Adaptive Instructional Systems: First International Conference, AIS 2019, Held as Part of the 21st HCI International Conference, HCII 2019, Orlando, FL, USA, July 26–31, 2019, Proceedings [1st ed.] 978-3-030-22340-3;978-3-030-22341-0

This book constitutes the refereed proceedings of the First International Conference on Adaptive Instructional Systems,

880 78 45MB

English Pages XXI, 664 [672] Year 2019

Polecaj historie

HCI in Games: First International Conference, HCI-Games 2019, Held as Part of the 21st HCI International Conference, HCII 2019, Orlando, FL, USA, July 26–31, 2019, Proceedings [1st ed.] 978-3-030-22601-5;978-3-030-22602-2

This book constitutes the refereed proceedings of the First International Conference on HCI in Games, HCI-Games 2019, he

497 60 42MB Read more

Distributed, Ambient and Pervasive Interactions: 7th International Conference, DAPI 2019, Held as Part of the 21st HCI International Conference, HCII 2019, Orlando, FL, USA, July 26–31, 2019, Proceedings [1st ed.] 978-3-030-21934-5;978-3-030-21935-2

This book constitutes the refereed proceedings of the 7th International Conference on Distributed, Ambient and Pervasive

821 43 81MB Read more

Engineering Psychology and Cognitive Ergonomics: 16th International Conference, EPCE 2019, Held as Part of the 21st HCI International Conference, HCII 2019, Orlando, FL, USA, July 26–31, 2019, Proceedings [1st ed.] 978-3-030-22506-3;978-3-030-22507-0

This book constitutes the proceedings of the 16th International Conference on Engineering Psychology and Cognitive Ergon

459 36 29MB Read more

Augmented Cognition: 13th International Conference, AC 2019, Held as Part of the 21st HCI International Conference, HCII 2019, Orlando, FL, USA, July 26–31, 2019, Proceedings [1st ed.] 978-3-030-22418-9;978-3-030-22419-6

This book constitutes the refereed proceedings of the 13th International Conference on Augmented Cognition, AC 2019, hel

611 32 46MB Read more

Adaptive Instructional Systems: 5th International Conference, AIS 2023 Held as Part of the 25th HCI International Conference, HCII 2023 Copenhagen, Denmark, July 23–28, 2023 Proceedings 303134734X, 9783031347344

This book constitutes the refereed proceedings of the 5th International Conference, AIS 2023, held as part of the 25th I

450 24 20MB Read more

HCI International 2019 – Late Breaking Papers: 21st HCI International Conference, HCII 2019, Orlando, FL, USA, July 26–31, 2019, Proceedings [1st ed. 2019] 978-3-030-30032-6, 978-3-030-30033-3

This year the 21st International Conference on Human-Computer Interaction, HCII 2019, which was held in Orlando, Florida

465 118 66MB Read more

HCI International 2019 – Late Breaking Posters: 21st HCI International Conference, HCII 2019, Orlando, FL, USA, July 26–31, 2019, Proceedings [1st ed. 2019] 978-3-030-30711-0, 978-3-030-30712-7

This book constitutes the extended abstracts of the posters presented during the 21st International Conference on Human-

382 61 47MB Read more

HCI International 2019 - Posters: 21st International Conference, HCII 2019, Orlando, FL, USA, July 26–31, 2019, Proceedings, Part I [1st ed.] 978-3-030-23521-5;978-3-030-23522-2

The three-volume set CCIS 1032, CCIS 1033, and CCIS 1034 contains the extended abstracts of the posters presented during

1,073 114 67MB Read more

HCI International 2019 - Posters: 21st International Conference, HCII 2019, Orlando, FL, USA, July 26–31, 2019, Proceedings, Part II [1st ed.] 978-3-030-23527-7;978-3-030-23528-4

The three-volume set CCIS 1032, CCIS 1033, and CCIS 1034 contains the extended abstracts of the posters presented during

588 102 84MB Read more

HCI International 2019 - Posters: 21st International Conference, HCII 2019, Orlando, FL, USA, July 26–31, 2019, Proceedings, Part III [1st ed.] 978-3-030-23524-6;978-3-030-23525-3

The three-volume set CCIS 1032, CCIS 1033, and CCIS 1034 contains the extended abstracts of the posters presented during

924 98 42MB Read more

Adaptive Instructional Systems: First International Conference, AIS 2019, Held as Part of the 21st HCI International Conference, HCII 2019, Orlando, FL, USA, July 26–31, 2019, Proceedings [1st ed.]
978-3-030-22340-3;978-3-030-22341-0

Author / Uploaded
Robert A. Sottilare
Jessica Schwarz

Table of contents :
Front Matter ....Pages i-xxi
Front Matter ....Pages 1-1
Adaptation Vectors for Instructional Agents (Benjamin Bell, Robert Sottilare)....Pages 3-14
Adaptive Team Training for One (Elizabeth Biddle, Barbara Buck)....Pages 15-27
Adaptive Training: Designing Training for the Way People Work and Learn (Lara K. Bove)....Pages 28-39
Evolving Training Scenarios with Measurable Variance in Learning Effects (Brandt Dargue, Jeremiah T. Folsom-Kovarik, John Sanders)....Pages 40-51
Adaptive Instructional Systems: The Evolution of Hybrid Cognitive Tools and Tutoring Systems (Jeanine A. DeFalco, Anne M. Sinatra)....Pages 52-61
Lessons from Building Diverse Adaptive Instructional Systems (AIS) (Eric Domeshek, Sowmya Ramachandran, Randy Jensen, Jeremy Ludwig, Jim Ong, Dick Stottler)....Pages 62-75
Fundamentals, Flavors, and Foibles of Adaptive Instructional Systems (Paula J. Durlach)....Pages 76-95
Foundational Principles and Design of a Hybrid Tutor (Andrew J. Hampton, Arthur C. Graesser)....Pages 96-107
Change Your Mind (Dov Jacobson, Brandt Dargue)....Pages 108-117
Developing Authoring Tools for Simulation-Based Intelligent Tutoring Systems: Lessons Learned (James E. McCarthy, Justin Kennedy, Jonathan Grant, Mike Bailey)....Pages 118-129
Ibigkas! 2.0: Directions for the Design of an Adaptive Mobile-Assisted Language Learning App (Ma. Mercedes T. Rodrigo, Jaclyn Ocumpaugh, Dominique Marie Antoinette Manahan, Jonathan D. L. Casano)....Pages 130-141
Adaptive Learning Technology for AR Training: Possibilities and Challenges (Alyssa Tanaka, Jeffrey Craighead, Glenn Taylor, Robert Sottilare)....Pages 142-150
Intelligent Tutoring Design Alternatives in a Serious Game (Elizabeth Whitaker, Ethan Trewhitt, Elizabeth S. Veinott)....Pages 151-165
Front Matter ....Pages 167-167
Missing Pieces: Infrastructure Requirements for Adaptive Instructional Systems (Avron Barr, Robby Robson)....Pages 169-178
Standards Needed: Competency Modeling and Recommender Systems (Keith Brawner)....Pages 179-187
Measuring the Complexity of Learning Content to Enable Automated Comparison, Recommendation, and Generation (Jeremiah T. Folsom-Kovarik, Dar-Wei Chen, Behrooz Mostafavi, Keith Brawner)....Pages 188-203
Capturing AIS Behavior Using xAPI-like Statements (Xiangen Hu, Zhiqiang Cai, Andrew J. Hampton, Jody L. Cockroft, Arthur C. Graesser, Cameron Copland et al.)....Pages 204-216
Standardizing Unstructured Interaction Data in Adaptive Instructional Systems (Vasile Rus, Arthur C. Graesser, Xiangen Hu, Jody L. Cockroft)....Pages 217-226
Exploring Methods to Promote Interoperability in Adaptive Instructional Systems (Robert Sottilare)....Pages 227-238
Examining Elements of an Adaptive Instructional System (AIS) Conceptual Model (Robert Sottilare, Brian Stensrud, Andrew J. Hampton)....Pages 239-250
Interoperability Standards for Adaptive Instructional Systems: Vertical and Horizontal Integrations (K. P. Thai, Richard Tong)....Pages 251-260
Front Matter ....Pages 261-261
Integrating Engagement Inducing Interventions into Traditional, Virtual and Embedded Learning Environments (Meredith Carroll, Summer Lindsey, Maria Chaparro)....Pages 263-281
Productive Failure and Subgoal Scaffolding in Novel Domains (Dar-Wei Chen, Richard Catrambone)....Pages 282-300
Adaptation and Pedagogy at the Collective Level: Recommendations for Adaptive Instructional Systems (Benjamin Goldberg)....Pages 301-313
Developing an Adaptive Trainer for Joint Terminal Attack Controllers (Cheryl I. Johnson, Matthew D. Marraffino, Daphne E. Whitmer, Shannon K. T. Bailey)....Pages 314-326
Using an Adaptive Intelligent Tutoring System to Promote Learning Affordances for Adults with Low Literacy Skills (Anne Lippert, Jessica Gatewood, Zhiqiang Cai, Arthur C. Graesser)....Pages 327-339
Development of Cognitive Transfer Tasks for Virtual Environments and Applications for Adaptive Instructional Systems (Anne M. Sinatra, Ashley H. Oiknine, Debbie Patton, Mark Ericson, Antony D. Passaro, Benjamin T. Files et al.)....Pages 340-351
Application of Theory to the Development of an Adaptive Training System for a Submarine Electronic Warfare Task (Wendi L. Van Buskirk, Nicholas W. Fraulini, Bradford L. Schroeder, Cheryl I. Johnson, Matthew D. Marraffino)....Pages 352-362
Learning Analytics of Playing Space Fortress with Reinforcement Learning (Joost van Oijen, Jan Joris Roessingh, Gerald Poppinga, Victor García)....Pages 363-378
Wrong in the Right Way: Balancing Realism Against Other Constraints in Simulation-Based Training (Walter Warwick, Stuart Rodgers)....Pages 379-388
Front Matter ....Pages 389-389
Evaluation of Diagnostic Rules for Real-Time Assessment of Mental Workload Within a Dynamic Adaptation Framework (Anna Bruder, Jessica Schwarz)....Pages 391-404
Model for Analysis of Personality Traits in Support of Team Recommendation (Guilherme Oliveira, Rafael dos Santos Braz, Daniela de Freitas Guilhermino Trindade, Jislaine de Fátima Guilhermino, José Reinaldo Merlin, Ederson Marcos Sgarbi et al.)....Pages 405-419
The Influence of Gait on Cognitive Functions: Promising Factor for Adapting Systems to the Worker’s Need in a Picking Context (Magali Kreutzfeldt, Johanna Renker, Gerhard Rinkenauer)....Pages 420-431
Using Learning Analytics to Explore the Performance of Chinese Mathematical Intelligent Tutoring System (Bor-Chen Kuo, Chia-Hua Lin, Kai-Chih Pai, Shu-Chuan Shih, Chen-Huei Liao)....Pages 432-443
Eye Blinks Describing the State of the Learner Under Uncertainty (Johanna Renker, Magali Kreutzfeldt, Gerhard Rinkenauer)....Pages 444-454
Adaptive Remediation with Multi-modal Content (Yuwei Tu, Christopher G. Brinton, Andrew S. Lan, Mung Chiang)....Pages 455-468
Diagnostic Requirements for Efficient, Adaptive Robotic Surgery Training (Thomas E. F. Witte, Martin Schmettow, Marleen Groenier)....Pages 469-481
Supporting Human Inspection of Adaptive Instructional Systems (Diego Zapata-Rivera)....Pages 482-490
Front Matter ....Pages 491-491
Adaptive Agents for Adaptive Tactical Training: The State of the Art and Emerging Requirements (Jared Freeman, Eric Watz, Winston Bennett)....Pages 493-504
Cognitive Agents for Adaptive Training in Cyber Operations (Randolph M. Jones, Ryan O’Grady, Fernando Maymi, Alex Nickels)....Pages 505-520
Consideration of a Bayesian Hierarchical Model for Assessment and Adaptive Instructions (Jong W. Kim, Frank E. Ritter)....Pages 521-531
Developing an Adaptive Opponent for Tactical Training (Jeremy Ludwig, Bart Presnell)....Pages 532-541
Application of Artificial Intelligence to Adaptive Instruction - Combining the Concepts (Jan Joris Roessingh, Gerald Poppinga, Joost van Oijen, Armon Toubman)....Pages 542-556
Validating Air Combat Behaviour Models for Adaptive Training of Teams (Armon Toubman)....Pages 557-571
Six Challenges for Human-AI Co-learning (Karel van den Bosch, Tjeerd Schoonderwoerd, Romy Blankendaal, Mark Neerincx)....Pages 572-589
Front Matter ....Pages 591-591
Authoring Conversational Intelligent Tutoring Systems (Zhiqiang Cai, Xiangen Hu, Arthur C. Graesser)....Pages 593-603
A Conversation-Based Intelligent Tutoring System Benefits Adult Readers with Low Literacy Skills (Ying Fang, Anne Lippert, Zhiqiang Cai, Xiangen Hu, Arthur C. Graesser)....Pages 604-614
Adaptive Instructional Systems and Digital Tutoring (J. D. Fletcher)....Pages 615-633
Conversational AIS as the Cornerstone of Hybrid Tutors (Andrew J. Hampton, Lijia Wang)....Pages 634-644
Ms. An (Meeting Students’ Academic Needs): Engaging Students in Math Education (Karina R. Liles)....Pages 645-661
Back Matter ....Pages 663-664

Citation preview

LNCS 11597

Robert A. Sottilare Jessica Schwarz (Eds.)

Adaptive Instructional Systems First International Conference, AIS 2019 Held as Part of the 21st HCI International Conference, HCII 2019 Orlando, FL, USA, July 26–31, 2019, Proceedings

123

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board Members David Hutchison Lancaster University, Lancaster, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Zurich, Switzerland John C. Mitchell Stanford University, Stanford, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel C. Pandu Rangan Indian Institute of Technology Madras, Chennai, India Bernhard Steffen TU Dortmund University, Dortmund, Germany Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA

11597

More information about this series at http://www.springer.com/series/7409

Robert A. Sottilare Jessica Schwarz (Eds.) •

Adaptive Instructional Systems First International Conference, AIS 2019 Held as Part of the 21st HCI International Conference, HCII 2019 Orlando, FL, USA, July 26–31, 2019 Proceedings

123

Editors Robert A. Sottilare Soar Technology, Inc., Orlando, FL, USA

Jessica Schwarz Fraunhofer FKIE Wachtberg, Germany

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-030-22340-3 ISBN 978-3-030-22341-0 (eBook) https://doi.org/10.1007/978-3-030-22341-0 LNCS Sublibrary: SL3 – Information Systems and Applications, incl. Internet/Web, and HCI © Springer Nature Switzerland AG 2019 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Foreword

The 21st International Conference on Human-Computer Interaction, HCI International 2019, was held in Orlando, FL, USA, during July 26–31, 2019. The event incorporated the 18 thematic areas and affiliated conferences listed on the following page. A total of 5,029 individuals from academia, research institutes, industry, and governmental agencies from 73 countries submitted contributions, and 1,274 papers and 209 posters were included in the pre-conference proceedings. These contributions address the latest research and development efforts and highlight the human aspects of design and use of computing systems. The contributions thoroughly cover the entire field of human-computer interaction, addressing major advances in knowledge and effective use of computers in a variety of application areas. The volumes constituting the full set of the pre-conference proceedings are listed in the following pages. This year the HCI International (HCII) conference introduced the new option of “late-breaking work.” This applies both for papers and posters and the corresponding volume(s) of the proceedings will be published just after the conference. Full papers will be included in the HCII 2019 Late-Breaking Work Papers Proceedings volume of the proceedings to be published in the Springer LNCS series, while poster extended abstracts will be included as short papers in the HCII 2019 Late-Breaking Work Poster Extended Abstracts volume to be published in the Springer CCIS series. I would like to thank the program board chairs and the members of the program boards of all thematic areas and affiliated conferences for their contribution to the highest scientific quality and the overall success of the HCI International 2019 conference. This conference would not have been possible without the continuous and unwavering support and advice of the founder, Conference General Chair Emeritus and Conference Scientific Advisor Prof. Gavriel Salvendy. For his outstanding efforts, I would like to express my appreciation to the communications chair and editor of HCI International News, Dr. Abbas Moallem. July 2019

Constantine Stephanidis

HCI International 2019 Thematic Areas and Affiliated Conferences

Thematic areas: • HCI 2019: Human-Computer Interaction • HIMI 2019: Human Interface and the Management of Information Affiliated conferences: • EPCE 2019: 16th International Conference on Engineering Psychology and Cognitive Ergonomics • UAHCI 2019: 13th International Conference on Universal Access in Human-Computer Interaction • VAMR 2019: 11th International Conference on Virtual, Augmented and Mixed Reality • CCD 2019: 11th International Conference on Cross-Cultural Design • SCSM 2019: 11th International Conference on Social Computing and Social Media • AC 2019: 13th International Conference on Augmented Cognition • DHM 2019: 10th International Conference on Digital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management • DUXU 2019: 8th International Conference on Design, User Experience, and Usability • DAPI 2019: 7th International Conference on Distributed, Ambient and Pervasive Interactions • HCIBGO 2019: 6th International Conference on HCI in Business, Government and Organizations • LCT 2019: 6th International Conference on Learning and Collaboration Technologies • ITAP 2019: 5th International Conference on Human Aspects of IT for the Aged Population • HCI-CPT 2019: First International Conference on HCI for Cybersecurity, Privacy and Trust • HCI-Games 2019: First International Conference on HCI in Games • MobiTAS 2019: First International Conference on HCI in Mobility, Transport, and Automotive Systems • AIS 2019: First International Conference on Adaptive Instructional Systems

Pre-conference Proceedings Volumes Full List 1. LNCS 11566, Human-Computer Interaction: Perspectives on Design (Part I), edited by Masaaki Kurosu 2. LNCS 11567, Human-Computer Interaction: Recognition and Interaction Technologies (Part II), edited by Masaaki Kurosu 3. LNCS 11568, Human-Computer Interaction: Design Practice in Contemporary Societies (Part III), edited by Masaaki Kurosu 4. LNCS 11569, Human Interface and the Management of Information: Visual Information and Knowledge Management (Part I), edited by Sakae Yamamoto and Hirohiko Mori 5. LNCS 11570, Human Interface and the Management of Information: Information in Intelligent Systems (Part II), edited by Sakae Yamamoto and Hirohiko Mori 6. LNAI 11571, Engineering Psychology and Cognitive Ergonomics, edited by Don Harris 7. LNCS 11572, Universal Access in Human-Computer Interaction: Theory, Methods and Tools (Part I), edited by Margherita Antona and Constantine Stephanidis 8. LNCS 11573, Universal Access in Human-Computer Interaction: Multimodality and Assistive Environments (Part II), edited by Margherita Antona and Constantine Stephanidis 9. LNCS 11574, Virtual, Augmented and Mixed Reality: Multimodal Interaction (Part I), edited by Jessie Y. C. Chen and Gino Fragomeni 10. LNCS 11575, Virtual, Augmented and Mixed Reality: Applications and Case Studies (Part II), edited by Jessie Y. C. Chen and Gino Fragomeni 11. LNCS 11576, Cross-Cultural Design: Methods, Tools and User Experience (Part I), edited by P. L. Patrick Rau 12. LNCS 11577, Cross-Cultural Design: Culture and Society (Part II), edited by P. L. Patrick Rau 13. LNCS 11578, Social Computing and Social Media: Design, Human Behavior and Analytics (Part I), edited by Gabriele Meiselwitz 14. LNCS 11579, Social Computing and Social Media: Communication and Social Communities (Part II), edited by Gabriele Meiselwitz 15. LNAI 11580, Augmented Cognition, edited by Dylan D. Schmorrow and Cali M. Fidopiastis 16. LNCS 11581, Digital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management: Human Body and Motion (Part I), edited by Vincent G. Duffy

x

Pre-conference Proceedings Volumes Full List

17. LNCS 11582, Digital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management: Healthcare Applications (Part II), edited by Vincent G. Duffy 18. LNCS 11583, Design, User Experience, and Usability: Design Philosophy and Theory (Part I), edited by Aaron Marcus and Wentao Wang 19. LNCS 11584, Design, User Experience, and Usability: User Experience in Advanced Technological Environments (Part II), edited by Aaron Marcus and Wentao Wang 20. LNCS 11585, Design, User Experience, and Usability: Application Domains (Part III), edited by Aaron Marcus and Wentao Wang 21. LNCS 11586, Design, User Experience, and Usability: Practice and Case Studies (Part IV), edited by Aaron Marcus and Wentao Wang 22. LNCS 11587, Distributed, Ambient and Pervasive Interactions, edited by Norbert Streitz and Shin’ichi Konomi 23. LNCS 11588, HCI in Business, Government and Organizations: eCommerce and Consumer Behavior (Part I), edited by Fiona Fui-Hoon Nah and Keng Siau 24. LNCS 11589, HCI in Business, Government and Organizations: Information Systems and Analytics (Part II), edited by Fiona Fui-Hoon Nah and Keng Siau 25. LNCS 11590, Learning and Collaboration Technologies: Designing Learning Experiences (Part I), edited by Panayiotis Zaphiris and Andri Ioannou 26. LNCS 11591, Learning and Collaboration Technologies: Ubiquitous and Virtual Environments for Learning and Collaboration (Part II), edited by Panayiotis Zaphiris and Andri Ioannou 27. LNCS 11592, Human Aspects of IT for the Aged Population: Design for the Elderly and Technology Acceptance (Part I), edited by Jia Zhou and Gavriel Salvendy 28. LNCS 11593, Human Aspects of IT for the Aged Population: Social Media, Games and Assistive Environments (Part II), edited by Jia Zhou and Gavriel Salvendy 29. LNCS 11594, HCI for Cybersecurity, Privacy and Trust, edited by Abbas Moallem 30. LNCS 11595, HCI in Games, edited by Xiaowen Fang 31. LNCS 11596, HCI in Mobility, Transport, and Automotive Systems, edited by Heidi Krömker 32. LNCS 11597, Adaptive Instructional Systems, edited by Robert Sottilare and Jessica Schwarz 33. CCIS 1032, HCI International 2019 - Posters (Part I), edited by Constantine Stephanidis

Pre-conference Proceedings Volumes Full List

xi

34. CCIS 1033, HCI International 2019 - Posters (Part II), edited by Constantine Stephanidis 35. CCIS 1034, HCI International 2019 - Posters (Part III), edited by Constantine Stephanidis

http://2019.hci.international/proceedings

First International Conference on Adaptive Instructional Systems (AIS 2019) Program Board Chair(s): Robert A. Sottilare, USA, and Jessica Schwarz, Germany • • • • • • • •

Avron Barr, USA Benjamin Bell, USA Elizabeth Biddle, USA Gautam Biswas, USA Keith Brawner, USA Barbara Buck, USA Brandt Dargue, USA John Dexter Fletcher, USA

• • • • • • • •

Stephen Goldberg, USA Xiangen Hu, USA Jong Kim, USA R. Bowen Loftin, USA Benjamin Nye, USA Jan Joris Roessingh, The Netherlands Thomas Schnell, USA Anne Sinatra, USA

The full list with the Program Board Chairs and the members of the Program Boards of all thematic areas and affiliated conferences is available online at:

http://www.hci.international/board-members-2019.php

HCI International 2020 The 22nd International Conference on Human-Computer Interaction, HCI International 2020, will be held jointly with the affiliated conferences in Copenhagen, Denmark, at the Bella Center Copenhagen, July 19–24, 2020. It will cover a broad spectrum of themes related to HCI, including theoretical issues, methods, tools, processes, and case studies in HCI design, as well as novel interaction techniques, interfaces, and applications. The proceedings will be published by Springer. More information will be available on the conference website: http://2020.hci.international/. General Chair Prof. Constantine Stephanidis University of Crete and ICS-FORTH Heraklion, Crete, Greece E-mail: [email protected]

http://2020.hci.international/

Contents

Adaptive Instruction Design and Authoring Adaptation Vectors for Instructional Agents . . . . . . . . . . . . . . . . . . . . . . . . Benjamin Bell and Robert Sottilare

3

Adaptive Team Training for One . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Elizabeth Biddle and Barbara Buck

15

Adaptive Training: Designing Training for the Way People Work and Learn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lara K. Bove

28

Evolving Training Scenarios with Measurable Variance in Learning Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Brandt Dargue, Jeremiah T. Folsom-Kovarik, and John Sanders

40

Adaptive Instructional Systems: The Evolution of Hybrid Cognitive Tools and Tutoring Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jeanine A. DeFalco and Anne M. Sinatra

52

Lessons from Building Diverse Adaptive Instructional Systems (AIS) . . . . . . Eric Domeshek, Sowmya Ramachandran, Randy Jensen, Jeremy Ludwig, Jim Ong, and Dick Stottler

62

Fundamentals, Flavors, and Foibles of Adaptive Instructional Systems. . . . . . Paula J. Durlach

76

Foundational Principles and Design of a Hybrid Tutor . . . . . . . . . . . . . . . . . Andrew J. Hampton and Arthur C. Graesser

96

Change Your Mind: Game Based AIS Can Reform Cognitive Behavior. . . . . Dov Jacobson and Brandt Dargue

108

Developing Authoring Tools for Simulation-Based Intelligent Tutoring Systems: Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . James E. McCarthy, Justin Kennedy, Jonathan Grant, and Mike Bailey Ibigkas! 2.0: Directions for the Design of an Adaptive Mobile-Assisted Language Learning App. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ma. Mercedes T. Rodrigo, Jaclyn Ocumpaugh, Dominique Marie Antoinette Manahan, and Jonathan D. L. Casano

118

130

xviii

Contents

Adaptive Learning Technology for AR Training: Possibilities and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alyssa Tanaka, Jeffrey Craighead, Glenn Taylor, and Robert Sottilare Intelligent Tutoring Design Alternatives in a Serious Game . . . . . . . . . . . . . Elizabeth Whitaker, Ethan Trewhitt, and Elizabeth S. Veinott

142 151

Interoperability and Standardization in Adaptive Instructional Systems Missing Pieces: Infrastructure Requirements for Adaptive Instructional Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Avron Barr and Robby Robson Standards Needed: Competency Modeling and Recommender Systems . . . . . Keith Brawner Measuring the Complexity of Learning Content to Enable Automated Comparison, Recommendation, and Generation. . . . . . . . . . . . . . . . . . . . . . Jeremiah T. Folsom-Kovarik, Dar-Wei Chen, Behrooz Mostafavi, and Keith Brawner Capturing AIS Behavior Using xAPI-like Statements. . . . . . . . . . . . . . . . . . Xiangen Hu, Zhiqiang Cai, Andrew J. Hampton, Jody L. Cockroft, Arthur C. Graesser, Cameron Copland, and Jeremiah T. Folsom-Kovarik

169 179

188

204

Standardizing Unstructured Interaction Data in Adaptive Instructional Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vasile Rus, Arthur C. Graesser, Xiangen Hu, and Jody L. Cockroft

217

Exploring Methods to Promote Interoperability in Adaptive Instructional Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robert Sottilare

227

Examining Elements of an Adaptive Instructional System (AIS) Conceptual Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robert Sottilare, Brian Stensrud, and Andrew J. Hampton

239

Interoperability Standards for Adaptive Instructional Systems: Vertical and Horizontal Integrations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . K. P. Thai and Richard Tong

251

Instructional Theories in Adaptive Instruction Integrating Engagement Inducing Interventions into Traditional, Virtual and Embedded Learning Environments . . . . . . . . . . . . . . . . . . . . . . Meredith Carroll, Summer Lindsey, and Maria Chaparro

263

Contents

Productive Failure and Subgoal Scaffolding in Novel Domains . . . . . . . . . . . Dar-Wei Chen and Richard Catrambone Adaptation and Pedagogy at the Collective Level: Recommendations for Adaptive Instructional Systems . . . . . . . . . . . . . . . . . Benjamin Goldberg Developing an Adaptive Trainer for Joint Terminal Attack Controllers. . . . . . Cheryl I. Johnson, Matthew D. Marraffino, Daphne E. Whitmer, and Shannon K. T. Bailey Using an Adaptive Intelligent Tutoring System to Promote Learning Affordances for Adults with Low Literacy Skills. . . . . . . . . . . . . . . . . . . . . Anne Lippert, Jessica Gatewood, Zhiqiang Cai, and Arthur C. Graesser Development of Cognitive Transfer Tasks for Virtual Environments and Applications for Adaptive Instructional Systems . . . . . . . . . . . . . . . . . . Anne M. Sinatra, Ashley H. Oiknine, Debbie Patton, Mark Ericson, Antony D. Passaro, Benjamin T. Files, Bianca Dalangin, Peter Khooshabeh, and Kimberly A. Pollard Application of Theory to the Development of an Adaptive Training System for a Submarine Electronic Warfare Task . . . . . . . . . . . . . . . . . . . . . . . . . . Wendi L. Van Buskirk, Nicholas W. Fraulini, Bradford L. Schroeder, Cheryl I. Johnson, and Matthew D. Marraffino Learning Analytics of Playing Space Fortress with Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joost van Oijen, Jan Joris Roessingh, Gerald Poppinga, and Victor García Wrong in the Right Way: Balancing Realism Against Other Constraints in Simulation-Based Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Walter Warwick and Stuart Rodgers

xix

282

301 314

327

340

352

363

379

Learner Assessment and Modelling Evaluation of Diagnostic Rules for Real-Time Assessment of Mental Workload Within a Dynamic Adaptation Framework . . . . . . . . . . . . . . . . . . Anna Bruder and Jessica Schwarz Model for Analysis of Personality Traits in Support of Team Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guilherme Oliveira, Rafael dos Santos Braz, Daniela de Freitas Guilhermino Trindade, Jislaine de Fátima Guilhermino, José Reinaldo Merlin, Ederson Marcos Sgarbi, Carlos Eduardo Ribeiro, and Thiago Fernandes de Oliveira

391

405

xx

Contents

The Influence of Gait on Cognitive Functions: Promising Factor for Adapting Systems to the Worker’s Need in a Picking Context . . . . . . . . . Magali Kreutzfeldt, Johanna Renker, and Gerhard Rinkenauer Using Learning Analytics to Explore the Performance of Chinese Mathematical Intelligent Tutoring System. . . . . . . . . . . . . . . . . . . . . . . . . . Bor-Chen Kuo, Chia-Hua Lin, Kai-Chih Pai, Shu-Chuan Shih, and Chen-Huei Liao

420

432

Eye Blinks Describing the State of the Learner Under Uncertainty . . . . . . . . Johanna Renker, Magali Kreutzfeldt, and Gerhard Rinkenauer

444

Adaptive Remediation with Multi-modal Content . . . . . . . . . . . . . . . . . . . . Yuwei Tu, Christopher G. Brinton, Andrew S. Lan, and Mung Chiang

455

Diagnostic Requirements for Efficient, Adaptive Robotic Surgery Training. . . Thomas E. F. Witte, Martin Schmettow, and Marleen Groenier

469

Supporting Human Inspection of Adaptive Instructional Systems. . . . . . . . . . Diego Zapata-Rivera

482

AI in Adaptive Instructional Systems Adaptive Agents for Adaptive Tactical Training: The State of the Art and Emerging Requirements . . . . . . . . . . . . . . . . . . . . Jared Freeman, Eric Watz, and Winston Bennett Cognitive Agents for Adaptive Training in Cyber Operations . . . . . . . . . . . . Randolph M. Jones, Ryan O’Grady, Fernando Maymi, and Alex Nickels Consideration of a Bayesian Hierarchical Model for Assessment and Adaptive Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jong W. Kim and Frank E. Ritter Developing an Adaptive Opponent for Tactical Training . . . . . . . . . . . . . . . Jeremy Ludwig and Bart Presnell Application of Artificial Intelligence to Adaptive Instruction - Combining the Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jan Joris Roessingh, Gerald Poppinga, Joost van Oijen, and Armon Toubman Validating Air Combat Behaviour Models for Adaptive Training of Teams. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Armon Toubman Six Challenges for Human-AI Co-learning . . . . . . . . . . . . . . . . . . . . . . . . . Karel van den Bosch, Tjeerd Schoonderwoerd, Romy Blankendaal, and Mark Neerincx

493 505

521 532

542

557 572

Contents

xxi

Conversational Tutors Authoring Conversational Intelligent Tutoring Systems . . . . . . . . . . . . . . . . Zhiqiang Cai, Xiangen Hu, and Arthur C. Graesser A Conversation-Based Intelligent Tutoring System Benefits Adult Readers with Low Literacy Skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ying Fang, Anne Lippert, Zhiqiang Cai, Xiangen Hu, and Arthur C. Graesser

593

604

Adaptive Instructional Systems and Digital Tutoring . . . . . . . . . . . . . . . . . . J. D. Fletcher

615

Conversational AIS as the Cornerstone of Hybrid Tutors . . . . . . . . . . . . . . . Andrew J. Hampton and Lijia Wang

634

Ms. An (Meeting Students’ Academic Needs): Engaging Students in Math Education. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Karina R. Liles

645

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

663

Adaptive Instruction Design and Authoring

Adaptation Vectors for Instructional Agents Benjamin Bell1(&) and Robert Sottilare2 1

Eduworks Corporation, Corvallis, OR 97333, USA [email protected] 2 Soar Technology, Inc., Orlando, FL 32817, USA [email protected]

Abstract. This paper examines the application of intelligent agents to guide and adapt instruction in a class of learning technologies known as adaptive instructional systems (AISs). AISs are artificially-intelligent, computer-based systems that guide learning experiences by tailoring instruction and recommendations based on the goals, needs, and preferences of each individual learner or team in the context of domain learning objectives. Intelligent agents are autonomous entities which observe the conditions in their environments through percepts (e.g., sensors) and then act upon the environment using actuators. Their activity is intelligently directed to strategies (plans for action) and tactics (actions executed by the AIS) which enhance the progress of the learner toward the achievement of assigned goals and objectives. To optimize agent and learner performance we examine a notional set of principles that is needed to guide adaptive instructional designers and system developers toward adaptation vectors that support both instructionally meaningful (effective) and doctrinally correct (relevant and tactically plausible) actions by the AIS. In conclusion, we explore the literature to understand the design of adaptation vectors and provide recommendations for future AIS research and standards development. Keywords: Adaptation Vectors Adaptive instructional system (AIS) Instructional strategies Instructional tactics

1 Introduction The focus of this paper is on the application of intelligent agents and their role in various tailoring methods for adaptive instructional systems (AISs). We have termed these tailoring methods adaptation vectors. We begin by defining what AISs and intelligent agents along with their major components and functions. We follow this up with a problem definition. 1.1

Defining Adaptive Instructional Systems

During the last year, an IEEE working group chartered through Project 2247 has taken on the task of developing standards and best practices for AISs. This IEEE working group will be debating what is and is not an AIS, and what standards and best practices will evolve from the marketplace. To date, the group has identified three potential areas © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 3–14, 2019. https://doi.org/10.1007/978-3-030-22341-0_1

4

B. Bell and R. Sottilare

for standardization: (1) a conceptual model for AISs, (2) interoperability standards for AISs, and (3) evaluation best practices for AISs. This paper explores design principles and methods to enhance interoperability and reuse for AISs) that are defined as: artificially-intelligent, computer-based systems that guide learning experiences by tailoring instruction and recommendations based on the goals, needs, and preferences of each individual learner or team in the context of domain learning objectives [1]. It is important to distinguish between adaptive and adaptable system attributes. Both adaptive and adaptable systems provide system flexibility, but adaptive systems are able to observe the environment, identify changing conditions, and then take action without human intervention while adaptable systems provide control over change/flexibility to the user [2, 3]. Adaptive and adaptable attributes are not mutually exclusive and both adaptive and adaptable features can exist together in the same system. For example, the Generalized Intelligent Framework for Tutoring (GIFT) has an authoring system that features adaptability by allowing users to configure course objects and also features adaptivity by automatically building a tutor based on specifications developed by the author [4, 5]. Adaptive instructional systems (AISs) come in many forms and this makes standardization challenging. The most common form of AIS is the intelligent tutoring system (ITS) which is a computer-based system that automatically provides real-time, tailored, and relevant feedback and instruction to learners [6, 7]. Other forms of AISs include intelligent mentors (recommender systems) which promote the social relationship between learners and intelligent agents [8] and web-based intelligent media used for instruction. Next, we define intelligent agents and discuss their various attributes and forms. 1.2

Defining Intelligent Agents

Intelligent agents are autonomous entities which observe the conditions in their environments through percepts (e.g., sensors) and then act upon the environment using actuators [9]. Agents may take different forms: • simple reflex agents [9] – act only in response to the current percept, ignoring the rest of the percept history; uses condition-action rules to determine action selection • model-based reflex agents [9] – acts based on a model that represents percept history and interprets the current state of the world, how the world has evolved (trends), and the impact of decisions in terms of rewards; uses condition-action rules to determine action selection • goal-based agents [9] – expand on the capabilities of model-based agents by using goal information that describes desirable states; uses progress toward goals to determine action selection • utility-based agents [9] – define a measure of desirability of a particular state called a utility function which maps each state to a measure of the utility; improves upon goal-based agents which can only distinguish between goal states (desirable) and non-goal states (not desirable); acts to be in states of the highest utility • learning agents [9] – allows the agents to initially operate in unknown environments and to become more competent through experiences; implements separate processes

Adaptation Vectors for Instructional Agents

5

for learning and performance which are responsible for making improvements and selecting external actions respectively • multi-agent architectures [10] – composed of hierarchies of agents with different functions; the GIFT architecture [4, 5] is a multi-agent architecture with agents assigned to manage dialogue between learners and virtual humans, and decisions for recommendations, strategy selection, and tactic selection A set of desirable characteristics for intelligent agents (derived from Russell and Norvig [9]) might include: • Autonomous – agents are able to act without human direction or intervention • Adaptive and Reactive – agents are responsive to changes in their environment and are active in enforcing policies (rules) • Proactive – agents take initiative to achieve long-term goals, recognize opportunities to progress toward goals, and learn and adapt to make more effective choices in the future • Sociable and Cooperative – in multi-agent architectures, agents share information and act together to recognize and achieve long-term goals For AIS architectures, intelligent agents not only act on the environment, but also observe and act upon the learner(s) as shown in Fig. 1.

Fig. 1. Intelligent agents interacting with learners and the environment [11]

In AISs, the activities of effective agents are intelligently directed to develop recommendations, strategies (plans for action) and tactics (actions executed by the AIS) which enhance the progress of the learner toward the achievement of assigned goals and objectives. Policies (often rule-based, decision trees or Markovian processes) are

6

B. Bell and R. Sottilare

developed to aid AIS decision-making to select optimal next actions. The goal of each action is to optimize learning outcomes and policies drive recommendation, strategies, and tactics that have the highest rewards with respect to learning, performance, retention, and transfer of skills. Next, with a greater understanding of AISs and intelligent agents, we begin to shape our problem space. 1.3

Defining Our Problem Space

Much of research in AISs has been focused on blending agent-enabled instruction with adaptive tutoring. At the nexus of these converging areas is the construct of an adaptive instructional agent, a label that we use fully aware of the ambiguity between the adaptivity of the agent and tailoring of the instruction facilitated by an adaptive agent. In real-time, AISs use adaptive agents to autonomously make instructional decisions and as part of this process act on both learners and the environment. To be effective in selecting actions, the adaptive agent must be cognizant of its environment and have a model of how its decisions will impact learning. Tailoring based on individual needs, goals, and preferences are dependent on a comprehensive learner model that includes learner attributes which are known to moderate or influence learning. So why are we staking out such a subtle distinction? The reason for this distinction is more than a semantic exercise and really has to do with the authoring process for adaptive instruction and specifically agent impacts on the environment (e.g., simulation, serious games, problem banks). Research Question: How might we use intelligent agents to author more relevant and appropriate content to offer better, more diverse options for adaptive instruction – especially during remediation where new experiences are used to assist learners in achieving expected competencies after initial instruction results in learning gaps? As we noted previously, intelligent agents come in different forms. Assuming we need agents that learn and are flexible, cooperative, proactive, and autonomous, we can envision agent policies that not only target a diversity of solutions, but are also provide doctrinally correct solutions. It is also important that agents are situationally-aware and have fitness criteria to weigh their decisions when choosing between multiple good actions that assist in progress toward multiple learning goals. All this complexity indicates that we should consider a multi-agent architecture, but where should we begin? Agents that exhibit adaptive behaviors might be valuable in training simulations and educational courses by creating a variety of diverse challenges for learners and resulting in a variety of performance outcomes. While generating a variety of behaviors for constructive entities, sundry scenarios or an assortment of problem sets is an important element of adaptive instruction, by itself it is not enough. The agents operating as part of AISs or in partnership with AISs need to act using tacticallyplausible adaptive behaviors, produce doctrinally relevant scenarios or produce rich, germane problems sets. Diversity is not enough. When we want intelligent agents to contribute to learning during instruction, the adaptive behaviors they exhibit must meet different standards and their policies must weigh outcomes in terms of their importance or fitness with respect to learning goals.

Adaptation Vectors for Instructional Agents

7

A policy (set of principles) is needed that can guide training developers in creating adaptive instructional agents for simulation-based training whose adaptations are both instructionally meaningful (effective) and doctrinally correct (relevant and tactically plausible). We term these adaptation vectors. In this paper, we explore what the adaptive tutoring literature has to offer the designers of agent-based instruction, discuss recent examples of adaptive, agent-based tutoring systems, and synthesize preliminary adaptation vectors from these case studies.

2 Adaptive Instruction vs. Adaptive Agents Earlier we defined and listed essential attributes of intelligent agents and suggested a difference between an adaptive instruction and adaptive agents. Here we begin to compare and contrast these terms, and ultimately define an adaptive instructional agent and its essential policies. We define adaptive instruction as guided learning experiences in which an artificially-intelligent, computer-based system tailors instruction and recommendations based on the goals, needs, and preferences of each individual learner or team in the context of domain learning objectives [1]. How an AIS might facilitate the accomplishment of domain learning objectives is illustrated in Fig. 2.

Fig. 2. Notional AIS architecture - tailoring is facilitated by the pedagogical model

Adaptive agents are one type of mechanism that facilitate adaptive instruction, and we have noted that adaptive agents take many forms and use many methods to adapt instruction. We emphasize that adaptive agents that act on an environment during instruction may or may not do so with the goal to enhance learning, and that not every adaptive instructional method is facilitated by an adaptive agent. To continue this discussion, we examine adaptation mechanisms in both the literature and in other papers within this volume.

8

B. Bell and R. Sottilare

2.1

Adaptation Mechanisms

In manually-adapted training systems, an instructor or simulation manager initiates adaptations when the conditions of the learner(s) warrant a change in the level of difficulty or the need for more or less support. In artificially-intelligent systems, adaptations are autonomously initiated/triggered by a set of conditions recognized by an AIS. Regardless of the nature of the instructor, the learning theory behind these adaptations is largely based upon Vygotsky’s Zone of Proximal Development (ZPD), a nexus within a learning experience where the difficulty level of the experience is perfectly balanced with the learner’s proficiency [12]. According to Vygotsky, this balance is necessary to keep the learner engaged during the learning experience (Fig. 3).

Fig. 3. Zone of proximal development (pictures courtesy of Graesser and D’Mello [13])

Jones and O’Grady (this volume) adopt a mixed-autonomy approach that allows all adaptations to be controlled on a spectrum from automated tuning to manual manipulation by human instructors. They discuss three types of adaptation: (1) tactics and techniques, e.g., adapting an attack or defense); (2) level of sophistication, e.g. making an attacker more or less aggressive, or limiting a defender’s awareness to focus training; and (3) personality parameters, e.g., tuning preferences of various types of agents in the ecosystem. Freeman, Watz, and Bennett (this volume) discuss adaptive agents for adaptive tactical training. They emphasize the role of agents built using architectures capable of generating realistic and instructive behaviors. The authors specify the data required to

Adaptation Vectors for Instructional Agents

9

drive such agents and describe a testbed that delivers some of these data to accelerate development of agents and improve evaluation of agents’ tactical smarts and agility. The adaptations in this case address tactical proficiency in battles against sophisticated human and machine adversaries, instructional efficacy, and maintenance and extension of agents to new training challenges. These adaptations span inferences fit to varied tactical situations, generation or selection of tactical actions to fit those inferences, use of different behavioral “modes” (e.g., the tactical preferences of different adversaries), fitting instructional actions to student needs, enabling agent evolution over time, and publishing metadata that enable systems to adaptively select and parameterize agents for new training tasks. Warwick (this volume) advocates for adaptive agents that strike a balance between realism and tractability. The author argues that realism is subordinate to instructional objectives in training applications, and that in any specific instance, the level of realism required to meet those training objectives is an empirical question. For adaptive behaviors, he offers two general mechanisms. First, agents in training simulations (as opposed to simulation for other purposes) should have full visibility into “ground truth” and use that perspective to probe a learner’s weaknesses or to create tactically challenging circumstances as teaching moments. Second, formalisms for encapsulating agent behaviors can be blended, so that, for instance, finite state machines can drive routine behaviors with richer cognitive representations injecting adaptive behaviors. 2.2

Defining Adaptive Instructional Agents

Now as we mix concepts of adaptive instruction and intelligent agents, we begin to formulate the characteristics of adaptive instructional agents. An adaptive agent can be thought of us modifying its behaviors to align with the current activity context, or adjusting its actions to learned user preferences. Adaptive instruction is generally speaking a learning interaction that adjusts to learner proficiencies, affective state, or other user characteristics [1, 6, 7, 14–16]. In examining the automated instructional decisions expected to be managed by intelligent agents, Sottilare distills them into three simple types: recommendations, strategies, and tactics [11]. Recommendations are relevant proposals that suggest possible next steps (e.g., next problem or lesson selection) and fit into what VanLehn [17] describes as the outer loop of the tutoring process, sometimes referred to as macroadaptations. Strategies are plans for action by the AIS and tactics are actions executed by an AIS intelligent agent. Tactics are tied to VanLehn’s inner loop [17] and are micro-adaptations used to guide step-by-step experiences (e.g., prompts, hints, or feedback during problem solving) in adaptive instruction. As previously discussed, intelligent agents are autonomous entities which observe their environment through sensors and act upon their environment using actuators while directing its activity towards achieving goals [9]. In AISs, Baylor [18] identifies three primary functions or abilities for intelligent agents: (1) the ability to manage large amounts of data, (2) the ability to serve as a credible instructional and domain expert, and (3) the ability to modify the environment in response to learner performance. To this end, we add the requirement for the agent to be a learning agent, an entity that

10

B. Bell and R. Sottilare

makes more effective decisions with each experience. In a multi-agent AIS, we also add some additional capabilities prescribed by Nye and colleagues [19]: • Specialization: agents can specialize in tasks they perform and can communicate their abilities using semantic messages and agent communication languages [20] • Decentralization: a large number of agent services interacting across domains, platforms (e.g., desktop vs. mobile), and client and server-based software • Customization: distinct agents that communicate using messages where each agent does not rely on the specific internal state of any other agent • Data-driven enhancements: related to customization, where agents (and simpler services) use data to improve their performance against specified measures and goals Finally, we acknowledge the requirement that adaptive instructional agents attenuate their “correctness” in order to incorporate instructionally meaningful and tactically plausible errors into the behaviors of synthetic teammates. The ability of an agent to commit deliberate errors at instructionally strategic moments can provide critical practice in team coordination skills [16]. The term “adaptive instructional agent” conveniently captures the convergence of adaptive agents that make recommendations or execute instructions (e.g., Google Assistant or Amazon Alexa) and applies their abilities to adaptive instruction. A key distinction here is that agents participating in instruction are, or could be, role-players in a scenario. Such agents thus play multiple functions: performing some role in the story; ensuring that instructional goals are achieved; providing real-time feedback; understanding and generating natural language. With this in mind, we define adaptive instructional agents as: intelligent software-based entities that observe and act upon both the learner and the environment with the goal of enhancing learning with respect to allocated learning objectives. It seems that we need and expect a lot from our adaptive instructional agents, but what impact does this have on the authoring process for AISs?

3 Implications for Authoring AIS As we begin to evaluate the implications of a multi-agent architecture on the authoring process, we should begin with understanding the types of users directly interacting with authoring services to produce adaptive instruction. This following list was compiled from GIFT in its context as an evolving multi-agent architecture for authoring, delivering, managing, and evaluating adaptive instruction [19]: • Instructors (Use Services): Deliver, modify, and possibly design GIFT courses • Basic Course Designers (Configure Services): Modify or build course content with wizards/tutorials, including selecting or configuring a group of services verified to work together. • Advanced Course Designers (Compose Services): Build advanced content and adaptive behavior, by selecting and configuring services to work together as a group

Adaptation Vectors for Instructional Agents

11

• Service/Agent Programmers (Make/Add Services): Code new services or agents used by GIFT • Framework Providers (Combine Service Ecosystems): Maintain and interface other large frameworks, sensors, or external environments with GIFT through interfaces and gateways Given these authoring roles, we envision authoring tools that link specified learning objectives with multi-agent services (e.g., assess learner performance or select optimal feedback). An AIS conceptual model including an ontology, a set of concepts and categories in a subject area or domain that shows their properties and the relations between them, is needed to identify variables for common services.

4 Adaptation Vectors While not comprehensive, the following adaptation vectors have been identified in conjunction with the agent-based services associated with AISs: • Learning objective services – authoring capabilities for automatically developing and tracking progress toward learning objectives • Constructive behavioral services – authoring capabilities for developing, observing, and modifying adaptive entity behaviors for use in virtual environments stimulated by AISs; adaptations may be to provide more intelligent interactions and/or to provide more complex and difficult scenarios in response to learner performance • Adaptive realism services – authoring capabilities that automatically scale the realism of the environment to align with the resolution required to perform the training task • Learner modeling services – authoring capabilities for automatically populating individual learner or team models with variables that measure progress toward learning objectives and maintain learner states for learning, performance, retention, transfer of training, domain competency and/or proficiency • Learner strategy services – authoring capabilities for automatically matching the difficulty of future experiences to align with learner proficiency (ala Vygotsky’s Zone of Proximal Development) • Learner recommender services – automated capabilities to analyze organizational or career learning goals and recommend next steps for training, education, and job experiences based on gaps in knowledge or skills • Dialogue-based and virtual human services – authoring capabilities for developing scripts, decision trees, and conditional models to drive dialogue based on available content In addition to services, we might also seek to establish principles or tenets that govern the application and decision-making of adaptive instructional agents: • Learning first - tactical realism (feasibility) of solutions are important, but may be sacrificed to create advantageous conditions for learning; an AIS might alter lesson sequences, level of difficulty, or interaction style, but may not necessarily apply such adaptations in ways that maintain the tactical believability of a scenario

12

B. Bell and R. Sottilare

• Long term learning first - adaptations must enhance learning outcomes, but long term learning goals are more important than short term learning goals; an AIS might forego short term learning successes to pursue long term learning goals • Satisficing - timely solutions are important and timely solutions that are less optimal may be selected in lieu of less solutions; a timely good solution is better than a late optimal solution; an AIS might implement tactics that are less than optimal • Longitudinal modeling - adaptations must be based not just on current states, but an examination of learner trends across longer periods of time; a comprehensive model of the learner and their progress toward learning goals is a better basis for decisionmaking than a snapshot

5 Conclusion and Recommendations Designers of AISs and developers of the agents that inhabit them must reconcile a host of coincident and in some cases competing constructs that shape how instruction adapts to support the learner. In this paper we presented a notional set of principles to help instructional designers and agent developers implement adaptive instruction. Our specific focus is on the application of intelligent agents and their role in various tailoring methods, which we term adaptation vectors. We distinguished agents for simulation (which act to advance scenario events but not necessarily to enhance learning) from agents for adaptive instruction, casting adaptation vectors as explicitly supporting both instructionally meaningful and doctrinally plausible actions by the AIS. We noted as well that “plausible” does not always equate to “correct” or “optimal”, since a legitimate adaptation to support learning objectives can steer an agent toward deliberately erroneous behavior. As an initial framework, we proposed seven adaptation vectors within which AIS developers can define specific instances of adaptations: learning objectives; constructive behaviors; adaptive realism; learning modeling; learning strategy; learning recommender; and dialogue. As an early step toward granting agents the logic to select among these adaptation types, we further proposed four tenets that could be incorporated into adaptive agent logic: learning first; long-term learning first; satisficing; and longitudinal modeling. In conclusion, we explored the literature to understand the design of adaptation vectors and provided recommendations for future AIS research and standards development. It should be noted that while not every adaptive instructional method is facilitated by an adaptive agent, many are and there is a trend toward learning agents. Our goal in setting forth this preliminary analysis is to encourage AIS developers to be deliberate in how adaptive agents influence instructional interactions. Further research is needed to validate these categories as effectively capturing distinctive and useful adaptation types.

Adaptation Vectors for Instructional Agents

13

References 1. Sottilare, R., Brawner, K.: Component interaction within the generalized intelligent framework for tutoring (GIFT) as a model for adaptive instructional system standards. In: The Adaptive Instructional System (AIS) Standards Workshop of the 14th International Conference of the Intelligent Tutoring Systems (ITS) Conference, Montreal, Quebec, Canada, June 2018 2. Oppermann, R.: Adaptive User Support: Ergonomic Design of Manually and Automatically Adaptable Software. Routledge, London (2017) 3. Oppermann, R., Rasher, R.: Adaptability and adaptivity in learning systems. Knowl. Transf. 2, 173–179 (1997) 4. Sottilare, R.A., Brawner, K.W., Goldberg, B.S., Holden, H.K.: The generalized intelligent framework for tutoring (GIFT). In: Concept Paper Released as part of GIFT Software Documentation. U.S. Army Research Laboratory—Human Research & Engineering Directorate (ARL-HRED), Orlando (2012) 5. Sottilare, R., Brawner, K., Sinatra, A., Johnston, J.: An Updated Concept for a Generalized Intelligent Framework for Tutoring (GIFT). US Army Research Laboratory, Orlando (2017) 6. Anderson, J.R., Boyle, C.F., Reiser, B.J.: Intelligent tutoring systems. Science 228(4698), 456–462 (1985) 7. Psotka, J., Massey, L.D., Mutter, S.A.: Intelligent Tutoring Systems: Lessons Learned. Lawrence Erlbaum Associates, Hillsdale (1988). ISBN 978-0-8058-0192-7 8. Baylor, A.: Beyond butlers: intelligent agents as mentors. J. Educ. Comput. Res. 22(4), 373– 382 (2000) 9. Russell, S., Norvig, P.: Artificial Intelligent: A Modern Approach. Pearson Education Ltd., Malaya (2003) 10. Niazi, M., Hussain, A.: Agent-based computing from multi-agent systems to agent-based models: a visual survey. Scientometrics 89(2), 479–499 (2011) 11. Sottilare, R.A., Burke, C.S., Salas, E., Sinatra, A.M., Johnston, J.H., Gilbert, S.B.: Designing adaptive instruction for teams: a meta-analysis. Int. J. Artif. Intell. Educ. 28(2), 225–264 (2018) 12. Vygotsky, L.S., Rieber, R.W.: The Collected Works of LS Vygotsky: Volume 1: Problems of General Psychology, Including the Volume Thinking and Speech. Cognition and Language: A Series in Psycholinguistics, vol. 1. Springer, Heidelberg (1987). https://doi.org/ 10.1007/978-1-4613-1655-8 13. Graesser, A.C., D’Mello, S.: Emotions during the learning of difficult material. In: Ross, B. (ed.) Psychology of Learning and Motivation, vol. 57, pp. 183–225. Academic Press, San Diego (2012) 14. Sottilare, R.: Considerations in the development of an ontology for a generalized intelligent framework for tutoring. In: International Defense & Homeland Security Simulation Workshop in Proceedings of the I3M Conference, Vienna, Austria, September 2012 15. Sottilare, R., Ragusa, C., Hoffman, M., Goldberg, B.: Characterizing an adaptive tutoring learning effect chain for individual and team tutoring. In: Proceedings of the Interservice/Industry Training Simulation & Education Conference, Orlando, Florida (2013) 16. Bell, B., McNamara, J.: How to avoid using stupid agents to train intelligent people. In: Proceedings of the Interservice/Industry Training, Simulation, and Education Conference (I/ITSEC), December 2005 17. VanLehn, K.: The behavior of tutoring systems. Int. J. Artif. Intell. Educ. 16(3), 227–265 (2006)

14

B. Bell and R. Sottilare

18. Baylor, A.: Intelligent agents as cognitive tools for education. Educ. Technol. 1, 36–40 (1999) 19. Nye, B.D., Auerbach, D., Mehta, T.R., Hartholt, A.: Building a backbone for multi-agent tutoring in GIFT (Work in progress). In: Proceedings of the 5th Annual Generalized Intelligent Framework for Tutoring (GIFT) Users Symposium (GIFTSym5), 17 July 2017, p. 23 (2017) 20. Bellifemine, F., Caire, G., Poggi, A., Rimassa, G.: JADE: a software framework for developing multi-agent applications. Inf. Softw. Technol. 50(1), 10–21 (2008)

Adaptive Team Training for One Elizabeth Biddle1(&) and Barbara Buck2 1

The Boeing Company, Orlando, FL, USA [email protected] 2 The Boeing Company, St. Louis, MO, USA [email protected]

Abstract. There is an increase in pilot demand at current commercial training system facilities. With technology advances in head-worn augmented reality and virtual reality (AR/VR) interfaces, adaptive training technologies and human behavior representation; the potential exists to leverage this technology to offload or download training from traditional pilot training simulators to virtual training devices. Accordingly, Boeing has been applying these capabilities to the development of VR training prototypes designed to address different aspects of a pilot training curriculum [1]. The first prototype implemented is a Ground Procedures Trainer that focuses on both the taskwork and teamwork competencies, as defined in Sottilare et al. [2], to provide an individual student pilot the opportunity to train on his or her own. This paper discusses the method used to provide training to an individual for both taskwork and teamwork related skills. Keywords: Team training Intelligent tutoring

Adaptive instructional systems

1 Introduction 1.1

Purpose

There is an increase in pilot demand at current commercial training system facilities. Consequently, training providers are looking for new economical training technologies to increase pilot training and provide an alternative to Full Flight Simulators (FFSs) and Fixed Training Devices (FTDs). Pilot training conducted in these devices focuses on the mechanics of flying as well as the interaction between the Captain and First Officer, and their interaction with external personnel (e.g., ground support, cabin crew). With technology advances in head-worn augmented reality and virtual reality (AR/VR) interfaces, adaptive training technologies and human behavior representation, the potential exists to leverage this technology to offload or download training from the FFS or FTD to VR training devices. Accordingly, Boeing has been applying these capabilities to the development of VR training prototypes designed to address different aspects of a pilot training curriculum [1]. The first prototype implemented is a Ground Procedures Trainer that focuses on both taskwork and teamwork competencies, as defined in Sottilare et al. [2], prior to the taxi phase. It is an adaptive team training prototype designed for individual student training. While there have been team intelligent tutoring systems (ITSs) implemented © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 15–27, 2019. https://doi.org/10.1007/978-3-030-22341-0_2

16

E. Biddle and B. Buck

previously [3], this prototype differs in that it employs a synthetic role-player to fill out the team. The Ground Procedures Trainer provides a lesson to an individual student pilot that covers four procedures that a Captain and First Officer perform while the aircraft is at the gate, prior to taxi. The student assumes the role of the First Officer, while the VR Pilot provides the Captain role. For the purposes of this prototype, taskwork refers to the performance of checklists and interaction with the flight deck to ensure the aircraft is ready for engine start and taxi. Teamwork in the context of the work described herein refers to the interaction between the Captain and First Officer, which is typically referred to as Crew Resource Management (CRM) [4]. The focus on this paper is to describe the methodology used to implement adaptive training that supports the development of the taskwork and teamwork skills required for ground procedures in a commercial pilot flight training scenario. 1.2

Crew Resource Management

An accepted definition of team is “a distinguishable set of two or more people who interact dynamically, interdependently, and adaptively toward a common and valued goal/objective/mission, who have each been assigned specific roles or functions to perform, and who have a limited life-span of membership” [5]. Team performance research has led to the development of a wide range of models and theories [5] to describe the interactions within a team and provide a set of competencies to train teamwork. With respect to flight crew operations and in particular, commercial aviation, crew resource management (CRM) is the term primarily used to refer to the interactions between the flight deck crew and other human roles involved with a flight, such as cabin crew in the commercial sector and maintenance personnel. A related term, Cockpit Resource Management, refers solely to the interactions between the flight deck crew. With that said, there are differing definitions of Crew/Cockpit Resource Management. Federal Aviation Authority Crew Resource Management (CRM) The FAA [4] defines CRM as the “effective use of all available resources: human resources, hardware and information…to operate a flight safely” (p. 2). While CRM initially referred to Cockpit Resource Management and focused on the interactions between the Captain and First Officer (flight deck crew) only, it was expanded to Crew Resource Management and the combined activities of the flight deck crew, cabin crews, aircraft dispatchers, maintenance personnel and air traffic controllers. CRM involves the following activities “team building and maintenance, information transfer, problem solving, decision making, maintaining situational awareness and dealing with automated systems” (p. 2). The FAA has defined three high level performance clusters to describe the categories of CRM activities. The clusters are: 1-Communications Processes and Decision Behavior, 2-Teambuilding and Maintenance and 3-Workload Management and Situational Awareness. Each cluster is categorized further into a set of behaviors, and then a set of behavioral markers, which are demonstrable examples that support the behavior. “Crewmembers acknowledge their understanding of decisions” is an example of a behavior marker, and it is associated with the Communications Processes and Decision

Adaptive Team Training for One

17

Behavior cluster and the Communications/Decisions behavior. The CRM clusters and behaviors are shown in Table 1. The FAA [4] recommends three critical components of CRM training: indoctrination/awareness, recurrent feedback and practice and continuing reinforcement. Table 1. CRM clusters and behaviors Cluster Communications processes and decision behavior

Team building and maintenance

Workload management and situational awareness

Behavior Briefings: Preflight activity to coordinate, plan and identify potential problems Inquiry/Advocacy/Assertion: Purposeful promotion of the course of action the team member feels is best despite potential conflict that could arise within the crew Crew Self-Critique Regarding Decisions and Actions: Facilitating discussion after an event that includes the product, the process, and the people involved Communications/Decisions: Free and open communication in which appropriate information is shared clearly and crew involvement in decision making Leadership Followership/Concern for Tasks: Extent to which crew is concerned with effective accomplishment of tasks Interpersonal Relationships/Group Climate: Quality of interpersonal relationships and pervasive climate of flight deck Preparation/Planning/Vigilance: Anticipating contingencies and the various actions that may be required Workload Distributed/Distractions Avoided: How well crew manages to prioritize tasks, share workload, and avoid being distracted from essential activities

CRM Reference in Evidence Based Training Most airline companies in and outside of the United States participate in International Civil Aviation Organization (ICAO). In 2013, ICAO introduced publication 9995, entitled “Manual of Evidence-Based Training (EBT)” [6], which focuses on the core competencies of piloting skills. Five of the eight competencies defined in the EBT method focus on CRM, which underscores the importance of CRM to the role of a commercial airline pilot. While these competencies are similar to the clusters and behaviors defined by the FAA, they are distinctly different. The CRM related competencies identified in the EBT manual are provided in Table 2. Similar to the FAA documentation, the EBT manual defines a set of performance indicators, another term for the behavioral markers used in the FAA model, to provide examples of behaviors that demonstrate performance of the CRM competencies. The EBT manual stresses that the CRM competencies be addressed throughout various training stages and regimes, as recommended by the FAA. However, the

18

E. Biddle and B. Buck

methods for developing the training regimes and defining competencies differs between the two approaches, in how the competencies were derived. Table 2. CRM competencies defined by ICAO EBT manual. Competency Communication Leadership & Teamwork Problem solving & Decision making Situation awareness

Workload management

Definition Demonstrates effective oral, non-verbal and written communications, in normal and non-normal situations Demonstrates effective leadership and team working Accurately identifies risks and resolves problems; uses the appropriate decision-making processes Perceives and comprehends all of the relevant information available and anticipates what could happen that may affect the operation Manages available resources efficiently to prioritize and perform tasks in a timely manner under all circumstances

CRM in the Military The Air Force refers to CRM as Cockpit/Crew Resource Management and refers to the effective use of all available resources by individuals or crews to safely and efficiently accomplish an assigned mission or task [7], or more simply, “things aircrews do” (p. 4) [8]. Air Force Instruction (AFI) 11-290 [8] defines CRM and identifies six knowledge and skill sets, which are provided in Table 3. CRM training doctrine outlined in AFI 11-290 encompasses both human crewed flight as well as remotely piloted flight activities. Table 3. CRM knowledge and skill sets per AFI 11-290. Knowledge/Skill set Communication Crew/Flight coordination Mission analysis Risk management/ Decision making Situational awareness Task management

Definition Knowledge of common errors/barriers, listening/feedback skills and efficient information exchange Knowledge and skills to enable internal and external team mission coordination, understanding of impact of attitudes and ability to resolve conflict Pre-, current and post-mission analysis and threat and error management techniques Risk assessment, management and problem-solving; understanding hazards and break downs Identifying errors; prevention, recognition and recovery of loss of situational awareness Setting priorities; recognizing over/under-load; automation management; and checklist discipline

Adaptive Team Training for One

1.3

19

Adaptive Instruction and Team Training Issues

The term “Adaptive Instructional Systems” (AIS) was first coined by Brawner and Sottilare [9] and refers to a family of systems that supports the provision of tailored instruction to a student. Intelligent tutoring systems (ITSs) fall under the AIS family. The general components of an ITS include an expert model, instructional model, student model and problem solving environment. The expert model encapsulates the expert’s problem-solving skill in a machine-executable form. The student model provides a current and historical record of the student’s mastery of the expert’s problemsolving skills. The instructor model provides the logic and implementation of an instructional strategy based on comparison of expert and student model. The problemsolving environment is where the student can demonstrate and practice his/her skills issues with team training involving monitoring task and teamwork skills. With respect to the prototype described in this paper, our ITS architecture (Fig. 1) involves three primary components: a Student Model, an Instructional Model, and an Expert Model. The student model implements a profile of dynamically maintained variables, each corresponding to one learning objective (LO). These variables are evaluated over a number of observations. As a result, changes due to learning are reflected across exercises, as the score increases due to correct performance (based on the expert model), or decreases as errors are made. The amount that scores are changed can be weighted according to the degree to which the action reflects mastery of the LO. Amount of change is also adjusted according to the degree of support (e.g., hints) the ITS provided to the student during the exercises.

Fig. 1. The Boeing ITS architecture.

This initial capability, the Boeing ITS, uses discrete event scenarios in which the student’s actions bring them through a particular path (using the instructional model) that is more or less correct. An evolution of the ITS was the Virtual Instructor (VI), which supports dynamic simulation-based training environments, in which the problem solving scenarios can be a range of simulation/gaming environments. The Ground

20

E. Biddle and B. Buck

Procedures Trainer extends the VI to support training of taskwork and teamwork skills for an individual student. While there has been tremendous progress in the development of ITSs to train individual tasks, or taskwork - the technical performance of a role or function, ITS designed to train teamwork, such as CRM, are still in the research phases due to additional complexities [2, 3]. The implementation of an AIS capability in a team training environment requires the training audience (students) to simultaneously perform taskwork and teamwork. Further, both the individuals on the team, and the team itself, have sets of taskwork and teamwork goals, which requires the assessment of these areas at both the individual and team level as well as decision on whether to provide an instructional intervention to the entire team or just an individual and how to adapt the scenario if the issue as at the individual level rather than the team level (Fig. 2).

Fig. 2. Assessment and instructional intervention model for team training.

Specifically, team training AIS requires multi-level automated human performance assessment of individual and team taskwork and teamwork-related competencies. The next issue is to use these multi-level assessments to update student models at both the individual and team level, and to diagnose the learning needs at these two levels. Learner needs are identified and used to guide the implementation of feedback and other instructional interventions. Providing interventions at a team and collective level is complex, as some individuals may require feedback irrelevant to other team members. This makes the decision of what interventions to apply when most difficult. The development of the VR pilot training tools discussed in this paper addresses the issue of assessing taskwork and teamwork simultaneously. However, since the training tool is designed for an individual student, using a synthetic team member to fill out the flight

Adaptive Team Training for One

21

crew, the issue of whether to provide feedback to the individual or team did not need to be addressed.

2 Ground Procedures Trainer Overview The Ground Procedures Trainer integrates three primarily technologies: (1) VR Flight Deck, (2) VR Pilot, and (3) VR Instructor. The VR Flight Deck used as the training environment in this lesson is a commercial aircraft virtual reality flight deck. The VR Pilot is a synthetic representation of the Captain (Fig. 3) to support CRM training to enable an individual student to train on procedures that require interaction between the Captain and First Officer without requiring a second student, or instructor role-player. While the student performs the procedures, the VR Pilot will perform the expected duties of a Captain, as well as respond to the student in response to the student’s actions. For example, if the student says something out of context or incorrect, the VR Pilot will ask for the student to repeat or remind the student of the procedure they are currently performing.

Fig. 3. The VR pilot.

Evaluation of the student’s actions is supported by the VR Instructor using data from the VR Flight Deck and the student’s interactions with the VR Flight Deck (e.g., head movement, hand/arm movement) and the VR Pilot (e.g., speech). Additionally, the VR Instructor will provide feedback to the student in response to his or her actions, through a variety of means. For example, if the VR Instructor determines that the student is looking at the incorrect instrument, the correct instrument will be highlighted, or the VR Instructor may verbally inform the student what instrument to use. The student is also able to prompt the VR Instructor for help. In some cases, the

22

E. Biddle and B. Buck

VR Instructor will prompt the VR Pilot to provide specific feedback, such as directing the student’s attention to a different part of the procedure. The VR Pilot will interact with the student and the VR Flight Deck to complete the flight tasks that are expected of the Captain (taskwork) as well as support CRM related interactions with the student. 2.1

Lesson Learning Objectives

In order to provide adaptive training to develop both taskwork and teamwork competencies, student performance assessment is linked to either a set of task related behavioral indicators or CRM related behavioral indicators. As discussed previously, there are a variety of competencies and behavioral indicator sets that have been developed to support CRM training in a fixed wing aircrew environment. Since the VR pilot training prototypes are focused on commercial pilot type rating training, the Ground Procedures Trainer is using multiple sets of learning objectives taken from a type rating syllabus as well as the FAA and ICAO competencies. The learning objectives are identified below in Table 4. Table 4. Learning objectives for the ground procedures trainer prototype. Type rating T1: Locate all airplane systems (in the scenario) T2: Operate all airplane systems (in the scenario) T3: Demonstrate proficiency in performing normal procedures T4: Demonstrate proficiency in the use of the associated checklists

ICAO CRM I1: Application of Procedures: Identifies and follows all operating instructions in a timely manner I2: Application of Procedures: Correctly operates aircraft systems and associated equipment I3: Communication: Ensures the recipient is ready and able to receive the information

I4: Communication: Conveys messages clearly, accurately and concisely I5: Communication: Uses eye contact, body movement and gestures that are consistent with and support verbal messages I6: Leadership and Teamwork: Carries out instructions when directed

FAA CRM F1: Communications: Crew members seek help from others when necessary F2: Team Building and Maintenance: Time available for the task is well managed F3: Workload Management and Situational Awareness: Crewmembers speak up when they recognize work overloads in themselves or in others

Adaptive Team Training for One

2.2

23

Scenario Event Linkage to Learning Objectives

After the learning objectives were specified, the ground procedures that comprise the lesson were decomposed into behaviors and expected actions, with the expected actions providing the basis for assessment. The expected actions were linked to the learning objectives identified in Table 4. In some cases, an expected action was related to more than one learning objective and therefore linked to multiple learning objectives. The linkage to learning objectives is what enables the student model to be updated as a result of an assessment of the student’s actions. The following example is provided to illustrate the process. One of the steps in a procedure addressed in the lesson is for the student to visually check the flap setting and verbally confirm the flap setting to the Captain. This step involves two primary actions. First, the student must look at the correct instrument, in this case, the flaps. This action addresses the following learning objectives from above list: • T1 - Locate all airplane systems (in the scenario) – the student must be look at the flaps • T4 - Demonstrate proficiency in the use of the associated checklists – knowing that checking the flaps is the next step and demonstrating knowledge of the proper order for the checklist items • I1 - Application of Procedures: Identifies and follows all operating instructions in a timely manner – the student is given a time constraint for receiving credit for identifying the flaps Next, the student must correctly state the current flap setting so that the Captain is able to acknowledge. This second behavior is related to the following learning objectives: • T1 - Locate all airplane systems (in the scenario) – the student must say the correct name for the flaps • T4 - Demonstrate proficiency in the use of the associated checklists – the student is verbally confirming the flaps setting, which is a required part of the procedure • I4 - Communication: Conveys messages clearly, accurately and concisely – in order to receive credit, the confirmation of the flaps must be understood by the VR Instructor speech recognition system 2.3

Assessment of Learning Objective Scores

Linking the learning objectives to the expected actions and related metrics enables the learning objectives to be updated in real time based on the assessment of the student’s actions. The learning objectives are scored on a scale of 1–5, adapted from commercial pilot training assessment standards with a score of 5 demonstrating perfect performance and a score of 1 indicating that performance was not at all demonstrated. Scores are decremented if the student required a hint in order to perform the task correctly. Figure 4 shows a graph of learning objectives scores by nodes over the course of a lesson.

24

E. Biddle and B. Buck

Fig. 4. Learning objective scores graphed overtime throughout the lesson.

The tabs across the top indicate that graphs are provided for the three sets of learning objectives, and the hint tab displays the number of hints used for each node. The ICAO competencies tab is selected in Fig. 4. The graph shown is for the selected learning objective, I1, and it shows the nodes in which learning objective I1 was evaluated as well as the score at each node. The average score for the node is shown at the end of the learning objective description. In this case, the average score for learning objective I1 is 4.4. The scores for the other ICAO learning objectives are shown at the end of their textual descriptions. The graphs for these learning objectives are viewable by clicking on the learning objective text. 2.4

Feedback Strategy

Instructional feedback is provided in several forms - through highlighting the instrument/equipment that the student should be viewing, providing verbal feedback from an “instructor” or the VR Pilot and auto-completion of a step in the procedure. Examples of feedback that the VR Instructor may provide to the student include statements to let the student know that the action is not part of the procedure, statements that say the action is part of the procedure but not the in the proper sequence; or even telling the student what action they need to perform. Feedback used by the VR Pilot when the student is speaking unclear include statements such as “come again” or “I didn’t get that”. The student receives feedback either at their request, though the use of the hint function, or if they fail to perform a step within a certain time period. The student is able to request hints if he or she is having difficulty by verbally requesting a hint. There are three levels of hints. The first level hint will highlight the instrument that the student is supposed to interact. The second level hint will have the

Adaptive Team Training for One

25

VR Instructor tell the student what task to perform. The third level hint results in the VR Instructor completing the task automatically and informing the student to move onto the next step in the procedure. The final form of feedback provided to the student is a review of the lesson when they have completed it. The review provides the overall scores for each of the learning objectives so they are able to determine which learning objectives they were most successful at, and which learning objectives will require further work. The student, or an instructor, will be able to drill down in the graph to determine on which nodes they scored better or worse. Figure 5 provides a screenshot of the learning objective summary.

Fig. 5. Learning objective summary.

3 Conclusion 3.1

Training Effectiveness Evaluation

The Ground Procedures Trainer will undergo an evaluation to compare its training effectiveness to the current means of learning flight procedures, which is in a Fixed Training Device (FTD). When the student receives training in a FTD, there are typically two students, one in the role of the Captain and one in the role of the First Officer, and the students will switch roles to experience both roles. In some cases, an instructor, in addition to the instructor teaching the lesson, plays one of the roles if two students are not available. The lessons in the FTD walk through the procedures for an entire flight, with the goal of making the procedure more natural and fluid. The FTD typically is not flown for the entire flight, but jumped to various stages of flights to focus on the completion

26

E. Biddle and B. Buck

of procedures. The FTD uses a mix of real aircraft instrumentation and flat panel displays representing the aircraft system. Typically, a realistic seat from an aircraft cockpit is used, although some FTDs use office style chairs. For the Ground Procedures Trainer, the VR cockpit provides the flight deck environment as described previously. The student wears a VR headset to view the cockpit with earphones and microphone for audio feedback and verbal interaction. The student uses a VR hand controller or haptic glove to manipulate instruments in the cockpit and will be seated in an office style chair. For this lesson, this student role is the First Officer, and therefore, seated in the right seat. The Captain is role-played with the VR Pilot as discussed previously and located in the right seat. The training effectiveness evaluation is designed as an experiment to compare the FTD lesson with the VR Ground Procedures Trainer. The training effectiveness will be evaluated through the performance of the ground procedures in a Full Flight Simulator (FFS), which provides an environment that closely replicates an operational flight deck. The participants will begin by reviewing and signing the consent form. Next, they will complete a demographic survey to obtain information regarding their prior flight experience as well as gaming and VR experience. They will then be provided an overview of the ground procedures trainer and a practice period to review the ground procedures using a poster of the flight deck, which is what the students typically do to prepare for their FTD lessons. Next, the participants will receive instruction on the use of VR prototype or current training device. They will then complete the lesson three times in either the VR prototype or current training device. For the current training device condition, the FTD, one instructor pilot will serve as the instructor, and a second instructor pilot will role-play the Captain. After the participant completes the lesson three times in either condition, they will complete a post-training survey to obtain feedback on their experience. Finally, the participants will complete an assessment in a FFS with the participant assuming the First Officer role and another pilot role-playing the Captain. 3.2

Summary

The VR ground procedures trainer is the first step in the creation of a variety of VR based training capabilities to reduce time spent in large, traditional simulators. The use of adaptive training capabilities provides the student with more opportunities to receive training at their convenience, and the use of the VR Pilot enables the student to train without the need of a second student or role player. There are multiple theories and frameworks for training and assessing teamwork competencies. In order for the VR pilot training applications to be used under various regulatory authorities that prescribe to different CRM models and training best practices, the VR adaptive training capability is designed to work with multiple competency frameworks.

Adaptive Team Training for One

27

References 1. Buck, B., Genova, M., Dargue, B., Biddle, E.: Adaptive learning capability: user-centered learning at the next level. In: Sottilare, R. (ed.) Intelligent Tutoring Systems Conference 2018 Industry Track Proceedings, pp. 3–11 (2018). http://ceur-ws.org/Vol-2121/Preface.pdf. Accessed 19 Feb 2019 2. Sottilare, R.A., Burke, C.S., Salas, E., Sinatra, A.M., Johnston, J.H., Gilbert, S.B.: Towards a design process for adaptive instruction of teams: a meta-analysis. Int. J. Artif. Intell. Educ. (2017). https://doi.org/10.1007/s40593-017-0146-z 3. Sinatra, A., et al.: Considerations for dealing with real-time communications in an intelligent team tutoring system experiment. In: The Team Tutoring Workshop of the 19th International Conference of the Artificial Intelligence in Education (AIED) Conference, London, England, United Kingdom (2018) 4. Federal Aviation Administration (FAA). Crew resource management training. Advisory Circular 120-51E (2004) 5. Salas, E., Dickinson, T.L., Converse, S., Tannenbaum, S.I.: Toward an understanding of team performance and training. In: Sweezey, R.W., Salas, E. (eds.) Teams: Their Training and Performance, pp. 3–29. Ablex, Norwood (1992) 6. International Civil Aviation Organization (ICAO). Manual of evidence based training methodology. Document 9995, AN/497 (2013) 7. Nullmeyer, R., Spiker, V.A., Wilson, D., Deen, G.: Key crew resource management behaviors underlying C-130 aircrew performance. In: Proceedings to the 2003 Interservice/Industry Training Simulation and Education Conference, Orlando, FL (2003) 8. Secretary of the Air Force. Cockpit/crew resource management program. Air Force Instruction 11-290 (2018). https://static.e-publishing.af.mil/production/1/acc/publication/ afi11-290_accsup/afi11-290_accsup.pdf. Accessed 19 Feb 2018 9. Brawner, K., Sottilare, R.: Proposing module level interoperability for intelligent tutoring systems. In: The Exploring Opportunities to Standardize Adaptive Instructional Systems (AIS) Workshop of the 19th International Conference of the Artificial Intelligence in Education (AIED) Conference, London, England, United Kingdom (2018)

Adaptive Training: Designing Training for the Way People Work and Learn Lara K. Bove(&) SAIC, Reston, VA, USA [email protected] Abstract. Many people think “adaptive instruction” is synonymous with “personalized instruction” or training that is designed to appeal to the desires and whims of the individual learner. However, since the purpose of training is to support an organization’s needs for a fully competent and developed workforce, adaptive instruction is better thought of as training that is designed to change the training based on the learning needs of the individual (or group) as identified by gaps in the learner’s knowledge as well as the learner’s depth of understanding with the content. After all, the purpose of training is to “impact … individual, process, work team, and/or organizational performance” [1]. The adaptations are all about providing each learner with the training they need based upon precise gaps in the individual’s knowledge and skill level (e.g., novice or expert) with the content, all while ensuring the training is designed in line with learning theories about how people learn. For example, if the gap is for a specific motor skill, then the training will provide opportunities for the learner to practice, develop, and hone the skill. This could be done in virtual, simulated, or live environments. The reason for having physical practice is not because the learner “prefers motion” or is a so-called “kinesthetic learner,” but because the desired training outcome is physical performance of the task. Those who are unfamiliar with the instructional design approaches to adaptive instruction might view the notion of “testing out” of a course as a form of adaptive training (or at the very least a precursor to the current approach). However, the test-out method is built around a linear view of training, as well as a view that all learners must work through all of the training materials. The problem is that people have very different backgrounds and knowledge bases and clearly do not all require the same training to be able to succeed in their work [2]. In addition, sometimes the training gaps demonstrate that some of the training which has been traditionally taught in a linear fashion does not need to be delivered in this way. This is not to suggest that there are not logical sequences or progressions of development. Rather, instructional design should expand its view of what is possible in training delivery and design the course to best meet the needs of the learners and workforce. Consider that 20 years ago we had to design courses to fit into specific time blocks because we had to account for the limitations of having an instructor and trainees gathered together. This led to the development of some courses that follow a highly structured progression even though the concepts and skills are not logically linear at all. The use of adaptive instruction can allow for a very different method of delivery. Instructional designers use learning theories and best practices in instructional design to determine which teaching strategies to use for specific types of learning. Thus, it falls to the instructional designer to determine how the training © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 28–39, 2019. https://doi.org/10.1007/978-3-030-22341-0_3

Adaptive Training: Designing Training for the Way People Work and Learn

29

ought to be developed to best meet the needs of the individual learner. The adaptive instructional system (AIS) must have methods to identify the learner’s knowledge and skill level and must deliver training which supports each unique combination of knowledge and skill level, especially as learners gain knowledge and their skill levels change. The author investigates research-based approaches to identifying the knowledge gaps and skills in academic settings and applies these to workplace training. Further, the author identifies specific instructional design approaches and organizational thinking which are needed to support this type of adaptive approach to workplace training. The paper identifies the necessary technical components and capabilities for a fully functional AIS which provides adaptive training based on the learner’s gaps in knowledge and different skill level (from novice to expert). The discussion provides specific examples to illustrate how the adaptive training would differ for the individual learners in training designed for pilots, air traffic controllers, project managers, etc. Finally, the author calls upon instructional designers and software engineers to collaborate in developing systems which will provide the type of training the workforce needs and help us learn more about how people learn so that we can refine the training even as we learn about what works best. Keywords: Adaptive instruction Training efficiency measures

Training Expert Novice

1 Pilot Training: A Need for Adaptive Instructional Systems The idea for this research paper began when the author was tasked with evaluating and redesigning the academic curriculum component of an undergraduate pilot training program. The original curriculum was designed to be delivered over an 18-month period during which students would have a limited number of flights in high-fidelity simulators; and eventually they would have solo flights. The goal of the new program was to reduce the total training time by as much as 9 months. Students were given low-fidelity simulators which they could use as often as they liked from day one of the training. In order to reduce the total time for the course, the academic training needed to occur in a condensed time frame (as compared to its original format) and would need to happen in tandem with the students’ initial simulated flight experiences. In fact, because the low fidelity simulators led to students solo flights happening sooner, the customer even sought to provide academic instruction prior to students’ arrival so that they would be able to complete the academics before live flight. The existing curriculum had been designed for live instruction combined with elearning modules and reading materials. The e-learning modules were primarily clickand-read with a limited amount of animated graphics and videos to demonstrate or explain some of the concepts. The reading materials were lengthy guides of 100 or more pages per topic. Most of the live instruction was designed to provide students with support for their independent work (e-learning and readings).

30

L. K. Bove

The customer believed that students could likely test out of some of the lessons or pieces of the curriculum and that this would reduce the time burden. Further, the customer was interested in providing individualized training. As the author examined the academic training content, several challenges became apparent. 1. What is the best way to determine which components students have already mastered? There was so much material (over 1,000 pages of student readings alone) that even a simplistic test-out method could become burdensome in and of itself. 2. How should the training be designed to adapt to the needs of each student while still ensuring all students meet all of the requirements (i.e., mastered the objectives)? They needed an adaptive delivery system which could accommodate the individual needs and account for all the variables. 3. What design tools could the author use to map all of this training material? The tool needed to enable the design to allow for different learning paths, different learner aptitudes, sequencing, additional supports for students when needed, etc. The author conducted research to find the answers to the questions described above, and in so doing developed an outline for an adaptive instructional system which could be used for training such as the pilot training program, but is much more broad-based in its application to workplace training. This paper describes that adaptive instructional approach.

2 Adaptive Instruction for Organizational Training 2.1

Adaptive Training vs. Personalized Training

The purpose of training is to enable individuals and groups to improve their performance to better support the organization [1]. Adaptive instruction for workplace training should be designed to improve the employee-learner’s ability to achieve the outcome or otherwise improve the outcomes for the organization. Thus, adaptive training is defined as training which is delivered or tailored to the individual in order to best support learning mastery for the individual within the confines of the organization’s needs. It is important to distinguish training that is tailored to meet a learner’s needs and training that is tailored to meet a learner’s preferences. Some argue that adaptive training should be developed to support the individual learner’s preferences (e.g., create videos and games for learners who prefer learning this way). However, research shows that developing training for different delivery methods based on learner preference yields no greater results, and may even be harmful because learners do not develop or hone skills needed to learn (or succeed) in different learning environments/contexts [3]. Indeed, we must ensure our workforce can learn in all different contexts because we cannot possibly deliver every training to meet every learner’s preference. Further, developing duplicative training is costly. The author distinguishes adaptive training designed to meet the needs of learners by referring to training delivered based on user preference as personalized training and training delivered based on learner needs as adaptive training.

Adaptive Training: Designing Training for the Way People Work and Learn

2.2

31

Adaptive Training for Different Skill Levels

In the workplace it is highly likely that individuals, even those working the same job, have varying skill levels with the specific content. Yet, there is a commonly held misunderstanding that all learners need all of the same content [2]. In addition to the problem of not recognizing differences among learners (or not acknowledging those differences when developing training), most organizations do not have systems in place to support the development or delivery of such differentiated training. Consider the traditional live training in which employees are brought together for the training for a set amount of time (i.e., half-day training or a three-day course). In such an environment, all learners need to be ready to tackle each topic at the same time, moving at the same pace and reaching the finish line at the same time. While it is possible, and even likely, that there are instructors who differentiate instruction for some learners, when appropriate and possible, there are also built in constraints limiting such adaptations, if only due to the need to meet all course objectives. Further, such differentiation is not often built into the training. As for computer-based training, this too is mainly designed with a one-size-fits-all approach. It is true that learners do not necessarily have to move at the same pace, but they are generally all provided the same exact training content and are expected to go through all of it. The Learning Continuum. Thinking of learning as a continuum, the author defines three specific skill levels among learners: novice, intermediate, and expert. A skill level refers to the depth of understanding or familiarity a learner has with the topic. A novice is someone who has little, if any, prior knowledge in the particular content area; and an expert is someone who has a broad knowledge base and can draw from experience and a complex level of understanding. The intermediate falls somewhere in between the novice and expert. It is important to note that the term expert is used to describe the level of experience for the learner, and is not to be confused with a subject matter expert (SME). All of the learners identified–from novice through to expert–need some training on the topic; the difference lies in how much training or what type of training they need. A person can be a novice in one area and an expert in another, and in fact this is to be expected. Thus, it is important to determine the learner’s skill level (novice, intermediate, expert) for each topic. The one-size-fits-all model not only wastes the valuable resource of employee time when employees are given training they do not need, but it also results in diminishing returns because people respond to different training approaches as their knowledge, skill, and/or ability improves [4, 5]. In addition, the expertise-reversal effect shows that providing experts with training designed for a novice can have negative impacts on the training outcomes for experts and vice versa [6]. In other words, if we remove the extra supports and information which could help the novice, but require the novice to take the training, it is likely that the novice will not do well. And if we require the expert to take the training designed for the novice, the expert is likely to do poorly in the training as compared to training designed for the expert. The expertise-reversal effect is not

32

L. K. Bove

intuitive at all; we often assume that the more experienced worker will thrive regardless of the training approach. However, research has shown time and again that providing extraneous materials leads to poorer learning outcomes for various learning contexts, including math, literature, statistics, foreign language, computer skills, accounting, etc. [7]. Since the purpose of training is to enable people to perform well at their jobs, poor training scenarios risk poor job performance outcomes. As noted, research shows the need for differing training approaches based on the learner’s skill with the topic. Following are two examples which the author created to demonstrate how adaptive training could be provided to the novice, intermediate, and expert. Project Manager Training. A company develops training for new project managers which includes several topics: budgeting; reports; human resources and personnel management; and contracts and legal compliance. The new managers have different levels of experience with each of the topics. Some of the learners will be familiar with all of the required reports, and will be able to tackle this component of their new role with minimal training. Others will be very comfortable with all of the information required in the reports, but will have no idea how to navigate the section of the company’s intranet in order to find the report templates as well as information that goes into the reports. The adaptive training would be provided as such: • The learner who only needs training on navigating the Intranet will be provided with that information only. • The learners who do not require information about the Intranet will be given training which focuses on the reports themselves. • For learners who need all of the information, the training will focus on one component at a time, guiding learners throughout the process so that learners can master each task. Inquiry/Discovery-Based Training for IT Security. The following example is specifically provided to demonstrate that even inquiry-based (discovery-based) approaches can be used in an adaptive instructional model. Discovery-based training approaches are based on the work of Jerome Bruner, who said that people learn when they build meaning for themselves through the act of discovery [8]. Kalyuga identified specific approaches to discovery-based instruction for the novice [4]. When considering the discovery-based approach for adaptive instruction, it is important to recognize the challenges facing learners of different skill levels. In particular, if the training is too open-ended, the novice will expend effort trying to understand some of the basics at the expense of having cognitive resources to make discoveries. Because the novice is new to the content, s/he will not know what is most important for the task at hand. Thus, the discovery-based approach must be designed differently for the novice than for the expert. In the case of the novice, the training must limit the extraneous information (or unknowns) and focus on the area where the learner should be making the discovery. Here is how the IT Security training might be different for the novice and the expert. The training is about complex technical security policies. Learners are provided

Adaptive Training: Designing Training for the Way People Work and Learn

33

with a real-life example and asked what they should do in the situation. All learners will be given an open-ended question, but the training will differ as follows. • The training for the novice will only include the terminology which is important to the task at hand, and will explain the terminology. This enables the novice to focus on the challenge of figuring out what the employee should do rather than having to look up the terminology and struggle with figuring out the meaning of the policy. • The training for intermediate learners will provide explanations for some of the terminology, as well as links (or text that the learner can click on to learn more) for explanations of the policy. In this way, the intermediate learner is given some guidance, but not as much as the novice. • The training for the expert will provide an open-ended question with links to policies and terminology, but does not provide any specific explanations of the policies within the body of the problem. The problem also may include extraneous policies or a more complex situation. This allows the learner to come up with a solution without being distracted by the basic information that is already familiar. 2.3

Adaptations for Different Learning Paths

The adaptive training should address more than just differentiation based on learner familiarity; it must also allow for different learning paths. Returning to the example of the pilot training, the academic content included material on physics and aerodynamics. Since all of the students were undergraduate college students, it is likely that some students were familiar with some of the physics concepts. However, even among the group with familiarity, some of the students would need the training to cover more of the physics content than others. The adaptive training model could provide each student with only the components they needed. To sync this up with the idea of the novice, intermediate, and expert: Suppose we have two students that need all of the pilotspecific aerodynamics content. The student who has a strong grasp of physics is provided training designed for an expert and the student who has no physics background is provided training designed for a novice. A third student has already mastered the pilotspecific aerodynamics content and does not need this part of the training at all. Figure 1 depicts how this might work, by indicating different learning paths as if they were routes on a map. Each point on the map represents a skill, knowledge, or ability (KSA) that the learner must master in order to succeed in the full course objectives. If a learner has already mastered the skill, knowledge, or ability, then the learner will not undergo training in that area (i.e., be directed to that point on the map). (The diagram does not indicate novice, intermediate, or expert as this is another layer that must be addressed at each point on the map, and may change as a learner progresses through the training.) This form of adaptive training allows each individual learner to travel the path which is most expedient and appropriate for that person. One person may need to go to all locations; another learner may only need to visit one or two. Some learners will need to take more time or revisit areas while others will grasp the content or develop the skill relatively quickly.

34

L. K. Bove

Fig. 1. Each map shows a unique learning path. Individuals are given only the training they need.

3 The Adaptive Instructional System Adaptive instruction allows for individualized learning paths and accounts for the different needs of learners based on their skill level, even when providing training for large groups of people. This adaptive approach requires very specific input from instructional designers (IDs) and a technical system or architecture which delivers the adaptive training. Since the instructional design approach is embedded or integrated into the technical system, the discussion of the different technical capabilities will include specifics on the role of instructional design. The adaptive instructional system (AIS) must include capabilities, methods, and systems which: • • • •

Determines the learner’s skill level and knowledge gaps Conducts periodic learner analysis to provide adaptive training delivery Evaluates the learner experience and effectiveness of the adaptive system Provides reporting tools and metrics for students, instructors, and organizational leaders • Supports adaptive training in digital and live training environments

In addition to the components within the AIS, there is a need for development tools that IDs could use to map the training components by learner levels, as well as the variety of learning paths, all while ensuring that the learning objectives and outcomes are addressed in such a way as to ensure all learners receive all the training they need. The following discussion looks at each of the AIS requirements in greater detail, from the perspective of instructional design and learning theory. While the discussion includes some specific examples, it deliberately avoids providing or prescribing the technical solution(s) because it is believed that the software engineers and computer scientists should bring their expertise and insight to the problem and are likely to arrive at a much more elegant solution which addresses all of the instructional requirements. 3.1

Learner’s Skill Level and Knowledge Gaps

In order to ensure the learner is given the right training, the system must determine the learner’s gaps in knowledge and the learner’s skill level for each specific component of the training. The system should take into account historical information, assessments, and self-reporting about knowledge and skill level.

Adaptive Training: Designing Training for the Way People Work and Learn

35

Learner Certifications and Training History. There are many different individual data points which can act as a starting point for determining a learner’s knowledge base and skill level. This includes professional licenses, certifications, and academic degrees which indicate a base level of knowledge or skill with the particular content. The system should also account for other training the employee has taken, particularly any training that is provided by the company (i.e., the same organization which is providing the adaptive training). It is important to note that certifications and completions of training do not guarantee that the learner has mastered the material, but this information can provide helpful information about both gaps in knowledge and skill levels. The ID should provide lists of courses, certifications, and other indicators of knowledge in the particular area. As the system matures, the AIS may provide analysis showing which types of past training are strong indicators of a learner’s knowledge. Assessing Skill Level and Perceived Effort. One way to determine whether a learner is a novice, intermediate, or expert is to ask the learner to describe how they would solve a difficult problem. This approach is based on research and theories about the differences in the way that novices and experts approach a problem, and follows a similar process used by Kalyuga and Sweller [9]. Here is how it works. At the beginning of the training, learners are presented with a complex problem and asked to describe or tell what they would do first in solving the problem. The learner’s answer is used to identify where s/he falls on the novice to expert continuum. Obviously the ID will need to develop the complex problems, possible student responses, and the linkages between student responses and skill level. Using the example of the project manager training, the learner could be given the following problem: The quarterly report XJ5 is due next week. Describe the first thing you need to do in order to get this done. Table 1 provides examples of the different responses and the rationale the ID would use to determine the skill levels. Table 1. Assessing skill level Question: The quarterly report XJ5 is due next week. Describe the first thing you need to do in order to get this done Student response Learner skill level and rationale Find out what the XJ5 report is Novice; The response indicates the learner is unfamiliar and who it gets sent to. with the task, does not know who to ask or where to locate the template Download the template and begin Intermediate to Expert: An expert knows exactly what is collecting the data. required in the report and where to find the data. The response does not indicate whether the learner knows all of the requirements Copy last quarter’s report and {Does not need training}.The learner indicates complete update the information mastery of the task

36

L. K. Bove

It is important to assess skill levels at various points in the training as some learners will progress from novice to expert rather quickly and others may require more practice and guidance, moving from novice to intermediate. Self-reporting Skill Levels. The training should include questions asking the learner to rate their own skill level in the course material at intervals throughout the training. It is true that a novice might indicate s/he is an expert with the hopes of skipping the training. However, the learner would not skip the training. Instead, the learner might (depending upon all of the other measures in this category) begin with a training component that is too difficult. When the learner does not master the training concepts, the system will adapt and provide different training which is more suited to the learner. 3.2

Ongoing Learner Analysis and Adaptive Delivery

In order to ensure learners are receiving the training that best supports their needs, it is important to conduct ongoing assessments of the learner’s knowledge base and skill level to enable additional (or ongoing) adaptations throughout the training. In terms of the knowledge base, the instructional designer needs to identify specific measures of success throughout the training as well as the indicators (data) of differing levels of success. The ID should determine metrics/data and pathways based on a “theory-based prediction model,” of the training [10]. According to Reigeluth, ID’s should design training based on theories about how people learn, and should use data to test those theories; all the while using the data to make decisions about what learners need [11]. The ID designs the training based on a framework of how the learner will progress through the training as well as markers of progress or indicators of needing more support. Further, the ID needs to provide specific instructions about how the course should be altered depending upon what the data analysis reveals [10]. Two important characteristics of the data the ID collects are: 1. it helps to answer a specific question or set of questions about the learner or training and 2. it provides information that can be acted upon [10]. The system should also include a way to determine the effectiveness of the adaptations: is the training at the right level of difficulty for the learner (is the cognitive load ideal) and does it provide the learner with the content that s/he needs? The author’s research uncovered several different methods of using the difficulty level (learner’s perceived level of effort) and the learner’s performance on the task to see if learners were given training that met their skill level. Using both types of data is important because a learner may do very well on the task while exerting a high level of cognitive effort and another learner may do very well while exerting little effort at all. Of course there are several other possibilities as well. We must have a way to distinguish among these learners so that the training can provide each learner with the training to enable him/her to progress to the point where they exert little effort (i.e., develop expertise). One way to determine the effectiveness of the adaptations is to look at whether the level of difficulty is what would be expected based on the learner’s knowledge and skill level.

Adaptive Training: Designing Training for the Way People Work and Learn

37

As an example, following is a description of the data points used during a research project on adaptive approaches for air traffic controller (ATC) training [12]. The ATC training had tasks which were scored according to specific components of the task. For example, if the task included a plane with initial flight altitude which differed from the desired exit flight altitude, that component of the problem was scored as a 1. If the task involved the possibility of an airplane leaving the airspace if the ATC did not give a command to change its heading, that task scored 3. The complexity of the problem was the sum of all the tasks. Upon completing each problem, learners indicated how much mental effort they invested (on a five-point scale from very low to very high). If a learner succeeds in a task, but indicates that the problem was difficult, the system could adapt by providing the learner with guided problems or additional problems; this should be determined by the ID based on the content. The system could be designed to require learners to achieve a specific efficiency score—a measure of the perceived difficulty and success—so that learners do not move on to tasks based solely on a pass/fail score, but rather only move on when they demonstrate the appropriate skill level [9]. It is also recommended that development of the AIS involve research to determine which algorithms provide for the best learning outcomes [9]. 3.3

Learner Experience and Training Effectiveness

As described in the previous section, during the training delivery, the AIS should be calculating the efficiency of the training. At the same time, the AIS should be collecting metrics on how well it is working. In other words, the system should be designed to evaluate whether the pre-determined plan or map worked as expected. It is possible to have a situation where the ID indicated a task was very difficult, but the learner experience suggests otherwise. Or the learner outcomes may show that intermediate learners are having to repeat the same content and tasks several times, which may indicate the need for additional content or problems with the design of that material (or some other issue). 3.4

Tools and Reporting Methods

In order for the system to create reports, the system must be able to access data which was saved during the training event as well as collect and analyze data from various sources. As such, it is not likely that the AIS can be built upon a learning management system (LMS), or at least not solely upon an LMS: most LMS’s do not track data about how learners interact with the training or “how they go about solving an educational task” [10, p. 158]. Learning record stores (LRSs) provide a way to capture learner experience data, and may support the need for data captured throughout the training event [13]. The reporting methods and tools must be designed to provide reports that are specifically tailored to the various users, including at a minimum: students, instructors, IDs, and organizational leadership. Further, the ID should develop materials which help the different users not only to understand the reports, but also to know what they should do next. For example, a report provided to the student should include recommendations

38

L. K. Bove

for next steps, which would be specific actions the learner can take in order to expand understanding of the concepts, to build upon the success within the training, actions to ensure the training mastery is demonstrated on the job, etc. Reports for organizational leaders should explain the findings as well as make recommendations such as how the training might be revised in the future. 3.5

Supports Many Learning Platforms

AIS should be developed to support e-learning, instructor-based training, blended training, etc. As shown in Fig. 1, learners may need different training components, and these are not limited to computer-based training. Not only must the ID develop adaptive training components for the training designed for different delivery methods, but the AIS must also provide a way for the instructor to identify which training to use (e.g., when should the instructor use the training for novices and when should the instructor use the training for experts).

4 Conclusion As employees have varying levels of experience and knowledge even when they perform the same job, there is a need to provide training which meets the unique needs of each person so that organizations are not wasting resources by providing unneeded training. Further, the expertise-reversal effect shows that it is not ideal to provide experts with training designed for novices. The experts will not do as well as they would if the training were designed for them. An adaptive training approach should include identifying the learner’s knowledge gaps and skill level. Adaptations should be made periodically throughout the training in order to support the learner’s development. The AIS should evaluate the effectiveness of the adaptations in order to support refining the training as well as informing the design of future training. The AIS needs to include reporting tools, and the ID should ensure that the reports include information which not only explain what is in the report, but also provide guidance and suggestions for next steps based on the findings. Finally, the adaptive training systems should be designed to support training for all training environments, from e-learning to the instructor-led classroom. The paper provided the theory and concepts which should inform the AIS; however, the next step is for IDs and software experts to collaborate and build a system, and then refine the design by conducting research using actual training and employees.

References 1. Swanson, R.A.: Analysis for Improving Performance. Berrett-Koehler Publishers Inc., San Francisco (2007) 2. Hannum, W.: Training myths: beliefs that limit the efficiency and effectiveness of training solutions, part 1. Perform. Improv. 48(2), 26–30 (2009)

Adaptive Training: Designing Training for the Way People Work and Learn

39

3. Hannum, W.: Training myths: false beliefs that limit the efficiency and effectiveness of training solutions, part 2. Perform. Improv. 48(6), 25–29 (2009) 4. Kalyuga, S.: For whom explanatory learning may not work: implications of the expertise reversal effect in cognitive load theory. Technol. Instr. Cognit. Learn. 9, 63–80 (2012) 5. Ngyuen, F.: What you already know does matter: expertise and electronic performance support systems. Perform. Improv. 45(4), 9–12 (2006) 6. Jiang, D., Kalyuga, S., Sweller, J.: The curious case of improving foreign language listening skills by reading rather than listening: an expertise reversal effect. Educ. Psychol. Rev. 30(3), 1139–1165 (2018) 7. Chen, O., Kalyuga, S., Sweller, J.: The expertise reversal effect is a variant of the more general element interactivity effect. Educ. Psychol. Rev. 29, 393–405 (2017) 8. Culatta, R.: Instructional Design.Org. https://www.instructionaldesign.org/theories/ constructivist/. Accessed 11 Mar 2019 9. Kalyuga, S., Sweller, J.: Rapid dynamic assessment of expertise to improve the efficiency of adaptive e-learning. Educ. Technol. Res. Dev. 53(3), 83–93 (2005) 10. Davies, R., Nyland, R., Bodily, R., Chapman, J., Jones, B., Young, J.: Designing technology-enabled instruction to utilize learning analytics. TechTrends 61, 155–161 (2017) 11. Reigeluth, C.: What is instructional-design theory and how is it changing? In: InstructionalDesign Theories and Models: A New Paradigm of Instructional Theory, pp. 5–30. Lawrence Erlbaum Associates, Mahwah (1999) 12. Salden, R., Paas, F., Broers, N., Van Merriënboer, J.: Mental effort and performance as determinants for the dynamic selection of learning tasks in air traffic control training. Instr. Sci. 32, 153–172 (2004) 13. xAPI Homepage. https://xapi.com. Accessed 11 Mar 2019

Evolving Training Scenarios with Measurable Variance in Learning Effects Brandt Dargue1(&), Jeremiah T. Folsom-Kovarik2(&), and John Sanders3(&) 1

The Boeing Company, St. Louis, MO 63166, USA [email protected] 2 Soar Technology, Inc., Orlando, FL, USA [email protected] 3 US Army, Ft. Knox, KY, USA [email protected]

Abstract. One major cost driver in simulation-based training (SBT) and Intelligent Tutoring System (ITS) development is authoring scenario content. Effort is compounded when creating a catalog of scenarios to enable optimal scenario selection for the individual learner. Automated Scenario Generation (ASG) methods have been successful at creating variants of scenarios using procedural generation, evolution, or event templates with variable parameters. These methods typically describe boundaries within which content may vary, but do not describe how the variation will affect learning. The authors developed an evolutionary algorithm approach to generate variations in training scenario content that can integrate with the Generalized Intelligent Framework for Tutoring (GIFT) being developed by the Army Research Lab (ARL). One of the primary goals of the approach is to greatly increase not just surface variations that do not affect learning, but variations on scenario content and instructional feedback that make the scenarios measurably different as described by multidimensional learning effect metrics. The approach leverages a structured cognitive task analysis (CTA) which outputs a set of scenarios in a schema that directly drives the ITS. A small change to the CTA identifies dimensions and constraints of variation and the effect on learning from the variations. Both manual and machine learning approaches are being explored. The key contributions of our research are (a) developing domain-general computational measures of effect on learning for creative variation, (b) making evolution effective for many teaching and tutoring domains by evolving textual content in addition to event sequences, and (c) leveraging human expert input and usage data to influence and feed back into evolution. This paper summarizes the need for ASG, describes the novel approach, and shares the key contributions of our research. Keywords: Adaptive instructional system Scenario-based learning

Automated Scenario Generation

J. Sanders—US Army (ret), Ft. Knox, KY, USA. © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 40–51, 2019. https://doi.org/10.1007/978-3-030-22341-0_4

Evolving Training Scenarios with Measurable Variance in Learning Effects

41

1 Introduction Intelligent Tutoring Systems (ITS’s), scenario-based training, and simulation-based training can be very effective [e.g. (NTSA 2018; Kulik and Fletcher 2016; VanLehn 2011)]. They are also time-consuming to develop, test, and modify. The extra time and cost results in reduced use of training using these methods, relatively few choices being created, and existing scenarios stagnating. In nature, systems improve and adapt to changes such as the environment they are in through methods such as evolution. Likewise, a training system should evolve to keep it up to date, to adapt to new situations, and to enable the system to provide a better learning experience. In nature, the variant that survives evolved in a way that enables it to overcome the problem. For training, where the goal is effective learning, it should be the instructional value of the variant scenario that determines its worth. We are prototyping two methods for evolving. One is by design, and one by random mutations. The scenarios generated by both methods are assessed by their instructional value, so determining the instructional value is a primary focus of this paper. The ability to evolve scenarios provides a method for what is known as Automated Scenario Generation (ASG). Busch et al. (1995) investigated and validated the ability to auto-generate instructional tutoring for maintenance training based on a summary of the state-of-the-problem solution in different ways, suggesting a strategy, suggesting a system or component that still needs to be tested, or suggesting a specific test. Autogenerating these multiple levels of mentoring enables scenarios that are automatically generated to be used instructionally without any further human effort. A second method of evolving new scenarios is by creating novel sequences of scenario segments or a specific sequencing of events within a scenario (Zook et al. 2012). Dividing the scenarios into segments that are natural for the domain make the segments effective learning experiences when used standalone, when assembled into complete stories, or when assembled with short transitional segments that keep the leaner engaged. Using natural scenario segments also provides for an intuitive framework to lower the effort of manual authoring and provides entry points for more focused practice in the specific segment. This method was applied when developing scenarios for the SUAS COMPETE (Small Unmanned Aviation Systems Company Employment Training Exercises) adaptive training prototype (Durlach and Dargue 2010) which was selected for use in this current project. Selecting scenario individual phases is a typical method used for skill practice. Both commercial pilot and tactical pilot students practice the more difficult landing and take-off phases of flight more than cruise phases. One of the great benefits of flight simulators is that students can repeatedly practice landings without having to taxi, take off, and perform approach phases. Musicians greatly benefit from repeating specific sections of a musical score without having to repeat the entire score. Similarly, athletes such as baseball players hone their skills by practicing specific portions of their sport such as batting. These methods of ASG are effective at providing unique, tailored learning experiences for individuals or teams. However, they require substantial authoring effort to create the individual building blocks that are assembled into the unique scenarios. This project set out to alleviate the authoring burden by automatically generating the individual segments or the entire scenario. We also set out to explore and

42

B. Dargue et al.

validate domain-general methods to create novel scenario variants in less predictable ways. These methods are still instructionally sound, and better prepare learners to be more adaptive, resourceful, and innovative. 1.1

The Theory of (Scenario) Evolution

Training scenarios are authored to intentionally provide a learning experience for specific learning objectives. If not done correctly, a variant could very easily reduce the value of that learning experience. A primary objective of this project is to increase, or at least maintain, the instructional value of the training exercise. To avoid superficial or counterproductive variants, ASG systems need to understand the domain. Novelty Search helps alleviate the domain dependency by rewarding how unique a variant is rather than rewarding progress toward a goal (Stanley and Lehman 2015b). The solution is still measured against the goal, but the distance from the goal is not the primary measure of value/worth. If the strain/branch/variant is unique and still within bounds, then it is worth keeping and worth exploring (evolving) further. Novelty Search then needs two litmus tests for each variant: a determination of whether it is within bounds and a measure of novelty or uniqueness from other variants. Both of these measures are domain-specific. The visual demonstration (Stanley and Lehman 2015a) on the Novelty Search Users Page is in the domain of solving a maze. The goal is to reach the end of the maze, and the demonstration shows how the typical approach of focusing on proximity to the goal can lead to failure. For Novelty Search, the demonstration uses the maze walls as bounds and distance from previous mutations rather than distance to goal. In a domain where, for example, a student needs to learn how to create a smoke screen to hide from an enemy, the measures of novelty might be position of the student, position of the enemy, and wind direction. Boundaries for this exercise might be that the wind direction can only be the eight primary compass directions and another measure is minimum and maximum distance between student and enemy. For this project, we want a measure of novelty to include instructional value. An instructional value of novelty for the smoke exercise should be something like the angle between the vectors of wind and line-of-site to the enemy, and distance to the goal along that line-of-sight, as these are the factors that the student must use to determine the correct placement of smoke. 1.2

The Goal of This Paper

Mager (1962), emphasized that a learning objective needs to be written as a clear description of competent performance so that it is clear how to generate instruction for, and assessments of, those learning objectives. For the smoke lesson described above, a learning objective written in this format might be “Given a terrain map showing position of the learner’s unit, positions of the enemy, position of the goal, and weather information including wind direction and speed, the learner will be able to select the target location for smoke grenades and number of grenades within a given tolerance to sufficiently keep the enemy from seeing the learner’s approach toward the goal.” With this information, a developer can determine how to evolve the scenarios to provide practice and assessment of that objective. The goal of this project is to enable software

Evolving Training Scenarios with Measurable Variance in Learning Effects

43

to automatically generate a set of scenarios. In addition to the mechanism/algorithm that performs the automation, we need a method to describe to the ASG system how to automate the generation and how to assess the learner’s attempt. We also need to make this domain-neutral. This paper focuses on this aspect of defining the learning objectives in such a way that the scenario meets the learning objectives (the instructional value can be determined). For more information about how the authors are implementing Novelty Search for this project, see Folsom-Kovarik and Brawner (2018).

2 Training Scenario Variability The US Army scaffolds a Soldier’s training using a methodology known as crawlwalk-run (Headquarters, United States (U.S.) Army Combined Arms Center 2016), which conducts training events sequenced in progressive levels of difficulty. Intelligent tutoring systems such as the US Army Research Lab’s Generalized Intelligent Framework for Tutoring (GIFT) (Sottilare et al. 2012) feature the ability to scaffold the learner by sequencing content in such a progressively more difficult order. These systems can automatically select a training exercise (scenario) from a pool of exercises based on the learning complexity of the exercise. A single dimension of complexity is often sufficient within a specific learning objective. For example, the difficulty factor for adding two positive integers might be based on whether or not the sum of any of the digits is greater than 9 so that the learner has to perform the concept of carrying. However, within the larger learning objective of adding two numbers, there are multiple additional dimensions of complexity, such as whether or not there are negative numbers and whether there are fractions. To properly challenge the learner, the complexity should increase along the appropriate dimension. Simulation scenarios are much more complex than the simple math example given. For example, the scenarios used for this project involved 49 enabling learning objectives that address nine higher-level concepts. Additionally, there are three distinct phases to the scenarios where the learner must plan for the mission, prepare for the mission, and execute the mission. Each of the SUAS COMPETE scenarios use over 300 multi-dimensional learner performance measures at approximately 45 discrete decision points to precisely determine the learner’s mastery of the learning objectives. Evolving the scenario within specific learning objective dimension will enable the system to generate variants that provide learning experiences and practice tailored to a learner’s particular needs. However, these learning objectives are domain-specific, so we need more generic dimensions of variability. 2.1

Scenario Complexity

Dunne et al. (2015) developed a tool to measure a simulation scenario complexity based on three characteristics of the scenario: Task Complexity (TC), Task Framework (TF), and Cognitive Context Moderators (CCM). Each characteristic is based on factors about a given scenario, such as the number of cues, actions, subtasks across actions, interdependent subtasks, possible paths, criteria to satisfy, conflicting paths, and distractions (see Table 1, Scenario Characteristics).

44

B. Dargue et al. Table 1. Scenario characteristics

• Task Complexity – Component Complexity • Number of Tasks/Actions required • Number of Information Cues that must be monitored and assessed – Coordinative Complexity • Number of Sub-Tasks • Number of Inter-Dependent Tasks • Task Framework (How well defined is the domain) – Task Paths – Presence of multiple potential ways to accomplish outcome(s) – Task Criteria – Number of desired outcomes – Unknown/Conflicting Paths – Degree of uncertainty • Cognitive Context Moderators – Distraction Factor

Dunne et al. summed these dimensions into a single complexity factor for the scenario and validated with domain experts. The single value attribute of complexity allows the instructor to select the scenario with the proper level of difficulty for individual learners. It also allows analysts to verify that they have authored enough scenarios with the proper coverage of complexity levels to enable the US Army’s crawlwalk-run methodology. Providing the right level of challenge enables the learner to “get into the flow of optimal learning” described by the psychologist Csikszentmihályi (1990). His research discovered three conditions required for getting into the state of optimal performance and learning. First, the task must be a challenging activity with a clear set of goals and a measure of progress toward those goals. Second, there must be immediate feedback to the person about the progress in those measures. Third, the person must perceive the level of challenge of the task is aligned to one’s own perceived abilities. In other words, the learner must believe that the task is difficult and requires effort, the learner must be able to access and interpret the cues that are required to measure progress toward the goal, and the learner must believe that achieving the goal is possible. The ITS used for this project—like the Sherlock ITS it was based on (Lesgold et al. 1988; Lesgold and Nahemow 2013)—prepares the learner for the task through instruction, then immerses the learner into the task-based scenario. Sherlock further encouraged the learner that she/he had the skills, and that the system would provide help if needed. Keeping the measure of complexity for each characteristic or even each base variable separate will help tailor the experience more precisely for the individual learner. For example, a particular learner may not have any problem with the number of cues or distractions, but needs more practice in scenarios that have a high number of interdependent subtasks. Individual measures of each dimension for the scenarios and for the learner’s mastery will enable the scenario selection mechanism to select the appropriate scenario for the learner. It may not be practical to directly measure the learner’s level of proficiency in each of these complexity dimensions independently in the same way as measuring proficiency in each learning objective. Rather than

Evolving Training Scenarios with Measurable Variance in Learning Effects

45

measuring directly, analytics of the learner’s performance in the scenarios can provide insight to the learner’s competency in the individual factors. A training scenario is a series of situations, events, decisions, actions, and tasks. Some dimensions of complexity are only valid for specific phases or tasks within the scenario. For example, factors that vary the complexity of scenarios for piloting an airplane include length and direction of runway, cross-wind, and wind shear. However, those factors only influence the complexity of the tasks performed during takeoff and landing, whereas route changes only add complexity during cruise or approach. As shown by experiments by Biddle et al. (2006), allowing distinct phases to be selected independently from a pool of variants will allow for more focused practice for the individual needs of each learner. The example of smoke above is a distinct phase of an overall scenario breaching a minefield. For that scenario, new elements might be wind speed, multiple enemy positions, uncertainty of enemy position, and the position for the goal to which the student/unit must move. Therefore, the complexity factors and other instructional value measures should be individually computed and associated with each distinct phase of the scenarios. Any LMS that is compliant to SCORM 2004 or AICC can select scenarios based on measured gaps in a learner’s expertise of explicit learning objectives (Perrin et al. 2004; Biddle et al. 2006). However, the scenario selection algorithms need to be able to determine which scenarios focus on which learning objectives. Typically, scenarios are intentionally authored for explicit learning objectives (LOs). For ASG, either the system similarly needs to generate scenarios explicitly for the specific LOs or there needs to be a method to determine which LOs a student will encounter in each of the machine-generated scenarios.

3 The Approach To address the needs of generating scenarios for specific instructional value or generating using Novelty Search then calculating the instructional value, we are leveraging the SUAS COMPETE scenario XML files. The structure of those scenarios comprise optional paths through a series of situations, events, decisions, actions, and tasks. Therefore, the scenario complexity can be autonomously determined by software. Additionally, for each action along the paths in the scenario XML there are individual measures for each of the 49 LOs. Although the LOs are domain-specific, the format of the LO measures in the XML is domain-independent. 3.1

Dimensions of Evolution and Instructional Value

We defined specific ways in which the scenario can evolve as dimensions. We looked at specific dimensions for the scenarios we considered and dimensions that were envisioned to be generic and thus applicable across domains. As one method to validate the dimension as generic, we considered domains such as corporate leadership, factory workers, engineering, and compliance training.

46

B. Dargue et al.

The domain-generic dimensions of complexity were validated with the subject matter expert (SME) who authored the original scenarios. LTC John Sanders (US Army, retired), whose specialty areas include combined arms, authoring tactical operations doctrine, and advanced military instruction, helped define the roles of SUAS. He agreed that those factors measure the complexity of learning, and, in general, also measure the complexity of the tasks within the scenarios. The relevant effect of each measure varies based on the phase of the scenario. Additionally, the effect is dependent on the specific task or decision within the phase. LTC Sanders, the SUAS COMPETE SME, previously worked with the author to define three dimensions of complexity for military scenarios (see Fig. 1) (Sanders and Dargue 2012). These factors include the level of threat, complexity of the task or system used in the task, and environmental factors. The level of threat dimension could be expressed in a domain-generic way as level of risk or urgency. The eight dimensions defined by Dunn et al. discussed earlier are more specific and therefore can be classified in a hierarchy under those defined by Sanders and Dargue (see Table 2).

Fig. 1. Progressive training matrix in three dimensions from (Sanders and Dargue 2012)

Table 2. Three dimensions of scenario complexity • Level of risk or urgency – Number of criteria to satisfy – Number of conflicting paths • Complexity of the task or system used in the task – Number of actions – Number of subtasks across actions – Number of interdependent subtasks – Number of possible paths • Environmental factors – Number of distractions – Number of cues

Evolving Training Scenarios with Measurable Variance in Learning Effects

3.2

47

CTA and Authoring

The scenarios we are using were authored using a Cognitive Task Analysis (CTA) originally developed to capture knowledge of maintenance technicians for an ITS. The Precursor, Action, Results, Interpretation (PARI) (Hall et al. 1995; Means and Gott 1988). The PARI CTA is typically a structured interview process using two SMEs. Using a spreadsheet we enabled a single SME to perform PARI (Dargue and Biddle 2016). For each task, decision, or action to be performed in the scenario, the spreadsheet codifies the SME’s mental model and four possible actions that can be performed. The SME ranks each of the four possible actions and defines changes to the assessment of the learner’s level of expertise for each learning objective. For each decision made in the scenario, the learner model is dynamically updated using these assessments. A unique, inherent capability is that for any given decision point, different learning objectives are scored based on which decision is made at that point. For example, if the situation requires the learner to select the proper tool to tighten a hex bolt, and the learner selects the proper wrench, she will get credit for understanding when to use a wrench. If she had selected a hammer, her measure of expertise in both wrenches and hammers will be reduced. Another feature of the PARI CTA is that the decisions are made in a specific context of other decisions. Each phase of the scenario comprises a series of tasks, actions, Fig. 2. PARI CTA process or decisions. The expert’s path is defined by the expert’s decisions plus the outcomes or results of those decisions. To fully define the scenario, the tool is used to codify the results of each possible decision, how the mental model changes, and what new decisions can be made. This can be done by the same SME or multiple SMEs. In this manner, multiple paths with continuity in branching are defined for the scenario. One simple method to generate a scenario variant is to reduce the complexity of the scenario by removing optional or conflicting paths (Fig. 2).

48

B. Dargue et al.

A second method to generate a scenario variant for a specific learner is to modify the scenario to change the situations so that different decisions/actions should be made. A search through the decision points can determine which path will present the decision points that the learner needs to experience to target specific gaps in expertise. If the path is not on the optimal path or a path that the learner is likely to take, the software will need to know how to vary the situation at one or more decision points. The problem statement and the precursor for each decision point captured by PARI contain the situation and the mental model of the expert. These are also the cues that are counted to determine scenario complexity. Currently, this information is free-form text in the XML describing the expert’s thoughts for the precursor for the decision and the interpretation of the results of the decision. While this is effective for transferring the expert mental model to the learner, it is not easily understood by the software nor is it easily modified by the software. A relatively simple addition to the scenario authoring tools will capture the cues used by the expert for each decision in a machine understandable and variable way. Often the factor is an indication of cost or risk for that choice and is therefore weighed by the expert for making the decision. The addition to PARI for requesting the author to define the few factors of complexity that are not currently computable will make that critical information explicit for the software. With that information, the software can determine what level of challenge each decision point has and how that challenge level can be changed. With the cues explicitly defined, the software can determine which factors need to be changed to make a less than optimal or even unacceptable path the optimal choice. The software will also be able to calculate the new complexity measure and be able to make changes to ensure the right level of challenge is given to the learner. Returning to the wrench/hammer decision, if the learner has already demonstrated mastery of selecting a wrench for hex bolts, we might want to change the number of conflicting paths complexity factor by adding a socket wrench, pliers, and a pipe wrench in addition to the original crescent wrench. We might change the number of actions complexity factor by stating there is a nut on the other side and that the bolt has to be torqued to a standard they have to calculate. If the learner has not demonstrated proficiency in hammers and nails, the software could change the fastener to a hammer. 3.3

Scenario Continuity

In many cases, such as our fastener example, the path from making the correct choice does not change: the path only depends on whether the learner selected the correct decision. For the fastener, when the correct tool is used, regardless of whether it was a nail or a bolt, the fastener is properly fastened. If the user selects the wrong tool, the fastener will work loose. In other cases, decision points further down the scenario path might need to reflect the change. We also run the danger of making a variant that is nonsensical or impossible. For example, if the fastener we change to a nail is holding the wheel on a car. To avoid those cases, bounds need to be defined. A straightforward solution is to have the author specify the bounds or possible variants. In many cases, this is a natural extension to the

Evolving Training Scenarios with Measurable Variance in Learning Effects

49

addition that defines the cues used by the expert to make the decision. For domains such as military tactical decision making, discontinuity is inherent. Uncertainty caused by the “fog of war” and unexpected changes from adaptive threats and deceptive techniques used by the enemy make discontinuity in the scenario more realistic and better prepares the learner for warfare. Variants made by adjusting the number of cues, paths, subtasks, or outcomes do not present a risk of discontinuity or nonsense. The paths created by the original authors and encoded in the original scenario provide scenario continuity for all the possible paths, including the conflicting paths, as the scenario evolves or unfolds based on the decisions and actions performed by students in the simulation.

4 Conclusions and Recommendations for Future Research This project consists of researching, prototyping, and evaluating methods to autonomously evolve variants of simulation scenarios to be used in adaptive training systems. The primary purpose for generating a variety of variants is to provide a library of scenarios so that the optimal learning experiences can be selected for students. There are basically two approaches with slight alternatives being prototyped. The first approach is to use information within the scenario to purposely create variants. The second approach is to generate a large set of variants that are then analyzed for instructional value. For the first approach, the existing scenario format in XML includes enough information to intelligently make a small set of variants for specific instructional outcomes. By adding information, such as cues an expert uses to make decisions to the scenarios, the software can generate a larger set of tailored scenario experiences that are focused on a greater number of different instructional outcomes. Both approaches require research and validation of methods to autonomously assess the instructional value of scenario variants. The first uses the variables of the assessment as guides in making the variants. The second approach uses the assessment as a “litmus test” of the instructional worthiness of scenario variants. The variables used by Dunne et al. to determine scenario complexity provide what may be an ideal domain-independent method to determine the instructional value of a simulation scenario variant. Those variables also provide information that may be used to make specific variants to address an individual learner’s needs. Acknowledgements. The research reported in this document/presentation was performed in connection with contract number W911NF-18-C-0005 with the U.S. Army Contracting Command - Aberdeen Proving Ground (ACC-APG). The views and conclusions contained in this document/presentation are those of the authors and should not be interpreted as presenting the official policies or position, either expressed or implied, of ACC-APG, U.S. Army Research Laboratory or the U.S. Government unless so designated by other authorized documents. Citation of manufacturer’s or trade names does not constitute an official endorsement or approval of the use thereof. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon.

50

B. Dargue et al.

References USARMYCOMBATIVES.COM. 2-10 Crawl, Walk and Run. U.S. Army FM 3-25.150 Combatives (2018). http://www.usarmycombatives.com/2-10-crawl-walk-and-run/ Biddle, E., Perrin, B., Dargue, B., Lunsford, J., Pike, W., Marvin, D.: Performance-based advancement using SCORM 2004. In: Proceedings of the 2006 Interservice/Industry Simulation, Training, & Education Conference (I/ITSEC), Orlando (2006) Busch, D., Dargue, B., Perrin, B.: ASTUTE: an architecture for intelligent tutor development. In: Interservice/Industry Training, Simulation, and Education Conference (I/ITSEC), Albuquerque, New Mexico, pp. 672–681 (1995) Csíkszentmihályi, M.: Flow: The Psychology of Optimal Experience. Harper and Row, New York (1990) Dargue, B., Biddle, E.: Chapter 13 – Mining expertise: learning new tricks from an old dog. In: Sottilare, R., Graesser, A., Hu, X., Olney, A., Nye, B., Sinatra, A. (eds.) Design Recommendations for Intelligent Tutoring Systems: Volume 4 - Domain Modeling, pp. 147–157. U.S. Army Research Laboratory, Orlando (2016). https://gifttutoring.org/ attachments/download/1736/Design%20Recommendations%20for%20ITS_Volume%204% 20-%20Domain%20Modeling%20Book_web%20version_final.pdf Dunne, R., Sivo, S.A., Jones, N.: Validating scenario-based training sequencing: the scenario complexity tool. In: Interservice/Industry Training, Simulation, and Education Conference (I/ITSEC), Orlando, pp. 2886–2898 (2015) Durlach, P., Dargue, B.: An adaptive training prototype for small unmanned aerial system employment. In: The 23rd Florida Artificial Intelligence Research Society Conference (FLAIRS-23). Daytona Beach, FL (2010) Folsom-Kovarik, J.T., Brawner, K.: Automating variation in training content for domain-general pedagogical tailoring. In: 2018_05_Proceedings of the Sixth Annual GIFT Users Symposium. U.S. Army Research Laboratory, Orlando (2018). https://gifttutoring.org/attachments/ download/2700/17_GIFTSym6_Authoring_paper_18.pdf Hall, E.P., Gott, S.P., Pokorny, R.A.: The PARI methodology. Technical report 1995-0108. Brooks AFB TX: Armstrong Laboratory, Human Resources Directorate (1995) Headquarters, United States (U.S.) Army Combined Arms Center. FM 7-0, Train to Win in a Complex World. Fort Leavenworth, KS (2016). https://rdl.train.army.mil/catalog/search? current=true&search_terms=%23TrainingManual. Accessed 22 June 2018 Kulik, J.A., Fletcher, J.D.: Effectiveness of intelligent tutoring systems. Rev. Educ. Res. 86(1), 42–78 (2016). https://doi.org/10.3102/0034654315581420 Lesgold, A., Nahemow, M.: Tools to assist learning by doing: achieving and assessing efficient technology for learning. In: Klahr, D., Carver, S.M. (eds.) Cognition and Instruction. Psychology Press (2013) Lesgold, A., Lajoie, S., Bunzo, M., Eggan, G.: SHERLOCK: A Coached Practice Environment for an Electronics Troubleshooting Job. University of Pittsburgh, Learning Research and Development Center, Pittsburgh (1988) Mager, R.F.: Preparing Instructional Objectives (Revised Second (1984) ed.). David S. Lake Publishers, Belmont (1962) Means, B., Gott, S.P.: Cognitive task analysis as a basis for tutor development: articulating abstract knowledge representations. In: Psotka, J., Massey, D., Mutter, S. (eds.) Intelligent Tutoring Systems: Lessons Learned. Erlbaum, Hillsdale (1988) NTSA. Why Use Simulation? Return on Investment. NTSA - National Training and Simulation Association (2018). http://www.trainingsystems.org/publications/simulation/roi_effect.cfm. Accessed 21 June 2018

Evolving Training Scenarios with Measurable Variance in Learning Effects

51

Perrin, B.M., Banks, F., Dargue, B.W.: Student vs. software pacing of instruction: an empirical comparison of effectiveness. In: The Proceedings of the 2004 Interservice/Industry Training, Simulation, and Education Conference, Orlando, FL (2004) Sanders, J., Dargue, B.: A design whose time has finally arrived: use of adaptive training technology for staff training. In: Presentation at the 23rd International Training Equipment Conference (ITEC), London (2012) Sottilare, R., Brawner, K., Goldberg, B., Holden, H.: The Generalized Intelligent Framework for Tutoring (GIFT). US Army Research Laboratory Human Research & Engineering Directorate, Orlando (2012) Stanley, K., Lehman, J.: Novelty search demonstration. The Novelty Search Users Page (2015a). http://eplex.cs.ucf.edu/noveltysearch/userspage/demo.html. Accessed 3 June 2018 Stanley, K., Lehman, J.: The novelty search users page. The Evolutionary Complexity (EPlex) Research Group at the University of Central Florida (2015b). http://eplex.cs.ucf.edu/ noveltysearch/userspage/index.html. Accessed 3 June 2018 VanLehn, K.: The relative effectiveness of human tutoring, intelligent tutoring systems and other tutoring systems. Educ. Psychol. 46(4), 197–221 (2011). https://doi.org/10.1080/00461520. 2011.611369 Zook, A., Lee-Urban, S., Riedl, M.O., Holden, H.K., Sottilare, R.A., Brawner, K.W.: Automated scenario generation: toward tailored and optimized military training in virtual environments. In: Proceedings of the International Conference on the Foundations of Digital Games, pp. 164–171. ACM (2012)

Adaptive Instructional Systems: The Evolution of Hybrid Cognitive Tools and Tutoring Systems Jeanine A. DeFalco1,2(&) and Anne M. Sinatra2 1

2

Oak Ridge Associated Universities, Oak Ridge, USA [email protected] US Army Combat Capabilities Development Command Soldier Center, Simulation and Training Technology Center, Orlando, USA [email protected], [email protected]

Abstract. This paper begins with a discussion on framing technology as tools that facilitate a distributed cognition for problem solving and supporting discriminating minds. Specifically, Dewey’s advocacy of using technology as a learning tool to support the development of discriminating minds will serve as a framework for understanding the purpose and aim of this next generation of Adaptive Instructional Systems (AISs). This paper reviews the evolution of computer assisted instruction, distinguishing between the iterations of early computer cognitive tools to more effective intelligent tutoring systems, followed by next generation hybrid systems, or AISs, that are a unique blend of cognitive tools and intelligent tutoring. These AISs are the result of improved technological affordances combined with research in education to achieve meaningful learning and discriminate intelligence, an objective aligned with Dewey’s framework for using technology as a learning tool. Keywords: Adaptive instructional systems Computer assisted instruction

John Dewey ITS Learning

1 Introduction Arguably, a good working definition of technology can be articulated as follows: the practical application of knowledge especially in a particular area or a manner of accomplishing a task especially using technical processes, methods, or knowledge [1]. Gee [2] has framed technology as those tools – including digital tools – that facilitate a distributed cognition for problem solving. Humans are, Gee notes, tool users; and believes it is misleading to even discuss intelligence as an unaided entity. Gee argues that part of what constitutes human stupidity is being left alone without tools or collaboration with other people [2]. Within this context, this paper looks back to Dewey’s advocacy of using technology as a learning tool as a framework to define adaptive instructional systems (AISs) for the purpose of supporting the development of discriminating minds. A distinction will be made between early iterations of computers as cognitive tools and how This is a U.S. government work and not under copyright protection in the United States; foreign copyright protection may apply 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 52–61, 2019. https://doi.org/10.1007/978-3-030-22341-0_5

Adaptive Instructional Systems

53

the confluence of educational research and current technological affordances allow for more effective, constructivist-oriented hybrids of computer assisted instruction (CAI) that are widely being recognized as AISs. Further, this paper will briefly discuss the history of how CAIs moved from less effective instructional systems to more effective intelligent tutoring systems (ITSs), and to the current state of AISs. Finally, we will close on how examples of these hybrid AISs realize Dewey’s belief on the purpose of using technology as a learning tool.

2 Dewey, Technology, and Meaningful Learning 2.1

Dewey and Technology

Although written almost a century ago, Dewey’s essay creative intelligence [3] examines the benefit of an educated public and laments the state of public education in existence coupled with the significance of new technologies. Dewey reminds us that at one time in the history of America, education was necessary and appropriate to address a new world burgeoning with new technology, namely the railway, telegraph, telephone, and cheap printing press. However, education in America needed to change as it was transitioning from a landscape of farms and small towns to a world of industry and advanced transportation and communication. Industrialized America, outfitted with ever evolving new technology, necessitated that schools develop their students’ expertise in content and skills to navigate technology responsibly. Dewey’s idea of “responsible technology” critiqued the technological culture of the 1920’s. In his critique, Dewey defines his view of a balanced technology, which included equipping citizens to develop the skills of consideration and criticism. Essentially, Dewey believes the aim of education is to develop the intelligent learner who can identify and adjust to problems of understanding, and to reconstruct and reorganize their former ways of thinking, or what he termed as intellectual habits [4]. Ultimately, the purpose of learning should support thinking as intellectual discernment [4]—the ability to find solutions to problems through intelligent conduct—or what we term, discriminate intelligence. Dewey wanted students to develop this “discrimination intelligence” to protect them and future generations from what he termed, “bunk,” especially “social and political bunk” [5]. Furthermore, Dewey pointed out that one of the reasons that schools were failing to educate our students rested on the misguided belief that by educating students to have an undiscriminating mental habit void of the habit of criticism, schools would produce a “loyal, patriot, a well-equipped good citizen” [5]. But an undiscriminating mind can never fully participate in a democracy. At its core, democracy is hallmarked by negotiation and compromise amongst citizens who are not only knowledgeable about content but can navigate that content within the context of the dialect of personal and social values. As such, a mind incapable of discrimination – reflection and analysis – cannot participate in acts of parrhesia: speech activity where one articulates one’s beliefs truthfully and courageously to effect change. Dewey suggests the solution to this problem of producing undiscriminating minds is to be found in a greater confidence in intelligence, inquiry, the use of the scientific

54

J. A. DeFalco and A. M. Sinatra

method, and the engagement with responsible technology. According to Hickman [6], Dewey’s critiques of technology can be found throughout his half century, 13,000 published pages of work. Although Dewey’s writing may not systemically reflect this examination, his critique is consistent. Importantly, Dewey sees engagement with technology as a method to support discriminate intelligence and promote autonomy of the individual. Technology, Dewey believed, would include educating individuals to select the materials and the technique of trades for the sake of securing industrial intelligence so that the individual may be able to make his own choices and be master of his own economic fate [3]. This analysis of Dewey is relevant to the discussion of CAIs and the next generation of AISs because it provides a framework to define and design effective learning systems that are becoming ubiquitous in training and learning environments. For educators, it is important to recognize not only how to integrate innovation into training and learning, but we should also consider how we can achieve meaningful learning with technology that supports discriminate intelligence. 2.2

Meaningful Learning with Technology: Standards and 21st Century Skills

Howland, Jonassen and Marra [7] outline a number of learning objectives that should be considered when pursuing meaningful learning with technology. These objectives speak to anchoring learning and instructional activities through tools that are engaging and supportive of authentic, active, constructive, intentional, and cooperative learning. These objectives are also mirrored in a 2017 report by the Institute for the Future [8] that analyzed the requisite proficiencies and abilities that will be required across different job and work settings in the next era of human/machine partnerships. These human proficiencies and abilities include: vision, perseverance, and creative problemsolving [8]. Within these frameworks of proficiencies and abilities necessary for the future workforce, the overarching aims of learning should include integrating technologybased tools that prepare individuals for future work. These tools should also contribute to critical thinking and social skills that are the hallmarks of discriminate intelligence, enabling individuals to meaningfully participate as engaged and reflective Democratic citizens. Thus, educators need to address the inclusion of technologies that can support both the short-term goal for knowledge mastery in a specific domain as well as develop proficiencies and skills necessary to support creative reasoning and problem solving. Achieving this lies in thoughtful application of innovative technologies that support authentic engagement in an educational experience. In examining the history of CAIs, there has been a steady progression towards achieving this end, with the current iteration being systems that not only support tools that allow for inquiry and constructivist learning, but have additional capabilities of tailoring instruction based on the needs and traits of its learners. These hybrid systems that combine the early generation of computer cognitive tools with the next generation of AISs.

Adaptive Instructional Systems

55

3 Evolution of Computer Cognitive Tools to Adaptive Instructional Systems 3.1

Computers as Cognitive Tools

In the early ‘90’s, educational psychologists Jonassen and Reeves [9] and Lajoie [10] advocated for using computers as cognitive tools. Jonassen and Reeves’s [9] defined cognitive tools as: “technologies, tangible or intangible, that enhance the cognitive powers of human beings during thinking, problem solving, and learning.” Lajoie [10] identified four types of cognitive tools1, which included tools that would not only enhance cognitive powers but would be platforms for constructivist learning, e.g., through simulations and generating solutions to gaps in knowledge and understanding. Historically, most of the positive learning effects from early CAIs included memorization, understanding, and application of facts, concepts, and procedures as reported by Vinsonhaler and Bass [11] and later confirmed by Kulik [12] whose metaanalysis reviewed 97 studies of basic computer-based instruction effectiveness, finding an average effect size of 0.32. Upon closer examination, Kulik [12] noted great variation in the differences in learning between CAIs and classroom instruction: some differences exceeded 1.00 standard deviations, whereas others reported zero standard deviation of difference. More complex systems, known then as instructional tutoring systems, showed greater overall improvement in learning outcomes. Further, Dodds and Fletcher [13] noted comparative effect sizes for learning via computer-based training at 0.39, whereas multimedia platforms had an effect size of 0.50, and for ITS, 1.08. From this, we can infer that the effectiveness of ITSs was not only due to the capabilities of the advancing technologies, but arguably increased success could be attributed to the informed design of systems based on evidence driven methods that promote deeper, more constructive learning activities. 3.2

Intelligent Tutoring Systems

ITSs have been defined as computer-based systems that seek to capture the capabilities and practices of a human tutor who is both a subject matter expert and intelligently respond to the learner and their actions [14]. In a review by Van Lehn [15], step-based ITSs produced an average of 0.76 standard deviations in improved learning outcomes. Kulik and Fletcher [16] found similar results of improvement in their examination of 39 ITSs. While Kulik and Fletcher [16] also showed great variation in results – from marginally negative to greater than 2.00 standard deviations – they suggested that the poor performance of some ITSs was due to insufficient teacher support when students used ITSs, and in other cases misalignment of assessments against objectives assessed in an ITS. However, what is most interesting about the findings of Kulik and Fletcher [16] includes a report on repeated evaluations of an ITS that focused on learning 1

(1) support cognitive processes such as memory and metacognitive processes; (2) offset cognitive load for lower level cognitive skills to free up resources for higher order thinking skills; (3) platforms for learners to engage in cognitive activities that would otherwise be out of their reach, such as simulations; (4) allow learners to engage in problem solving by generating and testing hypotheses.

56

J. A. DeFalco and A. M. Sinatra

outcomes of basic facts and simple procedures vs. more deep and conceptual learning. Their findings included evidence that ITSs were better suited to supporting and promoting deep, conceptual learning rather than basic, procedural learning. This is not a surprising outcome, for the salient element of an ITS is that content and assessments can be authored to automate steps in the system that adapt to the states (performance or emotional states) of a learner. However, the actual proliferation of these systems is not as ubiquitous in schools as they are in industry and military training. ITSs are increasingly being utilized in skills training and decision making in industry (e.g., air pilot training) and in the military (e.g., land navigation, medical care). Further, ITSs have been built primarily to support training and education for individuals. However, there are efforts underway to build ITSs to support collective training for teams, crews, and units that are essential in meeting the needs of organized military missions to address collaborative problem solving [17, 18]. Examples of four widely used adaptive ITS for individual learners include AutoTutor (developed by the University of Memphis), the Authoring Software Platform for Intelligent Resources in Education (ASPIRE; developed by the University of Canterbury in New Zealand), Cognitive Tutor Authoring Tools (CTAT; developed by Carnegie Mellon University), and The Generalized Intelligent Framework for Tutoring (GIFT; developed by the CCDC Soldier Center - STTC) [19]. GIFT, however, is unique amongst this representative sample of adaptive ITSs as it essentially a framework that allows ITS authors to create tutors in any domain, and its functionality is being expanded to help train and teach teams, not just individuals. While early ITSs were envisioned less as a cognitive tool and more of a tutor, more recent ITSs have evolved considerably since that time, both in terms of functionality but more importantly as to purpose. Functionality and purpose have merged into highlighting authoring functions that feature adaption in communication, teaching content, domain/student knowledge, and knowledge representation to improve and support deep learning [19]. 3.3

Adaptive Instructional Systems

While the efficacy of early, basic CAIs was admittedly modest, they were but the first in what is now a more robust domain of adaptive instructional systems (AISs). Lajoie and Derry [10] suggested that computers should be viewed more as a mind-extension or cognitive tool rather than function as a teacher/expert. Additionally, the field of CAIs have expanded in its efficacy of producing greater learning outcomes from tutoring system. This expansion includes advancements in technological affordances available to deliver tutoring capabilities that are responsive to the individual. For example, systems have been devised to be adaptive in delivering feedback depending upon individual sensor-based and sensor-feedback [20]. Natural language processing capabilities can now facilitate and support human/computer dialogue to provide additional adaptive capabilities to tailor instruction based on learner responses. AISs reflect a current paradigm shift within the field of ITSs. This shift is seen substantively in a newly organized IEEE working group P2247.1 that seeks to define and standardize the nature, purpose, and specs of AISs. At present, this working group

Adaptive Instructional Systems

57

has established that AISs will include traditional ITSs, but there is a recognition that it must also include other kinds of operating systems such as human virtual agents, which are a more faithful rendering of using the computer as an agent for constructivist learning. However, to date there is no agreed upon definition that distinguishes AISs from ITSs or any other kind of technology-based learning platform. However, it is expected that this working group will delineate between mere cognitive tools and this new generation of hybrid systems that are part cognitive tool and part tutor. One such proposed definition includes identifying AISs as systems that are not limited to mere extensions or enhancements of cognition, but rather support discriminate intelligence as hallmarked by expert reasoning and decision making – the aim of educational efforts, according to Dewey [4]. If defined in this way, AISs would be readily recognized as the next evolutionary step starting from computer assisted instruction to ITSs to this new hybrid mindtool/tutor. Ideally, operationalizing AISs would include delineating this generation of technology mediated learning platforms as ones that support educational experiences to promote intelligent learning and discriminate intelligence. 3.4

Efficacy of AISs

The efficacy in improved learning outcomes and supporting discriminate intelligence mediated by AISs resides both in the ability for the system to adapt to the learner based on levels of content complexity and the effect of feedback. These systems also can provide a variety of educational experiences to facilitate authentic and simulated constructivist learning environments. This can be seen in systems such as GIFT [21] that can integrate game-based learning tools such Virtual Battle Space [22] and VMedic [23]. Also, there is use for a virtual tutor by way of tutorial dialogue systems that interact with humans though conversations and adaptively respond to a person’s actions, emotions, and verbal contributions, as seen in AutoTutor [24]. Additionally, virtual human assistants such as ELLIE 1.0 [25] can be used to assess mental health. ELLIE 1.0 is a virtual human agent that can “see” an individual’s gestures, facial movements, and postures, respond to these changes, as well as engage in natural dialogue to engage a participant in a therapist-type dialogue. Currently, there are efforts underway to create a new and improved ELLIE 2.0, that can not only assist in training effort for the military but can also support ethical decision making of individuals. As noted by Chi and Menekse [26] dialogic learning, which is essentially learning through discussion, is superior to passive, active, or even collaborative learning methods. In this way, ELLIE 2.0 will not only function as an ITS, but it can support constructivist learning to support discriminate intelligence. This integration of a variety of cognitive tools into tutoring systems are essentially what is being considered as new hybrid systems that serve both as a tutor and as a platform for constructivist educational experiences. Their efficacy lies in the interdisciplinary development of these systems, where teams of computer scientists work with educational psychologists to build and test the efficacy of new AIS designs. In this way, systems are being designed to reflect best practices regarding effective pedagogy to support more than basic, procedural skills training and memorization. Instead, systems are being developed to facilitate educational experiences that employ tools and learning

58

J. A. DeFalco and A. M. Sinatra

objects that allow for more constructivist approaches to learning, to support discriminate intelligence across a range of disciplinary domains. What is important to note about AISs is that there is evidence that has shown equivalent learning effects as compared to expert human tutors, though these experiences to date have been limited to well-defined cognitive domains, such as computer programming, physics, and mathematics [14]. Yet, there is an emerging belief that these AISs could increase their validity as exclusive viable training options if the fundamental challenges to achieving increased and accelerated learning outcomes were identified and solutions were discovered. Within this context falls some more current research that is focused on accurately modeling the learner and the educational experiences [27].

4 GIFT: The Generalized Intelligent Framework for Tutoring In addition to devising AISs that can support increased learning in science and math, there is also work emerging to devise pedagogical templates that address creative and ethical thinking. GIFT is emerging as a leader of these kinds of platforms [21]. Specifically, GIFT can make instructional decisions to adapt content and sequencing of content to support expert problem solving, as well as make adaptive selection based on learner traits, needs, and preferences. Further, it can host a range of constructivist learning objects to support meaningful learning in educational experiences, e.g., gamebased learning, virtual tutors, and virtual human agents. As a content delivery platform, learning object materials can be presented to the learner for opportunities to experiment with content and construct their own knowledge through simulated yet authentic learning scenarios. GIFT’s adaptive functionality also allows participants to come back to their learning courses populated with prior engagement data to feed forward the adaptive elements. In this way, GIFT functions as a cognitive tool. As a cognitive tool, GIFT serves as an extension and repository of prior experiences of the learner that not only feeds forward the content and sequencing of material but can also remind the learner about their prior work, adapting assessments and simulations based on prior demonstrated competency. The challenge in this is devising and validating individual student models, devising trait and behavioral markers by which to structure an adaptive model. This learner modeling is recognized as a persistent, difficult task. The solution to this lies in part in merging innovative technologies with new pedagogical theories based on the science of learning and conducting empirical investigations to validate the models [28]. One current research effort to address this challenge is an investigation into understanding the effect of the relationship between learner traits and the sequencing of increasing complex content as it contributes to accelerated learning mediated by GIFT.

Adaptive Instructional Systems

4.1

59

Devising a Learner Model for Accelerated Learning Mediated by GIFT

An inter-institutional study is currently underway between Columbia University and the CCDC Soldier Center - STTC using GIFT as an experimental test bed and delivery platform to investigate traits and cognitive abilities that will be used to devise a pedagogical template to support expert decision making for medical personnel training in critical care education. After correlational analyses and experimentation, it is the intention of the investigators to come to a better understanding of the salient traits that are relevant to abstract/creative reasoning in expert decision making, using GIFT can to deliver materials and monitor outcomes to support accelerated critical care learning. Two pilot studies were recently conducted at the United States Military Academy to examine the initial effects of priming of analogical and spatial reasoning in verbal and mathematical learning outcomes. This initial look at the priming of these tasks is part of a larger effort to determine whether there would be an effect on learning if participants were primed with these tasks delivered through GIFT prior to engagement with medical content. This is an important effect to study as it speaks directly to the pedagogical template that can be authored within GIFT or any other AIS in the efforts to effectively employ these systems as hybrid cognitive tools and tutors. In the first pilot test, mental rotation problems were used to prime participants to solve mathematical reasoning problems. When participants were primed with mental rotation tasks, there was a trend that these participants outperformed those who did not receive the mental rotation task on a mathematical reasoning test. In a second informal pilot test, mental rotation problems were also used to prime participants to solve problems, this time solving for verbal analogies. The results indicated that those who solved mental rotation tasks prior to solving for verbal analogies had an overall increased success rate than those who had not been primed with the mental rotation tasks. While no conclusive, statistically significant claims can be made from this informal examination, there is enough anecdotal evidence to support a broader, more comprehensive experimental study examining the effect of spatial reasoning tasks on learning outcomes mediated by an AIS. Accordingly, the authors of this paper are in the process of launching two experimental studies that will provide data on how the sequencing of content in a pedagogical design within GIFT can provide further evidence on the efficacy of this AIS as an effective hybrid cognitive tool and tutor. As a framework, GIFT’s robust efficacy ultimately lies in its structural flexibility to author courses that sequence learning objects that are not merely static but dynamic, meaningful, and authentic.

5 Conclusion This paper has advocated for a re-examination of Dewey’s call to support intelligent learning where technology-based learning platforms are designed as constructivist, educational experiences to support the skills necessary to develop discriminate intelligence – a trait necessary as the driving force that sustains a Democratic society. This call becomes increasingly possible to realize given the progress on technological

60

J. A. DeFalco and A. M. Sinatra

affordances available for integration in CAI educational experiences driven by informed research by educational psychologists. Essentially, AISs are uniquely positioned to function both as a computer cognitive tool and a tutor, providing opportunities for meaningful engagement in authentic experiences to represent and construct knowledge as well as assist in complex decision making. In this way, AISs are already moving far beyond mere cognitive tools or the limited experiences afforded by even a human tutor. Rather, AISs are creating novel, engaging, liminal spaces where learners can more fully engage in a constructivist approach to learning, allowing for distributed cognition for problem solving to support discriminate intelligence. Acknowledgements. Research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-17-2-0152. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

References 1. Merriam-Webster Online Dictionary (2019). https://www.merriam-webster.com/dictionary/ technology 2. Gee, J.P.: The Anti-education Era: Creating Smarter Students Through Digital Learning. St. Martin’s Press, Chicago (2013) 3. Dewey, J., Moore, A.W.: Creative Intelligence: Essays in the Pragmatic Attitude. H. Holt, New York (1917) 4. Lamons, B.: Habit, education, and the democratic way of life: the vital role of habit in John Dewey’s philosophy of education (2012). https://scholarcommons.usf.edu/cgi/viewcontent. cgi?referer=https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=Lamons%2C+B. +Habit%2C+&btnG=&httpsredir=1&article=5314&context=etd 5. Dewey, J.: Education as politics. New Repub. 32(409), 140 (1922) 6. Hickman, L.A.: John Dewey’s Pragmatic Technology. Indiana University Press, Bloomington (1990) 7. Howland, J.L., Jonassen, D.H., Marra, R.M.: Meaningful Learning with Technology. Pearson, Upper Saddle River (2012) 8. ITFT: Institute for the Future. The next era of human/machine partnerships (2017). http:// www.iftf.org/fileadmin/user_upload/downloads/th/SR1940_IFTFforDellTechnologies_ Human-Machine_070717_readerhigh-res.pdf. Accessed 20 Sept 2018 9. Jonassen, D., Reeves, T.: Learning with technology: using computers as cognitive tools. In: Jonassen, D. (ed.) Handbook of Research for Educational Communication and Technology, pp. 693–719. Simon & Schuster Macmillan, New York (1996) 10. Lajoie, S.P., Derry, S.J. (eds.): Computers as Cognitive Tools. Routledge, New York (2013) 11. Vinsonhaler, J.F., Bass, R.K.: A summary of ten major studies on CAI drill and practice. Educ. Technol. 12, 29–32 (1972) 12. Kulik, J.A.: Meta-analytic studies of findings on computer-based instruction. In: Baker, E.L., O’Neil, H.F. (eds.) Technology Assessment in Education and Training, pp. 9–33. Lawrence Erlbaum Associates, Hillsdale (1994)

Adaptive Instructional Systems

61

13. Dodds, P., Fletcher, J.D.: Opportunities for new smart learning environments enabled by next generation web capabilities. J. Educ. Multimed. Hypermedia 13, 391–404 (2004) 14. Sottilare, R., Fletcher, J.: Research task group (HFM-237) work plan. In: Assessment of Intelligent Tutoring System Technologies and Opportunities. STO/NATO (2018). https:// gifttutoring.org/attachments/download/2520/$$TR-HFM-237-ALL.pdf 15. VanLehn, K.: The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educ. Psychol. 46, 197–221 (2011) 16. Kulik, J.A., Fletcher, J.D.: Effectiveness of intelligent tutoring systems: a meta-analytic review. Rev. Educ. Res. 86, 42–78 (2016) 17. Adamson, D., Dyke, G., Jang, H., Rosé, C.P.: Towards an agile approach to adapting dynamic collaboration support to student needs. Int. J. Artif. Intell. Educ. 24(1), 92–124 (2014) 18. Rosen, Y.: Computer-of human-to-agent approach. Int. J. Artif. Intell. Educ. 25(3), 380–406 (2015) 19. Sottilare, R.: A review of intelligent tutoring system authoring tools and methods. In: Assessment of Intelligent Tutoring System Technologies and Opportunities. STO/NATO (2018). https://gifttutoring.org/attachments/download/2520/$$TR-HFM-237-ALL.pdf 20. Paquette, L., et al.: Sensor-free or sensor-full: a comparison of data modalities in multichannel affect detection. International Educational Data Mining Society (2016) 21. Sottilare, R.A., Brawner, K.W., Sinatra, A.M., Johnston, J.H.: An updated concept for a generalized intelligent framework for tutoring (GIFT). GIFTtutoring.org (2017) 22. Hoffman, M., Markuck, C., Goldberg, B.: Using GIFT wrap to author domain assessment models with native training applications. In: Generalized Intelligent Framework for Tutoring (GIFT) Users Symposium (GIFTSym4), p. 75, July 2016 23. DeFalco, J.A., et al.: Motivational feedback messages as interventions to frustration in GIFT. In: Proceedings of the Fourth GIFT User Symposium (GIFTSym4), Princeton, NJ, pp. 25–35 (2016) 24. Graesser, A.C., Cai, Z., Morgan, B., Wang, L.: Assessment with computer agents that engage inconversational dialogues and trialogues with learners. Comput. Hum. Behav. 76, 607–616 (2017) 25. DeVault, D., et al.: SimSensei Kiosk: a virtual human interviewer for healthcare decision support. In: Proceedings of the 2014 International Conference on Autonomous Agents and Multi-agent Systems, pp. 1061–1068. International Foundation for Autonomous Agents and Multiagent Systems, May 2014 26. Chi, M.T., Menekse, M.: Dialogue patterns in peer collaboration that promote learning. In: Socializing Intelligence Through Academic Talk and Dialogue, pp. 263–274 (2015) 27. DeFalco, J.A., Hum, R., Wilhelm, M.: Developing accelerated learning models in GIFT for medical military and civilian training. In: Schmorrow, D.D., Fidopiastis, C.M. (eds.) AC 2018. LNCS (LNAI), vol. 10916, pp. 183–191. Springer, Cham (2018). https://doi.org/10. 1007/978-3-319-91467-1_15 28. Perez, R., Skinner, A., Sottilare, R.: Review of intelligent tutoring systems for science technology engineering and mathematics (STEM). In: Assessment of Intelligent Tutoring System Technologies and Opportunities. STO/NATO (2018). https://gifttutoring.org/ attachments/download/2520/$$TR-HFM-237-ALL.pdf

Lessons from Building Diverse Adaptive Instructional Systems (AIS) Eric Domeshek(&), Sowmya Ramachandran, Randy Jensen, Jeremy Ludwig, Jim Ong, and Dick Stottler Stottler Henke Associates, Inc., 1650 S. Amphlett Blvd., Suite 300, San Mateo, CA 94402, USA [email protected]

Abstract. This paper presents lessons learned from building a wide range of Adaptive Instructional Systems (AISs), ultimately bearing on the question of how to characterize the space of potential AISs to advance the cause of standardization and reuse. The AISs we consider support coached practice of complex decision-making skills—e.g., military tactical decision-making, situation assessment, and systems troubleshooting and management. We illustrate forces that affect system design and dimensions along which systems then vary. The relevant forces derive from the AIS’s area of application, the project structure within which it is built, and the customer’s priorities. Factors to consider include (1) The extent to which the target domain is well-defined versus ill-defined; (2) The degree of fidelity required, preferred, and/or available for an exercise simulation environment; (3) The intended roles of automated and human instructors in instructional delivery; and (4) Imperatives for short-term and/or long-term cost containment. The primary dimensions of AIS design we consider in this paper include (1) Exercise Environment; (2) Knowledge Models; (3) Tutor Adaptations; and (4) Supporting Tools. Each of these is further broken down into a set of more detailed concerns. Together, they suggest structures that can inform an ontology of AIS methods and modules. Keywords: Adaptive Instructional Systems Application and project demands Design and implementation choices Ontologies and standardization

1 Experience with Intelligent Tutoring Systems (ITSs) Funded primarily by DoD and NASA, Stottler Henke has developed dozens of Intelligent Tutoring System (ITSs) for training and education in a wide range of subject areas. By reflecting on our direct experience, this paper aims to describe patterns that may help advance broader standardization goals for future applications that share features with some of our past ITSs. Most of these systems support coached practice of complex decision-making skills, such as military tactical decision-making, situation assessment, or systems troubleshooting and management. To support coached practice, ITSs typically integrate interactive exercise environments—e.g., simulations or problem-solving tools—that present problems, tasks, or scenarios to the student (Woolf 2010). © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 62–75, 2019. https://doi.org/10.1007/978-3-030-22341-0_6

Lessons from Building Diverse Adaptive Instructional Systems (AIS)

63

Training exercises typically correspond to activities performed on the job. Educational exercises may teach knowledge and skills that are less tightly coupled to real-world tasks. In either case, the student works in the environment to perform task-relevant actions observable by the automated tutor. Students may also provide additional evidence of their thought process when analyzing situations or making decisions. The tutor may communicate with the student directly, via user interface widgets or avatar, or it may communicate indirectly via simulated characters. Stottler Henke ITSs, like other tutoring systems, adapt their instruction to individual students by: • Selecting exercises and instruction to address an individual student’s current learning needs, and modifying the course of exercises based on student performance; and • Providing individually tailored interventions such as prompts, hints, performance feedback, questions, and remedial instruction, before, during, and after exercises. The applications our ITSs address, the project structures within which we build them, and our customers’ priorities greatly influence the methods and technologies we apply. Consider some examples of these kinds of constraints: • Well- Versus Ill-Defined Domains: In well-defined domains, common ITS design approaches involve tracing fully competent expert cognitive performance models, often accompanied by models of flawed student performance [1–3]. Such ITSs can assess performance by matching the student’s actions with correct and buggy solutions generated from these models. It is feasible (though challenging and costly) to develop a sufficiently complete performance model when a subject is wellbounded and stable over time, such as for high school math or physics. However, many of the ITSs we have developed address ill-defined domains [4, 5], involving complex decision-making in poorly bounded domains, in which the knowledge and skills—and therefore the exercises that teach them—must keep up with changing situations, methods, and tools used in the real world. Thus, we often implement ITSs that complement limited general-purpose performance models with more complex scenario-specific models of expertise. • Fidelity of Simulated Task Environments: Rooted in the doctrine, “train as you fight,” DoD customers often have a strong preference for high fidelity simulation. Unfortunately, using a high-fidelity simulation often incurs high costs, whether the ITS includes a new simulator or integrates with a legacy system [6, 7]. Instrumenting legacy simulations so they report the kinds of information needed by an ITS is a recurring problem; as is finding means to route ITS output into the user interaction context created by a legacy simulation [8]. Alternately, quickly constructing a “cognitively realistic” simulation offers its own challenges, especially when it must include appropriately responsive non-student agents. • The Role of Instructors: In our experience, practical ITSs are often used in blended learning environments, so most ITSs should also help human instructors understand and address the needs of individual students and the class overall.

64

E. Domeshek et al.

• Cost Containment: There are always constraints on the time and money available to build an ITS: its exercise environment and instrumentation, its control knowledge and logic, and its instructional and exercise content. Since training applications tend to change relatively quickly, the systems often need to be modified and maintained over time—new doctrine accommodated, new exercises created, etc.—ideally without having to establish another contract and pay for expensive AI expertise on an ongoing basis. This drives a desire for authoring tools, which in turn affects the forms that the ITS’s knowledge and supported algorithms can practically take [9, 10]. The overarching goals of suiting the task domain, addressing the learning objectives, and achieving training effectiveness within available resource limitations can be addressed in many ways. Thus, as illustrated below, we have employed a variety of approaches to developing adaptive exercise environments, assessment knowledge, and instructional interactions.

2 Requirements and Dimensions for AISs In order to adapt instruction, one must be able to: • Assess the student’s knowledge and skills (and possibly attitudes, aptitudes, and emotional states), based on performance within the work environment and/or interactions with the tutor, perhaps supplemented by legacy instruments; and • Select and present instructional interactions at the right time and in an effective manner to address issues identified during assessment. In the context of the broader set of issues raised by ITSs, we have found it useful to characterize our systems along the following dimensions, all of which affect adaptation: 1. Exercise Environment a. Degree of Free Play b. Modes of Interaction c. Content of Interaction d. Interpretation Issues e. Simulation Constraints 2. Knowledge Models a. Expert Model b. Tutor Model c. Student Model 3. Tutor Adaptations a. Instruction Selection b. Exercise Selection c. Prompting d. Hinting e. Immediate Feedback f. After-Action Review g. Socratic Dialog h. Remedial Instruction

Lessons from Building Diverse Adaptive Instructional Systems (AIS)

65

4. Supporting Tools a. Instructor Support b. Authoring Support In the following section, we provide discussion and examples of these dimensions.

3 System Design Dimensions and Their Effects on Adaptation 3.1

Exercise Environment

Degree of Free Play. Degree of Free Play [11] is our first environmental factor. Free play, here, means relatively unconstrained—yet still tracked and tutored—pursuit of specific training-defined tasks. It does not imply relatively undirected exploration of a simulated “microworld” [12, 13]. Consider two possible points on this dimension and how they affect adaptation. • Some of our systems such as the Tactical Action Officer (TAO) ITS [14, 15] provide free-play simulations whose events evolve realistically in response to a wide range of possible student actions. The timing and sequence of tactical situations depends upon actions taken by computer-generated forces, which in turn respond to student actions and the actions of other simulated agents. Thus, one cannot assess student performance simply by recognizing student actions at prespecified times or in a specified sequence. Instead, student actions must be evaluated in the context of the tactical situation by considering the state of other friendly and opposing forces and their recent actions. Thus, the assessment subsystem within TAO-ITS uses Behavior Transition Net-works (BTNs), an extension of finite state machines. Developed at Stottler Henke, BTNs detect significant sequences of events, conditions, and student actions to assess performance and identify knowledge and skill gaps [16, 17]. • By contrast, effective training systems can often be created that provide semi-free play simulations in which a moderate number of options are made available to the student. For these kinds of training simulations, we have developed an in-house intelligent tutoring engine called the Task Tutor Toolkit (T3) [18, 19]. T3 matches student actions to a solution template that encodes correct actions within a scenario and their allowable variation. Although this approach supports simulations that are somewhat constrained, compared to free-play simulations, the development of these simulations and the solution template for each scenario requires much less effort. Modes of Interaction. The available forms of input and output (I/O) supported by an exercise environment are a major determinant of training realism, a potential costdriver of ITS development, and an opening for widening the range of adaptations the system can exhibit. Most of our systems include traditional graphical user interface (GUI) elements with forms and widgets, often designed to mimic or abstract some realworld equivalent—e.g., systems control interfaces or traditional paperwork. Many of

66

E. Domeshek et al.

our systems include more strongly graphical elements such as maps (for tactical situations as in TAO-ITS, InGEAR [20], ComMentor [21, 22], and many others), diagrams (for complex logic or control systems as in ICT Tutor [23]), or graphs (for data analysis problems as in AAITS [24]). We have done some work on leveraging 3D or virtual reality environments, including Commercial Off-The-Shelf (COTS) game engines. We have devoted more effort to allowing for language-based interaction, including speechbased I/O. For instance, students using the second generation TAO-ITS system could converse with simulated watchstanders staffing a ship’s combat center, and students in METTLE [25] could engage in a diagnostic interview with a simulated patient. In the line of prototypes starting with ComMentor, we explored mechanisms to support extended Socratic dialogs. No matter the external format, a system can ultimately only respond sensibly to some range of inputs. In some situations, widgets are used to build tightly restricted forms of input, such as multiple-choice questions or checklists. In other situations, we use more flexible mechanisms, such as interactive graphics (e.g., diagrams, maps, and timelines), or constrained understanding and generation of spoken language and typed text. Most diagrams have some natural structure—e.g., nodes and/or links with some sets of available states—that effectively defines what can be communicated. Many maps or data presentations can be analyzed into a smaller set of qualitatively distinct and meaningful regions. Even with language, a massive universe of student utterances can be mapped to a much smaller set of expected and meaningful inputs. Driven by domain needs and project constraints, the recognizable inputs may constitute a relatively large or quite small set of alternatives to which the exercise environment and/or tutor most adapt. We will say more about the kinds of content that can be communicated and the degree of contextual interpretation required even once a student input is recognized/classified. Specifying and controlling a range of environment or tutor outputs presents a different and often simpler set of challenges. For instance, understandable language generation is easier than language interpretation. Utterances can be completely canned or templated to allow some useful range of variation. In some cases, pre-recorded audio and video are appropriate. Means to inject language or other kinds of environment manipulations into an exercise can be designed into custom-built simulations. However, introducing ITS-driven adaptive behaviors into an existing simulation can be more of a challenge. Content of Interaction. A typical exercise environment focuses on providing means to support carrying out a task to be trained. However, sometimes, it is not possible to assess the student’s knowledge and skills based solely on their observable task-relevant actions. For such applications, it is useful to elicit the student’s reasoning, either by requiring the student to show their reasoning as part of the assigned task, or by asking the student questions about their thought process. As an example of showing reasoning, the ITADS [26, 28] tutor provides a “rationale panel” where students can maintain an inventory of hypotheses underlying their actions in its free play diagnostic environment. As an example of asking about thought processes, the ComMentor system presents tactical decision games that prompt students, typically Army Captains, to sketch tactical plans. Then, the system evaluates the student’s plan and engages in a

Lessons from Building Diverse Adaptive Instructional Systems (AIS)

67

Socratic dialog that probes their understanding of the situation, guiding them through questions and feedback to build up an argument structure for appropriate interpretations and courses of action. AAITS teaches underwater acoustic analysis and the ICT Tutor teaches counter-intelligence analysis. Both of these systems require students to enter observations and inferences, using a domain-specific, graphical analysis tool. In all these cases, the tutors condition their assessments and their instructional support not just on overt task performance, but also on revealed student rationale. Interpretation Issues. Student inputs often require contextually sensitive interpretation in order to support tracking and assessment of performance and learning. This context-sensitivity can exist at different points on the free play spectrum, and for different input forms and contents, even after basic input meaning is determined. As noted earlier, for free play exercises as in TAO-ITS, we use BTNs to evaluate student actions in their tactical context, including the state and recent actions of other friendly and opposing forces. For constrained play exercises using T3, student actions are interpreted with respect to their place in an observed sequence relative to a flexible solution template. In ITADS student assertions about hypotheses are compared to the system’s own record of justifiable hypotheses given the diagnostic information uncovered by student actions. In ComMentor, student input about their understanding of the situation and decision rationale are interpreted relative to a set of argumentative points to be discussed, while tracking context for topics that have already been raised, are currently-in-focus, and are yet to-be-discussed. Without context-sensitive interpretation, tutor assessments will often be incorrect and system adaptations ill-informed. Simulation Constraints. An ITS’s exercise environment may be custom-built, rely on integration with a legacy system, or use some combination of the two. Beyond the cost and realism impacts, the build and reuse options tend to introduce different constraints on observing student activity and controlling the student experience—e.g., modifying the course of the exercise and/or injecting tutor interactions. For instance, the secondgeneration PORTS/TAO-ITS was integrated with the legacyAegis PORTS simulator; enabling the ITS BTNs to observe and manipulate elements of the simulation required close collaboration with the PORTS developers. Similar collaboration was required for InGEAR’s integration with a legacy game. Sometimes, however, opportunities to ask for access to desired interfaces in legacy systems may be limited, as when we used the standard High Level Architecture (HLA) interfaces for the BC2010 ITS [26]. For ITADS, we were required to provide a high-fidelity simulation of a representative shipboard information technology environment. The resulting network of customconfigured virtual machines required new instrumentation in order to support rich observation and control. For the students’ free play troubleshooting task in ITADS, we were forced to invent new forms of modeling to maximize inferences given limited available observability of student actions. Within the same system, students also have a procedure execution task, for which we fell back on (more restricted) solution templates (like T3) for modeling and coaching. In contrast, systems such as AAITS, ICT, and METTLE all relied on custom-built exercise environments in which we could, with some development effort, make the environments do whatever was needed to support adaptive instruction.

68

E. Domeshek et al.

3.2

Knowledge Models

In our ITSs, the dominant modeling effort is typically devoted to building the Expert Model—describing what constitutes good understanding and behavior in the domain, and therefore what we would like the student to learn. The details of the Expert Model are often abstracted and linked to an associated Curriculum Model—a hierarchy of principles that students are expected to master. The Curriculum Model summarizes what the students are to learn. The Expert Model contains all the details of how to recognize and/or perform adequate knowledge and behavior. Tutor interventions such as hints, prompts, and feedback, can be associated with either or both of Curriculum nodes and Expert behaviors. The Tutor Model—controlling delivery of available tutor interventions in response to student performance and states of the Student Model—is typically left as code, though often with some parameterization to allow fine-tuning to suit instructors’ pedagogical preferences. The Student Model is typically a mastery overlay on the nodes of the Curriculum Model, capturing the results of automated assessments that are part of the Expert Model. The discussion here will focus on the various forms the Expert Model takes in different ITSs, with the aim of achieving training effectiveness within available resource limitations while suiting the task domain and learning objectives. • Expert Behaviors: Probably the most flexible form of Expert modeling we use is context-sensitive sequence recognizers of the kind most easily built using BTNs (as applied in TAO-ITS, the C2 V-ITS [29, 30], and many other systems). An important issue with such models is the extent to which the resulting behavior specifications are situationally bound. That is, does a BTN apply only within a specific scenario (or some part of a scenario) or is it a more generally applicable characterization of good performance? It is common for our ITSs to include a combination of scenario-specific and cross-scenario knowledge. Versions of this question arise for other modeling approaches discussed below. Our SimBionic system for BTN authoring supports hierarchical decomposition and target-specific variants of BTNs, which can be useful when seeking to partition and generalize behavior specifications for reuse. • Domain Constraints: Another powerful and common approach to modeling Expert knowledge and behavior is to capture constraints among domain objects and actions. This approach was used in a series of tactical tutors such as InGEAR, AAIRS [31, 32], and BC2010 ITS. InGEAR paid particular attention to generalizing such constraints by starting the process of formalizing key concepts to facilitate automated recognition in different situations—e.g., characterizing the concept of cover and concealment as it applies to different terrains. ITADS troubleshooting Expert Model exploited constraints on what symptoms provided evidence for or against what faults, and what student actions could produce additional evidence. The form of the model was envisioned as potentially applying across a wide range of troubleshooting applications. • Solution Templates: The template approach introduced by T3 provides a lowercost, though more limited, means to track and coach activity when more flexible approaches are not needed or perhaps not affordable. A similar scheme was used for

Lessons from Building Diverse Adaptive Instructional Systems (AIS)

69

those aspects of ITADS focused on fixing (rather than diagnosing) problems. Variation in activity ordering and parameterization, as well as some contextsensitive selection of alternatives can be accommodated. Coaching can be provided to help keep the student making progress along a viable path. Again, templates can be written so they apply across multiple exercises. • Reusable Script Elements: Scripted exercises seem like they would offer only limited adaptation. However, the degree of limitation depends on the sophistication of the scripting formalism and the size of the scripts. For example, the METTLE system is explicitly built on a concept of scripting, but the scripts are very large (e.g., hundreds of lines for a simulated patient) and each script line is conditional. Conditions can include not only immediate triggers—such as things just done or said by the student—but also tests depending on earlier actions or logical combinations of such actions. To speed authoring and promote generality, script lines can be reused by particular actors in specific scenarios and/or scenes, with selective overrides for chosen aspects of such lines—e.g., conditions, actions, or instructional annotations. • Direct Mastery Evidence: In some ITSs, exercises are composed from more focused interaction—e.g., questions and available answers—that are directly linked as evidence for or against mastery of Curriculum principles or skills. For instance, in ReadInsight, a text passage is accompanied by (an adaptive set of) comprehension questions. Available answers to each question are treated as mastery evidence. 3.3

Tutor Adaptations

Across our many ITSs we have built versions of all the kinds of tutor adaptation listed in Sect. 2. Instruction Selection and Remedial Instruction are probably the two simplest techniques in our work. That is because instruction is generally taken to be some kind of mostly-opaque multimedia, often prepared and possibly delivered using external COTS tools such as PowerPoint, Articulate, or Flash. In our systems, such instruction packets are linked to curriculum nodes and may be further tagged in other ways—such as being particularly appropriate for initial or secondary exposure. Instruction or remediation is chosen based on student curriculum mastery estimates, and subject to tunable heuristic rules regarding issue such as score thresholds for offering remediation, and prerequisite structures for introducing new topics. Exercise Selection is generally also driven by links to curriculum nodes and student mastery estimates. However, as noted in connection with TAO-ITS, since we control the exercises more completely, we can sometimes use exercise configuration or real-time exercise steering as alternatives to selecting an entirely new exercise. During-exercise coaching includes Prompting, Hinting, and Immediate Feedback. Of these, prompting—taken to mean proactive tutor suggestions—is probably used least often. While most of our expert modeling approaches can determine some reasonable next action at most moments, it is harder to judge when it would be productive to break in on a student and suggest such a move. METTLE does this using author-settable timeouts that start running when some conditions are first met. Tactical decision-making tutors such as InGEAR and C2 V-ITS monitor the passage of time as part of the tactical situation, such that the absence of student action can be a trigger for

70

E. Domeshek et al.

proactive suggestions. More often, the tutor’s knowledge of useful next steps is used to drive a series of hint offerings. The student must explicitly request such hints (though the tutor may flag their availability). A common practice is to offer a progression of hints going from general to more specific—e.g., what to consider, what to do, how to do it, why to do it. Immediate feedback is typically tied to assessment logic, though again with additional control annotations and/or heuristics to determine which assessments are worth commenting about in the middle of the exercise flow. It is usually judged more important to provide negative feedback aimed at immediately correcting a student mistake—whether that be for pedagogical purposes, or to ease the system’s student tracking chores—than to provide positive feedback reinforcing correct decisions. All of these interventions are generally under author control, since instructors may have opinions about what kinds of interventions to include. For instance, some military instructors may be less concerned about student motivation and hence de-emphasize positive feedback. All of these interventions can potentially be combined with fading, wherein control logic takes into account the student’s assessed mastery level on relevant curriculum points when deciding whether or not to offer a prompt, hint, or feedback. Most of our systems focus exclusively on adapting based on mastery assessments. Again, military training addresses students who have been pre-selected for certain attributes and personality traits, so adaptation to factors beyond performance, such as motivation, may not be as interesting in that context. After Action Review is provided by most of our ITSs. In its simplest form, it is a comprehensive collection of exercise assessments and paired feedback, like a reportcard, organized by a scheme such as exercise chronology or curriculum hierarchy. Links to chosen remediations are typically embedded with the (negative) assessments. But assembling a truly effective AAR can require substantial adaptation, as illustrated by AAIRS. In a tactical trainer, making a point clearly can involve constructing an adaptive presentation that selects particular events, and then uses overlay graphics, vantage points, and filtering to highlight important information. Also, with team trainers, part of the challenge is to illustrate team dynamics and focus on individuals or sub-teams and their roles in the problem. If the tutor is not smart about how the AAR is assembled and just presents a (non-tailored) playback to the training audience for each training point, the benefits of individualized assessment may largely be lost. Finally, Socratic Dialogs are a kind of extended structured interaction that can be used either as an element within AAR (as in ITADS), as an occasional reflective interlude (as in METTLE), or as the dominant format for an entire exercise (as in ComMentor). The surface form of a Socratic dialog is an exchange driven by tutorgenerated questions. The intent of a Socratic dialog is to engage the student in guided self-explanation, and to provide relatively direct evidence to the tutor of mental processes, along with possible gaps and misconceptions. Thus, a Socratic dialog has both a teaching purpose and an assessment purpose. Our dominant approach—derived from ComMentor—uses a tree-structured argument script as the backbone for the interaction. Tutor questions aim to elicit key statements about the situation and/or proposed solution from the student. Any points the student misses or only partially states can be revisited through a combination of repeated focused questions, drill-down to pieces of an extended line of argumentation intended to build up to the missing insight, and/or tutor summary or recapitulation of the argument and supported point.

Lessons from Building Diverse Adaptive Instructional Systems (AIS)

3.4

71

Supporting Tools

The need to provide Instructor Support has relatively minor impacts on adaptation. The main constraints it introduces are (1) curriculum elements must be meaningful and the hierarchy’s organization comprehensible to an instructor (both strongly preferable in any case), and (2) connection of curriculum node mastery assessments to exercise behavior assessments be clearly traceable and explicable. The need to provide Authoring Support has much more pervasive impacts on the forms of adaptation that a system can support. Generally, what is desired is a fully integrated authoring environment that supports the creation of, and linking among, curriculum, instruction, exercises, assessments, tutor interventions, etc. The greatest challenge is usually support for authoring of the Expert Model, in any of the forms surveyed in Sect. 3.2. The SimBionic toolkit provides robust support for drawing BTN logic as hierarchical flow charts. The T3 tools support authoring of more restrictive solution templates by demonstration of an activity sequence, followed, if necessary, by GUI-based editing to relax and add constraints or alternate paths. We have experimented with several schemes for authoring tree-structured Socratic dialogs, most recently embedded in ITADS and its authoring suite. Systems that exploit domain constraints typically introduce their own custom GUI editors with forms tuned to the formats of those constraints. That is true, for instance, with InGEAR and ITADS. Directly linking exercise actions as mastery evidence likewise calls for a custom editor, though relatively simple design suffice if users only need to link exercise options to curriculum elements. Finally, scripting schemes may depend on a custom structure editor or on a textual editor and parser for a scripting language. As suggested by this paper, there are many degrees of freedom in ITS design (all the dimensions listed in Sect. 2). We have also suggested the need to make design tradeoffs, compromises, and innovations to suit domain and project constraints (all the examples in this Sect. 3). In our experience, those constraints are often discovered incrementally by working through examples. The consequence is that authoring tools are often unavailable and/or unreliable when early content needs to be developed because the formalism to be authored is still being invented and changed on the fly. This conundrum accentuates the value of highly polished tools that support proven families of modeling mechanisms, like SimBionic for BTNs and T3 for Solution Templates. We have explored other schemes to quickly bootstrap flexible, reliable, and helpful authoring tools. These include (1) mapping model structures into tabular formats and using COTS spreadsheet or database applications that support constrained input (such as Excel or Access); (2) mapping model structures into textual formats that are easily parsed (such as s-expressions or context-free grammars); (3) building tools using GUI frameworks that provide rapid configuration of custom editors driven by an underlying data model (such as the Eclipse Rich Client Platform and its Modeling Framework).

72

E. Domeshek et al.

4 Towards an Ontology for AIS Design AISs can be characterized in many ways—e.g., in terms of their domains of applicability, pedagogical commitments, styles of interaction, forms of modeling, bases for adaptation, ways of adapting, and so on. Our focus in this paper has been to point out how the combination of domain and project constraints can have pervasive impacts on the kinds of interaction, modeling, and adaptation that may be needed or possible, and, in turn, on the mechanisms that will work to provide those target capabilities. Accordingly, when we think about carving up the space of AISs—especially as we look to identify and build reusable tools—we tend to think first in terms of how these domain and project constraints point towards families of mutually compatible interaction, modeling, and adaptation mechanisms. This is reminiscent of the points made in Bell [33] when he focused his breakdown of the AIS space around the questions: “how do I build one?” and “what’s hard about that?” Our stance also shares features and some lineage with the approach taken within the Goal-Based Scenarios research program [34], where specialized authoring tools and runtimes were developed to support particular styles of interaction—in that case, centered on a major learner activity—and embedding guidance on how to construct pedagogically effective content. The authors of this paper come to AIS design primarily with backgrounds in symbolic Artificial Intelligence. So, when it comes to developing ontologies to characterize AISs, we would naturally develop fine-grained breakdowns along any or all of the lines suggested above—domain attributes, pedagogical mechanisms, interaction styles, modeling approaches, adaptation drivers, and adaptation mechanisms. Each point in such an ontology space could suggest a potentially recurring situation or need or capability, and hence a possibly reusable technology or method or module. However, as illustrated in this paper, there will necessarily be dependencies across those different dimensions. Not every possible combination is likely to make sense, or be realizable, or fit with commonly recurring project constraints. We also suspect that most of these dimensions are somewhat open-ended or are at least likely to see further growth for some time. Thus, we expect there could be a large number of reusable components, and that those components will vary in how commonly applicable and freely combinable they are. In consequence, we expect substantial advantage to pre-packaging consistent combinations or components, together tuned to address a range of recurring needs. Even if one overarching framework (say GIFT [35, 36]) could encompass configuration of all such modules, implementation effort would be reduced by re-using pre-configured packages that (nearly) addresses current needs. For example, a package containing expert model mechanisms tailored to troubleshooting domains could be constructed in a form that can be instantiated for future training applications involving troubleshooting skills. Further, some of the earlier work cited above notes that effectively exploiting the capabilities of reusable AIS modules or module-constellations remains challenging. That is, pedagogical effectiveness will ultimately depend on the how well the chosen pieces are used, so embedding authoring guidance is essential [33, 34].

Lessons from Building Diverse Adaptive Instructional Systems (AIS)

73

Therefore, we would propose that in addition to fine-grained dimension-specific ontologies, AIS standardization would benefit from development of a higher-level catalog of AIS applications, characterized in terms of their use of the lower-level pieces. Such a catalog would be open-ended (non-exhaustive) and would need to allow for partial overlaps (non-exclusive). It would likely lead to introduction of additional lower-level taxonomies to characterize the roles and relationships of the primary components, as well as the applicability conditions of the modules and constellations. Obviously, this implies a substantial community effort. Our range of experiences with coached practice for complex decision-making skills offers suggestive guidance, but substantial and ongoing input from the larger ITS and AIS community will be required to build and maintain a framework that more fully covers the full range of potential AIS requirements.

References 1. Anderson, J.R., Boyle, C.F., Yost, G.: The geometry tutor. In: IJCAI, pp. 1–7, August 1985 2. Anderson, J.R., Corbett, A.T., Koedinger, K.R., Pelletier, R.: Cognitive tutors: lessons learned. J. Learn. Sci. 4(2), 167–207 (1995) 3. Vanlehn, K., et al.: The Andes physics tutoring system: lessons learned. Int. J. Artif. Intell. Educ. 15(3), 147–204 (2005) 4. Jonassen, D.H.: Instructional design models for well-structured and III-structured problemsolving learning outcomes. Education Tech. Research Dev. 45(1), 65–94 (1997) 5. Lynch, C., Ashley, K., Aleven, V., Pinkwart, N.: Defining ill-defined domains; a literature survey. In: Proceedings of the Workshop on Intelligent Tutoring Systems for Ill-Defined Domains at the 8th International Conference on Intelligent Tutoring Systems, pp. 1–10, June 2006 6. Fink, C.D., Shriver, E.L.: Simulators for maintenance training: Some issues, problems and areas for future research. Kinton Inc Alexandria VA (1978) 7. Andrews, D.H., Carroll, L.A., Bell, H.H.: The future of selective fidelity in training devices. Educ. Technol. 35(6), 32–36 (1995) 8. Stottler, R.H., Richards, R., Spaulding, B.: Use cases, requirements and a prototype standard for an intelligent tutoring system (ITS)/simulation interoperability standard (I/SIS). In: Proceedings of the SISO 2005 Spring Simulation Interoperability Workshop, pp. 3–8, April 2005 9. Murray, T.: An overview of intelligent tutoring system authoring tools: updated analysis of the state of the art. In: Murray, T., Blessing, S.B., Ainsworth, S. (eds.) Authoring Tools for Advanced Technology Learning Environments, pp. 491–544. Springer, Dordrecht (2003). https://doi.org/10.1007/978-94-017-0819-7_17 10. Sottilare, R., Graesser, A., Hu, X., Brawner, K. (eds.): Design Recommendations for Intelligent Tutoring Systems: Authoring Tools and Expert Modeling Techniques (2015) 11. Andrews, D.H., Windmueller, H.W.: Lock-step Vs. free-play maintenance training devices: definitions and issues. Educ. Technol. 26(7), 29–33 (1986) 12. Papert, S.: Mindstorms: Children, Computers, and Powerful Ideas. Basic Books, New York (1980) 13. Lawler, R.: Designing computer based microworlds. Byte 7, 138–146 (1982)

74

E. Domeshek et al.

14. Stottler, R., Harmon, N.: Transitioning an ITS developed for schoolhouse use to the fleet: TAO ITS, a case study. In: Proceedings of the Industry/Interservice, Training, Simulation & Education Conference (I/ITSEC 2001) (2001) 15. Stottler, R., Davis, A., Panichas, S., Treadwell, M.: Designing and implementing intelligent tutoring instruction for tactical action officers. In: Proceedings of the Industry/Interservice, Training, Simulation & Education Conference (I/ITSEC 2007) (2007) 16. Houlette, R., Fu, D., Jensen, R.: A visual environment for rapid behavior definition. In: Proceedings of the 2003 Conference on Behavior Representation in Modeling and Simulation, Scottsdale, AZ, 2003 (2003) 17. Presnell, B., Houlette, R., Fu, D.: Making behavior modeling accessible to nonprogrammers: challenges and solutions. In: Proceedings of the Industry/Interservice, Training, Simulation & Education Conference (I/ITSEC 2007) (2007) 18. Ong, J., Noneman, S.: Intelligent tutoring systems for procedural task training of remote payload operations at NASA. In: Proceedings of the Industry/Interservice, Training, Simulation & Education Conference (I/ITSEC 2000) (2000) 19. Mohammed, J., Ong, J., Li, J., Sorensen, H.: Rapid development of scenario-based simulations and tutoring systems. In: AIAA Modeling and Simulation Technologies Conference and Exhibit, p. 6419 (2005) 20. Jensen, R., Lunsford, J., Presnell, B., Cobb, M.G., Kidd, D.: Generalizing automated assessment of small unit tactical decision making. In: Proceedings of the Industry/Interservice, Training, Simulation & Education Conference (I/ITSEC 2013) (2013) 21. Domeshek, E., Holman, E., Ross, K.: Automated Socratic tutors for high-level command skills. In: Proceedings of the Industry/Interservice, Training, Simulation & Education Conference (I/ITSEC 2002) (2002) 22. Domeshek, E., Holman, E., Luperfoy, S.: Discussion control in an automated Socratic tutor. In: Proceedings of the Industry/Interservice, Training, Simulation & Education Conference (I/ITSEC 2004) (2004) 23. Ramachandran, S., Remolina, E., Barksdale, C.: Scenario-based multi-level learning for counterterrorism intelligence analysis. In: Proceedings of the Industry/Interservice, Training, Simulation & Education Conference (I/ITSEC 2006) (2006) 24. AAITS description. https://www.stottlerhenke.com/solutions/education-and-training/aaitsteaches-undersea-acoustic-analysis-to-navy-sonar-technicians/. Accessed 1 Apr 2019 25. Domeshek, E.: Scenario-based conversational intelligent tutoring systems for decisionmaking skills. In: Proceedings of the Industry/Interservice, Training, Simulation & Education Conference (I/ITSEC 2009) (2009) 26. Stottler, R.H., Pike, B., Bingham, R., Jensen, R.: Adding an intelligent tutoring system to an existing tactical training simulation. In: Proceedings of the Industry/Interservice, Training, Simulation & Education Conference (I/ITSEC 2002) (2002) 27. Ramachandran, S., Domeshek, E., Jensen, R., Aukamp, A.: Uncovering the hidden: tradeoffs in rationale elicitation for situated tutors. In: Proceedings of the Interservice/Industry Training, Simulation & Education Conference (I/ITSEC 2016) (2016) 28. Ramachandran, S., Jensen, R., Ludwig, J., Domeshek, E., Haines, T.: ITADS: a real-world intelligent tutor to train troubleshooting skills. In: Penstein Rosé, C., Martínez-Maldonado, R., Hoppe, H.U., Luckin, R., Mavrikis, M., Porayska-Pomsta, K., McLaren, B., du Boulay, B. (eds.) AIED 2018. LNCS (LNAI), vol. 10948, pp. 463–468. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93846-2_87 29. Jensen, R., Marshall, H., Stahl, J., Stottler, R.: An intelligent tutoring system (ITS) for future combat systems (FCS) robotic vehicle command. In: Proceedings of the Industry/Interservice, Training, Simulation & Education Conference (I/ITSEC 2003) (2003)

Lessons from Building Diverse Adaptive Instructional Systems (AIS)

75

30. Jensen, R., Mosley, J., Sanders, M., Sims, J.: Intelligent tutoring methods for optimizing learning outcomes with embedded training. In: Proceedings of the NATO Workshop on Human Dimensions in Embedded Virtual Simulation (NATO HFM-169) (2009) 31. Jensen, R., Nolan, M., Chen, D.Y.: Automatic causal explanation analysis for combined arms training AAR. In: Proceedings of the Industry/Interservice, Training, Simulation & Education Conference (I/ITSEC 2005) (2005) 32. Jensen, R.: Assessing perceived truth versus ground truth in after action review. In: Proceedings of the Industry/Interservice, Training, Simulation & Education Conference (I/ITSEC 2009) (2009) 33. Bell, B.: One-size-fits-some: ITS genres and what they (should) tell us about authoring tools. Des. Recommendations Intell. Tutoring Syst. 3, 31–45 (2015) 34. Jona, M., Kass, A.: A fully-integrated approach to authoring learning environments: case studies and lessons learned. In: The Collected Papers from AAAI-97 Fall Symposium workshop Intelligent Tutoring System Authoring Tools. AAAI-Press (1997) 35. Sottilare, R.A., Brawner, K.W., Goldberg, B.S., Holden, H.K.: The generalized intelligent framework for tutoring (GIFT). US Army Research Laboratory–Human Research & Engineering Directorate (ARL-HRED), Orlando, FL (2012) 36. Sottilare, R.A., Brawner, K.W., Sinatra, A.M., Johnston, J.H.: An updated concept for a generalized intelligent framework for tutoring (GIFT). GIFTtutoring.org (2017) 37. Woolf, B.P.: Building Intelligent Interactive Tutors: Student-Centered Strategies for Revolutionizing E-Learning. Morgan Kaufmann, Boston (2010)

Fundamentals, Flavors, and Foibles of Adaptive Instructional Systems Paula J. Durlach(&) U. S. Army DEVCOM Soldier Center, Orlando, FL 32826, USA [email protected]

Abstract. Adaptive instructional interventions have traditionally been provided by a human tutor, mentor, or coach; but, with the development and increasing accessibility of digital technology, technology-based methods of creating adaptive instruction have become more and more prevalent. The challenge is to capture in technology that which makes individualized instruction so effective. This paper will discuss the fundamentals, flavors, and foibles of adaptive instructions systems (AIS). The section on fundamentals covers what all AIS have in common. The section on flavors addresses variations in how different AIS have implemented the fundamentals, and reviews different ways AIS have been described and classified. The final section on foibles discusses whether AIS have met the challenge of improving learning outcomes. There is a tendency among creators and marketers to assume that AIS—by definition—support better learning outcomes than non-adaptive technology-based instructional systems. In fact, the evidence for this is rather sparse. The section will discuss why this might be, and potential methods to increase AIS efficacy. Keywords: Adaptation

Assessment Learning

1 Introduction Park and Lee (2004) defined adaptive instruction as educational interventions aimed at effectively accommodating individual differences in students while helping each student develop the knowledge and skills required to perform a task. Traditionally, this type of customized instruction has been provided by a human tutor, mentor, or coach; but, with the development and increasing accessibility of digital technology, technology-based methods of creating adaptive instruction have become more and more prevalent. Shute and Psotka (1994) point out that availability to computers began occurring at about the same time that educational research was also extolling the benefits of individualized learning; so, it was natural that harnessing computers was seen as a practical way to increase individualized instruction for more students. From the beginning, the challenge has been: how to capture in technology that which makes individualized instruction so effective. This paper will discuss the fundamentals, flavors, and foibles of adaptive instructions systems (AIS). The section on fundamentals covers what all AIS have in common. The section on flavors addresses variations in how different AIS have implemented the fundamentals, and reviews different ways AIS have been described and classified. The final section on foibles discusses whether AIS This is a U.S. government work and not under copyright protection in the United States; foreign copyright protection may apply 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 76–95, 2019. https://doi.org/10.1007/978-3-030-22341-0_7

Fundamentals, Flavors, and Foibles of Adaptive Instructional Systems

77

have met the challenge of improving learning outcomes. There is a tendency among creators and marketers to assume that AIS—by definition—support better learning outcomes than non-adaptive technology-based instructional systems. In fact, the evidence for this is rather sparse. The section will discuss why this might be, and potential methods to increase AIS efficacy.

2 Fundamentals of AIS All AIS have three things in common: (1) Automated measurement of the learner’s behavior while using the system (2) Analysis of that behavior to create an abstracted representation of the learner’s level of competency on the knowledge or skills to be learned, and (3) Use of that representation (the student model) to determine how to adapt the learner’s experience while engaged with the system. The first two have to do with assessing learner competency, while the third is concerned with how to ameliorate competency weaknesses. The first two together constitute formative assessment, while the third is the adaptive intervention. 2.1

Formative Assessment in AIS

Formative assessments in AIS can be thought of as experiments about learners’ knowledge. The AIS user experience is designed so as to test theories about that knowledge. The theories predict how a learner should behave if they possess such and such knowledge, given the stimuli and behavioral affordances of the learning environment. Consider a multiple choice question. Inferences about knowledge based on the student’s selection depend on an implicit theory, though it is rarely explicitly stated: If the student understands the content, they will choose the right answer. But competing theories are possible; e.g., perhaps the student selected the correct answer by chance. Because both theories make the same prediction, it might be necessary to assess the knowledge multiple times and in various ways, until enough evidence has accumulated to make a competency inference with more certainty. Questions can provide more finegrained information about learner competency by wisely selecting the foils to detect common misconceptions, or by varying the difficulty of the question, based on normative data. This allows a deeper understanding of the student’s competency vs. just a binary rating. Other methods besides direct questions, of course, can be used to assess student knowledge. Learners can be asked to create artifacts, or perform activities. “Stealth assessment” refers to assessing student behavior during the performance of an activity, such that assessment occurs without any explicit question-asking (Shute et al. 2016). The process of designing assessments has been formalized in various ways (Shute et al. 2016). Here I will focus on the Evidence-Centered Design (ECD) framework (e.g., Behrens et al. 2010; Mislevy et al. 2001; Shute et al. 2008). The first step in designing an assessment is to determine what mixture of knowledge, skills, and other factors should be assessed. An explicit representation of this in ECD is the competency model (also referred to as a domain model or an expert model in other contexts). For any particular domain (e.g., troubleshooting a washing machine) the competency

78

P. J. Durlach

model organizes the things people need to know (e.g., the pump won’t work if the lid is open), things they need to do (check there is power to the pump), and the conditions under which they have to perform (e.g., determine why the pump failed to run in the spin cycle). Collecting and organizing the domain information into a competency model may be one of the most challenging steps in creating an AIS, particularly for domains in which there are multiple ways of solving problems, and one best solution might not exist. Traditionally, the domain knowledge must be captured from experts, and experts don’t always agree or have explicit access to the knowledge they use. Even for well-defined domains, like algebra, consulting experts is needed in order to capture instructional knowledge (e.g., common mistakes and various teaching methods), which are not represented in text books (Woolf and Cunningham 1987). This knowledge may be used in designing the AIS content and interventions, even if not represented in the competency model. Making the competency model explicit is not required for assessment, nor for an AIS; but it is recommended, especially for complex knowledge structures and/or stealth assessment. For relatively simple learning, the domain model may remain in the designer’s head and only be implicitly represented in the content presented to the student. An AIS designed for learning by rote (e.g., learning the times tables or vocabulary) is an example. An adaptive intervention for these types of AIS is to give each student practice on each association in inverse proportion to their ability to recall it (so for each student, their own pattern of errors determines how frequently specific association probes occur). The domain model may be represented only implicitly by the associations chosen for the student to learn. Hypotheses about the state of the learner’s ability with respect to the elements of the competency model are represented in the student model (or learner model or competency profile in other contexts), by assigning values against each ability. The student model may be a single variable reflecting an overall proficiency, or multiple variables representing finer grains of knowledge, and arranged in a network hierarchy with required prerequisite knowledge also represented. Typically the complexity of the competency model is reflected in the student model. When the competency model is not even explicit, the student model tends to be relatively simple, such as a list of values (e.g., percent correct on each cell of the times tables). For more complex competency models, the student model may be represented computationally, for example as a Bayesian network. A Bayesian network provides the ability to make predictions about how students might respond to a new problem, based on the current student model values. Regardless of the specific model, student model values are updated when new relevant information becomes available. That information comes from the student’s performance on tasks specifically designed for that purpose. In ECD, these tasks are the task model. The task model consists of the learning environment the learner experiences, materials presented, actions that the student can take, and products they may produce. It specifies what will be measured and the conditions under which they will be measured. Elements in the task model are linked to elements in the competency model by the evidence model, so as to specify how student actions under different conditions indicate evidence with respect to different competencies. This requires a two-step process. First is evaluating the appropriateness or correctness of a student behavior or product (the evaluation step). This might be something like: correct

Fundamentals, Flavors, and Foibles of Adaptive Instructional Systems

79

vs. incorrect, or low, medium, high. Second is turning these evaluations into student model values. ECD calls this the statistical step. The statistical step turns the evaluations into beliefs (or likelihoods) of student model values. In other words, if the student did x under conditions a and e, the probability is .68 that they understand the relation between the switch activated by lid closure and the pump (in keeping with the washing machine troubleshooting example). The actual probability to assign might be determined using expert input, approximations based on task features, a fitted model, or cognitive task analyses of students at known levels of mastery. The student model values are made by inferences based on evidence collected, therefore the assessment designer must set rules for how to collect the evidence and when enough evidence has been amassed. In ECD these rules are called the assembly model. The assembly model controls the sequence of tasks and the rules for ending the assessment. When the sole purpose of the assessment is to gauge aptitude (summative assessment), a rule for ending the assessment may be determined by time to complete. A traditional aptitude test allowing a specific amount of time and using paper and pencil is an example of a fixed summative assessment. All the students have the same questions and are given the same amount of time. In contrast, an adaptive summative assessment is possible. It changes the sequence of tasks for each student, depending on the values in the student model, in order to collect evidence as efficiently as possible. Therefore different students may have a different sequence of questions and be allowed to work for different amounts of time. The Graduate Management Admissions Test (GMAT) is an example of an adaptive summative assessment. In AIS, the purpose of assessment is not to rank students, but rather to diagnose student misconceptions and to identify where special attention might be needed. Stopping rules therefore may be based on some sort of threshold (e.g. student model value of .90) or criterion (e.g., correctly diagnosing why the pump won’t run in less than three minutes), indicating that mastery of the competency has been achieved, and no further practice is required. Adaptive sequencing rules in AIS are intended to give the student the next “best” learning experience to advance their progress; they are not part of the assessment, but rather one of several possible adaptive interventions. 2.2

Adaptive Interventions in AIS

Once an AIS has an abstracted representation of the learner’s competency profile with respect to particular learning objectives, how can an AIS adapt to the individual? To answer this, several researchers have analyzed tutor-learner interactions to extract information about what occurs during one-to-one human tutor-tutee sessions. Much of this research is summarized in VanLehn 2011; see also Lepper and Woolverton 2002). Other sources of adaptive interventions have been educational research (e.g., the mastery learning technique) and psychological research (e.g., spacing). Bloom’s research (1976, 1984), as well as Western interpretations of Vgotsky (1978), have had a large influence on instructional interventions in AIS, particularly the use of the mastery learning technique. Both essentially suggested that students should progress at their own rate. Tasks should not be too hard, so as to be frustrating, or too easy, so as to be boring. Learners should be given assistance with new tasks or concepts until they have been mastered. Formative assessments should be used to determine where a learner

80

P. J. Durlach

needs help and when they are ready to move to a higher level of challenge. Help can be given in a number of ways, and this can also be adapted to the learner. Different learners may need different forms of help, so different methods for remediating weaknesses should be available (e.g., hints, timely feedback, alternative content and media, examples, self-explanation, encouragement, etc.). As far as implementing these recommendations in technology, almost all AIS have incorporated the mastery learning technique, and provide learners with feedback on their performance. Feedback is known to be critical to learning; however, whether it is considered adaptive is kind of a grey area. Positive feedback can be motivating and reduce uncertainty. Sadler (1989) suggested that one critical role of feedback is to support the student in comparing his or her own performance with what good performance looks like, and to enable students to use this information to close that gap. In general, the more fine-grained the feedback, the more likely it is to fit these requirements. Durlach and Spain (2014) suggested that feedback can be at different levels of adaptivity depending on how helpful it is in allowing the learner to self-correct, or how it is used as a motivational prop. Another adaptive intervention that appears frequently in AIS stems from the psychological literature on the scheduling of practice “trials.” It is a well-established from learning science that retention is better when learning experiences are spaced over time compared to when they are massed (Brown et al. 2014). Repeated recall of information just on the edge of forgetting seems to solidify it better in memory compared to when it is recalled from relatively short term memory (which occurs when repetitions are close together). But people learn and forget at different rates for different items, so some AIS, particularly those aimed at learning associations (like the times tables or foreign language vocabulary), have devised adaptive spacing algorithms such that the sequencing of trials is personalized to the learner based on their accuracy for each association. So if you are learning French vocabulary, once you get the chien-dog association correct several times, you will be tested for it less and less frequently. These AIS are sometimes referred to as electronic flashcards. Although this “drill and practice” method of learning is sometimes looked down on as primitive, Kellman et al. (2010) have argued that it is a good way of learning structural invariance across otherwise variable cases, which is required for expertise. They argue that it can be applied to conceptual materials to increase fluency (greater automaticity and lesser effort), a hallmark of expertise; Kellman and colleagues have demonstrated this for several mathematical concepts (Kellman and Massey 2013). Note that adaptive spacing can be applied within a learning session or across days, weeks, months, etc. Bringing back items from an earlier topic at the right time may be particularly useful when the goal is to discriminate when to apply certain methods of problem solving vs. executing the method itself (Pan 2015; Rohrer 2009). 2.3

PLATO (Programmed Logic for Automatic Teaching Operation)

To illustrate the fundamentals of AIS, the workings of PLATO, one of the first AIS, is summarized (Bitzer et al. 1965). PLATO was developed to explore the possibilities of using computers to support individualized instruction. Several variations were created, but only the simplest of the “tutorial logics” will be outlined here (see Fig. 1). This logic led a student through a fixed sequence of topics, by presenting facts and examples

Fundamentals, Flavors, and Foibles of Adaptive Instructional Systems

81

and then asking questions covering the material presented (the Main Sequence in Fig. 1). PLATO responded with “OK” or “NO” to each answer. Students could reanswer as many times as they liked, but could not move on to new content until they got an “OK.” If required, the student could ask for help by pressing a help button. This took them off to a branch of help material (Help Sequence in Fig. 1), which also had questions and additional branches as needed. After completing each help branch, or short-circuiting the help sequence by selecting an “AHA!” button, the student returned to the original question and had to answer it again. Each question had its own help branch, and in some versions different help branches for different wrong answers. If a student used all the help available, and could still not answer the original question, then PLATO supplied the correct answer. Teachers could create the content and questions on the computer itself, and could also examine stored student records. Records could also be aggregated for different types of analysis. One thing to notice about PLATO is that it is the type of system where the competency model is not explicit. The competency model is only implicit in the content and questions that are included by the teacher/author. Another thing to notice is that the student model is quite simple. It is made up of the correct and incorrect answers chosen by the student. The computer supplies the branching logic, depending on those answers. Although created in the 1960’s PLATO’s outward behavior is quite similar to many AIS in use today.

Fig. 1. Illustration of sequencing in PLATO. Students could request help for each question. The depth of help branches was determined by the content author.

82

P. J. Durlach

Advances in computing power have allowed today’s AIS to be more sophisticated, by using explicit competency models and artificial intelligence; however, it is not entirely clear whether AIS in 2019 are any more effective.

3 Flavors of AIS Like ice cream, AIS come in many different flavors and qualities. You can get basic store-brand vanilla. You can get artisanal Salt & Straw’s Apple Brandy & Pecan Pie;© or you can get up-market varieties in stores (e.g., Chunky Monkey®). PLATO is kind of like a store brand. It provides different paths through material and problems, with each decision based on only an isolated student action. The equipment is the same for each domain; only the ingredients (the content) differ. More complicated machinery (artificial intelligence) or customization (hand crafting) are required for more up-market brands. The analogy for “Artisanal” consist of AIS from academic researchers. These typically have been hand-crafted, use computational models, and are experimented with and perfected over years. Such AIS are typically referred to as Intelligent Tutoring Systems (ITS). ITS tend to provide coached practice on well-defined tasks requiring multiple steps to reach a solution (such as solving linear equations), although one particular type—constraint-based reasoning ITS (e.g., Mitrovic 2012) have had success with more open-ended domains such as database design. ITS are domain-specific, and cannot be repurposed easily for a different domain. Defining what makes an ITS “intelligent” has never been clearly agreed upon. Some researchers suggest it is only the appearance of intelligence, regardless of the underlying engineering, while others require explicit competency, student, and/or learner models (Shute and Psotka 1994). Most AIS use mastery learning and provide feedback; but other ingredients can be added: on-demand help, dashboards showing progress, interactive dialogue, simulations, and gamification, to name a few. Some AIS also use student models that include student traits (e.g., cognitive style) or states (frustration, boredom), in addition to the student’s competency profile. The states or traits are used to influence adaptive interventions such as altering feedback or content (e.g., D’Mello et al. 2014; Tseng et al. 2008). Durlach and Ray (2011) reviewed the research literature on AIS and identified several adaptive interventions for which one or more well-controlled experiments showed benefits compared to a non-adaptive parallel version. These were (1) Mastery learning, which has already been explained; (2) Adaptive spacing and repetition, which also has already been explained; (3) Error-sensitive feedback–feedback based on the error committed; (4) Self-correction, in which students find errors in their own or provided example problems; (5) Fading of worked examples (here students are first shown how to solve a problem or conduct a procedure; they are not asked to actually solve steps until they can explain the steps in the examples); (6) Metacognitive prompting—prompting students to self-explain and paraphrase. Because there are many ways that AIS can differ, different ways to classify the various types have been suggested.

Fundamentals, Flavors, and Foibles of Adaptive Instructional Systems

83

4 Macro and Micro Adaptation Shute (1993) and Park and Lee (2004) describe adaptive interventions as being either macro-adaptive or micro-adaptive. A single AIS can employ both. Macro-adaptation uses pre-task measures or historical data (e.g., grades) to adapt content before the instructional experience begins. Two types of macro-adaptive approaches are Adaptation-as-Preference, and Mastery. With Adaptation-as-Preference, learner preferences are collected before training and this information is used to provide personalized training content. For example, it may determine whether a student watches a video or reads; or whether examples are given with surface features about sports, business, or the military. With Mastery macro-adaptation, a pretest determines the starting point of instruction, and subsequently already-mastered material, as determined by the pretest, may be skipped. For both these macro-adaptations, the customization is made at the beginning of a learning session or learning topic. In contrast, microadaptive interventions respond to learners in a dynamic fashion while they interact with the AIS. These systems perform on-going adaptations during the learning experience, based upon the performance of the learner or other behavioral assessment (e.g., frustration, boredom). They may use a pattern of response errors, response latencies, and/or emotional state to identify student problems or misconceptions, and make an intervention in real-time. Micro-adaptive interventions may be aimed at correcting specific errors or aimed at providing support, such as giving hints or encouragement. 4.1

Regulatory Loops

VanLehn (2006) popularized the idea of characterizing ITS behaviors in terms of embedded loops. He originally suggested that ITS can be described as consisting of an inner loop and an outer loop. The inner-loop is responsible for providing withinproblem guidance or feedback, based on the most recently collected input. The outer loop is responsible for deciding what task a student should do next, once a problem has been completed. In ITS, data collected during inner-loop student-system interactions are used to update the student model. Comparisons of the student model with the competency model drive the outer loop decision process, tailoring selection (or recommendation) of the next problem. In order to do this, the available problems must be linked to the particular competencies their solution draws upon. Zhang and VanLehn (2017) point out that different particular AIS use different specific algorithms for making this selection or recommendation. The complexity of the algorithm used tends to depend on how the competency model is organized, e.g., whether knowledge is arranged in prerequisite relationships, difficulty levels, and/or whether there are one-toone or a many-to-one mappings between underlying competencies and problems. There is no one algorithm, nor even a set of algorithms proven to be most effective. VanLehn (2011) introduced the interaction granularity hypothesis. According to this hypothesis, problem solving is viewed as a series of steps, and a system can be categorized by the smallest step upon which the student can obtain feedback or support. The hypothesis is that the smaller the step, the more effective the AIS should be. This is because smaller step sizes make it easier for the student (and the system) to identify and address gaps in knowledge or errors in reasoning. The granularity hypothesis is in

84

P. J. Durlach

accord with educational research on the effects of “immediacy” (Swan 2003). Combining the notion of loops and of granularity, it seems natural to extend the original conception of inner and outer loops to a series of embedded loops in which the inner loop is the finest level of granularity. Loops can extend out beyond the next task or problem to modules, chapters, and courses. Different types of adaptations can be made to help regulate cognitive, meta-cognitive, social, and/or motivational states at different loop levels (VanLehn 2016). 4.2

Levels of Adaptation

Durlach and Spain (2012, 2014) and Durlach (2014) proposed a Framework for Instructional Technology (FIT), which lays out various ways of implementing mastery learning, corrective feedback, and support using digital technology. Mastery learning and feedback have already been discussed. Support is anything that enables a learner to solve a problem or activity that would be beyond his or her unassisted efforts. In FIT, mastery learning is broken down into two separate components, micro-sequencing and macro-sequencing. Micro-sequencing applies to situations in which a given mastery criterion has yet to be met, and a system must determine what learning activity should come next to promote mastery. It can roughly be equated with remediation. Macrosequencing applies to situations in which a mastery criterion has just been reached and a system must determine the new mastery goal–what learning activity to provide next. It can be equated with progression to a new topic or deeper level of understanding. For each of the four system behaviors (micro-sequencing, macro-sequencing, corrective feedback, and support), FIT outlines five different methods of potential implementation. These are summarized in Table 1. Except for macro-sequencing, the five methods of implementation fall along a continuum of adaptation. At the lowest level (Level 0), there is no adaptation – all students are treated the same. Each successive level is increasingly sophisticated with respect to the information used to trigger a system’s adaptive behavior. The type of intervention (macro-sequencing, micro-sequencing, feedback, and support) crossed with the level of adaptation (0 to IV) do not fit into a nice neat four x five matrix, however. This is because the adaption levels advance differently for the different types, based on differences in the pedagogical functions of the types. Macro-sequencing and micro-sequencing are about choosing the next learning activity. So, e.g., the adaptive levels of micro-sequencing have to do with how personalized the remedial content is, which is in turn, based on the granularity of the student model. The more granular the student model, the more personalized the remedial content can be (though this is not explicitly acknowledged in the framework). Analogously, the macro-sequencing levels depend on the student information used to determine the content of the next topic (e.g., job role). Clougherty and Popova (2015) also proposed levels of adaptivity for content sequencing, but they were organized as seven levels along a single dimension. The levels progress with increasingly flexible methods of “remediation” and “advancement,” where remediation is analogous to FIT’s micro-sequencing, and “advancement” is analogous to FIT’s macro-sequencing. At their first level, all students receive the same content and the only flexibility is that they can proceed at their own pace. The next level is quite similar to the sequencing outlined in Fig. 1 for Plato. Clougherty and Popova’s top level, which they call an

Fundamentals, Flavors, and Foibles of Adaptive Instructional Systems

85

adaptive curriculum, exceeds what is specified by FIT. The adaptive curriculum allows students, assisted by an AIS, to pursue personalized, interdisciplinary learning. Table 1. Framework for Instructional Technology (FIT) Corrective feedback Level Description 0 No item level feedback—summary score only I Minimal item feedback—item accuracy information II Correct answer or explanation of correct answer III Error-sensitive feedback—compares and contrasts correct answer to specific incorrect answer IV Context-aware feedback—knowledge of past performance used to influence feedback message Support Level Description 0 None I Fixed sources of information (e.g. glossary); problem-determined hints; student initiates access II Locally adaptive hints (answer determined) a. on request b. triggered III Context-aware hints, prompts, or pumps (faded scaffolding) a. on request b. triggered IV Level III + mixed initiative interactive dialogue Micro-sequencing Level Description 0 Recycling (repeat already seen content) I Supplemental Remediation (one remedial content set) II Supplemental Remediation levels (multiple remedial content sets) III Adaptive Content (remedial content composed for the individual) IV Real Time Adaptation (remediation via within—task adaptation) Macro-sequencing Level Description 0 Fixed sequence—one version only I Student choice or hybrid choice/fixed II Test-out (placement) III Adapted ahead (multiple versions) IV Adapted During (multiple versions)

In contrast to sequencing, the functions of feedback and support are to guide attention and to assist recall and self-correction for the current learning activity. The advancing FIT levels of adaptation for these are based on how much of the learning

86

P. J. Durlach

context is taken into account in deciding how feedback and support are provided. Support will be used as an example. For Level 0 there is no support. Level I support is the same for all learners, and accessed on learner initiative. This includes support like glossaries and hyperlinks to additional explanatory information. Level I also includes problem-determined hints, where students are given the same advice for fixing an error no matter what error was made. Answer-based hints are at Level II. Answer-based hints are different, depending on the type of error made. FIT refers to this type of support as locally adaptive because it depends on the specific error the student made at a specific (local) point in time. Nothing needs to be accessed from the student model in order to supply answer-based support. In contrast, context-aware support needs to draw upon information from the student model. Context-aware support provides different levels of support as the student gains mastery; this is often referred to as fading. For example, at Level III the availability of hints may change depending on a student’s inferred competency. Finally, Level IV adds more naturalistic interactive dialogue to Level III. It should be noted, however, that interactive dialog can be implemented at Level II; it does not necessarily require access to a student model that persists across problems (Level III). Autotutor (D’Mello and Graesser 2012) is an example of a Level II AIS with interactive language. Autotutor supports mixed-initiative dialogues on problems requiring reasoning and explanation in subjects such as physics and computer science; but the interaction is not influenced by information about student mastery from one problem to the next. One intention behind FIT was to provide procurers of instructional technology with a vocabulary to describe the characteristics they desire in to-be-purchased applications. Terms like “adaptive” and “ITS” are not precise enough to ensure that the customer and the developer have exactly the same idea of the features that will be engineered into the delivered product. However, in an attempt to be precise, the resulting complexity of the FIT model may have undermined its ability to meet the intention. And despite this complexity, there are multiple factors that FIT does not deal with at all. For example, it does not make distinctions according to different kinds of competency, student, or domain models. Nor does it address factors like the quality of the instructional content, instructional design, nor user interface considerations. Tyton Parners’ white paper (2013) presented another framework for instructional technology, using six attributes: (1) Learner profile, (2) Unit of adaptivity, (3) Instruction coverage, (4) Assessment, (5) Content Model, and (6) Bloom’s Coverage. An AIS can be located at a point along each attribute’s continuum (e.g., an AIS can be high on one attribute and low on another). Explanation of the attributes and each’s continuum are shown in Table 2. The white paper suggests that, along with maturity (defined by eight attributes), the taxonomy can be a useful guide to selecting instructional technology. They also suggest consideration of instructor resources (e.g., dashboards and data analytics) and evidence on learning outcomes. FIT and Tyton’s frameworks are largely complementary. The Tyton taxonomy covers attributes not addressed by FIT (Assessment frequency, Instruction Coverage, Content Model, Bloom’s coverage), whereas FIT provides finer-grained detail on Tyton’s Unit of adaptivity and Learner profile attributes.

Fundamentals, Flavors, and Foibles of Adaptive Instructional Systems

87

Table 2. Taxonomy and adaptivity continua proposed by Tyton Partners (2013). Attribute Learner Profile is a structured repository of information about the learner used to inform and personalize the learning experience Unit of Adaptivity refers to the structure of the instructional content and the scale at which that content is modified for specific learner needs Instruction Coverage refers to the pedagogical flexibility of a product to deliver an adaptive learning experience and the scope/scale of that experience within the context of a course Assessment is the frequency, format, and conditions under which learners are evaluated Content Model describes the accessibility of the product’s authoring environment to instructors or other users and their ability to add and/or manipulate instructional content in the system Bloom’s Coverage highlights to what extent a product can support the learning objectives within the Cognitive Domain of Bloom’s Taxonomy

Low ← Adaptivity Level Continuum ! High Student data informs Student data drives Student data is initial placement adaptivity during a dynamic following each adaptive learning sequence experience

Course prerequisite level

Unit/lesson

Learning object level

Targeted study aid

Supplemental instruction

Whole course

Infrequent/Benchmark Formative

Adaptive/Continuous

Closed, with some configurability

Open, authoring Authoring capability offered as platform a service

Understanding/ Remembering

Analyzing/Applying Creating/Evaluating

88

P. J. Durlach

5 Foibles of AIS AIS have been seen as a way to scale up individualized instruction. While the aim is to produce better learning outcomes than can be achieved with less adaptive methods, head-to-head comparisons of AIS to equivalent-in-content non-adaptive instructional technology has been rare (Durlach and Ray 2011; Murray and Perez 2015; Yarnall et al. 2016). More typically, AIS have been assessed as a supplement to traditional classroom instruction (e.g., Koedinger et al. 1997), and/or by looking at pre-AIS vs. post-AIS assessment outcomes. In that context, it has been suggested that an effect size of one be set as a benchmark for AIS learning gains; i.e., learning gains should raise test scores by around one standard deviation, on average (Slavin 1987; VanLehn 2011), to be on par with human tutoring (or, if compared with a non-adaptive control intervention, an effect size of about .75 or greater). While some AIS have been shown to meet or exceed this benchmark (e.g., Fletcher and Morrison 2014; VanLehn 2011); this is not always the case (Durlach and Ray 2011; Kulik and Fletcher 2016; Pane et al. 2010; Vanderwaetere et al. 2011; Yarnall et al. 2016). Consequently, the inclusion of some kind of adaptive intervention in instructional technology does not guarantee better learning outcomes. A generalized “secret sauce” – the right combination of instructional design, content selection, and individualized feedback and support that will boost learning outcomes for any given set of learning objectives (and for any learners in any context)—has yet to be identified. Evaluation studies conducting systematic manipulation of AIS features and behaviors to determine those which are required for superior learning outcomes have rarely been conducted (Brusilovsky et al. 2004; Durlach and Ray 2011); but, perhaps this is not surprising due to the myriad decisions that must be made during AIS design (thus the large number of variables that could be systematically examined), and the challenge of conducting educational research in naturalistic settings (Yarnall et al. 2016). This section will review some of the factors that might help improve AIS efficacy. 5.1

AIS Design

There is still much “art” in AIS design. Many decisions are made by intelligent guesswork (Koedinger et al. 2013; Zhang and VanLehn 2017). What should the mastery criteria be? What algorithms should be used to update the student model or to select the next learning task? There are no standard solutions to these questions. Even if mirroring approaches that have proven successful for others (e.g. using a Bayesian model), there are still many parametric decisions required; and what if the designed model does not align with learners’ actual knowledge structures? There is little learning science available to guide these decisions. Some researchers have looked to improve student models by re-engineering them based on whether the data produced by students using their AIS fit power-law curves (e.g., Koedinger et al. 2013; Martin and Mitrovic 2005). Poor fits suggest a misalignment. More recently, machine learning approaches are being applied to optimize pedagogical interventions (e.g., Chi et al. 2011; Lin et al. 2015; Mitchell et al. 2013; Rowe et al. 2016). Whether these data-driven approaches lead to better learning outcomes remains to be seen.

Fundamentals, Flavors, and Foibles of Adaptive Instructional Systems

89

Much AIS design has been inspired by what human tutors do; but, AIS pedagogical interventions have focused more on knowledge and skill acquisition compared with motivational interventions. Tutoring behaviors concerned with building curiosity, motivation, and self-efficacy are known to be important (Clark 2005; Lepper and Woolverton 2002). Therefore AIS learning outcomes may benefit from greater incorporation of tactics aimed at bolstering these. Many AIS do not incorporate “instruction;” but, rather focus on practice. They therefore do not necessarily abide by instructional design best practices. Merrill (2002) described five principles of instruction and multiple corollaries; but, it is not clear that these have had much impact on the design of AIS content. Instructional design advises on what content to provide and when during the learning process (e.g., ordering of concrete vs. abstract knowledge). “Content is king” and may outweigh the effects of adaptive intervention. The technology used to provide content is less relevant than the content itself (Clark 1983). Finally, with respect to content, it is possible that greater effort at including “pedagogical content knowledge” in AIS intervention tactics may improve AIS effectiveness. Pedagogical content knowledge is knowledge about teaching a specific domain (Hill et al. 2008). This includes what makes learning a specific topic difficult, common preconceptions, misconceptions, and mistakes. An AIS that can recognize these may be better able to help students overcome them. Only a few AIS have included explicit representation of pedagogical content knowledge, often referred to as bug libraries (Mitrovic et al. 2003). Greater attention to heuristics for good human-computer interaction might also help AIS effectiveness. Certainly, interfaces that incur a high cognitive load will leave students with fewer cognitive resources for learning. Beyond good usability design, it has been suggested that the structure of interactions can affect knowledge integration and the mental models formed by students (Swan 2003). Woolf and Cunningham (1987) suggest that AIS environments should be intuitive, make use of key tools for attaining expertise in the knowledge domain, and as far as possible resemble the real world in which the tasks need to be conducted. Learners should have the ability to take multiple actions so as to permit diagnosis of plausible errors. Finally, a shift from focusing on improving performance during the use of an AIS to improving retention and transfer may enable AIS to provide better learning outcomes. The performance measured during or immediately after an AIS experience may fail to translate into retention and transfer. Interventions aimed at improving performance during practice may actually have the inverse effect on retention and transfer (Soderstrom and Bjork 2015). An example is massed vs. spaced practice: performance during practice may appear better when practice opportunities are massed in time; however, retention over the long term is better when it is spaced in time. Research has shown that the optimal spacing depends on encoding strength, and so may be different for different items and different across learners. Therefore an inferred measure of encoding strength can be used to adapt when a repetition will likely enhance that strength further. As already mentioned, some AIS do use adaptive spacing algorithms to schedule tasks so as to improve retention. Most of these are applied in the context of simple associative learning, and use measures such as accuracy, reaction time, and/or student judgement of certainty to infer encoding strength. This tactic has found less application in AIS concerning more complex types of learning (Kellman and Massey

90

P. J. Durlach

2013). Therefore more consideration of when tasks should be retested might improve AIS impacts on retention. Methods of practice that tend to impair performance during initial learning but facilitate retention and transfer have been referred to as “desirable difficulties” (Bjork and Bjork 2011). These desirable difficulties include making the conditions of practice less predictable, including variations in format, contexts, timing, and topics. Formative assessment using recall (vs. recognition) is also desirable. Not all sources of difficulty are desirable (e.g., unintuitive user interfaces, or content too advanced for the learner). Desirable difficulties are ones that reinforce memory storage and de-contextualize retrieval. Just as spacing effects can be made adaptive, it is possible that implementation of some of the other desirable difficulties can be done adaptively (e.g., personalizing when to vary the format of a task based on prior performance). Except for spacing, the potential benefits of implementing these desirable difficulties adaptively has received little attention. 5.2

AIS Deployment

The impact of AIS on learning outcomes is not only dependent on AIS design, but also on the deployment environment. The quality and context of AIS implementation is important. The same piece of courseware can be used in different ways, and the specifics of usage can affect learning outcomes (Yarnall et al. 2016). First-order barriers to successful technology integration include limitations in hardware capability, hardware availability, time, systems interoperability, technical support, and instructor training (Ertmer and Ottenbreit-Leftwich 2013; Hsu and Kuan 2013). But AIS are socio-technical systems, so their impact also depends on human activities, attitudes, beliefs, knowledge, and skills (Geels 2004). Educational technology integration is facilitated when instructors believe that first-order barriers will not impede its use, have had adequate training, and perceive institutional support (Ertmer and OttenbreitLeftwich 2013; Hsu and Kuan 2013). Additionally, the instructor’s ability to co-plan lessons and/or content with the technology can have an impact on technology integration (Ertmer and Ottenbreit-Leftwich 2013) and learning outcomes (Yarnall et al. 2016). Much research has examined the factors that affect technology integration in education in general (e.g., Knezek and Chritensen 2016); and much advice can be found on how to facilitate technology integration (e.g., https://www.edutopia.org/ technology-integration-guide-description and https://www.iste.org/standards/essentialconditions); but, there is rather less available on how to assess it objectively or on its relation to learning outcomes. The literature on human aspects of technology integration in education has tended to focus almost exclusively on the teacher and has devoted far less attention to other factors including social variables like class size, student demographics, or student attitudes. With respect to implementing AIS specifically, Yarnall et al. (2016) suggest that AIS providers work with their institutional partners and researchers to articulate and validate implementation guidelines. The technology needs to fit the environment in which it will be used. AIS providers need to fully explore the boundary conditions of recommended use. If the provider recommends one hour per day but the program of instruction can only accommodate three hours per week, what are the implications? Instructors deviating from recommended guidelines may undermine AIS effectiveness;

Fundamentals, Flavors, and Foibles of Adaptive Instructional Systems

91

but it is not entirely clear that guidelines have been validated through empirical testing for multiple types of learning environments and learners. Beyond following guidelines, Tyton Partners (2016) recommend establishing a shared understanding of the impact of AIS on the teaching process among all stakeholders, and involving faculty in the selection and implementation processes throughout. They also recommend educating faculty on AIS in general, including the concepts, models and techniques used, and the learner data that can be accessed.

6 Conclusions AIS have three fundamental characteristics: (1) Automated measurement of the learner’s behavior while using the system (2) Analysis of that behavior to create an abstracted representation of the learner’s level of competency on the knowledge or skills to be learned, and (3) Use of that representation (the student model) to determine how to adapt the learner’s experience while engaged with the system. Different flavors of AIS do this in different ways, however, and there is still much art in their design. AIS can provide learning benefits including reduction in time to learn and increasing passing rates; but, this is not a guaranteed outcome. AIS is not a homogeneous category that can be established as “effective” or “ineffective” (Yarnall, Means, & Wetzel). It is likely an AIS will not be any less effective than a non-adaptive analog; however, an AIS will typically require more upfront resources, and potentially more radical changes to an existing program of instruction. This is because AIS require more content and under-the-hood engineering, and allow for different learners within a class to be at substantially different places in the curriculum. The decision to use AIS also introduces new issues of privacy and security concerning the usage of the learner data generated (Johanes and Lagerstrom 2017). Some have suggested that a great value of AIS may be in the revelation of those data to the instructor and perhaps the learners themselves (e.g., Brusilovsky et al. 2015; Kay et al. 2007; Martinez-Maldonado et al. 2015). If the data can be provided in a form that is meaningful for instructors and learners, that may enable beneficial instructional adaptation on the part of an instructor, or self-regulation on the part of the learner, without necessarily requiring especially effective use of the data for adaptive intervention by the system itself. Given the current state of the art of artificial intelligence applied to AIS, one of AIS’s greatest potential benefits may be to empower the “more knowledgeable other” (words attributed to Vygotsky) to guide learning. Further breakthroughs may provide AIS to become more reliably effective in and of themselves.

References Behrens, J., Mislevy, R., DiCerbo, K., Levy, R.: An evidence centered design for learning and assessment in the digital world. CRESST Report 778. University of California, Los Angeles (2010) Bitzer, D., Lyman, E., Easley, J.: The used of Plato: a computer controlled-teaching system. Report R-268. University of Illinois, Urbana Illinois (1965)

92

P. J. Durlach

Bjork, E., Bjork, R.: Making things hard on yourself, but in a good way: creating desirable difficulties to enhance learning. In: Gernsbacher, M., Pew, R., Hough, L., Pomerantz, J. (eds.) Psychology and the Real World: Essays Illustrating Fundamental Contributions to Society, pp. 56–64. Worth Publishers, New York (2011) Bloom, B.: Human Characteristics and School Learning. McGraw-Hill, New York (1976) Bloom, B.: The 2 sigma problem: the search for methods of group instruction as effective as oneto-one tutoring. Educ. Res. 13, 4–16 (1984) Brown, P., Roediger, H., McDaniel, M.: Make It Stick. The Belknap Press of Harvard University, Cambridge (2014). https://doi.org/10.1021/ed5006135 Brusilovsky, P., Karagiannidis, C., Sampson, D.: Layered evaluation of adaptive learning systems. Int. J. Continuing Eng. Educ. Lifelong Learn. 14, 402–421 (2004). https://doi.org/ 10.1504/IJCEELL.2004.005729 Brusilovsky, P., Somyurek, S., Guerra, J., Hosseini, R., Zadorozhny, V., Durlach, P.: Open social student modeling for personalized learning. IEEE Trans. Emerg. Top. Comput. 4(3), 450–461 (2015). https://doi.org/10.1109/tetc.2015.2501243 Chi, M., VanLehn, K., Litman, D., Jordon, P.: An evaluation of pedagogical tutorial tactics for a natural language tutoring system: a reinforcement learning approach. Int. J. Artif. Intell. Educ. 21, 83–113 (2011). https://doi.org/10.3233/JAI-2011-014 Clark, R.: Reconsidering research on learning from media. Rev. Educ. Res. 53(4), 445–459 (1983). https://doi.org/10.3102/00346543053004445 Clark, R.: What works in distance learning: motivation strategies. In: O’Neil, H. (ed.) What Works in Distance Learning: Guidelines, pp. 89–110. Information Age Publishers, Greenwich (2005) Clougherty, R., Popova, V.: How adaptive is adaptive learning: seven models of adaptivity or finding Cinderella’s shoe size. Int. J. Assess. Eval. 22(2), 13–22 (2015) D’Mello, S., Blanchard, N., Baker, R., Ocumpaugh, J., Brawner, K.: I feel your pain: a selective review of affect-sensitive instructional strategies. In: Sottilare, R., Graesser, A., Hu, X., Goldberg, B. (eds.) Design Recommendations for Adaptive Intelligent Tutoring Systems: Adaptive Instructional Strategies, vol. 2, pp. 35–48. U.S. Army Research Laboratory, Orlando, FL (2014) D’Mello, S., Graesser, A.: AutoTutor and affective AutoTutor: learning by talking with cognitively and emotionally intelligent computers that talk back. ACM Trans. Interact. Intell. Syst. 2, 23–39 (2012). https://doi.org/10.1145/2395123.2395128 Durlach, P.: Support in a framework for instructional technology. In: Sottilare, R., Graesser, A., Hu, X., Goldberg, B. (eds.) Design Recommendations for Adaptive Intelligent Tutoring Systems: Adaptive Instructional Strategies, vol. 2, pp 297–310. Army Research Laboratory, Olrando (2014) Durlach, P., Spain, R.: Framework for instructional technology. In: Duffy, V. (ed.) Advances in Applied Human Modeling and Simulation, pp. 222–223. CRC Press, Boca Raton (2012) Durlach, P., Spain, R.: Framework for instructional technology: methods of implementing adaptive training and education. Technical report 1335. U.S. Army Research Institute for the Behavioral and Social Sciences (2014). www.dtic.mil/docs/citations/ADA597411. Accessed 26 Dec 2018 Durlach, P., Ray, J.: Designing adaptive instructional environments: insights from empirical evidence. Technical report 1297. U.S. Army Research Institute for the Behavioral Social Sciences (2011) Ertmer, P., Ottenbreit-Leftwich, A.: Removing obstacles to the pedagogical changes required by Jonassen’s vision of authentic technology-enabled learning. Comput. Educ. 64, 175–182 (2013). https://doi.org/10.1016/j.compedu.2012.10.008

Fundamentals, Flavors, and Foibles of Adaptive Instructional Systems

93

Fletcher, J., Morrison, J.: Accelerating development of expertise: a digital tutor for navy technical training. Institute for Defense Analysis Document D-5358. Institute for Defense Analysis, Alexandria, VA (2014) Geels, F.: From sectoral systems of innovation to social-technical systems: insights about dynamics of change from sociology and institutional history. Res. Policy 33, 897–920 (2004). https://doi.org/10.1016/j.respol.2004.01.015 Hill, H., Ball, D., Schilling, S.: Unpacking pedagogical content knowledge: conceptualizing and measuring teachers’ topic-specific knowledge of students. J. Res. Math. Educ. 39(4), 372–400 (2008) Hsu, S., Kuan, P.: The impact of multilevel factors on technology integration: the case of Taiwanese grade 1–9 teachers and schools. Educ. Technol. Res. Dev. 61(1), 25–50 (2013) Johanes, P., Lagerstrom, L.: Adaptive learning: the premise, promise, and pitfalls. Am. Soc. Eng. Educ. (2017). https://peer.asee.org/adaptive-learning-the-premise-promise-and-pitfalls. Accessed 08 Jan 2019 Kay, J., Reimann, P., Yacef, K.: Mirroring of group activity to support learning as participation. In: Luckin, R., Koedinger, K., Greer, J. (eds.) Artificial Intelligence in Education, pp. 584– 586. IOS Press, Amsterdam (2007) Kellman, P., Massey, C.: Perceptual learning, cognition, and expertise. In: Ross, B. (ed.) The Psychology of Learning and Motivation, vol. 58, pp. 117–165. Academic Press, Elsevier (2013). https://doi.org/10.1016/b978-0-12-407237-4.00004-9 Kellman, P., Massey, C., Son, J.: Perceptual learning modules in mathematics: enhancing students’ pattern recognition, structure extraction, and fluency. Top. Cogn. Sci. 2(2), 285–305 (2010). Special Issue on Perceptual Learning Knezek, G., Christensen, R.: Extending the will, skill, tool model of technology integration: adding pedagogy as a new model construct. J. Comput. High. Educ. 28, 307–325 (2016). https://eric.ed.gov/?id=ED562193 Koedinger, K., Anderson, J., Hadley, W., Mark, M.: Intelligent tutoring goes to school in the big city. Int. J. Artif. Intell. Educ. 8, 30–43 (1997) Koedinger, K., Booth, J., Klahr, D.: Education research. Instructional complexity and the science to constrain it. Science 342, 935–937 (2013). https://doi.org/10.1126/science.1238056 Koedinger, Kenneth R., Stamper, John C., McLaughlin, Elizabeth A., Nixon, Tristan: Using data-driven discovery of better student models to improve student learning. In: Lane, H.Chad, Yacef, Kalina, Mostow, Jack, Pavlik, Philip (eds.) AIED 2013. LNCS (LNAI), vol. 7926, pp. 421–430. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39112-5_43 Kulik, J., Fletcher, J.: Effectiveness of intelligent tutoring systems: a meta-analytic review. Rev. Educ. Res. 86(1), 42–78 (2016). https://doi.org/10.1016/0004-3702(90)90093-F Lawless, K., Pellegrino, J.: Professional development in integrating technology into teaching and learning: knowns, unknowns, and ways to pursue better questions and answers. Rev. Educ. Res. 77(4), 575–614 (2007). https://doi.org/10.3102/0034654307309921 Lepper, M., Woolverton, M.: The wisdom of practice: lessons learned from the study of highly effective tutors. In: Aronson, J. (ed.) Improving Academic Achievement: Impact of Psychological Factors on Education, pp. 135–158. Academic Press, San Diego (2002) Lin, H., Lee, P., Hsiao, T.: Online pedagogical tutorial tactics optimization using genetic-based reinforcement learning. Sci. World J. 2015 (2015). Article ID 352895. https://www.hindawi. com/journals/tswj/2015/352895/cta/. Accessed 04 Jan 2019 Martin, Brent, Mitrovic, Antonija: Using learning curves to mine student models. In: Ardissono, Liliana, Brna, Paul, Mitrovic, Antonija (eds.) UM 2005. LNCS (LNAI), vol. 3538, pp. 79–88. Springer, Heidelberg (2005). https://doi.org/10.1007/11527886_12

94

P. J. Durlach

Martinez-Maldonado, R., Pardo, A., Mirriahi, N., Yacef, K., Kay, J., Clayphan, A.: The LATUX workflow: designing and deploying awareness tools in technology-enabled learning settings. In: Proceedings of the Fifth International Conference on Learning Analytics and Knowledge, pp 1–10, ACM, New York (2015). https://doi.org/10.1145/2723576.2723583 Merrill, M.: First principles of instruction. Educ. Technol. Res. Dev. 50(3), 43–59 (2002) Mislevy, R., Steinberg, L., Almond, R.: On the structure of educational assessments. Measur. Interdisc. Res. Perspect. 1(1), 3–62 (2001) Mitchell, Christopher M., Boyer, Kristy Elizabeth, Lester, James C.: A Markov decision process model of tutorial intervention in task-oriented dialogue. In: Lane, H.Chad, Yacef, Kalina, Mostow, Jack, Pavlik, Philip (eds.) AIED 2013. LNCS (LNAI), vol. 7926, pp. 828–831. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39112-5_123 Mitrovic, A.: Fifteen years of constraint-based tutors: what we have achieved and where we are going. User Model. User Adap. Interact. 22(1–2), 39–72 (2012) Mitrovic, Antonija, Koedinger, Kenneth R., Martin, Brent: A comparative analysis of cognitive tutoring and constraint-based modeling. In: Brusilovsky, Peter, Corbett, Albert, de Rosis, Fiorella (eds.) UM 2003. LNCS (LNAI), vol. 2702, pp. 313–322. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-44963-9_42 Murray, M., Pérez, J.: Informing and performing: a study comparing adaptive learning to traditional learning. Inf. Sc. Int. J. Emerg. Transdiscipline 18, 111–125 (2015). https://doi.org/ 10.1080/13636820000200120, http://www.inform.nu/Articles/Vol18/ISJv18p111125Murray1572.pdf. Accessed 03 Jan 2019 Pan, S.: The interleaving effect: mixing it up boosts learning. Sci. Am. (2015). https://doi.org/10. 1177/1529100612453266, https://www.scientificamerican.com/article/the-interleaving-effectmixing-it-up-boosts-learning/. Accessed 26 Dec 2018 Pane, J., McCaffrey, D., Slaughter, M., Steele, J., Ikemoto, G.: An experiment to evaluate the efficacy of cognitive tutor geometry. J. Res. Educ. Effectiveness 3(3), 254–281 (2010). https:// doi.org/10.1080/19345741003681189 Park, O., Lee, J.: Adaptive instructional systems. In: Jonassen, D.H. (ed.) Handbook of Research on Educational Communications and Technology, pp. 651–684. Lawrence Erlbaum Associates Publishers, Mahwah (2004) Rohrer, D.: The effects of spacing and mixing practice problems. J. Res. Math. 40(1), 4–17 (2009) Rowe, J., et al.: Extending GIFT with a reinforcement learning-based framework for generalized tutorial planning. In: Sottilare, R., Ososky, S. (eds.) Proceedings of the 4th Annual Generalized Intelligent Framework for Tutoring (GIFT) Users Symposium (GIFTSym4), pp. 87–97. U.S. Army Research Laboratory, Orlando, FL (2016) Sadler, D.: Formative assessment and the design of instructional systems. Instr. Sci. 18, 119–144 (1989) Shute, V.: A comparison of learning environments: all that glitters… In: Lajoie, S., Derry, S. (eds.) Computers as Cognitive Tools, pp. 47–74. Lawrence Erlbaum Associates, Hillsdale (1993) Shute, V., Hansen, E., Almond, R.: You can’t fatten a hog be weighing it – Or can you? evaluating as assessment for learning system called ACED. Int. J. Artif. Intell. Educ. 18(4), 289–316 (2008) Shute, V., Leighton, J., Jang, E., Chu, M.: Advances in the science of assessment. Educ. Assess. 21(1), 34–39 (2016) Shute, V., Psotka, J.: Intelligent tutoring systems: past, present, and future. In: Jonassen, D. (ed.) Handbook of Research for Educational Communications and Technology, pp. 570–600. Simon & Schuster Macmillan, New York (1994)

Fundamentals, Flavors, and Foibles of Adaptive Instructional Systems

95

Slavin, R.: Mastery learning reconsidered. Rev. Educ. Res. 57(2), 175–213 (1987). https://doi. org/10.3102/00346543057002175 Soderstrom, N., Bjork, R.: Learning versus performance: an integrative review. Perspect. Psychol. Sci. 10(2), 176–199 (2015). https://doi.org/10.1177/1745691615569000 Swan, K.: Learning effectiveness online: what the research tells us. In: Bourne, J., Moore, J. (eds.) Elements of Quality Online Education, Practice and Direction. Sloan Center for Online Education, Needhan, MA, pp. 13–45 (2003) Tseng, J., Chu, H., Hwang, G., Tsai, C.: Development of an adaptive learning system with two sources of personalization information. Comput. Educ. 51, 776–786 (2008) Tyton partners: learning to adapt: understanding the adaptive learning supplier landscape (2013). http://tytonpartners.com/tyton-wp/wp-content/uploads/2015/01/Learning-to-Adapt_SupplierLandscape.pdf. Accessed 03 Jan 2019 Tyton Partners: learning to adapt 2.0: the evolution of adaptive learning in higher education (2016). http://tytonpartners.com/tyton-wp/wp-content/uploads/2016/04/yton-PartnersLearning-to-Adapt-2.0-FINAL.pdf. Accessed 09 Jan 2019 Vanderwaetere, M., Desmet, P., Clarebout, G.: The contribution of learner characteristics in the development of computer-based adaptive learning environments. Comput. Hum. Behav. 27, 118–130 (2011). https://doi.org/10.1016/j.chb.2010.07.038 VanLehn, K.: The behavior of tutoring systems. Int. J. Artif. Intell. Educ. 16, 227–265 (2006). https://doi.org/10.1145/332148.332153 VanLehn, K.: The relative effectiveness of human tutoring, intelligent tutoring systems and other tutoring systems. Educ. Psychol. 46(4), 197–221 (2011) VanLehn, K.: Regulative loops, step loops, and task loops. Int. J. Artif. Intell. Educ. 26, 107–112 (2016). https://doi.org/10.1007/s40593-015-0056-x Vygotsky, L.: Mind in Society: The Development of Higher Psychological Processes. Harvard University Press, Cambridge (1978) Woolf, B., Cunningham, P.: Multiple knowledge sources in intelligent teaching systems. IEEE Expert 2(2), 41–54 (1987). https://doi.org/10.1109/MEX.1987.4307063 Yarnall, L., Means, B., Wetzel, T.: Lessons learned from early implementations of adaptive courseware. SRI Education, SRI International. Menlo Park, CA. (2016). https://www.sri.com/ sites/default/files/brochures/almap_final_report.pdf. Accessed 19 Sep 2018 Zhang, L., VanLehn, K.: Adaptively selecting biology questions generated from a semantic network. Interact. Learn. Environ. 25(7), 828–846 (2017). https://doi.org/10.1080/10494820. 2016.1190939

Foundational Principles and Design of a Hybrid Tutor Andrew J. Hampton(&) and Arthur C. Graesser University of Memphis, Memphis, TN 38152, USA [email protected]

Abstract. This paper describes the concept of a hybrid tutor as a type of adaptive instructional system (AIS). A hybrid tutor is a confederation of several digital learning resources and human interactions so that the right resource is available to the learner at the right time. We discuss a method for combining several existing educational technologies into a unified platform that tracks progress on learning a subject matter across several constituent parts and offers recommendations on what to do next. There is a learning record store that keeps track of progress and enables intelligent recommendations at several levels: broad topics, specific knowledge components, material difficulty, and mode of instruction. The fine-grain adaptability allows the incorporation of several cognitive learning principles, such as multiple representations and modalities, mental model construction, item spacing, and support for self-regulated learning. The proposed web-based learning environment can function as a stand-alone instructional platform that is integrated into classrooms with topics assigned according a curriculum-based calendar or an adaptive learning environment that suggests learning activities that are generated by an intelligent recommender system. As a proof of concept, we developed ElectronixTutor, a hybrid tutor designed for introductory and intermediate electrical engineering education. This paper describes the rationale for its design and preliminary results. Keywords: Adaptive instructional systems Learning principles

Intelligent tutoring systems

1 Introduction Adaptive instructional systems (AISs) are computer-based systems that guide learning experiences by tailoring instruction and recommendations based on the goals, needs, and preferences of each learner in the context of domain learning objectives [1]. These advanced computer learning environments help students master knowledge and skills by implementing algorithms that adapt to students and that are informed by scientific principles of learning [2]. Typically, this type of instruction focuses on one student at a time to be sensitive to individual differences relevant to the topic at hand or instruction generally. It is also possible to have an automated tutor or mentor interact with small teams of learners in collaborative learning and problem-solving environments [3, 4]. Many of these systems go far beyond the capabilities of conventional computer training systems. Adaptivity in conventional systems often consists of no more than © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 96–107, 2019. https://doi.org/10.1007/978-3-030-22341-0_8

Foundational Principles and Design of a Hybrid Tutor

97

coarse-grained signal-response using primitive learning principles. For example, a learner may study static material (e.g., text), take a multiple-choice assessment, receive a score, and iterate through the same process until achieving a threshold performance. Progression through topics often follows a predetermined order. Advanced AISs can drastically improve upon this approach by implementing fine-grained adaptivity. This can include providing feedback within individual problems to work toward a correct answer or directing learners to a subsequent problem suited to their level of mastery (as determined by previous performance). This is known as two-loop adaptivity [5]. Intelligent tutoring systems, a subset of AISs, track detailed learner characteristics such as knowledge, skills, and other psychological attributes and apply computational models based on the combination of artificial intelligence and cognitive science [2, 6, 7]. 1.1

Contributions of AISs

The evolution of cognitive learning principles and models of learning have produced a range of pedagogically advanced AIS environments. Several mature systems have demonstrated significant learning outcomes. Some examples cover well-defined subject matters such as algebra and geometry, including Cognitive Tutors [8–10] and ALEKS [11]. Other efforts in electronics (SHERLOCK [12], BEETLE-II [13]) and digital information technology (Digital Tutor [14]) also have successful use cases. The inclusion of verbal interaction with conversational agents can scaffold more natural engagement with the subject matter [15, 16] and open doors to less well-defined domains. Conversational systems encourage learners to explain concepts in their own words and thereby engender reflection and reasoning. Mixed initiative dialogues allow learners to direct the conversation to personally relevant areas of the topic. Paralinguistic cues (e.g., pointing, facial expressions) increase realism and allow for visual reinforcement of information. Conversational systems with two or more agents (e.g., a teacher agent and a student agent) allow multiple kinds of interactions and encourages greater social involvement [15, 17, 18]. Some examples of conversational AISs include AutoTutor [16, 19], Betty’s Brain [20], Coach Mike [21], Crystal Island [22], and Tactical Language and Culture System [23], all with demonstrated advantages over conventional instructional techniques. Analyses of the effectiveness of AISs (and more specifically intelligent tutoring systems) have demonstrated value added over more conventional approaches like classroom instruction or reading static materials. While the effect sizes vary substantially from d = 0.05 [24, 25] to an impressive d = 1.08 [26], most converge on relatively large values between d = 0.40 and d = 0.80 [5, 27]. Together, AISs cover a wide range of topics, and often the same topic from a variety of pedagogical angles. 1.2

Some Practical Challenges of AISs

Several practical problems have challenged widespread creation and adoption of these AISs despite their individual successes and collective contribution to our understanding of educational technologies more generally. AISs require large, diverse teams to work together effectively. For example, AutoTutor problems in electrical engineering required multiple experts in the fields of computer science, cognitive psychology,

98

A. J. Hampton and A. C. Graesser

natural language processing, and, of course, electrical engineering [28]. These systems require a major investment in time and resources. Smaller, less expensive systems focus on a small band of pedagogical methods, learning principles, modalities, and content; but this runs the risk of yielding smaller learning gains and fewer learners who benefit from the AIS. 1.3

Multiple Representations

Learning functions on many levels (e.g., [29, 30] that can benefit from varied forms of instruction. Levels of learning also tend to build on one another, such that higher levels often assume competency on lower ones. While individual AISs may suffer from lack of breadth or prohibitive development time, they each potentially provide a valuable representation of the information. A staged algebraic approach can offer a concrete mathematical complement to the conceptual focus of conversational AISs. Simple word problems provide remedial representation of key concepts and relationships when targeted at problem areas. Easily accessible definitions and functional descriptions lower the barrier to interaction with information. And functional comprehension follows from representations integrating concepts, components, and relationships. Leveraging multiple representations in AISs can provide staged advancement when made concurrently available. This can also help ensure that learners stay within Vygotsky’s zone of proximal development [31]. The National Academy of Sciences, Engineering, and Medicine [32] identified affordances of learning technologies. The affordances include interactivity, where the technology responds to learners’ actions, and adaptivity, where information is contingent on the past behavior, knowledge, or characteristics of the learner. Taken together, these present a baseline technical qualification for AISs. Other affordances include providing feedback on quality of performance, offering a choice on what to learn next, and allowing nonlinear access to content for self-directed learners. Learning technology also affords linked representations that emphasize different conceptual viewpoints, open-ended learner input to encourage self-expression, and communication with others. Leveraging all of these affordances into a single AIS presents a daunting challenge. However, given the potential for concurrently developed and available AISs in the same domain, a possible solution is to combine existing AISs into a larger, complex system. The resulting system would provide diverse modes of interacting with content, strengthening learners understanding by reinforcing through varied, stratified repetition. It could also foster ownership on the learner’s part by allowing choice of not just content, but order, representation, and difficulty. The resulting confederated system would include a human instructor that orchestrates learning, together with the recommendations from an intelligent recommender system. This essentially is what we mean by a hybrid tutor: the best of accomplished human tutors and digital intelligence from an ensemble of digital resources. The hybrid tutor would still require substantial investment by experts in diverse fields to create content, but the advantages stated should mean those hours are more likely to yield fruitful interactions with students. The pressing challenge becomes developing a way of translating progress in one system to progress in another, and in making intelligent recommendations across an array of resources on several levels (e.g., topic, modality, difficulty, sources, human vs. computer).

Foundational Principles and Design of a Hybrid Tutor

99

2 ElectronixTutor The promise of a hybrid tutor spurred the development of ElectronixTutor. ElectronixTutor is a hybrid tutor, designed to supplement classroom instruction by leveraging multiple AISs (and conventional static learning resources) in a single platform. Critically, all individual learning resources contribute to a unified learner record store. This store translates progress among the many resources on several discrete levels. The disciplined classification of these resources and levels allows the learning record store to inform an integrated recommender system. Collectively, these components leverage the established benefits of AIS interaction and provide both detailed (i.e., individual) and composite (i.e., classroom or population) learner information to a human instructor who can then manually set assignments at the item or topic level. The inclusion of multiple learning resources, both adaptive and static, allows ElectronixTutor to present learning content in multiple modes. Learners then have detailed records for how they interact with each one (e.g., time, performance, self-selected versus recommended or assigned). These resources all appear in a common user interface (see Fig. 1). In this example, an AutoTutor conversation (complete with optional dialogue history and scroll-over information from Point & Query) appears in the activity window that dominates the screen. The left-side navigation bar includes site navigation as well as all course content. Featured prominently is the “Topic of the Day” facility, where instructors set the area of content to be mastered. Below that appear “Recommendations”, based on the learners’ history of interaction with the system holistically. Finally, learners can self-select any of the available problems from a drop-down menu (though instructors can limit content availability for pedagogical reasons).

Fig. 1. The ElectronixTutor user interface, here showing an AutoTutor question with Point & Query engaged.

100

2.1

A. J. Hampton and A. C. Graesser

AutoTutor

AutoTutor [16, 19] presents conceptual questions on electrical circuits in a conversational exchange. Learners have both a tutor agent and a peer agent with whom to engage in a natural language discussion on a designated topic. The resulting “trialogues” always orient the learner to the topic, introduce an appropriate graphical representation, and directly address the learner by name when asking the main question. This allows learners to go from the concrete image to the deeper concept. Further, each main question has several components of a full correct answer, with the AutoTutor Conversational Engine able to extract partial, as well as incorrect, responses from natural language input. This affords follow-up hints, prompts, or pumps from one or both conversational agents to elicit all information the learner knows about the topic at hand. In addition to the depth of understanding that AutoTutor examines, the analysis of breadth makes it an excellent diagnostic interaction with the learner to identify the appropriate next problem within the larger system. This approach has proven successful across numerous domains, including STEM topics such as computer literacy physics, biology, and scientific reasoning. 2.2

Point and Query

Within AutoTutor, Point & Query [28] aims to mitigate the difficulty many learners have in identifying appropriate questions to ask by offering a simple mouse-over interaction with circuit diagrams. The learner clicks on a hot spot, which launches a set of good questions to ask; the learner selects a question and immediately receives a good answer [33]. Research has demonstrated that this facility greatly increases the absolute number of interactions with the learning program. The low effort necessary to engage with content lowers the barrier and encourages further engagement by reinforcing question-asking behavior with immediate answers. 2.3

Dragoon

Dragoon [34] has learners construct and manipulate dynamic models of circuits, ensuring functional understanding of interacting parts by fostering the development of appropriate mental models (see Fig. 2). These questions represent by far the most difficult problems available. The holistic perspective on structures, parameters, and relationships among them requires comprehensive understanding. This adds substantial value, both in ensuring mastery with a high degree of confidence, and in providing challenges to the most advanced, diligent learners.

Foundational Principles and Design of a Hybrid Tutor

101

Fig. 2. A sample Dragoon problem, requiring detailed, comprehensive knowledge of the circuits.

2.4

LearnForm

LearnForm is oriented toward complex problem-solving on electronics circuit problems, with overarching problems that deconstruct into constituent parts and feedback (see Fig. 3). Mathematical reasoning and algebraic logic play an important and recurring role in electrical engineering. These problems ensure that learners have detailed knowledge of all required steps, with explanations provided in relatively simple mathematical sentences that build on one another until a complete, applied problem is complete.

Fig. 3. A sample LearnForm problem, with algebraic formulas broken down into constituent parts while still relating to the original problem in total.

102

2.5

A. J. Hampton and A. C. Graesser

Beetle-II

BEETLE-II [13] addresses basic understanding of circuits, with a focus on introductory concepts such as voltage, current, open versus closed circuits, and how to find faults using voltage. These problems demonstrated learning gains, but only engage the macro-level of discourse and pedagogy as opposed to the micro-level language and content adaptation present in AutoTutor. 2.6

NEETS and Topic Summaries

As mentioned above, quality static texts will retain their place on the pedagogical landscape for the foreseeable future. The Navy Electricity and Electronics Training Series (NEETS) is a hefty collection of documents encapsulating all essential training information for the several Navy specialties dealing with electronics. These documents are both irrefutably useful and irredeemably dry. We provide these in context as a necessary backstop for any well-rounded education in Navy electrical engineering, though relying on learners to voluntarily engage (or reliably engage when compelled) remains a challenge beyond our scope. ElectronixTutor has indexed these texts and provides hyperlinks to the appropriate section when listed or recommended to learners. The considerable depth of the NEETS suggested the need for a more approachable static text resource that would still allow learners to peruse at their own pace. To that end, subject matter experts collaboratively created topic summaries for each of the 15 topics covered in ElectronixTutor. These summaries comprised between one and four pages of a high-level overview, including diagrams, important definitions or formulas, and links to external resources such as Wikipedia or university webpages deemed to be of value. 2.7

Unified Learner Model and Recommender System

The inclusion of learning resources that instantiate such divergent pedagogical strategies is an important first step. However, these resources need to be organized and launched to the right person at the right time according to a disciplined framework. Knowledge components [35] provide a common currency of content to support this (see Fig. 4). In ElectronixTutor, every problem is annotated by experts on various content topics and knowledge components. They determine whether each learning resource includes these topics and knowledge components. For each of the 15 topics in the system on devices and versus circuits, decisions are made by the experts as to whether particular knowledge components tap the structure, behavior, function, or a parameter. This 15 (topic) 2 (device versus circuit) 4 (structure, behavior, function, parameter) matrix yields a list of 120 combinations. The landscape of learning resources has questions/items that touches at least one of these, and potentially several.

Foundational Principles and Design of a Hybrid Tutor

103

Fig. 4. Knowledge component mapping in ElectronixTutor

All learner interactions with learning content are discretized through knowledge components, formalized in xAPI format, and sent to the unified learner record store. An intelligent recommendation system ensures adequate coverage of each topic while keeping learners in the zone of proximal development, with problems not too difficult or too easy, but just within the learner’s mastery. ElectronixTutor allows learners to choose problems in three ways. They can stay in the Topic of the Day, which completely controls their selection and guides them to what our algorithm judges to be the optimal problem for their advancement within a topic designated by a human instructor. Alternatively, they can also choose from among three recommended activities (item/question in a topic), providing a degree of control (and thereby psychological engagement) while staying within near-optimal parameters. Finally, learners can opt for completely self-directed learning, allowing full access to the suite of learning resources and range of topics. Progress in any of these will update the recommendations accordingly.

3 Preliminary Data We have conducted several small-scale studies in university and trade school settings, though we have not yet attained our optimal goal of full classroom integration. Available data indicate that learners’ performance in ElectronixTutor was not significantly correlated with degree of problem difficulty (q [12] = –0.068, p = 0.818), and stayed relatively stable near 78% correct. This means learners stayed roughly in the zone of proximal development, which was intentionally in the design of the system. That is, the stability is likely explained by adaptivity of the recommender system, and in part due to allowing users to self-select problems commensurate with their experience and level of comfort with the material.

104

A. J. Hampton and A. C. Graesser

We saw a good distribution of engagement relative to available topics. In Table 1, the most advanced topics appear at the top of the list, and most basic at the bottom. With the first eleven progressing topics garnering generally increasing use, we feel confident in the content match between sample and target populations to a point. Minimal use for the four most advanced topics suggests the need for some advanced classes to be incorporated into future studies for full system evaluation.

Topic

Table 1. Relative time spent in each available topic.

Pushpull Amplifier Multistage Amplifier CB Amplifier CC Amplifier CE Amplifier Transistor Zener Diode & Regulator Diode Limiter & Clamper Power Supply Rectifier PN Junction Filter Series/Parallel Combination Series & Parallel Circuit Ohm's Law & Kirchhoff's Law 0

2

4

6

8

10

Hours Spent

Survey data obtained from those who completed ten hours of interaction provide some promising results as well. While the total number of participants to complete the study was small (only 6 completed out of 50 to request log-on credentials), the majority indicated that they would continue to use ElectronixTutor without paid compensation. This indicates a certain start-up cost in learning the system that can be mediated by a more thorough introduction. Our initial attempt at classroom integration showed largely the same effect, with few able to move past the opening stages. It should be noted that these systems were used voluntarily, so adoptions were accepted to be low. The results are compatible with research on MOOCs, which are known to have high dropout rates. We also identified some learners who became disengaged, particularly during the posttest assessment. Performance features like time on task and scores relative to historical performance make this detection relatively simple. In this way, human instructors can more readily intervene when learners become distracted or discouraged.

Foundational Principles and Design of a Hybrid Tutor

105

4 Conclusions, and Future Work The integration of multiple existing AISs in a unified, cohesive platform is the essence of hybrid tutors. Integration enhances the range of interactions available, expanding the range of interaction types available to learners and leveraging more of the potential affordances of learning technology. Variability in representations is useful in fostering deep, lasting understanding of complex topics like electrical engineering. The multiple modes of content and strategy acquisition are expected to provide better transfer to new problems that the learner may encounter. Providing varying levels of control over content selection encourages engagement and investment in learning activities. Students can defer to classroom assignment (adaptive to their mastery level), evaluate a manageable number of recommendations (based on historical performance), or choose to explore freely. These overarching factors collectively suggest to a high likelihood that hybrid tutors like ElectronixTutor will yield a product greater than the sum of their parts. Future work in this field will focus on tight integration into classrooms, with calendar functions determining assignments and system participation built into the syllabus. This is the ideal application of hybrid tutors. Interim goals include improvements based on learner feedback, notably improvements in the early stages when many participants dropped off. Further, deployment of ElectronixTutor “in the wild” (that is, made available widely to any who are interested in using it) will opportunistically recruit participants with motivation to learn, thus providing data and opportunities for iterative improvement. The University of Memphis library offers a “sandbox” learning tools interface that may facilitate this branch of inquiry. Further, Shelby County Schools, the school district surrounding the University of Memphis, has expressed interest in supplementing conventional classroom instruction with ElectronixTutor, potentially expanding the learner base.

References 1. Sottilare, R., Brawner, K.: Exploring standardization opportunities by examining interaction between common adaptive instructional system components. In: Proceedings of the First Adaptive Instructional Systems (AIS) Standards Workshop, Orlando, Florida (2018) 2. Graesser, A.C., Hu, X., Sottilare, R.: Intelligent tutoring systems. In: Fischer, F., Hmelo-Silver, C.E., Goldman, S.R., Reimann, P. (eds.) International Handbook of the Learning Sciences, pp. 246–255. Routledge, New York (2018) 3. Gilbert, S., et al.: Creating a team tutor using GIFT. Int. J. Artif. Intell. Educ. 28, 286–313 (2018) 4. Sottilare, R., Graesser, A.C., Hu, X., Sinatra, A. (eds.): Design Recommendations for Intelligent Tutoring Systems: Team Science, vol. 6. U.S. Army Research Laboratory, Orlando (2018) 5. VanLehn, K.: The relative effectiveness of human tutoring, intelligent tutoring systems and other tutoring systems. Educ. Psychol. 46, 197–221 (2011) 6. Sottilare, R., Graesser, A. C., Hu, X., Holden, H. (eds.): Design Recommendations for Intelligent Tutoring Systems, Volume 1—Learner modeling. U.S. Army Research Laboratory, Orlando (2013)

106

A. J. Hampton and A. C. Graesser

7. Woolf, B.P.: Building Intelligent Interactive Tutors. Morgan Kaufmann Publishers, Burlington (2009) 8. Aleven, V., Mclaren, B.M., Sewall, J., Koedinger, K.R.: A new paradigm for intelligent tutoring systems: example-tracing tutors. Int. J. Artif. Intell. Educ. 19(2), 105–154 (2009) 9. Koedinger, K.R., Anderson, J.R., Hadley, W.H., Mark, M.: Intelligent tutoring goes to school in the big city. Int. J. Artif. Intell. Educ. 8, 30–43 (1997) 10. Ritter, S., Anderson, J.R., Koedinger, K.R., Corbett, A.: Cognitive tutor: applied research in mathematics education. Psychon. Bull. Rev. 14, 249–255 (2007) 11. Falmagne, J., Albert, D., Doble, C., Eppstein, D., Hu, X.: Knowledge Spaces: Applications in Education. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-35329-1 12. Lesgold, A., Lajoie, S.P., Bunzo, M., Eggan, G.: SHERLOCK: a coached practice environment for an electronics trouble-shooting job. In: Larkin, J.H., Chabay, R.W. (eds.) Computer Assisted Instruction and Intelligent Tutoring Systems: Shared Goals and Complementary Approaches, pp. 201–238. Erlbaum, Hillsdale (1992) 13. Dzikovska, M., Steinhauser, N., Farrow, E., Moore, J., Campbell, G.: BEETLE II: deep natural language understanding and automatic feedback generation for intelligent tutoring in basic electricity and electronics. Int. J. Artif. Intell. Educ. 24, 284–332 (2014) 14. Fletcher, J.D., Morrison, J.E.: DARPA Digital Tutor: Assessment data (IDA Document D4686). Institute for Defense Analyses, Alexandria (2012) 15. Johnson, W.L., Lester, J.C.: Face-to-face interaction with pedagogical agents, Twenty years later. International Journal of Artificial Intelligence in Education 26(1), 25–36 (2016) 16. Nye, B.D., Graesser, A.C., Hu, X.: AutoTutor and family: a review of 17 years of natural language tutoring. Int. J. Artif. Intell. Educ. 24(4), 427–469 (2014) 17. Craig, S.D., Twyford, J., Irigoyen, N., Zipp, S.A.: A Test of spatial contiguity for virtual human’s gestures in multimedia learning environments. J. Educ. Comput. Res. 53(1), 3–14 (2015) 18. Graesser, A.C., Li, H., Forsyth, C.: Learning by communicating in natural language with conversational agents. Curr. Dir. Psychol. Sci. 23, 374–380 (2014) 19. Graesser, A.C.: Conversations with AutoTutor help students learn. Int. J. Artif. Intell. Educ. 26, 124–132 (2016) 20. Biswas, G., Jeong, H., Kinnebrew, J., Sulcer, B., Roscoe, R.: Measuring self-regulated learning skills through social interactions in a teachable agent environment. Res. Pract. Technol. Enhanced Learn. 5, 123–152 (2010) 21. Lane, H.C., Noren, D., Auerbach, D., Birch, M., Swartout, W.: Intelligent tutoring goes to the museum in the big city: a pedagogical agent for informal science education. In: Biswas, G., Bull, S., Kay, J., Mitrovic, A. (eds.) AIED 2011. LNCS (LNAI), vol. 6738, pp. 155–162. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21869-9_22 22. Rowe, J.P., Shores, L.R., Mott, B.W., Lester, J.C.: Integrating learning, problem solving, and engagement in narrative-centered learning environments. Int. J. Artif. Intell. Educ. 21, 115–133 (2011) 23. Johnson, L.W., Valente, A.: Tactical language and culture training systems: Using artificial intelligence to teach foreign languages and cultures. AI Magazine 30, 72–83 (2009) 24. Dynarsky, M., et al.: Effectiveness of Reading and Mathematics Software Products: Findings from the First Student Cohort. U.S. Department of Education, Institute of Education Sciences, Washington (2007) 25. Steenbergen-Hu, S., Cooper, H.: A meta-analysis of the effectiveness of intelligent tutoring systems on college students’ academic learning. J. Educ. Psychol. 106, 331–347 (2013) 26. Dodds, P.V.W., Fletcher, J.D.: Opportunities for new “smart” learning environments enabled by next generation web capabilities. J. Educ. Multimedia Hypermedia 13, 391–404 (2004)

Foundational Principles and Design of a Hybrid Tutor

107

27. Kulik, J.A., Fletcher, J.D.: Effectiveness of intelligent tutoring systems: a meta-analytic review. Rev. Educ. Res. 85, 171–204 (2015) 28. Graesser, A.C., et al.: ElectronixTutor: an intelligent tutoring system with multiple learning resources. Int. J. STEM Educ. 5(15), 1–21 (2018) 29. Bloom, T.M.E.: Bloom’s Taxonomy of Educational Objectives. Longman, New York (1965) 30. Kyllonen, P.C., Shute, V.J.: Taxonomy of Learning Skills. Universal Energy Systems Inc., Dayton (1988) 31. Chaiklin, S.: The zone of proximal development in Vygotsky’s analysis of learning and instruction. Vygotsky’s Educ. Theor. Cult. Context 1, 39–64 (2003) 32. National Academy of Sciences, Engineering, and Medicine: How people learn II: Learners, contexts, and cultures. National Academies Press, Washington, D.C. (2018) 33. Graesser, A.C., Hu, X., Person, N.K., Jackson, G.T., Toth, J.: Modules and information retrieval facilities of the human use regulatory affairs advisor (HURAA). Int. J. E-Learn. 3 (4), 29–39 (2004) 34. VanLehn, K., Chung, G., Grover, S., Madni, A., Wetzel, J.: Learning science by constructing models: can dragoon increase learning without increasing the time required? Int. J. Artif. Intell. Educ. 26(4), 1033–1068 (2016) 35. Koedinger, K.R., Corbett, A.C., Perfetti, C.: The Knowledge-Learning-Instruction (KLI) framework: bridging the science-practice chasm to enhance robust student learning. Cogn. Sci. 36(5), 757–798 (2012)

Change Your Mind Game Based AIS Can Reform Cognitive Behavior Dov Jacobson1(&) and Brandt Dargue2(&) 1

2

GamesThatWork, Atlanta, GA, USA [email protected] The Boeing Company, St. Louis, MO 63166, USA [email protected]

Abstract. Among the most challenging Learning Objectives are those that require the learner to unlearn behavior - especially behavior deeply rooted in ancient evolutionary selection [1]. One such challenge is the mitigation of cognitive bias. After observing that traditional linear training does not reliably deliver lasting improvement to such deeply held cognitive behavior [2], a group of researchers began to experiment with game-based learning. Much of the success of the game experiments can be attributed to the efficacy of Adaptive Instruction Systems (AIS). The principles of AIS are usually applied consciously to the design of game-based learning. Even when they are not explicitly invoked, the nature of good game design is inherently one of adaptive instruction. This is most easily demonstrated with a case study of one of the games - Enemy of Reason. Enemy of Reason has been shown to be a particularly efficacious tool for mitigating Cognitive Bias. A large scale and complex game, it employs a variety of mechanics rooted in game design, adaptive instructional design and in the literature of intelligence analysis. Enemy of Reason is acutely relevant to the current moment. It treats cognitive bias not only as a natural phenomenon but as the result of an attack by an enemy determined to wage psychological warfare. Keywords: Adaptive instruction system Game based learning Adaptive learning Serious games Intelligent tutoring system

1 Introduction Game Based Learning is a form of Adaptive Instructional System in which the tropes and mechanics of entertainment games are employed to gracefully deliver the adaptive content. The format is familiar and engaging to learners of all ages. While only certain AIS systems are games, the case can be made that all games are adaptive learning systems. This is especially true of digital computer and videogames. Games are used to present Adaptive Instructional Systems in a form that is readily acceptable. This is not because the game structure disguises the AIS. On the contrary, the game structure makes transparent the adaptive nature of the learning experience. Many salient features that facilitate adaptive learning (e.g.: continual assessment, © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 108–117, 2019. https://doi.org/10.1007/978-3-030-22341-0_9

Change Your Mind

109

branching, repetition, even simple learner modeling) [3] are familiar features of games, and in that context, players have come to trust their value. Players expect certain adaptive learning features in a digital game and will employ them naturally. For example, a gameplayer expects and utilizes a tutorial level just as a textbook user expects a table of contents. Both are useful conventions of their respective learning media. This report presents a case study of Game Based Learning and examines how it realizes the goals of adaptive learning using mechanisms derived both from game science and from advanced learning design theory. 1.1

Enemy of Reason

Enemy of Reason was developed under IARPA’s SIRIUS program, by a diverse consortium of experts assembled and directed by Boeing’s Senior Learning Scientist, Brandt Dargue. Enemy of Reason is a plot-driven mystery adventure game, in which several high-speed mini-games are embedded. It is an Intelligent Learning System in which the player is immersed for several hours of computationally mediated experiences. Every player enjoys a unique sequence of experiences as the game guides players of varying backgrounds and diverse learning capabilities past a common threshold of competence. Like most learning games, and many entertainment games, the Enemy of Reason can be best seen as a complex, interactive and media-rich Adaptive Learning System. To elevate learners to a standard of mastery beyond that achieved by the control condition (a powerful but linear video lesson), the game employs several different pedagogic principles. Some of these are common to most media. Others are particular to game-based learning. These principles include • • • • • •

Schema Graphs Concept Navigation Synthetic Mentor Spaced Learning Just In Time Instruction Germane Cognitive Load

In the following pages, we will examine each of these learning mechanisms to see how they advance learning and how they fit together to form a successful adaptive system.

2 Learning Objective The Terminal Learning objective is a lasting behavior change which replaces the player’s instinctual heuristic decision-making with rational analytic judgement. Specifically, the player will acquire behavior that helps identify and mitigate certain pernicious cognitive biases.

110

2.1

D. Jacobson and B. Dargue

Salience

The Enemy of Reason emerged from a team of game designers and experienced members of the intelligence community. Behind its playful allegory, the game warns of a new and unseen existential threat to democratic civilizations. Developed before Brexit and the 2016 US elections, the game was a harbinger of a new species of warfare. Nation shall not lift sword against nation. Nor will it will hurl nuclear bombs. Instead, to topple our society, a foe attacks the thought process of our citizens. Enemy of Reason boldly illustrates such an attack. The game starts on a sunny September morning as the city-state of Capital City suffers a dramatic surprise attack. Explosions are heard and a dense red cloud rapidly engulfs the city (see Fig. 1). The scene evokes painful memories of urban destruction and panic grips the population (see Fig. 2).

Fig. 1. Capital city under the Red Cloud.

But the Red Cloud rapidly dissipates, revealing a city apparently intact. The attack inflicts no physical damage on the city’s infrastructure nor on the health of its terrified inhabitants. The nature of the weapon, if it was a weapon, is as mysterious as the identity of the attacker. In this game, the player is Ian Solitaire, a Capital City intelligence agent, assigned to solve two mysteries. Who did this? What did they do? The Red Cloud damage is very subtle. Many affected citizens deny that the cloud contained any destructive agent at all. As Ian Solitaire, the player will discover that it did. But before they can understand what hit the city, players must learn the dangers of heuristic thinking. By deploying the Red Cloud, the enemy unleashed a Cognitive Bias Virus. The virus amplifies dangerous heuristic thinking already present in the human host. These cognitive biases corrupt rational thought and unravel social trust.

Change Your Mind

111

Fig. 2. Within the Red Cloud, citizens panic.

Citizens afflicted by Confirmation Bias support their current opinion by avidly seeking consistent evidence. At the same time, they unconsciously distort and discount any contradictory observation. The biased citizen attempts to test a theory, but only reinforces it. Someone suffering from Fundamental Attribution Error sees the behavior of other citizens as evidence of their character, most often evidence of their character flaws. They overlook the fact that most human behavior is actually driven by circumstance [4]. Those living with Bias Blind Spot have learned to recognize biases in other people, while incorrectly assuming this skill inoculates them against the same biases [5]. All these biases are hard to fully understand, difficult to detect and nearly impossible to mitigate. To save Capitol City, the player embarks on a hero’s journey during which these challenges are mastered.

3 Common Features of GBL and AIS 3.1

Spaced Learning

In order to shape behavior (as opposed to transferring declarative knowledge), repetition is critical. Small bites of learning delivered at regular intervals (“trickle training”) is most effective to develop new habits. Since it strove to build new habits of thinking, Enemy of Reason was designed as an episodic game, intended to be deployed as a series of brief scenarios encountered daily, or even weekly. Other Adaptive Instruction Systems also employ the advantages of spaced learning [6].

112

3.2

D. Jacobson and B. Dargue

Schema Graphs

Learning objectives are structured into units of comprehension called schemas. Acquisition of each schema depends on the prior acquisition is predicate schemas. The entire hierarchy of dependencies is often represented as a Directed Acyclic Graph (DAG). This graph provides a map of all the learning required to successfully complete the unit’s primary learning objective. Learner model tracking software and game scenario sequencing engines can use such a map to flexibly and meaningfully organize player experience [7]. 3.3

Concept Navigation

In much Game-Based Learning, instructional topics are organized and distributed predictably across the gameplay topology. Gameplay topology can be geospatial or can refer to more abstract spaces such as narrative, puzzle space, or conversational. In a game the learner typically has some degree of agency to move among them, and to determine the instructional path. There are several levels of adaptive learning designed into Enemy of Reason. Practical implementation varies: While the simplest strategies are fully realized throughout the game, implementation of some complex adaptive mechanics is more irregular. In large part, this irregular implementation is a natural product of demands inherent in a narrative game of the ‘adventure’ genre. The playable nodes contain critical plot points and solution clues in addition to their cargo of learning content. Basic adaptive design might reorder the nodes to add or remove instructional scaffolding as needed. If not executed carefully. this adaptive tactic invites plot holes and missing clues that put the player at a great disadvantage or simply break the game. If the designers supply optional nodes that contain alternative learning (but similar plot elements) they invite a combinatorial explosion and a production nightmare that would break the budget. 3.4

Synthetic Mentor

One solution is to separate instruction from exposition at each node and deliver the latter even when the former is avoided. In Enemy of Reason, this is facilitated by the game design. Most instruction is delivered through Dr K - a character presented as an artificial anthropomorphic app embedded in the hand-held Think Machine. Dr K is herself an Intelligent Tutor. Her presence removes some of the burden of adaptive instruction from the gameplay graph. A second limit to implementation of a simple adaptive strategy is predictability. The adaptive engine probes the player by offering a set of choices that assess the learner’s state. These probes fit a pattern, and were they implemented strictly throughout the game, the pattern would be obvious to the player. Indeed, the design of an irregular decision space topology is one of the most difficult creative tasks in this type of project.

Change Your Mind

113

Just in Time Instruction In classic pedagogy, instruction is followed by assessment. This assessment follows the instruction often at a considerable distance: at the end of a week of lessons, say, or at the midterm exam. By contrast, in game-based learning, as in AIS, instruction and assessment coincide. Finer analysis will show that while they may occur in the same few seconds of interaction, they are generally distinct phases. In a well-designed game, assessment begins immediately (and never ends). It is integrated into the nature of gameplay. The player is offered a challenge that assumes mastery of the learning material. If the player fails the challenge, remedial instruction is given to cure the deficiency before the challenge is (in some form) repeated. The remediation can range from crude repetition, to material selected from a matrix of alternatives, based on the state variables that track user experience, demonstrated competencies, and successful learning modalities. Meanwhile players who demonstrate working competency move forward in the game unaware of the instruction they avoid [8]. A more complex design allows more nuanced adaptive features. At almost every node in the gameplay topology, the player has several action choices. Often, for pragmatic purposes, the designer has attached the same path forward to multiple choices, either directly or after the player experiences inconsequential activity. This is the case, for instance, in the previously discussed case, where the game challenge is used to determine whether the learner has mastered a particular skill. The player’s action either immediately leads out further into the game narrative or it engages the player in needed instruction first. Instead, the more complex design, like most sophisticated adaptive tutors, seeks to classify the player’s deficiencies so these can be addressed. At higher levels of Enemy of Reason, for example, the player is learning to avoid a cognitive phenomenon called Bias Blind Spot. This is the unfortunate heuristic that prevents those experts sensitized to the appearance of a cognitive bias in other people’s reasoning from recognizing this same bias in their own. In one level of Enemy of Reason, the player plays as Consuela, a veteran of guerilla warfare who returns to her mountain cadre seeking information leading to the perpetrator of the Red Cloud attack (see Fig. 3). The player must guide her as she interviews her former comrades, assembling clues from these often very unreliable informants. Some guerillas may deceive Consuela, but far more often they report what they think. Unfortunately, the fighters do not think very clearly, and the player is expected to detect the cognitive biases that distort the reports Consuela receives. However, at another level, Consuela’s behavior might be driven by the player’s own cognitive bias. Fundamental Attribution Error, for example, can cause the player to assume malevolent intent in the actions of characters that were driven by simple situational constraints. The game presents choices designed to assess the player’s specific skills. Struggling players, directing the Consuela avatar, fail to employ the bias detection skills introduced much earlier in the game. Choosing an option that reveals this failure, they navigate into game sequences designed for remedial instruction. A better player might correctly detect and mitigate bias in her informant but fail to realize that she herself is distorting the informant’s answer to conform to her own biases. The game will drive this player into material designed to sensitize her to the problem of Bias Blind Spot.

114

D. Jacobson and B. Dargue

Fig. 3. Consuela interrogates guerillas to locate Stilo

Some players will identify both the guerilla’s bias and their own. These players will be expedited forward in the game. In one scene, a love-struck young female fighter misinterprets the cadre leader’s ordinary remarks as sly declarations of affection. Consuela must disregard this bias since logic based on that interpretation will lead her away from the leader’s true location. Consuela must also fight her own bias to consider the leader malevolent. When the female fighter cites, as a sign of affection, the black eye given to her by the leader, Consuela recoils. She is ready to consider it evidence of abuse. However, dispassionate investigation will reveal that the black eye resulted from neither abuse nor affection. It was an accidental slip during a martial arts exercise. Each player investigates as much as seems necessary before deciding what the black eye represents and chooses action to be pursued on this decision. This action reveals the player’s competencies and selects the next learning experience. More subtle adaptive design exploits features of the game environment. For example, the learning game design generally serves two distinct objectives. First, the player must master the game’s extrinsic learning content. In our case, the player must master certain skills that mitigate several cognitive biases. 3.5

Germane Cognitive Load

At the same time, the same player, in the same game must solve intrinsic game challenges. In Enemy of Reason, the player must identify and capture the perpetrator of a Red Cloud attack on Capital City from a broad array of enemies and rivals within the City and among her foreign adversaries. The fact that the game offers challenges in both its intrinsic puzzles and its extrinsic training should warn designers that they might overload the player. Proponents of classic Cognitive Load Theory will be especially cautious, since this appears to be the

Change Your Mind

115

sort of extraneous load that the Theory abhors. More game-friendly learning theorists admit the possibility that well-engineered game challenges may serve instead as germane cognitive load. By shouldering germane load, a learner acquires schemas which expedite further progress toward the core objectives. In Enemy of Reason, the game design indulges frequent excursions from the narrow instructional path. Each of these is designed to be germane cognitive load and delivers a constructive schema to the player. For example, the player, seeking the perpetrator of the Red Cloud attack on Capital City might seek evidence that implicates Beloved Leader, the dynastic autocrat who rules an isolated and militarized neighboring country. Beloved Leader brutalizes his starving and enslaved population but rallies their support by loudly threatening Capitol City with annihilation. In chasing this rather obvious suspect, the player is cautioned to avoid confirmation bias. Nevertheless, many players surrender to heuristic thinking and pursue more and more evidence to confirm their suspicions. Among the evidence that they amass are secret videos belonging to Beloved Leader’s Intelligence Analyst Training division. Each of these items is a prized token for winning the game level. But the player can also view them. One video is an awkwardly translated Kunqu (Chinese Opera) performance that glorifies deductive reasoning, an important prerequisite skill for those seeking to overcome confirmation bias. (The singers, of course, credit the Beloved Leader with discovery of deductive reasoning.) Another displays the progress of a traditional Asian children’s game with slate and tiles (see Fig. 4). The game, in fact, perfectly illustrates the Analysis of Competing Hypotheses (ACH), an intellectual tool for defeating confirmation bias, developed by Richard Heuer at the CIA’s Sherman Kent School for Intelligence Analysis and a critical learning objective treated elsewhere in the game.

Fig. 4. Slate and tiles game to illustrate the Analysis of Competing Hypotheses (ACH).

116

D. Jacobson and B. Dargue

To win, the player identifies the secret boxed token using the fewest moves. By choosing wooden clue tiles, the player works to disconfirm all but one of the three alternative solutions in the top row. Both the Chinese opera and the children’s game are, of course, completely fictional, invented to reinforce important schemas that support the core learning objectives.

4 Conclusion Game Based Learning employs purpose-built digital games (“serious games”) to help players overcome real-world learning challenges. These games employ many sophisticated pedagogic features, many of which are immediately recognizable to the student of Adaptive Instructional Systems (AIS). While it is tempting to suggest these games generally include some form of AIS, it is more useful to declare that the games themselves comprise a particular class of Adaptive Instructional System. A Venn diagram of the intersection of the pedagogic feature set of advanced GBL with that of state-of-the-art non-game AIS would be dominated by their common overlap. Nevertheless, the two disciplines have developed areas of special development. Practitioners of Game Based Learning would do well to follow the mainstream AIS literature. Work being advanced in learner modeling, data-driven artificial intelligence and advanced sensors can be used to upgrade serious game engines. In general, the game designer is better equipped to offer the learner attractive, meaningful choices than to determine which choices the learner currently needs. Conversely proponents of non-game Intelligent Tutoring Systems and traditional Adaptive Instructional Systems would profit from a study of the work done in Game based Learning. Rich media assets are an obvious advantage of games, but these are not the most important. Games achieve learning by leveraging player agency and by exercising constructive failure. These, more than the media assets, are attributes that can find their way into traditional Adaptive Instructional Systems. Acknowledgements. This research was supported by the Intelligence Advanced Research Projects Activity (IARPA) via the Air Force Research Laboratory. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, AFRL, or the U.S. Government.

References 1. Staddon, J.: Adaptive Behavior and Learning. Cambridge University Press, Cambridge (2016). https://doi.org/10.1017/CBO9781139998369 2. Croskerry, P.: Cognitive bias mitigation: becoming better diagnosticians. In: Diagnosis: Interpreting the Shadows, pp. 257–287. CRC Taylor and Francis Group, Boca Raton (2017) 3. Park, O.C., Lee, J.: Adaptive instructional systems. Educ. Technol. Res. Dev. 25, 651–684 (2003)

Change Your Mind

117

4. Langdridge, D., Butt, T.: The fundamental attribution error: a phenomenological critique. Br. J. Soc. Psychol. 43(3), 357–369 (2004). https://doi.org/10.1348/0144666042037962 5. West, R.F., Meserve, R.J., Stanovich, K.E.: Cognitive sophistication does not attenuate the bias blind spot. J. Pers. Soc. Psychol. 103(3), 506 (2012) 6. Mettler, E., Massey, C.M., Kellman, P.J.: Improving Adaptive Learning Technology through the Use of Response Times. Grantee Submission, July 2011 7. Peirce, N.: The Non-invasive Personalisation of Educational Video Games (Doctoral dissertation, Trinity College Dublin) (2013) 8. Shute, V.J., Zapata-Rivera, D.: Adaptive educational systems. In: Durlach, P. (ed.) Adaptive Technologies for Training and Education, pp. 7–27. Cambridge University Press, New York (2012)

Developing Authoring Tools for SimulationBased Intelligent Tutoring Systems: Lessons Learned James E. McCarthy1(&) , Justin Kennedy2 and Mike Bailey2

, Jonathan Grant2

,

1

Sonalysts, Inc., Fairborn, OH 45324, USA [email protected] 2 Sonalysts, Inc., Waterford, CT 06385, USA {jkennedy,jgrant,mbailey}@sonalysts.com

Abstract. Intelligent tutoring systems have a long history of significantly improving student performance. Unfortunately, they also have a long history of being very expensive to develop while producing very short shelf lives. To address these deficiencies, we developed and evaluated an authoring suite known as the Rapid Adaptive Coaching Environment (RACE). While developing RACE, we frequently encountered tension between maximizing the power of the intelligent tutoring system that authors would produce and minimizing the level of effort associated with producing this tool. In this paper, we will attempt to highlight this tension as we consider the five core design challenges that we addressed in the RACE prototype. We will review the formal and informal evaluations conducted by the development team and discuss how those evaluations led to the maturation of the design. Keywords: Intelligent tutoring User-centered development

Authoring systems

1 Introduction Over the past 30 years, the United States Department of Defense (DoD) has been a leading developer of intelligent tutoring systems (ITSs, Fletcher 1988; McCarthy, 2008). Perhaps one of the most thoroughly researched intelligent tutoring systems in the military domain is SHERLOCK, developed near the beginning of the 1990s (Katz and Lesgold 1993; Lesgold et al. 1992). Like many other military ITSs, SHERLOCK was developed to improve troubleshooting skills, this time in conjunction with the F-15 Avionics Test Station. The original SHERLOCK research demonstrated that about 25 h of practice in that environment had an impact on post-test performance equal to about 4 years of on-the-job experience (Corbett et al. 1997). Gott et al. (1995) followed up with a report that reviewed five separate evaluations of SHERLOCK. That review showed a performance improvement of the 50th percentile students to the 85 percentile (effect size of d = 1.05 standard deviations; an effect considered large and robust).

© Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 118–129, 2019. https://doi.org/10.1007/978-3-030-22341-0_10

Developing Authoring Tools for Simulation-Based Intelligent Tutoring Systems

119

However, despite this long history and many demonstrable benefits, very few intelligent tutoring systems are currently in use within the DoD. There are several reasons for this, including the characteristics of the tutors that have been developed, the level of effort associated with their development, and the difficulty and cost associated with maintaining these systems over time. The current effort was designed to address these limitations. Specifically, our team set out to develop an authoring system that would make it easier, less expensive, and less time-consuming to develop and maintain intelligent tutoring systems. Our charter mandated that we address problems and systems representative of those encountered throughout the military, and that we do so in a way that was independent of a particular simulation environment or tutoring engine. The observation that intelligent tutoring systems are very powerful but are hampered by high development and maintenance costs is not a new one. Several “workarounds” have attempted to make intelligent tutoring systems more cost-effective. While many of the approaches have real merit, they all exhibit deficiencies. For example, machine learning approaches have been used to develop a multidimensional representation of what a correct solution looks like within a complex problem space. A student’s performance is compared to this standard and a proficiency assessment is made. A critical deficiency of machine learning approaches is that they yield only “black box” models of expertise. The student’s score indicates how much his/her performance differs from what the tutor had been taught to expect, but not why the performance is different. Real-time feedback is not possible. An alternative to machine learning approaches is to use formal cognitive modeling architectures and languages to develop glass-box cognitive models of expert performance. These architectures generally provide a programming language that allows users to model human cognitive processes and behaviors. However, these languages almost always are specified at a primitive level, similar to assembly code (Cohen et al. 2005; Ritter and Koedinger 1995). This makes it very time consuming to develop the models and usually the models are over-specified. That is, they include details that either cannot be supported or that are irrelevant to the task at hand (Amant et al. 2005). To address these concerns, several groups have worked to simplify this programming process. Across these efforts, the strengths and weaknesses of the high-level language approach very nearly complement those of the machine learning approach. High-level languages can produce glassbox representations of expertise and offer the opportunity to provide real-time performance assessment and feedback. Unfortunately, they still are based on relatively conventional task analyses and require manual model development involving a relatively large team of domain experts, knowledge engineers, and software engineers. While this development process is unquestionably easier, it still requires technical knowledge of both the language itself and the underlying architecture(s). As a result, while high-level languages can reduce development and maintenance costs, they do so incrementally. They do not provide the order-of-magnitude cost savings we desire from this project. Instead of adopting either of these approaches, our team focused on demonstrationbased authoring of intelligent tutoring systems (cf., Koedinger et al. 2003). We felt that, like machine learning approaches, demonstration-based authoring would allow experts to do what they do best – perform – while also producing glass-box representations like those offered by expert models defined with high-level languages. Such a technique would reduce the cost and level of effort associated with expert model development dramatically.

120

J. E. McCarthy et al.

Academic research has indicated that demonstration-based tutors have the potential to increase the efficiency of tutor production by an order-of-magnitude (Koedinger et al. 2003; Aleven et al. 2009). Further, they produce glass-box tutors that can provide real-time feedback and hints. In doing so, they avoid the cost of complex task analyses and replace programmers and instructional designers with subject-matter experts (SMEs). However, this approach is not without problems of its own. These tutors generally have been developed for relatively simple domains such as introductory mathematics. These simple, static domains simplify the problem considerably along a number of dimensions. One characteristic of the demonstration-based tutors developed for simple domains is that their user interface requirements are easily met. Few user controls are needed and those that are needed are common. As a result, toolboxes of reusable and instrumented widgets are easier to create. Across military domains (aircraft operations, tactical system employment, etc.) interfaces are likely to be much more complex, thus building an interface from a toolbox is less practical. Similarly, the simple domains associated with demonstration-based tutors developed so far have been static. Once a problem is presented, it remains the same except for the actions of the operator. This is not true of many military domains; dynamically shifting situations and priorities require the system to constantly reassess which actions are most appropriate. Moreover, the content associated with these domains rarely changes in any significant way over the course of years. Across military domains, change is relatively constant. Within this project, we set to address these challenges within a prototype authoring tool known as the Rapid Adaptive Coaching Environment (RACE). While developing RACE, we frequently needed to balance our efforts to maximize the power of the intelligent tutoring system that authors would produce while minimizing the level of effort associated with producing the tool. In the discussion that follows, we will attempt to highlight this tension as we consider the five core design challenges that we addressed in the RACE prototype: 1. 2. 3. 4. 5.

Assessment of Evolving Situations Capturing State and Action Information from Arbitrary Simulations Exporting Assessment Models for Arbitrary Intelligent Tutors Specifying Evaluation Standards, and Providing Instructive Hints and Feedback.

2 Baseline Approach Our initial approach to developing RACE focused on the development of a learning objective hierarchy and the association of each LO with a scenario event that would serve as a practice opportunity. The following subsections provide an overview of this process.

Developing Authoring Tools for Simulation-Based Intelligent Tutoring Systems

2.1

121

Outlining

The grand vision for RACE included the ability to develop both simulation-based intelligent tutoring that would allow students to independently practice skills and media-based adaptive interactive multimedia instruction (IMI) that would allow students to master the fundamentals of a given domain. As such, it was very important to the development team to establish a solid foundation for this work through the development of a well-structured hierarchy of learning objectives (LOs). However, we were also aware that the likely users of RACE were not instructional designers per se. Rather, they were likely to be SMEs who were, perhaps, serving as instructors. Therefore, we felt that it was important to scaffold their performance. To do this, we started by asking authors to identify the thing that was probably most familiar to them – the tasks that they wanted their students to be able to perform when they were done with training. The development team felt (and later evaluations confirmed) that the SMEs/authors would have clear expectations about this. Next, we asked the authors to take one small step away from tasks and think about what the students would have to do to complete each task. That is, what skills are required for successful task performance? Scaffolds, in the form of general directions, more detailed explanation, and access to a very detailed “how-to” guide supported authors in this process. Just as we asked the authors to identify the skills that were needed to perform the task, we also asked the authors to identify the knowledge that was required to perform each skill. That is, what facts, concepts, principles, etc. must be mastered before the student can successfully perform the skill? The last step in this outlining process guided the authors through defining learning objectives for each of the skill and knowledge elements that they had defined. Within this process, the authors selected (1) an LO Type (Fact, Procedure, etc.) that was consistent with the current element type (knowledge vs. skill), (2) an LO Level that indicated the depth of mastery that was required (Remember, Use, etc.), and (3) a verb that was appropriate for that Type and Level (State, Configure, etc.). RACE then asked the author to specify an object for the verb (e.g., “the primary components of the radar receiver” or “the radar receiver for in-port operations”). Together, these elements allowed RACE to define LOs with known outcome types. The final LO took the form “The student will be able to VERB OBJECT” (e.g.,“The student will be able to state the primary components of the radar receiver.”). 2.2

Developing Practice Scenarios

Next, the author was asked to develop practice scenarios. This was a two-step process. First, RACE guided the author through the process of associating each skill-based LO with one or more scenario events that could provide an opportunity to practice the associated skill (or check for mastery of that skill). The second step in the process was to actually develop a scenario that included that event. RACE was designed to work with any conformant simulation system. As a result, it did not include scenario development support; the development team felt that it was more appropriate to leave that on the simulation “side of the fence.”

122

2.3

J. E. McCarthy et al.

Demonstrating Solutions

To demonstrate a solution, the author would select a skill-based learning objective and launch an associated scenario within the simulation environment. For our prototype, we used various applications developed within the Standard Space Trainer (SST) framework (Nullmeyer and Bennett 2008). Both RACE and the simulation connected to a piece of middleware. Using a well-defined interface, the middleware component captured data from the simulation, reformatted it, and passed it to RACE for processing. To demonstrate the correct procedure for the event (and, by extension, the LO), the author had to do two things. First, the author had to use a RACE interface to define a situation assessment rule that the system would use to recognize that the preconditions for the LO have been satisfied and the student should begin the associated procedure. Second, the author had to demonstrate the appropriate procedure. As the author conducted the demonstration, the simulation passed the recorded actions to the middleware component and on to RACE where it appeared on a display that the author could review (see Fig. 1).

Fig. 1. RACE review captured action screen. Authors use this screen to review, edit, and organize procedures demonstrated within the associated simulation system.

Using this interface, authors could delete or re-order steps to correct minor errors made during the demonstration process. In addition, they were encouraged to organize steps in the process according to the goals that were being pursued. These organization labels were used in the feedback development process.

Developing Authoring Tools for Simulation-Based Intelligent Tutoring Systems

2.4

123

Setting Evaluation Standards

After reviewing and refining their demonstrations, authors were asked to define evaluation standards that would specify how precisely they expected students to mirror their demonstrated procedures. Authors used the RACE interface to specify standards for accuracy and pacing. Accuracy standards included things like using the right control or entering the correct value. Pacing standards included the order in which users performed tasks and how much time elapsed between actions. Different types of actions had different standards associated with them and the authors could choose to enable/disable standards or to adjust the values against which student performance would be judged (including adjusting the “margin of error” that is acceptable. 2.5

Validating Coaching Statements

RACE automatically generated hints and feedback that the intelligent tutoring system could deliver to students to enhance their performance. RACE applies a grammar to captured actions to create the feedback. Each action is associated with three levels of progressively more directive hints. Similarly, RACE associated each correct action with three levels of progressively more detailed feedback. RACE also associated each action with an arbitrary number of coaching statements that the intelligent tutor could deliver when the student made an error. The specific evaluation standards that the author selected were associated with specific types of errors. Each type of error was associated with up to five levels of progressively more detailed feedback. The authors’ task was to review this automatically generated feedback and make any adjustments that they felt improved its quality. 2.6

Supporting Multiple Tutoring Environments

Just as RACE was designed to be independent of a given simulation environment, it was also designed to be independent of a given tutoring engine. To pursue this goal, the development team adopted an industry standard for the specification of task models: The W3C XML Task Model (2014). The W3C XML task model includes a metamodel expressed in Unified Modeling Language (UML) and an eXtensible Markup Language (XML) Schema. While the task model has been targeted as the basis for interchange of task models between different user interface development tools, it is general enough to express most of the concepts involved in the RACE assessment model, and can be extended to cover the rest. A key attribute of the W3C Task Model is that it already includes a number of operators to describe relations between tasks. Those operators include interleaving, order independence, synchronization, parallelism, choice, disabling, suspend-resume, enabling, iteration and optional. For our prototyping effort, middleware was developed that allows Sonalysts’ intelligent tutoring system, ExpertTrain (McCarthy 2008), to ingest the W3C Task model and convert the data that it contained into the data structures needed by ExpertTrain to support automated performance assessment and coaching.

124

J. E. McCarthy et al.

3 Evaluations and Revisions Throughout the development process, the RACE team conducted a series of evaluations of the emerging prototype. Two of these were formal evaluations with potential users of the simulation and others were less formal review and feedback sessions conducted in concert with our Agile development approach (cf. Schwaber and Beedle 2002). 3.1

Formal Evaluations

Formal evaluations of RACE were conducted in 2016 and 2017. In these evaluations, U.S. Air Force instructors were asked to use the RACE software to complete authoring (2016/17) or coaching (2017) tasks. Their performance was monitored to identify aspects of the process that could be improved. Further insights were gathered via participant surveys and debrief discussions. Indicative results are shown in Table 1. Generally, the participants felt strongly about the need for the type of independent practice offered by RACE, and they were relatively happy with the ease of the authoring process. Table 1. Representative results from RACE evaluations

2016

2017

Authoring Ease

Coaching Value

3.2

Informal Evaluation

Despite the overall favorable results of the evaluations, the development team was left with the impression that improvement was possible. One source of this impression was the debrief comments made by participants at the conclusion of the formal evaluations. A more powerful source was the less formal comments that we received at the end of

Developing Authoring Tools for Simulation-Based Intelligent Tutoring Systems

125

each development sprint. In this section, we review some of those findings and the steps that we took to enhance the performance of the system. Authoring Approach was Too Restrictive. As the development team studied the feedback, certain trends emerged. First, it was clear that authors wanted to be able to define multiple correct procedures for many tasks, but that the current approach to doing so felt cumbersome. They wanted an easier way to provide students with the flexibility to complete tasks in a variety of ways. Second, because procedures were frequently repeated in support of various tasks, they wanted an ability to easily re-use the procedures previously demonstrated. Third, in some cases, task procedures had complex timing and sequential patterns. There was a concern that RACE wasn’t sufficiently expressive to support user needs as we moved from a prototype to a fielded system. To address these challenges the design team undertook an investigation regarding the ability to create a more graphical “front end” to the authoring process. Our goal was not to implement this approach, but to determine if we could create a design concept that felt more natural to users. Sonalysts developed a “Performance Roadmap” approach to authoring. With this approach, the author would select an LO and then use a visual workspace and icons to build a sequence of goals that would satisfy that learning objective. Each goal, in turn, would have one or more methods associated with it. A method would be a path for a response – a series of steps to accomplish a specific task or satisfy a goal. This approach would allow authors the flexibility to build many ways to allow students to perform a task, answering one of the usability concerns. After building a timeline, an author would “populate” method blocks one of two ways. First, an author could import an existing method from an ever-growing RACE library. This would allow the re-use of “building blocks,” answering the second of the usability concerns. Second, an author could demonstrate new methods using a simulation linked to the learning objective. Besides being saved to the current roadmap, newly recorded methods would be saved to the RACE library and become available for import into future roadmaps. After populating method blocks, an author would individually adjust timing and accuracy standards for each using a visual interface. A representation of this approach is shown in Fig. 2. Concern About Coaching Level and Associated Level of Effort. With our new, more graphical authoring approach in hand, the development team reviewed it with a focus group. The user population expressed continued concern about the level of effort associated with the authoring task and their ability to support that level while doing their “day job.” To help reduce the level of effort, the audience suggested doing away with methods and focusing instead on outcomes (e.g., “we only care if they get the job done, not how they do it.”). In reaction to this feedback, the development team undertook parallel analysis tasks. One group investigated whether an outcome-based approach was practical. The second group investigated a mechanism for simplifying the authoring process while maintaining process-oriented methods. The development team concluded that a purely “outcome-based” approach to supporting independent practice was possible, but probably not advisable. There were several reasons for this determination. First, although the SST environment included the level of detail necessary to define required outcomes, we could not be sure that was true

126

J. E. McCarthy et al.

Fig. 2. Representation of graphical roadmap approach to authoring

for all simulation environments and adopting this approach might violate our goal of being simulation-agnostic. Second, the very richness of the SST data model that made outcome-based assessment possible might make developing outcome recognition rules complex. As a result, the reduction in the level of effort that authors were seeking might not be realized. Third, the outcome-based strategy would limit the breadth of coaching that was available to whether or not the outcome was achieved, not why. As a result, the system would have a much diminished ability help to the students to improve. Along the other path (simplifying process-based authoring), the development team proceeded by making a few simplifying assumptions. The primary simplification was to replace the “task-skill-knowledge-LO” decomposition process that would support both skill- and knowledge-focused instruction and practice, with a scenario-goalmethod decomposition focused on supporting skill-based practice. Within this framework, the authoring process would have two major tasks. First, the author would decompose each scenario in the practice curriculum into a sequential series of useable goals. Second, the author would demonstrate the methods associated with each goal. By limiting the demonstration process to one goal at a time, this approach avoided the combinatorial explosion among methods that could sometimes occur within the LOcentric approach. To explore this design concept, the design team proceeded along two paths. First, we wanted to see if this approach could produce high-value coaching. To explore this, we created a practice curriculum comprising eight scenarios that spanned an introductory curriculum in space operations. An SME analyzed the procedures associated with each practice activity and decomposed them into a set of relatively small goals that had a high degree of re-use across the practice activities. This process produced approximately 56 small re-usable goals. In addition, the SME created a sequence chart that illustrated when each goal could be completed. Figure 3 illustrates the outcome for a relatively simple and a relatively complex practice activity.

Developing Authoring Tools for Simulation-Based Intelligent Tutoring Systems

127

Relativity Simple Practice Activity

Relatively Complex Practice Activity

Fig. 3. Sample goal sequences

We then updated the RACE software to allow us to populate these goals with demonstrated methods and replicate these goal sequences within the coaching sequence. To approximate the outcome-oriented coaching that the instructors preferred, we disabled most of the evaluation standards (e.g., we no longer assessed pacing or the parameters that students entered). We then asked our SME to review the resultant performance assessment and coaching. These “pilot tests” allowed us to make further refinements into how the RACE coaching engine responded to errors, how the system recognized that the students thought they were done, and how the system summarized student performance following the end of an activity. Second, the development team replaced the graphical authoring approach discussed earlier with one that provided explicit support for defining the methods associated with each goal and sequencing those goals to support a given practice activity. The revised approach asked authors to demonstrate the methods for one goal at a time. In doing so, it avoided the sense of “overwhelming” instructors. Further, developing the roadmap was simply a process of importing completed goals. To provide a little more flexibility, we also allowed authors to disable methods if they felt that they were not appropriate in a given setting and to adjust the performance standards that would be enforced within a given practice activity. The result seems to strike a good balance between providing both expressive flexibility and simplicity.

128

J. E. McCarthy et al.

4 Discussion The RACE project explored the ability to develop authoring tools that would make it simpler to develop and maintain intelligent tutoring systems that can enable independent student practice. From a technology perspective, the development team felt that they were successful. The most daunting technical challenge was developing a generalized and usable approach to situation assessment. We made progress on this front, but ultimately other factors (specifically, the explicit definition of goal sequences to simplify other facets of authoring) obviated the need for a complete solution. This is an area ripe for further exploration in a cross-simulation/cross-domain environment. A second challenge was the need to make the authoring as simple and intuitive as possible for our authors while maintaining the full power of an ITS. Our efforts here were helped by the inclusion of formal usability assessments within the effort. Perhaps more strikingly, our efforts were also enhanced by more periodic feedback afforded by the monthly development “sprints” used within these efforts. The Agile development approach allowed us to experiment with ideas and to get timely internal/external feedback on what “worked” while there was still time to benefit from the lessons learned. We plan to combine these two methods in most of our future research and development efforts. Across the span of the prototyping effort, the development team feels that they have made significant progress in demonstrating the ability to field an authoring tool that is agnostic to both the hosting simulation environment and the tutoring engine and that produces high-quality coaching. We look forward to the opportunity to continue to test and refine this emerging technology.

References Aleven, V., Mclaren, B.M., Sewall, J., Koedinger, K.R.: A new paradigm for intelligent tutoring systems: example-tracing tutors. Int. J. Artifi. Intell. Educ. 19(2), 105–154 (2009) Amant, R.S., Freed, A.R., Ritter, F.E.: Specifying ACT-R models of user interaction with a GOMS language. Cogn. Syst. Res. 6(1), 71–88 (2005) Cohen, M.A., Ritter, F.E., Haynes, S.R.: Herbal: a high-level language and development environment for developing cognitive models in Soar. In: Proceedings of the 14th Conference on Behavior Representation in Modeling and Simulation, pp. 133–140 (2005) Corbett, A.T., Koedinger, K.R., Anderson, J.R.: Intelligent tutoring systems. In: Helander, M.G., Landauer, T.K., Prabhu, P.V. (eds.) Handbook of Human-Computer Interaction. Elsevier Science B.V., Amsterdam (1997) Fletcher, J.D.: Intelligent training systems in the military. In: Andriole, S.J., Hopple, G.W. (eds.) Defense Applications of Artificial Intelligence: Progress and Prospects. Lexington Books, Lexington (1988) Gott, S.P., Kane, R.S., Lesgold, A.: Tutoring for transfer of technical competence. Air Force Technical Report: AL/HR-TP-1995-0002. Armstrong Laboratory, Human Resources Directorate, Brooks AFB, TX (1995) Katz, S., Lesgold, A.: The role of the tutor in computer-based collaborative learning situations. In: Lajoie, S.P., Derry, S.J. (eds.) Computers as Cognitive Tools. Erlbaum, Hillsdale (1993)

Developing Authoring Tools for Simulation-Based Intelligent Tutoring Systems

129

Koedinger, K.R., Aleven, V., Heffernan, N.: Toward a rapid development environment for Cognitive Tutors. In: Artificial Intelligence in Education: Shaping the Future of Learning through Intelligent Technologies, Proceedings of AI-ED 2003. IOS Press (2003) Lesgold, A., Lajoie, S., Bunzo, M., Eggan, G.: SHERLOCK: a coached practice environment for an electronics troubleshooting job. In: Larkin, J., Chabay, R. (eds.) Computer-Assisted Instruction and Intelligent Tutoring Systems: Shared Goals and Complementary Approaches. Lawrence Erlbaum Associates, Hillsdale (1992) McCarthy, J.E.: Military applications of adaptive training technology. In: Lytras, M.D., Gaševiće, D., Ordóñez de Pablos, P., Huang, W. (eds.) Technology Enhanced Learning: Best Practices. IGI Publishing, Hershey (2008) Nullmeyer, R., Bennett, W.: Leading-Edge Training Research is Catalyst for Training Transformation in Space Operations. Fight’s On 7(1), 1 (2008) Ritter, S., Koedinger, K.R.: Towards lightweight tutoring agents. In: Proceedings of AI-ED 1995 - World Conference on Artificial Intelligence in Education, Washington, DC, pp. 91–98, August 1995 Schwaber, K., Beedle, M.: Agile Software Development with Scrum, vol. 1. Prentice Hall, Upper Saddle River (2002)

Ibigkas! 2.0: Directions for the Design of an Adaptive Mobile-Assisted Language Learning App Ma. Mercedes T. Rodrigo1, Jaclyn Ocumpaugh2, Dominique Marie Antoinette Manahan1, and Jonathan D. L. Casano1,3(&) 1 Ateneo de Manila University, Quezon City, Philippines [email protected], [email protected] 2 University of Pennsylvania, Philadelphia, PA, USA 3 Ateneo de Naga University, Naga City, Philippines

Abstract. Ibigkas! is a team-based mobile-assisted language learning application that provides students with English language practice. Working collaboratively rather than competitively, players must find the rhyme, synonym, or antonym of a given target word among different lists of words on their mobile phones. At this time, Ibigkas! is not adaptive. In order to anticipate the needs of an adaptive version of the game, we conducted a workshop in which students and teachers from the target demographic played the game and then participated in focus group discussions. Based on their feedback, we conclude that an adaptive version of the game should include metacognitive support and a scoring system that enables monitoring of individual performance based on individual mistakes or non-response. Tracking of individual performance will enable us to build in other articulated student and teacher preferences such as levelling up, rankings, adaptive difficulty level adjustment, and personalized post-game support. Keywords: CSCL

Ibigkas! English language learning

1 Adaptation Systems in Collaborative Contexts Computer-supported collaborative learning (CSCL) is a branch of learning science concerning how people learn together with computers [11]. Collaboration differs from cooperation. Cooperative learning occurs when individuals accomplish assigned tasks independently and combine their work to arrive at a single output. Collaborative learning, on the other hand, occurs when learning is constructed socially, though a process of negotiation and sharing [11]. Thus, successful collaborative learning means successfully learning the subject matter and successfully working with others [9]. Traditionally, CSCL environments provided learners with a variety of communication tools to support their activities: email, chats, discussion fora, audio and video conferencing, and others. Recent years have seen a growing interest in the application of artificial intelligence (AI) to automatically and adaptively provide learners with the scaffolding they need in order to learn the content and to work with each other © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 130–141, 2019. https://doi.org/10.1007/978-3-030-22341-0_11

Ibigkas! 2.0

131

productively [see 5, 11]. Indeed, CSCL applications aim to capture knowledge of group activity and use it to better support group interactions [5]. Magnisalis and colleagues [5] characterize adaptive CSCL along several dimensions. Pedagogical objective is the general pedagogical goal of the system, i.e. whether its goal is to present the values of selected activity types (mirroring), information on what productive activity might be (metacognitive), or advising learners towards behaviors that might improve collaboration (guiding). Target of the intervention refers to that aspect of learning task that the systems intelligence supports. The target could be group formation, content support, or peer interaction support. Modeling refers to the aspects of the domain, pedagogy, activity, and learners that the system internally represents in order to make informed adaptations. Closely linked to modeling are the technologies used to generate these models, which include AI and non-AI techniques. Finally, design space refers to how these adaptations are presented to the learner— explicitly or implicitly, directly or indirectly. Creating adaptive CSCL is a non-trivial undertaking for at least two reasons. Group interactions are inherently complex. The best possible group interactions are marked with idea generation, constructive criticism, a plurality of viewpoints, shared understanding, and peer support [9]. In the more usual instances and in the absence of pressure from a teacher, learners can fail to interact productively with their peers [4], loafing instead and depending on a few competent, responsible group members to complete the work [2]. Second, existing models for adaptation or intelligent support tend to assume individualized learning [5], possibly because theoretical frameworks and models for team interactions have not yet reached a level of maturity that translate easily to software [3]. In group contexts, how do we balance the giving and withholding of assistance? What type of assistance should be provided—cognitive, social, metacognitive, affective? Should the assistance be explicit or implicit? Should the assistance be presented to a specific group member or to another group member, through a change in the learning environment? These are all open questions [8] and are likely dependent upon individual considerations (e.g., student knowledge) as well as culturally-specific expectations. Researchers have made inroads into finding some answers. At the conceptual level, Chopade and colleagues [1] offer a framework for constructing intelligent tutoring systems for teams. It augments classic intelligent tutoring systems architecture components (learner model, domain knowledge, interface) with new units needed to support team work, e.g. a team model and a dynamic team adaptation module. Soller and Lesgold [10] compare an array of computational methods of modeling the task-based and social processes of collaborative learning through fine-grained analysis of team interactions. At the level of implementation, Viswanathan and VanLehn [12] made use of tablet gestures and superficial speech features to measure levels of collaboration among pairs of students solving complex math problems. Vizcaino and colleagues [13] created and tested a collaborative, adaptive environment for supporting programming. The system compares group behaviors against known patterns of behavior to decide which content the group needs or what behaviors it should encourage.

132

Ma. M. T. Rodrigo et al.

Researchers have looked to the automation of collaboration scripts, i.e. teacherformulated activities, guidelines, and procedures that structure non-computer supported student group work [4]. Examples of these scripts include the assignment of students to specific roles and the distribution of learning materials among members to force consultation (also known as jigsaw scripts) [2]. Rau and colleagues [6] successfully created adaptive collaboration scripts to support chemistry learners. Their script posed questions that resulted in student discussion or prompted the students towards certain behaviors. We observe that the proposed conceptual frameworks require the use of high fidelity data, including dialog and sensor data [1]. We also observe that these systems are usually deployed in complex problem solving STEM environments [6, 12, 13]. In this paper, we attempt to consider the questions from [8] in the context of Ibigkas! a team-based mobile game for English language learning. The game keeps interaction logs but makes no use of video or audio recording, or sensors. The objective of this paper is to collect design considerations and directions for the development of an adaptive version of the game. The design considerations and directions are drawn from the literature and from focus group discussions with students and teachers representing the game’s target user groups.

2 An Overview of Ibigkas! Ibigkas! is a collaborative drill-and-practice style game that helps learners develop fluency in identifying rhymes, synonyms, and antonyms in English. It was developed (as discussed more fully in [7]) by the Ateneo Laboratory for the Learning Sciences of the Ateneo de Manila University in the Philippines, and it is available free of charge for both Android and iOS. It was intended for use by under-resourced students in grades 4, 5, and 6 students in Philippine public schools, where English is an official language, but is not necessarily the language of instruction. Ibigkas! allows both multiplayer and single player modes. To play in multiplayer mode, each student must first have a mobile phone with the game installed. The game does not require Internet access, but each device must be connected to the same network hotspot in order to communicate. When the game begins, a random player from the team receives a target word (in Fig. 1a, the target word is KIT). All players receive lists of words, only one of which is the correct answer, i.e. the rhyme, synonym or antonym of the target word (in this example, the list of words received are MISS, NO, WRONG, ANOTHER, HIT and BAD and the correct answer is HIT which rhymes with the target word KIT, in Fig. 1b). The player presented with the target word must say it aloud, so that the other players can hear it. The requirement to say the word aloud is the origin of the game’s name, as ibigkas is the Filipino word for “pronounce” or “say out loud”. All other players then check their list of words to see if they have the correct answer. The player with the correct answer should say the answer aloud and tap it. Once the correct answer is tapped, the round is over and a new round begins.

Ibigkas! 2.0

133

Fig. 1. Sample of a rhyming task. (a) Player 1 has target word KIT. (b) Player 2’s screen has the correct answer HIT.

Because data suggests that mobile phone access is relatively common among poorer Filipino families, a non-collaborative version was also designed. This singleplayer mode enables students to practice on their own. In this mode, the target word’s answer is always among the three choices at the bottom half of the student’s screen.

Fig. 2. Samples of the Ibigkas! printable cards showing the three types of linguistic relationships that the game teaches (synonyms, antonyms, and rhymes).

134

Ma. M. T. Rodrigo et al.

However, even if the students we are hoping to serve with Ibigkas! may have access to a phone for individual practice at home, their schools are unlikely to provide one-to-one mobile devises, and even less likely to be able to provide reliable network or Internet access. For these reasons, a card game equivalent of Ibigkas! (See Fig. 2) has also been designed. Like the mobile game, printable cards and rule book are also available for download, free of charge, from http://penoy.admu.edu.ph/*alls/ downloads-2. In its simplicity, the multiplayer version of Ibigkas! fits the criteria for a jigsaw CSCL. The learners have a piece of the question and the answer. Bringing question and answer together requires working together. At the time of this writing, the Ibigkas! mobile game is not adaptive. Players can select levels of difficulty, but the game has no native intelligence that enables it to make automatic adjustments based on players’ behaviors. To consider an adaptive version of this game, we first have to identify the characteristics of collaborative activities in the normal classroom, without technology interventions. From these characteristics, we draw considerations for the design of adaptive collaborative support.

3 Workshop Sessions In early February 2019, we conducted workshop sessions with Grades 4, 5, and 6 students and English teachers in a public elementary school in Quezon City in the Philippines. The workshop session consisted of Ibigkas! game play followed by focus group discussions. Participating teachers and students and their parents gave their written consent to participate in the study. Immediately after the focus group discussions, teachers were given a token of PhP200.00 (approximately US$4.00) while students were given a token of PhP50.00 (approximately US$1.00). 3.1

School and Student Profile

School A was one of two participating schools in the study reported in [7]. As of 2017, School A had 7,419 students. Class sizes averaged 50 students. Dependent wholly on government support, the school is often under-resourced. Teachers told Rodrigo and colleagues [7] that they bring their own laptops, projectors, and Internet hotspots to class, which they procure at their own expense. During the focus group discussion, teachers added that even traditional materials like colored markers were already in short supply, since the school was ending in four weeks. The profile of the student population mirrors that of the school. Most of the students come from difficult socio-economic circumstances. While some parents hold managerial positions, many work as day laborers or at jobs with relatively low skill requirements. Many of the children have to work before or after school to augment family incomes [7].

Ibigkas! 2.0

3.2

135

Study Participants

Six teachers participated in the workshop, two from each grade level. All teachers were female. When combined, they shared 82 years of teaching experience (see Table 1). Eleven grade 4 students, 12 grade 5 students and 12 grade 6 students joined the workshop. Ages ranged from 8 to 13. 46% of participants were male. Most students owned their own cellular phones (see Table 2). For the purposes of the workshop, the research team brought cellular phones owned by their laboratory, with the game already installed. Table 1. Teacher profile. Grade 4 4 5 5 6 6

Teacher A B A B A B

Number of years teaching 17 4 20 3 28 10

Table 2. Student profile Grade 4 5 6 Overall

3.3

No. of students Male Average age 11 45% 9.7 years 12 42% 10.6 years 12 50% 11.3 years 35 46%

Owns a cell phone 55% 67% 50% 57%

Data Collection Methods

Each grade level had separate workshop sessions lasting about one hour each. All sessions took place within the same day. The students were divided into three groups of three to four members each. One trained facilitator was assigned per group. The teachers were in a separate group with their own facilitator. The first author of this paper began each session with a reiteration of the consent form’s contents. We then asked the students to complete a brief demographics questionnaire. Once the students finished answering the questionnaire, the facilitators taught their groups how to play Ibigkas! The students and teachers played all versions of the game —the mobile game, in both single and multiplayer mode, and the card game. The facilitators then conducted the focus group discussion in which they asked members of their groups two sets of questions. The first set was about the game experience and the second was about group work within their classroom context (see Table 3).

136

Ma. M. T. Rodrigo et al. Table 3. Focus group discussion questions.

Questions for students Questions for teachers Set 1: Game experience • What did you like most/least about the mobile game? • How do you think the game can be improved? • What did you like most/least about the card game? • How do you think the card game can be improved? Set 2: Group work • Do you like working in groups? Why or • Do students like working in groups? Why why not? or why not? • How do you choose your group members? • How do you go about assigning students to their groups? • How large are your groups? • How large are the groups? • How do you help each other learn when • What kind of help do you provide to the you are working together? Do good groups—help with content, help with team students coach those who are not as good? management? • How do you handle freeloaders? • How do you handle freeloaders? • How do you grade group outputs? Are there • How are you graded? Are there grades for grades for individual contributions or just a individual members or do you receive one single grade for the entire group? group grade? Do you get to tell the teacher how much each person contributed?

4 Results We discuss the results of the focus group discussion, with emphasis on the responses to the second set of questions. From the first set of questions, we focus on the responses to the questions about the mobile game rather than the card game, to stay within scope. 4.1

Game Experience

When teachers and students were asked for their opinions about the game and the game experience, much of the feedback was positive. They found the game easy to learn and fun to play. The students said they learned new words. Many students said that they enjoyed the multiplayer mode of the mobile game. They found it exciting, and they liked that they were able to help each other if they didn’t know what to do. The multiplayer mode gave them the feeling that they were “in it together.” They also enjoyed seeing teamwork improve over time. The teachers also liked the multiplayer mode. They agreed that it was exciting and challenging. They liked that it required students to stay alert and provided opportunities for interactivity with both the device and with classmates. For example, one teacher praised the fact that students could check each other’s pronunciation. Both teachers and students had a number of suggestions. In terms of content, they suggested adding more words to the program’s corpus, especially words that appear in the students’ textbooks, and they hoped that we would consider expanding to other

Ibigkas! 2.0

137

languages or even other subject areas (e.g., building games to teach mathematics). They suggested categorizing words by types (e.g. verbs vs nouns). Students indicated that they were motivated to improve their game performance by studying, even suggesting that the game give them time to take notes or else provide a reviewer so they could memorize the word lists. While teachers said that they liked the collaborative nature of the multiplayer mode, they made an almost contradictory request: increase the competition between students. They wanted individual students to be able to level up and they wanted the game to track student rankings, which would require the game to track individual progress. 4.2

Group Work

Group work is a staple teaching/learning format in Philippine public schools, sometimes to a fault: one teacher shared that the latest basic education curriculum required group work formats, whether or not they were appropriate for the subject matter. The students said that they liked groupwork because working as a team makes school work easier. They could share ideas, help each other, and in the process have more fun. Individual work, one student said, was boring. The teachers offered a more nuanced perspective that reflects the culture of Filipino classrooms. They said that students liked groupwork because it gave them license to chat. Also, while it is observed that in some cultures, the stronger students would prefer to work alone than to learn with peers [14], they interestingly reported that it is the better students who prefer group work. The less proficient students preferred to work individually because incompatibilities in working styles led to arguments or fights between classmates. Students said they were not often given a chance to choose their groupmates. When they did, people tended to gravitate towards the smart students. Hence, groups were usually assigned by the teachers using a variety of different strategies: they had students count off, they grouped students by rows or alphabetically, and so on. The teachers said they tried to make sure that all groups had a mix of stronger and weaker students. Group sizes tended to range from four to eleven members per group, although one teacher said she only grouped students into dyads or triads. These large group sizes were consequences of large class sizes and limited classroom space. Groups often had to spill into the corridor in order to convene. Once groups convened, students said that they would make sure that everyone understood the task and shared the work. They tried to teach each other, give feedback, assist, and direct those who were lost or confused. It happens on occasion that students disengage from the learning task, partly out of boredom or lack of interest. They refuse to cooperate, preferring instead to chat or play. In these instances, students tried to selfmanage by scolding these members, giving them work to do, or reporting them to the teacher. Teachers then try to address the specific needs of these members. They break down the task into simpler steps or give alternate tasks that might appeal to student interests or abilities. Teachers, in the meantime, went from group to group to check on group dynamics and the learning process. They also reported using other strategies to ensure collaboration, some of which may not be typical in other cultures. For instance, Filipino

138

Ma. M. T. Rodrigo et al.

teachers sought to make the smarter members of a group responsible for less-able peers. Their grading rubrics include group discipline, which subsumes teamwork. During the evaluation of groupwork, teachers ask each member to account for their individual contributions to the group, but it is also common for teachers to assign a single grade to all members of the group. This strategy is meant to make group members responsible for each other, but student leaders within the group are, in some extreme cases, allowed to drop a particularly uncooperative student from the group roster. When this happens, the student who has been removed from the group receives a lower grade than the rest of the group or no grade at all.

5 Design Directions The game format of Ibigkas! circumvents many of the problems and issues that the teachers and students raised in our pre-design workshops e.g. students disengaging from the tasks and refusing to work with the rest of the group. Students are automatically excited to try it out. The instructions are relatively simple. Students must pay attention or the entire group suffers. We now attempt to reconcile the other focus group discussion responses with what we know of adaptive CSCL, referring in large part to the characteristics from [5]. Pedagogical Goals. Given the simplicity of Ibigkas! it seems inappropriate to aspire for the system to guide learners towards more productive interaction, at least during game play. In this game’s context, metacognitive support might be the most appropriate pedagogical objective, as students mainly need to stay alert in order to succeed in the game. However, as discussed below in the subsection Targeting specific students for intervention students could also be pushed to try words that are considered more difficult—based either on measures typically employed in Natural Language Processing (NLP) tools (e.g., age of acquisition) or based on the students prior performance with the word and/or its linguistic properities. Likewise, it might be possible to have the game incorporate new words, either based on past student performance, or with some input/guidance from their teachers. For example, future designs of Ibigkas! could allow a teacher who is aligning their English instruction with the science curriculum to specify that students practice the vocabulary associated with those lessons. Targeting Specific Students for Intervention. Teachers said that they try to form heterogenous groups, with a balance strong and weak learners. Ibigkas! only provides a group score; there is no differentiation in score among members. For the system to support group formation, it would have to track individual students’ performance and report these to the teacher. One way to implement individual tracking is by penalizing students who answer incorrectly or who fail to tap the correct answer when it appears on their screen. If the system tracks individual performance in this manner, it would be possible to accommodate student and teacher suggestions for levelling up, ranking, and overall providing a more competitive environment.

Ibigkas! 2.0

139

This also opens the door for content adaptation. While in-game, the level of difficulty of the content could be increased or decreased, depending on student performance. That is, students could be given progressively less-common synonyms or antonyms for a target word they have proven familiarity with. Likewise, if a student seems to have mastered the rhymes involving one phonological class of sounds, the game could shift these examples out of rotation and instead require students to practice words containing other classes of sounds. Future versions of the game may also take advantage of the devices being connected to each other to learn how difficult the words are as more games get played. Words that majority of the students seem to get wrong may be labelled as more difficult. After some time, a hierarchy of word difficulty for a given context of respondents may materialize and this could be made as additional basis for shifting the in-game difficulty. While the design of the multi-player version of the game, especially, does not lend itself naturally to hints or clues, it may be possible to provide bottom-out hints. Particularly when the game is being played in single-player mode, it could be useful to prompt students to advance to the next set of words. This could be done by, for example, having the correct answer change color if the student does not correctly answered within a certain amount of time. Empirical research is needed in order to determine the appropriate amount of time to allow a student to struggle (e.g., [5]). However, in general, finding some way to discouraging them from wheel-spinning, as in [15]), is probably an important design strategy. Building on student suggestions, future designs of Ibigkas! could also provide postgame notes. These notes could be personalized to individual learner needs, but with further help from teachers, they could also be provided to encourage group study activities outside of the game. For example, students in a group that has struggled with a particular set of synonyms or antonyms could be directed to specific reading/writing tasks outside of the game. Providing students with content support would hypothetically have a positive impact on peer interaction. Because the game mechanic mainly requires knowledge of the subject matter and attention, students who know the subject matter should be able to perform better at the game. Modeling and Technology. The domain needs to be structured based on level of difficulty of the words. At the moment, a language expert binned the words in the Ibigkas! corpus based on word length and familiarity, and words have been semi-automatically coded for phonological (sound-based) patterns that are related to the rhyming tasks. Moving forward, it may be possible to automate this process using NLP tools, however, we will need to take into account contextual factors (i.e. a word that is familiar in a Western context may not be as familiar in a Philippine context), and we will also need to consider the perceptual factors that differ among non-native speakers when they are asked to identify rhymes. As discussed earlier, learner performance can be tracked and modeled in terms of the timing and correctness of response patterns, and these transactions might be able to provide further information about group collaborations. For example, a student who is repeatedly mispronouncing a word (e.g., the player in one of our workshops who pronounced gorgeous with two hard “g”s, resulting in gor-gee-us) might impede his or

140

Ma. M. T. Rodrigo et al.

her classmates from correctly identifying the synonyms or antonyms in their lists. One can imagine, for example, that if the group’s response is delayed each time the target word appears on one particular student’s screen, it could be that the group needs help remediating that student. Design Space. Automatically changing difficulty levels would constitute an implicit design because it takes place without providing the student with directions, clues, or hints. The design is both indirect and direct. It is indirect because when in-game difficulty is shifted/adapted for one person, the way the game is played allows this change to affect the whole team. A student who advances in difficulty will, as per the rules of the game, cause team mates to receive answers that match the question originally for the target student. This allows the target student to gain the adapted instruction indirectly through another student. It is direct because at certain points in the multi-player mode (where the correct solution appears in the device held by the target student) and throughout the singeplayer mode, the instruction is directly received by the student it targets.

6 Conclusion We describe design considerations and directions for an adaptive version of Ibigkas!, a collaborative mobile game for English language learning. Based on a reading of the literature and feedback from teachers and students who participated in a focus group discussion about the game, an adaptive version of the game would include metacognitive support, a scoring system that enables monitoring of individual performance, adaptive difficulty level adjustment. Post-game content support would be an added benefit, especially if it addresses learners’ individual needs. While it does not have the same level of sophistication as the systems discussed in [6, 12, 13] the simplicity of Ibigkas! might also be its strength. The game offers an opportunity to design a few simple, adaptive interventions and measure their effects on the target population. Future work will look into the possibility of implementing these adaptations and testing them in the field. Acknowledgements. We thank the Ateneo de Manila University, specifically the Ateneo Center for Educational Development, Areté, and the Department of Information Systems and Computer Science. We thank the principals, teachers, and learners of our partner public schools for their participation. We thank our support staff composed of Francesco Amante, Michelle Banawan, Jose Isidro Beraquit, Philip Caceres, Marie Rianne M. Caparros, Marco De Santos, Walfrido David Diy, Marika Fernandez, Ma. Rosario Madjos, Monica Moreno, and Lean Rimes Sarcilla. Finally, we thank the Commission on Higher Education and the British Council for the grant entitled Jokes Online to improve Literacy and Learning digital skills amongst Young people from disadvantaged backgrounds.

Ibigkas! 2.0

141

References 1. Chopade, P., Yudelson, M., Deonovic, B., von Davier, A.A.: Modeling dynamic team interactions for intelligent tutoring. In: Building Intelligent Tutoring Systems for Teams: What Matters, pp. 131–151. Emerald Publishing Limited (2018) 2. Diziol, D., Walker, E., Rummel, N., Koedinger, K.R.: Using intelligent tutor technology to implement adaptive support for student collaboration. Educ. Psychol. Rev. 22(1), 89–102 (2010) 3. Freeman, J., Zachary, W.: Intelligent tutoring for team training: lessons learned from US military research. In: Building Intelligent Tutoring Systems for Teams: What Matters, pp. 215–245. Emerald Publishing Limited (2018) 4. Kobbe, L., et al.: Specifying computer-supported collaboration scripts. Int. J. Comput.Support. Collaborative Learn. 2(2–3), 211–224 (2007) 5. Magnisalis, I., Demetriadis, S., Karakostas, A.: Adaptive and intelligent systems for collaborative learning support: a review of the field. IEEE Trans. Learn. Technol. 4(1), 5–20 (2011) 6. Rau, M.A., Bowman, H.E., Moore, J.W.: An adaptive collaboration script for learning with multiple visual representations in chemistry. Comput. Educ. 109, 38–55 (2017) 7. Rodrigo, M.M.T., et al.: Ibigkas!: the iterative development of a mobile collaborative game for building phonemic awareness and vocabulary. Comput.-Based Learn. Context (in press) 8. Rummel, N., Walker, E., Aleven, V.: Different futures of adaptive collaborative learning support. Int. J. Artif. Intell. Educ. 26(2), 784–795 (2016) 9. Soller, A., Lesgold, A.: Knowledge acquisition for adaptive collaborative learning environments. AAAI Technical Report (2000) 10. Soller, A., Lesgold, A.: Modeling the process of collaborative learning [Abstract only]. In: Hoppe, H.U., Ogata, H., Soller, A. (eds.) The Role of Technology in CSCL. CULS, vol. 9, pp. 63–86. Springer, Boston (2007). https://doi.org/10.1007/978-0-387-71136-2_5 11. Stahl, G., Koschmann, T.D., Suthers, D.D.: Computer-supported collaborative learning. na (2006) 12. Viswanathan, S.A., VanLehn, K.: Using the tablet gestures and speech of pairs of students to classify their collaboration. IEEE Trans. Learn. Technol. 11(2), 230–242 (2018) 13. Vizcaíno, A., Contreras, J., Favela, J., Prieto, M.: An adaptive, collaborative environment to develop good habits in programming. In: Gauthier, G., Frasson, C., VanLehn, K. (eds.) ITS 2000. LNCS, vol. 1839, pp. 262–271. Springer, Heidelberg (2000). https://doi.org/10.1007/ 3-540-45108-0_30 14. Wallace, J.: Do students who prefer to learn alone achieve better than students who prefer to learn with peers? Institute of Education (1992) 15. Beck, J.E., Gong, Y.: Wheel-spinning: students who fail to master a skill. In: Lane, H.C., Yacef, K., Mostow, J., Pavlik, P. (eds.) AIED 2013. LNCS, vol. 7926, pp. 431–440. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39112-5_44

Adaptive Learning Technology for AR Training: Possibilities and Challenges Alyssa Tanaka(&), Jeffrey Craighead, Glenn Taylor, and Robert Sottilare Soar Technology Inc., Orlando, FL 34786, USA {Alyssa.tanaka,Jeffrey.Craighead,Glenn, Bob.Sottilare}@Soartech.com

Abstract. The life-saving care of a combat casualty begins moments after an injury is sustained. In these moments, the caregiver is tasked with maintaining tactical objectives and making critical decisions for care that may determine if a casualty lives. While the United States has achieved unprecedented survival rates for casualties arriving alive to combat hospitals, as high as 98%, evidence still suggests that 25% of battlefield deaths are potentially preventable. The majority of these deaths are found to occur during the pre-hospital phase of care. This phase of care presents a significant opportunity for the improvement of battlefield medicine and casualty care outcomes. Having providers engage in realistic Tactical Combat Casualty Care (TC3) scenarios can optimize the leadership, teamwork, tactical, and medical skills required to succeed in the challenging situations they may encounter. A hurdle facing the community is creating realistic training scenarios that adequately challenge the cognitive and decision-making processes of trainees. In the real world, patients provide explicit and implicit information to caregivers, who make observations and collect evidence to determine the best course of care. During a training scenario, with an assumingly healthy patient, it can be challenging to provide the trainee with the cues to trigger their clinical decision making. Further, medical diagnosis and intervention place high demand on a human’s sense of touch and sight. Providers often rely on visual cues of the injury or illness to assist in making the best decision for treatment. Similar visual cues are also used during treatment to perform the correct treatment procedure. The proliferation of augmented reality (AR) technologies provide an interesting opportunity to address some of these obstacles in training by enhancing training scenarios with the overlay of realistic visual scenes onto the real world. The use of AR technologies in medical training provide the capability to supply these cues, as well as more explicit guidance, to learners. The supplement to training using these technologies presents the opportunity to make significant improvements to a simulation experience, improve training, and ultimately have a positive impact on medical care. The enhanced visual displays can create more compelling environments for the trainees to interact with and provide a more realistic environment; however, many factors regarding the use of AR in training are still being explored. The potential impact of AR on learning is still relatively unknown. Also, the methodology for integrating AR with advanced techniques must be evaluated.

© Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 142–150, 2019. https://doi.org/10.1007/978-3-030-22341-0_12

Adaptive Learning Technology for AR Training

143

One such technique, adaptive training, offers the opportunity to provide tailored learning experiences to individual learners. This paper will explore learning opportunities and applications for AR in TC3 training. Specifically, this paper will discuss opportunities for integrating AR training with adaptive instruction for TC3 training. Finally, this paper will discuss future directions and opportunities for AR in TC3 training. Keywords: Augmented reality

Medical training Adaptive instruction

1 Introduction 1.1

Tactical Combat Casualty Care

The life-saving care of a combat casualty begins moments after an injury is sustained. In these moments, the caregiver is tasked with maintaining tactical objectives and making critical decisions for care that may determine if a casualty lives. While the United States has achieved unprecedented survival rates for casualties arriving alive to combat hospitals, as high as 98%, evidence still suggests that 25% of battlefield deaths are potentially preventable. The majority of these deaths are found to occur during the pre-hospital phase of care [1]. This phase of care presents a significant opportunity for the improvement of battlefield medicine and casualty care outcomes. Having providers engage in realistic Tactical Combat Casualty Care (TC3) scenarios can optimize the leadership, teamwork, tactical, and medical skills required to succeed in the challenging situations they may encounter [2]. A hurdle facing the community is creating realistic training scenarios that adequately challenge the cognitive and decision-making processes of trainees. Patients provide explicit and implicit information to caregivers, who make observations and collect evidence to determine the best course of care. During a training scenario, with an assumingly healthy patient, it can be challenging to provide the trainee with the cues to trigger their clinical decision making. Current training for TC3 often involves a patient being simulated by the use of a mannequin or a human actor (i.e. standardized patient). The mannequin can range in fidelity level, from providing the basic human form for practice, to providing advanced physiologic interactions to the learner. In the case of human actor, military personnel participating in a training exercise often carry a “casualty card” that instructs the person on how to portray a specific wound named on the card, if nominated to simulate a casualty. The card is also used to tell the trainee what wound to treat. These simulated patients are often enhanced with the use of moulage to simulated the individual wounds and bring another layer of realism to the scenario. These may range from simple moulage that demonstrate some characteristics of the wound (e.g., rubber overlays with synthetic blood) to more complex sleeves that have representative blood dynamics (Fig. 1).

144

A. Tanaka et al.

Fig. 1. Moulage simulating a wound being placed on a simulated patient.

While these techniques provide the basic information needed to support a training scenario, the simplicity of the presentation often requires the instructor to describe the wound or remind the trainee during an exercise about the qualities of the wound that are not portrayed, including how the wound is responding to treatment. For example, an instructor may spray fake blood on the moulage to simulate arterial bleeding. This effort by the instructors is there to compensate for the low-fidelity simulation, and takes away from time that could be spent providing instruction. While relatively simple, even these simulations take time and effort to create, set up, and manage, before and during the training exercise. AR as a Solution. Augmented Reality (AR), especially the recent boom in wearable AR headsets, has the potential to revolutionize how TC3 training happens today. AR can provide a unique mix of immersive simulation with the real environment. In a field exercise, a trainee could approach a casualty role-player or mannequin and see a simulated wound projected on the casualty. The hands-on, tactile experience combined with the simulated, dynamic wounds and casualty response has the potential to drastically increase the realism of medical training. The proliferation of AR technologies provides an interesting opportunity to enhance training by overlaying realistic visual scenes onto the real world. The enhanced visual displays can create more compelling patients for the trainees to interact with and provide a more realistic and valuable learning opportunity. Medical diagnosis and intervention place high demand on a human’s sense of touch and sight. Providers often rely on visual cues of the injury to assist in making the right decision for treatment. Similar visual cues are also used during treatment to determine the correct treatment procedure. Training technology capabilities that can provide these cues, or more explicit guidance to assist with diagnosis and treatment, have the potential to make significant improvements to a simulation experience.

Adaptive Learning Technology for AR Training

1.2

145

Augmented Reality

AR typically refers to technology that allows a user to see a real environment while digital information is overlaid on that view. Heads-Up Displays (HUDs) such as in cockpits or fighter pilot helmets represent early work in AR, though typically these overlays do not register with objects in the environment. Later work includes registering information with the environment for tasks ranging from surgery, to machine maintenance, to entertainment such as the addition of AR scrimmage lines in NFL football games, or the highlighting the hockey puck in NHL games. See [1, 2] for thorough surveys of augmented reality. As mobile devices (phones, tablets) have become more capable, augmented reality has become more mobile, with game examples such as Pokemon GoTM, which provides an “AR view” option to show 3D renderings of game characters overlaid on top of camera views. More recently, wearable AR hardware has tended to focus on see-through glasses, visors, or individual lenses that allow for computer-generated imagery to be projected hands-free, while allowing the user to see the surrounding environment directly. Additionally, more sophisticated AR projections are registered with the real environment, where digital objects can be placed on real tables or seem to interact with real obstacles (Fig. 2).

Fig. 2. Microsoft HoloLens is one example of the commercially available AR headsets.

AR manufacturers, like Microsoft, have recognized the value of medical applications of the technology and have sponsored or supported multiple projects to explore its real feasibility, which have resulted in prototypes which validate the potential that AR has to improve training and ultimately, medical care. While the technology continues to improve, there are still several limitations with current AR systems that have real implications in training, including limited computer processing power and limited field of view; however, newer systems like those from SA Photonics and Magic Leap promise many improvements. 1.3

Adaptive Training

An interesting element of TC3 training is the multi-faceted role of the instructor. While providing classroom instruction is part of their role, instructors are also tasked with simulating patient and combat conditions during a hands-on scenario. During a scenario, instructors will question trainees about their treatments, make suggestions or give

146

A. Tanaka et al.

hints, or directly order the trainees as needed. The instructor may also vary the difficulty of the training to suit the particular trainee. A possible solution to relieving some of this burden from the instructor is to rely on the use of adaptive instructional systems (AISs) to tailor training for learners. AISs have been the topic of research and development for decades in many fields, but recently has seen renewed interest in military programs like the US Army’s Synthetic Training Environment (STE) rapid development program and the US Navy’s My Navy Learning science and technology program. AISs are artificially-intelligent, computer-based systems that guide learning experiences by tailoring instruction and recommendations based on the goals, needs, and preferences of each individual learner or team in the context of domain learning objectives [3]. Examples of AISs include intelligent tutoring systems (ITSs), intelligent mentors, and intelligent instructional media. In 2017, a NATO Research Task Group (HFM RTG 237) completed its mission to investigate various intelligent tutoring systems technologies and opportunities for their use within NATO military organizations [4]. The group noted that “military operations, especially those characteristic of current irregular warfare environments, require, among other things, improvisation, rapid judgment, and the ability to deal with the unexpected. They go beyond basic instructional objectives and call for education and training focused on higher order cognitive capabilities such as analysis, evaluation, creativity, and rapid synthesis of novel approaches – approaches that must intersperse judgment with the automatic responses provided by training involving memorization and practice of straightforward procedures. These capabilities can make the difference between success and failure in operations, and require more sophisticated forms of instruction such as one-to-one tutoring” [4] and thus the impetus for AISs. The RTG found a substantial body of research and development of instructional technologies and especially AISs. They also concluded that while it is not practical to provide one-to-one human tutoring to every soldier, sailor, and airman, it is practical to provide computer-based tutoring that is dynamically tailored to every learner’s capabilities, preferences, and needs (e.g., knowledge and skill gaps). Technology to produce effective and efficient tutoring requires ‘intelligent’ systems that rapidly tailor instruction to individual learner abilities, prior knowledge, experience, and, to some extent, misconceptions (e.g., common errors or malformed mental models about the training domain). In an effort to exploit current and emerging technologies and to push toward needed research to create new technology, the RTG made several recommendations [4] with respect to: • • • • •

Expanding authoring tools to new instructional domains Enhancing automation in the authoring process to reduce developer workload Enhancing user experiences through adaptive interfaces Standardizing components and data in ITSs to enhance interoperability and reuse Modeling aggregate levels of learning (e.g., team and organizational learning) resulting from ITS adoption

It is through adaptive interfaces linked to AR technologies that we might realize improved access to adaptive instruction (e.g., anytime anywhere training), more effective and efficient interaction through natural language dialogue, enhanced realism to match the complexity of training tasks, and better quality after action reviews which

Adaptive Learning Technology for AR Training

147

enable more detailed analysis of critical decision points in our training. Specifically relating to TC3 training, adaptive instruction embedded within AR training can provide even greater realism to the scenario and enhanced feedback for trainees, without added instructor burden.

2 Integration Opportunities In following section, we discuss several ideas of providing AR-based adaptive instruction in TC3 training. AR provides a platform for immersing trainees in a visually rich and dynamic environment that places a heavy emphasis on user interaction. Thus, the focus of this paper is on the application of interaction strategies for instructional adaption. While there are several approaches to providing adaptation for learning, the application of interaction strategies leverages the highly interactive nature of AR systems. 2.1

Content Adaptation

Content adaptation involves using the AR visual display to not only progress the scenario, but to also adapt the scenario to the abilities of the learner and the instructional goals. When using AR for TC3 scenarios, the visual scene often provides imagery of an injury’s initial state, as well as the reaction of the injury to trainee input (i.e., intervention input of lack thereof). Injury Identification. The diagnostic and intervention process involves many different interactions with the patient, ranging from asking diagnostic questions to performing a complex surgical intervention. One example of content adaptation can be performed by adapting the level of clinical questioning that must be performed to diagnose or treat a patient. Varying amounts of visual cues may be provided to encourage a trainee to ask the right questions or order the right diagnostic tests. Injury Timing. Another method of adding content adaptation involves the timing of the injury exposure. To increase scenario complexity, the patient state could be degrading at a rapid pace, requiring the student to make clinical decisions faster. The patient could also have multiple injuries that need to be addressed concurrently to one another and exhibit at the same time. Injury Response. Simulated injuries can also be used to represent tasks of varying difficulty and challenge trainees. This can be a very natural adaptation in the scenario and driven automatically using a physiology engine (e.g., BioGears). Using a physiology engine, the patient’s health would degrade or improve according to the actions performed by the student. The AR wound would ideally adapt to the physiology as well. For example, if the trainee does not apply pressure to an arterial wound, the physiology would generate continuous blood loss in the patient’s physiology. Subsequently, the virtual injury would display significant blood loss. The adaptations could become increasingly complex in this case because the deteriorating patient state could

148

A. Tanaka et al.

affect other physiologic systems (e.g., respiratory system) and cause compounding challenges for the trainee. Similarly, the injury complexity can also be increased adding compounding injuries to the visual overlay of the patient. This can not only require more complex procedures to be performed, but also require the trainee to think multi-dimensionally about the patient’s injuries. For example, an unconscious patient may instruct the trainee to perform an intubation. However, an unconscious patient with significant injury to the head and neck should indicate to the trainee that a cricothyroidotomy may be necessary. 2.2

Prompts

Using AR, trainees can receive helpful prompts via the HUD that they would not otherwise see in the real-world. These prompts can provide localization information and explanation using floating text callouts and pointers. Prompts can also be used to highlight the location at which an action can be performed, along with accompanying instructions on how to proceed. Explicit Prompts. One use of prompts is to provide explicit hints, reminders, or instructions to the trainee in the form of text. This approach is well suited for assisting with a procedural understanding of the task at hand. This is more relevant if the learner forgot to perform a step of the procedure. Another method of providing explicit prompts is to provide cues that show where the trainee will need to interact next. For example, a prompt box or arrows may be used to show a trainee where to insert an IV or place a tourniquet. Implicit Prompts. Another use of prompts is to provide more subtle hints to the trainee about the patient state or the necessary next steps. This can be done through the use of both visual and audio cues. One example is to display the vital signs to the trainee and allow them to use that as an indicator of the patient state. This is more challenging prompt because it requires the trainee to think critically about the signs and symptoms involved with injuries. Implicit prompts can also be provided through audio cues. For example, if the trainee needs to provide the patient with pain medication, the system could have the patient make groaning noises. 2.3

Assessment and Feedback

Assessment and feedback are important components of instruction and provide the learner with valuable information about their performance. AR is highly interactive and incorporates assessment and feedback in very natural ways. As the trainee is interacting with the patient and visual scene, the system is identifying the trainee actions and incorporating that into the scenario. As the system identifies the actions, the system provides a subsequent reaction. The reaction may be positive or negative to the patient state, depending on the correctness of the action. This assessment and feedback loop is an inherent part of the AR scenario.

Adaptive Learning Technology for AR Training

149

Scaffolding Feedback. As the trainee progresses through a scenario, the system could provide scaffolds to support the trainee by adapting the specific feedback based on previous actions. Scaffolding can be an important component in providing the appropriate amount and types of prompts. As the trainee gains proficiency, the system may fade the support that it provides. This might include moving from providing explicit scaffolding (e.g., text hints about what might need to be done) to implicit scaffolding (e.g., providing simple cues such emphasizing the blood squirting from an arterial wound or changing vitals), to implicit challenging (e.g., providing an ambiguous mix of cues on the condition of a casualty). A layer of scaffolding includes adjusting the amount of feedback given. As the trainee progresses through the scenario, the system may provide less explicit feedback. The trainee may also receive less feedback if they are performing particularly well on the task. One consideration when adapting the amount of feedback is that many implicit feedback mechanism are needed to progress the training scenario (e.g., blood squirting), so explicit prompts are more appropriate for this type of adaptation. This approach is consistent with known properties of effective feedback [5], but also with more fundamental learning theories, such as Vygotsky’s Zone of Proximal Development [6], in which the learning environment is adapting continuously to student learning to deliver learning situations that are consistently challenging but matched to the student’s current capability and rate of learning. Timing of Feedback. Another consideration for feedback is the timing of the feedback. Feedback timing can challenge to trainees by not providing them with a hint or prompt immediately. Like scaffolding, consideration must be made when using this is adaptation for implicit feedback mechanisms. Using the blood squirting example, it is important that this feedback is given immediately to the trainee to maintain the realism of the scenario. Timing adaptation however can be valuable in allowing the trainee time to contemplate the correct course of action.

3 Conclusion Military medical personnel are the first responders of the battlefield. Their training and skill maintenance is of preeminent importance to the military; however, several challenges exist to providing the most realistic and effective training possible. The simulated patients used during training exercises often cannot replicate the injuries being trained, which requires significant intervention from instructors. AR offers improvements over this approach by providing realistic, dynamic visual scenes that can mimic battlefield injuries. These technologies and their applications are still emerging and exploration is needed to ensure their appropriate integration into instruction. Specifically, this paper explored the options for leveraging adaptive instruction techniques into AR training. Content adaptation involves adapting the illustration and animation of wounds and procedures, projected onto the simulated patient. Adaptation in this mode provides a much richer experience; however, the tracking and animation projection needed for this is at the very edge of current AR capabilities. Illustration and animation of wounds and

150

A. Tanaka et al.

procedures can be a very challenging problem. Also, creating animated models is a very labor intensive task, so this may limit the number of scenarios that can be created. Hints and prompts are also valuable methods for providing adaptation within an AR scenario. Prompts can be particularly valuable for instruction when a student is not performing the task in a timely manner or potentially missed a critical component of the patients injuries. This relies on the HUD to deliver instructional content and information to the trainee, but does not require high-resolution, animated models, nor does it require precise placement of augmentations. These capabilities are can be met by many commercially available AR systems. Adaptation can also be provided within the assessment and feedback mechanisms within the scenario. Automated feedback is a first class feature in an AR scenario because the dynamic visual scene is constantly providing the trainee with feedback on the patient state and the result of any treatments performed; however, the amount, timing, and type of feedback given to the trainee can be adapted to the needs of the scenario and the trainee. The incorporation of adaptation in AR provides even further opportunity to enhance the educational content of the scenarios. TC3 training can involve a variety of learners, at varying learning levels and adaptive instruction would allow for the learners to receive instruction based on their level of expertise and instructional goals without putting more burden on the instructor. Our future work will include the incorporation of these techniques within our AR training capabilities and validation of such techniques as educational tools.

References 1. Mabry, R.L., DeLorenzo, R.: Challenges to improving combat casualty survival on the battlefield. Mil. Med. 179(5), 477–482 (2014) 2. Milham, L.M., et al.: Squad-level training for Tactical Combat Casualty Care: instructional approach and technology assessment. J. Def. Model. Simul. 14(4), 345–360 (2017) 3. Sottilare, R., Brawner, K.: Component interaction within the Generalized Intelligent Framework for Tutoring (GIFT) as a model for adaptive instructional system standards. In: The Adaptive Instructional System (AIS) Standards Workshop of the 14th International Conference of the Intelligent Tutoring Systems (ITS) Conference, Montreal, Quebec, Canada, June 2018 4. Sottilare, R.A., et al.: NATO Final Report of the Human Factors & Medicine Research Task Group (HFM-RTG-237), Assessment of Intelligent Tutoring System Technologies and Opportunities. NATO Science & Technology Organization (2018). https://doi.org/10.14339/ sto-tr-hfm-237. ISBN 978-92-837-2091-1 5. Shute, V.J.: Focus on formative feedback. Rev. Educ. Res. 78, 158–189 (2008) 6. Vygotsky, L.S.: Mind and Society: The Development of Higher Psychological Processes. Harvard University Press, Cambridge (1978)

Intelligent Tutoring Design Alternatives in a Serious Game Elizabeth Whitaker1(&), Ethan Trewhitt1, and Elizabeth S. Veinott2 1

Georgia Tech Research Institute (GTRI), Atlanta, GA, USA {elizabeth.whitaker,Ethan.trewhitt}@gtri.gatech.edu 2 Michigan Technological University, Houghton, MI, USA [email protected]

Abstract. The adaptive learning potential in video games can be increased by incorporating intelligent tutoring system (ITS) approaches. However, very few studies have examined the ITS capability in a game. Across a four-year research project, we implemented a number of intelligent tutoring design alternatives and integrated them into the serious 3D video game. The game, Heuristica, was designed to improve critical thinking by exposing players to different cognitive biases through game play, and then adaptively giving players opportunities to practice avoiding or mitigating each bias. We describe the intelligent tutoring system, and how we structured the learner’s environment, provided learners with tailored feedback, supported spaced and massed practice, and provided opportunities for in-game learner reflection. We describe the design and functionality of the Student Modeler and interface features that support student feedback and interaction. We evaluate the tradeoffs made in the student model design and the impact they had on the game experience, focusing on the open student model summary screen, mixed-initiative opportunities and algorithms for selection of learning opportunities. This paper contributes to the intelligent tutoring literature by describing one of the first intelligent tutoring systems embedded in a 3D game for training and provides lessons learned from implementations in a single game. Keywords: Adaptive learning environment Video games

Intelligent tutoring design

1 Introduction Video games can be used to support player learning and decision-making strategies, and the training power in games can be increased by incorporating intelligent tutoring approaches. Few intelligent tutoring systems have been implement in a serious, 3D, replayable, video game [5, 6, 11, 19]. This combination introduces some difficult challenges for ITS designers. VanLehn (2011) evaluated the state of human tutoring and intelligent tutoring and identified several novel approaches to intelligent tutoring that needed further research (e.g. open student model) [15]. As part of a large research project a number of intelligent tutoring design alternatives were integrated into a serious, 3D-immersive game called Heuristica. The purpose of Heuristica is to teach students to recognize and mitigate cognitive biases [12]. © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 151–165, 2019. https://doi.org/10.1007/978-3-030-22341-0_13

152

E. Whitaker et al.

Cognitive biases are tendencies for humans to think in certain ways that provide shortcuts in everyday decision-making, but which can lead to errors in complex decision-making. Heuristica provides a set of scenarios within a space station narrative where the player can interact with game characters to perform tasks such as diagnosing and repairing problems, and observing and evaluating game characters performing tasks. The student is provided with definitions of the cognitive biases, methods to mitigate the tendency toward those biases and opportunities to apply those methods in both decision-making opportunities and in recognizing those biases in others. Figure 1 shows an example of the in-game experience.

Fig. 1. Example Learning Opportunity (LO) in the Heuristica game, asks players to diagnose a patient and recognize confirmation bias, a common cognitive bias.

Video games can be used to support player learning and decision-making strategies [9, 10, 14, 16, 17], and the training power in games can be increased by incorporating intelligent tutoring approaches [13, 19]. For the Heuristica project we developed and implemented a set of intelligent tutoring approaches and experimented with alternatives centered around the Student Model working with the Content Selector, which makes decisions based on the state of the Student Model. The learning experience was tailored based on the student’s prior knowledge, performance in gameplay, and personal preferences. Game activities were modularized, enabling game content to be selected and ordered to support the student’s need for extra practice in certain areas and less in others. These game activities are called Learning Opportunities (LOs). We describe the design and functionality of the Student Modeler and interface features that support student feedback and interaction, along with the representation of the Student Model, its mapping to concepts to be learned, and the indexing of in-game learning opportunities to support the tailoring of gameplay consistent with the Content Selector’s reasoning. We describe our experiments with open student model summary screen representations, mixed-initiative opportunities, massed vs. spaced practice, and algorithms for selection of learning opportunities.

Intelligent Tutoring Design Alternatives in a Serious Game

1.1

153

Intelligent Tutoring in the Heuristica Game

The Heuristica game hosts a set of modular components for facilitating a student’s training and the analysis of that student’s learning through game participation. These training components, shown in Fig. 2, include the Student Modeler and the Content Selector. The Student Modeler and the Content Selector use reasoning techniques that are guided by the Learning Theories and Teaching Theories extended from existing learning and teaching theories by general psychology theory and studies performed during the Heuristica project. Our Student Model, overlaid on the Curriculum Model, is used to identify the areas of misunderstanding or lack of knowledge and reason about the student’s knowledge based on activities in the game, including the identification of cognitive biases exhibited in the student’s performance. Figure 3 on the next page shows, in detail, the step-by-step interactions among the components of the Heuristica intelligent tutoring system.

Fig. 2. Heuristica intelligent tutoring components

Student Model and Curriculum Representation. Traditionally, student models are tied to a domain model that defines the study areas of the learning system [5, 8, 10]. A domain model provides a way to represent, index or describe the concepts, procedures or skills that the student should be learning [6]. There are many approaches to modeling the domain. We have chosen a practical approach that allows us to index the game LOs by which concepts they teach and/or evaluate. This is tied to the Student Model through a common representation called an overlay model.

154

E. Whitaker et al.

Database Heurisca Gaming Environment

Pretest Seeds 2. Student model is updated

Student Model Gameplay

7. Performance on LO is evaluated and scored

Event Log

Student Modeler

3. Student model is read

5. Next LO is loaded and played by the student

LO Ordering Scoring

1. Seeds are used to inialize concept mastery levels

4. Student’s next LO is chosen*

Content Selector

8. New concept scores are used to update mastery levels (8b. Repeat at step 2)

6. Student plays LO

Student

*Step 2 in greater detail: 1. Consider concept teaching goals 2. Consider student’s per-concept mastery levels using a sliding window (average the n most recent scores for each concept) 3. Consider game and concept prerequisites 4. Rank concepts to priorize based on pedagogy 5. Rank learning opportunies (LOs) based on their ability to teach the highest-priority concepts

Fig. 3. Timeline of the interactions among the Heuristica Game and student modeling components.

An overlay representation [1] of a student model is an approach in which the student model is built on top of a curriculum model; that is, they use a common representation. The curriculum is a form of the domain model and is a representation, enumeration, or indexing of the concepts being taught by the learning system. The overlay student model representation is consistent with the curriculum model and associates a mastery level with each concept in the curriculum. The Student Model uses student performance and behaviors in the learning system to provide evidence of mastery levels associated with elements in the curriculum. These mastery levels in the Student Model are updated as the student moves through the tutoring system. The mastery levels increase or decrease depending on the student’s performance. A readout of an individual student model at any time provides an estimate of student’s accomplishments and makes explicit the concepts or skills that need further work. This information is used by the Content Selector to select and order scenarios and activities in the simulations. The curriculum model in Heuristica consists of an explicit representation of the concepts and skills related to recognizing and mitigating cognitive biases as well as the relationships and interconnections among them. This curriculum model includes the concepts that are to be taught or experienced through the student’s interaction with Heuristica. It represents concepts that are declarative knowledge, procedural knowledge and metacognitive knowledge. We also represent relationships among concepts, including prerequisite and sub-concept relationships that are used to reason about the sequence of activities and game scenarios that should be made available to the student. In our evaluations, LOs were developed that teach the definitions, recognition and mitigation techniques for the following cognitive biases:

Intelligent Tutoring Design Alternatives in a Serious Game

155

• Confirmation Bias refers to the tendency to favor confirming information over disconfirming information when searching for information or when testing hypotheses. • Fundamental Attribution Error occurs when individuals weigh personal or dispositional explanations for others’ actions, while neglecting the role that situational forces have in determining behavior. • Bias Blind Spot is a meta-cognitive bias in which a person reports that he/she is less susceptible to a bias than others. • Anchoring Bias occurs when a person’s estimates are unduly affected upward or downward based on a single piece of information given immediately before the estimate. • Representativeness Bias arises when a person focuses too much on salient characteristics that are similar to certain populations. • Projection Bias is a person’s belief that other people have characteristics similar to his or her own. Two different Heuristica games were developed, each teaching three of the six biases. For each of the six biases, a single concept group encompasses the set of specific concepts that collectively cover the knowledge that students should learn with respect to that bias. Student Modeler Maintains the Student Model. The Student Modeler uses information about the student, such as his or her background, preferences, knowledge state and goals to create and maintain the Student Model. The Student Model is built and stored in a database, then updated over time as the student interacts with the system. It is this part of an intelligent tutoring system that enables it to answer questions about the student. It can be used to provide a snapshot of the student’s mastery levels at any time. The information in the Student Model is used to tailor the instruction to the needs of the student. The Student Modeler monitors the activity of the student in the serious game, infers and models his or her strengths and weaknesses (by analysis of the activities log produced by the gaming system and the use of inferencing techniques guided by the Learning Theory) and updates a set of values in the Student Model to represent the current state or mastery level of each concept that is a component of the student’s knowledge. The Student Modeler works in conjunction with the part of the Heuristica framework that scores performance on each LO. It accesses database tables that contain LO decisions and appropriate answers that are used in evaluating the student’s performance and identifying the level of mastery exhibited for skills and concepts used in the activities. This information is made available to the Student Modeler which then uses it to update the Student Model. The Content Selector Organizes Gameplay. The Content Selector reasons about and selects scenarios and activities stored in a database structure where each learning activity has associated with it a set of concepts or skills (used to index the content in a content library) that are expected to be used by the student in performing that activity. The Content Selector chooses (guided by the Teaching Theory and the current state of the Student Model) a scenario or activity (an LO) that the student needs to complete in order to master the cognitive biases curriculum. It explicitly keeps track of the activities

156

E. Whitaker et al.

which the student has already participated in, and concepts that the student has already shown the mastery of when selecting what to teach next. The Content Selector chooses the learning opportunities to be presented to the student in a given interaction with the game, and selects the sequence for presentation through a computational implementation guided by the learning and teaching theories. It may choose to replay an LO that the student has experienced before (within the replay limit parameter associated with that LO in the database), and it may decide to skip learning activities that address concepts that the student already understand, based on the contents of the Student Model. These learning activities are sequenced in a database table for use in driving the gaming scenarios. Scoring and Mastery Levels. In Heuristica, the scoring performance of the student within a given LO is performed by the game component being played. The LO records the performance scores for each concept into the game log. The Student Model calculates a concept understanding score for the student, called the mastery level, which is a value from 0.0 to 1.0 for each concept. A mastery level threshold, which is adjustable by an initialization parameter, is used to determine when the student’s performance is sufficient to decide that the student no longer needs additional practice on a given concept. The choice of this threshold must be made in consideration of the length of gameplay allowed and the level of student performance which indicates sufficient mastery. This choice of threshold is also affected by the strictness of the scoring component in the game. Initializing the Student Model. A student’s interaction with the serious game system begins outside the game itself, in the form of a pretest conducted online. This pretest serves two important roles: it provides a baseline picture of the student’s initial understanding of the relevant concepts, and pretest scores may be used to seed the Student Model with initial mastery values. When the Student Model is used to make gameplay decisions, students who demonstrate a priori knowledge can move more quickly through concepts that need not be covered in as much detail, than students with less initial understanding.

2 Practical Game Play Considerations In the experiments with Heuristica reported elsewhere [16–19], students were given a pretest on the concepts which are taught in the game and posttest after playing the game to provide a measure of the learning that occurred in the game and to compare the learning in the tailored game approach to that in a fixed-order game. One of the constraints of the Heuristica intelligent learning serious game was to limit gameplay to approximately one hour, with learning to take place over a single playthrough of the game. This is a set of conditions that is different from the conditions in which student models are traditionally applied, and some of the advantages of a longitudinal student model are not available. Several goals are desired for the proper learning and gameplay experience, and the Content Selector uses these conditions to reason about the LO selections. Among these goals are the following:

Intelligent Tutoring Design Alternatives in a Serious Game

157

1. Students should reach mastery in all concepts. 2. Students should experience a variety of gameplay options, such as game LOs mixed with worked examples. 3. Students should feel like they are making progress as the game is played. 4. Students’ time should be used in an efficient, but effective manner: gameplay should not be too short or too long. 5. Mastery of the concepts should improve posttest cognitive bias test scores. The Student Model is designed such that students are expected to play until they reach mastery in all concepts. There are three conditions that will end the game: • Game complete: All concepts have been mastered. • Content exhausted: there are no more playable LOs available to teach the concepts that have not yet been mastered. This is usually because the relevant LOs have already reached their replay limits. The replay limits for individual LOs are parameters that can be tuned to provide a balance between the improved learning from additional practice and the potential for the student to lose interest after a certain number of LO replays. • Time limit reached: the student has exceeded the maximum time allowed for gameplay. The first end state is preferred and we call this a “complete” game. The other two end states indicate that something prevented the student from learning the relevant content before the game ended. These are “incomplete” games. There are several underlying causes that can result in an incomplete game. For example: 1. Gameplay difficulty is greater than learning difficulty, which means a competent student is unable to successfully complete in-game tasks irrespective of learning the underlying material. 2. Mastery scoring in gameplay is too harsh with respect to game objectives, which means a student who is performing qualitatively well is earning poor quantitative scores. 3. Gameplay content is failing to teach the learning material, which means that a competent student is failing to advance in learning. 4. A student is unable to learn the material quickly enough due to personal limitations. This analysis was important to the project as the game was being developed and we were experimenting with scoring techniques, gameplay LO development and teaching quality, gameplay mechanics and ease of use. Analysis of the student model representation and mastery levels was useful in the design iterations of these game components.

3 Intelligent Tutoring Design Alternatives The modular design of Heuristica and its student modeling components enabled the testing and evaluation of numerous teaching approaches and system design alternatives. As we experimented with some of our earliest designs, some of the shortcomings

158

E. Whitaker et al.

were revealed and we replaced some of the reasoning and algorithms with improved approaches. There are lessons to be learned from a review of the alternatives and experiences in this project. In the context of Heuristica under the constraints of gameplay duration and singleplaythrough learning, a few of the intelligent tutoring design alternatives provided learning improvements over a non-tailored game experience, but many did not. Still, there are lessons to be learned from the software and interaction designs and approaches. Some of these interaction approaches may be useful with extensions or under other conditions, and we present the most interesting of them here. 3.1

Open Student Model

An open student model [4, 7, 20] allows the student to inspect his or her progress as tracked by the student model, including the opportunity for the student to learn from the weaknesses recognized by reflecting on the mastery levels. An open student model also provides the opportunity for mixed-initiative interaction with the game; for example, if the student needs more practice on a particular concept, he or she could request that practice based on what was learned by inspecting the student model. Figure 4 shows one iteration of this interface within the game.

Fig. 4. Open Student Model interface within the game, including the student’s progress in six concept subgroups and a reflection opportunity.

In order to support the open student model, we provided a summary screen to the Student as a snapshot of his or her progress for insight into learning and for motivation. Throughout the Heuristica project we experimented with several representations of the summary screen. We were given feedback that our initial approach was too “busy” and contained too much information; later versions included a simpler graph that combined

Intelligent Tutoring Design Alternatives in a Serious Game

159

several related concepts and tracked progress in those primary areas. We experimented with a screen that better fit with the game’s narrative and used the language of badges and promotion on the Heuristica space station. A straightforward bar graph that showed progress made in each of the concept subgroups and how far the student was from mastery in each of those concept subgroups was well accepted. 3.2

Massed vs. Spaced Practice

Repetitively exposing students to material in small lessons (i.e., spaced presentation), versus aggregated presentation of the same material (i.e., massed presentation), generally leads to improved learning outcomes. The benefits of distributed spacing have been realized for over a century [21]. In Heuristica, in which gameplay consists of a set of short LOs and in which total gameplay time is limited, the Content Selector must decide how much gameplay be presented on a particular topic before switching to LOs focused on a different bias concept group. Working with cognitive psychologists we identified three approaches: 1. Go deep (massed practice): provide the student with experience in LOs associated with one cognitive bias at a time 2. Go broad (spaced practice): rotating through LOs related to multiple cognitive biases and then returning to iterate on all of them 3. Mixed approach: allow the student to achieve a moderate level of mastery in a given cognitive bias before moving to the next one. Once the student has covered each bias to a moderate level, return to reach full mastery in each of the cognitive biases. Recognizing that there will be instances when any one of these is most desirable, we implemented a parameter that would allow the game to run in any one of these modes. The mixed approach solved several problems for us. The cognitive bias content that the Heuristica game teaches requires several LOs before the student has enough experience on a cognitive bias to begin to grasp the concepts, which argues against a pure go-broad approach. In addition, one of our designs includes a mixed-initiative functionality (described later in this paper) which offers the student a choice of two LOs for his or her next learning activity. If we are giving the student a meaningful choice, these two LOs should contain notably different content. The mixed approach enabled us to address both of these considerations. We recognize that there will likely be different tradeoffs in other serious games. 3.3

Mixed Initiative

In the context of an intelligent tutoring system, “mixed-initiative” interaction refers to an approach in which, in some cases, the system decides which learning opportunity the student will experience next, and in other cases the student may decide. As one of the experimental designs in Heuristica, we developed an interaction that would provide the student with two choices as to which LO to play next. The choices offered by the Content Selector were both relevant to the student’s learning process, as represented in the Student Model. This choice encouraged the student to reason about his or her own learning needs and to allow the student the opportunity to select something that may be

160

E. Whitaker et al.

of interest [4, 20]. In the Heuristica mixed-initiative approach, the set of choices was constrained by the narrative flow (in the form of LO prerequisites) and concept presentation order (in the form of concept prerequisites). Figure 5 shows how this choice was presented to players.

Fig. 5. In-game user interface showing mixed-initiative choice. The highlighted bars indicate to the student which bias subgroups are covered by the choice under the cursor.

The first mixed-initiative algorithm design was implemented with the go-deep approach, which resulted in the student often being offered the choice between two LOs that were very similar in content. We followed this with a combined approach, in which the student received either (a) a choice between two LOs of the same bias whenever the student was in the massed portion of the learning process, or (b) a choice between two LOs in different biases whenever the student had reached moderate mastery of the current bias. The mixed approach to the massed-vs-spaced tradeoff worked well in support of the mixed-initiative goal of allowing the student some choice within the constraints of Heuristica. 3.4

Reflection

In an effort to increase interaction and to encourage students to analyze their own learning [3, 20, 22], we implemented a reflection screen after some of the LOs. This reflection screen presented the student with a question related to what had been learned about the most-recently presented topic, such as “What is an important consideration in bias mitigation?” Another version of reflection used an in-game narrative prompt that encouraged the student to leave hints for the next person who played that LO. Figure 4 on the previous page includes one iteration of the reflection interaction within the game.

Intelligent Tutoring Design Alternatives in a Serious Game

161

The reflection approach provided us with some interesting insight into what was being learned in some of the LOs. It allowed us to see if the messages that were being presented in the LOs were coming through clearly. It also presented a trade-off decision regarding how often we could ask the student to reflect without becoming irritating [18]. 3.5

Student Model Seeding Alternatives

We experimented with seeding the Student Model using different multipliers against the pretest values, in addition to conditions in which students began with no seeding. The implementation of the Student Modeler included parameters that controlled seeding behavior. A maximum seed multiplier, 1.0, would allow students to receive full credit for per-concept scores earned on the pretest, enabling a student to reach measured mastery quickly, potentially shortening gameplay time but leaving gaps in the student’s real-world learning. Lower multiplier values (and completely disabled seeding) would require the student to spend more time in the game to prove their mastery of concepts. The seeding multiplier must also take into consideration the accuracy of the mapping between pretest questions and in-game concept identifiers. A seeding multiplier of 0.65 represented a compromise between these two extremes. 3.6

Mastery Level Calculation Alternatives

Multiple algorithms for calculating the mastery level of a given concept were evaluated as part of this project. Our experiments identified several tradeoff dimensions with the mastery level calculations: the desire to manage the length of the gameplay, the recognition that scores in the distant past are not as relevant to estimating the student’s current knowledge state, and the desire to keep students motivated by not decreasing mastery levels following an unsuccessful LO performance. In one design alternative, the mastery level is based on a rolling average of the last n scores posted for a particular concept for this student (where n = 4, typically, for a short game like Heuristica). The technique of basing the measure of the student’s knowledge about a given concept on a rolling average allows for the student’s improvement over time to be measured in a way that excludes the scores from the “distant” past, e.g., the complete lack of knowledge in the beginning, but includes the student’s recent performance that exhibits his or her state of knowledge over LOs. In an attempt to shorten the time of gameplay, we tested the use of a “best-case” mastery level, in which each concept’s overall mastery level was calculated using the student’s best n scores. This approach also addressed a concern expressed in playtesting comments and reviews by designers that students were discouraged by instances when more play and experience with a particular concept could result in a drop in their mastery levels. This can happen if a student performs poorly on a later LO and the low score gets averaged in for mastery level calculation. However, the bestcase scoring algorithm (as opposed to the most-recent scoring algorithm) resulted in too little practice for students as it allowed mastery levels to rise more quickly than actual concept learning, which was indicated by lower learning levels in the posttest.

162

3.7

E. Whitaker et al.

Novelty in the Content Selector

In Heuristica, each LO has a parameter that controls the number of times that a student could be given that LO to play. Some of the LOs teach several important concepts and the Content Selector Algorithms tend to choose them for replay as many times as the replay parameter will allow. The result is that some LOs never get played. In order to give students adequate practice and repetition, but still provide variety for an interesting game, a “novelty” metric was added to the Content Selector. This decreased the selection likelihood of LOs that had already been played in favor of unplayed LOs that covered the same target concepts. The purpose of this parameter is to prevent the game from being too repetitive and to provide more interest and variety to the players. 3.8

Simulations to Support Rapid Prototyping

The Student Model works in tandem with the game to evaluate student performance and schedule gameplay elements that are best suited to the student’s current progress. To test Student Modeler improvements without playing the entire game, we developed a Game Surrogate that stands in for the game itself and allows the tester to enter concept scores either manually or from historical cases gathered from real students. Figure 6 shows the Student Model above the Game Surrogate, its History window, and the Game Surrogate Automator. The Game Surrogate prompts the tester to enter scores for each concept in the current LO. The History window enables the user to select from a set of real concept scores gathered from the logs of previous student tests.

Fig. 6. The Student Model diagnostic view, showing the current player’s scores for all concepts after each LO played so far (top); the Game Surrogate testing harness (bottom left) and its historical data selection dialog (bottom middle); and the Game

Intelligent Tutoring Design Alternatives in a Serious Game

163

In addition to direct testing of the Student Model for a single configuration, statistical analysis of many concepts across many student scores was necessary. The Game Surrogate Automator allows rapid simulation of multiple complete gameplays in order to determine the effect of parameter and behavior adjustments to the Student Modeler and Content Selector components. The Automator simulates responses from a student playing the game, randomly selecting a set of concept scores for each LO based on the scores from real student gameplay history. It creates multiple games and virtually plays each game to completion with the Student Modeler and Content Selector performing their normal roles in the evaluation and LO-selection process. For example, to test the effect of adjusting the Mastery Score Threshold from 0.8 to 0.82, the Automator was configured to play 50 games with each setting. The logs of those games were then analyzed, revealing that the average number of LOs played increased by 2 based on in game scores earned by students in the previous testing cycles. Evaluation of the game with real students is required for the testing of some design decisions. For many changes, however, these testing components provided a harness within which certain parameters and behaviors of student modeling components could be compared much more rapidly but with a reasonable level of fidelity.

4 Conclusion This paper has described our project’s investigation of design alternatives to support learning in the serious game Heuristica and the lessons learned through the exploration of their usage and behaviors in the integrated system. The exploration centered on the design decisions and resulting tradeoffs for a student model and its associated content selector, which makes decisions based on the state of the student model. These design alternatives were explored within the constraints of a short (approximately one hour) gameplay session with a requirement to complete learning from one playthrough of the game. We believe that the described designs and components can be extended and adapted to other games with a longer allowed gameplay and multiple sessions, and that the lessons learned can be applied under other conditions. This paper discusses the design tradeoffs for a number of characteristics of the intelligent tutoring portion of this project, including: 1. An open student model that shows students how the system perceives their concept mastery; 2. Massed vs. spaced practice approaches that guide content selection with respect to bias concept groups; 3. Mixed-initiative choices offered to students, and their dependence on the massed vs. spaced approaches; 4. A reflection question to encourage the student to reason about his or her own learning and the value of the responses in providing insight into what was actually being learned by the LOs;

164

E. Whitaker et al.

5. Pretest seeding as a form of initialization of the student model mastery levels so that students who demonstrate a priori knowledge can move more quickly through concepts that need not be covered in as much depth; 6. Different mastery level calculation approaches to best represent the learning of a student for content selection; and 7. A novelty metric to guide gameplay so that it is diverse and engaging to the student. We also described the design and implementation of an automated testing framework that allowed the behaviors of the student modeling components to be evaluated with respect to their software design, algorithm implementations, and parameter adjustments, prior to testing with students. Together this work describes several different design choices that leverage aspects that have been effective in human tutors [2] and intelligent tutoring systems [4, 15] and contributes to a growing body of research that has described integrating intelligent tutoring into video games [5, 6, 10, 11, 19]. Acknowledgements. This research was supported by the Intelligence Advanced Research Projects Activity (IARPA) via the Air Force Research Laboratory (Contract #FA8650-11-C7177). The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, AFRL, or the U.S. Government.

References 1. Carr, B., Goldstein, I.: Overlays: a theory of modeling for computer-aided instruction. AI Lab Memo 406. MIT, Cambridge (1977) 2. Chi, M.T., Siler, S.A., Jeong, H.: Can tutors monitor students’ understanding accurately? Cogn. Instr. 22(3), 363–387 (2004) 3. Goodman, B., Soller, A., Linton, F., Gaimari, R.: Encouraging student reflection and articulation using a learning companion. In: Proceedings of the AI-ED 1997 World Conference on Artificial Intelligence in Education, pp. 151–158 (1997) 4. Graesser, A.C., Hu, X., Sottilare, R.: Intelligent tutoring systems. In: International Handbook of the Learning Sciences, pp. 246–255. Routledge (2018) 5. Graesser, A., et al.: Critiquing media reports with flawed scientific findings: Operation ARIES! A game with animated agents and natural language trialogues. In: Aleven, V., Kay, J., Mostow, J. (eds.) ITS 2010. LNCS, vol. 6095, pp. 327–329. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-13437-1_60 6. Halpern, D.F., Millis, K., Graesser, A.C., Butler, H., Forsyth, C., Cai, Z.: Operation ARA: a computerized learning game that teaches critical thinking and scientific reasoning. Think. Skills Creat. 7(2), 93–100 (2012) 7. Holden, S., Kay, J.: The scrutable user model and beyond. Basser Department of Computer Science, University of Sydney (1999) 8. Hwang, G.J.: A conceptual map model for developing intelligent tutoring systems. Comput. Educ. 40(3), 217–235 (2003)

Intelligent Tutoring Design Alternatives in a Serious Game

165

9. Lynch, C., Ashley, K.D., Aleven, V., Pinkwart, N.: Defining ‘Ill-Defined Domains’: a literature survey. In: Aleven, V., Ashley, K.D., Lynch, C., Pinkwart, N. (eds.) Eighth International Conference on Intelligent Tutoring Systems Workshop on Ill-Defined Domains, Jhongli, Taiwan, pp. 1–10 (2006) 10. Mayer, R.E.: Computer Games for Learning: An Evidence-Based Approach. MIT Press, Cambridge (2014) 11. Millis, K., Forsyth, C., Butler, H., Wallace, P., Graesser, A., Halpern, D.: Operation ARIES!: a serious game for teaching scientific inquiry. In: Ma, M., Oikonomou, A., Jain, L. (eds.) Serious Games and Edutainment Applications, pp. 169–195. Springer, London (2011). https://doi.org/10.1007/978-1-4471-2161-9_10 12. Mullinix, G., et al.: Heuristica: designing a serious game for improving decision making. Paper presented at IEEE Games Innovation Conference (IGIC), Vancouver, BC (2013) 13. Psotka, J., Massey, L.D., Mutter, S.A.: Intelligent Tutoring Systems: Lessons Learned. Psychology Press, London (1988) 14. Roose, K., Veinott, E.: Roller coaster park manager by day problem solver by night: effect of video game play on problem solving. In: Extended Abstracts Publication of the Annual Symposium on Computer-Human Interaction in Play, Amsterdam, Netherlands, pp. 277– 282. ACM (2017) 15. VanLehn, K.: The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educ. Psychol. 46(4), 197–221 (2011) 16. Veinott, E., et al.: The effect of 3rd person perspective and session duration on training decision making in a serious video game. In: Proceedings of the Games Innovation Conference (IGIC), 2013 IEEE International, Vancouver, BC, pp. 256–262. IEEE (2013) 17. Veinott, E., et al.: The effect of cognitive and visual fidelity on decision making: is more information better? In: IEEE Games Innovation Conference (IGIC), 2014 IEEE International, Toronto, ON. IEEE, October 2014 18. Veinott, E., Whitaker, E.: Leaving hints: Using in-game player reflection to improve and measure learning. In: International Conference on Human Computer Interaction (HCII). Orlando, FL (2019) 19. Whitaker, E.T., et al.: The effectiveness of intelligent tutoring on training in a video game. In: 2013 IEEE International Games Innovation Conference (IGIC), pp. 267–274. IEEE, September 2013 20. Woolf, B.P.: Building Intelligent Interactive Tutors: Student-Centered Strategies for Revolutionizing e-Learning. Morgan Kaufmann, San Francisco (2009) 21. Dempster, F.N.: The spacing effect: a case study in the failure to apply the results of psychological research. Am. Psychol. 43(8), 627 (1988) 22. Wylie, R., Chi, M.T.: The self-explanation principle in multimedia learning. In: Mayer, R.E. (ed.) The Cambridge Handbook of Multimedia Learning, pp. 271–286. Cambridge University Press, Cambridge (2014)

Interoperability and Standardization in Adaptive Instructional Systems

Missing Pieces: Infrastructure Requirements for Adaptive Instructional Systems Avron Barr1(&) and Robby Robson2 1

2

Institute for Defense Analyses and IEEE Learning Technology Standards Committee, Aptos, USA [email protected] Eduworks Corporation and IEEE Standards Association, Corvallis, USA [email protected]

Abstract. As the market embraces adaptive instructional systems (AISs) and related data-intensive learning technologies, effective and economical deployment of products from multiple vendors will require critical infrastructure components. Four important types of data must be managed and securely shared across applications: the learner’s background and objectives; a profile of the learner’s current state of mastery; live data recording the learner’s current activities; and metadata describing the available learning activities. Additionally, software tools to manage multi-product learning environments will be needed. This paper explains the importance of these infrastructure components, reviews their state of readiness, and anticipates the benefits they will offer to various stakeholders. Keywords: Adaptive instructional systems Artificial intelligence Competencies Data analytics eLearning infrastructure Experience Application Program Interface (xAPI) Intelligent Tutoring Systems (ITSs) Learner modeling

1 Evolving Digital Learning Architectures Digital learning, i.e., the use of computer- and network-based learning applications, has been adopted in all major education and training markets: schools, colleges, enterprise training, and professional certification. ELearning systems have been used to save costs, broaden markets, improve learning, and make learning more convenient for the learners (anytime, anyplace). Since they were introduced in the 1990’s, the focal point of digital learning has been the institutional Learning Management System (LMS). The LMS handles many administrative issues: • • • • •

Student rosters and assignments; Learning materials, contracts with publishers, and usage statistics; Tracking and reporting learner activity and performance; Class management and communications; and Room and equipment scheduling.

© Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 169–178, 2019. https://doi.org/10.1007/978-3-030-22341-0_14

170

A. Barr and R. Robson

Since the advent of software as a service (SaaS), many organizations have adopted cloud-based LMS offerings, but the basic configuration remains the same: the multipurpose LMS is the main learning-specific infrastructure component needed to offer and manage digital learning environments. However, both technological and market developments are changing the way digital learning will be deployed. 1.1

Market Changes

Modern learners often explore multiple cloud-based offerings in their pursuit of knowledge, in addition to those that have been assigned via an LMS. In fact, learners are likely to be affiliated with multiple learning institutions and on-line activity vendors simultaneously. Changes in career paths and reduced corporate investments in in-house employee training have created increased demand for recording and managing an individual’s learning over their entire lifetime, from schooling to employment and beyond. As a result, “learner portability” has replaced “content portability” as the central interoperability issue in digital learning. While we once focused on assuring that all content could run on all LMSs, our biggest problem now is securely sharing data about learner preferences, backgrounds, traits, and achievements across learning systems and across institutional and content provider boundaries. Only through such sharing can we assure that a learner’s interactions with all of these systems is informed by the learner’s past activity and current state of understanding. The demand for learner portability will only grow over time, with increasing global availability of online opportunities, the “gig economy,” and labor market pressures to facilitate frequent re-training and career changes [1]. 1.2

Cloud-Based Learning Activities

Simulations, gaming, virtual reality, augmented reality, and other newer technologies are being used more frequently to create immersive, exploratory, and collaborative learning and rehearsal environments. Increasingly, these more complex offerings are delivered via the cloud. The institutional LMS is not designed to manage a learner’s activities outside of a single institution or to track learner data across these systems. 1.3

The Data Storm

Learning platforms now include mobile phones, tablets, and virtual/augmented reality devices as well as the instrumentation of learners with devices that measure their location, movements, biometrics, affective state, and instant-by-instant actions. Digital learning activities can generate detailed data about the learner’s actions, choices, responses, hesitations, biometrics, affective state, and much more. As learners spend more and more time using these digital platforms, we anticipate a dramatic increase in the variety, velocity, and volume of data they produce.

Missing Pieces: Infrastructure Requirements

1.4

171

Artificial Intelligence

Learning activities are getting smarter. Adaptive Instructional Systems (AIS), for example, use data about a learner’s history, preferences, recent performance, and state of mind to offer personalized instruction. In an AIS, the path that a student takes through the material, the modality that instruction is delivered, the style of remediation, and the feedback about progress can all be tuned to the individual learner’s needs and preferences. There are several other emerging product categories that use data-intensive algorithms to improve learning and help teachers, students, and administrators including Intelligent Tutoring Systems, learning analytics engines, coaches, recommenders, robo-graders, and pedagogical agents. In general, the requirements for learner data are largely driven by AI technologies, which are increasingly being applied to education and training in many forms, including machine learning, learning analytics, language and speech processing, computer vision, and affect recognition. The use of AI technologies improves the way learning activities model and interact with learners, e.g. by personalizing their experience, processing written and spoken language, or diagnosing their misunderstandings. Some learning activity publishers, having access to data from tens of thousands or even millions of customers, use machine-learning algorithms to glean insights into learner behavior, effective remediation, and product improvement. As with all AI algorithms, the more data a system has access to, the better the results. 1.5

Architectural Variations

Different kinds of organizations will evolve towards different infrastructure architectures. Public schools, for instance, may rely in part on infrastructure components supplied by their school district or state Department of Education. Corporations, associations, and government organizations may build the entire ecosystem in-house, or rely, partially or completely, on SaaS offerings. Publishers will track how learners use their products in aggregate, but they may not be allowed to track individual leaner progress. It is not possible at this time to offer concrete guidelines about how to construct an organization’s future elearning ecosystem for several reasons: • Installations in different market segments have different goals, constraints, and already-installed systems; • The functionality of some infrastructure components is still evolving; • There is no way to predict how the needed functionality will be combined and packaged in new products; and • Some important interoperability standards are still in development. Today, every instance is an experiment. By describing the key infrastructure components, we hope to inform the many architectural decisions that lie ahead for all providers of digital learning.

172

A. Barr and R. Robson

2 Essential Architectural Components Modern learning ecosystems must extend the traditional, LMS-based architecture to effectively manage the dramatic increase in learner mobility across suppliers, the variety of learning activities, and the amount of learning data. Specifically, operators will need tools to help them manage several types of data as well as the overall ecosystem configuration: • Learner Background: It is essential that all learning activities have access to data about the learner’s history, current objectives, preferences, and current activities. The things any teacher or tutor would want to know about a new student. • Competency: Characterization of the learner’s level of mastery relative to the level required to achieve their objectives, as defined by a certifying institution or firm in terms of a competency framework. • Activity stream data: Detailed data about what a learner is doing in real time, required by AI-enhanced learning activities and supplemental products like analytics engines and data visualization tools, which respond to data about what the learner is doing second by second. • Learning activity metadata: Learners, as well as intelligent tutors and recommender systems, benefit from knowing specifics about the available learning activities and their relevance to the learner’s current state of understanding and their learning goals. • Component registration and configuration management: Typically, a modern learning organization will deploy multiple systems locally and integrate with an array of cloud-based data providers. Tools to support the easy (automatic) registration and management of those systems will eventually be a necessity. Some of these needed infrastructure components are available commercially now. Some we expect will be available soon, because prototypes have been used in real applications. And some key components are still on the drawing board. Often commercial products combine functionality, e.g. by adding features to the LMS, so there isn’t necessarily a one-to-one correspondence between the functionality needed and the emerging products categories. 2.1

The Learner’s Background

Current commercial and research systems that model the learner’s background, interests, preferences, objectives, and current state of understanding start off with a blank slate. This practice will eventually become problematic for teachers and learners who use multiple systems simultaneously. It will be necessary to decouple the learner model from the learning activity and to create a means that enables all products from all vendors to effectively share this important information about the learner. Moreover, since over the course of their lives, learners work across institutions and activity providers, the learner model will need to be persistent and independent of any single institution or provider. To the best of our knowledge, there are no stand-alone products on the market now that maintain a learner model across activities and vendors. In fact, there is no

Missing Pieces: Infrastructure Requirements

173

agreed-upon standard format for the data contained in such a learner model. While there has been research in learner modeling for a decade (e.g., Bull and Kay [2]), the demand for products that support multiple learning activities has been more theoretical than market-driven. Now, however, there is an increased real-world need for multiple learning activities to share learner model data. 2.2

Competency Frameworks and Learner Profiles

In research systems, keeping track of the learner’s developing mastery of the subject at hand is usually considered part of the Learner Model, along with the background information described above. Nonetheless, it makes sense to architecturally separate competency management functionality from Learner Models for several reasons: • Background information is typically used and updated less frequently and by a different set of applications than competency information. • The competencies of interest, the learner’s objectives, may vary as the learner moves between learning activities, e.g. when studying Algebra and Geometry in the same semester. • Descriptions of competencies and rules for managing them are idiosyncratic. Schools, employers, publishers, and government agencies all have their own way of thinking about competencies and ascribing mastery. • Security considerations about the learner’s progress and status may be very different from those for background data. In addition to describing the learner’s current state and learning objectives, competency information is used to describe course requirements; degree requirements; certificates and credentials; job requirements; and the intended use of learning materials and activities. Every educational institution, corporate HR department, and learning activity vendor has its own framework for describing relevant competencies. Managing competency data requires new tools. Their functionality may differ a bit in different markets, and again, future products may implement different sets of features, but digital learning organizations will need tools to work with the following aspects of competency management: • Competency framework: A structured representation of the knowledge, skills, and abilities of interest. Frameworks capture the titles and descriptions of competencies, relationships among competencies, and data such as levels and assessment methods. • Competency evidence: Each organization defines the nature and format of the evidence it accepts in evaluating progress towards mastery. • Learner competency profile: Profiles track each learner’s level of mastery on each competency relative to their learning goals. Some organizations may incorporate a model of “decay over time” into the way they model the learner’s state of knowledge. • Evidence evaluation and rollup rules: Different organization have different rules for accepting evidence of learning and different algorithms for aggregating that evidence to draw conclusions about learners’ progress. Rules for concluding the

174

A. Barr and R. Robson

learner’s state of understanding of a higher-level competency, based on their mastery of sub-competencies, will also vary across organizations. Some LMSs and enterprise talent management systems offer tools for building and maintaining competency frameworks. These tools, however, are focused on a single institutional context and do not conform to standards that have been developed to compare or share competency frameworks. More recently, projects such as the Competency and Skills System project (CaSS, www.cassproject.org) and OpenSALT (www.opensalt.org) have been launched with the capability of sharing competency frameworks across multiple systems. CaSS, in particular, enables frameworks to be imported and exported in standardized formats, conforms to web standards such as linked data, and enables users to: • Create, store, and share competency frameworks, specifying local terminology, rollup rules, and associated assessments; • Compare and exchange competency definitions among organizations, recognizing the likely differences in terminology and semantics [3]; • Specify relationships among competences and rules for concluding mastery from evidence and from mastered sub-competencies. In addition, products are emerging that enable learner profiles to be stored, updated, and consumed by multiple learning, training, and staffing systems. These products include CaSS, MARI (www.mari.com), Viridis (www.viridislearning.com), Degreed (www.degreed.com), and many others, each with its own focus and target market. There are even more products that store learner profiles internally for their own use. Standardizing a learner profile component could enhance the value and lower the production cost of all of these systems [4]. 2.3

Runtime Activity Stream Data

Perhaps the most advanced new infrastructure components, in terms of commercial product availability, are tools to collect and manage the runtime data generated during learners’ actual learning activity. Many types of applications use this activity stream data [5], including: • AI-enhanced tutors, recommenders, and adaptive instructional systems that track what a learner is doing in real time to make diagnoses and to offer useful remediation and recommendations; • A variety of learning analytics and data visualization products that use real-time data to give feedback to students, apprise teachers about the status of their students, and issue early warning about learners who need attention; • Publisher applications that monitor runtime data to analyze how learners use their materials, looking for insight into problematic content. The Learning Record Store (LRS) is an emerging database product category for collecting and managing learner activity stream data. Most LMSs now incorporate an LRS, and most independent LRS products on the market are stand-alone “analytics” systems that typically offer more robust data visualization functionality than the LMSs

Missing Pieces: Infrastructure Requirements

175

that produce the data. The US Advanced Distributed Learning Initiative sponsored development of the protocols for sending activity stream data to an LRS and for querying an LRS [6]. LRSs can exchange data with any conformant learning activity and with each other. The xAPI protocol is relatively mature and we expect will soon be published as an IEEE standard. Many installations now have multiple LRSs, each monitoring some or all learner data streams for specific purposes. Security and privacy considerations will influence the architecture of LRSs and the implementations of xAPI-based data sharing. 2.4

Learning Activity Metadata

Related to the problem of sharing competencies across institutions, where each has its own way of describing learning objectives and learner mastery, there are problems with current approaches to describing learning activities and materials. Current standards for describing learning material are largely bibliographic, with a few nods to pedagogical considerations, such as grade level, reading level, and links to national school curricula standards or textbook chapters [7]. In a world where AI-enhanced tutors and recommenders are trying to identify exactly the right next step for learners, based on knowledge about their past activity and current state of knowledge, the current schemes fall short. Unfortunately, research doesn’t bode well for a general solution. Pilot studies using the Advanced Distributed Learning Initiative’s Total Learning Architecture [8, 9], a framework for integrating advanced learning activities and applications, showed that different recommenders and intelligent tutors used different schemes for tagging learning activities with the information they use to make decisions. For example, some recommender engines want to see content tagged as “intro vs. easy vs. hard,” while others might base their recommendations on a formal pedagogical framework such as Bloom’s taxonomy [10]. The precise parameters on which each recommender bases its decisions seem to be part of the recommender’s “secret sauce” and hence a poor candidate for standardization at this time. Of course, each learning activity needs to be tagged with the appropriate metadata. Publishers are able to adequately describe their own learning activities, for their own instructional environment and recommendation engines, but tagging content for general use by recommenders and other AI-enhanced products can involve a great deal of manual labor. AI language processing techniques have been explored for many years as a way to automate metadata tagging [11, 12]. 2.5

Component Registration and Configuration Management

Finally, broad adoption of advanced systems and newly enabled pedagogical initiatives such as competency-based training or mastery-based education will require new tools. Administrators will need to add, remove, and monitor all of these infrastructure components and control communications with the tools that students and teachers are using and with the learning activities themselves, whether they are located locally or in the cloud. All stakeholders, not just administrators will benefit from these tools. For

176

A. Barr and R. Robson

publishers, for instance, one might imagine automatic registration of a new cloud-based learning activity, including alignment of competency descriptions [13].

3 Infrastructure and the Economics of Adaptive Instruction Privacy concerns will shape, and possibly impede, market acceptance of intelligent learning technologies [14]. The full commercial development of all of the infrastructure components needed to support the economic deployment of Adaptive Instructional Systems and related AI-enhanced products may be years off. To deal with natural privacy concerns, the industry needs to present a strong argument about the value of storing and sharing learner data, now. In our opinion, AI technologies will continue to drive the demand for more learner data, while privacy concerns will increasingly throttle the sharing of that data in many markets. Infrastructure tools that allow individuals and institutions to define the rules about the procurement and deployment of data are key to assuring the community and the public that their data is need and that it is managed professionally. In addition to addressing data privacy, another benefit of the supportive infrastructure described earlier is facilitating the sharing of data from multiple vendors among AI-enabled products. In some markets, publishers with adaptive learning products in multiple subject areas have begun to share internal data elements, such as background and competency info, across their product lines. Sharing data across apps benefits learners by improving personalization. For example, a Geometry tutor might be able to use information about the learner’s knowledge of Algebra. As AIS products mature, and as customers acquire and deploy multiple systems from different vendors, the pedagogical, economic, and technical inefficiencies of the monolithic AIS model will become increasingly evident. Education and training organizations that are running multiple adaptive courses will begin to build out infrastructure to make it easier for them to install, deploy, use, monitor, evaluate, and change out AIS products. The resulting technical infrastructure will in turn simplify requirements and processes for product developers, teachers, publishers, and students, as well as the IT managers. Market adoption of key standards is essential to realizing economic gains over the lifetime of AI-enhanced products. Every infrastructure element depends on the implementation of standard data protocols that are implemented by all the vendors. Several research projects, including the US Advanced Distributed Learning Initiative’s Total Learning Architecture and the US Army Research Lab’s Generalized Intelligent Framework for Tutoring [15, 16] have explored the issues involved in deploying complex adaptive systems. Both of these projects are simultaneously contributing to IEEE standards working groups that will define the software interfaces needed for these infrastructure components [17]. Current relevant projects at the IEEE Learning Technology Standards Committee include: • xAPI: base standard, best practices, and soon, xAPI profiles • Adaptive Instructional Systems: (concept and definitions, component interoperability, and best practices for evaluation of an AIS)

Missing Pieces: Infrastructure Requirements

177

• Competency definitions and frameworks • Child and Student Data Governance: definitions and best practices • Federated Machine Learning: how to you use big data techniques when the data as distributed across many institutions that are not allowed to share? At the present time, some of the key products needed to build the infrastructure described in this paper do not exist or are in a pre-product (custom-built) state of evolution. Also, we expect to see the functionality and configuration of these infrastructure products evolve differently across market segments, including: K12, higher education; enterprise and military training; and professional certification. Customers are just beginning to use AI-enhanced learning products, but they will soon find that infrastructure components are need to realize the potential benefits of AI and to manage these inherently more complex learning ecosystems. Operators will also need to establish policy in new areas, e.g., requiring that publishers generate certain data during learning and store it in the right place. Today, every installation is an experiment. Eventually, we will have the products and experience needed to bring the full power of AI to education and enterprise training.

References 1. Robson, R., Barr, A.: Learning technology standards - the new awakening. In: Sottilare, R., Brawner, K., Sinatra, A., Goldberg, B. (eds.) Proceedings of the Sixth Annual GIFT Users Symposium: US Army Research Laboratory (2018). https://www.gifttutoring.org 2. Bull, S., Kay, J.: Open learner models. In: Nkambou, R., Bourdeau, J., Mizoguchi, R. (eds.) Advances in Intelligent Tutoring Systems. SCI, vol. 308, pp. 301–322. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14363-2_15 3. Credential Engine: The credential engine - moving credentialing forward (2018). http:// www.credentialengine.org/. Accessed June 2018 4. Robson, R., Barr, A., Fletcher, J.D.: Universal learner profiles. To appear as an Institute for Defense Analyses Report (in preparation) 5. Downes, A.: Learning analytics dimensions: learning experience analysis (2019). https:// www.watershedlrs.com/blog/learning-experience-analysis 6. ADL Initiative: The xAPI overview (2018). https://www.adlnet.gov/research/performancetracking-analysis/experience-api/ 7. LRMI: LRMI - a project of DCMI (2018). From Dublin Core Metadata Initiative: http://lrmi. dublincore.org/. Accessed June 2018 8. ADL: Total Learning Architecture (TLA) (2017). From Advanced Distributed Learning initiative: https://www.adlnet.gov/tla/. Accessed 20 Jan 2017 9. Smith, B., Gallagher, P.S., Shatz, S., Vogel-Walcutt, J.: Total learning architecture: moving into the future. In: Proceedings of the Interservice/Industry Training, Simulation, and Education Conference (I/ITSEC) (2018) 10. DeFalco, J.: Proposed standard for metadata tagging with pedagogical identifiers. In: Sottilare, R., et al. (eds.) Proceedings of the Workshop on Standards for Adaptive Instructional Systems at the Intelligent Tutoring Systems Conference, Montreal (2018) 11. Cardinaels, K., Meire, M., Duval, E.: Automating metadata generation: the simple indexing interface. In: International World Wide Web Conference Committee (IW3C2). ACM (2005). 1-59593-046-9/05/0005

178

A. Barr and R. Robson

12. Simko, M., Bielikova, M.: Automated educational course metadata generation based on semantics discovery. Institute of Informatics and Software Engineering, Faculty of Informatics and Information Technology, Slovak University of Technology (2009) 13. Schema.org.: Alignment object (2018). From Schema.org.: https://schema.org/ AlignmentObject. Accessed June 2018 14. Herold, B.: inBloom to shut down amid growing data-privacy concerns, 24 April 2014. From Education Week: http://blogs.edweek.org/edweek/DigitalEducation/2014/04/inbloom_ to_shut_down_amid_growing_data_privacy_concerns.html. Accessed June 2018 15. Sottilare, R.A., Brawner, K.W., Goldberg, B.S., Holden, H.K.: The Generalized Intelligent Framework for Tutoring (GIFT). Concept paper released as part of GIFT software documentation. U.S. Army Research Laboratory—Human Research & Engineering Directorate (ARL-HRED), Orlando, FL, USA (2012) 16. Sottilare, R., Brawner, K., Sinatra, A., Johnston, J.: An updated concept for a Generalized Intelligent Framework for Tutoring (GIFT). US Army Research Laboratory, Orlando, FL, USA (2017) 17. Robson, R., Barr, A., Sottilare, R.: Overcoming barriers to the adoption of IEEE standards. In: AI in Education (AIED), London (2018)

Standards Needed: Competency Modeling and Recommender Systems Keith Brawner(&) Army Research Laboratory, Orlando, FL, USA [email protected]

Abstract. There are a number of future efforts to revise military training and generally bring it into the 21st century, including the Army Learning Model, Synthetic Training Environment, Sailor 2025 initiative, and other service-level training revamps. These revamps are expected to do more than the past developments in content and LMS standards – tracking students, providing mappings of competencies, recommending for and against future training items, and other relatively advanced tasks. The Institute of Electrical and Electronics Engineers Adaptive Instructional Systems group has created the Adaptive Instructional Systems standards group, which is investigating the issues faced by the next wave of learning software. This paper discusses some of the technical and social issues of moving to the new model of education. Keywords: Standards Recommender systems

Adaptive Instructional Systems

Competencies

1 Introduction and Background There are several active efforts in large scale training system acquisition currently active in the Department of Defense community. The first of these is the Army’s Priority for the Synthetic Training Environment (STE) [3], which is putting into practice the requirements from the Army Learning Model 2015 [1]. The second of these is the Learning Continuum And Performance Assessment (LCAPA) [4] as part of the Navy’s Sailor 2025 initiative [8, 10]. Each of these efforts represents a large-scale acquisition of modern training technologies with the intention of boosting readiness, based on competencies. The Army demands of its Soldiers a broad set of competencies (e.g. Many different ground operations maneuvers, tasks, procedures), while the Navy requires a deep set of competencies (e.g. disassemble and reassemble of a reactor), but both rely upon competency assessment for the assessment of unit and task readiness. There is a need for the representation of competencies to be transferable at all levels; an Army Soldier moving Units, an Army Squad moving Sections, an Army Battalion moving between Regiments, a Navy Destroyer being assigned a new Carrier Group, a 1st class sonarman being assigned a new ship. The receiving unit should know about the abilities of the incoming group to the greatest extent possible to ensure the continuity of operations. Further, transfer between services should be viewed similarly. Further, those who exit service can benefit from a model of their abilities being transferred to potential employers. For those in the reserve units, it is helpful to This is a U.S. government work and not under copyright protection in the United States; foreign copyright protection may apply 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 179–187, 2019. https://doi.org/10.1007/978-3-030-22341-0_15

180

K. Brawner

have a model of their abilities available for military duty (e.g. a Reserve Soldier who owns a welding business should be considered for relevant positions if recalled, alongside official expertise and training). This type model is not only useful for Warfighters switching between groups, but also for the recommendation of training content at the various levels. As examples, consider an individual who is lacking in a single skill required for next promotion, a unit lacking an individual with specific training (i.e. heavy gunner, or cryptographer), or the discovery of a new required skill now needed for an individual area (i.e. IED detection/disposal, during the war in Iraq). These deficiencies can be targeted if relevant recommender systems can highlight the weaknesses to decision makers. Each area may assess competency in a different manner. As an example, the 75th Ranger Regiment and Delta Force should be enabled to differ in their assessment of what it takes to be “jump qualified”; these standards are likely higher than the standards for the Airborne Soldier. Similarly, the assessment logic to determine jump qualification can and should be shared between the two areas. Additionally, what it means for an individual to be “jump qualified” should differ based on whether it is an individual assessment (“did they pass jump school?”) or an assessment of the unit (“did each member pass jump school?” and “does the unit have a jumpmaster?”), and may vary based on the organization (“have they performed at least 1 practice jump together?”). These areas of competency serve as the basis for the recommendation of learning resources. As such, they require a method and standard for interchange such that multiple recommendation engines are available to service the needs of the learner, or that one recommendation engine can service multiple communities of learners. This paper presents suggestions for initial technologies to identify the opportunities for standards and interchange in association with the Adaptive Instructional Systems (AIS) IEEE group.

2 Traditional Educational Model Generally speaking, the traditional educational model is relatively lacking the problems of competency modeling and interchange that the rest of this paper will be discussing. The problems of this paper are relatively new problems which are brought on as a result of new technologies. It is worth reflecting on the traditional educational model, how the problem of competency and competency interchange are not particularly relevant, and the things that make them relevant to the modern day. Firstly, there is primary education (K-12 in the United States). Students are educated according to a relatively internally consistent model which is somewhat resistant to change by the nature of its throughput. As a specific example, consider a school which must teach single-digit addition (SDA), multi-digit addition (MDA), single digit subtraction (SDS) and multi-digit subtraction (MDS). Two alternative curriculum are presented: SDA->MDA->SDS->MDS (Curr1) and SDA->SDS->MDA->MDS (curr2) may be equally viable; administrators choose one or the other. Provided that students learn in the order prescribed, there is little problem. Transfer students from an alternative curriculum present an issue, but they represent less than 5% of the total volume and can be individually attended to – especially if they transfer

Standards Needed: Competency Modeling and Recommender Systems

181

at the beginning or end of an instructional block. Absenteeism presents a similarly problem with similar solution. However, a change from Curr1 to Curr2 midstream is disastrous – forcing the entire cohort of students to learn MDS without SDS. At the Curr1/Curr2 level the absent and transfer students represent a relatively small portion and relatively minimal problem. The same “spot fix” solution is applied at a larger level with the change of school zone (region, state, country, etc.) – students must be migrated among grades or remedial/basic/advanced levels of the same content. The basic solution to that problem is the implementation of nationwide (or continent-wide, as is the case in the EU) standards of knowledge per year. When the primary education system is the State education system, instructed primarily with textbooks (or digital equivalents), the solutions of spot-fixing, advancing or holdback are possible; this is appropriate for “Know What” knowledge [5]. The new world, however, demands knowledge workers – workers who are primarily valued based on their ability to interact with concepts and formulate solutions; “Know How” knowledge [5]. This forces individuals into taking charge of their own educational training, educational systems which inherit piecework-educated students, and employers receiving little proof that the employee can perform the job. 2.1

Requirements for Competencies and Recommenders

The knowledge workers of today have a mishmash of educational experiences which is poorly represented in a resume. The educational system and employer both look at a resume which says things like “Computer Science degree, 5 years’ experience networking projects, 3 years hobbyist website developer” and have difficulty discerning whether this person can perform the “make our website have a database backend” task or needs the “Databases 101” class. Naturally, both of them can ask – but this requires an accurate self-assessment understanding of the knowledge (or lack of it) on the part of the student/worker, and an accurate interpretation of the answer by the employer. This knowledge requirement on behalf of the employer prevents moving the task to a cheaper job category. Information on what the individual knows and what the individual can perform is required in order to gain benefit for the individual, employer, and student.

3 Military Educational Model Similar to the industrial educational model, above, the military educational model has the ability to operate as a “top down” structure. Organizations such as the Training and Doctrine Command (TRADOC) and Naval Education and Training Command (NETC) can dictate curriculum to the subordinate schoolhouses. The resultant educational programs, however, can be very different from each other, which reflects the different services and missions. As an example, the Navy traditionally trains for deep knowledge using an “A school”/“C school” model thoroughly complimented by time spent at sea training under more senior personnel; the 3rd Class Petty Officer nearly always has frequent interactions with a 1st Class Petty Officer. The Army has a model for training broad knowledge through training on individual skills (Land Navigation, Marksmanship,

182

K. Brawner

etc.), tactical drills (Break Contact, Clear Room, etc.), and leadership training. While both organizations took a top-down dictation approach to what information and skills the individual needed to have, the resultant models are remarkably dissimilar. Somewhat unlike the K-12 industrial schoolhouse model, however, Warfighters (Soldiers, Sailors, Airmen, Reservists, etc.) encounter the real world and gain an incredibly diverse set of experiences with mentors before being cycled back into formal education. Further complicating matters, an individual switches units, deployments, and groups with relative frequency throughout a typical military career for a variety of reasons, not the least of which includes needs of the nation. This problem is magnified for Reservist Warfighters, who make up approximately 20% of the total Warfighter population, and do things like “own a welding business for 8 years” between official tours of duty, leaving the expertise and ability on the table for the majority of deployment groups. 3.1

Requirements for Competencies and Recommenders

On the surface, this problem seems simple to solve – the military has significant ability to prescribe training to individual Warfighters. The military can bin school, ensure individual school compliance, mandate daily training activities, set service standards for individual Military Occupational Specialties (MOS) and attempt to enforce a doctrine that the Warfighter get their training credit from the military system. A significant amount of military training is currently performed in this manner – using certificates of completion and badges to authenticate training and qualification. Naturally, the military runs afoul the same problem as the secondary educational market. The Warfighters of today have a mishmash of educational experiences which is poorly represented in a resume. The military promotional system and educational system both look at a deployment history which says things like “Sonarman MOS, A/C school, 4 deployments with Carrier Group East Coast” and have difficulty discerning whether this person can perform the “find Russian submarines” task or needs geospecific threat refresher training. Naturally, both of them can ask – but this requires an accurate self-assessment understanding of the knowledge (or lack of it) on the part of the Warfighter, and on behalf of the receiving command. Needless to say, lives depend on the correct answer, which rightfully biases the military to over-train on any tasks critical to the job performance.

4 Need for an Updated Industrial Educational Model The current educational experience system works on the back of the “education +yearsExperience” or “schoolhouse+deploymentsServed” metrics for knowledge works and Warfighters, respectively. This has a background of no documentation of informal learning activities such as “website hobbyist” or “welding business” in the above examples. It has limited documentation of On Job Training (OJT), without individual tasks assigned or completed. Generally, this model is opaque to the end user of the employees’ labor.

Standards Needed: Competency Modeling and Recommender Systems

183

At the individual level, this results in lost opportunities where skills could be used or fields could be switched; the above welder-Warfighter should perhaps be reclassified as a Combat Engineer regardless of prior training, the programmer as perhaps a Full Stack Developer. The individual would be better served if his competencies and abilities were organizationally represented to the other institutions, rather than personally represented in the interview forms. Further, if the individual were able to see representations of his own knowledge, or lack thereof, his training could be optimized towards the goals of the other institutions. At the educational level, the lack of transparency of actual knowledge results in educational waste as individuals are given instruction that they do not need or are unprepared for. Students who where are already trained in one subject end up repeating training because of the lack of knowledge at the educational level. As a concrete example, a retired Navy Electronics Technician (ET) has significant expertise on circuit diagnostics, but a class on circuit diagnostics is required to meet university requirements for a degree in Electrical Engineering. A Sailor with the Electrical Engineering degree may be assigned to “A”/“C” school for Electronics. As the level of the employer, the existing educational model doesn’t answer the basic questions of how the individual can be useful to the organization or what training they would be most suited for in order to be more useful to the organization. The organization ends up receiving individuals without knowledge of other credentials; such as recruiting for someone with a degree in Electrical Engineering without the knowledge that this person has prior service as an ET. Alternatively, the employer receives an individual without a mapping of the individuals’ expertise. While this model can, and is, manually corrected for errors from Human Resources offices, it is wholly insufficient, as it does not serve the individual, the educational institution, or the employer, without intervention. A better model would be to track the relevant learning interests of the individual, and to provide these to the interested parties.

5 Features of an Updated Model for Competency and Recommender Systems If we have established that the existing system requires replacement or update in order to enhance the productivity and effectiveness of interacting organizations, the question of “in what manner?” remains. The top-down enforcement of grade-by-grade year-byyear regional standards is reasonably effective for the problems of K-12 education, but somewhat insufficient in its execution in the other organizations to which it must interact. 5.1

Ontology of All Knowledge

One of the things that must happen in order to have a representation of a persons’ knowledge is that there must be a representation of which it is that people can know. From this, the knowledge of the individuals can be mapped onto the representation. There have been organizations which have tried to create an ontological mapping of all

184

K. Brawner

of knowledge that a person may be inclined to possess [2]. This is possible, and even relevant, in certain domains; consider that the mapping of all K-12 knowledge which was conducted in order to establish the common core standards. In one manner, these efforts are laudable – they represent a direct path to the goal. In small settings, it is possible to create such mappings, and it is possible in larger settings with concerted manpower. Inevitably, however, such an ontology frays at the edges – where does such an ontology begin to place things like the soft skills possessed by management? The development of new fields of knowledge? The combination of existing fields? At what grain size? What does one do with a mapping that an individual is possessed of knowledge of “Math”? Whatever standard is created must, by its nature, be flexible enough to encapsulate both all knowledge and all possible knowledge. Statements of the knowledge of individuals must be agreed upon by the parties interested in such statements, rather than dictated from above. Standards should allow for the interchange between groups dealing with similar ontologies or ontological frameworks, which also allows the individual organizations to expand, contract, or redefine a shared vocabulary as needed by business or human resources processes. 5.2

Trusted Sources

Existing models of competency and accreditation exist in the form of “trusted sources”. A high school degree from a US state carries the weight of that state – agreements or disagreements can be undertaken at that level as to whether this is trusted, but the worth of the diploma is determined above the level of the high school. Similarly, a degree from a secondary institution carries the weight of that institution – the University of or Trump University or University of – and individual organizations must decide which of these items is trusted. Similarly, certain organizations, such as Accreditation Board for Engineering and Technology, exist as centralized authorities of quality [9]. Organizations may choose to trust the trusting authority, rather than evaluate and establish their own basis of trust of the credentials. The new model of educational credentials must follow in this footing. As an example, consider the YouTube and Khan Academy platforms. In one context, they both may be trusted – simple knowledge that an individual has watched 30 videos on the subject of Dishwasher Repair may be sufficient for the task envisioned (repair a dishwasher). However, a Khan Academy higher standard which has coupled assessment (have they repaired a dishwasher successfully?) may be needed for a more advanced task (train someone to repair a dishwasher). Further, it is possible to blend both YouTube and Khan Academy experiences into a unified credential issued by a trust authority. To use a military example, knowledge of the Littoral Combat Ship (LCS)-class webcam-based shipwide remote monitoring system may be sufficient for a cook or Action Officer, while engineers which use the system to troubleshoot complex problems may be held to a higher standard which includes assessment. The maintainers of such a system may be held to a higher standard yet. Different sources may be trusted at different levels for different tasks, standards, and systems for competency and recommendation. These sources of trust should be flexible enough to accommodate variations in the standards of the task. The vending of trust,

Standards Needed: Competency Modeling and Recommender Systems

185

independent of granting of credentials, must also be supported in emerging standards, as this practice has already been widely established for many educational institutions. Trusted sources can be used as the basis for the recommendation of automated systems. 5.3

Custom Assessment Queries – Individual Level

The basic thing that is needed at an individual level is an answer to the question “does this person know X?” and “what evidence do I have?”. At the individual level, queries such as this are required. These naturally feed placement and recommendations based on existing knowledge. Further, systems should be flexible enough to allow for different standards for the query source – a 70% performance may be good enough for some organizations and tasks, but insufficient for others. As a concrete example, an 18-year old Army recruit must have a 16 min, 36 s time for running 2-miles, which an Army Ranger must complete the same task in 13 min. At the human resources usage level, developed standards must have the ability to discover the potential of a 12 min 2-mile run in a recruit; this individual may be a good candidate for Ranger School. 5.4

Custom Assessment Queries – Group Level

A group-level query is likely a collection of the individual level queries. There are multiple ways to phrase such a query. Consider the query “is the unit jump ready?”. This query has multiple component queries, which may vary among divisions and Warfighter services: • Is everyone in the unit jump qualified? – Have they been training in parachute drills, bag packing, completed a number of jumps, etc. • Is everyone jump ready? – Has each individual complete a jump within the last number of months? • Is there a jump master? – Is at least one person in the unit a certified jump master, which has its own set of standards/competencies? Consider an answer to this query of ‘no, this unit is not jump ready’. The natural follow-up queries are ‘in what way is this unit lacking?’ and ‘how can the deficiency be fixed?’ and ‘what is the fastest way to fix the deficiency?’. Following the example, the answer may be as simple as “Indivdual2 needs to do a live jump” or as complex as “this unit has only one individual with 20% Jump Master Training, it is best to assign another Jump Master”. This knowledge provides information to recommender systems. 5.5

Custom Assessment Queries – Groups-of-Groups Level

In the military, the gold standard assessment of knowledge is “readiness level”, which is provided to very senior leaders. At its highest form, “readiness level” represents an answer to Congress/President to the question of “if we had to go to war tomorrow, how ready for the task are we?”. It is intended to be an honest assessment of military capability. In the current manner of business, the readiness of units rolls up into

186

K. Brawner

divisions, brigades, etc. into a total assessment of capability. The civilian world has equivalent ‘readiness level’ in items such as doctor/nurse teams, oil rig workers, or software development teams. Whatever standards and recommenders exist need to support roll up of teams-of-teams assessment, the recommendations systems existing to provide solutions to gaps in the assessment.

6 The Need for Models of Competency and Recommender Systems Much ink has been spilled about 21st century competencies [6], the new normal of knowledge work [11], bridging the gap between high school, college, military, and workforce, and within schools, colleges, military, and workforce [7]. However, these are not the problems of tomorrow, they are the problems of today. The existing system is not serving the individual, employer, or educator; it needs to change. At the core of this change is the representation of both individual and group “know what” and “know how” [5]. Making this change has significant benefit – it makes the individual more transportable across the workforce, it limits educational waste, and it helps employers to place individuals and groups in areas where they can prosper. Technology has created a problem where individuals are forced into a path of lifetime learning and educational experiences, but fortunately this is a problem which technology can also solve and optimize through the representation of interchangeable competencies and personalized educational recommendations.

References 1. ALC 2015 TRADOC Pamphlet 525-8-2, The United States Army Learning Concept for 2015 U.S. Government Printing Office (GPO), Washington, DC, November 2010 2. Béziaud, L., Allard, T., Gross-Amblard, D.: Lightweight privacy-preserving task assignment in skill-aware crowdsourcing. In: Benslimane, D., Damiani, E., Grosky, W.I., Hameurlain, A., Sheth, A., Wagner, R.R. (eds.) DEXA 2017. LNCS, vol. 10439, pp. 18–26. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64471-4_2 3. Center, U.S.A.A.S.: Synthetic Training Environment (STE) (2019) 4. Division, U.S.N.N.A.W.C.T.S.: Learning Continuum and Performance Aid (LCAPA) (2019) 5. Gonçalves, M.J.A., Rocha, Á., Cota, M.P.: Information management model for competencies and learning outcomes in an educational context. Inf. Syst. Front. 18(6), 1051–1061 (2016) 6. Greiff, S., et al.: Domain-general problem solving skills and education in the 21st century. Educ. Res. Rev. 13, 74–83 (2014) 7. Kazis, R.: Remaking Career and Technical Education for the 21st Century: What Role for High School Programs? Jobs for the Future (2005) 8. Olde, B.: Vision statement: navy career management and training of the future. Des. Recomm. Intell. Tutoring Syst. Assess. Methods 5, 47

Standards Needed: Competency Modeling and Recommender Systems

187

9. Prados, J.W., Peterson, G.D., Lattuca, L.R.: Quality assurance of engineering education through accreditation: the impact of engineering criteria 2000 and its global influence. J. Eng. Educ. 94(1), 165–184 (2005) 10. Richardson, J.M.: A design for maintaining maritime superiority. Naval War Coll. Rev. 69(2), 11 (2016) 11. Sadri Mccampbell, A., Moorhead Clare, L., Howard Gitters, S.: Knowledge management: the new challenge for the 21st century. J. Knowl. Manag. 3(3), 172–179 (1999)

Measuring the Complexity of Learning Content to Enable Automated Comparison, Recommendation, and Generation Jeremiah T. Folsom-Kovarik1(&), Dar-Wei Chen1, Behrooz Mostafavi1, and Keith Brawner2 1

2

Soar Technology, Inc., Orlando, USA [email protected] U.S. Army Combat Capabilities Development Command, Orlando, USA

Abstract. Learning content is increasingly diverse in order to meet learner needs for individual personalization, progression, and variety. Learners may encounter material through different content, which invites a measurable comparison method in order to tell when delivered content is sufficient or similar. Content recommendation and generation similarly motivate a fine-grained measure that enhances the search for just the right content or identifies where new learning content is needed to support all learners. Complexity offers a finegrained way of measuring content which works across instructional domains and media types, potentially adding to existing qualitative and quantitative content descriptions. Reductionist complexity measures focus on quantifiable accounting which practitioners and computers in support of practice can use together to communicate about the complexity of learning content. In addition, holistic complexity measures incorporate contextual influences on complexity that practitioners typically reason about when they understand, choose, and personalize learning content. A combined measure of complexity uses learning objectives as a focus point to let teachers and trainers manage the scope of reductionist elements and capture holistic context factors that are likely to affect the learning content. The combined measure has been demonstrated for automated content generation. This concrete example enables an upcoming study on the expert acceptance and usability of complexity for differentiating between hundreds of generated scenarios. As the combined complexity measure is refined and tested in additional domains, it has potential to help computers reason about learning content from many sources in a unified manner that experts can understand, control, and accept. Keywords: Training Complexity Recommender system Content generation Learning personalization Assessment

Holistic

1 Introduction Learning content accomplishes many tasks in modern practice: from simple presentation of material to guided practice with feedback, from formative assessment to highstakes credentialing, and from ongoing learning for personal interest to ongoing © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 188–203, 2019. https://doi.org/10.1007/978-3-030-22341-0_16

Measuring the Complexity of Learning Content

189

training on the job. Learning content may also be presented in different modes ranging through text on a page, video, interactive simulations, or performance observations situated in a real or realistic context. Interactive learning content such as training simulations can also change internally to give learners different levels of support or create variable conditions that make performance easier or harder. Intuitively, each type of learning content is likely to be well suited for different learners at different stages of learning. As a result, expert teachers and trainers can identify what specific learning content will ensure learners meet a required level of readiness, or what content will best help an individual to progress. Offering different learning content on an individual basis makes sense for reasons including personalization, progression, and variety. First, personalized content can vary the support and challenge available to individual learners so that they get the help they need to succeed on the target content (Wray et al. 2009; Wray and Woods 2013) or confront specific misconceptions (Folsom-Kovarik et al. 2018). Learning content can be presented with accommodations that cater to learning differences. The same content can also be couched in personalized terminology or contexts, in order to align with an individual learner’s past experience. Personalization motivates comparison between personalized learning content, to help ensure all learners received commensurate training or received some required minimum of training. Second, variable progression has been demonstrated in settings such as mastery learning (Bloom 1984) and others, where learners master a topic before moving on, even if additional learning content is needed. Controlling how learners progress also reduces time and resources that would be wasted on presenting content for which the learner is not yet prepared (Dunne et al. 2014). Personalization and variable progression introduce the need for recommendation, matching each learner with the optimal learning content to suit their needs. Third, variety of presentation can keep learners interested or motivated. During assessment, varying the learning content can help test learners’ ability to generalize and assess near or far transfer. Variety is also crucial for retest validity of assessments that will be presented to the same learners repeatedly over time, or in settings where learners may be expected to share information that would bias how others interact with the content. Increasing variety without a burden on authors is the goal of content generation which can automate some tasks for authors and make many more variants of learning content. Teachers, trainers, and learners all value having a range of content for reasons of personalization, progression, and variety. However, challenges exist in understanding the specific ways in which various content is similar or different. Measures are needed which provide insight beyond surface differences which may not change learning, and instead express facts about how the content can be expected to affect learners and learning. Such measures could help to answer important questions. When two learners received assessments under different conditions, were their outcomes comparable? Which types and amount of support or challenge does a particular learner need at this time? Given a history of content that a learner has experienced, what is the best next step to progress?

190

J. T. Folsom-Kovarik et al.

To answer these questions, various measures of complexity have been explored. Describing learning content with various dimensions of complexity suggests a quantifiable approach to expressing what it is that expert teachers and trainers think about whenever they choose to vary learning content. If successful, a measure of complexity would enable computational support for learning such as recommender systems that suggest what content a learner needs, metadata standards that enable reusing and exchanging learning content, and generative programs that create new learning content to fit a given need.

2 Reductionist Definitions of Complexity Offer Quantifiable Descriptions of Learning Content As a foundation for these new capabilities, a clear definition of complexity is needed which includes an explanation of the senses in which complexity can be objective rather than subjective. Several such measures from past work offer expressions of complexity. Some examples are considered with the goal of reflecting how practitioners such as teachers and trainers could use the measure without needing theoretical or technical knowledge to understand and control complexity in learning content. Complexity is a characteristic that can describe any learning content and has been studied in many forms in the past. For example, Piaget (e.g., 1936, 1972) theorized about how children develop from predominantly interacting with just the physical world to being able to understand relatively complex abstract concepts. Bloom et al. (1956) formally ranked learning activities in a taxonomy that could be related to complexity (e.g., inferring that knowledge or recall acts at a lower level of cognitive complexity than application, which is a lower level of complexity than evaluation). A reductionist perspective on task complexity involves breaking a task down into its component processes and rating the complexity of each process. One first step to creating a definition for complexity is to list the possible dimensions of complexity in a given component process. Wulfeck et al. (2004) built a list of these dimensions based on previous work by Feltovich and colleagues (e.g., Feltovich et al. 2012), and that list is summarized here in Table 1. Other possible dimensions of task complexity include (Dunn 2014): Number of required acts, number of information cues, number of sub-tasks, number of interdependent sub-tasks, number of possible task paths, number of criteria to satisfy a task, number of task paths that conflict or are unknown, and level of distraction. These dimensions, along with those outlined in Table 1, can form the basis for evaluating whether a task is complex in a reductionist manner. The dimensions can be related additively or through a more sophisticated calculus accounting for mediators and moderators. Reductionist perspectives tend to favor labeling and counting dimensions of complexity. Nuance can be obtained by gauging the degree to which a task fulfills a dimension and adjusting the importance level of a given dimension. An example of a reductionist perspective on measuring training is the U.S. Army’s “Leader’s Guide to Objective Assessment of Training Proficiency”, or “Objective T” (Department of the Army 2017), in which training is not considered complex unless all four operational variables (terrain, time, military threat, and social population) are considered dynamic.

Measuring the Complexity of Learning Content

191

Table 1. Dimensions of task complexity (Wulfeck et al. 2004) Dimension Abstract (vs. concrete) Multi-variate (vs. univariate) Interactive (vs. separable or additive)

Continuous (vs. discrete)

Non-linear (vs. linear)

Dynamic (vs. static) Simultaneous (vs. sequential)

Conditional (vs. universal)

Uncertain (vs. certain) Ambiguous (vs. unique)

Description A phenomenon is not physically observable and therefore must be imagined in some way (e.g., sound propagation) Any number of factors could have caused a given outcome (e.g., sound refraction in water depends on salinity, pressure, depth, and temperature of water) Outcomes are dependent on how causes interact with each other and cannot be predicted by assuming that causes act separately/additively (e.g., the effectiveness of a teaching method varies depending on the maturity of the learner – a method that works for a high-school student might not work for a first-grader, or vice versa) Operators must understand how the changing of a continuous variable affects the environment or task, as opposed to merely memorizing the effects of a few discrete states of the variable (e.g., understanding how the speed of a car affects its operation [continuous], as opposed to understanding how the gear selection affects its operation [park, reverse, neutral, drive, etc.]) The relationship between a variable and outcome is something other than an easily-understood straight-line function (e.g., adding workers to a project improves efficiency up to a point, but adding workers past that point decreases efficiency due to convoluted communication and workers not having enough to do [diminishing returns]) The relationship between variables and/or outcomes changes over time (e.g., the productivity of an employee generally decreases throughout the workday) When a variable changes, the corresponding outcome change is immediate, as opposed to at a later stage (e.g., a company policy change is effective immediately, which creates more complexity than when it is effective in one month, which provides preparation time) The relationship between a variable and outcome depends on context or the status of a boundary condition (e.g., in war zone, attacking civilians is against rules, unless the civilian is armed and in a threatening position) Operators must perform without knowing precise values of a given variable (e.g., firing an artillery projectile at an enemy stronghold with only a visual estimate of how far they are away) A given outcome can be the result of many different combinations of circumstances, or a given set of circumstances can lead to many different outcomes (e.g., receiving a promotion at work could be the result of achievement in Project A, but could also have been the result of achievement in Project B)

192

J. T. Folsom-Kovarik et al.

As an alternative to task complexity, another accounting measures mental processes that must be recruited to accomplish learning. Mental processes include perception, attention, memory, and executive function. Together these mental processes drive underlying cognitive skills and are evidenced in behaviors. The complexity of content can be inferred based on the behaviors required of operators while engaging with the content. Table 2 gives a sampling of some operator behaviors that are often associated with complexity in the content, environment, or task. The more of these behaviors an operator is required to perform during learning, the more complex the learning content likely is. Table 2. Common operator behaviors in complex environments Behavior Anticipation (e.g., Branlat et al. 2011) Attentional control (e.g., Schatz et al. 2012)

Coordination (e.g., Gulati et al. 2012)

Anomaly detection (e.g., Branlat et al. 2011)

Resilience (e.g., Bonanno 2004) Ambiguity tolerance (e.g., Bochner 1965)

Description Understanding and predicting how events might occur or how relevant actors might behave (e.g., anticipating the strategy that an adversary might use based on one’s own weaknesses) Directing and sustaining attention on a chosen target and being aware of one’s own attention, regardless of temptations to do something else (e.g., maintaining focus on a work assignment instead of checking social media) Aligning actions with others such that jointly-determined goals can be achieved (e.g., giving each team member a different section of the paper to write, depending on technical strengths and weaknesses of the team members) Mapping environmental indicators against baseline mental models to determine whether a given event should be considered unusual (e.g., monitoring a radar screen for potential enemy combatants) Maintaining healthy levels of psychological and physical functioning in the face of disruptive events (e.g., persisting in searching for a job even after being rejected several times) Being comfortable with uncertain elements in the environment or task (e.g., not allowing the lack of knowledge of a variable’s value to keep oneself from progressing on other parts of the task at hand)

The aforementioned elements of complexity (task complexity, required behaviors of operators) can be used, in a reductionist manner, to determine whether a given piece of content is complex. Many measures of complexity use reductionist perspectives because they feel more objective and are more easily quantifiable. However, reductionist perspectives require a theoretical understanding of cognitive processes which may or may not be sensitive to differences in competing theories or to application of theory in a practical setting. The many dimensions and their definitions can feel inaccessible for teachers, trainers, and instructional designers who need to assess and

Measuring the Complexity of Learning Content

193

control learning content in order to achieve a target outcome. In addition, the measures might not capture the totality of factors that create complexity in a task or environment.

3 A Combined Measure of Complexity Reflects Expert Practitioner Understanding When discussing how to evaluate the complexity of learning content, two views of complexity include a reductionist accounting based in mental processes (discussed previously) and a more holistic or macrocognitive accounting of complexity as a feature of sensemaking situated in context and culture. Taken together, these views, although in tension with each other, define a combined measure of complexity that aligns with practitioner understanding of learning content and thus is usable by teachers and trainers. In contrast to reductionist perspectives, holistic perspectives on complexity emphasize less strictly defined factors that can contribute to the complexity of a task. In this accounting, factors can operate individually but complexity increases when factors collectively produce “emergence” in the task or environment. Emergence describes cumulative effects that vary widely based on small initial differences and are therefore difficult to predict just from knowing about each component process individually (Paries 2006). The relationships between the factors can give rise to unpredictable operator behaviors, which further increase the complexity of a task. The number and structure of relationships can be counted to help measure complexity. Holistic complexity can also include contexts external to learning content such as students’ learning careers, current expertise, or learning objectives (LOs) because that information will affect how a task should be presented and how content should be described. Any cognitive model related to situational factors (“situated cognition”; Brown et al. 1989) is relevant to a holistic perspective on complexity because it considers knowledge imparted by training as inextricable from the contexts in which it is presented. Holistic complexity also accounts for ways in which learners understand the gist and deeper structural meaning of information as they progress in their mastery of a task (sensemaking; Boulton 2016). A combined approach to measuring complexity could be informed by these existing holistic measures in considering those parts of external context which instructors already use to describe learning content, such as learning objectives. For example, a single scenario in a virtual environment might be differentially complex based on the learning objective of the user. If the scenario contains numerous and fast-changing air combat, it might be complex for the purpose of teaching air support but not at all complex for instructing infantry maneuver under a contested airspace. Complete formal approaches to holistic perspectives are rare, but accounting for contextual factors and technical interactions between elements is often at the core of any holistic perspective. In the Army’s “Objective T” guide to training assessment (Department of the Army 2017), one way in which contextual factors are accounted for is in conditions for performance. Nighttime training is rated as more complex than daytime training due to lower visibility.

194

J. T. Folsom-Kovarik et al.

When some or all component processes of a task interact in a way that is above and beyond what could be predicted from knowing the component processes alone, such added complexity is a part of holistic complexity that might not be accounted for when using a reductionist approach. An example of interacting processes from the sports realm is dribbling up the court and shooting a basketball – the component processes of dribbling and shooting are each complex at a certain level, but the entire task is additionally complex because of the footwork required to transition between dribbling and shooting (additionally complex compared to just the summative complexities of dribbling and shooting considered separately). Although the individual complexities of component processes may be easier to quantify than holistic complexities, both types of complexity are instrumental in determining the true complexity level of learning content. The reductionist-perspective dimensions and behaviors (Sect. 2) provide some relatively objective and concrete criteria for complexity. Holistic complexity characteristics might be difficult for practitioners to precisely measure, but a combined measure of complexity does not necessarily require precise measurement to show improvement over reductionist or holistic perspectives alone. From the perspective of reproducing and reinforcing expert teaching and training practices, a computer-accessible measure of complexity should strive to account for both types of complexity. Expert teachers and trainers incorporate both perspectives when evaluating learning content. Instructors may present component processes separately at first so that emergent complexities can be taught separately as well, or they may start with an overview and explore separate factors in detail later. Both approaches are chosen with intention and contribute to managing complexity in the training. One manner of combining reductive and holistic complexity measures recognizes the key role played by domain knowledge in determining context. The example presented below captures domain knowledge about what is important to learning and what is only a surface feature in the form of learning objectives. Some variations in training scenarios could change complexity, but other variations might not produce relevant changes in complexity. The key insight is that the resulting complexity might be influenced by countable factors such as the number of cues or distractors, but only a domain expert can determine what the cues or distractors are. The factors an expert identifies are likely to vary by instructional domain, population, and expected level. Therefore, a combined measure is needed which is able to quantify those complexity factors that a practitioner highlights, differentiate what learning objectives they impact, and express broad strokes or fine-grained detail in a manner that is robust to human imprecision. Measuring complexity in a combined fashion ensures that the measure is useful and understandable to practitioners. Furthermore, given that human tutors are considered by many to be a “gold standard” of instructional effectiveness (e.g., Graesser et al. 2001), strategies used by human tutors are worthy of emulation in intelligent training systems. The possibilities for comparing, recommending, and generating content would expand significantly with the ability to perform automated assessment of task complexity in a way that combines reductionist and holistic perspectives. An example is the scenario generation algorithm described in the following section.

Measuring the Complexity of Learning Content

195

4 Measuring Complexity Improves Comparison and Recommendation of Learning Content Computer support for humans in teaching, training, and designing instruction is common today. Three areas in which a complexity measure can contribute to learning impact include comparison across learning opportunities, recommending how to learn most effectively or efficiently, and generating new variations on learning content. Modern teaching and training encompass much more than formal or classroom learning. Learners may seek out a how-to video online at a moment’s whim of personal interest, or may encounter a new technology to learn on the job even after years of building expertise. Comparing learning across all these formal, nonformal, and informal opportunities is driving a recent renaissance of interest in standards that describe learning. Representative examples of existing standards that describe learning include the IEEE standard for Learning Object Metadata (LOM; IEEE 2018) and the Shareable Content Object Reference Model (SCORM; ADL Net 2018a), among others. The goal of these standards at a high level is to let more than one computer system reason about a unit of learning content. Authors of these and other standards defined fields to describe learning content both qualitatively and quantitatively. For example, existing standards can express language, grade level, media type, and expected duration. These help to understand what learning content will be like before a learner attempts it, and to communicate what the learner did after the content is completed. As a result, it becomes possible to search for relevant content and to piece together content into a sequence or program of instruction. It should be noted that in the field of learning, an inherently human enterprise, these standards necessarily simplify the full understanding of an expert instructor into categorical descriptions that combine similarities and blur subtle details. Their goal is not to express every possible idea, but to express enough information for shared understanding. In the same way, some values defined in existing standards require subjective determination or allow for disparate definitions. For example, the grade level assigned to content might differ between countries or, in the U.S., between states. Categories that can be described in words but not in math, such as the interactive multimedia instruction (IMI) level, also have fuzzy boundaries but are widely used and useful. All these examples suggest that a definition of complexity can be useful in describing learning content without necessarily being mathematically precise or agreed by all parties, as long as the definition provides enough agreement to add detail to the existing descriptions. The need for comparing learning content has recently increased beyond what could have been envisioned in the early days of standards development. Future learning ecosystems (Raybourn et al. 2017) are being created that work to unify a learning experience in an entire career-long or lifelong learning, deployed across the boundaries of separate computer systems from many different vendors. New standards such as Experience API (xAPI; ADL Net 2018b) are emerging to share much more detailed information about how people learn as it occurs, from second to second. The Experience API is part of a movement to acknowledge and act on the subtle differences in

196

J. T. Folsom-Kovarik et al.

how individuals interact with learning content. Clearly, content that has the same grade level and IMI level can still vary widely in the experience it presents and the impact it can have on learning. To prepare for this future requirement, most existing standards provide methods for extension that would be compatible with one or more complexity measures. Along with other fine-grained metadata, a complexity measure is likely to express needed information about learning content that will support finding, using, and interpreting the learner experience in a future learning ecosystem. If the capability to objectively measure complexity is developed, such a capability will improve the methods computers have available to recommend, or support experts recommending, content based on learner needs. Learning content that covers different topics or different levels can still progress through well-known methods. When learning content is available that has substantially similar topic and level, then recommender systems can use complexity as another characteristic to differentiate learning content options. With the addition of information about individual learners, recommenders will be able to use complexity to predict the subjective and apparent difficulty of a piece of content, estimating how likely a learner is to succeed or how much effort will be required. These predictions will be useful in recommending content that is located within a learner’s “zone of tolerable problematicity,” the range of task difficulty that a learner is willing to engage with because the task is neither too complex nor too simple (Elshout 1985; Snow 1989). Instructional order of content, enabling progression, is one area that would improve substantially. For example, a common instructional strategy is scaffolding and fading, which refer to the gradual withdrawal of learner support over time so that a learner who initially needs the support to perform the task eventually learns how to perform the task without support (Vygotsky 1978). Scaffolded tasks possess support structures that are associated with decreased complexity; a scaffolded task might ask the learner to consider the effects of just one variable (as opposed to multiple), or require the learner to only consider his or her actions (as opposed to coordinating with other actors). Measuring these types of complexity therefore would facilitate the ordering of tasks such that relatively simple scaffolded tasks can gradually give way to relatively complex unscaffolded tasks. For fine-grained recommendations, tasks could be categorized not just as complex in a general sense, but complex in particular competencies (and not other competencies). For example, some concepts in physics are complex phenomena to understand, but the underlying mathematics are actually quite simple (e.g., do not require more than basic algebra). This would enable recommending different versions of content that contain support only for the learning objectives that are not familiar to an individual learner. The inherent dependencies and interactive nature of complex content inform recommendations about content presentation strategies as well, such as presenting new information in various and full contexts (Feltovich et al. 1988). Learners must be provided with many different contexts for information so that they do not overgeneralize from incidental features or happenstance. Relatedly, content should not be oversimplified – not only is overgeneralization a risk with oversimplified content, but learners can also create false senses of security in how well they grasp information

Measuring the Complexity of Learning Content

197

(Hoffman et al. 2013), leading to overconfidence that can hinder learning in the future. When new information is presented, it will often affect entire mental models that learners have constructed, due to the dependencies and interactive nature of complex content (Klein and Baxter 2006). Therefore, for complex learning content, it is of even more importance for all foundational information to be presented early on (even at the risk of overwhelming the learner initially), and for new information to not disrupt that foundation. Measuring the complexity of learning content enables informed recommendations regarding both what learning content to present and how to present it.

5 Operationalizing an Example Complexity Measure to Enable Learning Content Generation With complexity as a measured characteristic of learning content, automation for generating content will become more viable. It will be possible, for example, to generate several versions of content covering the same concepts with varying levels of complexity. An example of a combined complexity measure was demonstrated in a system for generating variants on a training scenario. A sophisticated training scenario offers several examples of the combined complexity measure as applied to different aspects of the scenario learning content. As a result of operationalizing the combined complexity measure in this training scenario, much more content can be generated and labeled in order to improve automated recommendation and support instructors who use the training. For the purposes of this example, training was selected that targets U.S. Army small-unit employment of unmanned air systems (UASs). At the Army Squad or Platoon level, infantry units operate hand-launched UASs with a wingspan approximately 1–2 m and a flight time of 1–2 h. A small UAS is useful for reconnaissance and surveillance tasks within the immediate area of operations. However, a key need is for small unit leaders to understand proper utilization of their UAS assets. These learners need training to plan, prepare, and execute UAS missions employing proper tactics, following required procedures, and coordinating with other units. Training was created for small unit leaders consisting of initial and final assessments, introductory and remedial text documents, and two adaptive training scenarios. Out of this content, the scenarios are the focus of the research and development. The UAS training scenarios use instructional principles that are relevant to many typical automated and instructor-led training settings. Learners’ decisions can trigger immediate feedback, change scenario events, and possibly end a scenario prematurely (followed by remediation and a later attempt). The training can be delivered through the Army’s Generalized Intelligent Framework for Tutoring (GIFT), a computer system that helps adapt training in a way that can be reused in different instructional domains and is not specific to any one training system (Sottilare et al. 2012; Sottilare et al. 2017). Some barriers to adaptive training exist in the operational setting. First, learners can have a wide range of learning needs at the start of training. Some may be more expert while others are novices. Furthermore, the training contains 48 learning objectives as

198

J. T. Folsom-Kovarik et al.

defined by subject matter experts (SMEs). Based on their past experience, learners sometimes need support only in some learning objectives while having previously mastered other learning objectives. Finally, learners need to progress through training at different rates rather than wasting time on content that is too basic or moving into advanced content when unprepared. To help address these barriers, complexity measures can help describe content in a way that enables recommendation algorithms to predict how scenarios will combine support or challenge for different learning objectives and pick the best match to learner needs. Whether delivered through GIFT or under instructor control, adaptive training can only recommend the content that most nearly fits learner needs within the range of learning content that exists. When choices are too few, adaptation may not find a perfect fit or might settle for a suboptimal choice. Furthermore, learners can memorize scenarios that are delivered more than once, or can share information to increase performance while actually avoiding a deep understanding of the target material. To increase the range of scenarios available to choose from, content generation tasks should be automated. The complexity measures enable content generation algorithms that produce varied scenarios because they help the algorithm determine how each variation fits into the library of all existing content, and whether it is similar to other content or offers a new combination of support and challenge. Recommendation in GIFT can occur between scenarios or, for quick response to learner needs, during a scenario in response to learner performance. Detailed descriptions of the technical basis that lets GIFT deliver adaptive training in any instructional domain are available in Sottilare et al. (2013). A key concept is that GIFT encodes instructional strategies which inform the selection of instructional tactics. Instructional strategies are general and work across instructional domains, such as choosing to support or challenge a particular learning objective. Instructional tactics are specific to an instructional domain, such as presenting a unit in a different location for the UAS training domain. In parallel, the content generation example aims to give GIFT scenarios that offer every combination of support or challenge across learning objectives. When more combinations are available, GIFT is better able to select and execute its instructional strategy. The manner in which each generated scenario delivers support or challenge is defined in domain-specific rules. These rules capture expert knowledge about what makes content more or less complex. They are structured in the same manner as the domain conditions which help GIFT assess learners and choose between instructional tactics. Multiple past approaches to authoring content variation exist. Some examples are content templates, learner cognitive models or simulated students, and simple enumeration or random changes of content. Content templates are useful in, for example, the Cognitive Tutor Authoring Tools (Aleven et al. 2006). Templates make it possible for a practitioner to create content with variables which have ranges of possible values chosen to provide similar complexity. For example, a math tutor can generate infinite addition problems. If the number of digits being added changes the complexity of the problems, or the presence of specific addends that require carrying between columns is important, then the templates must expressly contain those limitations. The reasoning behind the range

Measuring the Complexity of Learning Content

199

limits is also implicit, not available for computers to reason about in comparing or progressing between content generated from different templates. Learner cognitive models use a cognitive theory basis to predict how learners will interact with content. One widely used example is GOMS (John and Kieras 1996) and, more recently, SimStudent (MacLellan et al. 2014) which models learning and novice errors as a focus. SimStudent is a principled extension of CTAT and can work with GIFT. An important consideration that applies to the cognitive model approach in general is that models with few factors are limited in predictive power, while sophisticated models can present usability and acceptance challenges to non-technical practitioners. It would seem that a model is needed that captures only factors practitioners want to reason about, and at a functional level rather than a cognitive processes level. Finally, random or enumerative changes do not provide a way to capture what variation is important to teaching or training. It does not help to produce all possible locations of a hostile unit in a training simulation when most locations happen to be the same in terms of challenge to the learner. Instead, a method is needed to generate the unit in those few locations where complexity is affected. For example, if the unit is located in a few locations with tree cover, then that might increase the scenario’s challenge of one learning objective, locating and surveilling a hostile force. The example scenario generation method uses domain rules to express as many or few factors as practitioners care to use in describing the complexity of scenarios. Each rule is considered separately and no calculus for adding or multiplying separate factors is needed, although future work may explore such functions. Instead, the combined complexity measure captures emergent behavior by directly describing its observable effects within the scenario. The example of the tree cover illustrates how a small change in location, just out from under a tree, can greatly impact complexity. This method illustrates an example in which experts are empowered to capture exactly those factors or emergent behaviors which they consider important to teaching and training. Once domain-specific rules are in place, the resulting complexity measures give a straightforward and domain-general way to describe and compare different scenarios. The outputs of the domain-specific rules are domain-general values which the GIFT architecture can reason about because they only express support or challenge values as dimensionless quantities. As a result, GIFT is able in real time to select a particular scenario which might support several learning objectives and challenge several others. The example complexity measures are useful in two contexts. First, they provide a high-dimensional search space in which to apply a novelty search algorithm. Novelty search is used to quickly generate new content based on the criteria of maximizing novelty or difference from examples that have been seen before (Lehman and Stanley 2011). Crucially, novelty is not defined at any surface level which could be easily enumerated, such as locations of units. Instead novelty is defined as providing different combinations of high, medium, and low complexity in each of the many instructordefined dimensions (Folsom-Kovarik and Brawner 2018). This ensures that the generated scenarios give GIFT a wide selection of choices that are different in instructionally relevant ways. A second context in which the example complexity measures play a role is in describing learning content to teachers, trainers, and instructional designers. User

200

J. T. Folsom-Kovarik et al.

interfaces are being designed which summarize the complexity measures and allow drill-down, content previews, and paradata or usage pattern collection. These features are hypothesized to enable practitioners to search and select from thousands of training scenarios based on criteria they find important for the current learning need. Another key feature of practitioner interaction is requesting new content when the available scenarios do not meet the need. The complexity measures give an easy way to specify what gap the new variants should fill. A final consideration for future work is the most useful and usable manner in which practitioners can change or control the complexity scores assigned to content. It seems likely that letting instructors assign complexity scores directly will introduce challenges with generalizing from individual inputs, and with justifying changed values to other instructors in a shared system. Therefore, a proposed direction for future work is to allow non-technical teachers, trainers, and instructional designers to edit and select the rules that assign complexity scores. Currently, GIFT offers a capability to edit domainspecific rules for learner assessment. One important difference is that assessments have binary outcomes (pass or fail), while currently the complexity rules are continuous valued functions which are likely to be less user-friendly. One approach to address this concern would be to present rules in the authoring interface via low, medium, and high thresholds which might be easier to define correctly. In future work, a data collection opportunity is upcoming to help determine which presentations of complexity align best with expert instructors’ needs. The study will explore to what extent, and under what conditions, the instructors accept complexity as a tool to help differentiate and search among many training scenarios.

6 Conclusions Adding complexity as a measure to describe learning content will enable more detailed comparison, recommendation, and generation by making computers better able to communicate and reason about an important kind of difference and differentiator in learning content. Expert teachers, trainers, and instructional designers may be able to use a complexity measure to both express their understanding of learning content in more depth and also remove unimportant surface details from consideration. The complexity measure is being explored in a real-world training domain and has enabled computer generation of many training scenarios for fine-grained recommendation to meet learners’ individual needs. Immediate next steps will include validating usefulness and usability of the complexity measure in the context of a scenario generation example as presented to expert instructors in the given domain. This study will help to identify effective ways to present instructors with learning content options and to compare the options using their complexity. The research question in focus for this study is to identify the manner and conditions under which teachers and trainers accept and prefer to use complexity in describing or choosing learning content. Another key research question which must also be answered in relation to the combined complexity measure is agreement between teachers and trainers about complexity. Agreement is likely to be improved using domain-specific rules such as

Measuring the Complexity of Learning Content

201

those described. One statistic to help answer the question of practitioner agreement will be capturing statistics such as inter-rater reliability. In addition, the ability to measure complexity provides opportunities for future work in improving training methods. Are there differential effects of various complexity dimensions on training effectiveness? For example, are numbers of cues and distractors linearly additive or do they themselves interact in more interesting ways? Are there multiple interaction functions, which might change how measures interact depending on expertise or other context? Measuring combined complexity is an important first step, but knowing each complexity dimension’s effects on training would enable learning content descriptions that are more effective without requiring additional effort from instructors. When faced with a given contributing factor or dimension of complexity, do learners tend to use certain heuristics or biases, or are there certain types of mistakes that are common? If so, then measuring complexity and particular dimensions of complexity enables recommendations of learning content that can combat the mistakes commonly associated with high measures in a certain dimension of complexity. Initial learning content generally needs to be presented for novices in a manner that is not overwhelming. However, oversimplification in training can create misconceptions and hinder learning transfer in real-world environments. Can a task be simplified sufficiently for novices while maintaining its essential complexities? A hypothesis from Feltovich et al. (1988) is that novices who are exposed to full complexity at the start might not achieve as much right away and might be less satisfied early on in the learning process, but might also possess greater “horizons of understanding.” This hypothesis would be consistent with the “desirable difficulty” hypothesis, which states that there exists an optimal amount of initial learning difficulty that produces long-term gains (Bjork 2013). In conclusion, the measure of complexity that reflects how practitioners think about learning content offers a valuable perspective and an opportunity to improve how computers can support teachers, trainers, and instructional designers to maximize learning.

References ADL Net - SCORM 2004, 4th edn. https://adlnet.gov/adl-research/scorm/scorm-2004-4thedition/. Accessed 1 Jan 2018 ADL Net - xAPI. https://www.adlnet.gov/xapi/. Accessed 1 Jan 2018 Aleven, V., McLaren, B.M., Sewall, J., Koedinger, K.R.: The cognitive tutor authoring tools (CTAT): preliminary evaluation of efficiency gains. In: Ikeda, M., Ashley, K.D., Chan, T.-W. (eds.) ITS 2006. LNCS, vol. 4053, pp. 61–70. Springer, Heidelberg (2006). https://doi.org/10. 1007/11774303_7 Bjork, R.A.: Desirable difficulties perspective on learning. In: Pashler, H. (ed.) Encyclopedia of the Mind. Sage Reference, Thousand Oaks (2013) Bloom, B.S.: The 2 sigma problem: the search for methods of group instruction as effective as one-to-one tutoring. Educ. Res. 13(6), 4–16 (1984)

202

J. T. Folsom-Kovarik et al.

Bloom, B.S., Engelhart, M.D., Furst, E.J., Hill, W.H., Krathwohl, D.R.: Taxonomy of Educational Objectives: The Classification of Educational Goals. Handbook I: Cognitive Domain. David McKay Company, New York (1956) Bochner, S.: Defining intolerance of ambiguity. Psychol. Rec. 15(3), 393–400 (1965) Bonanno, G.A.: Loss, trauma, and human resilience: have we underestimated the human capacity to thrive after extremely aversive events? Am. Psychol. 59(1), 20 (2004) Boulton, L.: Adaptive flexibility: Examining the role of expertise in the decision making of authorized firearms officers during armed confrontation. J. Cogn. Eng. Decis. Mak. 10(3), 291–308 (2016) Branlat, M., Morison, A., Woods, D.D.: Challenges in managing uncertainty during cyber events: lessons from the staged-world study of a large-scale adversarial cyber security exercise. In: Human Systems Integration Symposium, Vienna, VA, 25–27 October 2011 Brown, J.S., Collins, A., Duguid, P.: Situated cognition and the culture of learning. Educ. Res. 18 (1), 32–42 (1989) Department of the Army: Leader’s Guide to Objective Assessment of Training Proficiency (2017) Dunne, R.: Objectively Defining Scenario Complexity: Towards Automated, Adaptive ScenarioBased Training. (Unpublished doctoral dissertation). University of Central Florida, Orlando, FL (2014) Dunne, R., Cooley, T., Gordon, S.: Proficiency evaluation and cost-avoidance proof of concept M1A1 study results. Paper Presented at the Interservice/Industry Training, Simulation & Education Conference (I/ITSEC), Orlando, FL (2014) Elshout, J.J.: Problem solving and education. Paper Presented at the First Conference of the European Association for Research on Learning and Instruction, Leuven, Belgium (1985) Feltovich, P.J., Spiro, P.J., Coulson, R.L.: The nature of conceptual understanding in biomedicine: The deep structure of complex ideas and the development of misconceptions (Technical report No. 440). University of Illinois at Urbana-Champaign: Center for the Study of Reading (1988) Feltovich, P.J., Spiro, R.J., Coulson, R.L.: Learning teaching, and testing for complex conceptual understanding. In: Frederiksen, N., Mislevey, R.J., Bejar, I.J. (eds.) Test theory for a new generation of tests, pp. 181–218. Routledge, Hillsdale, NJ (2012) Folsom-Kovarik, J.T., Boyce, M.W., Thomson, R.H.: Perceptual-cognitive training improves cross-cultural communication in a cadet population. Paper Presented at the 6th Annual GIFT Users Symposium, Orlando, FL (2018) Folsom-Kovarik, J.T., Brawner, K.: Automating variation in training content for domain-general pedagogical tailoring. Paper Presented at the 6th Annual GIFT Users Symposium, Orlando, FL (2018) Graesser, A.C., VanLehn, K., Rose, C.P., Jordan, P., Harter, D.: Intelligent tutoring systems with conversational dialogue. AI Mag. 22(4), 39–41 (2001) Gulati, R., Wohlgezogen, F., Zhelyazkov, P.: The two facets of collaboration: cooperation and coordination in strategic alliances. Acad. Manag. Ann. 6(1), 531–583 (2012) Hoffman, R.R., Ward, P., Feltovich, P.J., DiBello, L., Fiore, S.M., Andrews, D.H.: Accelerated Expertise: Training for High Proficiency in a Complex World. Psychology Press, New York (2013) IEEE 1484.12.1-2002 - IEEE Standard for Learning Object Metadata. https://standards.ieee.org/ findstds/standard/1484.12.1-2002.html. Accessed 1 Jan 2018 John, B.E., Kieras, D.E.: The GOMS family of user interface analysis techniques: comparison and contrast. ACM Trans. Comput. Hum. Interact. (TOCHI) 3(4), 320–351 (1996)

Measuring the Complexity of Learning Content

203

Klein, G., Baxter, H.C.: Cognitive transformation theory: contrasting cognitive and behavioral learning. In: Interservice/Industry Training Systems and Education Conference, Orlando, Florida, December 2006 Lehman, J., Stanley, K.O.: Novelty search and the problem with objectives. In: Riolo, R., Vladislavleva, E., Moore, J. (eds.) Genetic Programming Theory and Practice IX, Genetic and Evolutionary Computation, pp. 37–56. Springer, New York (2011). https://doi.org/10.1007/ 978-1-4614-1770-5_3 MacLellan, C.J., Koedinger, K.R., Matsuda, N.: Authoring tutors with SimStudent: an evaluation of efficiency and model quality. Paper Presented at the 12th International Conference on Intelligent Tutoring Systems, Honolulu, HI (2014) Paries, J.: Complexity, emergence, resilience. In: Hollnagel, E., Woods, D.D., Leveson, N. (eds.) Resilience Engineering: Concepts and Precepts, pp. 43–53. Ashgate, Burlington (2006) Piaget, J.: Origins of Intelligence in the Child. Routledge & Kegan Paul, London (1936) Piaget, J.: The Psychology of the Child. Basic Books, New York (1972) Raybourn, E.M., Schatz, S., Vogel-Walcutt, J., Vierling, K.: At the tipping point: learning science and technology as key strategic enablers for the future of defense and security. Paper Presented at the Interservice/Industry Training Simulation and Education Conference (I/ITSEC), Orlando, FL (2017) Schatz, S., Bartlett, K., Burley, N., Dixon, D., Knarr, K., Gannon, K.: Making good instructors great: USMC cognitive readiness and instructor professionalization initiatives (Technical report No. 12185). Marine Corps Training and Education Command, Quantico, VA (2012) Snow, R.E.: Aptitude-Treatment Interaction as a framework for research on individual differences in learning. In: Ackerman, P., Sternberg, R.J., Glaser, R. (eds.) Learning and Individual Differences. W.H. Freeman, New York (1989) Sottilare, R.A., Brawner, K.W., Goldberg, B.S., Holden, H.K.: The generalized intelligent framework for tutoring (GIFT). US Army Research Laboratory Human Research & Engineering Directorate, Orlando, FL (2012) Sottilare, R.A., Brawner, K.W., Sinatra, A.M., Johnston, J.H.: An updated concept for a Generalized Intelligent Framework for Tutoring (GIFT). US Army Research Laboratory, Orlando, FL (2017) Sottilare, R., Graesser, A., Hu, X., Holden, H. (eds.): Design Recommendations for Intelligent Tutoring Systems. Learner Modeling, vol. 1. U.S. Army Research Laboratory, Orlando, FL (2013). ISBN 978-0-9893923-0-3. https://gifttutoring.org/documents/42 Vygotsky, L.S.: Mind and Society: The Development of Higher Psychological Processes. Harvard University Press, Cambridge (1978) Wray, R.E., Lane, H.C., Stensrud, B., Core, M., Hamel, L., Forbell, E.: Pedagogical experience manipulation for cultural learning. Paper Presented at the Conference on Artificial Intelligence in Education, Brighton, England (2009) Wray, R.E., Woods, A.: A cognitive systems approach to tailoring learner practice. Paper Presented at the 2nd Advances in Cognitive Systems Conference, Baltimore, MD (2013) Wulfeck, W.H., Wetzel-Smith, S.K., Dickieson, J.L.: Interactive multisensor analysis training. Paper Presented at the RTO HFM Symposium on Advanced Technologies for Military Training. Space and Naval Warfare Systems Center, Genoa, Italy (2004)

Capturing AIS Behavior Using xAPI-like Statements Xiangen Hu1,2(&), Zhiqiang Cai1, Andrew J. Hampton1, Jody L. Cockroft1, Arthur C. Graesser1, Cameron Copland3, and Jeremiah T. Folsom-Kovarik3 1

2

University of Memphis, Memphis, TN 38104, USA [email protected] Central China Normal University, Wuhan 430072, Hubei, China 3 SoarTech, Orlando, FL 32836, USA

Abstract. In this paper, we consider a minimalistic and behavioristic view of AIS to enable a standardizable mapping of both the behavior of the system and of the learner. In this model, the learners interact with the learning resources in a given learning environment following preset steps of learning processes. From this foundation, we make several subsequent arguments. (1) All intelligent digital resources such as intelligent tutoring systems (ITS) need to be welldocumented with standardized metadata scheme. We propose a learning science extension of IEEE learning object metadata (LOM). specifically, we need to consider cognitive learning principles that have been used in creating the intelligent digital resources. (2) We need to consider AIS as whole when we record system behavior. Specifically, we need to record all four components delineated above (the learners, the resources, the environments, and the processes). We point to selected learning principles from the literature as examples for implementation of this approach. We concretize this approach using AutoTutor, a conversation-based ITS, serving as a typical intelligent digital resource. Keywords: Experience API Intelligent tutoring systems Learning object metadata Learning principles

1 Introduction xAPI (experience API) [1] was initially established as a generic framework for storing human learners’ activities in both formal and informal learning environments. Each xAPI statement contains elements that record a learner’s behavior, answering Who, Did, What, Where, Result, and When questions. For example, the current xAPI specification for Who is designed to uniquely identify the human learner (with tag name actor) by and identification such as email. In addition, the collection of verb for Did is also entirely for human learner actions (with tag name verb), such as attempted, learned, completed, answered, etc. (see a list of xAPI verbs [2]). xAPI specification for actor and verb are appropriate and work well for traditional e-learning or distributed learning environments where the learning content and resources are mostly static with limited types of interactions. When considering adaptive © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 204–216, 2019. https://doi.org/10.1007/978-3-030-22341-0_17

Capturing AIS Behavior Using xAPI-like Statements

205

instructional systems (AIS) such as intelligent tutoring systems (ITS), the existing specification for actor and verb may need to be extended. In fact, all existing ITS implementations use a computer to mimic a human tutor’s interactions with a human learner. To fully capture mixed-initiative interaction between the human learner and anthropomorphic computer tutors, we propose to establish an xAPI profile for AIS where behaviors for both the human learners and anthropomorphic AIS cast members are captured in the same learning record store. More importantly, the xAPI profile for AIS provides a standardized description of the behavior of anthropomorphic AIS. In this paper, we will demonstrate the feasibility of such an xAPI profile for AIS by analyzing the AutoTutor Conversation Engine (ACE) for a conversation-based AIS called AutoTutor [3]. We will then extrapolate from this case to propose a general guideline for creating xAPI profiles for AIS. We argue that creating this kind of xAPI profile specifically for AIS constitutes a concrete and appropriate step toward establishing standards in AIS.

2 Adaptive Instruction System Taken from the context of e-learning, the following definition of adaptivity fits well in the context of AIS: “Adaptation in the context of e-learning is about creating a learner experience that purposely adjusts to various conditions (e.g. personal characteristics and interests, instructional design knowledge, the learner interactions, the outcome of the actual learning processes, the available content, the similarity with peers) over a period of time with the intention of increasing success for some pre-defined criteria (e.g. effectiveness of e-learning: score, time, economical costs, user involvement and satisfaction).” Peter Van Rosmalen et al., 2006 [4]

In this definition, if “e-learning” is replaced with “learning”, we are looking at the entire educational system. In fact, the way we create curricula for different student populations (such as age and grades), build schools and learning communities, train teachers, and implement educational technologies all deliberately build toward having adaptive learning environments for students. When broadly defined, the concept and framework of AIS dates back over 40 years [5] and has been studied by generations of learning scientists [6–8].

3 A Four-Component Model of AIS For the purpose of this paper, we take a minimalistic, behavioristic view of AIS that contains only four components in the following fashion: The learners interact with the learning resources in a given learning environment following preset steps of learning processes. The four components are learners, resources, environments, and processes. In most of the learning systems, human learners are at the center (learner centered design; [9]). There are several types of resources in AIS. Schools, classrooms, etc. are physical

206

X. Hu et al.

learning resources; teachers, librarians, etc. are human resources; static online content, audio/video, etc. are digital resources; computer tutors such as AutoTutor, etc. constitute intelligent digital resources. Types of environments and processes are classified based on type of learning theories implemented in the given AIS. In the context of this paper, we specially consider AIS that include intelligent digital resources. For example, a conversation-based ITS such as AutoTutor [10, 11] delivers learning in a constructive learning environment that engages learners in a process that follows expectation-misconception tailored dialog in natural language. It is reasonable to assume that most of existing AISs with intelligent digital resources are created with certain guiding principles of learning.

4 Guiding Principles of Learning Systems There have been numerous “learning principles” relevant to different levels of learning. For example, psychologists provide twenty key principles [12, 13] to answer five fundamental questions of teaching and learning (Appendix A). An IES report identified the following seven cognitive principles of learning and instruction (Appendix B) as being supported by scientific research with a sufficient amount of replication and generality [14]. Further, the 25 Learning Principles to Guide Pedagogy and the Design of Learning Environments [15] list what we know about learning and suggest how we can improve the teaching-learning interaction (Appendix C). The list provides details for each principle to foster understanding and guide implementation. For example, Deep Questions (principle 18, Appendix C) indicates “deep explanations of material and reasoning are elicited by questions such as why, how, what-if-and what-if not, as opposed to shallow questions that require the learner to simply fill in missing words, such as who, what, where, and when. Training students to ask deep questions facilitates comprehension of material from text and classroom lectures. The learner gets into the mindset of having deeper standards of comprehension and the resulting representations are more elaborate.” Graesser and Person, 1994 [16]

Specifically, questions can be categorized into 3 categories and 16 types (Appendix D). Nielsen [17] extended the above taxonomy with five question types (Appendix E).

5 Learning Science Extension of LOM Since early 2000s, the e-learning industry had been greatly benefited from IEEE LOM [18] and the Advanced Distributed Learning (ADL) SCORM [19, 20]. Most recently, there has been more effort devoted to enrich metadata for learning objects. For example, the creation of Learning Resources Metadata Initiative (LRMI) [21]. Although IEEE LOM and LRMI focus almost exclusively on the learning objects (resources), it has placeholders for other components, such as typicalAgeRange1 for

1

take values such as target learners’ age.

Capturing AIS Behavior Using xAPI-like Statements

207

learners; interactivityType2 for processes; and context3 for environments. However, researchers started to notice its limitations when considered as metadata for learning objects in AIS such as ITS and it was suggested that metadata for learning contents should consider Pedagogical Identifiers [22]. It is definitional to say that all well-designed effective AIS with intelligent digital resources are based on some aspects of learning science. For example, in AutoTutor, one can easily identify the use of learning principles: • (6) in Appendix A: Clear, explanatory and timely feedback to students is important for learning. • (7) in Appendix B: Ask deep explanatory questions. • (18) of Appendix C: Deep Questions. For any specific AutoTutor module, expectation-misconception tailored dialog works best if deep and complex questions (type 3) of Appendix D are asked as main question to start the dialog. Unfortunately, detailed documentation at the level of foundational learning science is rarely available in existing AIS applications. Only a few AIS implementations provide documentation for each of the four components (learners, resources, environment, and processes) at the level of learning science. Our proposed approach is to first extend the existing metadata standards (such as IEEE LOM) to include learning science relevant metadata for each of the four components, and then make recommendations to enhance the xAPI statement. As an intuitive approach, we have “extended” the IEEE LOM with a “learning science extension” which could include a specific set of learning principles and associated implementation details (see Appendix F).

6 xAPI for AIS Behavior Intuitively, xAPI is a way to record the process and result that the learner interacts with the learning resources in a given learning environment. Technically, xAPI is a “specification for learning technology that makes it possible to collect data about the wide range of experiences a person has (online and offline)” [23]. Functionally, xAPI (1) lets applications share data about human performance.(2) lets you capture (big) data on human performance, along with associated instructional content or performance context information. (3) applies human (and machine) readable “activity streams” to tracking data and provides sub-APIs to access and store information about state and content. (4) enables nearly dynamic tracking of activities from any platform or software system—from traditional Learning Management Systems (LMSs) to mobile devices, simulations, wearables, physical beacons, and more. Experience xAPI - ADL Initiative [24]

The most important components of xAPI statements are actor, verb, and activity. xAPI has been learner centered, where actor is always the learner, verb is the action of

2 3

take values such as “Exposed”, “Active”, etc. take values such as “Training”, “Higher Education”, etc.

208

X. Hu et al.

the learner, and activity is relatively flexible to include only limited information of learning environment and process. As we have pointed out earlier, most AIS derive design and functionality from learning science theory. For each interaction between the learner and the system sophisticated computations govern progression, involving all other three components (resources, environments, and processes). For example, in AutoTutor, an expectationmisconception tailored dialog involves only one learner’s action (the input), but multiple processes and components of the AIS are involved. Some of the steps enumerated below (steps 1.3, 2.2, 3.1) are based on theories of learning. 1. Evaluate learner’s input. The result is a function of 1:1. stored expectations, 1:2. stored misconceptions, 1:3. semantic space, etc. 2. Construct feedback. The feedback is a function of 2:1. the evaluation outcome, 2:2. dialog rule, etc. 3. Deliver feedback. The delivery of feedback is a function of 3:1. the delivery agent (teacher agent or student agent), 3:2. the types of delivery technology, 3:3. the time (latency) of the feedback, etc. Unfortunately, when the learning record store (LRS) records learner behavior for such an AIS, only the input and result (or feedback) are recorded, but no system behavior. To capture wholistic behavior of an AIS, we need to consider all system behavior. So we necessarily extend the current xAPI behavior data specification, such that • actor includes all components of the AIS. This includes not just the learner, but also digital resources such as an ITS. • verb includes actions of the AIS. This includes not just learner’s actions, but also the actions of the AIS. • activity includes extended LOM that have detailed documentation of AIS at the level of learning science. As an example, when we tried to send systems behavior to the LRS, this is a sample list of statements from the LRS in reverse order, where statement #13 is the starting of the tutoring. Here a human learner John Doe, an intelligent digital resource is TR_DR_Q2_Basics, and Steve is the agent that represents the intelligent digital resource. 1. 2. 3. 4. 5. 6. 7. 8.

John Doe listen Steve TR_DR_Q2_Basics on Hint/Prompt TR_DR_Q2_Basics follow_rule_StartExpectation John Doe TR_DR_Q2_Basics follow_rule_FB2MQMetaCog John Doe TR_DR_Q2_Basics Evaluate John Doe on TutoringPack Q1 TR_DR_Q2_Basics follow_rule_TutorHint John Doe TR_DR_Q2_Basics transition John Doe on TutoringPack Q1 John Doe answer Steve TR_DR_Q2_Basics on Main Question John Doe listen Steve TR_DR_Q2_Basics on Main Question

Capturing AIS Behavior Using xAPI-like Statements

9. 10. 11. 12. 13.

TR_DR_Q2_Basics TR_DR_Q2_Basics TR_DR_Q2_Basics TR_DR_Q2_Basics TR_DR_Q2_Basics

209

follow_rule_Start John Doe follow_rule_StartTutoring John Doe follow_rule_AskMQbyTutor John Doe follow_rule_Opening John Doe transition John Doe on Tutor Start

There are three actions (bold) for the learner (John Doe). First is listen to Steve (#8) and then answer (#7) Steve when the Main Question was delivered and finally listen (#1) to Steve as he delivers hints. But there are multiple actions for the AIS: five actions (#13 to #9) to prepare the delivery of the Main Question, and five actions (#6 to #2) to prepare the delivery of the hint after John Doe answered the Main Question.

7 Discussion When we consider AIS within a simple four component model that involves the learners, resources, environments, and processes, we need to have a way to document the implementation with as much details as possible. This is especially true when we assume an AIS with intelligent digital resources is built with the guidance of learning science. To do this, we propose to extend the IEEE LOM of the intelligent digital resource to include a learning science extension. This approach in AIS is almost the same as the practice in health science in which all medicine seeking approval from the Food and Drug Administration (even over-the-counter medicine) must include detailed information, such as chemical compounds, potential side effects, and best way to administer the medicine. This information associated with medicine is for physicians and for patients. The proposed learning science extension of LOM for the intelligent digital resources in AIS is for teachers and learners. Furthermore, when it is stored in LRS, this information can be used for either post-hoc and real-time analysis of AIS. In addition to learning science extension of LOM for intelligent digital resources, we have also proposed to extended the limitation of xAPI statements to include all behaviors of AIS. More importantly, we need to make intelligent digital resources as a legitimate actor and record its actions together with the standard descriptions of environments and processes. When we have the learning science extension of LOM for intelligent digital resources and make the LRS store action from an intelligent digital resource in the same way as a human learner’s actions, we are literally proposing a “symmetric” view of AIS: the human learner and the intelligent digital resource are exchangeable in the stored AIS behavior data (in xAPI). The only difference is that the actions of human learners are observed behavior and actions of intelligent digital resources are programmed.

8 Conclusions We advocate a simple four component model of AIS that include learners, resources, environments, and processes. We consider this approach particularly valuable for those AIS that include intelligent digital resources. We assume intelligent digital resources

210

X. Hu et al.

are created with the guidance of learning science (such as learning principles), and consequently we proposed to use the learning science extension of IEEE LOM to document the intelligent digital resources as metadata. We further argue that behavior data records for AIS with intelligent digital resources should include behaviors of all AIS including behavior of human learners and the intelligent digital resources. The proposed learning science extension of LOM for intelligent digital resources and modified xAPI for AIS are tested in a conversation based AIS where AutoTutor is the intelligent digital resources. Acknowledgment. The research on was supported by the National Science Foundation (DRK12-0918409, DRK-12 1418288), the Institute of Education Sciences (R305C120001), Army Research Lab (W911INF-12-2-0030), and the Office of Naval Research (N00014-12-C-0643; N00014-16-C-3027). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NSF, IES, or DoD. The Tutoring Research Group (TRG) is an interdisciplinary research team comprised of researchers from psychology, computer science, and other departments at University of Memphis (visit http://www.autotutor.org).

Appendix A Top 20 Principles for Pre-K to 12 Education Divided in Five Types [12] How Do Students Think and Learn? (1) Students’ beliefs or perceptions about intelligence and ability affect their cognitive functioning and learning. (2) What students already know affects their learning. (3) Students’ cognitive development and learning are not limited by general stages of development. (4) Learning is based on context, so generalizing learning to new contexts is not spontaneous but instead needs to be facilitated. (5) Acquiring long-term knowledge and skill is largely dependent on practice. (6) Clear, explanatory and timely feedback to students is important for learning. (7) Students’ self-regulation assists learning, and self-regulatory skills can be taught. (8) Student creativity can be fostered. What Motivates Students? (9) Students tend to enjoy learning and perform better when they are more intrinsically than extrinsically motivated to achieve. (10) Students persist in the face of challenging tasks and process information more deeply when they adopt mastery goals rather than performance goals.

Capturing AIS Behavior Using xAPI-like Statements

211

(11) Teachers’ expectations about their students affect students’ opportunities to learn, their motivation and their learning outcomes. (12) Setting goals that are short-term (proximal), specific and moderately challenging enhances motivation more than establishing goals that are long-term (distal), general and overly challenging. Why Are Social Context, Interpersonal Relationships, and Emotional Well-Being Important to Student Learning? (13) Learning is situated within multiple social contexts. (14) Interpersonal relationships and communication are critical to both the teachinglearning process and the social-emotional development of students. (15) Emotional well-being influences educational performance, learning and development. How Can the Classroom Best Be Managed? (16) Expectations for classroom conduct and social interaction are learned and can be taught using proven principles of behavior and effective classroom instruction. (17) Effective classroom management is based on (a) setting and communicating high expectations, (b) consistently nurturing positive relationships and (c) providing a high level of student support. How to Assess Student Progress? (18) Formative and summative assessments are both important and useful but require different approaches and interpretations. (19) Students’ skills, knowledge and abilities are best measured with assessment processes grounded in psychological science with well-defined standards for quality and fairness. (20) Making sense of assessment data depends on clear, appropriate and fair interpretation.

B Seven Principles from an IES Report [14] (1) (2) (3) (4) (5) (6) (7)

Space learning over time. Interleave worked example solutions with problem-solving exercises. Combine graphics with verbal descriptions. Connect and integrate abstract and concrete representations of concepts. Use quizzing to promote learning. Help students manage study. Ask deep explanatory questions.

212

X. Hu et al.

C 25 Learning Principles to Guide Pedagogy and the Design of Learning Environments [15] (1) Contiguity Effects. (2) Perceptual-motor Grounding. (3) Dual Code and Multimedia Effects. (4) Testing Effect. (5) Spaced Effects. (6) Exam Expectations. (7) Generation Effect. (8) Organization Effects. (9) Coherence Effect. (10) Stories and Example Cases. (11) Multiple Examples. (12) Feedback Effects.

(13) (14) (15) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25)

Negative Suggestion Effects. Desirable Difficulties. Manageable Cognitive Load. Segmentation Principle. Explanation Effects. Deep Questions. Cognitive Disequilibrium. Cognitive Flexibility. Goldilocks Principle. Imperfect Metacognition. Discovery Learning. Self-regulated Learning. Anchored Learning.

D Graesser and Person [16] Classification of Questions (1) Simple or shallow (a) Verification: Is X true or false? Did an event occur? (b) Disjunctive: Is X, Y, or Z the case? (c) Concept completion: Who? What? When? Where? (d) Example: What is an example or instance of a category? (2) Intermediate (a) Feature specification: What qualitative properties does entity X have? (b) Quantification: What is the value of a quantitative variable? How much? (c) Definition questions: What does X mean? (d) Comparison: How is X similar to Y? How is X different from Y? (3) Deep or complex (a) Interpretation: What concept/claim can be inferred from a pattern of data? (b) Causal antecedent: Why did an event occur? (c) Causal consequence: What are the consequences of an event or state? (d) Goal orientation: What are the motives or goals behind an agent’s action? (e) Instrumental/procedural: What plan or instrument allows an agent to accomplish a goal? (f) Enablement: What object or resource allows an agent to accomplish a goal? (g) Expectation: Why did some expected event not occur? (h) Judgmental: What value does the answerer place on an idea or advice?

Capturing AIS Behavior Using xAPI-like Statements

E Nielson [17] Taxonomy of Question Types 1. Description Questions 1:1. Concept Completion: Who, what, when, where? 1:2. Definition: What does X mean? 1:3. Feature Specification: What features does X have? 1:4. Composition: What is the composition of X? 1:5. Example: What is an example of X? 2. Method Questions 2:1. Calculation: Compute or calculate X. 2:2. Procedural: How do you perform X? 3. Explanation Questions 3:1. Causal Antecedent: What caused X? 3:2. Causal Consequence: What will X cause? 3:3. Enablement: What enables the achievement of X? 3:4. Rationale Questions 3:4:1. Goal Orientation: What is the goal of X? 3:4:2. Justification: Why is X the case? 4. Comparison Questions 4:1. Concept Comparison: Compare X to Y? 4:2. Judgment: What do you think of X? 4:3. Improvement: How could you improve upon X? 5. Preference Questions 5:1. Free Creation: requires a subjective creation. 5:2. Free Option: select from a set of valid options.

F Example of Learning Science Extension of IEEE LoM for AutoTutor

213

214

X. Hu et al.

Editing interface for learning science extension in AutoTutor authoring tool. It only considered the 25 learning principles for the current version.

The XML for learning principles. For the specific AutoTutor Module, only “Deep_Questions” is explicitly relevant.

Capturing AIS Behavior Using xAPI-like Statements

215

Editing interface for learning science extension in AutoTutor Authoring tool to specify type of seed questions asked in a given AutoTutor Module.

The XML for Question Type for the specific AutoTutor Module, The question type implemented in this AutoTutor module is Causal consequence.

References 1. ADL Initiative Homepage. https://adlnet.gov/. Accessed 9 Feb 2019 2. xAPI Vocabulary & Profile Publishing Server. http://xapi.vocab.pub/. Accessed 9 Feb 2019 3. Graesser, A.C., Chipman, P., Haynes, B.C., Olney, A.: AutoTutor: an intelligent tutoring system with mixed-initiative dialogue. IEEE Trans. Educ. 48(4), 612–618 (2005) 4. Van Rosmalen, P., Vogten, H., Van Es, R., Passier, H., Poelmans, P., Koper, R.: Authoring a full life cycle model in standards-based, adaptive e-learning. J. Educ. Technol. Soc. 9(1), 72–83 (2006) 5. Atkinson, R.C.: Adaptive Instructional Systems: Some Attempts to Optimize the Learning Process. Stanford University, Palo Alto (1974) 6. Park, O.-C., Lee, J.: Adaptive instructional systems. Educ. Technol. Res. Dev. 25, 651–684 (2003) 7. Durlach, P.J., Spain, R.D.: Framework for instructional technology: Methods of implementing adaptive training and education. Defense Technical Information Center, January 2014 8. Shute, V.J., Psotka, J.: Intelligent tutoring systems: Past, present, and future. Armstrong Lab Brooks AFB TX Human Resources Directorate (1994) 9. Soloway, E., Guzdial, M., Hay, K.E.: Learner-centered design: the challenge for HCI in the 21st century. Interactions 1(2), 36–48 (1994) 10. Nye, B.D., Graesser, A.C., Hu, X.: AutoTutor and family: a review of 17 years of natural language tutoring. J. Artif. Intell. Educ. 24(4), 427–469 (2014) 11. Graesser, A.C., Wiemer-Hastings, K., Wiemer-Hastings, P., Kreuz, R.: AutoTutor: a simulation of a human tutor. Cogn. Syst. Res. 1(1), 35–51 (1999)

216

X. Hu et al.

12. Lucariello, J.M., Nastasi, B.K., Dwyer, C., Skiba, R., DeMarie, D., Anderman, E.M.: Top 20 psychological principles for PK–12 education. Theory Pract. 55(2), 86–93 (2016) 13. Top 20 Principles for Pre-K to 12 Education. https://www.apa.org/ed/schools/teachinglearning/principles/index.aspx. Accessed 31 Jan 2019 14. Pashler, H., et al.: Organizing instruction and study to improve student learning. National Center for Education Research, Institute of Education Sciences, US Department of Education, Washington, DC (2007) 15. Graesser, A.C., Halpern, D.F., Hakel, M.: 25 principles of learning. Task Force on Lifelong Learning at Work and at Home Washington, DC (2008) 16. Graesser, A.C., Person, N.K.: Question asking during tutoring. Am. Educ. Res. J. 31(1), 104 (1994) 17. Nielsen, R.D., Buckingham, J., Knoll, G., Marsh, B., Palen, L.: A taxonomy of questions for question generation. In: Workshop on the Question Generation Shared Task and Evaluation Challenge (2008) 18. Duval, E., Hodgins, W., Sutton, S., Weibel, S.L.: Metadata principles and practicalities. D-lib Mag. 8(4), 1082–9873 (2002) 19. Poltrack, J., Hruska, N., Johnson, A., Haag, J.: The next generation of SCORM: innovation for the global force. In: The Interservice/Industry Training, Simulation & Education Conference, I/ITSEC (2012) 20. Fletcher, J.D., Tobias, S., Wisher, R.A.: Learning anytime, anywhere: advanced distributed learning and the changing face of education. Educ. Res. 36(2), 96–102 (2007) 21. LRMI Homepage. http://lrmi.dublincore.org/. Accessed 7 Feb 2019 22. DeFalco, J.A.: Proposed standard for an AIS LOM model using pedagogical identifiers. In: Rosé, C.P., et al. (eds.) Artificial Intelligence in Education, London, United Kingdom, p. 43 (2018) 23. xAPI Homepage. https://xapi.com/overview/. Accessed 10 Feb 2019 24. Experience xAPI—ADL Initiative. https://adlnet.gov/research/performance-tracking-analysis/ experience-api/. Accessed 10 Feb 2019

Standardizing Unstructured Interaction Data in Adaptive Instructional Systems Vasile Rus1(&), Arthur C. Graesser2, Xiangen Hu2, and Jody L. Cockroft3 1

Department of Computer Science, The University of Memphis, Memphis, TN 38152, USA [email protected] 2 Department of Psychology, The University of Memphis, Memphis, TN 38152, USA 3 Institute for Intelligent Systems, The University of Memphis, Memphis, TN 38152, USA

Abstract. Standardization of unstructured information such as freely generated verbal responses in adaptive instructional systems poses many challenges. For instance, free responses have no clear delimitation of knowledge components, i.e., basic learning units (BLU), and therefore identifying the BLUs in such responses automatically is a challenge. We will review and exemplify major challenges and solutions and make recommendations for standardization of such unstructured information. Our work will inform student models that rely on student free responses, domain models that are derived from textual sources such as textbooks or general resources such as Wikipedia, the interoperability of adaptive instructional systems with free text inputs, and learning record stores. Keywords: Unstructured learner data Natural language

Adaptive instructional systems

1 Introduction Many Intelligent Tutoring Systems (ITSs; Rus et al. 2013; Woolf 2007) employ openended responses either partially or in the case of dialogue-based ITSs the whole interaction is dialogue-based which means all student responses are freely generated. Examples of such freely generated responses are shown in Table 1. Each of the three columns illustrates a major category of free responses: first column shows short responses usually found in dialogue based ITSs, the second column shows paragraphsize responses such as design justifications in the Virtual Internship-inator (Arastoopour et al. 2016), whereas the last column refers to a typical essay such as the argumentative essay used in the SAT test. Each of those types of responses pose their own challenges when it comes to extracting more structured information to be exchanged or aligned with data from other systems or simply to be stored in a common repository such as the Learning Record Store (LRS).

© Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 217–226, 2019. https://doi.org/10.1007/978-3-030-22341-0_18

218

V. Rus et al. Table 1. Examples of freely generated responses in educational contexts.

Short response/short essays One sentence or less Dialogue-based Intelligent Tutoring Systems (ITS) A1: Equal A2: Forces are equal and opposite in direction A3: equal and opposite

1.1

Medium responses

Essays

One paragraph David Shaffer’s Internship-inator system

Argumentative essay SAT-like essays

Design Specifications: PAM, Vapor, Negative Charge, 4% Justification: This prototype was altered slightly from the original with this material by changing from 2% CNT to 4%. This is an attempt to increase reliability without hindering flux or blood cell reactivity

Too long to show an example

Freely Generated Student Responses

As we argued in a recent paper (Rus 2018), open ended student responses have the potential to facilitate true assessment, i.e., building an accurate model of students’ level of understanding of a target topic. Indeed, freely articulated student responses reveal their thinking as opposed to, for instance, other forms of input used in some ITSs such as clicking on on-screen elements. In the latter case, similar to multiple choice questions in assessment instruments, students have to choose an answer from a set of given answers (only one of which is correct, usually) without providing any explanations. One major risk of using this kind of input is that students may pick the correct answer for the wrong reasons, without the ITS/assessor even knowing it. One can argue that such multiple choice interfaces/questions can be augmented with requests for explanations. It should be noted that there is a subtle difference between multiple-choicewith-explanation prompts and a traditional open-ended question in that the former gives students more information to work with, including the correct answer. Some students will be able to recognize the correct answer and in retrospect generate an explanation whereas in a pure open-ended question assessment scenario students have to generate both the answer and a solid explanation without any extra hints in the form of a set of potential answer choices. There are other significant advantages of open ended questions. For instance, open ended responses give students, in particular high-ability and high-knowledge students, the opportunity to provide novel and creative responses which is not possible when all they need to do is selecting one response from a limited set of predefined responses. Furthermore, eliciting freely-generated self-explanations has beneficial impact on learning (Chi et al. 2001). While open-ended responses have clear advantages when it comes to offering opportunities to reveal students’ understanding of a target topic, assessing them is extremely challenging. For instance, if manually assessed by experts it becomes prohibitively expensive to scale up the process, e.g., to assess open ended responses from millions of learners.

Standardizing Unstructured Interaction Data in Adaptive Instructional Systems

219

Another major issue with open-ended responses and in particular with responses generated in the context of dialogue-based ITSs is standardization. Indeed, standardization of unstructured information such as freely generated verbal responses in adaptive instructional systems poses many challenges. For instance, free responses have no clear delimitation of knowledge components (KCs), i.e., basic learning units (BLU), and therefore identifying the BLUs in such responses automatically is a challenge. We will review and exemplify major challenges and solutions and make recommendations for standardization of such unstructured information. Our work will inform student models that rely on student free responses, domain models that are derived from textual sources such as textbooks or general resources such as Wikipedia, the interoperability of adaptive instructional systems with free text inputs, and learning record stores. The rest of the paper is organized as in the followings. Section 2 outlines two major approaches to standardize freely-generated responses. Next, we present a model to standardize log files in dialogue-based ITSs. Section 4 discusses standardization on speech acts in dialogue-based ITSs whereas Sect. 5 discusses in more details the extraction of KCs from freely-generated responses. The Conclusions section ends the paper.

2 Two Approaches When it comes to standardization of systems, there are two major elements that must be addressed (Rus et al. 2018): (i) adopting a common architecture and (ii) designing interface protocols to facilitate communication among the main components of the common architecture as well as with external components and systems or users. This paper is relevant to primarily to the latter, i.e., interface standardization. Indeed, we analyze what it takes to map freely-generated learner responses to standardized protocols and for data exchange purposes in general. A widely used approach in representing what students know is to rely on the notion of knowledge component (KC) or basic learning units (BLU) - an atomic piece of knowledge (concept or skill) that a learner must acquire while learning to master a domain. KCs are typically the result of cognitive task analyses followed by validation and refinement cycles based on actual student performance data (Fancsali et al. 2016). A major challenge for the task of interface standardization in the area of AISs is that many components and the underlying representations they use, e.g., based on knowledge components (KCs), are domain and to some degree vendor specific. Indeed, a vendor may use a set of KCs while another vendor may use a very different set of KCs for the same target domain. The set of KCs could even be proprietary, being part of the vendors’ IP or “secret sauce” and therefore less likely to be disclosed. On one hand, there is a need for openness while on the other hand there is a need for IP protection. Our proposed solution is to adopt a hierarchical approach to specify units of data exchange, e.g., KCs, and use numeric ids for identifying ontologies and the KCs in those ontologies (details are available in Rus et al. 2018). The use of ids helps avoiding ontological commitments in the interface itself. This design results in a decoupling of

220

V. Rus et al.

the interface specification from the actual ontology specification, which is accomplished separately. The set of KCs for a domain, or domain model, can be organized in more complex structures, such as parameterized prerequisite knowledge structures, which are used for, among other things, shaping students’ learning trajectories, i.e., the sequence in which students explore the concepts and skills to be mastered in the target domain. The set of KCs constitutes the basis of the student model and could be as simple as specifying the mastery level on each KC at one moment in time. In the literature, this type of learner model is called the overlay model, i.e. the learner model can be viewed as an overlay of the domain model and covering the parts of the domain model the student has explored so far together with performance information, i.e., whether the student mastered or not the explored topics. We also suggested to generalize the notion of KC to account for other aspects of learning such as social, motivational, behavioral, emotional, psychomotor, and physiological (Rus et al. 2018). For instance, learners’ emotional states can be modelled as a set of emotion components (ECs). An example of an emotion component would be frustration reflecting students’ level of frustration relative to the current learning goal or instructional task at one particular moment in time. A learner could be frustrated for reasons not related to the current task which should probably be accounted for separately. Similarly, we work under the assumption that there are social components (SCs) capturing learners’ social skills at one particular moment in time, motivational components (MCs), behavioral components (BCs), and so on. When processing learners’ freely generated responses, which are unstructured textual data, one needs to extract the parts related to a reference set of KCs (similarly for the other xCs such as the ECs or BCs). In one approach, if student models are required to be updated continuously the extraction of KCs should be done on-the-fly and the extracted KCs sent to, for instance, the Learning Record Store (LRS). Dialogue-based ITSs can run without an explicit alignment of their internal representation to standardized sets of KCs and therefore they do not necessarily need to extract at runtime the KCs. For instance, such systems only need some form of an ideal answer against which the student responses are automatically compared using semantic similarity approaches (Rus et al. 2013). The ideal answers do not have to be annotated with KCs. However, such annotations would be needed for standardized communication and data integration with other sources of learner data as explained later. They could be done offline or outside the ITS itself, e.g., by a separate process that analyze log files as discussed next for the second approach. In a second approach, one can log directly students’ interactions and then an additional process will extract, off-line, the KCs and eventually sent them to the LRS. We will focus here on the latter approach although many ideas apply to the former approach as well. To this end, we propose a standard to record dialogue-based ITS-learner interactions in log files and then indicate ways to extract relevant information. In particular, we will focus on extracting two type of information: speech-acts and KCs. Speech acts map dialogue interactions into sequences of dialogue-acts which can be used as input for tutorial strategies discovery methods (Rus et al. 2016). The KCs can be used to infer learner knowledge state and for domain modelling.

Standardizing Unstructured Interaction Data in Adaptive Instructional Systems

221

3 Log File Standardization In one extreme approach, one can make an argument to record in a log file as much information as possible following the Big Data principle “if you can afford it, store it”. The rationale is data you never know what kind of information it might be useful to have for a future study. While this principle seems reasonable, in practice one needs to find the right balance between not-too-much and no-too-little information being logged. This is also the case for log files in ITSs that record the system-learner interaction. We suggest the following general guidelines to follow when deciding what to store in a log file: • Log as much as possible, every detail is important. Practical considerations should be considered which may impose some reasonable limits to what one can store. Also, privacy and security issues should be kept in mind which impose restrictions as well. • Use a machine readable format (XML or XML-like) which will make data extraction, fusion, and exchange easier. • Use proper ids/links to task, configuration file, and dialog policy so that when needed, everything could be linked to each other (data provenance requirement – more on this later). • The format should make it possible to easily extract some sections and present them in a user-friendly format (such as html). Besides recording standard information for each session such as learner id (i.e., some kind of universal learner id, URI, that could be used throughout the learning ecosystem), each utterance must be recorded together with a timestamp and the speaker (learner or tutor). This would be bare minimum of information to be stored. We also strongly suggest adding data provenance related information. Data Provenance. According to the National Network of Libraries of Medicines’ website (https://nnlm.gov/data/thesaurus/data-provenance; accessed on January 30, 2019), data provenance is defined as the origins, custody, and ownership of research data. This is a critical aspect in human subject research and practice. It has been addressed more systematically in biomedical domains but other areas such as education should pay more attention to data sources, traceability, and ownership. The concept of provenance guarantees that data creators are held accountable for their work, and provides a chain of information where data can be tracked as various stakeholder use other stakeholders’ data and transform it for their own purposes. An example of the kind of information could be stored in a log file to ensure traceability and replicability of data collected by an ITS are given in Fig. 1. For dialogue-based ITSs, we recommend to include in any log file standard information about the various parameters of the system used to record particular learner data (Fig. 2). It is beyond the scope of this paper to provide an exhaustive specification of the log file. Instead, the above suggestions are meant to be illustrations of what could be stored at a bare minimum in a log file of dialogue-based ITSs.

222

V. Rus et al.

Log generated by DeepTutor application

Fig. 1. System data to be logged for data traceability and reproducibility purposes.

Auto generated key value list of configuration parameters. Some less informative items may be filtered out when displayed in Html format

Fig. 2. System configuration data for traceability and reproducibility purposes.

4 Speech Act Standardization A key research question in ITSs (Rus et al. 2013) and in the broader instructional research community is understanding what expert tutors do as well as what learners do. To this end, we need to map interactions in dialogue-based ITS from pure conversations to sequences of actions in order to characterize general tutor and learner strategies. A step towards modelling tutorial dialogues as sequences of actions is to map a tutorial dialogue into a sequence of speech acts, i.e., utterances are viewed as actions based on the language-as-action theory (Austin 1962; Searle 1969). To map utterances in a tutorial dialogue into corresponding dialogue acts or actions a predefined dialogue or speech act taxonomy is typically used. For instance, in a prior project we used a taxonomy defined by educational experts. It is a two-level hierarchy of 15 top-level dialogue acts and a number of dialogue subacts. The exact number of

Standardizing Unstructured Interaction Data in Adaptive Instructional Systems

223

subacts differs from dialogue act to dialogue act. The overall, two-level taxonomy consists of 126 unique dialogue-act+subact combinations (Morrison et al. 2014). It should be noted that automatically discovered dialogue act taxonomies are currently being built (Rus et al. 2012). One challenge for standardizing speech act information is that various research and development groups use different taxonomies. Table 2 offers an overview of three taxonomies used by three different research groups. As can be seen from the table, there is a wide range of speech act taxonomies ranging from flat (no hierarchy) taxonomies to multi-layer taxonomies. Furthermore, the taxonomies use different number of speech acts, some of which are speaker specific (tutor vs. learner). If various systems adopting various speech act taxonomies were two exchange data or simply send data to a common repository for multi-system integration of the data there must be an alignment among the various taxonomies (alignment approach) or the various group will have to agree on some common taxonomy (or partial, upper-level taxonomy) that everyone agrees to use (consensus approach). Table 2. Difference research groups use different speech act taxonomies. University of Memphis 126 dialogue act/subact combinations 15 upper level of dialogue acts (Rus et al. 2015) (Morrison et al. 2014)

North Carolina State University 12 speech/dialogue acts 1 level (no hierarchy) (Boyer et al. 2011a, b)

University of Pittsburgh 16 dialogue-acts 1 level (no hierarchy) Speaker specific acts (some tutor vs. student) Litman and Forbes-Riley (2006)

In many cases, when big commercial players are involved in data exchange operations they rarely agree on a common taxonomy (this is a lesson from the Semantic Web movement in early 2000s) and therefore the only practical solution may be to adopt an alignment approach in which every player is using their own proprietary taxonomy in which case a taxonomy alignment or mapping process is needed. This latter approach has its own drawbacks such as a fundamental, semantic misalignment between the speech act categories in the two taxonomies to be aligned resulting in a disputable alignment result. It should be noted that speech acts can also be used to annotate student and tutor utterances in tutorial interactions through online tutoring services where human tutors interact with actual learners via chat-based conversations such that could then be integrated with learner data from ITSs. An example of such a taxonomy used to annotate human tutor – human learner interactions is described by Morrison and colleagues (2014).

224

V. Rus et al.

5 KC Extraction for Standardization In order to extract KCs from unstructured, freely-generated student responses, there is need for advanced natural language processing algorithms. Computational approaches to natural language understanding (NLU) can be classified into three major categories: true-understanding, information extraction, and text-to-text similarity. In true understanding, the goal is to map language statements onto a deep semantic representation such as first-order logic (see Rus et al. 2017). This approach relates language constructs to world and domain knowledge that is stored in a well-specified computational knowledge base, and that ultimately enables inferencing. Inconsistency and contradictions can be automatically detected, revealing potential flaws in students’ mental model of the target domain. Current state-of-the-art approaches that fall into this trueunderstanding category offer adequate solutions only in very limited contexts, i.e. toydomains. These lack scalability and thus have limited use in real world applications such as summarization or intelligent tutoring systems. Information extraction approaches use shallow processing to automatically detect in learners’ free responses the presence of certain nuggets of information that represent key expected concepts or derived measures that could be used as predictors of student responses’ correctness or overall quality. These approaches focus on text surface features or matching of exact words. They are most appropriate for item types such as fill-in-the-blank short answers where there is a limited range of expected correct responses. Text-to-text similarity approaches (T2T) to textual semantic analysis avoid the hard task of true understanding by defining the meaning of a text based on its similarity to other texts, whose meaning is assumed to be known. Such methods are called benchmarking methods because they rely on a benchmark text which is generated, checked, or annotated by experts, to identify the meaning of new, unseen texts. We focus in this paper primarily on T2T approaches because they are the scalable and dominant practical approaches at this moment and probably for the foreseeable future. One may argue that some of the information extraction approaches fall under the category of T2T approaches because, for instance, identifying in student responses a number of expected concepts that are specified by an expert is equivalent with a T2T approach in which parts of student responses are compared to key expected concepts provided by experts. To make the distinction less ambiguous for such approaches, we will call T2T approaches those that compare a full student response to a full benchmark or ideal response provided by an expert. If the expert only provides a set of key expected concepts then such an approach falls under the information extraction category. The key challenge when it comes to standardization is mapping unstructured free responses to KCs (proprietary or standardized). There are two approaches that could be adopted. In a first approach, the extraction process is fully automated in which case the trade-off will be noisier KCs. A second approach involves manual annotation of KCs related phrases at authoring time. As shown in Table 3, ideal responses can be marked to delimit parts of a free response that corresponds to a KCs. If such an approach is adopted then at runtime a semantic similarity approach can be used to extract KCs by

Standardizing Unstructured Interaction Data in Adaptive Instructional Systems

225

comparing fragments of student responses (window of a certain size, e.g., 5 consecutive words in the student response) to the annotated text segments in the ideal responses. Table 3. Examples of ideal responses annotated in KCs at authoring time (from Virtual Internship-inator). Medium responses Example 1: Based off of Natalie’s wish to decrease runoff into rivers, I decided to change most of the land around the river that was industrial to wetlands Example 2: In order to keep industrial space in the town, I changed some open land further away from the river into industrial space Example 3: I created more housing by changing some single family home areaslocated west of the river into multifamily homes

6 Conclusion We argued in this paper and discussed some key topics related to standardizing freelygenerated learner responses. We discussed what to log when interactions between a tutor (human-based or computer-based) and a learner involves such free responses, the need for adopting a common speech act taxonomy or align taxonomies, and finally we addressed the issue of mapping free, unstructured learner responses into structured sets of KCs which are needed for exchanges or pooling data from different learning systems that students may use. As already noted, the ideas presented here to extract KCs from unstructured data could also apply to other learner aspects such as their emotional state, e.g., one can imagine deriving ECs from the unstructured free response data. The free response data could be combined with other type of data, e.g., timestamps, recorded in the log files to infer for instance, behavioral components (BCs) such as wheel spinning. The ideas in this paper are meant to inform standardization efforts in AISs. Acknowledgements. The research was supported by the National Science Foundation Data Infrastructure Building Blocks program under Grant No. (ACI-1443068), Army Research Lab (W911NF-12-2-0030, W911NF-17-2-015), and the National Science Foundation grant (EEC1340402). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the should not be interpreted as representing the official policies, either expressed or implied, of the NSF, IES, ONR, ARL or the U.S. Government.

References Boyer, K.E., Grafsgaard, J., Ha, E.Y., Phillips, R., Lester, J.C.: An affect-enriched dialogue act classification model for task-oriented dialogue. In: Proceedings of the International Conference of the Association for Computational Linguistics (ACL), Portland, Oregon, pp. 1190–1199 (2011a)

226

V. Rus et al.

Arastoopour, G., Shaffer, D.W., Swiecki, Z., Ruis, A.R., Chesler, N.C.: Teaching and assessing engineering design thinking with virtual internships and epistemic network analysis. Int. J. Eng. Educ. 32(2) (2016). http://www.academia.edu/download/45975653/Teaching_ and_Assessing_Engineering_Design_Thinking.pdf Austin, J.L.: How to Do Things with Words. Clarendon Press, Oxford (1962) Boyer, K.E., et al.: Investigating the relationship between dialogue structure and tutoring effectiveness: a Hidden Markov modeling approach. Int. J. Artif. Intell. Educ. 21(1–2), 65–81 (2011b) Chi, M., Siler, S.A., Jeong, H., Yamauchi, T., Hausmann, R.G.: Learning from human tutoring. Cogn. Sci. 25(4), 471–533 (2001) Fancsali, S.E., Ritter, S., Yudelson, M., Sandbothe, M.: Implementation factors and outcomes for intelligent tutoring systems: a case study of time and efficiency with cognitive tutor algebra. In: FLAIRS Conference (2016). http://www.aaai.org/ocs/index.php/FLAIRS/FLAIRS16/ paper/download/12965/12614 Litman, D., Forbes-Riley, K.: Correlations between dialogue acts and learning in spoken tutoring dialogues. Nat. Lang. Eng. 12(2), 161–176 (2006) Morrison, D.M., Nye, B., Samei, B., Datla, V.V., Kelly, C., Rus, V.: Building an intelligent PAL from the Tutor.com session Database-Phase 1: data mining. In: The 7th International Conference on Educational Data Mining, pp. 335–336 (2014) Rus, V., Graesser, A., Moldovan, C., Niraula, N.: Automatic discovery of speech act categories in educational games. In: 5th International Conference on Educational Data Mining (EDM 2012), Chania, Greece, 19–21 June 2012 Rus, V., D’Mello, S., Hu, X., Graesser, A.C.: Recent advances in conversational intelligent tutoring systems. AI Mag. 34(3), 42 (2013) Rus, V., Banjade, R., Maharjan, N., Morrison, D., Ritter, S., Yudelson, M.: Preliminary results on dialogue act classification in chatbased online tutorial dialogues. In: Proceedings of the 9th International Conference on Educational Data Mining, Raleigh, NC, 29 June–2 July 2016 Rus, V., Olney, A.M., Foltz, P., Hu, X.: Automated assessment of learner-generated natural language responses. In: Sottilare, R., Graesser, A., Hu, X., Goodwin, G. (eds.) Design Recommendations for Intelligent Tutoring Systems: Assessment Methods, vol. 5, pp. 155– 170. U.S. Army Research Laboratory, Orlando (2017) Rus, V.: Explanation-based automated answer assessment of open ended learner responses. In: Proceedings of the 14th International Scientific Conference eLearning and Software for Education, Bucharest, Romania, 19–20 April 2018 Rus, V., Niraula, N., Maharjan, N., Banjade, R.: Automated labelling of dialogue modes in tutorial dialogues. In: Proceedings of the Twenty-Eighth International Florida Artificial Intelligence Research Society Conference, pp. 205–210 (2015) Rus, V., Graesser, A.C., Hu, X., Cockroft, J.L.: A computational perspective of adaptive instructional systems for standards design. In: Proceedings of the Workshop on Exploring Opportunities to Standardize Adaptive Instructional Systems (AISs) in Conjunction with the 19th International Conference on Artificial Intelligence in Education (AIED 2019), London, United Kingdom, 27–30 June 2018 Searle, J.R.: Speech Acts: An Essay in the Philosophy of Language. Cambridge University Press, Cambridge (1969) Woolf, B.P.: Building Intelligent Interactive Tutors: Student-Centered Strategies for Revolutionizing E-Learning. Morgan Kaufmann Publishers Inc., San Francisco (2007)

Exploring Methods to Promote Interoperability in Adaptive Instructional Systems Robert Sottilare(&) Soar Technology, Inc., Orlando, FL 32817, USA [email protected]

Abstract. This paper explores design principles and methods to promote interoperability and reuse in a class of instructional technologies called adaptive instructional systems (AISs) which include intelligent tutoring systems (ITSs), intelligent mentors or recommender systems, and intelligent instructional media. AISs are artificially-intelligent, computer-based systems that guide learning experiences by tailoring instruction and recommendations based on the goals, needs, and preferences of each individual learner or team in the context of domain learning objectives. IEEE is exploring standards and best practices for AIS modeling, interoperability, and evaluation under its Project 2247 and affiliated working group. This paper was composed to document recommendations related to interoperability standards for AISs. The after school market for AISs is large and on the rise in China, the US, and in Europe. The desired level of interoperability for AISs is at the lowest possible level to allow component reuse without impinging upon the intellectual property of the vendor. Keywords: Adaptive instructional system (AIS) Intelligent tutoring systems (ITSs) Interoperability

Reuse Standards

1 Introduction During the last year, an IEEE working group chartered through Project 2247 has taken on the task of developing standards and best practices for adaptive AISs. This IEEE working group will be debating what is and is not an AIS, and what standards and best practices will evolve from the marketplace. To date, the group has identified three potential areas for standardization: (1) a conceptual model for AISs, (2) interoperability standards for AISs, and (3) evaluation best practices for AISs. This paper explores design principles and methods to enhance interoperability and reuse for adaptive instructional systems (AISs) that are defined as: artificially-intelligent, computer-based systems that guide learning experiences by tailoring instruction and recommendations based on the goals, needs, and preferences of each individual learner or team in the context of domain learning objectives [1]. Adaptive instructional systems (AISs) come in many forms and this makes standardization challenging. The most common form of AIS is the intelligent tutoring system (ITS) which is a computer-based system that automatically provides real-time, © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 227–238, 2019. https://doi.org/10.1007/978-3-030-22341-0_19

228

R. Sottilare

tailored, and relevant feedback and instruction to learners [2, 3]. Other forms of AISs include intelligent mentors (recommender systems) which promote the social relationship between learners and intelligent agents [4] and web-based intelligent media used for instruction. Interoperability is defined as “the ability of two or more software components to cooperate despite differences in language, interface, and execution platform. It is a scalable form of reusability…” [5]. When examining interoperability as a design goal, we are attempting to design interfaces that are defined to a degree that allows information to be exchanged with and understood by other systems or system components [1]. Effective AISs allow information to flow between components and to be used by the components to enable transition from less optimal learning states to new positive learning states. This is accomplished by changes to the level/type of support, feedback, and difficulty of the content. AIS decisions may be triggered by trends in learner data (e.g., behavioral data), traits (e.g., personality traits) or states (e.g., behavioral or physiological states, emotions, learning, and performance). Data and traits are shared by the components and may be used to derive current states or project future states. Information (all data, traits, or states) are used to guide AIS decisions on the selection of domain-independent strategies (e.g., ask a question, provide support, prompt learner for more information, and request a reflective dialogue). Based on the strategy selection, tactics or specific domain-dependent actions are selected by the AIS. This sequence of interactions is conducted with the goal of optimizing tutoring decisions and ultimately learning. For AISs, this set of interactions has been described in the learning effect model (LEM) for both individual learners and teams of learners [6–10]. We are suggesting that the interactions between common AIS components, which are often arranged as encoded messages, might be our best near term opportunity for standardizing AISs. Interactions within and external to the AIS involve its four common components: a domain model, an individual learner or team model, an instructional model, and an interface model [11], but will all AISs have these common components? Will these components truly function the same to the degree that we can drop a domain model from one AIS into another AIS? These questions remain open until we have an approved conceptual model for AISs. Development of that formal conceptual model is still about a year away, but we can make some credible assumptions about these common components and how they function. 1.1

Interoperability in Domain Models

The domain model describes the set of skills, knowledge, strategies (plans for action), and tactics (actions executed by the ITS) for the topic/domain being instructed [11]. In the case of model tracing tutors, the domain model contains the expert knowledge in the form of an ideal learner model, and also includes the bugs, mal-rules, and misconceptions that students periodically exhibit. In the case of constraint-based tutors, the domain model includes constraints that have three data elements [12]: • Relevance Condition – describes when the constraint is applicable • Satisfaction Condition – specifies assessments to be applied to ascertain the correctness of the solution

Exploring Methods to Promote Interoperability in Adaptive Instructional Systems

229

• Feedback Message – communicates with the learner to advise them that their solution is incorrect and why it is incorrect, and provides reminds the learner of corresponding declarative knowledge ITSs use the context or constraints found in domain models along with learner states to select optimal instructional strategies. The key attribute in AISs is intelligence or the ability to observe and adapt. Noting that this is critical to being an AIS, we advocate that domain models should be included as common components of all AISs. It should also be emphasized that the intellectual property in AISs resides primarily in the domain model in the form of content, methods of assessment, and instructional strategies and tactics. This makes interoperability much more difficult than in other AIS components, but even so, there could be opportunities for interoperable AIS domain models at the common component level by standardizing the format of the data that passes between domain models and other common models (i.e., learner, instructional, and interface models). 1.2

Interoperability in Individual Learner and Team Models

Individual learner models consist of the cognitive, affective, motivational, and other psychological states that evolve during instruction and moderate (enhance or minimize) learning. Often, the data required to classify or predict learner states is acquired from sensors, learner input (e.g., self-reported data), learner assessments (e.g., tests or tasks used to define a level of learner proficiency within a domain of instruction), or historical databases (e.g., long-term learner model, learning record store or learning management system). There are opportunities to standardize the format of this data to allow it to be used by AISs and we shall revisit this topic later in this paper. Since learner performance is primarily tracked in the domain model, the learner model is often viewed as an overlay (subset) of the domain model, which changes over the course of tutoring. For example, “knowledge tracing” tracks the learner’s progress from problem to problem and builds a profile of strengths and weaknesses relative to the domain model [13]. Team models are not specifically called out as ITS common components, but the focus on the dynamics and functions of teams in collaborative learning and collaborative problem solving has placed new emphasis on the design of ITSs for team use [11, 14–16]. The team model, sometimes referred to as a collective model, must be able to track progress toward collective task learning objectives for either training or collaborative learning goals. To support team development and enhance collaboration skills, team models are significantly more complex than individual learner models which are primarily focused on the assessment of tasks. Team models also assess teamwork which is the “coordination, cooperation, and communication among individuals to achieve a shared goal” [17]. ITSs use models of individual learners or teams to assess their progress toward defined learning objectives and drive instructional decisions to optimize learning. While some learner attributes are relatively static, most are of the attributes in ITS learner models are unique to the task or domain of instruction. Again, this poses a significant challenge to standardizing learner or team models, but standardized data formats could

230

R. Sottilare

support interoperability by allowing the exchange of information between learner models and other components (i.e., domain, instructional or interface model). 1.3

Interoperability in Instructional Models

The instructional model (also known as the pedagogical model or the tutor model) takes conditions from the domain model and states from the learner models as input and makes recommendations in the form of tutoring strategies (plans for future AIS actions), next steps, and tactics (actions on what the tutor should do next in the exchange with the learner). In mixed-initiative systems, the learners may also take actions, ask questions, or request help [18, 19], but the AIS should to be ready in realtime to decide “what to do next” at any point and this is determined by an instructional model that captures pedagogical theories in the form of recommended instructional practices. The instructional model has a duality to it that encompasses generalizable theories of learning and instructional practices, and principles and measures of assessment that are specifically linked to particular domains of instruction. As we begin to think about the interoperability of AISs, we should consider open and proprietary solutions. Open solutions might be generalizable across domains or specific to a domain of instruction. Proprietary solutions may also be generalizable and specific. Either way, we must seek opportunities to standardize in areas that do not violate intellectual capital, and do not limit the movement of effective solutions into the marketplace. One way to accomplish this might be to treat these solutions as black boxes and standardize the information flowing into and out of the black boxes rather than attempt to dictate how the processes within are constructed. To this end, next we explore the fourth common element in AIS architectures, the interface model. 1.4

Interoperability in Interface Models

The interface model interprets the learner’s contributions through various input media (speech, typing, clicking) and produces output in different media (text, diagrams, animations, agents). In addition to the conventional human-computer interface features, some recent systems have incorporated natural language interaction [20–22], speech recognition [23, 24], and the sensing of learner emotions [25–27]. The interface model also governs how learners interact within the AIS. It should be noted that part of the interface model is target directly at providing the instructional model with information needed to tailor the training and part of the interface model is targeted at the learner’s interaction with the environment where the environment might be defined as an external simulation, some media, or a set of mathematical problems for the learner to solve. The interface captures behaviors related to interaction with the instructional content and provides this information to the instructional model within the AIS so it can update learner states and environmental conditions (e.g., context – where the learner is in the map of items to be learned) and then the AIS can make decisions with the goal of optimizing learning outcomes. When we think about interface models, we naturally think about the flow of information through the AIS architecture: learner data captured by various sources as

Exploring Methods to Promote Interoperability in Adaptive Instructional Systems

231

discussed earlier in this paper, learner states derived from learner data, instructional options and recommendations for next steps (strategies and tactics). The data that flows from one model to another usually takes on the form of messages or blocks of data with specific formatting, but the message types and formats vary from one AIS architecture to another and this is problematic when our goal is interoperability and reuse of tools and methods across AISs. Now that we have explored the types of data that AIS common components generate, process, and share with each other, it is time to discuss interoperability types, how interoperability might be supported in AISs, and finally, how interoperability might lead to reuse opportunities across different types of AIS platforms.

2 Examining Interoperability and Reuse in AISs Santos and Jorge [28] argued that “because of interoperability issues, intelligent tutoring systems [a subset of AISs] are difficult to deploy in current educational platforms without additional work. This limitation is significant because tutoring systems require considerable time and resources for their implementation. In addition, because these tutors have a high educational value, it is desirable that they could be shared, used by many stakeholders, and easily loaded onto different platforms.” At the heart of the problem is a lack of interoperability which is the inability of two or more software-based systems (or components) to cooperate by sharing information in spite of differences in their programming language, interfaces, and functions [5]. Interoperability may be thought of as a scalable form of reuse [5]. Therefore, logic dictates that any enhancement to make systems more interoperable will likely result in higher reuse of components, lower development costs, and increased opportunities for collaboration. For AISs, our goal is to allow plug-and-play applications for the principle components described in the section above: domain, learner/team, instructional, and interface models. If we think of AISs as composable networks, then plug-and-play components are capable of configuring both themselves and other cooperative network components without human intervention. They are in fact intelligent agents. Whether this is possible or practical to support plug-and-play will be discussed in this section. We begin by describing two levels of interoperability. 2.1

Syntactic and Semantic Interoperability

According to Ouksel and Sheth [29] and Euzenat [30], interoperability occurs at two levels: • Syntactic Interoperability: occurs when two or more systems are able to communicate by exchanging data; syntactic interoperability is a prerequisite for semantic interoperability defined below • Semantic Interoperability: occurs when the data exchanged between two or more systems is understandable to each system and can be used in each systems processes

232

R. Sottilare

In addition to internal component interoperability, we should also consider how AISs and their components might be enabled to be interoperable with non-AIS components? There are in fact many requirements for AIS interoperability: • Internal component interoperability • External system and component interoperability – Other AISs (e.g., information consumers and producers as part of the Total Learning Architecture [31]) – Learning Management Systems (LMSs) and other repositories – Computer-based Systems (e.g., training simulations) – Sensors and other data-generating components We explore each of these interoperability requirements below in the context of the Generalized Intelligent Framework for Tutoring (GIFT) [32, 33]. 2.2

Internal Component Interoperability

In AIS architectures like GIFT, there are standardized messages that facilitate the exchange of information between internal components (e.g., sensor module, learner module, pedagogical module). GIFT’s standardized messages provide a form of syntactic and semantic interoperability within GIFT. Sottilare and Brawner [34] identified the following categories of messages (changes/additions shown in bold) which might be also be used by other systems to promote internal interoperability and reuse of components from other AISs or AIS architectures: • Domain Model – Inputs • Requests for action (from Instructional Model) – Increase/decrease scaffolding (support) – Increase/decrease frequency or type of feedback – Increase/decrease the difficulty of future problems or scenarios • Feedback associated with concepts • A model of domain tasks, conditions, and standards (measures) – Outputs • Learner assessments (to Learner Model) – Performance states – Learning states – Domain Proficiency states – Retention models and states – Transfer of skills states – Emotional states (moderators of performance and learning states) • Learner Model – Inputs • Learner assessments for each learning objective or concept (from Domain Model) • Learner State representation (from Domain Model or derived from data) • Sensor data (if applicable) • Longer term data (if applicable)

Exploring Methods to Promote Interoperability in Adaptive Instructional Systems

233

– Outputs • Domain Proficiency (to long term learner model at end of course or lesson) • Learner State representation (to Instructional Model) to support new recommendations or strategies • Instructional Model – Inputs • Learning State representation (from Learner Model) – Cognitive state of the learner – Performance expectations (above, below, at) for each concept – Predicted future performance based on competency model • Physiological State representation (from Learner Model) – Derived emotional, physical states (e.g., fatigue) – Physiological stressors • Behavioral State representation (from Learner Model) – Derived attitudes or psychomotor performance based on primitive behaviors • Longer term learner attributes (from Learner Model or Learner Record Store) – Demographics and traits – Historical performance (competency and/or levels of proficiency) – Outputs • Request for changes to course direction (to Domain Model) • Request for feedback (to Domain Model) • Request for scenario adaption (to Domain Model) • Request for assessment (to Domain Model) Additional message sets will need to be developed to accommodate the exchange of new information (in bold) between AIS components. As suggested by Bell and Sottilare [35] (this volume), intelligent agents can and likely will play a large role in observing the learner and the environment and then triggering interactions (sharing of information) between AIS internal components. 2.3

External System and Component Interoperability

The next target of opportunity for AIS interoperability that we will discuss is AIS compatibility with external systems and components. In GIFT, a standard gateway has been defined to allow external systems and components (e.g., sensors, repositories, and training simulations) to: (1) push data into GIFT to support instructional decisions and assessments and (2) pull data from GIFT to act on either the learner or the environment. This provides a level of syntactic interoperability through the transport of data. Opportunities for semantic interoperability are enabled by defining variables and their data structures so AISs can understand how to use this data within AIS processes. A set of desired instances of interoperable systems/components is discussed in the sections below.

234

R. Sottilare

Interoperability with Other AISs A major standardization goal for AISs is their compatibility with other AISs to the level that major components (e.g., learner models or domain models) from one AIS could be inserted into another AIS and still function appropriately. Within the GIFT architecture, learner models are constructed based upon the variables and measures established by the AIS author. For instance, the AIS author might have a learning objective to master concepts associated with rifle marksmanship. Since this is a psychomotor task, it will be important for the AIS to track learner behaviors associated with steady position, aiming, breathing control, and trigger squeeze. The data would be acquired by sensors and used to assess performance states for steady position, aiming, breathing control, and trigger squeeze in the domain model. The derived performance states would then be transferred to the learner model. If the task was different (e.g., solving quadratic equations) then the learner model might track whether any of the three steps listed below were completed and in what order: • Step 1 Divide all terms by a (the coefficient of x2). • Step 2 Move the number term (c/a) to the right side of the equation. • Step 3 Complete the square on the left side of the equation and balance this by adding the same value to the right side of the equation. It might also be desirable to track individual learning during group tasks (e.g., collaborative problem solving) assess contributing factors to successful performance [10]. In this case, we could see the need for a collective model of the group that receives state information from all the supporting AISs relative to their roles, tasks, and progress toward learning objectives for the subject collective task. To automate assessment, the collective model might be tied to an agent-based hierarchical model that decides on a group strategy for task feedback and support. Interoperability with Learning Management Systems (LMSs) Another target of opportunity to enhance interoperability in AISs is to enable their functional interaction with learning management systems (LMSs) like edX, Blackboard, Moodle, or Canvas. LMSs are software applications that support the authoring, documentation, tracking, reporting and delivery of instruction for educational courses (e.g., Massive Open Online Courses – MOOCs) and training programs (e.g., Florida Boating Safety) [36]. AISs might be used to stimulate LMSs and provide adaptations (e.g., tailoring of feedback, support, and content). For instance, Aleven and colleagues [37] used GIFT and the Cognitive Tutor to stimulate adaptations in an edX course resulting an adaptive MOOC. The Learning Tools Interoperability (LTI) e-learning standard [38] was used to facilitate the integration of GIFT, the Cognitive Tutor and edX. In this instance, GIFT was an LTI tool provider and edX was an LTI tool consumer as defined in the 2019-1 GIFT documentation: • LTI Tool Consumer – system that requests an external source for access to their learning tools and then displays it internally • LTI Tool Provider – system that provides the learning tool to the Consumer

Exploring Methods to Promote Interoperability in Adaptive Instructional Systems

235

Interoperability with Computer-Based Systems As noted in the section above about interoperability with LMSs, we also envision the need to be interoperable with other computer-based systems supporting instruction. Training systems that are currently classified as low adaptive systems or systems that only adapt on performance might benefit from additional tailoring based on individual learner or team learning needs, goals, and preferences. For instance, the serious game Virtual BattleSpace (VBS) is a staple for military training in many countries. VBS has been integrated with GIFT (via the GIFT gateway module) to allow for more comprehensive tailoring. The primary function of the gateway module is to listen for communication outside of GIFT and then convert it into GIFT messages and viceversa. When a message is received from outside of GIFT (e.g., via the VBS distributed interactive simulation (DIS) connection), the gateway module converts that message into a GIFT message and sends it to the gateway module’s topic which is used to send GIFT simulation messages from interop plugins affiliated with systems or available information streams (e.g., DIS, VBS plugin). A common gateway approach may be a viable method of promoting syntactic interoperability between AISs and other computer-based systems used for training (task learning) and education (concept learning). Associated interop plugins would be required for any system or component that wanted to share information with a gatewaycompliant AIS. To achieve semantic interoperability, a mechanism is needed to identify the variables to be shared and the conditions they are intended to assess. In GIFT this is accomplished through a JavaScript condition class. For example, a VBS, a condition class might be used to check whether a specific entity avoided an area in the virtual environment represented in a VBS scenario. This information could then be used to assess the learner’s ability to move by terrain association and/or dead reckoning while avoiding certain obstacles, areas or terrain features which is a learning objective for mastering a land navigation task. Interoperability with Sensors and Other Data-Generating Components Sensors are devices which detect, identify, acquire, measure, and store a physical property associated with the environment or the behaviors or physiology of the learner (s). Sensors also require gateways to facilitate the movement of data that is acquired by the sensor to GIFT.

3 Recommendations for Interoperability Standards Our first recommendation is to examine closely what has been done with GIFT that might enhance the internal component interoperability of AISs along with their interoperability with other systems and components. GIFT was designed to allow instruction in a variety of domains and its architectural principles have enabled interoperability with a variety of systems. The types of interoperability that have been demonstrated by GIFT support effective interaction with systems and components in a way that allows the maintenance of intellectual property by treating the internal processes as black boxes. Interoperability is facilitated at the shared data level.

236

R. Sottilare

Our second recommendation is to treat the types of interoperability described in Sect. 2 of this paper as separate requirements. A solution (standard or recommended practice) proposed for one requirement should not affect the type of solutions proposed for other requirements. Finally, while it is basic to what AISs are, we recommend revisiting the structure that established AISs as four common components. There are still applications of AISs that don’t specifically fit this model or processes that are usually in one component that have been shifted to run in other AIS components. Agreeing on a conceptual model at either a component or process level is the precursor to any successful standards or recommended practices that are intended to facilitate the movement of AIS products to the marketplace.

References 1. Sottilare, R., Brawner, K.: Component interaction within the Generalized Intelligent Framework for Tutoring (GIFT) as a model for adaptive instructional system standards. In: The Adaptive Instructional System (AIS) Standards Workshop of the 14th International Conference of the Intelligent Tutoring Systems (ITS) Conference, Montreal, Quebec, Canada, June 2018 2. Anderson, J.R., Franklin Boyle, C., Reiser, B.J.: Intelligent tutoring systems. Science 228 (4698), 456–462 (1985) 3. Psotka, J., Mutter, S.A.: Intelligent Tutoring Systems: Lessons Learned. Lawrence Erlbaum Associates, Hillsdale (1988). ISBN 978-0-8058-0192-7 4. Baylor, A.: Beyond butlers: intelligent agents as mentors. J. Educ. Comput. Res. 22(4), 373– 382 (2000) 5. Wegner, P.: Interoperability. ACM Comput. Surv. (CSUR) 28(1), 285–287 (1996) 6. Sottilare, R.: Considerations in the development of an ontology for a Generalized Intelligent Framework for Tutoring. In: International Defense & Homeland Security Simulation Workshop in Proceedings of the I3M Conference, Vienna, Austria, September 2012 7. Fletcher, J.D. Sottilare, R.: Shared mental models of cognition for intelligent tutoring of teams. In: Sottilare, R., Graesser, A., Hu, X., Holden, H. (eds.) Design Recommendations for Intelligent Tutoring Systems. Learner Modeling, vol. 1. Army Research Laboratory, Orlando (2013). ISBN 978-0-9893923-0-3 8. Sottilare, R.: Adaptive Intelligent Tutoring System (ITS) Research in Support of the Army Learning Model - Research Outline. Army Research Laboratory (ARL-SR-0284), December 2013 9. Sottilare R., Ragusa C., Hoffman, M., Goldberg, B.: Characterizing an adaptive tutoring learning effect chain for individual and team tutoring. In: Proceedings of the Interservice/ Industry Training Simulation and Education Conference, Orlando, FL, December 2013 10. Sottilare, R.A., Burke, C.S., Salas, E., Sinatra, A.M., Johnston, J.H., Gilbert, S.B.: Designing adaptive instruction for teams: a meta-analysis. Int. J. Artif. Intell. Educ. (2017). https://doi. org/10.1007/s40593-017-0146-z 11. Sottilare, R., Graesser, A., Hu, X., Sinatra, A.M.: Introduction to team tutoring and GIFT. In: Design Recommendations for Intelligent Tutoring Systems. Team Tutoring, vol. 6. U.S. Army Research Laboratory, Orlando (2018). ISBN 978-0-9977257-4-2 12. Mitrovic, A., Martin, B., Suraweera, P.: Intelligent tutors for all: the constraint-based approach. IEEE Intell. Syst. 4, 38–45 (2007)

Exploring Methods to Promote Interoperability in Adaptive Instructional Systems

237

13. Anderson, L.W., et al.: A Taxonomy for Learning, Teaching, and Assessing: A revision of Bloom’s Taxonomy of Educational Objectives. Pearson, Allyn & Bacon, New York (2001) 14. Sottilare, R., Holden, H., Brawner, K., Goldberg, B.: Challenges and emerging concepts in the development of adaptive, computer-based tutoring systems for team training. In: Proceedings of the Interservice/Industry Training Simulation & Education Conference, Orlando, Florida, December 2011 15. Sottilare, R.A., Burke, C.S., Salas, E., Sinatra, A.M., Johnston, J.H., Gilbert, S.B.: Designing adaptive instruction for teams: a meta-analysis. Int. J. Artif. Intell. Educ. 28(2), 225–264 (2018) 16. Johnston, J., Sottilare, R., Sinatra, A.M., Burke, C.S. (eds.): Building Intelligent Tutoring Systems For Teams: What Matters. Emerald Group Publishing, Bingley (2018) 17. Salas, E.: Team Training Essentials: A Research-Based Guide. Routledge, London (2015) 18. Aleven, V., Mclaren, B., Roll, I., Koedinger, K.: Toward meta-cognitive tutoring: a model of help seeking with a Cognitive Tutor. Int. J. Artif. Intell. Educ. 16(2), 101–128 (2006) 19. Rus, V., Arthur, C.G.: The question generation shared task and evaluation challenge. The University of Memphis, National Science Foundation (2009) 20. Graesser, A.C.: Conversations with AutoTutor help students learn. Int. J. Artif. Intell. Educ. 26(1), 124–132 (2016) 21. Johnson, W.L., Lester, J.C.: Face-to-face interaction with pedagogical agents, twenty years later. Int. J. Artif. Intell. Educ. 26(1), 25–36 (2016) 22. Nye, B.D., Graesser, A.C., Hu, X.: AutoTutor and family: a review of 17 years of natural language tutoring. Int. J. Artif. Intell. Educ. 24(4), 427–469 (2014) 23. D’Mello, S.K., Graesser, A., King, B.: Toward spoken human–computer tutorial dialogues. Hum. Comput. Interact. 25(4), 289–323 (2010) 24. Litman, D.: Natural language processing for enhancing teaching and learning. In: Thirtieth AAAI Conference on Artificial Intelligence, 5 March 2016 25. Baker, R., D’Mello, S., Rodrigo, M., Graesser, A.: Better to be frustrated than bored: the incidence and persistence of affect during interactions with three different computer-based learning environments. Int. J. Hum. Comput. Stud. 68(4), 223–241 (2010) 26. D’Mello, S., Graesser, A.: Dynamics of affective states during complex learning. Learn. Instr. 22(2), 145–157 (2012) 27. Goldberg, B.S., Sottilare, R.A., Brawner, K.W., Holden, H.K.: Predicting learner engagement during well-defined and ill-defined computer-based intercultural interactions. In: D’Mello, S., Graesser, A., Schuller, B., Martin, J.-C. (eds.) ACII 2011, Part I. LNCS, vol. 6974, pp. 538– 547. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24600-5_57 28. Santos, G.S., Jorge, J.: Interoperable intelligent tutoring systems as open educational resources. IEEE Trans. Learn. Technol. 6(3), 271–282 (2013) 29. Ouksel, A.M., Sheth, A.: Semantic interoperability in global information systems. ACM SIGMOD Rec. 28(1), 5–12 (1999) 30. Euzenat, J.: Towards a principled approach to semantic interoperability. In: Proceedings of IJCAI 2001 Workshop on Ontology and Information Sharing, pp. 19–25, 4 August 2001. No commercial editor 31. Folsom-Kovarik, J.T., Raybourn, E.M.: Total Learning Architecture (TLA) enables nextgeneration learning via meta-adaptation. In: Interservice/Industry Training, Simulation, and Education Conference Proceedings, November 2016 32. Sottilare, R.A., Brawner, K.W., Goldberg, B.S., Holden, H.K.: The Generalized Intelligent Framework for Tutoring (GIFT). Concept paper released as part of GIFT software documentation. U.S. Army Research Laboratory—Human Research & Engineering Directorate (ARL-HRED), Orlando, FL, USA (2012)

238

R. Sottilare

33. Sottilare, R., Brawner, K., Sinatra, A., Johnston, J.: An updated concept for a Generalized Intelligent Framework for Tutoring (GIFT). US Army Research Laboratory, Orlando, FL, USA (2017) 34. Sottilare, R., Brawner, K.: Exploring standardization opportunities by examining interaction between common adaptive instructional system components. In: Proceedings of the First Adaptive Instructional Systems (AIS) Standards Workshop, Orlando, Florida, March 2018 35. Bell, B, Sottilare, R.: Adaptation vectors for instructional agents. In: Proceedings of the First HCII Adaptive Instructional Systems Conference, Orlando, Florida, July 2019 36. Ellis, R.K.: Field guide to learning management systems. ASTD learning circuits, 1–8 August 2009 37. Aleven, V., Sewall, J., Andres, J.M., Sottilare, R., Long, R., Baker, R.: Towards adapting to learners at scale: integrating MOOC and intelligent tutoring frameworks. In: Proceedings of the Fifth Annual ACM Conference on Learning at Scale, p. 14. ACM, 26 June 2018 38. IMS Global. Learning Tools Interoperability (LTI) Version 1.2 (2019). https://www. imsglobal.org/specs/ltiv1p2

Examining Elements of an Adaptive Instructional System (AIS) Conceptual Model Robert Sottilare1(&)

, Brian Stensrud1, and Andrew J. Hampton2

1

Soar Technology, Inc., Orlando, FL 32817, USA {bob.sottilare,stensrud}@soartech.com 2 University of Memphis, Memphis, TN, USA [email protected]

Abstract. This paper examines the components, functions, and interactions of adaptive instructional systems (AISs) as a method to construct a conceptual model for use in the development of IEEE standards. AISs are artificiallyintelligent, computer-based systems that guide learning experiences by tailoring instruction and recommendations based on the goals, needs, and preferences of each individual learner or team in the context of domain learning objectives. IEEE is exploring standards and best practices for AIS modeling, interoperability, and evaluation under its Project 2247 and affiliated working group. This paper was composed to document the interaction of learners with AISs in the context of a domain of instruction. The goal is to identify key interactions within AISs that drive instructional decisions, and to identify the data and methods required to support those machine-based instructions. In other words, we seek to identify methods to assess learner/team progress toward instructional objectives (e.g., knowledge, acquisition, skill development, performance, retention, and transfer of skills from instruction to operational/working environments. As part of the examination of AIS elements, we review a set of popular AIS architectures as a method of identifying what makes AISs unique from other instructional technologies. We conclude with recommendations for future AIS research and standards development. Keywords: Instructional decisions Learner states

Learner data

Learner interaction

1 Introduction Adaptive instructional systems (AISs) come in many forms with the most common being the intelligent tutoring system: a computer-based system that automatically provides real-time, tailored, and relevant feedback and instruction to learners [1, 2]. Other forms include intelligent mentors (recommender systems) which promote the social relationship between learners and intelligent agents [3] and other intelligent media used for instruction. The common components of AISs include a domain model, a learner model, an instructional model, and an interface model [4]. During the last year, an IEEE working group chartered through Project 2247 has taken on the task of developing standards and best practices for AISs. This IEEE © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 239–250, 2019. https://doi.org/10.1007/978-3-030-22341-0_20

240

R. Sottilare et al.

working group will be debating what is and is not an AIS. To date, the group has identified three potential areas for standardization: (1) a conceptual model for AISs, (2) interoperability standards for AISs, and (3) evaluation of best practices for AISs. This paper examines the development of an AIS conceptual model. A conceptual model is a representation of a system composed of a set of concepts which are used to support understanding of the principles, functions, and interactions of the system it represents [5]. According to Kung and Solvberg [6], a successful conceptual model should satisfy four fundamental objectives: • • • •

Enhance an individual’s understanding of the representative system Facilitate efficient conveyance of system details between stakeholders Provide a point of reference for system designers to extract system specifications Document the system for future reference and provide a means for collaboration.

2 Defining AISs We begin by defining AISs, examining the definition in detail to understand the drivers of adaptation, and then providing examples of popular AIS architectures and relate each to our chosen AIS definition. AISs are defined as: artificially-intelligent, computer-based systems that guide learning experiences by tailoring instruction and recommendations based on the goals, needs, and preferences of each individual learner or team in the context of domain learning objectives [7]. In examining our definition of AISs, we note that key words and phrases separate AISs from other classes of systems. 2.1

Artificially-Intelligent, Computer-Based

The phrase artificially-intelligent, computer-based indicates that we are discussing a system that is adaptive to changing conditions (and likely also adaptable). While both adaptive and adaptable systems provide system flexibility, adaptive systems are able to observe the environment, identify changing conditions, and then take action without human intervention [8]. In adaptable systems, the control over change/flexibility is in the hands of the user [8]. Many methods of adaptation have been used in AISs and these range from complex, real-time, autonomous decision-making to simple, prescriptive rules or decision trees. 2.2

Guided Learning Experiences

The phrase guided learning experiences indicates that we are discussing an intelligent system where the guide or tutor helps align learning with focused learning objectives. Learning is the process of acquiring new, or modifying existing, knowledge, behaviors, skills, values, or preferences [9]. Learning theories attempt to describe how learners acquire, process, and retain knowledge during learning while also accounting for various influences on the learning process: memory, emotions, prior experience, and environmental factors. While there are many approaches to learning, guided learning

Examining Elements of an Adaptive Instructional System (AIS)

241

activities in AISs are usually behavioral, cognitive, constructive, experiential or social/collaborative, but it should be understood that these approaches are not mutually exclusive. Behavioral learning approaches propose that learning is a long-term change in observable behavior in response to stimuli presented during instruction [10] and is primarily concerned with measurable results. Much of the game-based instruction today is intended to stimulate and reinforce performance and decision-making. Behavioral reinforcement may be positive (e.g., increased scoring or status, rewards) or negative (e.g., declining health status). Cognitive learning is a process where the acquisition of knowledge results from the internal processing of information as it is transferred from a knowledgeable individual to the learner [11]. Cognitive approaches propose that learning is moderated by factors such as memory, engagement, motivation, fatigue, thinking, and reflection, and are also concerned with instructional methods leading to retention of knowledge. In AISs, we could expect to see activities where the learner is asked to reason (e.g., complete a task, solve a problem). Constructive learning is a process where learning results in the development of mental models as part of a construction process where learners develop new ideas and concepts from their own knowledge and experience [12]. Constructive strategies include reflective thinking [13], learning by doing or experimentation [14], and discovery learning [15] which enable learners to construct mental models that have individual meaning with the goal that they take ownership of their learning. In AISs, scenario-based instruction offers opportunities for learners to explore their environment, expand their situational awareness (build their mental model), and then act on the environment in order to solve a problem or optimize a decision. Experiential learning is a process of learning through experience [16], and combines aspects of behavioral, cognitive, and constructive learning. Specifically, Kolb’s theory of experiential learning is a cycle of four stages: concrete experience, reflective observation, abstract conceptualization, and active experimentation [16]. In AISs, the same four stages may be implemented to support knowledge/skill acquisition, practice, reflection, modifying mental models, and then beginning anew. Social/collaborative learning is a learning approach where learners are able to socially interact with others (peers, instructors, and others) with the goal of expanding their knowledge and skill [17]. Collaborative learning reinforces active participation by learners in the group, generally focusing on a learning goals and including computersupported collaborative learning (CSCL) activities [18]. AISs have also been used to support collaborative learning, but AIS applications have been mainly concentrated on team training [19] and teamwork where “coordination, cooperation, and communication among individuals [is applied] to achieve a shared goal” [20]. 2.3

Tailored Instruction and Recommendations

The phrase tailored instruction and recommendations indicates that AISs are learnercentric systems. Tailoring or adaptation in AISs is based on the goals, needs, and preferences of individual learners or teams. This close tie between actions by the AIS and the learner’s states (e.g., knowledge, performance, emotion) and desired states

242

R. Sottilare et al.

(e.g., competency) form the basis of the learning effect model (LEM) [19, 21]. The LEM links learner data (e.g., physiological or behavioral data) to learner states (e.g., assessed or data-derived states—performance, proficiency or emotions) to instructional strategies (plans for action generated by the AIS) to instructional tactics (actions executed by the AIS). The terms goals, needs, and preferences also provide a temporal element to the AIS conceptual model in that they can be near-term (in the moment of instruction) or longer term as related to future desired states. 2.4

Context of Domain Learning Objectives

The phrase context of domain learning objectives indicates that AIS strategies and tactics are formulated with the goal of progressing toward specified learning objectives within a domain of instruction including team-based training [22]. It is important to note that a generalized AIS conceptual model would enable application to various domains of instruction. Already, AISs have been applied to cognitive domains such as mathematics [23], psychomotor domains such as marksmanship [24] and land navigation [25], and team/social/collaborative domains such as collaborative problem solving [26]. Next we examine decision making in AISs.

3 Examining Instructional Decisions in AISs Key elements related to the AIS decision processes are learner proficiency (also known as prior knowledge) and context. Learner and domain data drive AIS decision making. In examining the automated instructional decision processes within AISs, we can distill them down into three simple types: recommendations, strategies, and tactics. Recommendations are relevant proposals that usually suggest possible next steps (e.g., problem or lesson selection) and fit into what VanLehn [27] describes as the outer loop of the tutoring process which executes once for each task, multi-step problem, or scenario. As noted in our discussion of the LEM, strategies are plans for action by the AIS and tactics are actions usually executed by an AIS intelligent agent. Strategies and tactics are associated with the inner loop of the tutoring process which executes once for each step, turn, or action taken by the learner as they work toward a successful solution to a problem or scenario. During inner loop execution, feedback and/or hints may be provided to the learner during each step, turn or action, and the learner’s developing competence is assessed and updated in the learner model. The learner states in the learner model are used by the outer loop to select a next task that is appropriate for that particular learner. Intelligent agents are autonomous entities which observe their environment through sensors and act upon their environment using actuators while directing their activity towards achieving goals [28]. In AISs, Baylor [29] identifies three primary functions for intelligent agents: (1) ability to manage large amounts of data, (2) ability to serve as a credible instructional and domain expert, and (3) ability to modify the environment in response to learner performance. To this end, we add the requirement for the agent to be a learning agent, an entity that makes more effective decisions with each experience.

Examining Elements of an Adaptive Instructional System (AIS)

243

As shown in Fig. 1, intelligent agents in AISs observe and act on both the learner and the domain model (also known as the environment). The agents learn by reviewing the effectiveness of their decisions and updating policy when appropriate.

Fig. 1. Decision making process in AISs

We have discussed how models of the learner and the domain (environment) along with intelligent agents observing and acting on the instructional model support decision making in AISs. A critical element feeding the AIS decision process is the learner model with both real-time and historical data. Next we will examine how the AIS interface model also contributes real-time data from the learner to support state assessment.

4 Examining Learner–AIS Interaction As noted earlier, a common element in AISs is an interface model, but the model and the data on which it acts are anything but standard. AIS interfaces can take many forms from simple dashboards for mathematics tutors to scenario-based virtual environments for instructing military tactics. The AIS uses the interface model to push and pull learner data to/from external environments. The notion of an external environment as part of an AIS may be transparent to the learner. The learner interacts with a computer program in the form of a simulation, a game, a webpage, or some other media as part of their instructional experience. The learner selects controls, moves avatars through simulated terrain, solves mathematical problems, or receives content from these media we are calling a domain or environment.

244

R. Sottilare et al.

This interaction results in the generation of data that can be used by the AIS to assess the learner’s progress toward assigned learning objectives. Referring back to Fig. 1, we see the learner acting on the environment and observing its response. The intelligent agent is also observing and acting on the environment and the learner to assess conditions and learner states respectively. Depending on the AIS and the tasks to be learned in the domain of instruction, the learner could be interfacing through controls or through natural modes with the aid of sensors. These interfacing paradigms include: • Passive sensing of visual or other stimuli—this is the most common mode and usually involves presentation of content to the learner. • Unobtrusive sensing of non-verbal learner behaviors—sensors are used to acquire the location, position, or gestures of the learner. • Haptic interaction of the learner with environment—the learner interacts with the environment through a sense of touch facilitated by technology (e.g., haptic glove or controller). • Natural language interaction with other entities—learners talk to human or virtual instructors or other learners. • Text-based interaction with other entities—learners chat with human or virtual instructors or other learners. Each of the modes noted above provide data to the AIS for decision making resulting in recommendations, strategies, or tactics. For data coming from outside of the AIS architecture, a mechanism must be provided to allow for both the transport and the decoding of that data. Transport means the movement of data outside the architecture to where it can be processed by the AIS. This is usually facilitated by a gateway. The decoding of the data so it can be understood and used by the AIS is usually accomplished via a define condition class that describes the format and establishes a variable name for each type of data. Now that we have reviewed learner interfaces in AISs, we move on to review a few common AIS forms and associated architectures in the next section.

5 AIS Forms and Exemplar AIS Architectures AISs take many forms and have many features (e.g., natural language dialogue or open learner models), but AISs can be categorized broadly as follows: • Cognitive or Model Tracing AISs • Example Tracing AISs • Constraint-based Model AISs. 5.1

Cognitive or Model Tracing AISs

In model tracing systems, the AIS uses a cognitive model to trace the learner’s steps as they move through the problem-solving process. This enables the AIS to provide stepby-step feedback to the learner as part of the inner loop of adaptive instruction [27].

Examining Elements of an Adaptive Instructional System (AIS)

245

Cognitive models attempt to represent domain knowledge in the same manner in which knowledge is represented in the human mind [30]. According to Adaptive Control of Thought—Rational (ACT-R) cognitive architecture [31] “acquiring cognitive knowledge involves the formulation of thousands of rules relating task goals and task states to actions and consequences” [32]. Thereby, model tracing is very process centric with the AIS attempting to comprehend the process that a learner uses to solve a problem and ultimately arrive at a solution. Model tracing AISs are composed of expert rules, buggy rules, a model tracer and a user interface. Expert rules represent the steps that a proficient or ideal learner might take to solve the problem [33]. Examples of cognitive or model tracing AISs include: • Cognitive Tutor [34]—various tutors authored using the Cognitive Tutor and associated tools including GeneticsTutor and MathTutor • Dragoon [35]—an intelligent tutoring system used to teach the construction and exploration of models of dynamic systems for use in mathematics and science. 5.2

Example Tracing AISs

Example tracing AISs, also called pseudo-tutors, are actually a subset of cognitive AISs, but have a much simpler cognitive model and use generalized examples of problem-solving behavior as opposed to model-tracing AISs which use a rule-based cognitive model to interpret learner behavior. An advantage of example-tracing AISs is that they can be built quickly without formal computer programming knowledge, and can serve as a tool for “rapid prototyping”, or creating iterative prototypes over a short amount of time. Examples of example tracing AISs include: • Tuning Tutor [36]—an example-tracing tutor developed to teach learners about applied machine learning and specifically how to apply general principles of avoiding overfitting in cross-validation to the case where parameters of a model need to be tuned • ASSISTment Builder [37]—a tool designed to rapidly create, test, and deploy a simple type of pseudo-tutors called ASSISTments which have a simple cognitive model based upon a state graph designed for a specific problem. 5.3

Constraint-Based AISs

Per Mitrovic and colleagues [38], constraint-based AISs use constraints to represent correct knowledge related to pedagogically significant states in order to eliminate the need to model the learner’s misconceptions. A constraint is linked to set of solution states that share the same domain concept. Constraints are composed of three elements: • Relevance Condition—describes when the constraint is applicable. • Satisfaction Condition—specifies assessments to be applied to ascertain the correctness of the solution.

246

R. Sottilare et al.

• Feedback Message—communicates with the learner to advise them that their solution is incorrect and why it is incorrect, and provides reminders to the learner of corresponding declarative knowledge. An example constraint for a land navigation (orienteering) task might be “when using a compass in the northern hemisphere, place the compass on your map and rotate the maps until the needle points to the top of the map”. In this case the relevance condition is using a compass in the northern hemisphere. The satisfaction condition is the needle pointing to the top of the map. The feedback message might be to continue rotating until the needle points to the top of the page. Modeling of the learner is facilitated by assessments of the satisfaction or violation of constraints related to the domain concepts experienced. Examples of constraint-based AISs include: • Java Language Acquisition Tile Tutoring Environment (J-LATTE) [39]—a constraint-based intelligent tutoring system that teaches a subset of the Java programming language. • POSIT Constraint-Based Tutor [40]—Process-oriented subtraction interface for tutoring. 5.4

Multi-domain AIS Architectures

The above AIS types (i.e., cognitive or model tracing, example tracing, and constraintbased) focus primarily on single domains. Now we move on to multi-domain architectures which, as the name suggests, are able to author, deliver, and automatically manage adaptive instruction in different educational and training domains. In addition to the Cognitive Tutor discussed above, we review three multi-domain architectures in this section: • Generalized Intelligent Framework for Tutoring (GIFT) • AutoTutor • Active Student Participation Inspires Real Engagement (ASPIRE). Generalized Intelligent Framework for Tutoring (GIFT) The Generalized Intelligent Framework for Tutoring (GIFT), developed by the Learning in Intelligent Tutoring Environments (LITE) Lab at the US Army Research Laboratory, is emerging as a multi-domain, open source tutoring architecture [41, 42]. GIFT is a research prototype intended to reduce the computer skills and cost required to author ITSs, deploy them, manage them, and continuously evaluate the adaptive instruction they provide. A major advantage of GIFT is that three of its four functional elements are reusable across task domains. GIFT may also be linked to external training environments (e.g., serious games or virtual and augmented reality simulations) through a standardized gateway. GIFT authoring tools require no formal knowledge of computer programming or instructional design to develop effective ITSs. GIFT is freely available and may be hosted either locally or cloud-based. GIFT-based tutors have been prototyped to support training in adaptive marksmanship, land navigation, medical casualty care, and other military and non-military domains. GIFT, like other ITS

Examining Elements of an Adaptive Instructional System (AIS)

247

technologies, has focused on training individuals, but research is underway to create tools and methods to support tutoring of collectives. At the time of this writing, GIFT has a community of over 2000 government, academic, and industry users in 76 countries. Additional information about GIFT is available at www.GIFTtutoring.org. AutoTutor AutoTutor, developed at the University of Memphis, has been a stalwart in dialoguebased tutoring over the last 20 years. AutoTutor is an intelligent tutoring system that holds conversations with the human learner in natural language. AutoTutor has produced learning gains across multiple domains (e.g., computer literacy, physics, critical thinking). AutoTutor research is focused on three main areas: human-inspired tutoring strategies, pedagogical agents, and natural language tutoring. AutoTutor has been applied to several task domains in support of one-to-one tutoring, and it has a comprehensive set of authoring tools and services. An emerging capability in AutoTutor is the trialogue, intelligent pedagogical agents that help students learn by holding a conversation in natural language between the student, a virtual instructor, and a virtual student peer [43]. Additional information about AutoTutor is available at www. autotutor.org/. Active Student Participation Inspires Real Engagement (ASPIRE) ASPIRE, developed by the University of Canterbury in New Zealand, is a system for developing and delivering adaptive instruction on the web [44]. The system consists of ASPIRE-Author, a tutor development server, and ASPIRE-Tutor, a tutoring server that delivers the resulting ITSs to students for guided instruction. The authoring system provides a unique process for composing an ontology of the domain by outlining basic domain concepts, their properties, and the relationships between concepts forming the basis of an expert model. Lessons learned from the ASPIRE authoring process may reduce the time and cost associated with authoring ITSs and/or increase the accuracy of the represented domain. Additional information about ASPIRE is available at http:// aspire.cosc.canterbury.ac.nz/.

6 Next Steps—Recommendations for AIS Research and Standards Development We presented and dissected a definition of AISs that addresses the functional interaction of their four common components. We explored AIS forms and discussed the characteristics AIS architectures to identify their commonalities, and compare and contrast their differences. As noted throughout this paper, AIS complexity and diversity of form present a challenge to developing standards. A conceptual model will provide the essential ontology and terms of reference for members of the AIS research, development and standards community to rally around. We conclude that difficult work is ahead in the development of a comprehensive AIS conceptual model, but that the development of this model is an essential step towards identifying AIS standards for interoperability and reuse.

248

R. Sottilare et al.

References 1. Anderson, J.R., Boyle, C.F., Reiser, B.J.: Intelligent tutoring systems. Science 228(4698) 456–462 (1985) 2. Psotka, J., Mutter, S.A.: Intelligent Tutoring Systems: Lessons Learned. Lawrence Erlbaum Associates (1988). ISBN 978-0-8058-0192-7 3. Baylor, A.: Beyond butlers: intelligent agents as mentors. J. Educ. Comput. Res. 22(4), 373–382 (2000) 4. Sottilare, R., Graesser, A.C., Hu, X., Sinatra, A.M.: Introduction to team tutoring and GIFT. In: Design Recommendations for Intelligent Tutoring Systems: Volume 6—Team Tutoring. U.S. Army Research Laboratory, Orlando (2018). ISBN 978-0-9977257-4-2 5. Hoppenbrouwers, S.J.B.A., Proper, H.A.(Erik), van der Weide, Th.P.: A fundamental view on the process of conceptual modeling. In: Delcambre, L., Kop, C., Mayr, Heinrich C., Mylopoulos, J., Pastor, O. (eds.) ER 2005. LNCS, vol. 3716, pp. 128–143. Springer, Heidelberg (2005). https://doi.org/10.1007/11568322_9 6. Kung, C.H., Solvberg, A.: Activity modeling and behavior modeling. In: Ollie, T., Sol, H., Verrjin-Stuart, A. (eds.) Proceedings of the IFIP WG 8.1 Working Conference on Comparative Review of Information Systems Design Methodologies: Improving the Practice, North-Holland, Amsterdam, pp. 145–171 (1986) 7. Sottilare, R., Brawner, K.: Component interaction within the Generalized Intelligent Framework for Tutoring (GIFT) as a model for adaptive instructional system standards. In: The Adaptive Instructional System (AIS) Standards Workshop of the 14th International Conference of the Intelligent Tutoring Systems (ITS) Conference, Montreal, Quebec, Canada, June 2018 8. Oppermann, R.: Adaptive User Support: Ergonomic Design of Manually and Automatically Adaptable Software. Routledge, Abingdon (2017) 9. Gross, R.: Psychology: The Science of Mind and Behaviour, 7th edn. Hodder Education, London (2015) 10. Latham, G.P.: Behavioral approaches to the training and learning process. In: Goldstein, I.L. (ed.) Frontiers of Industrial and Organizational Psychology, The Jossey-Bass Management Series and The Jossey-Bass Social and Behavioral Science Series. Training and Development in Organizations, pp. 256–295. Jossey-Bass, San Francisco (1989) 11. Ausubel, D.P.: Educational Psychology: A Cognitive View. Holt, Rinehart & Winston, New York (1968) 12. Piaget, J.: Psychology and Epistemology: Towards a Theory of Knowledge. Grossman, New York (1971) 13. Dewey, J.: How We Think: A Restatement of the Relation of Reflective Thinking to the Educative Process, 2nd edn. D.C. Heath & Company, Boston (1933) 14. Anzai, Y., Simon, H.A.: The theory of learning by doing. Psychol. Rev. 86(2), 124 (1979) 15. Mayer, R.E.: Should there be a three-strikes rule against pure discovery learning? Am. Psychol. 59(1), 14 (2004) 16. Kolb, D.A.: Experiential Learning: Experience as the Source of Learning and Development. FT Press, New Jersey (2014) 17. Dillenbourg, P.: Collaborative Learning: Cognitive and Computational Approaches. Advances in Learning and Instruction Series. Elsevier Science Inc., New York (1999) 18. Van Berio, M.P.: Team training vs. team building and cooperative learning: defining the field of research (Team training vs team building en cooperatief leren: afbakening van het onderzoeksterrein). Human Factors Research Institute of Technology, Soesterberg, Netherlands (1997)

Examining Elements of an Adaptive Instructional System (AIS)

249

19. Sottilare, R.A., Burke, C.S., Salas, E., Sinatra, A.M., Johnston, J.H., Gilbert, S.B.: Designing adaptive instruction for teams: a meta-analysis. Int. J. Artif. Intell. Educ. 28(2), 225–264 (2018) 20. Salas, E.: Team Training Essentials: A Research-Based Guide. Routledge, Abingdon-onThames (2015) 21. Sottilare, R.: Considerations in the development of an ontology for a generalized intelligent framework for tutoring. In: International Defense & Homeland Security Simulation Workshop in Proceedings of the I3M Conference, Vienna, Austria (2012) 22. Sottilare, R., Ragusa, C., Hoffman, M., Goldberg, B.: Characterizing an adaptive tutoring learning effect chain for individual and team tutoring. In: Proceedings of the Interservice/Industry Training Simulation & Education Conference, Orlando, Florida (2013) 23. Corbett, A.: Cognitive computer tutors: solving the two-sigma problem. In: Bauer, M., Gmytrasiewicz, P.J., Vassileva, J. (eds.) UM 2001. LNCS (LNAI), vol. 2109, pp. 137–147. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44566-8_14 24. Goldberg, B., Amburn, C., Ragusa, C., Chen, D.W.: Modeling expert behavior in support of an adaptive psychomotor training environment: a marksmanship use case. Int. J. Artif. Intell. Educ. 28(2), 194–224 (2018) 25. Sottilare, R.A., LaViola, J.: Extending intelligent tutoring beyond the desktop to the psychomotor domain. In: Proceedings of the Interservice/Industry Training Simulation and Education Conference (I/ITSEC), Orlando, FL (2015) 26. Singley, M.K., Fairweather, P.G., Swerling, S.: Team tutoring systems: reifying roles in problem solving. In: Proceedings of the 1999 Conference on Computer Support for Collaborative Learning, p. 66. International Society of the Learning Sciences (1999) 27. VanLehn, K.: The behavior of tutoring systems. Int. J. Artif. Intell. Educ. 16(3), 227–265 (2006) 28. Russell, S., Norvig, P.: Artificial Intelligent: A Modern Approach. Pearson Education Ltd., Malayia (2003) 29. Baylor, A.: Intelligent agents as cognitive tools for education. Educ. Technol. 1, 36–40 (1999) 30. Corbett, A., Kauffman, L., MacLaren, B., Wagner, A., Jones, E.: A cognitive tutor for genetics problem solving: learning gains and student modeling. J. Educ. Comput. Res. 42(2), 219–239 (2010) 31. Lebiere, C., Anderson, J.R.: A connectionist implementation of the ACT-R production system. In: Proceedings of the Fifteenth Annual Conference of the Cognitive Science Society, pp. 635–640. Lawrence Erlbaum Associates, Mahwah (1993) 32. Anderson, J.R.: Rules of the Mind. Erlbaum, Hillsdale (1993) 33. Kodaganallur, V., Weitz, R.R., Rosenthal, D.: A comparison of model-tracing and constraint-based intelligent tutoring paradigms. Int. J. Artif. Intell. Educ. 15(2), 117–144 (2005) 34. Ritter, S., Anderson, J.R., Koedinger, K.R., Corbett, A.: Cognitive tutor: applied research in mathematics education. Psychon. Bull. Rev. 14(2), 249–255 (2007) 35. VanLehn, K., Wetzel, J., Grover, S., Van De Sande, B.: Learning how to construct models of dynamic systems: an initial evaluation of the dragoon intelligent tutoring system. IEEE Trans. Learn. Technol. 10(2), 154–167 (2017) 36. Aleven, V., et al.: Example-tracing tutors: intelligent tutor development for nonprogrammers. Int. J. Artif. Intell. Educ. 26(1), 224–269 (2016) 37. Heffernan, N.T., Turner, T.E., Lourenco, A.L., Macasek, M.A., Nuzzo-Jones, G., Koedinger, K.R.: The ASSISTment builder: towards an analysis of cost effectiveness of ITS creation. In: FLAIRS Conference, pp. 515–520 (2006)

250

R. Sottilare et al.

38. Mitrovic, A., Martin, B., Suraweera, P.: Intelligent tutors for all: the constraint-based approach. IEEE Intell. Syst. 4, 38–45 (2007) 39. Holland, J., Mitrovic, A., Martin, B.: J-LATTE: a constraint-based tutor for Java. In: Kong, S.C., et al. (eds.) Proceedings of the 17th International Conference on Computers in Education, pp. 142–146. Asia-Pacific Society for Computers in Education, Hong Kong (2009) 40. Ohlsson, S.: Constraint-based student modeling. In: Greer, J.E., McCalla, G.I. (eds.) Student Modelling: The Key To Individualized Knowledge-Based Instruction. NATO ASI Series, pp. 167–189. Springer, Heidelberg (1994). https://doi.org/10.1007/978-3-662-03037-0_7 41. Sottilare, R.A., Brawner, K.W., Goldberg, B.S., Holden, H.K.: The Generalized Intelligent Framework for Tutoring (GIFT). Concept paper released as part of GIFT software documentation. U.S. Army Research Laboratory—Human Research & Engineering Directorate (ARL-HRED), Orlando, FL, USA (2012) 42. Sottilare, R., Brawner, K., Sinatra, A., Johnston, J.: An Updated Concept for a Generalized Intelligent Framework for Tutoring (GIFT). US Army Research Laboratory, Orlando, FL, USA (2017) 43. Feng, S., Stewart, J., Clewley, D., Graesser, A.C.: Emotional, epistemic, and neutral feedback in autotutor trialogues to improve reading comprehension. In: Conati, C., Heffernan, N., Mitrovic, A., Verdejo, M.F. (eds.) AIED 2015. LNCS (LNAI), vol. 9112, pp. 570–573. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19773-9_64 44. Mitrovic, A., Suraweera, P., Martin, B., Zakharov, K., Milik, N., Holland, J.: Authoring constraint-based tutors in ASPIRE. In: Ikeda, M., Ashley, K.D., Chan, T.-W. (eds.) ITS 2006. LNCS, vol. 4053, pp. 41–50. Springer, Heidelberg (2006). https://doi.org/10.1007/ 11774303_5

Interoperability Standards for Adaptive Instructional Systems: Vertical and Horizontal Integrations K. P. Thai(&) and Richard Tong Squirrel AI Learning by Yixue Education Group, Highland Park, NJ 08904, USA [email protected] Abstract. Adaptive instructional systems (AISs) – tools and methods that tailor each student’s instructional experiences to their needs within a set of domain learning objectives – are becoming increasingly common. In an ideal configuration, AISs work in concert using open interoperability standards to provide a seamless experience for students and instructors, while leveraging highfrequency contextual data to inform the learning flow. With the large amount of learning interactions that can take place in AISs, however, existing industry standards are unable to support the interoperability and extensibility of components within an AIS and among different AISs. In this paper, we propose extensions on top of current industry standards to enable interoperability among components within an AIS. We also discuss the need of interoperability standards across different AISs on the learning ontology and data models, and the opportunity to leverage recent advances in federated machine learning to enable horizontal integration across separate AISs. Keywords: Adaptive instructional systems Interoperability standards Learning ontology Federated Machine Learning

1 Introduction We do not yet have a conceptual model for defining adaptive instructional systems (AISs), however, it is typically understood that unlike traditional computer-based instructional systems, AISs guide learning experiences by tailoring instruction and recommendation to each learner based on their goals, needs, and preferences in a specific learning domain [1]. AISs describe a class of software that includes intelligent tutoring systems (ITSs), adaptive learning technologies, interactive media, and other learning tools or methods that are used to personalize and optimize instruction for a particular learner or teams. They typically seek to maximize learning outcomes and efficiency toward knowledge acquisition, skill development, retention, performance, and/or transfer of skills from the instructional environment to that of the “real world” where those skills would be applied.

© Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 251–260, 2019. https://doi.org/10.1007/978-3-030-22341-0_21

252

1.1

K. P. Thai and R. Tong

AIS vs. Non-AIS

In a typical non-AIS, instructions are delivered to all students in the same way, often consisting of a fixed set and sequence of reading materials, videos, and/or exercises to be completed by all students. An AIS, on the other hand, may use individual variability in learning performance, learning pace, preferences, motivation, affective states, and other learner or team attributes together with instructional conditions to identify appropriate learning strategies and/or tutor actions. Recent advances in artificial intelligence, sensing technology, and data mining methods afford modern AISs exciting new ways to engage students in more open-ended tasks and draw new insights into the learning process. From increasing integration of natural language processing and affective state monitoring, to applications of simulations and interactive media, AISs are increasingly able to capture, process, and fuse high-frequency interaction data and natural rich modalities of communication, such as speech, writing, and nonverbal interaction during real learning activities. This provides unprecedented insight into the moment-to-moment development of a number of learning experiences, especially those involving multiple dimensions of activity and social interaction, enabling researchers to get far more nuanced and complex understanding of student learning processes, something that we have only begun to study at scale. In many AISs, machine learning algorithms are typically used to describe the interconnection among the learner’s state, the context of the learning experience, and AIS decisions, to recommend a learning path or actions that can maximize learning outcome. The adaptive architecture for different AISs differ by application across content, students, and learning objectives, but they can often be described as having two loops: an outer loop and an inner loop (Fig. 1; [2]). Let’s take an example of a simple AIS that involves many multi-step math problems, where the primary instructional mechanism is student answering questions and receiving feedback. In this AIS, the outer loop tailors the problem set that a student sees, and the inner loop personalizes instruction at the level of individual problem-solving steps. The outer loop executes once for each problem and iterates over the problems, giving feedback on the problem level (i.e. correct or incorrect) and selects the next problem that is appropriate for the student. The inner loop executes once for each problem-solving step and give feedback or hints on each step. The inner loop assesses and updates the student’s proficiency state or their learner model, which is used by the outer loop to select the next appropriate problem for that student [3]. It does this by looking at the skills that the student has currently mastered, evaluating the student’s knowledge state, and selecting the next optimal learning task. This two-loop model can involve much more complex interactions in a more sophisticated AIS that involves different types of learning tasks. An outer loop interaction may involve videos, interactive simulations, writing or speaking prompts, etc., in which proficiency estimation is not straightforward. The inner loop interaction also depends on the task. For a task that evaluates a student’s speaking skill, for example, the inner loop would need to evaluate the student’s speaking pattern against an optimal expert. This dialog-based inner loop adaptivity would require a task-specific ontology, one that is separate from that of the outer loop.

Interoperability Standards for Adaptive Instructional Systems

253

Fig. 1. A high-level description of an intelligent tutoring system. Source: https://www. carnegielearning.com/blog/truly-personalized-learning-all-about-those-loops/

AISs involve a highly complex process that requires a technology and data-driven system of integrating instructional resources, learning objectives, and assessment activities into single, progressive modular learning elements that can be adapted to individual learners, reordered, or shared between learning systems. A key component of an AIS that enables this process is the domain model. Domain models define the structure of a particular domain, including learning objectives (i.e., a knowledge map) and the learning content, measures, assessments, interventions (e.g., feedback, dialog), associated with those learning objectives in that domain (i.e., a content map). The domain models can define these while taking into account the learner’s goals, prior knowledge, assessed skills, and other attributes such as motivation and interest. Taken together, the knowledge map and the content map can be thought of as layers of the learning ontology of an AIS. In order to appropriately measure a student’s knowledge and provide recommendations, this ontology often requires the knowledge components to be very fine-grained and well-defined. The other three common components of AISs are the learner model, the instructional model, and the interface model [4, 5, cf. 1]. The learner model obtains and interpret data from the student through learning records, physiological or behavioral sensors and surveys, from learning record stores (LRSs) or learning management systems (LMSs). Instructional models assess the student’s progress toward learning objectives and recommend appropriate next steps. The interventions by the tutor/instructor and the student are managed through a user interface [1]. 1.2

Need for Interoperability

With the large amount of learning interactions that can take place in AISs, there is an increasing need for standards that enhance the interoperability and extensibility of course content and configuration within an AIS. The ability to exchange models in an

254

K. P. Thai and R. Tong

AIS allows for greater flexibility and enhanced instructional capabilities while reducing development costs. To do that, each of these models must be described in a standardized way to allow for an interchange of components and data exchange across components. There is also a need for interoperability across AISs. In an ideal configuration, AISs work in concert leveraging open interoperability standards to provide a seamless experience for student and instructor, while ubiquitously leveraging rich highfrequency data to inform the learning flow. In this paper, we focus on the interoperability among components within an AIS (vertical integration) and among separate AISs (horizontal integration) to provide services to each other. Specifically, we discuss the integration of components to support a more adaptable and extensible ecosystem within an AIS by building extensions on top of existing industry standards. We also discuss the need of interoperability standards across different AISs on the learning ontology and data models, and leveraging recent advances in federated machine learning to enable horizontal integration. Prior work from IEEE Learning Technology Standard Committee, LTSC, IMS Global and others have established standards that would be essential for transitioning to an adaptable and extensible ecosystem. Samantha Birk [6] highlighted a set of existing IMS Global standards that play a key role in transitioning to AIS and provided a useful visualization of the existing interoperable adaptable learning ecosystem (Fig. 2).

Fig. 2. An adaptive learning ecosystem using some of the existing IMS Global standards: LTI integration, Gradebook Service, Caliper Analytics, (Thin) Common Cartridge and QTI/APIP. Source: http://www.imsglobal.org/adaptive-adaptable-next-generation-personalized-learning.

Interoperability Standards for Adaptive Instructional Systems

255

As is, however, existing standards remain limited in their capability to support the unique needs of modern AISs. We propose that extensions built upon existing standards, including LTI, Caliper Analytics/xAPI, Common Cartridge, QTI and others, can support the seamless data exchanges within and across AISs for adaptation at both the outer loop and inner loop levels. In an ecosystem of adaptive learning process and system components (Fig. 3), a teaching and learning platform typically employs a Plan-Build-Deliver-Analyze cycle. During Plan, the learning map and the blueprint (i.e. goals) provide the basis for curriculum planning. During Build, content including the courseware, instructional, and assessment items are created. Each of these use different standards to express different corresponding data. The Competencies and Academic Standards Exchange (CASE) can be used to build the knowledge map, while Question and Test Interoperability (QTI) and Common Cartridge (CC) specifications can be employed to build assessment items and instructional items. During Deliver, the LMS is integrated with an environment like Customer Relationship Management (CRM), Student Information System (SIS), or Enterprise Resource Planning (ERP) via OneRoster or Learning Information Services (LIS). The individual components that plug into the LMS are supported by Learning Tools Interoperability (LTI) standard. Caliper Analytics or Experience API (xAPI) can be used to capture and store real-time learning event data in the Learning Record Store (LRS) for analysis and reporting. Taken all together, these existing standards provide an initial step toward vertical and horizontal integration, but all require extensions to be effective in this AIS ecosystem.

Fig. 3. A proposed ecosystem of the adaptive learning process and system components with extensions to existing interoperability standards (in blue). (Color figure online)

256

K. P. Thai and R. Tong

2 Vertical Integration (Within AISs) There is a myriad of tools and methods that can enhance student learning, but too often they are developed by different parties in isolation. Without a standardized way to transfer rich and contextual learning data among components, developers often resort to rebuilding third-party toolset within their own environment each time they seek to enhance their AIS’s capability. At an outer loop level, extensions to existing standards like LTI and xAPI/Caliper Analytics can enable AIS developers to build learning experiences from multiple components, by linking instructional components together and track the data generated during the learning process. For example, a chemistry student can seamlessly access a third-party chemistry lab toolset from within the AIS, launched using an LTI extension, to practice applying the scientific method in a simulated lab experiment, prior to returning to the AIS for problem solving practice. The AIS can receive data about the students’ progress and performance on the chemistry lab, store them in the LRS, and use the updated learner model to adapt the learning experience when the student returns to the AIS. Such extensions can also support interoperability at the inner loop level. In a language learning AIS, for example, a grading tool that uses natural language processing (NLP) embedded within a practice question can be hugely beneficial for learning. Alternatively, when a student answers a question correctly, an NLP-based dialog can be triggered to query the student’s reasoning process, perhaps to identify misconceptions using a mistake reasoning ontology specific to that domain learning objective. Such inner loop adaptivity requires a tremendous amount of development but allows us to capture individualized learning in ways that were not possible before. To enable these rich learning experiences, in this new framework, there must be a mapping of corresponding ontology among learning tools and components to support continuous proficiency estimation, updates of learning records, and high-frequency user data interchange. Without interoperability standards connecting these components, AIS developers currently would have to build the tool from scratch to enable this kind of integration, and each of these tools require their fair share of research and development. 2.1

IMS Learning Tools Interoperability (LTI)

LTI standard currently prescribes an easy and secure way to connect any LMS with learning applications that range from general communication tools for chat and virtual classrooms to domain-specific learning engines for particular subjects like math or history. The LTI core establishes a secure connection and confirms the learning application’s authenticity, allowing students to switch seamlessly between, for example, a video conferencing tool and an assessment tool within the same workflow. Thus, rather than having to leave the LMS to log into an adaptive learning system outside of the LMS, with LTI a student can seamless move from an LMS to a third-party platform for the adaptive content. Possible extensions are available to add optional features and functions, such as features that support the exchange of assignment and grade data

Interoperability Standards for Adaptive Instructional Systems

257

between an assessment tool and the LMS gradebook. This is a good step toward an ecosystem of tools and platforms, and apps for AIS. Such extensions are limited, however, when the LMS is adaptive or is an AIS. An interoperability standard that extends the current LTI capabilities within an AIS framework can support the transfer of learning data among the components within the AIS while preserving the learning contexts and the user’s role within that context. When a student moves between the instructional system and an assessment tool, for example, the assessment tool can carry with it and return from it information that ensures she continues with the learning goal. The same information can be used to check whether the instructional system is providing the expected learning outcome and feed back into the adaptive engine. Importantly, this extended LTI capability can be combined with specifications from an Experience API (or xAPI) and/or Caliper Analytics standard from IMS Global Learning Consortium to receive and send data about a users’ behaviors in different AIS components in a consistent format using a single vocabulary. By providing a more comprehensive understanding of what a student is doing, such vertical integration of information within an AIS can support better predictions about student achievement for better adaptivity and enable learning analytics for new insights into how different learning interactions within an AIS relate to learning outcomes. 2.2

Common Cartridge (CC) and Thin Common Cartridge

Common Cartridge and Thin Common Cartridge are specifications useful for packaging and exchanging digital learning materials and assessments, often for importing and exporting them to or from an LMS. CC and Thin CC are useful in that they provide a standard way to represent learning course materials that can be developed and used across LMSs. They provide an easy way to add content to a course, saving developers and instructors content development time. The content, however, is typically static (e.g., textbooks, chapters) and cannot be arranged and repackaged, thereby limiting the customizability for individual students. The content would also need to be connected to an ontology recognizable by the AIS. Thus, an extension on these standards could afford better interconnectedness among content and the transfer of richer content. 2.3

Question and Test Interoperability (QTI) and Accessible Portable AIS Item Protocol (APIP)

QTI enables the interoperability of assessment item content and results between authoring tools, item banks, and learning platforms, etc. APIP is an accessibility functionality for students with accommodation requirements. In an AIS, assessment items often provide the performance measures needed to inform the learner model and guide instructional next steps. However, current QTI and APIP do not allow for any content changes needed to tailor the learning path for a particular student. The algorithms that inform the learning path are currently tied to prescribed assessments. An extension would be needed to incorporate a level of logic that enables content to be modified and reordered as needed by the curriculum creator.

258

K. P. Thai and R. Tong

Enhancement to make systems more interoperable will likely result in higher reuse of components, lower development costs, and more collaboration in both the research and development of AISs, benefiting learners, instruction, and domains. An ideal extensible and cohesive AIS can leverage and built extensions upon these existing open interoperability standards (and others) to deliver a seamless experience for students via vertical integration among learning components while leveraging the rich data stream to inform the learning adaptation at both the outer loop and inner loop level.

3 Across AISs (Horizontal Integration) The reform literature in mathematics and science is replete with calls for the crosscurricular integration of subjects (i.e. between STEM subjects and also between STEM and the humanities). However, there remains very few AISs (and few non-adaptive instructional programs) that can handle the prerequisite skills, knowledge bases, and experiences necessary to implement such integrated instruction. A typical AIS addresses a single subject domain at a time, and it often has a unique content ontology, adaptive engine, and data management method. A math AIS ontology, for example, consists of a knowledge map of prerequisite mappings of granular learning objectives in math, but it can have many connections to objectives in physics and chemistry, so an interoperability mechanism can allow us to combine the maps together and exchange student information across subjects. A student’s calculus knowledge, for example, could inform their experience in a physics or chemistry, where calculus is a prerequisite. Thus, an extension of ontologies across AISs would not only expand the scope of multiple AISs but also enable a richer, cross-curricular learning experience for students. Corresponding to the knowledge graph, a “Federated” AIS approach can be also used to extend the learner model from one AIS to another. The combined and synchronized user model would require not only a set of common learner competency standard (such as CASE) but also the semantic and ontological construct for data exchange/synchronization. 3.1

Ontology

Competencies and Academic Standards Exchange (CASE). CASE was created to address the need for competency-based educational programs to manage competency statements and students’ associated assessment results in a consistent and digital way. It supports the exchange of competencies and rubrics despite differences in the terminology, processes, and roles across different programs. Some disciplines like math and physics have highly developed knowledge graph with well-defined learning objectives. Others are engaged in exciting debates about the best way to organize their domain knowledge. We propose that while professional associations shall be responsible for developing the knowledge graphs as they see fit, there is the need for a CASE standard extension to track and share the structural knowledge graph information across AISs.

Interoperability Standards for Adaptive Instructional Systems

3.2

259

Data Models

Caliper Analytics and Experience API (xAPI). Caliper Analytics and xAPI provide the means for consistently capturing and labeling learning data and securely sending data to an LRS, setting the stage for an extensible adaptive ecosystem. These common data formats are particularly important because AISs are often built around proprietary standards and algorithms that are siloed, where there is little or no visibility for the users into what is happening in and across the learning environment. It is also increasingly common that students work in multiple learning environments, and possibly multiple adaptive AISs. This standardized data format allows the data to be collected and combined with another provider’s data points so that they can be shared and analyzed to get a more comprehensive understanding of the student’s progress within the curriculum. For example, scores can be passed back to the LMS via the LTI Gradebook Services, but no detailed usage and performance data (e.g., click stream, number of activities or questions answered, video views or pages read, simulation performance, etc.) that track the student’s learning progress can be sent across multiple systems to inform and trigger academic interventions. Furthermore, these current specifications sometimes do not satisfy the needs of AISs where very fine-grained data and analytics are critical to measure and monitor students’ progress. Federated Machine Learning. Another dimension of the cross-AIS interoperability might be borrowed from Federated Machine Learning (FML), one of the lately proposed approaches in aggregating machine learning capability among multi homogeneous or even heterogenous AI systems. FML focuses on using the power of distributed system to train and enhance machine learning models. By training the models on the devices using local data and only transporting the models (i.e. not the data) between devices and the central server, this approach ensures data security and privacy while enabling real-time prediction on local machines with minimal infrastructure [7]. A “Federated” AIS approach can be used to extend the knowledge graph, the learner model, the data reach, and even ML depth by updating, sharing and aggregating different layers or aspects from one AIS to another. In this way, each AIS benefits from the data synchronization, latency, and security features of a federated system while maintaining its own ontology, adaptive engine, and data.

4 Conclusion In summary, these vertical integrations of components within an AIS and horizontal integrations across AISs seek to address fundamental needs in the development, enhancement, and evaluation of AISs. We believe that it is critically important to (1) create extensions on top of existing IMS Global and xAPI standards to form the foundation for the standardization of AISs, and (2) create a standards framework based on a reference architecture, within which existing standards can be extended (i.e., LTI, QTI, etc.), and new standard sets may be proposed (i.e., federated machine learning, ontology merging, or other AIS system rules).

260

K. P. Thai and R. Tong

Acknowledgements. This work was conceived in collaboration with members of the IEEE Adaptive Instructional Workgroup (AIS) P2247.1, and was written within the framework of the group’s effort. Special thanks to key contributors of the workgroup, including Bob Sottilare, Avron Barr, Robby Robson, Xiangen Hu, Arthur Graesser, and others.

References 1. Sottilare, R., Brawner, K.: Exploring standardization opportunities by examining interaction between common adaptive instructional system components. In: Proceedings of the First Adaptive Instructional Systems (AIS) Standards Workshop, Orlando, Florida, March 2018 2. Carnegie Learning. https://www.carnegielearning.com/blog/truly-personalized-learning-allabout-those-loops/ 3. VanLehn, K.: The behavior of tutoring systems. Int. J. Artif. Intell. Educ. 16(3), 227–265 (2006) 4. Murray, T.: Authoring intelligent tutoring systems: an analysis of the state of the art. Int. J. Artif. Intell. Educ. (IJAIED) 10, 98–129 (1999) 5. Woolf, B.P.: A Roadmap for Educational Technology. National Science Foundation # 0637190 (2010) 6. Birk, S.: IMS Global. http://www.imsglobal.org/adaptive-adaptable-next-generationpersonalized-learning 7. Yang, Q., Liu, Y., Chen, T., Tong, Y.: Federated machine learning: concept and applications. ACM Trans. Intell. Syst. Technol. (TIST) 10(2), 12–19 (2019)

Instructional Theories in Adaptive Instruction

Integrating Engagement Inducing Interventions into Traditional, Virtual and Embedded Learning Environments Meredith Carroll(&)

, Summer Lindsey

, and Maria Chaparro

Florida Institute of Technology, Melbourne, FL 32901, USA [email protected]

Abstract. The key to an effective learning environment is keeping the learner attentive and engaged [51]. The shift towards virtual learning environments, such as online and computer-based learning environments, distance the learner from the instructor and can lead to some less-than-engaging learning experiences. Such environments often lack key attributes necessary to fully engage learners, such as clear goals, adequate feedback, and instructor support. Decades of study in cognitive and educational psychology provide a foundation of knowledge regarding the factors that influence learner engagement, and how we can leverage this knowledge base to create engaging learning experiences in these new technology-driven learning environments. This paper presents a taxonomy that maps ten engagement-inducing learning interventions to learning environments in which they have been found to improve learner engagement, factors that influence learner engagement, and learning gains. Implementation of this taxonomy is then illustrated by presenting a use case implementation within a virtual learning environment, followed by a discussion of important considerations during implementation. Keywords: Classroom Involvement

Online learning Blended learning Flow

1 Introduction The key to an effective learning environment is keeping the learner attentive and engaged [51]. Unfortunately, many learning environments, especially those in the military and professional development/training world depend heavily on PowerPointbased classroom lectures and Computer-Based Training (CBT) environments, which can often be boring and result in disengagement. This is especially challenging for learning domains such as Unmanned Aircraft Systems (UAS) and aircraft maintenance training which may require a great deal of declarative and procedural knowledge absorption prior to actual hands-on training. There is a need to develop training methods and tools to support engagement optimization to increase learning effectiveness and efficiency in these learning environments. The shift towards virtual learning environments, such as online and computer-based learning environments, distance the

© Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 263–281, 2019. https://doi.org/10.1007/978-3-030-22341-0_22

264

M. Carroll et al.

learner from the instructor and can lead to some less-than-engaging learning experiences. Such environments often lack key attributes necessary to fully engage learners, such as clear goals, adequate feedback, and instructor support. Decades of study in cognitive and educational psychology provide a foundation of knowledge regarding the factors that influence learner engagement, and how we can leverage this knowledge base to create engaging learning experiences in these new technology-driven learning environments. This paper presents a taxonomy that maps ten engagement-inducing learning interventions to learning environments in which they have been found to improve learner engagement, factors that influence learner engagement, and learning gains. Implementation of this taxonomy is then illustrated by presenting a use case implementation within a virtual learning environment, followed by a discussion of important considerations during implementation across various learning environments. 1.1

Influencing Learner Engagement

This paper builds on previous work in which we developed an Applied Model of Learner Engagement, identifying factors that influence the likelihood of a learner to become engaged in a learning context [16]. The model presents influencing factors related to (1) the individual learner (cognitive ability, personality traits, motivation, interest, self efficacy, and anxiety), (2) the learning task (clarity of goals, feedback, level of challenge, enjoyment, and meaningfulness), and (3) the learning environment (level of autonomy, safety and support). These influencing factors provide opportunities for an instructor to intervene to improve learning (See Carroll et. al. [16], for full description of the model and factors). Utilizing this model as a foundation, we conducted a literature review identifying instructional interventions to be used in the modern educational environment to effectively target these factors, promote engagement, and improve learning outcomes. For the purposes of this effort, we define an instructional intervention as an instructional tool(s) or method(s) that facilitates the presentation of relevant information to be learned, creates opportunities for trainees to practice skills, and/or provides feedback to trainees during and after practice [58]. The criteria for an intervention to be included required empirical evidence indicating that the intervention resulted in: (1) an increase in engagement, or (2) a positive effect on the factors that influence engagement, and (3) learning gains such as knowledge, achievement or performance gains. Ten interventions were identified for inclusion, including: (1) Metacognitive Intervention, (2) Challenge Level Optimization, (3) Goal Clarity, (4) Feedback, (5) Autonomous Self-Regulated Learning, (6) Personalization, (7) Experiential Learning, (8) Gamebased Learning, (9) Interactivity and Multimedia, and (10) Meaningful Learning. These interventions are presented in Table 1 along with brief descriptions and example implementations.

Integrating Engagement Inducing Interventions

265

Table 1. Engagement inducing interventions Description Prompt to increase frequency/accuracy of self assessment of [metacognitive] knowledge/ learning process [61] Optimizing challenge to an individual Challenge learner’s skill level where the level/skill difficulty of the learning experience optimization provides adequate challenge without frustration [24] Goal clarity Learning goals presented to learners, and taken up and transparent throughout the learning activity [62] Feedback Information provided to learner that “aims to reduce the gap between current and desired learning outcome” [70] Strategies that allow learners be Autonomous self-regulated engaged in learning outcomes of their own goals; involves “autonomous learning motivation”; “acting with a sense of volition and choice” [53] Personalization Tailor instructional content to “student knowledge, interests, preferences, and goals” [13] Experiential Learning from experience; learning learning by doing. Immerses learner in an experience, encourages reflection about the experience to develop new skills/ways of thinking [44] Game-based “A system in which players engage in learning an artificial conflict, defined by rules, that results in a quantifiable outcome” [59] Interactivity & Dynamically communicating with an multimedia individual by either providing response information or allowing individual to participate through feedback, adaptation, control, or multimedia [64] Meaningful Connecting new ideas and knowledge learning to existing cognitive structures to give new information meaningful connections and to enhance memory retention [8] Metacognitive intervention

Example implementation Prompting students to reflect on the strategy used to solve a math problem

Increase/decrease difficulty based on performance

Provide overviews, transition statements, and summaries Providing areas of performance improvement on a grading rubric

Increase simulator availability so the learners can repeat and master the task and practice at their own pace

Surveying students on interests/goals, tailoring topics/learning content Providing a problem scenario and allowing students to work through and find solutions to the problem

Adding incentives for completing a math times table within an allotted time Using clickers in a science class; respond to user inputs and display user responses/performance scores

Developing a concept map of species to help understand distinction between mammals and non-mammals

266

M. Carroll et al.

2 Taxonomy of Interventions and Learning Environments A taxonomy was then created that mapped these instructional interventions to the learning environments in which they had been effective. Specifically, the taxonomy identified whether each intervention had been shown to (a) increase learner engagement or positively impact engagements factors and (b) improve learning gains, across one of three learning environments including (1) traditional (i.e., classroom-based), (2) virtual (e.g., computer/simulation-based), or (3) embedded learning environments (e.g., live, on-the-job). An overview of the taxonomy is presented in Table 2 and described in detail in the following sections.

Table 2. Taxonomy of instructional interventions and learning environments Traditional Metacognitive intervention ✔* Engagementa b Knowledge ✔ Skill/Performancec ✔ Challenge level optimization • Engagementa Knowledgeb • Skill/Performancec • Goal clarity Engagementa ✔* Knowledgeb • Skill/Performancec • Feedback Engagementa ✔* Knowledgeb • Skill/Performancec ✔ Autonomous self-regulated learning Engagementa ✔* Knowledgeb • Skill/Performancec ✔ Personalization Engagementa ✔ Knowledgeb ✔ Skill/Performancec ✔ Experiential learning Engagementa ✔ Knowledgeb ✔ Skill/Performancec •

Virtual

Embedded

✔ ✔ ✔

✔* ✔ ✔

✔* ✔ ✔

• • •

• ✔ ✔

• ✔ •

✔ ✔ ✔

• • •

✔* ✔ •

• • ✔

✔* • •

• • •

✔ • •

• • • (continued)

Integrating Engagement Inducing Interventions

267

Table 2. (continued) Traditional

Virtual

Embedded

Game-based learning ✔ ✔* • Engagementa Knowledgeb ✔ ✔ • Skill/Performancec ✔ ✔* • Interactivity and multimedia Engagementa ✔* ✔* ✔ Knowledgeb ✔ ✔ • Skill/Performancec ✔ ✔ • Meaningful learning Engagementa ✔ • • Knowledgeb ✔ • • Skill/Performancec • • • a Engagement includes: flow, engagement and task strategies as well as factors: motivation, self-efficacy, value, competence, satisfaction, and interest. bKnowledge includes: understanding/comprehension, performance, procedural knowledge, declarative knowledge, transfer of knowledge, perceived or actual learning, retention, and recall. cSkill/Performance include: academic achievement, general performance, strategies, training performance, training efficiency, mastery, effort, and information search. ✔ = Show to be beneficial to a trait in this category. • = No research presented on the impact in this category. * = Contingencies for the intervention effectiveness or a mix of beneficial and negative impacts exists.

2.1

Metacognitive Interventions

Metacognitive interventions have demonstrated improved knowledge transfer, higher performance and self-efficacy, increased comprehension and mastery, and more efficient use of learning time [7, 22, 40, 42, 61, 68]. Metacognitive interventions are typically facilitated by using self-reflection prompts which can be delivered using handouts in a classroom, a virtual cognitive tutor, or by using verbal prompts presented by an instructor or computer program during training; [7, 40, 61, 68]. Metacognitive interventions have the potential to increase learner engagement by improving selfefficacy [22, 61] and task value [40]. However, research has shown that this type of intervention interacts with an individual’s goal orientation, wherein it may only be beneficial for individuals who aim to perform well, and may decrease performance for individuals who avoid situations where they may perform poorly. Metacognitive Interventions and Traditional Learning Environments. Metacognitive interventions have led to improved learning strategies, understanding and academic success in traditional environments [7, 42, 49]. Kramarski and Mevarech [42] found that metacognitive training where students were presented with self-addressed metacognitive questions during tasks (e.g., “what strategy is most appropriate for this task?”) improved students’ ability to create graphs and improved transfer of knowledge from learning to performing. Askell-Williams et al. [7] evaluated students existing

268

M. Carroll et al.

levels of metacognitive activity and implemented metacognitive strategies where teachers gave verbal prompts focused on identifying key ideas, strategy instruction, and monitoring understanding. Students were also asked what the topic was about, what strategies they would use, to draw concept maps of the topic, and write key points of the lesson and what they did not understand. Over the course of the class it was found that students improved learning strategies. Metacognitive Interventions and Virtual Learning Environments. Metacognitive interventions led to improved knowledge and understanding in virtual environments [5, 27, 40]. Ford et al. [27] measured metacognitive activity with complex decision tasks, such as radar operations, and discovered metacognitive activity related to improved knowledge acquisition, performance, and self-efficacy. Vincent and Koedinger [68] evaluated the outcomes of using a virtual cognitive tutor in a computer math program to prompt students to engage in self-explanation and exhibited better visual and verbal declarative knowledge, more in-depth procedural knowledge, and better transfer of knowledge. Kohler [40] evaluated the effects of metacognitive intervention with second language learners by using a computer-based learning program that prompted students with self-reflection questions. It was found those in the metacognitive intervention condition had higher perceived training value; exhibited increased comprehension; and mastery of vocabulary, speaking, and listening of the language. Downing et al. [22] was interested in how metacognition was affected by different learning styles. Researchers measured undergraduate student’s perceptions of their thinking or metacognitive development and performance in problem-based learning (PBL) and non-PBL classes. PBL students had higher metacognitive ability, which led to higher self-efficacy. Metacognitive activity improved in those high in mastery goal orientation. Metacognitive Interventions and Embedded Learning Environments. Metacognitive interventions have led to improved knowledge and performance in embedded environments [61]. Schmidt and Ford [61] used display prompts to facilitate metacognitive learning for students learning how to design webpages. Metacognitive intervention was administered through having students reflect on learning and encouraging them to go back and revisit material if not fully understood. Students exposed to metacognition had higher declarative knowledge, training performance, self-efficacy, and time efficiency. Additionally, the metacognitive interventions were more beneficial for individuals high in performance-approach goal orientation, compared to individuals high in performance-avoidance goal orientation whom such strategies may actually lead to lower knowledge acquisition. 2.2

Challenge Level Optimization

Optimizing challenge to an individual’s skill level can increase learning, performance, and motivation—even when the perceived difficulty is lower than desired. The primary learning context in which challenge optimization has been used successfully is virtual environments. Increasing challenge in a simulation is one of the important factors for effective training and skill mastery [38]. Challenge can be optimized through target difficulty, making enemies more skilled, among other factors. Simulator difficulty must be comparable to reality for challenge level optimization to be beneficial [11].

Integrating Engagement Inducing Interventions

269

Challenge that adapts to one’s skill may be best suited for individuals with high openness and neuroticism. Individuals low in these traits may be better suited for static difficulty [11]. Individuals without experience in video games/simulations may do best with static/adaptive difficulty [11, 60]. Challenge Level Optimization and Virtual Learning Environments. Within virtual learning environments, challenge level optimization has led to higher learning and performance. The extent of the benefits can be different based on personality, medium of challenge level optimization, and experience in simulated environments. SampayosVargas et al. [60] compared 3 different mediums for learning Spanish (i.e., word matching, fixed difficulty game, and adaptive difficulty game). Motivation stayed constant across the conditions; learning and performance increased for the adaptive difficulty game. Sampayos-Vargas et al. [60] also found that fixed and adaptive difficulty simulated Spanish games resulted in higher perceived competence. Bauer, Brusso, and Orvis [11] researched adaptive difficulty in a military simulation game. The game asks trainees to find intel on enemy soldiers while being exposed to increasing, static, or adaptive difficulty (i.e., difficulty remains the same, increases, or increases/decreases based on performance, respectively) by making enemies more skilled and damaging. High openness and high neuroticism individuals performed better under adaptive difficulty while those with low openness did better with static or increasing difficulty and low neuroticism with static difficulty. Orvis et al. [50] evaluated performance and motivation in a military shooting training game. Performance and motivation increased regardless of the difficulty condition. Participants without prior gaming experience performed best under adaptive or no difficulty adjustment, whereas, those with gaming experience improved equally under all conditions. 2.3

Goal Clarity

Goal clarity has resulted in improved learning, performance and presence in a range of learning environments [12, 15, 46, 62]. Goal clarity has resulted in higher motivation in the classroom [62], but may not improve performance for those with low motivation [15]. Goal clarity can be varied by the method in which goals are presented (e.g., consist goals from managers and colleagues, clear deliverables, or by adding audio in a simulation; [12, 46]). Goal clarity may not benefit those in highly autonomous live training unless they have prior experience [12]. Goal Clarity and Traditional Learning Environments. Seidel et al. [62] found that students had increased competence and higher self-determined motivation (i.e. intrinsic, identified instead of externally motivated) when presented with high goal clarity/coherence videos. Students were presented various videos in a physics course with high and low goal clarity and coherence with lessons. Students viewed high goal clarity/coherence lessons as more supportive environments. No change in interest was found as a result of goal clarity in the classroom.

270

M. Carroll et al.

Goal Clarity and Virtual Learning Environments. Within online virtual environments, goal clarity has led to increased learning, perception of learning, learner presence, and positive perception of instructors when presented in a multimodal medium. Goal clarity can improve performance for those with high motivation to think deeply about lessons. Limperos et al. [46] presented a brief online lecture on flow theory both the clarity of goals and delivery modality. The visual and auditory conditions resulted in improvements in: actual and perceived learning, presence, perception of the instructor’s credibility, goodwill, and competence. The results support the notion that adding multimodal CBTs raises overall clarity more than altering content clarity. Bolkan et al. [15] evaluated the effects of goal clarity through a video-based communication studies lecture. Goal clarity was manipulated by providing elements such as advanced organizers. It was found that goal clarity interacted with motivation to learn. Those with high goal clarity and motivation to think deeply had increased test performance; those who had low motivation did not. Goal Clarity and Embedded Learning Environments. Goal clarity has improved learning for those with prior experience in autonomous work environments. Beenen and Mrousseau [12] surveyed MBA interns during their internships on goal clarity, autonomy, prior experience, learning, and job acceptance intention. Goal clarity accounted for 15.5% of the variance in learning and significantly correlated with job acceptance intentions. Beenen and Mrousseau [12] suggest establishing high goal clarity by ensuring goals are consistent from all managers and colleagues and can be achieved by clear deliverables. Goal clarity may not benefit those in highly autonomous training unless they have prior experience [12]. 2.4

Feedback

Providing process level feedback (i.e., feedback on the individuals methods for task completion) as opposed to performance level feedback (i.e., how well/poorly they performed) can result in improved performance perception, learning outcomes, improved strategies, effort, self-confidence, competence and engagement [17, 23, 70]. Feedback can also be delivered through various methods (e.g., simulator, instructor; [17, 38, 70]). However, providing only performance level feedback can cause competence to decrease [70]. Feedback and Traditional Learning Environments. Wollenschlager et al. [70] evaluated the effects of different feedback types in a science classroom. Students planned scientific experiments within three feedback conditions: (1) received grades on the overall assignment (2) received feedback on each aspect of a rubric on a grade of one to five, (3) others marked where they did well and where they could improve on the next scientific plan assignment. Students rated what they believed they would receive before actually getting their grade. It was found that improvement feedback resulted in higher perceived task improvement and more accurate expected outcomes, however no change in interest was seen. Solely providing performance feedback or transparency information can lead to perceived competence to decrease. A study by Gan, Nang, and Mu [29] explored what classroom feedback practices trainee teachers experience, and how their feedback experiences relate to learning motivation. The study found that

Integrating Engagement Inducing Interventions

271

activity-based feedback, teacher evaluation feedback, peer/self-feedback and longitudinal-development feedback led to motivational increases. Peer/self and longitudinal-development feedback were the most powerful. Feedback and Virtual Learning Environments. Earley et al. [23] studied a simulated stock market exercise to evaluate the effects of goal setting and process/outcome feedback on performance and other outcome variables. Process level feedback related to goal setting, improved task strategies, and information search. Outcome feedback (in combination with goal setting) improved effort. The authors noted that challenging goals in combination with high process and outcome feedback resulted in the highest performance. Earley et al.’s [23] study also revealed outcome feedback and goal setting can result in improved self-confidence. Issenberg et al. [38] performed a meta-analysis of medical simulators finding that feedback during the learning experience was the highest contributing factor to effective learning. Chapman, Selvarajah, and Webster [17] evaluated a computer based training program on motivating employees in three conditions: text, audio with images, and video. The video condition resulted in higher perceived feedback and in turn highest engagement. 2.5

Autonomous Self-regulated Learning

Self-regulated learning allows for a more autonomous environment, where offering choice allows an individual to repeat a task until learning is achieved [38, 57] or regulate the difficulty [43] to achieve mastery [33]. Self-regulated learning can result in increased motivation; achievement, performance, engagement and improved retention [19–21, 33, 43, 57, 67]. Self-regulated learning is commonly stimulated through choices (e.g., allowing learners to pick the assignments or trainings they wish to pursue [21]) and variability in availability (e.g., when to complete assignments; [38]). However, those with low understanding [57] experience [12], and beliefs (i.e., self-efficacy) may not perform well under highly self-regulated learning environments. Autonomous Self-regulated Learning and Traditional Learning Environments. In traditional learning environments, self-regulated learning can result in higher academic achievement, intrinsic motivation, engagement and improved skills in enhancing selflearning techniques. Rotgans and Schmidt [57] examined how autonomy in PBL within classroom environments affects cognitive engagement. Their findings were consistent with increased cognitive engagement and higher academic achievement. However, autonomy seemed to be largely dependent on student’s understanding of the topic. Deci, Vallerand, Pelletier, and Ryan [21] found that students who perceived their professors or environment as autonomy supportive had higher levels of intrinsic motivation. Those who felt they understood what controls the outcomes in school were rated as more engaged in school by their teachers and had higher achievement as shown by grades, when compared to those with less control understanding. Gillard, Gillard, and Pratt [33] induced autonomy through choices, mastery, and purpose. Survey findings and grades supported Gillard’s belief that students who are given more choice become more motivated to work towards mastery of a subject compared to more structured environments. Cleary and Zimmerman [19] trained high school volleyball students in multiple self-regulated learning processes including, perceived

272

M. Carroll et al.

instrumentality and task interest (forethought), self-monitoring (performance), and selfevaluation (self-reflection). The outcome was improved performance, engagement, and greater achievement. Initiation and sustained self-regulated learning is largely dependent on self-motivation beliefs such as self-efficacy, outcome expectations, task interest, and goal orientation. Autonomous Self-regulated Learning and Virtual Learning Environments. Leiker et al. [43] evaluated the effects on performance and motivation on a motion-based video game through adaptive difficulty. Participants were either given the ability to (1) choose to raise or lower difficulty or (2) difficulty was selected for the participant. Those in the choice condition exhibited higher retention and intrinsic motivation. No effect on motivation and engagement was found. Issenberg et al. [38] performed a meta-analysis of reports on medical simulators finding that allowing simulators to be available at any time allows for repetitive training. Availability was found to be the second most important factor in effective training. Autonomous Self-regulated Learning and Embedded Learning Environments. In embedded learning environments, autonomous self-regulated learning can lead to increases in performance, learning, and intrinsic motivation. Curado et al. [20] reviewed multiple studies where employees were either given the choice to enroll in training of their choice or were assigned to training. They found that when a job offers choice to employees on which training to participate in, they performed better. Hicks and Klimoski [35] conducted a study in which a company’s managers were either assigned to a training session or given choice. At the end of the training session those in the choice group showed an increase in learning, motivation and satisfaction compared to the assigned group. Thomas and Velthouse [67] conducted a literature review and found that four task assessment dimensions were important to engagement (impact, competence, meaningfulness and choice). Freedom in choices leads to a more intrinsically motivated individual, creativity, initiative, resiliency, and self-regulation. 2.6

Personalization

Personalizing content to an individual’s interests can result in increased engagement, learning, effort, and situational interest [13, 26, 34, 36]. Personalization can be achieved through content manipulation such as tailoring problems to topics of interest or through creating games related to the lesson [13, 34, 36]. Personalizing content can trigger situational interest, which is not long lasting [13] yet can be maintained and increased through repeated exposure [36]. Personalization and Traditional Learning Environments. In traditional learning environments, personalization can lead to improvements in engagement during task performance, triggering and maintenance of interest, enhanced learning, and lower mental effort. Fives and Manning [26] conducted a study on teachers’ knowledge of research-endorsed-motivational strategies for student engagement. The study found that incorporating student’s interests and values can lead to improved task engagement. Hidi and Renninger [36] proposed a 4-phase model of interest development in learning from a review of the literature, finding that when interest is captured it can lead to

Integrating Engagement Inducing Interventions

273

seeking for challenge and goals related to a task. Engaging activities (e.g., games) can trigger interest while involved tasks, (e.g., group projects) can maintain and build individual interest. Ginns & Fraser [34] examined whether the personalization of paperbased instructional materials through modified text (i.e., writing instructional text to address the learner directly while emphasizing personal relevance) would enhance learning heart terminology. Findings indicated that personalization enhanced learning, and lowered ratings of mental effort during testing indicating deeper learning. Personalization did not lead to higher levels of interest and enjoyment. Personalization and Virtual Learning Environments. Bernacki and Walkington [13] used a math tutoring program (i.e., Cognitive Tutor Algebra) in conjunction with a survey on interests to personalize math problems to topic areas participants were interested in (e.g., music, TV, sports). Participants in the personalization group displayed more meaningfulness and in turn a heightened situational interest. Ambroziak, Ibrahim, Marshall, and Kelling [6] assessed the use of the simulation program, MyDispense which allowed students to personalize skills trained. Students generally perceived the virtual simulation as an effective tool for learning medication dispensing skills. However, a drawback of the personalized simulated environment was the amount of time it took to create and test the personalized exercises. 2.7

Experiential Learning

Experiential learning can lead to higher engagement levels, knowledge gains, and knowledge retention [28, 66]. Experiential learning is commonly facilitated through PBL and inquiry-based learning (IBL). Increased task time can result from learning by doing and is not suggested with difficult learning material as the cognitive load may be too high for students attempting to develop an initial knowledge foundation [66]. Experiential Learning in Traditional Learning Environments. A study conducted by Ahlfeldt, Mehta, and Sellnow [3] measured student engagement across fifty-six university classes where some of the classes consisted of teachers trained on PBL. Higher levels of engagement were experienced in smaller classes, classes with teachers who had the most PBL training, and classes with teachers who implemented more PBL strategies [3]. This is consistent with other researchers who have found higher levels of engagement from PBL environments [28, 37]. Hmelo-Silver, Duncan, and Chinn [37] conducted a review of the literature on IBL and PBL environments. Their efforts showed that both types of experiential learning can help foster mastery goal orientation and higher knowledge gains. Winsett, Foster, Dearing, and Burch [69] assessed how experiential learning affected engagement in business management students. Higher levels of behavioral, emotional, and out-of-class cognitive engagement resulted from group-based experiential learning. It is important to note that the results suggest experiential learning has specific effects when paired with specific mediums. When paired with group discussions it yields physical engagement; group projects appear to drive emotional engagement; variability in group work drives cognitive out-of-class and emotional engagement.

274

M. Carroll et al.

Experiential Learning in Virtual Environments. Fukuzawa and Boyd [28] created the online Monthly Virtual Mystery game, which provided case-studies to engage learners. Students were divided into two groups either receiving the case studies or having to answer regular discussion board questions. The findings were consistent with PBL leading to higher levels of engagement and higher perceived value in online discussion boards. Students utilizing the game had a higher completion rate demonstrating that an active learning project can be implemented using PBL principles through an online discussion board. However, no difference in learning was found which the researchers attributed to the large class discussion sizes. A literature review by Al-Elq [4] on the use of simulators for experiential learning found that simulators led to improvement in learners’ competence and confidence. Al-Elq [4] referenced a study where practitioners practice their life saving skills. Survey results showed that practitioners felt more confident post-experiential learning and that the use of simulators led to higher competence as they allowed for a deeper understanding of complex medical factors. 2.8

Game-Based Learning

Within traditional and virtual learning contexts, game-based learning has led to gains in conceptual, tacit, declarative, procedural, strategic, and knowledge transfer, as well as increased performance, flow, and engagement [2, 10, 39, 45]. Learning activities can be “gamified” by adding incentives and gaming qualities to an activity that may be uninteresting [51]. However, game-based learning can also result in lower explicit knowledge if students become too focused on how to beat the game, which can result in decreased confidence when tested on knowledge gains (Rieber and Noah [54]). Gamebased learning can be facilitated by adding gaming elements or developing a game that may include: clear rules/goals and feedback, competitiveness, opportunities to solve problems, uncertain outcomes, or scores [30, 31]. One example is an interactive gamebased physics lesson that allows students to control a ball’s acceleration and velocity [54]. In game-based learning, the depth of flow may depend on the academic level of the student [2] but has the potential to increase motivation for all students [9, 30, 45, 51]. It should be noted that technological issues can decrease the flow of game-based learning if they become a distraction [2]. Game-Based Learning and Traditional Learning Environments. Game-based learning in the classroom has the potential to increase learning, understanding in how concepts are related, and motivation. Bai, Pan, Hirumi, and Kebritchi [9] utilized gamebased learning in mathematics. Students either learned algebra using the game-based DimensionM method or through regular instruction. Students who were in the gamebased learning condition had increased mathematical knowledge and maintained motivation to learn when compared to those who did not. A study by Kao, Chiang and Sun [39] utilized game-based learning in a science classroom to teach physics. Participants in game-based learning groups scored higher than those in the no game learning group demonstrating had higher related knowledge. Admiraal, Huizenga, Akkerman, and Dam [2] utilized the game Frequency 1550 to teach history to high school students. The game resulted in students showing a state of flow and improved

Integrating Engagement Inducing Interventions

275

performance during the game. However, the game did not have any results in learning outcome. The authors propose this is due to the distracting features of the game. A meta-analysis conducted by Li and Tsai [45] of science game-based learning found that it led to improved interest, motivation, and engagement in the classroom. Game-Based Learning and Virtual Learning Environments. Game-based learning can lead to better performance, deeper learning, and engagement in virtual learning environments. A study conducted by Barab et al. [10] developed a 3D game-based curriculum designed to teach water quality concepts in a simulated environment. Student experienced traditional, framed, or immersive world conditions. The immersive world conditions performed significantly better than traditional learning. A study by Squire and Jan [65] found that incorporating game-based learning in a simulated environment led to increased engagement and deeper learning. A group of students utilized a place-based augmented reality game called Mad City Mystery to learn about diseases. Increased engagement was seen in the form of students revisiting different areas in the game to answer the problem posed by the game. 2.9

Interactivity and Multimedia

Adding interactive elements can increase engagement [1, 14, 41] but can be hindered if the student does not know how to interact with the technology or becomes distracted by features [1]. Interactivity can also result in increased understanding and mastery [1] increased collaborative learning, and increased performance [14]. However, some studies have shown limited learning gains from adding interactive elements [52]. Interactivity is usually accomplished by allowing interaction with an activity or system through feedback, control, simulation, and adaptation. Additionally, interactivity can be achieved by adding more complex mediums such as animations, simulations, or live environments [1, 14, 52, 64]. Adding interactivity to learning activities can increase attention, motivation, confidence, satisfaction, relevance [56] perceptions of learning effectiveness [41], situational interest [52] and enjoyment, but can take away from the other learning aspects if not implemented appropriately [1]. Interactivity and Multimedia in Traditional Learning Environments. Interactivity can lead to improved learning, performance, understanding, and engagement. A study by Blasco-Arcas, Buil, Hernandez-Ortega, and Sese [14] incorporated interactivity by using clickers in a social sciences university classroom. Clickers resulted in increased engagement, collaborative learning, and performance. Krain [41] found that using video format to present case studies and PBL lead to increased engagement and student perception of effectiveness. Interactivity and Multimedia in Virtual Learning Environments. Within virtual learning environments interactivity and multimedia can led to increases in interest, enjoyment, understanding and mastery. A study conducted by Adams and Reid [1] surveyed two hundred students who were using interactive simulated environments to learn physics, and observed five students utilizing an interactive simulation to create a circuit. Interactivity led to increased understanding, enjoyment, and mastery. However, it is important that the simulated environment be easy to use and the features not be too

276

M. Carroll et al.

distracting [1]. A study by Pedra, Mayer, and Albertin [52] utilized interactivity in maintenance video instruction. High interactivity led to increased situational interest, however, no learning outcome gains were seen. Interactivity and Multimedia in Embedded Learning Environments. Rodgers and Withrow-Thorton [56] were interested in how different instructional media affected learner motivation in a workplace training situation for a hospital. Results indicated that interactive safety training led to increases in attention, satisfaction, and confidence, as well as in motivation. 2.10

Meaningful Learning

Utilizing meaningful learning tactics has resulted in higher engagement [25], achievement [32], understanding, recall, and transfer of knowledge [25, 47]. Meaningful learning can also increase an individual’s satisfaction and motivation [55, 63]. However, in some environments, motivation can decrease with the addition of meaningful learning aids such as concept maps, as learners prefer to learn through the more enjoyable means [18]. Decreased motivation can result from adding concept maps in certain areas (e.g., while playing a game to learn history), when more fun ways to learn exist [18]. Meaningful Learning in Traditional Learning Environments. Eppler [25] studied the effects of meaningful learning in a knowledge management and research methods course. Meaningful learning was induced using visual aids such as concept maps, mind maps, visual metaphors, and conceptual diagrams to build knowledge. Students exposed to the visual aids yielded increased engagement, as well as improved understanding, enjoyment, and recall. Gidena and Gebeyhu [32] utilized academic organizers in physics instruction to discern their effects on academic achievement. Academic organizers led to improved academic achievement, knowledge, and understanding. Mayer and Bromage [48] examined how advanced organizers affected learners in understanding a new computer programming language prior to or after learning. Students either read the advanced organizers before reading or after reading. Advanced organizers led to improved transfer of knowledge, connections, and recall but retention of information was not affected. Students in the before group had high recall of conceptual idea units, more appropriate intrusions, and novel inferences. Students who were presented with the advanced organizers after, scored higher on recall of technical idea units, and had less appropriate intrusions, connections, and nonspecific summaries. Shihusa and Keraro [63] investigated the effects of advanced organizers on learner motivation in a biology class to learn about pollution. Advanced organizers led to motivation increases compared to conventional learning experience.

3 Virtual Learning Environment Use Case We include here a use case example of how a subset of these interventions could be implemented within a virtual surgery simulator course in a medical program. Before the training course, a set of queries could be sent to each trainees via a mobile application

Integrating Engagement Inducing Interventions

277

to collect demographic and individual trait information known to impact an individual’s propensity to become engaged in the learning topic (e.g., personality, self efficacy, interests). The instructor could also send learning material and advanced organizers related to the upcoming simulated surgery to prompt the trainees to cognitively engage in the topic beforehand. The instructor could then review the profiles of each individual prior to the first training session to familiarize themselves with each trainee, including their propensity to become engaged. If a particular learner has low self-efficacy, the instructor could consider providing process level feedback with specific areas of improvement. During the training session, the instructor could monitor performance to determine if the simulated surgery task is too overwhelming or too simplistic for the trainee’s level of skill and adjust accordingly. For motivated learners who seem to have developing skill levels, instructor could provide more autonomy by allowing the learner to practice outside of normal training sessions to work toward mastery. If other trainees show boredom and lack of interest, the instructor can tailor the task to individual goals. For example, if an individual learner aims to work in the pediatric field, the instructor can adjust the surgery simulation task so that it targets pediatric emergency care. Instructors can provide process-level feedback regarding performance on the simulated task and utilize metacognitive prompts to promote selfreflection.

4 Implementation Considerations and Conclusion It is important to consider the environment, individual, and task itself when choosing the instructional intervention to implement as some may be more effective, feasible, or applicable in certain contexts. For instance, challenge/skill optimization is easiest to implement in environments in which there are ongoing assessments of individual learner knowledge/performance, such as in simulation or computer-based learning. Such is the case for feedback, as well. On the contrary, instructors can easily deliver clear goals in a range of environments. In addition, consideration should be given to individual learner characteristics, as some interventions only foster engagement in the right individual. For example, those who are afraid of performing poorly will perform worse under metacognitive interventions. However, those who are not afraid of learning from mistakes and seek to master a task will have improved performance with metacognitive intervention [61]. Additional caution should be taken when combining multiple instructional interventions. Utilizing interventions aimed at increasing the depth of understanding (e.g., concept maps) with interventions aimed at increasing enjoyment (e.g., games) can have a negative effect. Learners may become disengaged when completing the concept maps, as they prefer to engage with the content in the more entertaining way [18]. Consideration must be given to the particular intricacies of each intervention presented above.

278

M. Carroll et al.

5 Conclusion The modern educational environment is changing the way individuals learn and so must the way we teach. When fashioning learning opportunities to facilitate effective learning, instructors should consider what is the optimal learning environment, the ideal instructional interventions, and students’ individual characteristics. There is a wealth of knowledge available in the education and cognitive psychology literature that can be leveraged to achieve this. This paper attempts to facilitate this process, by presenting a taxonomy of instructional interventions that aim to increase learner engagement, mapped to the learning environment in which they have been shown effective in the literature. Also presented are a use case implementation in a virtual environment along with important considerations for implementation. By marrying proven instructional techniques, with emerging and innovative technology, instructors have the opportunity to more fully engage learners in each learning opportunity.

References 1. Adams, W.K., et al.: A study of educational simulations part 1 - engagement and learning. J. Interact. Learn. Res. 19(3), 397–419 (2008) 2. Admiraal, W., Huizenga, J., Akkerman, S., Ten Dam, G.: The concept of flow in collaborative game-based learning. Comput. Hum. Behav. 27(3), 1185–1194 (2011) 3. Ahlfeldt, S., Mehta, S., Sellnow, T.: Measurement and analysis of student engagement in university classes where varying levels of PBL methods of instruction are in use. High. Educ. Res. Dev. 24(1), 5–20 (2005) 4. Al-Elq, A.H.: Simulation-based medical teaching and learning. J. Fam. Community Med. 17(1), 35 (2010) 5. Aleven, V., Koedinger, K.: An effective metacognitive strategy: learning by doing and explaining with a computer-based cognitive tutor. Cogn. Sci. 26, 147–179 (2002) 6. Ambroziak, K., Ibrahim, N., Marshall, V.D., Kelling, S.E.: Virtual simulation to personalize student learning in a required pharmacy course. Curr. Pharm. Teach. Learn. 10, 750–756 (2018) 7. Askell-Williams, H., Lawson, M., Skrzypiec, G.: Scaffolding cognitive and metacognitive strategy instruction in regular class lessons. Instr. Sci. 40(2), 413–443 (2012) 8. Ausubel, D.: Cognitive structure and the facilitation of meaningful verbal learning. J. Teach. Educ. 14, 217–222 (1963) 9. Bai, H., Pan, W., Hirumi, A., Kebritchi, M.: Assessing the effectiveness of a 3-D instructional game on improving mathematics achievement and motivation of middle school students. Br. J. Educ. Technol. 43(6), 993–1003 (2012) 10. Barab, S.A., et al.: Transformational play as a curricular scaffold: using videogames to support science education. J. Sci. Educ. Technol. 18(4), 305 (2009) 11. Bauer, K.N., Brusso, R.C., Orvis, K.A.: Using adaptive difficulty to optimize videogamebased training performance: the moderating role of personality. Mil. Psychol. 24(2), 148– 165 (2012). https://doi.org/10.1080/08995605.2012.672908 12. Beenen, G., Rousseau, D.M.: Getting the most from MBA internships: promoting intern learning and job acceptance. Univ. Mich. Alliance Soc. Hum. Resour. Manag. 49(1), 3–22 (2010)

Integrating Engagement Inducing Interventions

279

13. Bernacki, M.L., Walkington, C.: The impact of a personalization intervention for mathematics on learning and non-cognitive factors. In: EDM (Workshops) (2014) 14. Blasco-Arcas, L., Buil, I., Hernandez-Ortega, B., Sese, J.: Using clickers in class. The role of interactivity, active collaborative learning and engagement in learning performance. Comput. Educ. 62, 102–110 (2013) 15. Bolkan, S., Goodboy, A., Kelsey, D.: Instructor clarity and student motivation: academic performance as a product of students’ ability and motivation to process instructional material. Commun. Educ. 65(2), 129–148 (2016) 16. Carroll, M., Lindsey, S., Chaparro, M.: An applied model of learner engagement and strategies for increasing learner engagement in the modern educational environment. Submitted to Interactive Learning Environments (Under review) 17. Chapman, P., Selvarajah, S., Webster, J.: Engagement in multimedia training systems. In: Proceedings of the 32nd Annual Hawaii International Conference on System Sciences (1999) 18. Charsky, D., Ressler, W.: “Games are made for fun”: lessons on the effects of concept maps in the classroom use of computer games. Comput. Educ. 56(3), 604–615 (2011) 19. Cleary, T.J., Zimmerman, B.J.: A cyclical self-regulatory account of student engagement: theoretical foundations and applications. In: Christenson, S., Reschly, A., Wylie, C. (eds.) Handbook of Research on Student Engagement, pp. 237–257. Springer, Boston (2012). https://doi.org/10.1007/978-1-4614-2018-7_11 20. Curado, C., Henriques, P.L., Ribeiro, S.: Voluntary or mandatory enrollment in training and the motivation to transfer training. Int. J. Train. Dev. 19, 98–109 (2015). https://doi.org/10. 1111/ijtd.12050 21. Deci, E.L., Vallerand, R.J., Pelletier, L.G., Ryan, R.M.: Motivation and education: the selfdetermination perspective. Educ. Psychol. 26(3–4), 325–346 (1991) 22. Downing, K., Kwong, T., Chan, S., Lan, T., Downing, W.: Problem-based learning and the development of metacognition. High. Educ. 57, 609–621 (2008) 23. Earley, C., Northcraft, G., Lee, C., Lituchy, T.: Impact of process and outcome feedback on the relation of goal setting to task performance. Acad. Manag. J. 33(1), 87–105 (1990) 24. Engeser, S., Rheinberg, F.: Flow, performance and moderators of challenge-skill balance. Motiv. Emot. 32(3), 158–172 (2008) 25. Eppler, M.J.: A comparison between concept maps, mind maps, conceptual diagrams, and visual metaphors as complementary tools for knowledge construction and sharing. Inf. Vis. 5(3), 2002 (2006) 26. Fives, H., Manning, D.K.: Teachers’ strategies for student engagement: comparing research to demonstrated knowledge. In: Annual Meeting of American Psychological Association (2005) 27. Ford, K.J., Smith, E.M., Weissbein, D.A., Gully, S.M.: Relationships of goal orientation, metacognitive activity, and practice strategies with learning outcomes and transfer. J. Appl. Psychol. 83(2), 218–233 (1998) 28. Fukuzawa, S., Boyd, C.: Student engagement in a large classroom: using technology to generate a hybridized problem-based learning experience in a large first year undergraduate class. Can. J. Sch. Teach. Learn. 7(1), 7 (2016) 29. Gan, Z., Nang, H., Mu, K.: Trainee teachers’ experiences of classroom feedback practices and their motivation to learn. J. Educ. Teach. 44, 1–6 (2018) 30. Garris, R., Ahlers, R.: A game-based training model: development, application, and evaluation. In: Interservice/Industry Training, Simulation & Education Conference, Orlando, FL (2001) 31. Garris, R., Ahlers, R., Driskell, J.E.: Games, motivation, and learning: a research and practice model. Simul. Gaming 33(4), 441–467 (2002)

280

M. Carroll et al.

32. Gidena, A., Gebeyhu, D.: The effectiveness of advance organizer model on students’ academic achievement in learning work and energy. Int. J. Sci. Educ. 39(6), 2226–2242 (2017) 33. Gillard, S., Gillard, S., Pratt, D.: A pedagological study of intrinsic motivation in the classroom through autonomy, mastery, and purpose. Contemp. Issues Educ. Res. (Online) 8(1), 1 (2015) 34. Ginns, P., Fraser, J.: Personalization enhances learning anatomy terms. Med. Teach. 32(9), 776–778 (2010) 35. Hicks, W.D., Klimoski, R.J.: Entry into training programs and its effects on training outcomes: a field experiment. Acad. Manag. J. 30(3), 542–552 (1987) 36. Hidi, S., Renninger, K.A.: The four-phase model of interest development. Educ. Psychol. 41(2), 111–127 (2006) 37. Hmelo-Silver, C.E., Duncan, R.G., Chinn, C.A.: Scaffolding and achievement in problembased and inquiry learning: a response to Kirschner, Sweller, and Clark (2006). Educ. Psychol. 42(2), 99–107 (2007) 38. Issenberg, B.S., McGaghie, W.C., Petrusa, E.R., Gordon, D.L., Scalese, R.J.: Features and uses of high-fidelity medical simulations that lead to effective learning: a BEME systematic review. Med. Teach. 27(1), 10–28 (2005) 39. Kao, G.Y.M., Chiang, C.H., Sun, C.T.: Customizing scaffolds for game-based learning in physics: impacts on knowledge acquisition and game design creativity. Comput. Educ. 113, 294–312 (2017) 40. Kohler, D.B.: The effects of metacognitive language learning strategy training on lowerachieving second language learners. ProQuest Information & Learning (2002) 41. Krain, M.: The effects of different types of case learning on student engagement. Int. Stud. Perspect. 11(3), 291–308 (2010) 42. Kramarski, B., Mevarech, Z.R.: Enhancing mathematical reasoning in the classroom: the effects of cooperative learning and metacognitive training. Am. Educ. Res. J. 40(1), 281–310 (2003) 43. Leiker, A., et al.: The effects of autonomous difficulty selection on engagement, motivation, and learning, in a motion-controlled video game task. Hum. Mov. Sci. 49, 326–335 (2016) 44. Lewis, L.H., Williams, C.J.: In: Jackson, L., Caffarella, R.S. (eds.) Experiential Learning: A New Approach, pp. 5–16. Jossey-Bass, San Francisco (1994) 45. Li, M.C., Tsai, C.C.: Game-based learning in science education: a review of relevant research. J. Sci. Educ. Technol. 22(6), 877–898 (2013) 46. Limperos, A., Buckner, M., Kaufmann, R., Frisby, B.: Online teaching and technological affordances: an experimental investigation into the impact of modality and clarity on perceived and actual learning. Comput. Educ. 83, 1–9 (2015) 47. Mayer, R.E.: Elaboration techniques that increase the meaningfulness of technical text: an experimental test of the learning strategy hypothesis. J. Educ. Psychol. 72(6), 770–784 (1980). https://doi.org/10.1037/0022-0663.72.6.770 48. Mayer, R.E., Bromage, B.K.: Difference recall protocols for technical texts due to advance organizers. J. Educ. Psychol. 72(2), 209 (1980) 49. McQuirter Scott, R., Meeussen, N.: Self-regulated learning: a touchstone for technologyenhanced classrooms. Read. Teach. 70(6), 659–666 (2017) 50. Orvis, K.A., Horn, D.B., Belanich, J.: The roles of task difficulty and prior videogame experience on performance and motivation in instructional video games. Comput. Hum. Behav. 24(5), 2415–2433 (2008) 51. Paas, F., Tuovinen, J.E., Van Merrienboer, J.J., Darabi, A.A.: A motivational perspective on the relation between mental effort and performance: optimizing learner involvement in instruction. Educ. Technol. Res. Dev. 53(3), 25–34 (2005)

Integrating Engagement Inducing Interventions

281

52. Pedra, A., Mayer, R.E., Albertin, A.L.: Role of interactivity in learning from engineering animations. Appl. Cogn. Psychol. 29(4), 614–620 (2015) 53. Reeve, J., Ryan, R., Deci, E.L., Jang, H.: Understanding and promoting autonomous selfregulation: a self-determination theory perspective. In: Schunk, D.H., Zimmerman, B. J. (eds.) Motivation and Self-Regulated Learning: Theory, Research, and Applications, pp. 223–244. Lawrence Erlbaum Associates Publishers, Mahwah, NJ, US (2008) 54. Reiber, L., Noah, D.: Games, simulations, and visual metaphors in education: antagonism between enjoyment and learning. Educ. Media Int. 45(2), 77–92 (2008) 55. Rendas, A.B., Fonseca, M., Pinto, P.R.: Toward meaningful learning in undergraduate medical education using concept maps in a PBL pathophysiology course. Adv. Physiol. Educ. 30, 23–29 (2006) 56. Rodgers, D.L., Withrow-Thorton, B.J.: The effect of instructional media on learner motivation. Int. J. Instr. Media 32(4), 333–342 (2005) 57. Rotgans, J., Schmidt, H.: Cognitive engagement in the problem-based learning classroom. Adv. Health Sci. Educ. 16(4), 465–479 (2011) 58. Salas, E., Cannon-Bowers, J.: The science of training: a decade of progress. Annu. Rev. Psychol. 52(1), 471–499 (2001) 59. Salen, K., Zimmerman, E.: Rules of Play: Game Design Fundamentals. MIT Press, Cambridge (2004) 60. Sampayo-Vargas, S., Cope, C.J., He, Z., Bryne, G.J.: The effectiveness of adaptive difficulty adjustments on students’ motivation and learning in an educational computer game. Comput. Educ. 69, 452–462 (2013) 61. Schmidt, A.M., Ford, K.J.: Learning within a learner control training environment: the interactive effects of goal orientation and metacognitive instruction on learning outcomes. Pers. Psychol. 56, 405–429 (2003) 62. Seidel, T., Rimmele, R., Prenzel, M.: Clarity and coherence of lesson goals as a scaffold for student learning. Learn. Instr. 15, 539–556 (2005) 63. Shihusa, H., Keraro, F.N.: Using advance organizers to enhance students’ motivation in learning biology. EURASIA J. Math. Sci. Technol. Educ. 5(4), 413–420 (2009) 64. Sims, R.: Interactivity for effective educational communication and engagement during technology based and online learning. In: McBeath, C., Atkinson, R. (eds.) Planning for Progress, Partnership and Profit, Proceedings EdTech 1998. Australian Society for Educational Technology, Perth (1998) 65. Squire, K.D., Jan, M.: Mad City Mystery: developing scientific argumentation skills with a place-based augmented reality game on handheld computers. J. Sci. Educ. Technol. 16(1), 5–29 (2007) 66. Stull, A.T., Mayer, R.E.: Learning by doing versus learning by viewing: three experimental comparisons of learner-generated versus author-provided graphic organizers. J. Educ. Psychol. 99(4), 808–820 (2007) 67. Thomas, K.W., Velthouse, B.A.: Cognitive elements of empowerment: an “interpretive” model of intrinsic task motivation. Acad. Manag. Rev. 15(4), 666–681 (1990) 68. Vincent, A., Koedinger, K.: An effective metacognitive strategy: learning by doing and explaining with a computer-based cognitive tutor. Cogn. Sci. 26, 147–179 (2002) 69. Winsett, C., Foster, C., Dearing, J., Burch, G.: The impact of group experiential learning on student engagement. Acad. Bus. Res. J. 3, 7–17 (2016) 70. Wollenschlager, M., Hattie, J., Machts, N., Harms, U.: What makes rubrics effective in teacher feedback? Transparency of learning goals is not enough. Contemp. Educ. Psychol 44, 1 (2016)

Productive Failure and Subgoal Scaffolding in Novel Domains Dar-Wei Chen1(&) and Richard Catrambone2 1

2

Soar Technology Inc., 12124 High Tech Ave, Suite 350, Orlando, FL 32817, USA [email protected] Georgia Institute of Technology, Atlanta, GA 30332, USA

Abstract. The assistance dilemma asks how learning environments should “balance information or assistance giving and withholding” (Koedinger and Aleven 2007, p. 239). Minimal guidance (MG) methods posit that students learn best when exploring problems freely, while direct instruction (DI) methods provide canonical solutions early on to streamline students’ efforts (problems later). Each method type provides unique benefits, but both are important (Schwartz and Martin 2004) and not easily delivered together. A relatively new MG-based method called “productive failure” (PF) is hypothesized to capture both sets of benefits by requiring students to struggle through problems early on and revealing canonical solutions afterward (Kapur 2008). Students using PF are hypothesized to more effectively transfer and retain information because balancing heuristics and formal knowledge produces diverse solution attempts (diSessa and Sherin 2000) and struggling during exploration pushes students to fill knowledge gaps (Kulhavy and Stock 1989). In the present studies, participants learned to perform tasks in two domains, cryptarithmetic (more traditional) and Rubik’s Cube (psychomotor, less traditional) while using either PF or DI. Analyses revealed that (A) PF participants did not outperform DI participants on either immediate post-tests or retention tests, although they did report being more exploration-oriented and trying more unique strategies, (B) subgoal labels increased learning, but only for the relatively novel Rubik’s Cube domain (and they sometimes increased workload in cryptarithmetic, surprisingly), and (C) effects of subgoal labels did not change with instruction type. Future research should determine how PF methods can be scaffolded to foster exploration mindsets and diverse solutions. Keywords: Productive failure Assessment Training transfer

Educational methods Scaffolding Retention Subgoals

1 Introduction For many years, education researchers have debated a seemingly simple question called “the assistance dilemma,” which can be summarized as: “How should learning environments balance information or assistance giving and withholding to achieve optimal student learning?” (Koedinger and Aleven 2007, p. 239). The answer to this question has the potential to shape future instructional design in fundamental ways, but no © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 282–300, 2019. https://doi.org/10.1007/978-3-030-22341-0_23

Productive Failure and Subgoal Scaffolding in Novel Domains

283

consensus has been reached thus far. For now, two categories of instructional methods dominate the debate. Traditional methods that provide canonical instruction early on and utilize problem-solving as application practice are called “direct instruction” (DI), while “minimal guidance” (MG) methods require learners to discover information through guided exploration and problem-solving, instead of receiving canonical instruction. Although MG and DI methods are pedagogically different, they are similar in that they both strive to help students avoid struggle and failure (i.e., being unsuccessful in producing canonical solutions) while learning; both types of methods provide various levels of scaffolding to reduce learner struggle and failure, ostensibly because struggle and failure ultimately do more harm than good. However, a relatively new method called “productive failure” (PF; e.g., Kapur 2008) is hypothesized to leverage struggle and failure for unique learning benefits. In PF, learners attempt problems first before receiving canonical instruction and it is hypothesized that as a result, they will potentially be abler to (A) solve transfer problems, (B) retain knowledge past immediate comprehension tests, (C) know why a given solution is correct, as opposed to just knowing that it is correct, and (D) identify their own gaps in knowledge, among other benefits. Furthermore, given that PF is an exploration-based method with canonical instruction implemented, learners using PF are hypothesized to reap benefits usually associated with either minimal guidance (e.g., self-generated concepts) and direct instruction (e.g., streamlining of attention and resource allocation). The experiments described here tested the “productive failure” hypothesis and aim to provide new perspective to existing learner assistance approaches as well. 1.1

“Minimal Guidance” Model

Productive failure methods are based, in part, on a variety of existing minimal guidance methods, but PF is hypothesized to improve on each of those methods in some fashion. • Discovery learning. An early instantiation of minimal guidance was “discovery learning,” in which students freely explore domains and material for themselves to create governing insights about the world (Anthony 1973), often without concrete goals in mind. • Constructivism. Learners in constructivist environments are hypothesized to build “conceptually functional representations of the external world” that are not necessarily unique to themselves (Jonassen 1991, p. 61). Therefore, while the basic pedagogical premise of constructivism is similar to that of discovery learning (i.e., active construction of meaning), a conceptual difference is that in discovery learning, students are hypothesized to instead construct their own unique representations of the world. • Impasse-driven learning. Impasse-driven learning is one of the first methods to implement struggle and failure to a large extent; impasses are defined by VanLehn et al. (2003, p. 220) as situations in which a student is stuck, “detects an error, or does an action correctly but expresses uncertainty about it.” The governing principle of impasse-driven learning is that impasses are effective in helping learners adopt learning-oriented mindsets, which cause them to be more likely to search their

284

D.-W. Chen and R. Catrambone

memories, examine the environment, or ask nearby people, etc. in attempts to discover what they do not understand (VanLehn et al. 2003). After students reach impasses, tutors are to provide explanations soon after when students are not able to. No matter the specific instantiation, MG methods are hypothesized to mitigate working memory constraints by encouraging learners to connect new information with prior long-term knowledge (Kapur and Bielaczyc 2011) during unstructured problemsolving periods. These connections increase the chances that new information is understood at a deeper level than if it was learned via DI, where the new information is often stored in working memory and available in external memory. 1.2

“Direct Instruction” Model

Opposite minimal guidance in the learner assistance debate are direct instruction methods, which generally guide students strongly and limit exploration. The worked example is considered “the epitome of strongly guided instruction” and “provides some of the strongest evidence for the superiority of directly guided instruction over minimal guidance” (Kirschner et al. 2006, p. 80). Worked examples are hypothesized to streamline attention to the most important parts of problems, reducing problem-solving search and thus lower working memory loads (Kirschner et al. 2006). For most learners, and novices in particular, this streamlining is key because they do not possess the relevant schemas with which to integrate new information and prior knowledge, and therefore cannot construct new schemas that are durable (Rourke and Sweller 2009). When unguided, many novices often resort to methods such as trial-and-error which are burdensome on working memory, causing it to be unavailable for contributing to long-term memory (Kirschner et al. 2006). If working memory is occupied with tasks such as trial-and-error or problem-solving search, unguided students will not be able to use working memory to learn, and they could therefore potentially search problem spaces for long periods without adding to long-term memory (Sweller et al. 1982). Learners can also sometimes lean too much on pre-existing knowledge to explore a domain (as opposed to devising learning goals), which can then lead to flawed conclusions (Wineburg and Fournier 1994). Direct instruction can be instantiated in many ways: Lectures, models, videos, presentations, demonstrations, as well as the aforementioned worked examples (Clark et al. 2012). 1.3

Solving the Assistance Dilemma Through “Productive Failure”

A growing body of literature posits that the productive failure methodology can help students learn in ways that achieve the objectives of both minimal guidance and direct instruction (e.g., Kapur 2011); that is, the “MG vs. DI” debate might be a false choice. On a high level, productive failure requires students to invent solutions to presented problems first (in the “generation period”) before receiving canonical instruction (“consolidation period”), thereby reversing the traditional order of these two teaching elements in DI. This order leads to struggle (and ultimately, failure) early on in the learning process, but there often exists “a latent productivity in what initially seemed to

Productive Failure and Subgoal Scaffolding in Novel Domains

285

be failure” (Kapur 2008, p. 379). The generation effect, “which refers to the long-term benefit of generating an answer, solution, or procedure versus being presented that answer, solution, or procedure” (Bjork and Bjork 2011), could explain this latent productivity, in part. The ensuing canonical instruction then serves to combat the “negative transfer” (Bransford and Schwartz 1999) that often plagues minimalguidance methods. It should be noted, however, that PF students do receive some basic domain information before entering the generation period, which lessens the probability of unproductive failures in which students attempt solutions that are too irrelevant to yield any valuable information. Most MG methods employ scaffolding so that learners can avoid failure, ostensibly because it will hinder learning; however, failure is embraced and explicitly designed into the PF process through the use of problem-solving early on (generation period), and difficult ill-structured problems in particular are frequently used. In practice, scaffolding is withheld and “solution features” are deliberately made inconspicuous in PF so learners will be unlikely to guess canonical solutions, instead being encouraged to lean on heuristics and prior knowledge to generate solutions (Loibl and Rummel 2014a). After initial problem-solving, canonical instruction follows for learners to fill in the rest of their understanding and remedy any mistakes they made. Sometimes, an initial assessment is implemented first immediately after the initial problems to ensure more concrete failure. Each of the following sections summarizes a key component of the productive failure hypothesis. Heuristics Plus Formal Knowledge. In minimal-guidance environments, learners are led to utilize prior knowledge and heuristics during problem-solving, thereby mitigating some working memory constraints (Kapur and Bielaczyc 2011) on the whole, even if searching problem spaces also increases learners’ working memory burdens somewhat (Sweller 1988). In the event that some learners do encounter higher cognitive demands in PF, they also often report feeling more engaged because of the autonomy they are afforded during initial problem-solving (diSessa et al. 1991). This prior knowledge activation is crucial for helping learners connect new material with long-term knowledge, which enables better encoding and assembling of schemas (Hiebert and Grouws 2007) as well as better transferability and durability of learning (Kapur 2008). The blending of heuristics, prior knowledge, and formalized canonical instruction allows PF methods to provide benefits that MG and DI alone cannot. For example, PF students are more likely to generate relatively large amounts of diverse solutions for novel problems (diSessa and Sherin 2000), a hallmark of how experts attempt problems (Clement 1991). Through these diverse solution attempts, students are expected to develop the ability to extrapolate new information to other contexts (procedural flexibility; Gorman et al. 2010). Another hypothesized benefit is the priming of students to solve transfer problems later using the relative wealth of available information (prior knowledge, heuristics, canonical instruction), even if the information is not germane to any given initial problem (Bransford and Schwartz 1999). A fair question regarding the above information might be whether DI methods can also achieve results similar to PF, given that many of them also implement canonical

286

D.-W. Chen and R. Catrambone

instruction and problem-solving. The key difference is that in productive failure, students use problem-solving to “assemble or structure key ideas and concepts while attempting to represent and solve the ill-structured problems” (Kapur et al. 2010, p. 1722). However, in direct instruction, problems are used “not as vehicles for making discoveries, but as a means of practicing recently-learned content and skills” (Clark et al. 2012, p. 6). As a result, students in DI are less likely to blend heuristics and formal knowledge, and more likely to receive formal knowledge and merely re-activate it when solving problems, leading to transferability that is not as robust. In contrast, PF students are led to use heuristics and prior knowledge during initial problem-solving (before receiving canonical instruction to remedy gaps in understanding), which ensures that both knowledge types are activated while learning. The order of material presentation is the key difference. Failure-Related Cognition. “Expectation failure” is the idea that learning is most successful when the outcome expected by a student from the domain does not, in fact, occur (Schank 1997). Key principles of expectation failure include: • Learners are less likely to develop creative solution attempts if environment is too controlled and failures are therefore not possible • Learners are predisposed to explaining occurrences in the domain and adjusting their mental models to avoid being surprised by similar events • For expectation failures to be most effective, they must occur during initial/practice problem-solving (more likely to be activated in future problems) The key function of expectation failures is exposing learners to gaps in their understanding and eliciting learners’ natural misunderstanding-induced curiosity in the material. In these situations, learners are more driven to fill knowledge gaps on their own (e.g., studying feedback), particularly when discrepancies between solution attempts and canonical solutions are wide (Kulhavy and Stock 1989). Due to the “problem-solving prior” instructional order, PF methods are particularly conducive to learners producing initial solution attempts that are discrepant from canonical solutions. Expectation failures also disrupt learners’ stability bias, the overconfident belief that currently-accessible information will remain just as accessible in the future (Kornell and Bjork 2009). Chi’s (2000) theory of the imperfect mental model also accords with the notion that failure can be effective and essential for learning; in short, the theory states that learning is done through updates to one’s own mental models and that selfexplaining, in particular, is an efficient way for learners to update their own models according to their own needs. Furthermore, when experiencing failures and ensuing canonical instruction, learners will also tend to identify reasons that a solution is plausible and why noncanonical solutions do not always work, which improves their capacity for transfer to novel situations (Kapur and Lee 2009). Comparing invented solutions and canonical solutions aids in the encoding of critical conceptual features and selecting relevant problem-solving procedures, even when performing transfer tasks (Siegler 2002). For example, when students were allowed to observe the consequences of entering incorrect spreadsheet formulas, as opposed to being corrected immediately upon

Productive Failure and Subgoal Scaffolding in Novel Domains

287

entering an incorrect formula, they achieved higher scores on transfer tasks than immediately-corrected students (Mathan and Koedinger 2003). Immediate Performance vs. Enduring Learning. “Desirable difficulties” (Bjork and Bjork 2011), even if not severe enough to consistently induce failure, can still induce decreased immediate performance and PF-related learning benefits in the long term. Examples of these difficulties include environmental factors (e.g., interface clutter; Fiore et al. 2006), training variation (e.g., practicing tasks that are adjacent to the target task; Kerr and Booth 1978), practice scheduling (e.g., interleaved schedule produces better retention than blocked schedule; Shea and Morgan 1979), and secondary tasks (adding relevant concurrent secondary task improves test performance; Young et al. 2011). The goal of any instructional method should be learning, which can be defined as “permanent changes in comprehension, understanding, and skills of the types that will support long-term retention and transfer” (Soderstrom and Bjork 2015, p. 176). Learning is a separate observed variable from immediate performance, which is a possibly temporary measure that can be an unreliable indicator of learning (Soderstrom and Bjork 2015). Many instructional methods focus on producing immediate performance improvements, but some evidence indicates that immediate performance is not indicative of long-term retention and/or transfer, which is perhaps more important (e.g., Schmidt and Bjork 1992). When learners demonstrate strong immediate performance, they could be merely exhibiting retrieval strength, which is recall activated in particular contexts; however, durable learning is a function of storage strength, which comprises the depths to which the material is associated with prior knowledge (Bjork and Bjork 2011). Increasing storage strength is most efficiently done through information retrieval (as opposed to information review) because the creation of “new routes” to information inherently activates previous knowledge as well (Carrier and Pashler 1992). The observation that enduring learning and immediate outward performance improvement can be uncorrelated is seen in research ranging from maze rats (rats’ abilities to finish mazes improve after ostensibly random wandering; Blodgett 1929) to statistics classes (students who invented solutions and received canonical instruction later outperformed DI students; Schwartz and Martin 2004). Furthermore, methods that aim to improve immediate performance can actually undermine enduring learning: For example, frequent and/or specific feedback, a common DI component, often helps students complete test problems that are similar to the ones they practiced, especially if tested soon after instruction. However, learners that receive the crutch of immediate and frequent feedback are shielded from creating generalizable problem-solving strategies, an important skill that is developed in those that are forced to struggle without immediate feedback (Cope and Simmons 1994). 1.4

Examining Subgoal Scaffolding in Productive Failure

Many of the PF studies to this point have required learners to complete initial problemsolving (the “generation period”) without scaffolding of any sort, perhaps because this arrangement increases the chances of failure and the learner reaping the benefits

288

D.-W. Chen and R. Catrambone

associated with failure. When no scaffolding structure is present, one potential concern is that learners might not fail in constructive ways, which could then lead to difficulty during canonical instruction because learners will have strayed “off course” to varying extents. Therefore, it is possible that PF methods could be even more optimal for learning with the implementation of some scaffolding, especially those scaffolding mechanisms that provide just enough guidance to ensure that failures are indeed productive (i.e., help students unearth fundamental truths about the domain). A few PF studies have implemented scaffolding during the generation period, but there are many more scaffolding mechanisms to be examined with regards to interactions with PF, some of which might produce better learning than non-scaffolded PF methods. The scaffolding mechanism chosen for manipulation in the current study is “subgoals,” which are labels for functional groupings of steps that can help learners recognize fundamental components of a problem (Catrambone 1998). Subgoals are a promising scaffolding mechanism for PF because they can potentially alleviate one of the major weaknesses in PF methods, which is the possibility that learners might fail unproductively by misunderstanding the deep structure of a given problem space. 1.5

General Overview of Current Study and Hypotheses

The experiments in the present study compared the effectiveness of productive failure and direct instruction in two domains that have not been examined before in this PF context. In Experiment 1, participants learned about cryptarithmetic, a domain that functions like the traditional academic domain of algebra and is somewhat similar to physics and math domains that have been used in past PF studies, but is more likely to be unfamiliar to participants (example problem: OOOH + FOOD = FIGHT). The tasks inherent in this domain (deducing variable values, logical reasoning, etc.) allow for reasonable comparison of the results to those from existing PF studies, which have centered mostly on STEM domains. In Experiment 2 (which was procedurally identical to Experiment 1), participants learned about solving the first layer of the Rubik’s Cube, a spatially-oriented task that requires some psychomotor coordination. The generalizability of PF methods to non-traditional domains were tested in this experiment. Experiment 2 provided an opportunity to examine whether Experiment 1 findings replicated or whether the effects of the manipulations might depend on how academic in nature the domain is. The specific methodological details used in these experiments can be found in the next section.

2 Method – Experiment 1 (Cryptarithmetic) 2.1

Participants

A meta-analysis of productive failure studies (Chen 2016) found that PF methods have produced, on average, a performance improvement of about 0.66 SD in deep conceptual knowledge when compared to direct instruction methods, and because PF was hypothesized to improve this kind of generalizable knowledge (as opposed to performance on procedurally-similar tasks), this effect size drove the power analysis used to

Productive Failure and Subgoal Scaffolding in Novel Domains

289

determine the sample size in this study. To achieve 80% power and 5% Type I error rate when searching for an effect of this size, 64 participants were used. These participants were recruited through the online SONA research participation system at the Georgia Institute of Technology and compensated with class credits for their time. All students at the Institute qualified for the experiment except for those who had prior experience in systematically solving cryptarithmetic problems. 2.2

Experimental Design

Experiment 1 was a laboratory experiment in which all participants were required to learn how to solve basic cryptarithmetic addition problems involving two numbers. The two manipulated independent variables that will be covered in this paper are: • Instruction type (between subjects): productive failure or direct instruction • Subgoal labels (between subjects): subgoal labels were provided or withheld All variables were fully crossed to form a factorial design for the experiment. Observed dependent measures included immediate task performance (near transfer, medium transfer, far transfer), retention task performance after a one-week break (near transfer, medium transfer, far transfer), and several secondary assessments that could predict task performance (e.g., workload, number of solution methods generated). 2.3

Procedures

Table 1 summarizes, in order, the procedures that participants completed during the experiment and some associated details.

3 Method – Experiment 2 (Rubik’s Cube) Procedures for Experiment 2 were identical to those in Experiment 1 except for the domain (see Table 2); participants in Experiment 2 learned how to solve the first layer of the Rubik’s Cube.

4 Results and Discussion 4.1

Instruction Type Main Effects (Immediate Post-test)

A general linear model (GLM) was created to analyze how the manipulated independent variables affected immediate post-test scores in both domains. For each individual problem type as well as overall test score, the data indicated that there was generally no significant difference between productive failure and direct instruction, except for one instance that is likely a random outlier given the pattern of the other results. Table 3 outlines these results (maximum possible test score is 100%). In the realm of near-transfer test problems, it was not expected that productive failure would produce significantly better task performance than direct instruction, especially when the problems were administered immediately after learning has

290

D.-W. Chen and R. Catrambone Table 1. Summary of Experiment 1 procedures (cryptarithmetic)

occurred. This expectation was realized in the above results. Many of the hypothesized advantages of PF methods were expected to instead become manifest during mediumand far-transfer problems, as well as retention problems, while DI methods’ usage of isomorphic problems as practice (Clark et al. 2012) are conducive to performance on test problems that are similar to the practiced ones. The “regurgitative” nature of completing procedurally-similar problems immediately after learning increases the importance of streamlined problem-solving search processes often emphasized in DI while rendering the potentially deeper structural learning in PF relatively less useful. However, a reason that DI was not hypothesized to actually overtake PF in immediate near-transfer task performance is that PF participants tend to report greater curiosity during canonical instruction than DI participants do (Loibl and Rummel 2014b), a phenomenon that was indirectly observed in this study when participants were surveyed about the purpose of the problem-solving learning period. In the cryptarithmetic domain, PF participants (M = 95%) were significantly more likely than DI participants (M = 24%) to say that the problem-solving period was to be used for

Productive Failure and Subgoal Scaffolding in Novel Domains

291

Table 2. Summary of Experiment 2 procedures (Rubik’s Cube)

exploration (as opposed to practice and application), F(1, 43) = 43.711, MSE = 0.128, p = 0.000, partial η2 = 0.504 (mean difference = 71%); a similar pattern of results for PF (M = 100%) and DI (M = 30.8%) held in the cube domain, F(1, 49) = 54.044, MSE = 0.113, p = 0.000, partial η2 = 0.524 (mean difference = 69.1%). This question served to illuminate the mindsets of participants in the two instructional conditions and indeed revealed the exploratory approaches that PF participants tended to take. According to Loibl and Rummel (2014b), initial unguided problem-solving periods in PF help learners to identify knowledge gaps that they are then more curious about resolving later when canonical instructions are presented; DI learners are not given intrinsic reason to pay as much attention to the canonical instructions. The benefits of the extra attention paid by PF participants to canonical instructions should be particularly evident during near-transfer test problems, given that the instructions focus on those types of problems. Moreover, not only were PF learners expected to be more curious and engaged, they were also expected to be more able to appreciate critical features of the presented canonical solutions due to comparisons of the strengths and

292

D.-W. Chen and R. Catrambone Table 3. Post-test score differences between instruction types

weaknesses of their invented solutions and the canonical ones (Moore and Schwartz 1998). Therefore, the advantages for each method were expected to “cancel out” to some extent, and the non-significant differences between PF and DI in both domains fulfilled those expectations. Productive failure was hypothesized to produce significantly better performance in medium- and far-transfer problems, but that largely turned out not to be the case. The hypothesis was based on the notion that PF methods, just through the order of instruction, would require learners to combine heuristics and formal knowledge in ways that the “canonical instruction, then application practice” order in DI does not (Kapur and Bielaczyc 2011). This combining of various knowledge bases in PF was expected to provide learners with the resources to generate relatively wide ranges of solution methods (diSessa and Sherin 2000) due in part to the exploratory information gleaned from the initial problem-solving periods, and these different solution methods should have enabled better attempts at transfer problems that cannot be solved solely using canonical instructions. Participants in PF conditions (M = 0.594 unique solution strategies, SD = 0.837) did indeed attempt unique solution strategies more often than DI participants (M = 0.219, SD = 0.420) in cryptarithmetic, F(1, 62) = 5.131, MSE = 0.439, p = 0.027, partial η2 = 0.076 (mean difference = 0.375), and the Rubik’s Cube domain revealed similar differences between PF (M = 0.781, SD = 0.552) and DI (M = 0.375, SD = 0.492), F(1, 62) = 9.648, MSE = 0.274, p = 0.003, partial η2 = 0.135 (mean difference = 0.406). However, the use of unique strategies (those that were not explicitly explained in instructional material) apparently did not aid participants on tasks of medium and far transfer. While it still might be the case that those tasks do require novel and creative

Productive Failure and Subgoal Scaffolding in Novel Domains

293

solution methods, perhaps the participants’ invented methods were either not particularly relevant or did not enable the participants to learn deep structural information about the domain. Furthermore, deciphering the parts of a solution attempt that are generalizable, and those that are context-specific and ungeneralizable, is often difficult for novices due to a lack of experience (Patel et al. 1993), an issue that is likely magnified in PF when participants initially are relying more on their own heuristics to make assumptions about the domain. 4.2

Instruction Type Main Effects (Retention Assessments)

To analyze the retention test performance dependent measure, pre-existing ability (covariate), immediate post-test score (covariate), and the independent variables were used as predictors in a GLM. No significant retention score differences were found between PF (M = 45.94%, SD = 21.62%) and DI (M = 48.62%, SD = 20.10%) in cryptarithmetic, F(1, 18) = 0.114, MSE = 376.147, p = 0.739, partial η2 = 0.006 (mean difference = 2.68%), and no significant retention score differences were found between PF (M = 63.72%, SD = 26.6%) and DI (M = 66.35%, SD = 26.6%), in Rubik’s Cube, F(1, 16) = 0.219, MSE = 171.214, p = 0.646, partial η2 = 0.014 (mean difference = 2.63%). It was hypothesized that the inherently frequent activation of prior and long-term knowledge during initial PF problem-solving would require learners to connect new material with relatively stable information that they already knew (Kapur 2008) and furthermore lead to deeper encoding and assembling of schemas (Hiebert and Grouws 2007). As a result, the learning that ensued was expected to be more enduring and less fleeting, a difference that would be most apparent on retention problems. However, when surveyed on a Likert scale (1–7, 7 = most), participants in PF (M = 4.25, SD = 2.11) did not report using significantly more prior knowledge than DI (M = 4.03, SD = 1.56) in cryptarithmetic, F(1, 62) = 0.223, MSE = 3.435, p = 0.639, partial η2 = 0.004 (mean difference = 0.22) and the differences between PF (M = 3.31, SD = 1.79) and DI (M = 3.13, SD = 1.66) were also not statistically significant in Rubik’s Cube, F(1, 62) = 0.189, MSE = 2.974, p = 0.665, partial η2 = 0.003 (mean difference = 0.188). For now, these data can inform some discussion and conclusions, but more-detailed analyses are likely needed in the future to examine, more generally, the differences in how PF and DI participants used problem-solving periods. Question prompts during problem-solving, for example, could enable researchers to more deeply study why a participant invented a particular solution strategy and whether that strategy contributed any generalizable domain knowledge through its use, or how a participant could be encouraged to activate more relevant prior and long-term knowledge. In the current experiments, given that PF methods did not prove superior to DI in terms of encouraging participants to lean more on their prior knowledge, it is then unsurprising that retention performance was about equal between the two conditions. This pattern of findings on retention performance contradicts what “desirable difficulties” research would predict (i.e., slow performance improvements early on due to difficulty designed into the instruction, but better performance later; e.g., Bjork and Bjork 2011). It was expected that PF participants surpass their DI counterparts on assessments like the retention test, which was administered one week after the material

294

D.-W. Chen and R. Catrambone

was learned. Participants’ struggles during the PF generation period would require deeper and more durable processing to navigate (i.e., connected to prior knowledge and/or self-generated heuristics), while DI participants would be more likely to fall into a false sense of competency because the learning process is relatively easier and performance on immediate tasks improves relatively quickly (Marsh and Butler 2013). However, survey measures such as workload (via NASA TLX) revealed that PF was not an appreciably more difficult experience than DI, and in some instances was actually reported to be an easier experience. Furthermore, not all participants in PF actually failed after the initial “struggle” period, which likely means that the given tasks were not difficult enough to yield productive failures and the associated benefits: 8 of 32 cryptarithmetic participants scored 100% on the mid-point check, while 6 of 32 Rubik’s Cube participants performed likewise. Therefore, PF did not create enough desirable difficulty for participants. 4.3

Subgoal Label Main Effects (Immediate Post-test and Retention Assessments)

Upon examining the subgoal predictor of the GLMs for immediate test and retention test performance, a pattern emerged regarding scores across domains. Table 4 summarizes the scores of participants who received subgoals (SUB) and those who received non-labeled (NL) instructions (maximum possible test score is 100%): Table 4. Test performance with subgoal- (SUB) and non-labeled (NL) instructions

In the cryptarithmetic domain, subgoal labels appeared to make very little difference in test scores. Previous research has demonstrated that subgoal labels outline highlevel information that can help learners organize domain content in meaningful ways (Atkinson et al. 2000), which theoretically should improve performance. However, it is probable that the college-educated participants did not require subgoal labels to help them organize content in a domain that is similar to algebra. According to the data, Rubik’s Cube participants were aided greatly by subgoal labels. Sweller (2010) notes that subgoals enable learners to focus just on fundamental structures of problems and not incidental features. In a domain like the Rubik’s Cube in

Productive Failure and Subgoal Scaffolding in Novel Domains

295

which participants likely do not possess much relevant experience, this generalizable information from subgoal labels is crucial so that participants do not extrapolate from concepts that might have been specific only to a given example. 4.4

Subgoal Label Main Effects (Workload)

Some evidence suggests that the subgoal labels in cryptarithmetic, if anything, served only to increase participant workload, possibly because of extra effort needed to interact with them. Tables 5 and 6 outline the workload data for both domains (maximum possible reported workload is 100%). Table 5. Cryptarithmetic: Workload differences between SUB and NL instructions

According to Table 5, subgoals increased workload significantly in the cryptarithmetic domain. Furthermore, subgoal labels did not improve performance in cryptarithmetic, suggesting that the increased load might have been extraneous. As was stated before, it is perhaps the case that subgoal labels were not necessary in the cryptarithmetic domain due to participants’ familiarity with algebra, which could explain why participants reported subgoals as relatively taxing to interact with. Subgoals did not increase workload in the Rubik’s Cube domain, as demonstrated in Table 6. The participants likely found the Rubik’s Cube subgoal labels to be essential information and therefore did not perceive them as difficult to engage. After all, the subgoal labels improved Rubik’s Cube performance substantially. Given the relatively robust findings in previous research regarding how subgoals reduce cognitive load in learners (e.g., Morrison et al. 2015), the findings in the current experiments are surprising. In future experiments, methods of implementing subgoal

296

D.-W. Chen and R. Catrambone Table 6. Rubik’s Cube: Workload differences between SUB and NL instructions

labels (e.g., frequency of labeling, type of content conveyed, learner role in generation of labels) could be manipulated to examine whether workload and performance results depend on the method of labeling. 4.5

Interaction Between Instruction Type and Presence of Subgoal Labels

Before the experiments started, subgoal labels presented during the PF generation period were expected to mitigate the chances that learners aimlessly pursued irrelevant objectives and formed structural misconceptions, risks inherent in any minimallyguided method (Brown and Campione 1994). While subgoal labels are generally important in DI materials as well, they were expected to be relatively less so because DI participants received instruction at the start of the learning process that was at least somewhat organized whether subgoals were labeled or not, and the participants were merely applying learned knowledge during the problem-solving phase, likely using the subgoal labels just as reminders. The data suggested that no such interaction between instruction type and subgoal labeling occurred, regardless of domain or timing of test (see Table 7). Instead, a plausible explanation is that the positive effects of subgoals are robust across various methods of instruction, but not necessarily across all domains (per previous findings). After all, the key purpose of subgoal labels is helping learners recognize fundamental components of a domain (Catrambone 1998), a useful aid regardless of whether a learner is using productive failure or direct instruction. However, the extent to which that aid increases performance might depend on domain

Productive Failure and Subgoal Scaffolding in Novel Domains

297

Table 7. Interaction between instruction type and subgoal labeling (test scores)

familiarity and how easily learners can discern fundamental components on their own in that given domain. In summary, subgoal labels improved performance in the Rubik’s Cube domain, regardless of instruction type, but failed to improve performance in the cryptarithmetic domain (also regardless of instruction type). A potential future research direction could involve manipulating the scaffolding mechanism used in PF instruction to examine whether other scaffolding mechanisms are more reliable across domains (e.g., selfexplanation prompts, social discourse). Preventing learners from failing unproductively and veering too far off track is a scaffolding mechanism that has been shown to be effective in general (e.g., training wheels; Carroll and Carrithers 1984), but other methods could prove superior in particular learning contexts. A systematic examination of domains is also necessary to study how these various scaffolding mechanisms interact with domains of particular characteristics; for example, the motivational aspects of group discourse (Lin et al. 1999) could improve learning relatively substantially in inherently uninteresting domains, but not spur much improvement in domains that are inherently more interesting.

5 Conclusions In general, PF methods in the present studies produced some ostensibly positive ancillary developments for learners (exploratory mindsets, diverse solution attempts, and occasionally lower workload). However, those ancillary developments did not lead to the ultimate goal of increasing post-test and retention test performance. This phenomenon suggests questions for further study such as whether the relevance and quality of learners’ solution attempts should be regulated somehow (perhaps through the use of scaffolding methods other than subgoals), or whether lower workload is inherently beneficial. Research in productive failure is still in its early stages and therefore much work remains to be done in improving the method itself. Potential improvements include explicit elicitation of prior domain knowledge, more meaningful subgoal labels, and group learning implementation. Replicating findings in various domains will also be an important task, given that people have access to learning wider varieties of information than ever but most learning research still centers on just science- and mathematicsrelated domains. Some patterns of results from the current experiments changed

298

D.-W. Chen and R. Catrambone

depending on domain, but systematic selection of domains would enable researchers to find more precisely the dimensions and characteristics of domains that drive changes in results.

References Anthony, W.S.: Learning to discover rules by discovery. J. Educ. Psychol. 64(3), 325–328 (1973) Atkinson, R.K., Derry, S., Renkl, A., Wortham, D.: Learning from examples: instructional principles from the worked examples research. Rev. Educ. Res. 70(2), 181–214 (2000) Bjork, E.L., Bjork, R.A.: Making things hard on yourself, but in a good way: creating desirable difficulties to enhance learning. In: Gernsbacher, M.A., et al. (eds.) Psychology and the Real World: Essays Illustrating Fundamental Contributions to Society. Worth Publishers, New York (2011) Blodgett, H.C.: The effect of the introduction of reward upon the maze performance of rats. Univ. Calif. Publ. Psychol. 4, 113–134 (1929) Bransford, J.D., Schwartz, D.L.: Rethinking transfer: a simple proposal with multiple implications. In: Iran-Nejad, A., Pearson, P.D. (eds.) Review of Research in Education. American Educational Research Association, Washington, DC, pp. 61–101 (1999) Brown, A., Campione, J.: Guided discovery in a community of learners. In: McGilly, K. (ed.) Classroom Lessons: Integrating Cognitive Theory and Classroom Practice, pp. 229–270. MIT Press, Cambridge (1994) Carrier, M., Pashler, H.: The influence of retrieval on retention. Mem. Cogn. 20, 633–642 (1992) Carroll, J.M., Carrithers, C.: Training wheels in a user interface. Commun. ACM 27(8), 800–806 (1984) Catrambone, R.: The subgoal learning model: creating better examples so that students can solve novel problems. J. Exp. Psychol. Gen. 127(4), 355–376 (1998) Chen, D.: The Role of Struggle and Productive Failure in Learner Assistance (2016, Unpublished manuscript) Chi, M.T.H.: Self-explaining: the dual processes of generating inferences and repairing mental models. In: Glaser, R. (ed.) Advances in Instructional Psychology, pp. 161–238. Lawrence Erlbaum, Mahwah (2000) Clark, R.E., Kirschner, P.A., Sweller, J.: Putting students on the path to learning: the case for fully guided instruction. Am. Educ. 36, 6–11 (2012) Clement, J.: Non-formal reasoning in science: the use of analogies, extreme cases, and physical intuition. In: Voss, J.F., Perkins, D.N., Siegel, J. (eds.) Informal Reasoning and Education. Lawrence Erlbaum Associates, Hillsdale (1991) Cope, P., Simmons, M.: Some effects of limited feedback on performance and problem-solving strategy in a logo microworld. J. Educ. Psychol. 86(3), 368–379 (1994) diSessa, A., Hammer, D., Sherin, B., Kolpakowski, T.: Inventing graphing: children’s metarepresentational expertise. J. Math. Behav. 10(2), 117–160 (1991) diSessa, A., Sherin, B.L.: Meta-representation: an introduction. J. Math. Behav. 19, 385–398 (2000) Fiore, S.M., Scielzo, S., Jentsch, F., Howard, M.L.: Understanding performance and cognitive efficiency when training for x-ray security screening. In: Proceedings of the HFES 50th Annual Meeting, pp. 2610–2614. HFES, Santa Monica (2006) Gorman, J., Cooke, N., Amazeen, P.: Training adaptive teams. Hum. Factors 52, 295–307 (2010)

Productive Failure and Subgoal Scaffolding in Novel Domains

299

Hiebert, J., Grouws, D.: The effects of classroom mathematics teaching on students’ learning. In: Lester, F.K. (ed.) 2nd Handbook of Research on Mathematics Teaching and Learning, pp. 371–404. Information Age, Charlotte (2007) Jonassen, D.: Objectivism vs. constructivism. Educ. Tech. Res. and Dev. 39(3), 5–14 (1991) Kulhavy, R.W., Stock, W.A.: Feedback in written instruction: the place of response certitude. Educ. Psychol. Rev. 1(4), 279–307 (1989) Kapur, M.: Productive failure. Cogn. Instr. 26(3), 379–424 (2008) Kapur, M.: A further study of productive failure in mathematical problem solving: unpacking the design components. Instr. Sci. 39, 561–579 (2011) Kapur, M., Bielaczyc, K.: Classroom-based experiments in productive failure. In: Carlson, L., Holscher, C., Shipley, T. (eds.) Proceedings of the 33rd annual conference of the Cognitive Science Society, pp. 2812–2817. Cognitive Science Society, Austin (2011) Kapur, M., Dickson, L., Yhing, T.P.: Productive failure in mathematical problem solving. Instr. Sci. 38(6), 523–550 (2010) Kapur, M., Lee, K.: Designing for productive failure in mathematical problem solving. In: Proceedings of the 31st Annual Conference of the Cognitive Science Society, pp. 2632–2637. Cognitive Science Society, Austin (2009) Kerr, R., Booth, B.: Specific and varied practice of a motor skill. Percept. Mot. Ski. 46, 395–401 (1978) Kirschner, P.A., Sweller, J., Clark, R.E.: Why minimal guidance during instruction does not work: an analysis of the failure of constructivist, discovery, problem-based, experiential, and inquiry-based learning. Educ. Psychol. 41(2), 75–86 (2006) Koedinger, K.R., Aleven, C.: Exploring the assistance dilemma in experiments with cognitive tutors. Educ. Psychol. Rev. 19, 239–264 (2007) Kornell, N., Bjork, R.A.: A stability bias in human memory: overestimating remembering and underestimating learning. J. Expermintal. Psychol.: Gen. 138, 449–468 (2009) Lin, X., Hmelo, X., Kinzer, C.K., Secules, T.J.: Designing technology to support reflection. Educ. Technol. Res. Dev. 47(3), 43–62 (1999) Loibl, K., Rummel, N.: The impact of guidance during problem-solving prior to instruction on students’ inventions and learning outcomes. Instr. Sci. 42, 305–326 (2014a) Loibl, K., Rummel, N.: Knowing what you don’t know makes failure productive. Learn. Instr. 34, 74–85 (2014b) Marsh, E.J., Butler, A.C.: Memory in educational settings. In: Resiberg, D. (ed.) The Oxford Handbook of Cognitive Psychology, pp. 299–317. Oxford University Press (2013) Mathan, S., Koedinger, K.R.: Recasting the feedback debate: benefits of tutoring error detection and correction skills. In: Hoppe, H.U., et al. (eds.) Artificial Intelligence in Education, pp. 13– 20. IOS Press (2003) Moore, J.L., Schwartz, D.L.: On learning the relationship between quantitative properties and symbolic representations. In: Proceedings of the International Conference of the Learning Sciences, pp. 209–214. Erlbaum, Mahwah (1998) Morrison, B.B., Margulieux, L.E., Guzdial, M.: Subgoals, context, and worked examples in learning computing problem solving. In: Proceedings of the 11th Annual International Conference on International Computing Education Research, pp. 21–29. ACM, New York (2015) Patel, V.L., Groen, G.J., Norman, G.R.: Reasoning and instruction in medical curricula. Cogn. Instr. 10, 335–378 (1993) Rourke, A., Sweller, J.: The worked-example effect using ill-defined problems: learning to recognize designers’ styles. Learn. Instr. 19, 185–199 (2009) Schank, R.: Virtual Learning: A Revolutionary Approach to Building a Highly-Skilled Workforce. McGraw-Hill, New York (1997)

300

D.-W. Chen and R. Catrambone

Schmidt, R.A., Bjork, R.A.: New conceptualizations of practice: common principles in three paradigms suggest new concepts for training. Psychol. Sci. 3(4), 207–217 (1992) Schwartz, D.L., Martin, T.: Inventing to prepare for future learning: the hidden efficiency of encouraging original student production in statistics instruction. Cogn. Instr. 22(2), 129–184 (2004) Shea, J.B., Morgan, R.L.: Contextual interference effects on the acquisition, retention, and transfer of a motor skill. J. Exp. Psychol.: Hum. Learn. Mem. 5, 179–187 (1979) Siegler, R.S.: Microgenetic studies of self-explanation. In: Garnott, N., Parziale, J. (eds.) Microdevelopment: A Process-Oriented Perspective for Studying Development and Learning, pp. 31–58. Cambridge University Press, Cambridge (2002) Soderstrom, N.C., Bjork, R.A.: Learning versus performance: an integrative review. Perspect. Psychol. Sci. 10(2), 176–199 (2015) Sweller, J.: Element interactivity and intrinsic, extraneous, and germane cognitive load. Educ. Psychol. Rev. 22(2), 123–138 (2010) Sweller, J., Mawer, R., Howe, W.: The consequences of history-cued and means-ends strategies in problems solving. Am. J. Psychol. 95, 455–484 (1982) Sweller, J.: Cognitive load during problem solving: Effects on learning. Cogn. Sci. 12, 257–285 (1988) VanLehn, K., Siler, S., Murray, C., Yamauchi, T., Baggett, W.B.: Why do only some events cause learning during human tutoring? Cogn. Instr. 21(3), 209–249 (2003) Wineburg, S.S., Foumier, J.E.: Contextualized thinking in history. In: Carretero, M., Voss, J.F. (eds.) Cognitive and Instructional Processes in History and the Social Sciences, pp. 285–308. Erlbaum, Hillsdale (1994) Young, M.D., Healy, A.F., Gonzalez, C., Dutt, V., Bourne, L.E.: Effects of training with added difficulties on RADAR detection. Appl. Cogn. Psychol. 25, 395–407 (2011)

Adaptation and Pedagogy at the Collective Level: Recommendations for Adaptive Instructional Systems Benjamin Goldberg(&) U.S. Army Combat Capability Development Command – Soldier Center, Simulation and Training Technology Center, Orlando, FL 32826, USA [email protected]

Abstract. Adaptive training and intelligent tutoring functions provide mechanisms for balancing coaching and challenge during a practice scenario to maximize performance outcomes. Extending these practices into team and collective domain environments requires an understanding of the pedagogical requirements an Adaptive Instructional System (AIS) must account for in design. This includes identifying what pedagogical functions a system must support and how those functions apply within foundations of knowledge/skill acquisition and team development. In this paper, we present a high-level taxonomy of feedback and adaptation types for application in a team-based AIS. The taxonomy is designed to serve as a foundation of pedagogical activities to build policies around based on real-time monitoring and demographic information. Keywords: Adaptive Instructional Systems Feedback Adaptation

Team tutoring

Pedagogy

1 Introduction Adaptive Instructional Systems (AIS) are designed to accelerate knowledge and skill acquisition through mediated experiences that are managed by algorithms and processes grounded in learning science and cognitive psychology. The tenets of AISs are based on effective instructional methodologies that are captured in Artificial Intelligence (AI) modeling techniques for tracking skill acquisition and guiding system-level pedagogical decisions. What is of importance in the current training climate is quickly extending these methods to account for team and collective structures across an array of roles and missions linked to Army objectives. With a modernization strategy underway to update the Army’s simulation-based training technology base, influencing the learning science component of these maturing applications early on is critical. This involves establishing research informed best practices for managing real-time coaching and adaptation across an array of team formations and team of team structures. From a pedagogical perspective, Vygotsky’s Zone of Proximal Development (ZPD) [1] provides a nice theoretical foundation that highlights an interplay between guidance and adaptation when managing an educational interaction (see Fig. 1), with This is a U.S. government work and not under copyright protection in the United States; foreign copyright protection may apply 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 301–313, 2019. https://doi.org/10.1007/978-3-030-22341-0_24

302

B. Goldberg

its simplicity holding merit when also considering team formations. As an example, when a learner’s or team’s ability does not match the complexity of a problem (e.g., there are errors present or lack of understanding on what to do next), an initial approach would focus on feedback and coaching to alleviate that impasse. A system will monitor interaction and assess performance for the purpose of diagnosis. With a representation of the domain, feedback can target specific Knowledge, Skills, Abilities (KSAs) for the purpose of correcting mental models and influencing procedural and process changes.

Fig. 1. The Zone of Proximal Development’s teaching strategy [1].

In the instance when feedback fails to improve the observed/modeled deficiency, adapting the pedagogical approach (e.g., engage worked example, modify scenario complexity, restart task/scenario, etc.) provides a mechanism to maintain learner engagement in an effort to prevent frustration and improve understanding. Alternatively, if the scenario complexity is below the ability of the interacting units, the system should have mechanisms to increase challenge for the purpose of maintaining desirable difficulties [2]. It is through these mechanisms that an instructor will adapt training to better meet the needs of the interacting party in an effort to maximize learning outcomes (i.e., time to proficiency, retention of skill, transfer of skill, etc.). The mechanisms by which to engage these instructional interventions is well researched in the AIS community [3, 4], however, most of the literature examines performance at the individual level and within well-defined domain spaces. While informing pedagogical strategies requires robust modeling techniques to track learner performance and competency (i.e., learner/team modeling), there is also a requirement to establish instructional design paradigms that guide the creation and configuration of context specific instructional injects that manage the experience for the purpose of optimizing outcome. In this paper, we discuss the design implications of extending AIS functions into team and collective training environments, with a focus on system-level

Adaptation and Pedagogy at the Collective Level

303

pedagogical supports. The impetus for this research associates with the U.S. Army’s modernization strategy called the Synthetic Training Environment (STE).

2 Learning Science and the Synthetic Training Environment The STE is a large undertaking that aims to modernize the Army’s current capability sets across Live, Virtual and Constructive (LVC) simulation-based training methods to support collective training. The overarching objective is to leverage advancements across the industry and government’s technology base to provide a modern training solution to today’s soldier that incorporates immersive interactions with high-fidelity realism to train critical skill sets across the echelon structures within the Army’s operational force. The resulting STE will provide the Army with a mechanism to rapidly simulate numerous “bloodless” battles in an effort to optimize force structure performance through exposure to realistic battle drills and operational dynamics in a multi-domain battlespace [5]. There are multiple functional capabilities required to support the maturation of STE. The one we focus on deals with the instructional components and learning science driving the design and application of the training. 2.1

Training Management Tools and Team-Based AIS

The STE subcomponent called Training Management Tools (TMT) is comprised of a set of technologies that assist in establishing relevant training content, managing the execution of that content, and organizing an AAR following interaction with that content. To organize these functions, TMT supports activities in the: (1) planning, (2) preparation, (3) execution, and (4) assessment/review phases of a STE training event. While most legacy training systems in the Army require human Observer Controllers (OC) to manage assessment and guide coaching, one capability TMT is conceptualized to offer for STE is intelligent tutoring by way of AIS methods. As a starting point for research, the foundation of the AIS functions for the TMT baseline are leveraging the most mature components native to the U.S. Army’s Generalized Intelligent Framework for Tutoring (GIFT). GIFT was designed as a set of defacto best practices for authoring AIS content within a domain-independent architecture [6]. Initial GIFT development focused on individual domains, with successful applications designed for training: (1) care under fire procedures for combat medics [7], (2) land navigation fundamentals [8], (3) basic rifle marksmanship [9] and enhanced situational awareness within the context of counter-insurgency [10]. While an initial focus was on establishing workflows in individualized domain spaces to support a more ubiquitous approach to authoring, teams and collective environments were not ignored. These included efforts examining the technical requirements at the architectural level associated with intelligent tutoring for collective teams [11, 12], and defining a theoretical construct by which to guide measurement design that will inform tutor decisions (i.e., defining the dimensions of teamwork and establishing behavioral markers by which to measure those dimensions [13]). At the current state, there are two salient observations by the authors related to team-based tutoring research: (1) while there is a mature understanding of what makes

304

B. Goldberg

an effective team and how to measure markers of performance [13], automating and generalizing those measures beyond simplistic go/no-go rule determinations will require robust research methods in a mature STE endorsed environment, and (2) there is little understanding on what a team-focused intelligent tutor should objectively do during run-time at the pedagogical level. While there have been some initial thought papers looking at how the tenets of learning science [14] and sports psychology [15] can influence an initial set of pedagogical policies, further work is required to translate those recommendations into a schema that functions within the TMT baseline. Before we discuss the instructional decision points and implications for intelligent tutoring in STE, it is important to define the activities and workflows that are currently in place to establish these capabilities. In the following sub-sections, we define how the organizing TMT activities (i.e., plan, prepare, execute, and assess) associate with AIS requirements. It is important to note the following descriptions are explicitly represented as potential workflows for building AIS content in STE as informed by the GIFT architecture, and do not represent the other TMT functions that serve different aspects of STE exercise development. Plan/Preparation. The plan and preparation activities associate with defining and configuring all the AIS components that are required for real-time measurement/assessment and pedagogical injects. An objective of the STE is to automate as much of this process as possible, but it is still important to document the various dependencies that will require configuration in one way or another prior to training execution. We separate plan and preparation components, as we believe there are varying levels of background and technical expertise required to support those workflows. As technology matures, the goal is to provide tools and methods that promote a sustainable AIS training environment maintained by people who know the domain and actually use the system, rather than by contractors and support personnel required for preparation purposes. To achieve this, not only do we need robust mechanisms to assess performance and inference competency, we need robust and intuitive authoring tools to support rapid configuration of these functions across roles, scenarios, and environments. Planning Activities. In the plan portion of AIS implementation, performing front-end analyses is critical. This involves: (1) representing a set of training objectives that a scenario will target, (2) storyboarding a scenario with events and triggers based on prescribed tasks and conditions of a defined training objective, (3) deconstructing the training objectives into knowledge, skills, and abilities (KSAs) that are required to meet performance standards, (4) establishing what KSAs associate with the sequenced events defined in the storyboard, (5) establishing criteria thresholds for task standards based on representative conditions, and (6) defining pedagogical strategy functions at both the task and KSA level for use in real-time to influence training. Each of the represented activities listed above assists in building a requirements list for use during system preparation (i.e., configuration). The goal is to elicit the necessary information from Subject Matter Experts (SME) through structured task analyses that are designed to link the workflow above with schema structures in an established AIS framework. The front-end analyses are important as they specify conditions for use during preparation activities. Depending on the assessment

Adaptation and Pedagogy at the Collective Level

305

techniques (e.g., decision trees, Bayesian nets, Markov Decision Processes, etc.), the training environment must be modeled across a set of states that a trainee and/or unit may experience based on their underlying actions. This involves documenting all related tasks/events and associated triggers by building references in the environment that can be used during AIS configuration (e.g., waypoints, paths, zones of interest, objects, etc.). Until AI methods target automated processes to support the six defined tasks above, there needs to be dedicated authoring tools and generalized methods for producing these content across any STE focused event, regardless of the terrain and echelon structure. This dependency will be discussed in further detail after the following sub-sections. Preparation Activities. Following the planning phase, preparation activities are initiated to configure the associated intelligent tutor modules. This involves establishing a gateway (i.e., defining dataflow specifications), configuring assessments around a set of defined tasks storyboarded during planning, and building instructional interventions that can be triggered based on the established assessments. A critical component to the success of this concept is a library of generalizable measures that can be referenced during this portion of the AIS development. If a measure does not exist, it is up to the interacting party to create the new measure in source, and recompile the code. Through this mechanism, there exists a community approach to measurement, where techniques and condition classes can be shared and repurposed across numerous contexts. In an ideal situation, tools and methods are available to the same interacting individuals in the plan phase to support these AIS preparation activities. However, AIS scenario preparation can involve complex technical tasks that require a deep understanding of the underlying architecture and its inherent dependencies when authoring automated assessments. To reduce this complexity, prior research has investigated the impact of overlay authoring functions for map-oriented simulation environments on authoring AIS logic, (see Fig. 2) with results showing significant reductions in both errors performed and time to author [16].

Fig. 2. AIS overlay authoring functions for establishing contextualized reference objects

306

B. Goldberg

The current tool provides a mechanism to quickly build points of interest, areas of interest, and paths of interest that can be referenced as contextual anchors when configuring measures. The approach is extensible and applies across any map-based simulation, with current examples in live, virtual and constructive environments. With contextual anchors, the author can easily populate measures that require references from the environment by which to base performance thresholds (e.g., designating a vulnerable area during a battle drill by using the ‘AvoidArea’ condition class and using the specified zone established with the overlay tool). This stealth assessment can provide specific interaction patterns in the environment and provides temporal associations when violations are observed for real-time feedback or logged for AAR purposes. Once assessments are configured across a set of tasks and conditions, preparation activities are required to configure pedagogical tactics that can be executed if AIS conditions are met. This is the critical gap of team-based tutoring that is not well researched in the AIS community. Understanding how to measure an environment and infer performance is one thing; understanding how to use those measures to drive automated pedagogical decisions is another. In an effort to maintain generalizability, it is important to represent these system actions in an abstract way that can translate across tasks, domains, and environments. In the current AIS TMT baseline leveraging the GIFT architecture, there are three pedagogical models that an author can reference. Each approach has dependencies on the assessment modeling technique, with the assessment outputs driving pedagogical logic. These three models include: • State Transition Model: bases pedagogical decisions on observed shifts in performance at a concept by concept level with four supported actions (provide guidance, adapt scenario, ask question, and do nothing) • Trend and Competency Model: examines performance over time and applies algorithms to determine focus of coaching and remediation based on task model priorities and trends. Applies same supported actions with variation in pedagogical reasoning. • ICAP-inspired (Interactive, Constructive, Active, Passive) Model (based on Chi’s [17] learning activity framework): formalized using Markov Decision Processes and incorporates a reinforcement learning backend [18]. Policies determine remediation interactivity based on demographics and observed patterns in performance to optimize reward functions. Regardless of the model, content is needed. In addition, the more adaptive you want the system, the more content required. Automating the construction of feedback and remedial activities is another challenge research aims to address, but there has not been much success outside the application of “hint factories” [19]. As a starting point, there is a need to establish an initial team-focused pedagogical activity model that accounts for individual and team states in a generalizable form. It should highlight the audience and intended function of each activity, with the goal of establishing policies across the activities to determine pedagogical practice based on data driven methods. We focus on the pedagogical activity model component of the problem space later in the paper.

Adaptation and Pedagogy at the Collective Level

307

Execution. After the preparation activities are complete, an AIS configured STE scenario is ready for execution. The assessments and pedagogical logic have been established around the planned storyboard conceived in the front-end analysis. The intended training audience is now ready to interact, with the underlying assessments in place to guide training when deficiencies in performance are identified. In this instance, procedures are required to initialize the appropriate AIS modules to support the Adaptive Tutoring Learning Effect Chain oriented for team and collective training structures (ATLEC, see Fig. 3) [20]. The ATLEC shows the process of using learner interaction data across a team structure to inform performance states at the individual and team level for guiding instructional decisions. Interfacing technology for observer controllers to provide injects into the ATLEC in real-time is planned, with their inputs linked to dedicated nodes within the task schema where automated measures are not support and/or feasible. The critical component here is the required instructional context in place before an adaptive intervention can be selected, with those anchors accounted for in the planning and preparation phases.

Fig. 3. ATLEC model for teams [20]

While the resulting AIS can operate in an automated closed-loop capacity, the platform is designed to support human-in-the-loop decisions at both the assessment and pedagogical level. In this instance, an OC uses the AIS technology to track/insert assessments and to manipulate the environment with adaptive injects and feedback. The goal is to reduce the workload on the OCs and provide access to direct manipulation and guidance functions that can be automatically carried out by the system, along with recommendations based on the underlying pedagogical policies. Assessment/Review. Assessment and review activities can be differentiated across two categories: (1) assessment and review for the training audience through After Action Review (AAR) activities intelligently informed from observations collected during the execution portion, and (2) assessment and review of the system-level interventions performed during the execution phase and their resulting impact on measurable performance and objective outcomes, with reinforcement learning methods applied where feasible. The former assessment and review activities apply AIS methods for facilitating a guided AAR with reference points and remediation materials to address deficiencies recognized at both the individual role and team context, while the latter institutes AI methods to support a self-optimizing system that modifies

308

B. Goldberg

pedagogical policies based on evidence-based methods. While AAR is a critical component of team and collective training, the focus of this paper is AIS functions at the execution level.

3 Adaptive Pedagogy at the Team Level With an understanding of the development phases that go into establishing AIS functions for teams, we spend the remainder of the paper discussing an initial pedagogical activity framework based around targeted feedback and adaptation strategies. The role of feedback and adaptations are discussed within the context of AIS, followed by considerations related to the role of each pedagogical function and how they associate across individual, team-leads, and whole team structures (i.e., global). 3.1

Role of Feedback and Adaptation

The power of AISs are in using sophisticated modeling techniques to understand an individual’s and/or team’s progress toward a task objective for the purpose of guiding that experience. While the measurement and assessment of an AIS drives the underlying selection of a pedagogical strategy, the feedback and adaptive functions are the interfacing component a trainee will experience, thus making it a critical capability need research should address. In the context of instruction and training, feedback is credited as a fundamental principle to efficient knowledge transfer [21–23]. According to Narciss [24], feedback in the learning context is provided by an external source of information not directly perceivable during task execution, and is used as a means for comparing performance outcomes with desired end states. This facilitation is useful for multiple purposes. Feedback: (1) can often motivate higher levels of effort based on current performance compared to desired performance [25]; (2) reduces uncertainty of how well an individual is performing on a task [26]; and (3) is useful for correcting misconceptions and errors when executing inappropriate strategies [27]. Understanding this interaction space in the context of teams is important. Teams can be composed of individuals from various backgrounds and with a wide-array of personality traits. Managing motivation while providing coaching becomes critical as one dis-engaged member of the team can impact the effectiveness of the training. These nuances should be accounted for when designing feedback policies. In addition, the game-based training platform itself must also have mechanisms for reacting to state and performance measures in real-time. These mechanisms are designed to impact scenario storylines and force objective reactions by those training. In-game adaptations should provide the ability to adjust difficulty levels based on inferred performance and teams, adjust the pace and flow of guidance and coaching strategies, and deliver cues in the virtual environment that may act as a form of scenario specific feedback. In the following subsections, we present feedback and adaptation taxonomies that can guide the development of a team-based pedagogical model that is represented domain-agnostically.

Adaptation and Pedagogy at the Collective Level

309

Feedback Target Taxonomy. Feedback should be structured around assessment characteristics and the target of coaching. There should be explicit representations in the domain model that inform the feedback target, but this needs to be accomplished while maintaining a flexible ontological schema. In the current TMT adaptive baseline, scenarios are represented as a series of tasks that have associated conditions and standards based on the context of the environment. Now that teams are being represented in the domain model, tagging task concepts and their associated assessments with metadata can support differentiation of feedback at levels within the team structure. As a starting point, we are proposing an initial feedback taxonomy that will guide pedagogical model development (see Table 1). The taxonomy establishes mechanisms at three levels of interaction (individual, team lead, and global), two levels of valence (positive and negative), and two levels of timing (real-time [RT] and AAR; number inside parentheses represents number of observed errors for an associated concept assessment). This approach supports a simplified representation that puts bounds on required content; thus, reducing the development time at the SME/instructor level. Before proceeding to system adaptations, there are a few dependencies to recognize with this approach. These include: (1) task specific concept assessments can be tagged at the process or procedure level, (2) assessment mechanisms exist (either automated or human-informed) for natural language and communication processes, and (3) checkpoints are specified that enable a battle update briefing with assessment classes in place to manage performance states at the team task level. Adaptation Target Taxonomy. When it comes to discussing system-level adaptations, those building AIS components are limited to the adaptive functions their integrated environment supports. However, there are common functions, depending on the simulation engine, that enables direct manipulation of actors, objects, and scenario variables in real-time. This includes adding/removing/relocating entities and non-player characters, teleporting interacting learners to any map location, adding building and environment features, adjusting time of day and weather, etc. While there are numerous ways to adapt a scenario in real-time, there needs to be a learning science informed workflow to assist in their configuration and execution. In an effort to maintain simplicity the adaptation taxonomy has limited dimensions. There are two levels of complexity (increase and decrease) and two levels of progress check decisions (continue and restart). Increasing and decreasing complexity can be any combination of adaptations. What is important is having a pre-established sets that the system can act on automatically and in real-time. Progress check decisions associate with either allowing a team to continue on in the mission or having the system restart the scenario due to observed critical errors. In this instance, SMEs will need to identify tasks that have designated performance criteria that are deemed critical at the task level. This enables SMEs to associate specific sub-tasks that require more focused training than others. Getting SME input to inform these decisions is critical. Next, there needs to be an agent or set of policies that manages the interplay between feedback and adaptation. While feedback can be delivered at the individual level, system adaptations will impact the scenario and team tasking at large. As a starting point, individual role and team-lead performance on tasks should not directly

310

B. Goldberg

Table 1. Team interaction feedback taxonomy for AIS (N: Novice, J: Journeyman, E: Expert; number in parentheses represents number of errors observed across task/concept structure) Audience

Feedback target

Val.

Description

Individual

Procedure

+

Confirm/Reinforce proper role execution Correct error and incorrect step within specified task procedure Task level performance confirmation across proper sequencing of procedures Correct procedures, but incorrect sequence to perform task Tactical communication focused. Proper execution. Aim to influence leadership Poor tactical communication. Specify feedback around storyboard event and communication timeline Situational Awareness focused. Information sharing across team structure but communicated to team lead. Influence leadership performance Poor team sharing of information. Specify feedback around storyboard event and salient clue requiring reporting Objective performance update at specified scenario checkpoint based on aggregated task-level assessments Error summary across tasks at team and role level performed prior to checkpoint Aggregate of communication and performance. Represent information/skill network for team (who knows/does what) Poor representation of team information/skill network across scenario timeline

− Process

+

− Team lead

Coordination

+

−

Communication

+

−

Global (everyone)

Progress

+

−

Shared cognition

+

−

Timing RT AAR N J(1) J(2) E N J(1) J(2) E N J(1) J(2) E N J(2) N J(2)

J(1) E J(1) E

N J(2)

J(1) E

N J(2)

J(1) E

N J(2)

J(1) E

N J

E

N J

E

N J

E

N J

E

Adaptation and Pedagogy at the Collective Level

311

lead to system adaptations as errors are performed. Feedback should be provided to correct errors based on the associations in Table 1, but the mission continues until an observed checkpoint is registered in the AIS system. At this point, performance is aggregated across interacting team structures, with an ability to populate adaptation level criteria that determines if complexity should be adjusted and if the scenario continues. This provides discrete time-markers that enable dynamic elements in the interacting environment.

4 Conclusion Team-based adaptive training is a desired capability in the future Army STE. To provide an effective solution to the soldier, research is required to determine what pedagogical interactions should be supported in these environments, and what influence do they have on the learning and team development process. In this paper we presented an initial taxonomy for both feedback and adaptation in the context of team AIS across individual roles, team leads, and global associations. The taxonomy provides an initial starting point to develop requirements for a pedagogical model that is based on policies surrounding the pedagogical activity dependencies. The leading dependency is having a robust assessment capability to support the pedagogical reasoning described above, which is its own research vector, requiring measures of task work, team work, and communication. Next steps will involve adapting the GIFT task model structure to include the assessment metadata requirements and feedback and adaptation schemas.

References 1. Vygotsky, L.S.: Mind in Society: The Development of Higher Psychological Processes. Cambridge (1978) 2. Bjork, E.L., Bjork, R.A.: Making things hard on yourself, but in a good way: creating desirable difficulties to enhance learning. In: Gernsbacher, M.A., Pew, R.W., Hough, L.M., Pomerantz, J.R. (eds.) Psychology and the Real World: Essays Illustrating Fundamental Contributions to Society, pp. 56–64. Worth, New York (2011) 3. Shute, V.J.: Focus on Formative Feedback. Educational Testing Service, Princeton, NJ (2007) 4. Sottilare, R.A., Graesser, A., Hu, X., Goldberg, B: Design Recommendations for Intelligent Tutoring Systems: Volume 2-Instructional Management, vol. 2. US Army Research Laboratory, Orlando, FL (2014) 5. Scales, B.: Virtual immersion training: bloodless battles for small unit readiness. Army Mag. 63(7), 24–27 (2013) 6. Sottilare, R.A., Brawner, K.W., Goldberg, B.S. Holden, H.K.: The generalized intelligent framework for tutoring (GIFT). U.S. Army Research Laboratory, Aberdeen Proving Grounds, MD (2012 7. Goldberg, B., Cannon-Bowers, J.: Feedback source modality effects on training outcomes in a serious game: pedagogical agents make a difference. Comput. Hum. Behav. 52, 1–11 (2015)

312

B. Goldberg

8. Goldberg, B., Roberts, N., Powell, W.G., Burmester, E.: Intelligent tutoring in the wild: leveraging mobile app technology to guide live training. In: Proceedings of the Defense and Homeland Security Simulation (DHSS) Workshop, Budapest, Hungary (2018) 9. Goldberg, B., Amburn, C., Ragusa, C., Chen, D.: Modeling expert behavior in support of an adaptive psychomotor training environment: a marksmanship use case. Int. J. Artif. Intell. Educ. 28(2), 194–224 (2018) 10. Rajendran, R., Mohammed, N., Biswas, G., Goldberg, B., Sottilare, R.A.: Multi-level user modeling in GIFT to support complex learning tasks. In: Proceedings of the 5th Annual Generalized Intelligent Framework for Tutoring (GIFT) Users Symposium (GIFTSym5), Orlando, FL (2017) 11. Gilbert, S., et al.: Creating a team tutor using GIFT. Int. J. Artif. Intell. Educ. 28, 286–313 (2018) 12. McCormack, R.K., Kilcullen, T., Sinatra, A.M., Brown, T., Beaubien, J.M.: Scenarios for training teamwork skills in virtual environments with GIFT. In: Proceedings of the 6th Annual GIFT Symposium, Orlando, FL (2018) 13. Sottilare, R.A., Burke, C.S., Salas, E., Sinatra, A.M., Johnston, J., Gilbert, S.B.: Designing adaptive instruction for teams: a metaanalysis. Int. J. Artif. Intell. Educ. 28(2), 225–264 (2018) 14. Johnston, J., Burke, S., Milham, L., Ross, W., Salas, E.: Challenges and propositions for developing effective team training with adaptive tutors. In: Building Intelligent Tutoring Systems for Teams. Emerald Publishing Limited, pp. 75–97 (2018) 15. Goldberg, B., Nye, B., Lane, H.C., Guadagnoli, M.: Team assessment and pedagogy as informed by sports coaching and assessment. In Sottilare, R., Graesser, A., Hu, X., Sinatra, A. (eds.) Design Recommendations for Intelligent Tutoring Systems, vol. 4. Team Tutoring, U.S. Army Research Laboratory (2018) 16. Davis, F., Riley, J., Goldberg, B.: Iterative development of the GIFT wrap authoring tool. Paper presented at the Generalized Intelligent Framework for Tutoring (GIFT) Users Symposium (GIFTSym6), Orlando, FL (2018) 17. Chi, M.T.H.: Active-constructive-interactive: a conceptual framework for differentiating learning activities. Top. Cogn. Sci. 1, 73–105 (2009) 18. Rowe, J., Spain, R., Pokorny, B., Mott, B., Goldberg, B. Lester, J.: Design and development of an adaptive hypermedia-based course for counterinsurgency training in GIFT: opportunities and lessons learned. In: Proceedings of the 6th Annual Generalized Intelligent Framework for Tutoring (GIFT) Users Symposium (GIFTSym6), Orlando, FL (2018) 19. Koedinger, K.R., Brunskill, E., de Baker, R.S.J., McLaughlin, E.A., Stamper, J.C.: New potentials for data-driven intelligent tutoring system development and optimization. AI Mag. 34(3), 27–41 (2013) 20. Sottilare, R., Ragusa, C., Hoffman, M., Goldberg, B.: Characterizing an adaptive tutoring learning effect chain for individual and team tutoring. In: Proceedings of the Interservice/Industry Training Simulation & Education Conference, Orlando, Florida (2013) 21. Andre, T.: Selected micro instructional methods to facilitate knowledge construction: implications for instructional design. In: Tennyson, R.D., Schott, F. (eds.) Instructional Design: International Perspectives: Theory, Research, and Models, vol. 1, pp. 243–267. Lawrence Erlbaum Associates, Inc., Mahwah, NJ (1997) 22. Bilodeau, E.A.: Principles of Skill Acquisition. Academic Press, New York (1969) 23. Bloom, B.: Human Characteristics and School Learning. McGraw-Hill, New York (1976) 24. Narciss, S.: Feedback strategies for interactive learning tasks. In: Spector, J., Merrill, M., van Merrienboer, J.J.G., Driscoll, M. (eds.) Handbook of Research on Educational Communications and Technology 3rd edn, pp. 125–144. Lawrence Erlbaum, Inc. New York (2008)

Adaptation and Pedagogy at the Collective Level

313

25. Locke, E., Latham, G.: A Theory of Goal Setting & Task Performance. Prentice-Hall, Upper Saddle River (1990) 26. Ashford, S.J.: Feedback-seeking in individual adaptation: a resource perspective. Acad. Manag. J. 29, 465–487 (1986) 27. Davis, W.D., Carson, C.M., Ammerter, A.P., Treadway, D.C.: The interactive effects of goal orientation and feedback specificity on task performance. Hum. Perform. 18, 409–426 (2005)

Developing an Adaptive Trainer for Joint Terminal Attack Controllers Cheryl I. Johnson1(&), Matthew D. Marraffino1, Daphne E. Whitmer2, and Shannon K. T. Bailey1 1

Naval Air Warfare Center Training Systems Division, Orlando, FL 32826, USA {cheryl.i.johnson,matthew.marraffino, shannon.bailey}@navy.mil 2 Zenetex, LLC, Orlando, FL 32817, USA [email protected]

Abstract. Adaptive training (AT) is training that is tailored to an individual trainee’s strengths and weaknesses, such that each trainee receives a unique training experience. Previous research has demonstrated that AT can lead to higher learning gains and decreased training time when compared to traditional training approaches in certain domains [1]. However, more systematic research is needed to define which AT techniques to employ and for what content in order to determine when to invest in these technologies. The goal of this research is to examine the benefits of two particular AT techniques (i.e., adapting feedback and scenario difficulty) based on trainee performance in a complex military decision-making task. In this paper, we discuss the development of a research testbed, the Adaptive Trainer for Joint Terminal Attack Controllers (ATTAC), from a science of learning perspective. In particular, we review how the Cognitive Theory of Multimedia Learning and the Expertise Reversal Effect drove design decisions and present preliminary results on participants’ impressions of ATTAC from a pilot study. Keywords: Adaptive training

Decision-making Feedback Assessment

1 Introduction 1.1

Joint Terminal Attack Controllers

The U.S. Marine Corps (USMC) Vision and Strategy 2025 describes the critical path for maintaining the Corps’ dominance as the expeditionary force of choice in a future of increasingly complex and volatile environments. Marines must be prepared to operate in a decentralized manner, placing unprecedented demands on them to make difficult decisions in high stress and high stakes situations. Close Air Support (CAS) is an example of such an environment. CAS involves aircraft attacking ground-based hostile targets in close proximity to friendly forces, which requires a clear understanding of the battlespace, including the locations of targets relative to friendlies, the weapons’ capabilities of the aircraft, and the possible timing and effects of the weapons (e.g., to minimize collateral damage and fratricide). Joint Terminal Attack Controllers This is a U.S. government work and not under copyright protection in the United States; foreign copyright protection may apply 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 314–326, 2019. https://doi.org/10.1007/978-3-030-22341-0_25

Developing an Adaptive Trainer

315

(JTACs) are the qualified service members who direct aircraft engaged in these highly dynamic, time-sensitive CAS missions [2]. CAS is a complex 12-step process that involves constructing plans for the mission and acquiring the necessary data for the attack (e.g., assessing attack geometry, determining the location of friendly forces) and then communicating those plans to the aircraft and other members of the ground units before, during, and after the mission’s execution. Individuals training to become JTACs come to the schoolhouse with a variety of military occupational specialties, such as infantry, artillery, and pilots. Many enter the course with hundreds of hours of experience participating in CAS missions, such as pilots who may have executed the CAS attacks against targets in theater, whereas other trainees may have never participated in CAS at all. The marked differences in students’ experience levels in the course leads to challenges for the less experienced students, who have a very steep learning curve relative to experienced students, and for instructors who train these less experienced students up to speed. To address the challenges of training individuals at various skill levels, we developed an adaptive training testbed to provide JTAC trainees with additional opportunities to practice CAS decision-making skills called the Adaptive Trainer for Joint Terminal Attack Controllers (ATTAC). Adaptive training (AT) is training that is customized to an individual’s strengths and weaknesses, such that each student receives a tailored training experience [3]. In this paper, we briefly review the AT literature, discuss how cognitive theory drove design decisions for ATTAC, and present initial results from pilot testing on users’ perceived usability of and overall satisfaction with ATTAC. We conclude with a discussion of future research plans with the ATTAC testbed. 1.2

Adaptive Training Overview

One-on-one tutoring is often considered a best practice for training and education. In a seminal paper, Bloom [4] found in his analysis of the literature that students who received one-on-one tutoring performed two standard deviations higher than students in a traditional classroom setting [c.f. 5]. Of course, one-on-one tutoring is not a tenable solution for training in the real world, as this would be prohibitively expensive in terms of cost, manpower, and time. Therefore, it has been the goal of instructional designers to try to approximate this experience through technology-based training solutions. One such solution that has gained traction recently is AT. In a review of AT systems used in military settings, McCarthy [6] described how AT has been utilized in a vast number of domains, from troubleshooting issues in electronic systems to learning procedures for operating radar systems. ATs have been implemented to teach diverse knowledge and skills, including both conceptual information and procedural tasks [7]. Durlach and Ray [8] performed an extensive review of the AT research literature and generally found that AT methods were effective and led to better learning outcomes. However, the authors noted that there were relatively few examples of carefully controlled experiments that included comparisons between an adaptive group and a non-adaptive group or two adaptive groups, which limited their ability to draw strong conclusions about what adaptive techniques work best under which conditions. More recent research has found support for adapting feedback and difficulty of a task based on the trainee’s performance [1], while

316

C. I. Johnson et al.

some others have not [9–11]; therefore, more research is needed to determine when and how to implement AT techniques successfully. AT systems rely on three core components to effectively tailor a student’s instruction, which is referred to as the Observe-Assess-Respond (OAR) AT model [12]. First, the AT system must be able to observe characteristics about a student within the context of the learning environment. These observations could be based on a student’s behavior within the environment or some trait variable (e.g., spatial ability or prior knowledge). Next, the system must assess what these observations mean about the student. Finally, once the system has made an accurate assessment of the student, the system must then respond with some instructional intervention based on learning theory to efficiently guide the student to meet his/her learning objectives. One example of an instructional intervention is feedback provided in response to a student’s input. When developing an AT system, one must consider each component of the ObserveAssess-Respond model during the design process. In this research effort, we were particularly interested in examining the latter case; that is, we investigated what instructional interventions were most effective in a given task domain. 1.3

Research Approach

The overall approach of this effort was to build upon Landsberg and colleagues’ [1] research in which they developed a scenario-based adaptive trainer for making periscope calls. Specifically, students were trained using a periscope simulation to determine and report a contact’s angle on the bow (AOB), or the orientation the contact is presenting relative to the periscope operator’s eye. Trainees received feedback after each scenario that adapted based on how they performed, and after a series of scenarios, the difficulty of the next block of scenarios was also adapted. This research demonstrated that adapting the difficulty of scenarios and the type of feedback trainees received based on an assessment of their performance led to more effective and efficient training when compared to traditional non-adaptive training approaches of similar length or longer. Although these results were very promising, more research is needed to determine the generalizability of this approach to other types of tasks. Therefore, for the current effort, we sought to extend Landsberg and colleagues’ [1] research in a number of important ways. First, we conducted this research to determine whether using the same approach of adapting the feedback and difficulty generalizes to a different type of task. Determining AOB is primarily a visuo-spatial task, but do these adaptive approaches apply in a complex decision-making task, such as CAS? That is – perhaps the underlying cognitive mechanisms that support learning a spatial task (e.g., calling AOB) differ from more complex conceptual tasks (e.g., CAS decision-making), because spatial information is processed differently than conceptual information. Second, determining a “level of correctness” is often less straightforward in a decisionmaking task. In the periscope training study, trainees responded with an orientation and angle (e.g., Starboard 160°), which has a clear correct answer and can easily be scored (e.g., an answer of Starboard 100° is off by 60°). However, in a decision-making task, there may be more than one correct answer with correct answers falling along a continuum (i.e., correct, incorrect, and partially correct) rather than a binary assessment of correct or incorrect, making assessment less straightforward. Third, in the present

Developing an Adaptive Trainer

317

research, trainees were expected to make 5–7 decisions in each scenario, while in previous research adaptations were based on a single answer [1]. Therefore, determining how to handle adaptations with more decision points within a single scenario posed a unique challenge. To explore these research questions, we developed the ATTAC testbed.

2 Developing ATTAC 2.1

ATTAC Overview

Because CAS is such an involved task with 12 individual steps, we chose to focus our effort on one particularly challenging step, the critical planning process called “game plan” development. The JTAC’s game plan sets the stage for the execution of the entire CAS mission and is a difficult topic for JTAC trainees to master. The game plan consists of four interdependent decisions: Type, Method, Ordnance, and Interval. “Type” refers to the “Type of control” the JTAC employs over the CAS mission, and the decision is based on factors such as whether the JTAC can meet the criteria for controlling the mission or prefers the CAS pilots to assume some level of control. “Method” is the “Method of attack” the attacking aircraft will utilize, and the JTAC must consider how the target will be correlated with the aircraft (e.g., the JTAC provides precise grid coordinates or the aircraft uses a sensor to find the target). To determine which Ordnance to employ, the JTAC must decide which weapon to use against the target to achieve the desired effects. For Interval, the JTAC determines how much time separation is needed between attacking aircraft. In a typical CAS mission, a JTAC controls two or more aircraft, and the JTAC must decide whether each aircraft will follow the same game plan or each aircraft will have a different game plan. When designing ATTAC, we intended for trainees to practice developing game plans for a variety of situations, receive feedback about their responses, and experience different levels of scenario difficulty. ATTAC is an adaptive scenario-based trainer that presents a series of CAS scenarios and trainees develop a game plan(s) for each scenario that will meet the commander’s intent. As depicted in Fig. 1, for each scenario, the trainee is presented with the information a JTAC has on hand while conducting a CAS mission. This information includes: a brief on the situation and description of the target(s), the capabilities of the JTAC (e.g., map and compass, laser-target designator, range finder, etc.), the type of aircraft and weapons available to conduct the mission (e.g., two F/A-18Es, each with two Mk-83 s), the distance of the target from the JTAC and nearest friendlies, the presence of any surface-to-air threats, and current weather conditions. Given this information, the trainee must choose the best Type, Method, Ordnance, and Interval combination that most efficiently and effectively prosecutes the targets and meets the scenario’s objective (i.e., commander’s intent). Based on the trainee’s response, the trainee receives tailored feedback and the difficulty of subsequent scenarios is also adjusted. The overall adaptation design decisions for ATTAC were based on popular science of learning principles and theory.

318

C. I. Johnson et al.

Fig. 1. Example ATTAC scenario.

2.2

Theoretical Approach to Designing ATTAC

Cognitive Theory of Multimedia Learning. During the development of ATTAC, we utilized the Cognitive Theory of Multimedia Learning [CTML; 13, 14], a popular framework in educational psychology for understanding how people learn. Using this approach allowed us to make predictions about the effectiveness of different instructional techniques and drove our overall testbed design. A main assumption of CTML is that learners have a limited working memory capacity; therefore, instruction should be designed to limit the amount of unproductive cognitive processing imposed on the learner and foster productive cognitive processing. According to CTML, individuals engage in three different types of cognitive processing while learning: unproductive processing due to poor instructional design (i.e., extraneous), processing related to the complexity of the material (i.e., essential), and processing to make sense of the material (i.e., generative). Extraneous cognitive processing arises from poor instructional design, such as using an interface that is cumbersome or awkward. Essential cognitive processing stems from the complexity of instructional content itself. For example, learning how to repair an engine would require more essential processing for novices who know very little about car repair, because they have to build these schemas about how an engine works as they learn; but repairing an engine would require less essential processing for master mechanics, because they already have schemas in place and their retrieval is an automated process, requiring fewer cognitive resources to process the incoming information. Therefore, the amount of essential cognitive processing is highly dependent upon a learner’s prior knowledge and experience. Generative processing refers to the level of mental effort learners must expend in order to make sense of the material they are learning. Generative processing is considered productive cognitive processing, because the learner is relating the learning content to his prior

Developing an Adaptive Trainer

319

knowledge. These three types of cognitive processing are traditionally thought of as additive, and it is possible for individuals to reach their capacity, which is called cognitive overload. In the event of cognitive overload, learning and task performance suffer. Therefore, the goal of instructional designers should be to minimize extraneous processing, manage essential processing, and foster generative processing. Consistent with CTML, the expertise reversal effect (ERE) is a well-documented finding that certain instructional interventions that may be effective for novice learners may actually be detrimental for more knowledgeable learners, and it has been demonstrated in a number of different domains [15]. For example, in one experiment researchers found that providing more detailed instruction was beneficial for novice learners by giving them useful information about the task. However, as learners gained more expertise about the domain, providing the detailed instruction hurt their performance; in fact, providing less structured instruction was more beneficial for the experts [16]. The authors argued that the detailed instruction led to extraneous cognitive processing for the experts, as the additional detail was unnecessary and distracting, using up limited cognitive resources that could have otherwise been devoted toward meaningful cognitive processing. On the other hand, the additional detail was necessary for the novices in order to manage their essential processing demands. The ERE is just one example of why tailoring training to the needs of an individual learner is beneficial; it is important to consider trainees’ prior knowledge of the domain as some instructional strategies may be more or less effective depending upon their level of expertise. Adapting Feedback and Scenario Difficulty. Grounded in our review of the literature, we determined that adapting feedback and scenario difficulty based on trainee performance were both promising techniques for ATTAC. Feedback is considered by many researchers as one of the most effective instructional strategies [e.g., 5, 17], but there are many remaining questions on how to apply feedback in complex training situations, such as AT and scenario-based training [18–20]. In the current research, our goal was to test predictions from CTML and ERE that providing detailed feedback may be more effective for low-performing students, while less detailed feedback may be more helpful for high-performing students. Regarding scenario difficulty, a recent meta-analysis [21] reported that training with adaptively increased scenario difficulty was more effective than training that increased difficulty at fixed intervals or training that remained at a constant difficulty throughout. This finding is consistent with CTML. If a trainee is performing in a scenario that is too difficult, essential cognitive processing demands will be very high, and that trainee may not have enough cognitive resources available for meaningful learning to take place. Likewise, if a trainee performs in a scenario that is too easy, cognitive processing demands will be low, and the trainee may become bored or distracted, increasing extraneous processing. That is, these theories predict that an optimal strategy is to keep trainees in a “sweet spot” during training, in which the scenario is not too difficult to overwhelm or too easy to bore the learner.

320

C. I. Johnson et al.

2.3

How ATTAC Works

ATTAC provided individualized training by following each of the components of the OAR model, and each are discussed in turn. For each game plan scenario, ATTAC first observed a trainee’s game plan responses that were input via the drop-down menus for each decision (e.g., Type, Method, Ordnance, Interval). Once a selection was made for each drop-down, the trainee submitted the game plan for assessment. Next, ATTAC assessed the trainee’s game plan by comparing responses to a database of possible game plan combinations. Due to the nature of game plan development, determining assessment criteria posed two distinct challenges: (1) how to assess 5–7 interdependent decisions into a single performance score from which to make adaptation decisions, and (2) how to manage that there are often many correct approaches to prosecuting a CAS mission (or as JTAC instructors like to say, “there is more than one way to skin a cat”). To address these issues, the individual decisions (e.g., the Type, Method, Ordnance, and Interval selection for each aircraft) were considered holistically because it is the combination of factors that determines whether or not the game plan will be effective (e.g., one method of attack may be appropriate if used with a certain ordnance, but not with another). Therefore, the entire game plan was assigned a score based on how likely it would meet commander’s intent. There were three possible assessment outcomes: ideal, acceptable or unacceptable. A game plan was considered ideal if it met mission requirements as efficiently as possible, while also considering the safety of the attacking aircraft, friendly forces, non-combatants, and the JTAC. An acceptable game plan also met mission requirements but may not be the most efficient answer. Finally, an unacceptable game plan was potentially unsafe, inconsistent with CAS doctrinal requirements, and/or unlikely to meet commander’s intent. Based on these criteria, it was possible for a scenario to have several different game plans that could be scored as ideal, acceptable, or unacceptable, which in turn, allowed us to handle the multiple ways the trainee could approach a scenario. Lastly, based on the assessment of the trainee’s game plan, ATTAC responded by adapting the feedback the trainee receives and the difficulty of subsequent scenarios. As previously discussed, feedback and difficulty adaptations were selected as instructional interventions in order to support productive cognitive processing consistent with CTML. With regard to adapting feedback, we considered the ERE literature [15], such that the type of feedback trainees received was based on the correctness of their game plan. Examples of feedback for each type of game plan assessment are provided in Table 1. For example, if the trainee’s answer was assessed as an ideal game plan, the trainee received positive outcome feedback and their answer was displayed (Table 1, top row). When trainees provided an ideal game plan, we reasoned that providing elaborative process feedback would serve as an extraneous cognitive processing demand, because presumably the trainee performed the correct decision-making process to arrive at the correct answer. If the trainee’s answer was an acceptable game plan, the trainee received outcome feedback with elaborative process feedback specific to the response to help the trainee understand why his/her answer was not ideal (Table 1, middle row). In this case, the trainee’s answers were mostly correct, so we provided error-specific feedback designed to minimize the amount of cognitive processing required to understand how to arrive at the ideal answer. Finally, if the trainee’s game

Developing an Adaptive Trainer

321

plan was unacceptable, outcome feedback was provided in addition to fully elaborative process feedback that described the correct decision-making process for Type, Method, Ordnance, and Interval decisions for an ideal game plan (Table 1, bottom row). For unacceptable game plans, we managed essential processing demands by modeling the decision-making steps an expert takes to arrive at an ideal solution. For both the acceptable and unacceptable game plan responses, the feedback screen also displayed the trainee’s answer and an ideal answer so that the trainee could compare their answer to an ideal solution. In all cases, trainees could toggle between the feedback screen and the scenario to review them before moving on to the next scenario. Table 1. Example feedback by game plan assessment. Game plan assessment Ideal Acceptable

Unacceptable

Outcome feedback

Elaborative process feedback

Good job! Your game plan should meet Commander’s intent Almost there. Your game plan would work but consider the following adjustments Not quite. Your game plan is unlikely to be successful

N/A • 30 s – 1 min spacing is recommended for unguided munitions (p. 47, TACSOP) • Mk-83 is a better target-weapon match to destroy a building (p. 112– 113, JFIRE) • Type 3 is recommended because the target set lends itself to multiple attacks in a single engagement window to achieve GFC’s intent (p. III–46, JP – 3-09.3). But Type 1 or Type 2 could be used for GP ordnance based on JTAC’s tactical risk assessment • BOT is the recommended method of attack with unguided ordnance delivery (p. 43, TACSOP) • For a Type 1 or Type 2 attack, 30 s – 1 min spacing is recommended for unguided munitions (p. 47, TACSOP)

Similar to how the feedback adaptations worked, the assessment of a trainee’s game plan also drove the difficulty adaptation of subsequent scenarios. In an attempt to manage the cognitive processing demands of the trainee, the difficulty of the scenarios (i.e., basic, intermediate, and advanced) was adapted to match to the trainee’s level of performance. To illustrate, those who were performing poorly may have been experiencing high essential processing demands, which could have prevented meaningful learning from taking place. Reducing scenario difficulty lowers the essential processing demands for the trainee and allows more cognitive resources to be directed to more productive cognitive processing. Likewise, scenario difficulty increased for trainees

322

C. I. Johnson et al.

who performed well to ensure they continued to be challenged and not become bored which could lead to underutilization of cognitive resources. However, unlike feedback which adapted after every scenario, scenario difficulty was adjusted after a set of two scenarios. This was done to prevent difficulty level from cycling too reactively. Once ATTAC was fully developed, we conducted a pilot study to gauge participants’ initial impressions of ATTAC and test the adaptive algorithms to ensure they were working appropriately. Our ultimate goal is to perform a training effectiveness evaluation of ATTAC, so this pilot study was a necessary first step to determine that the system was working as planned and our experimental measures were adequate.

3 Pilot Study 3.1

Design

In this study, U.S. Marine Corps personnel evaluated an adaptive version or a nonadaptive version of ATTAC in a between-subjects design. Because one of our main goals was to assess how well the adaptive algorithms were working, more participants were assigned to the adaptive condition than the non-adaptive condition. 3.2

Participants

Pilot data were collected from a total of 22 male participants. Participants ranged in age from 20 to 28 years (M = 24.15, SD = 2.99) and had been in the Marine Corps for 1 to 11 years (M = 4.23, SD = 2.46). Of the 22 participants, 10 indicated they had prior CAS experience, primarily as forward observers (i.e., qualified service members who assist the ground unit and the JTAC by providing targeting information) during live and/or simulated training scenarios. Eighteen participants were in the adaptive condition, and four participants were in the non-adaptive condition. 3.3

ATTAC Testbeds

There were two ATTAC testbeds used during this study, the adaptive version and a non-adaptive version. There were two main differences between these versions of ATTAC. The first difference was the level of detail provided in the feedback. The adaptive version provided tailored feedback based on the trainee’s performance (as described above in Sect. 2.3; see Table 1), whereas the non-adaptive version only displayed the trainee’s submitted answer and the ideal answer. The second difference involved scenario difficulty. The adaptive version of ATTAC adapted the scenario difficulty as described above in Sect. 2.3, whereas participants in the non-adaptive version only completed intermediate difficulty scenarios regardless of how well they performed during training.

Developing an Adaptive Trainer

3.4

323

Materials and Procedure

Participants first completed a demographic questionnaire that contained biographical items about their background, military experience, and previous CAS experience. Participants then completed a 4-item scenario-based pre-test, in which they did not receive feedback on their performance. Next, during the training phase, participants briefly reviewed a PowerPoint tutorial on how to use ATTAC and then completed training scenarios in ATTAC for 45 min. After the training phase, participants completed the System Usability Scale [SUS; 22] and an Instruction Reaction Questionnaire (IRQ). The SUS is a 10-item survey that asks participants to indicate their agreement on a 1 (strongly disagree) to 5 (strongly agree) scale regarding the usability of the training environment (e.g., “I thought the system was easy to use.”). The IRQ is a 14-item survey we developed that asks participants to rate their agreement regarding their perceptions of the training on a 1 (strongly agree) to 6 (strongly disagree) scale (e.g., “Overall, the training was useful to me.”). Finally, participants completed a 4-item scenario-based post-test with no feedback and were debriefed and dismissed. 3.5

Results

We examined descriptive data from the SUS and IRQ to ascertain users’ perceptions of ATTAC. Due to the small number of participants in the non-adaptive condition, we chose not to submit these data for further statistical analysis and make direct comparisons of the conditions. The SUS data were transformed according to the author of the scale’s [22] instructions so that the SUS scores range from 0 to 100. The average SUS score for those in the adaptive condition was 76.91 (SD = 13.59), and the non-adaptive condition had an average score of 70.63 (SD = 12.31). A higher score indicates better usability, and an average score above 68 is considered “above average” [22, 23]. Therefore, both versions of ATTAC were rated as above average usability. On the IRQ, lower numbers indicate agreement with the statement. Means and standard deviations for the most pertinent items by condition are presented in Table 2. In general, participants in the adaptive condition rated their experience using ATTAC lower than the middle of the scale. Those in the adaptive condition reported that they liked the training, wanted to complete more training like ATTAC in the future, and believed it was useful overall. In addition, the adaptive condition participants believed the difficulty of scenarios was appropriate for their skill level and that the feedback they received helped them develop strategies to improve their performance. However, the results indicated that there was room for improvement regarding the feedback in ATTAC given the relatively neutral responses to some of the feedback-specific items. Participants’ ratings and informal discussions with them during the debrief suggested that they believed the feedback could have been more useful and easier to understand. Those in the non-adaptive condition had generally less favorable attitudes with scores generally above the midpoint of the scale. However, strong conclusions cannot be made due to the small sample size of participants in the non-adaptive condition.

324

C. I. Johnson et al.

Table 2. Selected results from Instruction Reaction Questionnaire. Lower scores indicate a higher level of agreement with the statement. IRQ statement

“I liked the content in this training” “I would like to complete more training like this in the future” “Overall, the training was useful to me” “The difficulty of scenarios was appropriate for my skill level” “I believe that the feedback I received provided me with effective strategies to help me perform better” “The feedback I received was easy to understand” “I believe that the feedback I received could have been more useful” (negative statement)

Adaptive (n = 18) M 2.22 2.11

SD 1.52 1.45

Nonadaptive (n = 4) M SD 2.75 0.96 3.00 1.83

1.94 2.29

1.47 1.26

3.00 2.25

1.83 1.26

2.39

1.33

4.25

0.96

2.94 3.39

1.21 1.14

3.75 1.75

1.71 0.96

4 Discussion The initial results of the pilot study were promising in terms of usability and favorable impressions of the ATTAC system. The usability data suggest that both versions of ATTAC were easy to use and that extraneous processing due to poor interface design was minimized. Importantly, participants in the adaptive condition reported favorable attitudes toward the instruction. The adaptive training may have fostered engagement due to the variation in scenario difficulty and feedback. Furthermore, the non-adaptive training condition may have also experienced frustration and suppressed generative processing because participants were only presented with their answer and the ideal answer with no additional explanation or feedback. Since the pilot study, we have made improvements to the ATTAC testbed and have begun experimentation for a training effectiveness evaluation. Participants in the adaptive condition indicated that the feedback could be easier to understand and more useful during the pilot study. Previously, the feedback was relative to an arbitrarily selected ideal game plan (recall that there are more than one possible correct game plans in most cases). Now, we developed an algorithm to match a student’s game plan to the closest ideal game plan, such that the feedback that is displayed will be most similar to the student’s approach to the scenario. In addition to making improvements to the feedback statements, a training effectiveness evaluation is currently underway to establish whether adapting feedback and scenario difficulty is an effective approach in this domain. In this new experiment, we are comparing learning outcomes from three between-subjects training conditions, using the adaptive and non-adaptive versions of ATTAC, along with a traditional training condition. The traditional training condition acts as a control condition, such that participants do not interact with ATTAC and it simulates JTAC students’ current training experience with no access to a game plan

Developing an Adaptive Trainer

325

development trainer. We will examine performance across the three conditions to assess if learning gains are greatest after performing the training scenarios with adaptive features, compared to non-adaptive features and traditional training. In a future experiment, we plan to tease out the effects of adaptive feedback and scenario difficulty to determine whether one approach is more effective than another or whether they have an additive effect on learning gains. The research and development of ATTAC provides an important contribution to the training literature. Previous research [8] has discussed the need for more research to determine which AT techniques are effective for which domains. Therefore, the goal of our research with ATTAC is to apply techniques that have been effective in one domain to a new domain to determine whether these effects still hold. Game plan development is a rich decision-making task, and it represents a domain that has not been explored extensively in the AT literature. Acknowledgements. We gratefully acknowledge Dr. Peter Squire and the Office of Naval Research who sponsored this work (Funding Doc# N0001418WX00298). Presentation of this material does not constitute or imply its endorsement, recommendation, or favoring by the U.S. Navy or Department of Defense (DoD). The opinions of the authors expressed herein do not necessarily state or reflect those of the U.S. Navy or DoD.

References 1. Landsberg, C.R., Mercado, A.D., Van Buskirk, W.L., Lineberry, M., Steinhauser, N.: Evaluation of an adaptive training system for submarine periscope operations. In: 56th International Proceedings on Human Factors and Ergonomics, vol. 56, no. 1, pp. 2422–2426. SAGE Publications, Los Angeles (2012) 2. Joint Publication 3-09.3, Close Air Support (2014) 3. Landsberg, C.R., Astwood, R.S., Van Buskirk, W.L., Townsend, L.N., Steinhauser, N.B., Mercado, A.D.: Review of military adaptive training system techniques. Mil. Psychol. 24(2), 2–113 (2012) 4. Bloom, B.S.: The 2 sigma problem: The search for methods of group instruction as effective as one-to-one tutoring. Educ. Res. 13(6), 4–16 (1984) 5. VanLehn, K.: The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educ. Psychol. 46(4), 197–221 (2011) 6. McCarthy, J.E.: Military applications of adaptive training technology. In: Lytras, M.D., Gasevic, D., Ordonez de Pablos, P., Huang, W. (eds.) Technology Enhanced Learning: Best Practices, 304–347. IGI Publishing, Hershey (2008) 7. Landsberg, C.R., Van Buskirk, W.L., Astwood, R.S., Mercado, A.D., Aakre, A.J.: Adaptive training considerations for simulation-based training. Special report No 2010-001, NAWCTSD. Naval Air Warfare Center Training Systems Division, Orlando (2011) 8. Durlach, P.J., Ray, J.M.: Designing adaptive instructional environments: insights from empirical evidence. Technical report No. 1297, ARI. U.S. Army Research Institute for the Behavioral and Social Sciences, Arlington (2011) 9. Billings, D.R.: Efficacy of adaptive feedback strategies in simulation-based training. Mil. Psychol. 24(2), 114–133 (2012)

326

C. I. Johnson et al.

10. Conati, C., Manske, M.: Evaluating Adaptive Feedback in an Educational Computer Game. In: Ruttkay, Z., Kipp, M., Nijholt, A., Vilhjálmsson, H.H. (eds.) IVA 2009. LNCS (LNAI), vol. 5773, pp. 146–158. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-64204380-2_18 11. Serge, S.R., Priest, H.A., Durlach, P.J., Johnson, C.I.: The effects of static and adaptive performance feedback in game-based training. Comput. Hum. Behav. 29(3), 1150–1158 (2013) 12. Campbell, G.E.: Adaptive, intelligent training systems: just how “smart” are they? In: Campbell, G.E.: Adaptive Training Systems. Symposium conducted at the Naval Air Systems Command Fellows Lecture Series, Orlando (2014) 13. Mayer, R.E.: Multimedia Learning, 2nd edn. Cambridge University Press, New York (2009) 14. Mayer, R.E.: Cambridge Handbook of Multimedia Learning. Cambridge University Press, New York (2014) 15. Kalyuga, S., Ayres, P., Chandler, P., Sweller, J.: The expertise reversal effect. Educ. Psychol. 38, 23–31 (2003) 16. Kalyuga, S., Chandler, P., Sweller, J.: Learner experience and efficiency of instructional guidance. Educ. Psychol. 21(1), 5–23 (2001) 17. Shute, V.J.: Focus on formative feedback. Rev. Educ. Res. 78(1), 153–189 (2008) 18. Bolton, A.E.: Immediate versus delayed feedback in simulation based training: matching feedback delivery timing to the cognitive demands of the training exercise. Unpublished doctoral dissertation. University of Central Florida, Orlando (2006) 19. Johnson, C.I., Priest, H.A.: The feedback principle in multimedia learning. In: Mayer, R.E. (ed.) The Cambridge Handbook of Multimedia Learning, 2nd edn., pp. 449–463. Cambridge University Press, New York (2014) 20. Van Buskirk, W.L.: Investigating the optimal presentation of feedback in simulation-based training: an application of the cognitive theory of multimedia learning. Unpublished doctoral dissertation. University of Central Florida, Orlando (2011) 21. Wickens, C.D., Hutchins, S., Carolan, T., Cumming, J.: Effectiveness of part-task training and increasing-difficulty training strategies a meta-analysis approach. Hum. Factors 55(2), 461–470 (2013) 22. Brooke, J.: SUS-A quick and dirty usability scale. Usability Eval. Ind. 189(194), 4–7 (1996) 23. Brooke, J.: SUS: a retrospective. J. Usability Stud. 8(2), 29–40 (2013)

Using an Adaptive Intelligent Tutoring System to Promote Learning Affordances for Adults with Low Literacy Skills Anne Lippert(&), Jessica Gatewood, Zhiqiang Cai, and Arthur C. Graesser University of Memphis and the Institute for Intelligent Systems, Memphis, TN 38111, USA [email protected]

Abstract. One out of six adults in the United States possesses low literacy skills. Many advocates believe that technology can pave the way for these adults to gain the skills that they desire. This article describes an adaptive intelligent tutoring system called AutoTutor that is designed to teach adults comprehension strategies across different levels of discourse processing. AutoTutor was designed with a simple, easy to use interface that caters to the special technological needs of adult learners. Though the interface may be simple, the functionality is not. AutoTutor leverages empirically based learning principles from cognitive psychology to scaffold the acquisition of reading comprehension skills. In particular, it embeds six major learning affordances, or learning opportunities, that help students master difficult material. We provide an overview of AutoTutor, describe its’ learning affordances and discuss its potential as a reading comprehension tool. We conclude by considering some of the challenges when building adaptive technologies to support low literacy adults. Keywords: Adaptive technology Learning principles

Intelligent Tutoring System

1 Introduction One in six adults in the United States have literacy skills at a low level of proficiency [1] and face difficulties with daily literacy tasks. Adult education programs offer instruction to help struggling adults (ages 16 and older) improve reading, writing, math, science, and social study skills with the culminating goal of obtaining a high school equivalency degree or a job. Federally funded adult education programs serve an estimate of 2.6 million adults which represents small percentage of the nation’s struggling adult readers [2]. Unfortunately, these programs are beset with many obstacles: Poor funding, little professional development for teachers and tutors, high absenteeism and attrition rates, and a diversity of students in terms of racial, ethnic, and gender identities, age (between 16 and 80+), employment, education, language status, and psychosocial attributes of esteem, anxiety, and motivation. As a result of these obstacles, administering quality adult literacy instruction has been a challenge [3] and attempts to improve the literacy of these adults have been disappointing. © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 327–339, 2019. https://doi.org/10.1007/978-3-030-22341-0_26

328

A. Lippert et al.

The difficulties of adult literacy instruction have led some to advocate the use of adaptive learning technologies on the Internet as a possible solution [4]. Being able to access a computer program on the Internet is an excellent way to combat absenteeism due to unstable work hours, transportation difficulties, and childcare issues. Computers with Internet access are available for adult learners in public libraries, children’s schools, and adult literacy programs. Newnan’s [5] survey of more than 1000 programs indicated that more than 80% of survey respondents had computers in their classrooms with consistent access to the Internet (although significant variability was noted). Peterson [6] reported that an increasing number of adult literacy programs are infusing technology into their classrooms and curricula. On other hand, a major challenge lies in developing technology that addresses the poor digital literacy skills of adults in the United States [6]. Olney, Bakhtiari, Greenberg, and Graesser [7] recently tested the digital literacy ability of 114 adults reading below the 8th grade level. Even though 72% of the adult learners reported using a computer for five or more years, the majority of these adults were not able to complete simple tasks such as opening a Word document in a taskbar, typing in a web address and clicking NEXT, or choosing a secure password and typing it in a “re-enter password” box. Thus, technology geared for adult learners needs to be understood easily by adult learners, include scaffolding to help them use the technology, and minimally depend on open-ended learner input such as writing (because poor readers are able to write very little). One solution to handling the limited digital literacies of adult readers is to use conversational agents as part of the technology [4]. Conversational agents are talking heads or avatars that speak to the adult with pre-recorded voices or synthetic text-tospeech facilities. The agents can give instructions to the learners when they have trouble using important features on the computer interface. When properly designed, these agent technologies can provide support that is analogous to a human teacher or tutor. An Intelligent Tutoring System (ITS) called AutoTutor was designed with conversational agents in order to teach comprehension skills to adults who read between a 3rd and 8th grade level. Besides providing a simple, intuitive design and abundant scaffolding, AutoTutor has a number of characteristics of educational environments that facilitate learning. Old-school media consisted of listening to lectures, watching video presentations, and reading books. For these media, the learners passively observe or linearly consume the materials at their own pace. However, the learning environments in today’s world require learners to be more active by strategically searching through hypermedia, constructing knowledge representations from multiple sources, performing tasks that create things, and interacting with technologies or other people [8, 9]. From the standpoint of technology, it is worthwhile for technologies to embody characteristics that facilitate active, constructive, interactive learning environments. The National Academy of Sciences, Engineering, and Medicine [10] identified 8 characteristics, referred to as affordances, of advanced learning technologies that are grounded in cognitive and educational psychology principles/strategies. The affordances that AutoTutor implements for adults with lower reading literacy include interactivity, adaptivity, feedback on performance, choice, linked representations, and communication with other people or agents. AutoTutor includes the 2 other affordances (nonlinear access and open-ended learner input) but only minimally.

Using an Adaptive Intelligent Tutoring System to Promote Learning Affordances

329

The AutoTutor technology has been used to develop modules for 35 comprehension strategies that cover different levels of reading comprehension, including words, the explicit text, the referential situation model, rhetorical structure, and discourse genre [11]. A 4-month hybrid intervention that included human tutors in addition to AutoTutor was conducted on 252 struggling adult readers in Toronto and Atlanta. Fang et al. [12] reported improvements on three psychometric measures of comprehension in a pretestposttest design, with effects sizes that varied from .12 to .63 for four clusters of readers. This paper describes AutoTutor for struggling adult readers and reports highlights of empirical findings that speak to its efficacy. A primary emphasis is on the affordances of AutoTutor’s interface and pedagogical strategies that are grounded in learning principles and as such should improve adaptive computer-based tutoring for struggling adult readers. We first give an overview of the AutoTutor system, and describe some general considerations in building educational technology for low literacy adults. We then describe what is meant by “learning affordances” of technology and describe six primary affordances of AutoTutor that support active deeper knowledge acquisition and learning. Finally, we turn to studies that consider the effectiveness of AutoTutor as a reading comprehension tool for the struggling adult reader. We conclude with some recommendations for future research in the area of adaptive learning environments for adult literacy students.

2 Overview of AutoTutor for Reading Comprehension AutoTutor is a conversational ITS that teaches adults reading comprehension skills by holding conversations in natural language. These conversations are called “trialogues” because two computer agents, a teacher (Christina) and a peer student (Jordan), engage one adult learner in discussion about course topics. The “talking heads” help adults learn by interacting with them through speech and by frequently referring to texts and multimedia. They scaffold students through different types of reading comprehension strategies (e.g., clarifying pronouns, identifying main ideas, understanding comparecontract structures) and also help with navigating the computer environment. The AutoTutor curriculum has 35 lessons that focus on specific comprehension components [13]. Each AutoTutor lesson takes 10 to 50 min to complete. Adult learners typically have substantial challenges with writing, so AutoTutor tends to rely on point & click (or touch) interactions, multiple-choice questions, drag and drop functions, and other conventional input channels. However, the system does include some writing components that require semantic evaluation of open-ended student contributions. AutoTutor has many pictures, diagrams, and multimedia that help grab and maintain the attention of the adult learner. The system also has the capability of reading texts aloud when the learner asks for such assistance by clicking on a screen option. This is an important feature because many of the adult learners have limited decoding and word identification skills [14]. When the AutoTutor system was being created, the designers took into account the distinctive characteristics of adult learners. For example, it was necessary to have an AutoTutor intervention that makes little or no use of keyboard input [7]. Instead, there was an emphasis on clicking on visible options on the display, much like an appliance

330

A. Lippert et al.

that attempts to make the hidden mechanisms invisible [15]. There was the need to create an introductory video on digital literacy to train learners on any particular computer feature that was absolutely essential to include in a particular lesson. For example, scrolling was needed in many of the lessons so that the adults could read lengthier texts. However, only 60% of the adults could do scrolling [7]. The introductory video included instructions on scrolling in addition to other important behaviors that many adult learners in the sample had not mastered, as discussed earlier in this chapter. Interestingly, there are many tutorials on digital literacy on the Internet that one might have considered using. Unfortunately, these tutorials routinely assume that the users are able to read at higher levels than the adults with low literacy. In the next section, we see how, despite the limitations of low literacy adults when it comes to technology, AutoTutor successfully implements many possibilities to support learning in the area of reading comprehension.

3 Learning Affordances Learning technologies like ITSs open up significant opportunities to support learners. The term “affordance” refers to opportunities a technology makes possible related to learning and instruction [10, 16]. For example, a bench affords users a way to sit, whereas a staircase affords users the ability to reach higher ground. Certain features of contemporary digital environments including multimedia displays with texts, pictures, diagrams, visual highlighting, sound, spoken messages, and input channels (clicking, touching) for entering information can afford important learning opportunities for users. The learning environments in today’s world require learners to be more active by strategically searching through hypermedia, constructing knowledge representations from multiple sources, performing tasks that create things, and interacting with technologies or other people [8, 9]. From the standpoint of technology, it is worthwhile to take stock of the characteristics of learning environments that facilitate active, constructive, interactive learning environments. Table 1 shows some of these characteristics that were identified by the National Academy of Sciences, Engineering, and Medicine in the second volume of How People Learn [10]. It is important to consider these characteristics when creating technologies to support the acquisition of knowledge. Table 1. Key affordances of learning technologies (NASEM, 2018). Affordance 1. Interactivity 2. Adaptivity 3. Feedback 4. Choice

Description The technology systematically responds to the actions of the learner The technology presents information that is contingent on the behavior, knowledge, or characteristics of the learner The technology gives the learner information about the quality of their performance and how it could improve The technology gives learners options on what to learn and how to regulate their own learning (continued)

Using an Adaptive Intelligent Tutoring System to Promote Learning Affordances

331

Table 1. (continued) Affordance 5. Nonlinear access 6. Linked representations 7. Open-ended learner input 8. Communication

Description The technology allows the learner to select or receive learning activities in an order that deviates from a set order The technology provides quick connections between representations for a topic that emphasizes different conceptual viewpoints, media, and pedagogical strategies The technology allows learners to express themselves through natural language, drawing pictures, and other forms of open-ended communication The learner communicates with one or more people or agents

4 Learning Affordances of AutoTutor In this section, we discuss how AutoTutor embeds six primary learning affordances empirically shown to support learning. They are Interactivity, Adaptivity, Feedback, Choice, Linked Representations, and Communication with other people. For each affordance, we describe the cognitive or educational principle that it reflects. We then demonstrate how this affordance is captured in AutoTutor. 4.1

Interactivity

Unlike static textbooks, audiotapes or films, an interactive system presents new information in response to the learner. Underlying interactivity is the idea of a two-way action (between learner and instructor) as opposed to a one-way action (i.e., from instructor to learner) that helps the learner change his or her knowledge to promote learning [17]. AutoTutor was designed as an interactive system that responds to the actions and even non-actions of adult learners in an effort to promote understanding. For example, interactivity occurs at the level of question asking and answering. After the user selects an answer, the system responds with a sound that tells the user he or she was either correct (higher pitched chime) or incorrect (lower pitched beep). In the case of an incorrect response, the student is often asked to interact further with the system and provide a different answer. In addition, there are responses to more unique actions of the users. For instance, there is a “repeat” button to press whenever the learner wants the previous turn of an agent to be repeated. Users can press on an option to have text read to them whenever the materials involve a multi-sentence text (but not when a single sentence is presented). They can press the home icon at the bottom whenever they want to start at the beginning, and the system will return them to the start. AutoTutor is responsive to these periodic needs of the learner. The system is also responsive to adults who do not initiate a response before a timeout period expires by repeating the agent’s question or request. The AutoTutor system handles any action or non-action of a learner at every point in the conversation when the learner is expected to contribute. This system behavior increases interactivity and guides the user toward specific learning goals.

332

4.2

A. Lippert et al.

Adaptivity

There is good evidence that instruction is more effective if it takes into account that learners are different and that they change as they learn [18]. For example, task selection based on assessment of individual students’ knowledge can contribute to the effectiveness of instruction [19, 20]. AutoTutor was designed to be adaptive to help foster learning. In particular, there are three components of AutoTutor that provide adaptive interaction. The first assigns texts to read (or shorter instruction episodes) that are tailored to the learner’s ability (not too easy or too difficult), as calibrated by prior performance of the learner. A lesson starts out with a text at an intermediate difficulty level, but then increases or decreases the difficulty of the assigned materials in a manner that is sensitive to the learner’s previous performance. The difficulty level of the texts is computed by Coh-Metrix, a system that scales texts on difficulty by considering characteristics of words, syntax, discourse cohesion, and text category [21, 22]. After performance is scored on the questions associated with the initial text in a lesson, the next text assigned will be relatively more difficult if the score is high and will be relatively easier if the adult’s score is low. The second adaptive component designs the trialogue conversations in a manner that adapts to the adults’ ability and/or motivation, as reflected in their performance scores during training. For example, there is an AutoTutor activity in which the computer peer competes in a Jeopardy-like game with the adult learner. The learner and peer agent take turns answering questions and score points in the competition that is guided by the tutor agent. Sometimes the learner wins and sometimes the peer agent wins, but ultimately the adult learner manages to end up winning or tying the overall competition, no matter how poorly the adult learner performs. The learner’s winning the competition against the peer agent is expected to boost the confidence of the adult learner. Regarding the third adaptive component, the conversations associated with a particular tutor question depend on the responses of the adult learner. When the adult answers a question correctly when first asked, the adult gets full credit for answering the question. When the adult answers the question incorrectly, AutoTutor generates a hint and gives the adult a second chance; the adult gets partial credit when the answer is correct on the second attempt. Another approach is to have the peer agent generate information or make a selection and to ask the adult whether Jordan’s answer is correct; the adult gets partial credit if they decide correctly. Open-ended responses (that require the learner to type in information using natural language) are assessed with computational linguistics techniques that match the student’s input to expectations [23]. In this way, AutoTutor scaffolds learning by being adaptive to the ability or motivation of the learner as well as his or her progress throughout the lesson. 4.3

Feedback

There is a wealth of evidence that feedback powerfully influences learning outcomes. From a review of 12 meta-analyses that included information regarding feedback in classrooms, the average effect size was d = .79, which is twice the average effect of other academic influences [24]. Feedback can make learning visible to the student, can

Using an Adaptive Intelligent Tutoring System to Promote Learning Affordances

333

lead to error detection, and enhance students’ assessment capabilities about their learning [25]. Feedback is central to the AutoTutor system. For each response the user gives, AutoTutor provides the learner information about the quality of their performance and how it could improve. The feedback is both timely and takes into consideration the abilities of the adult learner. There are three main ways the AutoTutor system provides feedback. First, when a user submits an answer, they hear a sound (either a negative “wonk” or a positive “chime”) that quickly alerts them to whether they answered correctly or not. Second, following the correct or incorrect sound, the user hears what is called a “canned response”. This is a general response such as “nice job” or “that’s not quite the answer we want”, depending on the correctness of the given answer. The canned response for incorrect answers tries to be more neutral in feedback, since adult learners may struggle with confidence. These responses are typically, “Sorry, i was thinking of a different response” or “Hmmm, that is not the best answer in this case” which lets the user know his or her answer was not correct, but without being inadvertently disparaging. The canned response for correct answers is very positive, using phrases like “Yes! That’s it! Way to go” or “Nice work. You are really getting the hang of this”. When canned feedback comes from the peer agent, it is often in the sense of the student peer benefitting from the wisdom of the adult learner. Phrases such as “Wow, I had no clue that was the answer. Thanks so much for your help!” or “[Adult learner’s name], thanks for choosing the right answer. You are helping me so much” may boost the learner’s confidence and motivation. The third type of feedback comes after the canned feedback, when the system provides more specific feedback to the user regarding the question. The correct answer on the screen is highlighted in green, and an agent describes why this answer is correct in one or two clear sentences. If the tutor agent provides the explanation, it may be followed up with the peer agent summarizing this information with a statement such as “Oh i get it now. So certain words like ‘first’ or ‘then’ can help us determine the order of steps in a procedure.” In this way, feedback can come from both agents in order to help pinpoint the correct answer and add clarification. In general, AutoTutor attempts to provide timely, clear and relevant feedback to enhance learning. 4.4

Choice

Research supports the idea that instruction that gives students a choice in what to learn and when to learn it is generally considered better than instruction in which all students follow the same scope and sequence at the same pace [26]. As such, AutoTutor gives learners options on what to learn and how to regulate their own learning. Though AutoTutor was initially designed to act as a web-based component of an instructor led course that followed a particular curriculum, the program can act as a stand-alone reading comprehension tool. The program is web-based, and anyone at any time can access any of the available lessons. This affords adult learners the choice of developing their skills at their pace in a variety of environments, including their home. AutoTutor lessons are divided into three main categories: Words, Texts and Stories, and Computer and Internet. Each lesson was meant to stand alone, independent of other lessons so that adults can work on any lesson within any category. If they find certain lessons too

334

A. Lippert et al.

difficult, they can choose a different lesson more suitable to their level, for example, choosing a lesson from Words instead of Texts and Stories. Within the lessons themselves, learners have choice. For example, there are three “Jeopardy!” style lessons where the human student and the peer agent answer questions in a jeopardy style fashion. At each turn, the user chooses what question to receive from among a board that contains 16 questions corresponding to different topics and having different point values. Early versions of AutoTutor also included auxiliary computer components that augment learning experience and motivation, and similar features are being considered for the present version. For example, an online independent reading facility for the adult learners to use. This facility has a text repository (i.e., http://csal.gsu.edu/content/ library) with thousands of texts categorized on different topics (such as health, family, work, etc.) and difficulty level. The independent reading facility also provides access to Simple English Wikipedia, a version of Wikipedia for English language learners, and newspaper articles. Adults are encouraged to read documents on topics that interest them, with the guidance and encouragement of the teachers in the adult literacy centers. The hope is that use of the independent reading facility will increase the adults’ practice time and self-regulated learning. 4.5

Linked Representations

The use and construction of different representations to inform on the same concept can promote a deeper understanding of domain concepts that would be difficult to achieve with a single representation [27]. The ability switch between multiple perspectives in a domain helps learners build abstractions necessary for a grasp of domain content [28]. Furthermore, insights achieved through the use of multiple representations increases the likelihood that knowledge acquired will transfer to new situations [29]. AutoTutor was designed to provide quick connections between representations for a topic that emphasizes different conceptual viewpoints, media, and pedagogical strategies. As such, it helps promote deeper learning and cognitive flexibility. For example, the majority of lessons include a 2 min video tutorial called a “nutshell” that gives the learner a brief visual and audio overview of the lesson topic. This tutorial is typically viewed before a user begins a lesson, but may be accessed at any point during the lesson by pressing on a “watch video” button on the bottom of the screen. AutoTutor also uses visuals such as charts or diagrams to enhance learning. For example, Fig. 1 shows a diagram used at the beginning of the lesson “Connecting Ideas” that helps the user build a model of how the characters and events of stories interrelate. These visuals may be presented during the opening dialogue, when the student agent and tutor agent are giving an overview of the lesson. Like the nutshells, the user can often access these visuals throughout the lesson with a click of a button. Often, these visuals appear again at the end of the lesson, while the agents are recapping what was learned. The goal is to facilitate learning by modeling information in multiple ways to help the user build a coherent representation of the topic.

Using an Adaptive Intelligent Tutoring System to Promote Learning Affordances

335

Fig. 1. An example of AutoTutor’s use of linked representations. This diagram depicts relationships between characters and events, a key concept within the lesson “Connecting Ideas”. The concept this diagram reflects is represented throughout the lesson in other ways including text, pictures and conversation.

4.6

Communication

For the learner, conversations provide multiple opportunities and resources for the development of intellectual competencies and positive motivational orientations toward learning [30]. As described previously, conversation is at the heart of AutoTutor, and is used to scaffold learning in students. When using AutoTutor, the learner communicates with a peer agent and a tutor agent in what is called a trialogue. The presence of both a tutor agent and a peer agent make possible three primary conversation modes in AutoTutor- testing mode, game mode or help mode. In testing mode, the tutor tests both the adult and peer agent on their comprehension by asking questions or making a request, giving short feedback (“you are right”), and also providing content that repeats, elaborates, or explains the correct answer. When lessons were being designed for AutoTutor, curriculum developers tried not to rely on this testing mode too often because it has a “schoolish” pragmatic foundation that may turn off many of the adults. The trialogue conversations also have a game mode, which is presumably more motivating. Another mode that is frequently used is a help mode, where the peer agent needs help with a task and the adult learner is encouraged to help the peer agent. The help mode is designed to increase the adult’s self-esteem and feelings of positivity toward learning. These types of communication illustrate how the agent conversation can be designed to enhance motivation in addition to improving cognitive comprehension strategies.

336

A. Lippert et al.

5 Effectiveness of AutoTutor as a Reading Comprehension Tool The primary goal of AutoTutor is to increase reading comprehension skills in lower literacy adults. The six affordances within AutoTutor reflect cognitive principles known to enhance learning, and as such, should facilitate the acquisition of reading skills in adult learners. We have recently begun to explore the question of how effective AutoTutor is for reading comprehension. At this point, we have considered the effectiveness of AutoTutor by analyzing data from 52 adult literacy students who interacted with AutoTutor in a 100 h intervention designed to improve their reading skills [4]. The intervention was a blended between teacher-led sessions and the computer based AutoTutor. The purpose of the study was to gather information regarding the feasibility of running this intervention in authentic adult literacy settings. Self-report data from the adult learners indicated that they were very engaged with AutoTutor. They related to the student agent’s trials and tribulations, for example, when he had a real world problem that needed reading to help him resolve the situation. The adults sometimes felt sorry for the student agent when he incorrectly answered questions. The students rated the refresher, “nutshell” videos as being very helpful, succinct, and engaging overviews of lessons. The behavioral performance data were also encouraging. The adults in the feasibility study completed 71% of the lessons, which is an excellent retention statistic compared with norms of attrition rates in adult literacy centers [2]. The adults answered 55% of the questions correctly in the AutoTutor conversations, where chance responding is approximately 33%. This level of performance indicates that the questions were sometimes challenging and required the system to adaptively offer hints to scaffold learning. This conversational scaffolding is very different than traditional computer-based trainings that do not adapt to the user’s response and provide only multiple choice questions with no scaffolding. The results of the feasibility study were sufficiently encouraging to continue testing on approximately 200 additional adult learners in a study that has been completed and is currently being analyzed. This additional data will help us determine whether AutoTutor is a viable approach to possibly improve adult readers on comprehension strategies and skills.

6 Discussion Digital technologies are expected to play an increasing role in helping adults learn reading comprehension skills in a society where there are higher expectations on adults in the workforce and the community at large [31, 32]. As such, it is critical that learning technologies take heed of empirical evidence regarding what works and what does not work in order to effectively promote knowledge acquisition. This paper described an adaptive ITS called AutoTutor that was designed with learning affordances that reflect empirically based cognitive principles of learning. We provided examples of how AutoTutor implements six of eight affordances that help facilitate aspects of successful learning environments.

Using an Adaptive Intelligent Tutoring System to Promote Learning Affordances

337

While early results suggest that AutoTutor is a promising technological tool for low literacy adults, more data needs to be collected and analyzed, and there are further challenges to address. For example, adults with low literacy skills tend to vary not only in demographic variables (age, gender, and race/ethnicity), but also with respect to educational backgrounds, learning disabilities, primary languages (English or other) and motivation to improve their literacy level [2]. Furthermore, Fang et al. [12] showed that adult readers have distinctive behavioral profiles when it comes to learning. There are higher performing readers who may benefit from more challenge and should be encouraged to increase their reading activities. There are conscientious readers who benefit by spending extra time on the material and questions, unlike struggling readers who also spend a good amount of time on the material but show minimal gains. There are also underengaged readers who would benefit from reminders to concentrate or more motivating material. By leveraging the multiple affordances of AutoTutor such as adaptivity, linked representation and choice, it is possible to better deal with this variation in both background and behavior. A future hope is that we can improve learning by tailoring instruction and materials to meet the various needs of the individuals in this group. Acknowledgements. The research reported here is supported by the Institute of Education Sciences, US Department of Education, through Grant R305C120001, and the National Science Foundation Data Infrastructure Building Blocks program under Grant No. (ACI- 1443068). The opinions expressed are those of the authors and do not represent views of the Institute or the US Department of Education, and the National Science Foundation.

References 1. National Center for Education Statistics: The Condition of Education 2016. The National Academies Press, Washington, D.C. (2016) 2. Lesgold, A., Welch-Ross, M.: Improving Adult Literacy Instruction: Options for Practice and Research. National Academies Press, Washington, D.C. (2012) 3. Greenberg, D.: The challenges facing adult literacy programs. Community Literacy J. 3(1), 39–54 (2008) 4. Graesser, A., Greenberg, D., Olney, A., Lovett, M.: Educational technologies that support reading comprehension for adults who have low literacy skills. In: Perin, D. (ed.) Wiley Adult Literacy Handbook. Wiley, New York (in press) 5. Tytonpartners. http://tytonpartners.com/library/learning-for-life-the-opportunity-for-techno logy-to-transform-adult-education/. Accessed 02 Mar 2019 6. The Digital Divide in 2016. http://techtipsforteachers.weebly.com/. Accessed 28 Sept 2016 7. Olney, A., Bakhtiari, D., Greenberg, D., Graesser, A.: Assessing computer literacy of adults with low literacy skills. In: Hu, X., Barnes, T., Hershkovitz, A., Paquette, L. (eds.) International Conference of Educational Data Mining 2017, pp. 128–134. International Educational Data Mining Society, Wuhan (2017) 8. Chi, M.T.: Active-constructive-interactive: a conceptual framework for differentiating learning activities. Top. Cogn. Sci. 1(1), 73–105 (2009) 9. Wiley, J., Goldman, S., Graesser, A., Sanchez, C., Ash, I., Hemmerich, J.: Source evaluation, comprehension, and learning in Internet science inquiry tasks. Am. Educ. Res. J. 46, 1060–1106 (2009)

338

A. Lippert et al.

10. National Academies of Sciences: Engineering, and Medicine: How People Learn II: Learners, Contexts, and Cultures. The National Academies Press, Washington, D.C. (2018) 11. Graesser, A., McNamara, D.: Computational analyses of multilevel discourse comprehension. Top. Cogn. Sci. 3(2), 371–398 (2011) 12. Fang, Y., et al.: Clustering the learning patterns of adults with low literacy interacting with an intelligent tutoring system. In: Boyer, K., Yudelson, M. (eds.) International Conference on Educational Data Mining 2018, pp. 348–354. Educational Data Mining Society, Buffalo (2018) 13. Graesser, A., et al.: Reading comprehension lessons in AutoTutor for the Center for the Study of Adult Literacy. In: Crossley, S., McNamara, D. (eds.) Adaptive Educational Technologies for Literacy Instruction, pp. 288–293. Taylor & Francis Routledge, New York (2016) 14. Sabatini, J., Shore, J., Holtzman, S., Scarborough, H.: Relative effectiveness of reading intervention programs for adults with low literacy. J. Res. Educ. Effectiveness 4, 118–133 (2011) 15. Norman, D.: The Invisible Computer. MIT Press, Cambridge (1998) 16. Collins, A., Neville, P., Bielaczyc, K.: The role of different media in designing learning environments. Int. J. Artif. Intell. Educ. 11, 144–162 (2000) 17. Wagner, E.: In support of a functional definition of interaction. Am. J. Dist. Educ. 8, 6–29 (1994) 18. Aleven, V., McLaughlin, E., Glenn, R., Koedinger, K.: Instruction based on adaptive learning technologies. In: Mayer, R., Alexander, P. (eds.) Handbook of Research on Learning and Instruction, 2nd edn., pp. 522–560. Routledge, New York (2017) 19. Anderson, J., Corbett, A., Koedinger, K., Pelletier, R.: Cognitive tutors: lessons learned. J. Learn. Sci. 4(2), 167–207 (1995) 20. Atkinson, R.: Optimizing the learning of a second-language vocabulary. J. Exp. Psychol. 96(1), 124–129 (1972) 21. Graesser, A., McNamara, D., Cai, Z., Conley, M., Li, H., Pennebaker, J.: Coh-Metrix measures text characteristics at multiple levels of language and discourse. Elementary Sch. J. 115, 210–229 (2014) 22. McNamara, D., Graesser, A., McCarthy, P., Cai, Z.: Automated evaluation of text and discourse with Coh-Metrix. Cambridge University Press, Cambridge (2014) 23. Graesser, A.: Conversations with AutoTutor help students learn. Int. J. Artif. Intell. Educ. 26, 124–132 (2016) 24. Hattie, J.: Visible learning: a synthesis of over 800 meta-analyses relating to achievement. Routledge, Oxford (2009) 25. Hattie, J., Gan, M., Brooks, C.: Instruction based on feedback. In: Mayer, R., Alexander, P. (eds.) Handbook of Research on Learning and Instruction, pp. 290–324. Routledge, New York (2017) 26. Connor, C., Morrison, F., Fishman, B., Schatschneider, C., Underwood, P.: Algorithmguided individualized reading instruction. Science 315(5811), 464–465 (2007) 27. Ainsworth, S.: DeFT: a conceptual framework for considering learning with multiple representations. Learn. Instr. 16(3), 183–198 (2006) 28. Ainsworth, S., Van Labeke, N.: Multiple forms of dynamic representation. Learn. Instr. 14(3), 241–255 (2004) 29. Bransford, J., Schwartz, D.: Rethinking transfer: a simple proposal with multiple implications. Rev. Res. Educ. 24, 61–100 (1999) 30. Wentzel, K., Edelman, D.: Instruction based on peer interactions. In: Mayer, R., Alexander, P. (eds.) Handbook of Research on Learning and Instruction, pp. 365–387. Routledge, New York (2017)

Using an Adaptive Intelligent Tutoring System to Promote Learning Affordances

339

31. Carnevale, A.P., Smith, N.: Workplace basics: the skills employees need and employers want. Hum. Res. Dev. Int. 16, 491–501 (2013) 32. Organisation for Economic Co-operation and Development: Adults, Computers and Problem Solving: What’s the Problem? OECD Publishing, Paris (2015)

Development of Cognitive Transfer Tasks for Virtual Environments and Applications for Adaptive Instructional Systems Anne M. Sinatra1(&), Ashley H. Oiknine2,3, Debbie Patton4, Mark Ericson4, Antony D. Passaro4, Benjamin T. Files4, Bianca Dalangin2, Peter Khooshabeh4, and Kimberly A. Pollard4 1 US Army Combat Capabilities Development Command Soldier Center – Simulation and Training Technology Center, Orlando, USA [email protected] 2 Department of Psychological and Brain Sciences, University of California, Santa Barbara, USA [email protected] 3 DCS Corporation, Los Angeles, USA [email protected] 4 US Army Research Laboratory, Human Research and Engineering Directorate, Aberdeen Proving Ground, Los Angeles, USA {debra.j.patton4.civ,mark.a.ericson.civ, antony.d.passaro.civ,benjamin.t.files.civ, peter.khooshabehadeh2.civ, kimberly.a.pollard.civ}@mail.mil

Abstract. Spatial navigation and spatial learning are important skills with implications for many different fields, and can be trained in Virtual Environments (VEs). Research in these areas presents a critical challenge: how to best design tests of knowledge transfer to measure learning from the VEs. Additionally, spatial learning tasks and VEs can be implemented in Adaptive Instructional Systems (AISs), which could result in improved spatial navigation and learning. The current paper describes the development of transfer tasks to assess different levels of spatial learning from experience in a VE, including recall, recognition, route, bearings, and map drawing transfer tests. Details are provided in regard to how transfer questions were generated and the reasoning behind them. Implications for the use of similar tasks in AISs and Intelligent Tutoring Systems are discussed. Keywords: Spatial navigation Virtual environments Adaptive Instructional Systems Spatial learning

Transfer tasks

1 Introduction to Spatial Navigation in Virtual Environments 1.1

Introduction

Spatial navigation and spatial memory are important skills for a variety of occupations as well as for everyday living. Spatial learning is frequently studied in an experimental setting utilizing virtual environments (VEs) and immersive technologies, such as This is a U.S. government work and not under copyright protection in the United States; foreign copyright protection may apply 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 340–351, 2019. https://doi.org/10.1007/978-3-030-22341-0_27

Development of Cognitive Transfer Tasks

341

Virtual Reality (VR) head mounted displays (HMDs), to display spatial tasks. An important challenge in spatial navigation research is how to design tests of learning transfer from the immersive VEs. By learning to navigate a VE there is the assumption that the navigational skills learned in the real world environments will be improved [3, 4]. Measures of transfer of learning are especially important for automated systems. Many Intelligent Tutoring Systems (ITSs) and Adaptive Instructional Systems (AISs) are in well-defined domains (e.g., math and physics), and few have been integrated with real-time feedback in external games and virtual environments [14]. However, with the development of domain-independent ITS frameworks such as the Generalized Intelligent Framework for Tutoring (GIFT), there are opportunities for integrating custom VEs with adaptive tutoring and after-the-fact assessment. Since spatial learning can involve interacting with a VE and meeting measurable goals, it is a natural fit for intelligent tutoring and incorporation into AISs. 1.2

The Current Environment and Applications for AISs

The current paper describes and details the development of experimental transfer tasks and assessment materials associated with an experiment to investigate the relationship among immersion, individual differences, and training transfer in VEs. The primary method used for determining transfer performance in the described experiment was based on traditional cognitive psychology approaches and measures. This paper describes the methods that were used to develop the transfer tasks that were associated with the VEs in order to examine learning as a result of engaging with the VEs, and provides an initial explanation of approaches that can be used to grade them. The implications for adapting these types of materials and approaches for AISs are discussed, as well as lessons learned from working as a distributed team to generate these materials.

2 Generating Transfer Tasks for a Virtual Environment 2.1

General Experiment Design and Task Description

In order to examine the relationship between immersion, individual differences and training transfer, a study was designed that had three levels of immersion: (i) low: desktop monitor with desktop speakers, (ii) medium: partially occluded HMD with headphones, and (iii) high: fully occluded Oculus Rift HMD with circumaural headphones. The visual fidelity of the devices used in each level increased as the associated immersion level went up (e.g., the Oculus had higher visual fidelity than the partially occluded HMD). Additionally, the fidelity of the associated audio increased with the level of immersion. The low immersion audio included only distance-based intensity cues played over loudspeakers. The medium immersion audio condition used directional audio cues based on free-field head-related transfer functions without head motion cues. The high immersion condition used head motion tracked directional cues and room acoustic cues using Oculus’ 3D audio spatialization effects from their audio Software Development Kit.

342

A. M. Sinatra et al.

Three different mini and three different larger Unity3D-based environments were created for use in the experiment. The objects used within the environments included Unity “free assets”, and objects developed by the research team. The minienvironments were intended to serve as an introduction and practice environment before engaging in the full task and were highly tied to schemas. The minienvironments were: (i) Holiday environment (items included a pot of gold at the end of a rainbow, four leaf clover, and a Christmas tree), (ii) Museum environment (items included a dinosaur skeleton and a knight’s armor), and (iii) Recreation Room environment (items included a dart board and a chess board). Each of these minienvironments included items that were appropriate for the theme assigned to the environment and could be used for assessment in the transfer tasks. Three larger main environments were also created and had more traditional items of interest in them. The main environments were (i) House (items included an alarm clock and a bed), (ii) Office (items included a conference table and a vending machine), and (iii) School (items included a chalkboard and an abacus). Each environment featured a scavenger hunt; there were 8 target scavenger hunt items in each main environment. Additionally, there were a number of incidental non-target items that were included in each environment, with the numbers being as consistent as possible between environments. Environments and immersion levels were counterbalanced. The study had a within subjects design, with participants returning after no less than 14 days for each additional session to minimize for carry-over effects. Therefore, all participants who completed each of the three sessions experienced all environments and immersive technologies. As an example of an environment layout, an overhead view of the Office environment is shown in Fig. 1.

Fig. 1. Overhead screenshot of the Office environment with researcher-generated room labels (labels not shown to participants). Different items can be viewed in each room, such as a coat rack, treadmill, conference table, and copy machine.

Development of Cognitive Transfer Tasks

2.2

343

Task Description

In the experiment, the participants were tasked with navigating through the environment and conducting a scavenger hunt for 8 target objects that each had numbered flags on top of them. In every condition, the participants navigated with an Xbox One controller. They were asked to look serially for a target item (e.g., “Find item number 1.”) with the given numbered flag instead of the item name so as to not introduce any semantic information about the item to the participant. This was to ensure that when participants engaged with the transfer tests they were remembering the environment that they navigated and not information or instructions that they were told. When a target item was found, the participant would indicate it by pressing a button on the controller, and the system would record it and ask them to find the next item number. Each main environment had an equal number of rooms (13), and equal number of target items (8) and a similar number of incidental (non-target) items. The participants had 15 min to engage in the scavenger hunt, and if they completed the task prior to this they were asked to explore the environment until the time completed. Careful consideration was given to which items would be considered target items and ensuring that there was no duplication of target items between the environments. For instance, if a sink was a target object in one environment, it was ensured that (a) no other environment had the same sink and (b) no sinks served as target items in other environments. In order to ensure that distinct (i.e. no identical objects) questions could be asked about each environment, the selection of the target items for the scavenger hunt was tightly coupled with the creation of the transfer tasks that would be used for later assessment. In order to ensure that the environments were equally complex, the number of rooms, size of rooms, and number of doors were the same, and the number of items in each room were counted similarly with high inter-rater agreement, as this would influence the number of items that it would be possible to recall. Further details about the creation of the environments, metrics supporting equal complexity, and the determination of the order of the scavenger hunt is documented by Files et al. [5]. This equivalency ensured that transfer questions were derived from equally complex environments, could be balanced, and that no environment had an obvious advantage or disadvantage. 2.3

Transfer Tasks and Development

The transfer tasks were divided into five types with increasing complexity, based on the level of spatial knowledge processing required, from recall and recognition to more complex and integrative survey knowledge [10]. Types included: free recall, recognition, route description, bearings, and map drawing. All the tasks except the map drawing were completed in a computer-based survey system presented on a Surface Pro tablet. The map drawing was completed using a pen and paper. After the completion of the interaction with the environment, the participants engaged in these activities in the aforementioned order as to not “give away” or activate a schema for the environment. For example, because map drawing requires the most complete processing and integration of the participant’s spatial memory, we wanted the participant to complete this

344

A. M. Sinatra et al.

last. If they completed the map drawing before the route descriptions, they may be pulling from their memory of their own drawn map to answer the question rather than the memory of their actual experience in the environment. Similarly, because participants were shown photos of objects during the recognition portion, we wanted this to occur after the free recall. Otherwise, participants may have remembered an object from its photo during the recognition transfer task rather than from their memory of their experience in the environment. Free Recall. First, participants were asked to freely recall any and all objects that were present in the environment. The phrasing of the prompt was carefully worded as “Please list all of the objects you encountered in the virtual environment”, in order to encourage participants to list everything they recalled even if it was not specifically part of the numbered search task. For the mini environment participants had 1.5 min and the opportunity to type up to 25 items. In the main environments participants had 3 min and were able to type in as many as 74 items. These numbers were deemed sufficient based on pre-pilot runs where non-naïve participants were allowed to recall an unlimited number of items during the time limit. After the initial recall, the participants were prompted to indicate, of the objects recalled, which had been target objects in the scavenger hunt, and which objects had included any audio components. These measures indicate how much target and incidental (non-target) knowledge was gained from navigating the environment without providing any cues for recall. It is expected that participants are likely to recall items that were “targets” in the scavenger hunt task, but also that they will have noticed other objects throughout the exploration of the VE. Initial planned approaches for analysis of the recall questions include compiling the number of items listed by the participant; determining if there is accuracy in identifying the characteristics of the objects (e.g., sound; target), and looking for any frequently missed or added items. Recognition. Next, participants were given a recognition test that was meant to examine the impact of schemas on their knowledge of the environment. Participants were presented with actual images from the environment as well as foil images, and needed to determine if they were in the environment or not. There were two forms of these questions that had different overall intentions. The first type presented an image and asked the participant to indicate with a “yes” or a “no” if the specific object was in the environment. This type of question presented items that could reasonably be in the environment among objects that were actually in the environment. This was to see if participants were relying on their schema for the specific environment rather than direct memory of the environment. This method has been previously used in route learning research in order to determine what information was gleaned from the interaction [15]. The second type of question presented images of four objects and asked the participants to indicate which of the objects was present in the environment. All images in both question types featured an isolated image of the object on a white background to ensure no contextual information would impact response to transfer and allow for a seamless introduction to foils. We included four levels of cases in which these objects were schematically similar (e.g. objects found in a baby’s room), functionally similar (e.g., different objects that tell time), identical but varied in color (e.g. same power drill but in four distinct colors), and cases in which objects differed in small details (e.g., all four

Development of Cognitive Transfer Tasks

345

were clocks of the same color, shape, and function but varied in number font, artistry, etc.). These questions were designed to determine if specific types of errors were being made, and if they varied systematically across conditions. In the case of both types of questions (yes vs. no and multiple choice), there were a mix of target and non-target objects included in the questions. This decision was made to distinguish if there is a difference in memory for objects from the task that were being engaged with (target) vs. incidental objects from the environments (non-target). It should be noted that there were no time constraints on the recognition items. Examples of these two types of questions are in Fig. 2. Was this in the environment or not?

Yes [] No []

Fig. 2. The top question is an example of recognition question type 1, which was a “yes” or “no” response. The bottom question is an example of recognition question type 2, which presents items that differed in small details and asks for the one in the environment to be identified.

Initial analysis approaches for the first type of recognition questions are expected to involve a count of how many accurate and inaccurate “yes” or “no” responses were given and if there were any differences in accuracy for target or non-target items. For the second type of recognition questions the overall accuracy will be examined, as well as which type of characteristic question results in the greatest number of errors. This will provide additional information about the level of information that was encoded

346

A. M. Sinatra et al.

about the objects in the environment and if a particular level of immersion was conducive to a given “resolution” of recall. For example, if some subjects performed better on the low-level specific detail questions for the most immersive condition (HMD) then that would be a case in favor of that level of immersion being most effective for that subset of subjects. Route Description. After the recall and recognition questions, participants were asked to describe the route between two specified objects. More specifically, participants were asked to imagine giving directions to a person who had never been in the environment before and type in those directions. Below the route prompt, an example of a route description of a commonly known route (to the current testing area from the building lobby) was displayed. The example included the initial orientation in the environment at the starting object (e.g. elevator), directives (e.g. turn right), landmarks (e.g. couch), and goal location (e.g. lab). Once participants reviewed the instructions, they received either 60 or 90 s (mini and main environment respectively) to describe the route in a free response textbox. In choosing which objects to use in these questions, careful consideration was given to the number of rooms that would need to be traversed, and if the participant was being asked to describe a route from target to target, target to nontarget, non-target to target, or non-target to non-target. The number of questions about objects of the aforementioned types were kept equal between all conditions. This allows us to make comparisons between memory for scavenger hunt target items and incidental items. Our preliminary approach to evaluating these descriptions was largely borrowed from the work of Lovelace and colleagues [9]. This scoring protocol includes route quality, landmarks, elaboration, errors, and segments completed, among others. In addition to the route quality ratings from Lovelace et al. [9], we added categories that included interpretations of landmarks, vagueness, and whether participants reported they did not remember. Upon completion of the scoring protocol, we plan to establish interrater reliability among coders on a sub-set of the route descriptions. Once there is agreement, the remaining descriptions will be distributed amongst coders for evaluation. Bearings. Next, participants were given a test to identify the bearing directions of an object given a specific viewpoint. This was designed to evaluate their spatial understanding of the relationship between objects and the environment. Easy (within the same room), medium (1 or 2 rooms over), and hard (3 rooms over or more) questions were developed, again with careful consideration to the characteristics of the objects that were being considered (e.g., target to non-target; non-target to non-target) and ensuring they were equal across conditions. Participants were told to indicate the direction of a goal object given a starting viewpoint as though they were pointing to an object with a penetrating laser. The analogy of a “penetrating laser” was used to convey that the direction reported should not be influenced by potential obstructions such as walls or other objects. Below these instructions, participants were provided with an example question to familiarize themselves with the question type. The example used the spatial relationship between California, Washington, and Oregon, which should be common knowledge to our participants (living in California). The example prompt asked participants to imagine standing on Oregon facing towards Washington and to

Development of Cognitive Transfer Tasks

347

indicate at what direction they can find California. Underneath the prompt, the question contained a palm-sized circle with mouse-clickable tick marks every three degrees along the inner circumference. In the middle of the circle was a photo with the object they are to imagine “standing on” and above the circle is the object they are to imagine “facing.” See Fig. 3 for an example of one of the bearings questions.

Fig. 3. An example of a Bearings question.

Map Drawing. Finally, participants were given a piece of 8.5 11 in. graph paper and an erasable pen. Participants were instructed to draw a floor plan of the environment. They were asked to label each room, refrain from using any label twice, and to indicate doorways with an “X”. This overarching task was developed in order to measure the survey knowledge of the participants and assess how accurate their mental map of the rooms and configurations were as they relate to each other. Further, the

348

A. M. Sinatra et al.

participants were asked to label the rooms in order to determine if they had schematized the information. An erasable pen was provided to ensure that participants were able to generate a map that was legible for input into the Gardony Map Drawing Analyzer (GMDA) [6], while also retaining the ability to erase as needed. Participants had either 2 or 5 min to draw the floor plan when drawing the mini or main environment respectively. We plan on using an open source and validated graph analyzer, GMDA, to evaluate how accurate participants’ map drawings were [6]. First we will draw boundaries for each room on top of a true map of the environment within the software to serve as the template. Once an accurate template has been created, we will develop a protocol for scanning the participant maps and upload them to the GMDA. Using the same boundary drawing protocol we used for template creation, we will analyze the scanned map drawing for accuracy. The metrics we may use to compare maps include the number of rooms drawn, room position relative to others, scale, rotational bias, and angle bias among others. 2.4

Distinction of Transfer Task Levels and Cognitive Psychology

These five levels of tasks described in detail above were developed to measure spatial knowledge acquisition including spatial-working memory (remembering objects, shapes, and colors in recall and recognition) and survey knowledge (e.g. route descriptions, bearing directions, map drawing). Survey knowledge includes visual, geometrics, relational, emotional, and descriptive information [7], and requires information to be processed more deeply than with simple route or recognition memory [15]. Having these different levels of transfer tasks allows for nuanced comparison of spatial learning that occurred under conditions of different levels of immersion and different individual traits. From a cognitive psychology perspective, the initial spatialmemory questions (item recall, item identification) are representative of a shallow automatic encoding that can be acquired with minimal processing and attention [2, 8]. Whereas, survey knowledge as represented by being able to explain the relative location of items to each other and drawing a map requires more in depth and effortful processing to achieve [2]. Additionally, by asking questions about target items as well as non-target items it can lend insight into the encoding processes that are occurring during the interaction. Comparisons can be done to determine if the same level and type of knowledge was achieved in all conditions, or if certain levels of immersion were more conducive to the development of knowledge of the environment. For instance, if the high immersion condition results in higher performance on the survey knowledge measures it could have important implications for the design and display of VEs.

3 Application of Methodology and Implications for AISs The described experiment and process used for the development of transfer tasks is highly relevant to the development of AISs. There are multiple relevant lessons learned and processes that were used during the development of the tasks. The team which developed these questions was highly distributed, which resulted in the materials needing to be very organized and highly annotated. Many AISs have teams that are

Development of Cognitive Transfer Tasks

349

often working in different locations and that need to work collaboratively on the development of the system. Additionally, environments and transfer evaluations, such as the project described, could be incorporated within an intelligent tutoring framework, such as the GIFT [14] to engage in continuous assessment, adaptive tutoring, and individual remediation based on performance. GIFT has many functions that are useful to enhance and conduct psychology experiments [1, 11–13]. 3.1

Implications of Examining Learning Transfer in an AIS

The learning transfer tasks that were described can be updated and used in similar training environments to assess the level and types of knowledge that are being gained from an interaction. By considering the distinction between shallow and deep encoding it can lend insight into how much is being learned from interaction with an ITS or system. If it is determined that certain characteristics of ITSs result in deeper knowledge (e.g. the ability to identify bearings or generate a map) then they can be utilized in future ITSs and AISs in general. The specific types of questions that were used in our study are generalizable and can be updated or changed based on the topic or goals of a tutor that is being created. By determining if there is learning transfer after engaging with a VE it can help to improve the design of ITS interventions and provide a trainer with valuable information about what type of knowledge their learners are getting out of engaging with the ITS. 3.2

Application of VEs and Transfer to the Generalized Intelligent Framework for Tutoring

Spatial navigation and learning is an ideal task to incorporate into an AIS using a framework such as GIFT. The VEs that were created for the described experiment were created using Unity3D. GIFT has the ability to interface with external training applications, and a gateway already exists between Unity and GIFT. Therefore, if desired, it would be fairly straightforward to connect a Unity environment such as the one described with GIFT. It would be necessary to ensure that the desired performance and state information (e.g., timing, finding target items during the scavenger hunt) was being communicated from Unity to GIFT. This could then be used to provide adaptive feedback that could be authored based on the desired training outcomes. If an ITS were to be created to assess spatial learning, the approaches that were used for the currently described transfer tasks would be useful. An ITS framework such as GIFT includes a survey authoring system that can be used to create nearly all of the types of questions that were described (with the exception of map drawing). While it would be difficult to leverage the assessment of recall questions in real time, accuracy on recognition and bearings questions could be determined in real time and used to provide recommendation as well as feedback on ways that learners could improve their performance. There would be many opportunities to provide adaptive feedback in real time in a VE, and to additionally store information for post-processing after the fact to inform future training.

350

3.3

A. M. Sinatra et al.

Implications of Working with a Distributed Team

As is the case with many AISs, the research team involved in the described study were not all working in the same location. The group was distributed between the West coast, Southeast, and Mid-Atlantic of the United States, and was comprised of 10+ members. In order to support these many locations, planning meetings occurred over the phone, with some in-person synchronization meetings of the entire team. In order to engage in the actual development of the transfer tasks, a subset of the group met consistently through phone calls in order to come up with an overall plan for the task development. A full list of questions of each type for each environment were developed by the subgroup of researchers and then compiled and edited by one researcher in order to balance the characteristics of items and questions. All of the questions were carefully compared to ensure for consistency between the transfer questions for the different environments. Careful notes and descriptions were included with the final version of the questions, as they were implemented by an additional group of team members. These additional team members collected, modified, and prepared the required images, and one team member formatted and entered the questions into an electronic survey platform. Finally, the team members involved with the development of the transfer questions reviewed the versions that were implemented in the survey platform.

4 Conclusions The current paper describes the development of transfer tasks to evaluate the types of knowledge that participants gained after navigating in Unity3D-based virtual environments with different levels of immersive technology. The types of questions that were generated and described offer insight into the types of questions that could be generated for use in AISs by instructors and researchers who wish to assess the type of information (shallow vs. deep) that was retained after engaging in a VE. The described environments and tasks could be used as a basis for a spatial navigation tutor using a framework such as GIFT. The approaches for developing the tasks in a distributed team, and for ensuring consistency between experimental conditions offer insight into processes that can be used for future AIS design and experimentation. Acknowledgements. This work was funded by the US Army Research Laboratory’s Human Research and Engineering Directorate. The authors would like to acknowledge Jason Moss who contributed to the conceptualization and development of the experiment. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

Development of Cognitive Transfer Tasks

351

References 1. Boyce, M.W.: From concept to publication - a successful application of using GIFT from the ground up. In: Generalized Intelligent Framework for Tutoring (GIFT) Users Symposium (GIFTSym4), p. 125 (2016) 2. Craik, F.I., Lockhart, R.S.: Levels of processing: a framework for memory research. J. Verbal Learn. Verbal Behav. 11(6), 671–684 (1972) 3. Darken, R.P., Banker, W.P.: Navigating in natural environments: A virtual environment training transfer study. In: Proceedings of VRAIS 1998, pp. 12–19 (1998) 4. Darken, R.P., Goerger, S.R.: The transfer of strategies from virtual to real environments: an explanation for performance differences? In: Proceedings of Virtual Worlds and Simulation 1999, pp. 159–164 (1999) 5. Files, B.T., Oiknine, A.H., Thomas, J., Khooshabeh, P., Sinatra, A.M., Pollard, K.A.: Same task, different place: Developing novel simulation environments with equivalent task difficulties. In: Applied Human Factors and Ergonomics 2019 Conference (in press) 6. Gardony, A.L., Taylor, H.A., Brunyé, T.T.: Gardony map drawing analyzer: software for quantitative analysis of sketch maps. Behav. Res. Methods 48(1), 151–177 (2016) 7. Golledge, R.G., Dougherty, V., Bell, S.: Acquiring spatial knowledge: survey versus routebased knowledge in unfamiliar environments. Ann. Assoc. Am. Geogr. 85(1), 134–158 (1995) 8. Hasher, L., Zacks, R.T.: Automatic and effortful processes in memory. J. Exp. Psychol. Gen. 108(3), 356 (1979) 9. Lovelace, K.L., Hegarty, M., Montello, D.R.: Elements of Good Route Directions in Familiar and Unfamiliar Environments. In: Freksa, C., Mark, D.M. (eds.) COSIT 1999. LNCS, vol. 1661, pp. 65–82. Springer, Heidelberg (1999). https://doi.org/10.1007/3-54048384-5_5 10. Siegel, A.W., White, S.H.: The development of spatial representations of large-scale environments. Adv. Child Dev. Behav. 10, 9–55 (1975) 11. Sinatra, A.M.: The 2018 research psychologist’s guide to GIFT. In: The Proceedings of the Generalized Intelligent Framework for Tutoring (GIFT) User Symposium (GIFTSym6) (2018) 12. Sinatra, A.M.: The updated research psychologist’s guide to GIFT. In: Proceedings of the Generalized Intelligent Framework for Tutoring (GIFT) Users Symposium (GIFTSym4), p. 135 (2016) 13. Sinatra, A.M.: The research psychologist’s guide to GIFT. In: Proceedings of the 2nd Annual GIFT Users Symposium, pp. 85–92 (2014) 14. Sottilare, R.A., Brawner, K.W., Sinatra, A.M., Johnston, J.H.: An updated concept for a Generalized Intelligent Framework for Tutoring (GIFT) (2017). GIFTtutoring.org 15. van Asselen, M., Fritschy, E., Postma, A.: The influence of intentional and incidental learning on acquiring spatial knowledge during navigation. Psychol. Res. 70(2), 151–156 (2006)

Application of Theory to the Development of an Adaptive Training System for a Submarine Electronic Warfare Task Wendi L. Van Buskirk1(&), Nicholas W. Fraulini2, Bradford L. Schroeder1, Cheryl I. Johnson1, and Matthew D. Marraffino1 1

Naval Air Warfare Center Training Systems Division, Orlando, USA {wendi.vanbuskirk,bradford.schroeder, cheryl.i.johnson,matthew.marraffino}@navy.mil 2 StraCon Services Group, LLC, Fort Worth, USA [email protected]

Abstract. Adaptive training (AT) can be an efficient option for providing individualized instruction tailored to trainees’ needs. Given promising research findings involving AT, we were challenged with developing an AT solution for Submarine Electronic Warfare (EW). Submarine EW is a complex task that involves classifying contacts, recognizing changes in the environment, and interpreting real-time displays. To train this task, we designed and developed the Submarine Electronic Warfare Adaptive Trainer (SEW-AT). We drew from multiple theoretical perspectives to drive our design decisions, including Multiple Resource Theory (MRT) and the Zone of Proximal Development (ZPD). Following the trainer’s development, we conducted a training effectiveness evaluation (TEE) to gauge initial performance improvements from SEW-AT. Using Submariners (n = 34) from 4 different Submarine Learning Centers across the United States, we found a 46% reduction in missed reports and a 49% improvement in report accuracy while using SEW-AT. As a next step, we plan to explore how the frequency of adaptation, or adaptation schedules, affect training performance and efficiency to determine if finer-grained adaptations produce greater learning gains. Keywords: Adaptive training Submarine Electronic Warfare Training effectiveness evaluation Feedback

1 Introduction 1.1

Background

Submarine Electronic Warfare (EW) is a critical task for submarine safety and a highly perishable skill. It is a dynamic and complex assignment that requires operators to monitor multiple inputs in order to classify contacts, recognize changes in the environment, and submit reports. The sheer amount of information that operators must monitor and assess across multiple modalities while under tight time constraints This is a U.S. government work and not under copyright protection in the United States; foreign copyright protection may apply 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 352–362, 2019. https://doi.org/10.1007/978-3-030-22341-0_28

Application of Theory to the Development of an Adaptive Training System

353

presents significant workload issues. The US Navy is investigating ways to improve operator performance in this domain with a particular focus on training. For example, the Submarine Tactical Requirements Group has recommended using Adaptive Training (AT) to fulfill critical training gaps in EW. Adaptive training can be conceptualized as consisting of content that can be adjusted based on an individual’s aptitude, learning preference, or learning style [1]. Training can be adjusted, either during training or following a training session, based on a trainee’s performance. The adjustments (adaptations) can be applied to different aspects of the task (e.g. instructional material, feedback, or difficulty) depending on training requirements [2]. AT is a practical option for instructors who otherwise may be unable to tailor instruction to an individual or group. Previous research has shown the benefits of this technique when compared against non-adaptive training [3, 4]. For example, Tennyson and Rothen [5] found benefits for adapting based on in-task responses compared to adapting based on pre-task data or not adapting. Furthermore, these benefits of AT have been extended across many applied fields, including medical and military training [6, 7]. Given these promising findings, we were challenged with developing an AT solution for Submarine Electronic Warfare (EW) Operators. To meet the Fleet’s training needs, we developed the Submarine Electronic Warfare Adaptive Trainer (SEW-AT), which allows trainees to practice this task in a realistic environment. In order to develop SEW-AT, we started with an approach used by Landsberg et al. [2] in which they adapted both the scenario difficulty and feedback content based on performance in a training system for submarine periscope operators. The approach was applied to a submarine periscope task that was visual-spatial in nature and included a temporal demand. Landsberg et al. [2] provided evidence for performance benefits on both timeliness and accuracy. Considering the similarities in task demands between the periscope task and the EW task, we chose to utilize this approach in SEW-AT. However, the multifaceted nature of Submarine EW required extending and modifying the approach used by Landsberg et al. In order to make these changes, we consulted several cognitive theories to develop SEW-AT. 1.2

Theory

In order to understand the theory-based design choices we made, it is important to understand some context of the EW task. Operators must identify task events by simultaneously listening for radio frequency signals and navigating cluttered real-time displays. Moreover, operators must remain vigilant for counter-detection and threat recognition. EW Operators are also required to submit reports based on their current contact picture to the Officer of the Deck (OOD) during specific time windows. Therefore, SEW-AT was developed to allow trainees to perform the role of an EW operator in a series of 10 to 15 min scenarios and complete tasking and reports as required. To aid with trainees’ reports submissions in SEW-AT (which are verbal reports in the operational environment), we developed a report GUI that generates their verbal reports from operator selections and typed entries. Due to the complex nature of SEW-AT and its components, our team employed several theoretical frameworks during the design process. These theoretical frameworks contributed to the design and development of feedback delivery and difficulty

354

W. L. Van Buskirk et al.

adaptation. Regarding feedback delivery, previous research advocates for detailed process feedback during training for military tasks [8]. Process feedback provides trainees information about their underlying performance to encourage strategy development for the future (versus outcome feedback which provides performance accuracy). Moreover, this method of feedback better approximates individualized instruction [9]. Unfortunately, workload and temporal demands imposed by the task would not allow for delivery of process feedback in real time as we were concerned that it would overload the operator. Our solution to this issue was two-fold: we provided real-time feedback in the form of audio cues or hints to alert trainees to important events they may have missed; additionally, we provided end-of-scenario feedback (a combination of process and detailed outcome feedback) to offer insights into areas of improvement. When it came to implementation of this feedback approach, we used Multiple Resource Theory and literature on the timing of training feedback to guide our specific design decisions. Multiple Resource Theory (MRT) and resulting box model [10] provided a theoretical framework for several key decisions in the development process. MRT assumes there are independent working memory sub-systems (i.e., verbal, spatial), which have a limited capacity. MRT emerged based on evidence that differences in performance on concurrent tasks emanated from differences in the qualitative demands of the separate tasks [11, 12]. Wickens [13] expanded upon prior research by proposing a model that linked supposed resource dimensions to underlying neurological mechanisms [14]. Specifically, the MRT framework identifies perceptual modalities (e.g. visual, auditory) through which sensory information is processed (e.g., verbal or spatial). Therefore, overloading the visual or auditory channels individually can add difficulty to information processing compared to providing information both visually and audibly. Additionally, the verbal and spatial processing codes represent the manner in which information displayed through the visual or auditory channels is encoded and then processed in working memory. Much like the separate perceptual modalities, processing codes are also subject to overuse. For example, Goodman, Tijerina, Bents, and Wierwille [15] provide evidence showing the benefits of voiced cell phone dialing while driving due to the various spatial and manual demands of controlling the vehicle. Voiced dialing avoids overloading the visual channel, which is highly taxed with driving. Further, it also utilizes verbal component versus spatial (locating numbers on the number pad) in order to avoid taxing the spatial processing used during driving. From this example, we also see how the spatial and verbal processing codes can combine with perceptual modality to increase difficulty. Using MRT as an inspiration, we took a two phased approach to providing feedback in SEW-AT. The first was to provide immediate feedback for critical events by using audio cues. Although the EW task primarily requires auditory recognition of RF frequencies, we felt that brief audio verbal cues during the task would not overly tax the auditory channel. This decision was made with the high taxation of visual verbal channel (e.g. real-time displays, verbal report GUI, etc.) in mind. The audio verbal cues would quickly alert trainees to critical task events without inhibiting their ability to perceive and process frequencies. Moreover, the audio verbal cues would not interfere with trainees’ ability to manually respond to these events through the report GUI.

Application of Theory to the Development of an Adaptive Training System

355

The second phase to feedback delivery was to provide a combination of process and detailed outcome feedback after each scenario. The EW task also requires trainees to utilize both visual verbal and visual spatial processing codes to recognize, analyze, and interpret stimuli. Due to the degree in which the task taxes the visual channel, we did not feel that any real-time, detailed, feedback was appropriate for SEW-AT. Furthermore, several researchers [16, 17] have argued against the use of the immediate feedback suggesting that the processing of feedback in real-time competes for limited cognitive resources that are being used to perform the task and learning suffers as a result. Given the temporal demand and modality processing issues imposed by the SEW-AT task, we chose to delay providing detailed feedback until completion of the task in hopes of encouraging long-term learning and retention. In addition to feedback delivery, the process by which SEW-AT adapted trainees’ scenario difficulty also required theoretical consideration. Given our desire to both train and challenge EW Operators, we found that Vygotsky’s [18] Zone of Proximal Development (ZPD) most appropriately guided our design of SEW-AT’s difficulty adaptations. The ZPD represents the theoretical “zone” in which task difficulty challenges trainees during learning without overwhelming them and discouraging learning. Therefore, we developed an algorithm that adapted task difficulty between scenarios to be more challenging when trainees performed well, but less challenging when they performed poorly in order to keep them in this desired zone. For example, trainees who performed well on a given scenario would receive a more difficult scenario next; similarly, trainees who performed poorly on a given scenario receive an easier scenario next. Trainees remained at the same level of difficulty for their next scenario when their scores did not exceed either threshold for adapting. Additionally, the concept of the ZPD helped to guide our decisions for scenario length and adaptation frequency. Although our goal was to create algorithms that kept trainees in their ZPD, this did not come without concerns. Chief among these concerns was the question of how often to schedule difficulty adaptations to more precisely position trainees in their ZPDs. To address this, we consulted Van Lehn’s [19] concept of granularity for guidance. Granularity in this context refers to the number of opportunities trainees have to interact with the system when completing a task. In SEW-AT, users interact with the system by submitting multiple reports within each scenario. The problem at hand is what amount of interaction (i.e., what granularity of measurement) should be used to inform difficulty adaptations. Should each report trigger adaptation? Or, should a set of reports trigger adaptation? Though we could have adapted scenario difficulty after every report, we felt this would have adapted trainees too often, potentially forcing them from their respective ZPDs. As a result, we chose to implement 10–15 min scenarios and to adapt following each scenario, as trainees would then have the opportunity to submit several reports prior to an adaptation decision. In doing so, we believed that our adaptation decisions would be triggered by a more complete assessment of a trainee’s performance, appropriately positioning them into their ZPD. In sum, our design decisions for SEW-AT were informed by theoretical perspectives of human information processing, adaptive training, and empirical evidence from prior training research. In the following sections, we provide an overview of the EW operator’s task demands and detail our approach to evaluating SEW-AT’s training

356

W. L. Van Buskirk et al.

effectiveness. We completed our evaluations with usage data collected from trainees at several submarine learning centers across the United States.

2 Training Effectiveness Evaluation 2.1

Submarine Electronic Warfare Adaptive Trainer

SEW-AT simulates a trip to periscope depth (PD) to train Electronic Warfare Operators on three main training objectives: (1) maintaining safety of ship, (2) recognizing parameter changes, and (3) making accurate and timely reports. Prior to the PD trip, a detailed Pre-PD brief is presented for that scenario which provides the context and mission for that PD trip. Once the pre-PD brief is reviewed, the scenario starts and the EW Operator is presented with opportunities to provide reports to Control during the first 10–15 min at PD. Reports to Control are entered using a report GUI which allows text-based input of EW verbal report litany, which is passed off to algorithms to assess the accuracy and timeliness of the report. As discussed above, audio cues are presented during the scenario based on the near real-time assessment of performance. After completion of a scenario, performance feedback is provided and focuses on the training objectives listed above. SEW-AT then adapts the difficulty of the next scenario based on the operator’s performance. There are 3 levels of difficulty based on submarine doctrine - basic, intermediate, and, advanced. If a trainee performs well, they are moved up a level; if a trainee, performs poorly, they move down a level of difficulty; if their performance is fair, they stay at the same level of difficulty. All trainees start SEW-AT on an intermediate level scenario after that their path through SEW-AT is based on performance. Further, trainees create unique user log-ins so that their progress over time is tracked and to ensure scenarios are not repeated. We delivered SEW-AT systems to several Submarine Learning Centers (SLC) across the country. The goal of providing these systems was threefold: (1) to obtain operator feedback on an initial version of the system, as well as collect site usage data that would (2) help us tune our adaptive algorithms, and (3) provide insight on initial training effectiveness. Although we did not have control over how the system was used [e.g., how many submariners used it, how many times they used the system, or if the sites incorporated the system into their curriculum (versus independent study), etc.], we were able to collect and analyze usage data periodically from the sites during software upgrades. From the first software update to the second, we received back 8 to 10 months of usage data across the sites from 66 EW Operators. These data were broken down into single-touch and multiple-touch users. Single touch users (n = 32) were defined as trainees who actively played through only one scenario in SEW-AT. Multiple touch users (n = 34) were defined as trainees who actively played more than one scenario in SEW-AT over the 8 to 10-month period. The multiple touch data allow us to assess performance improvements over time and are presented below.

Application of Theory to the Development of an Adaptive Training System

2.2

357

Usage Results

In order to assess performance improvements, we derived a “pre-test” and a “post-test” score for both timeliness and accuracy for each operator. The pre-test data were extracted from the first scenario the operators played through and the post-test data were extracted from the last scenario they played through. Accuracy measures were determined by assessing the percent accuracy of the information that was provided for each report that was required in the scenario. A paired samples t-test was performed to assess accuracy improvements from pre to post. This showed a statistically significant improvement, Mpre = 16.06% (15%), Mpost = 33.47% (31%), t(33) = 4.37, p < .001. Timeliness scores were broken down into an analysis of the percentage of late calls (i.e., calls that were provided but were reported past the time they were due) and missed calls (i.e., reports that were never given). A paired samples t-test was performed to assess improvements from pre to post. Note: smaller numbers indicate better performance (e.g., a reduction in the amount of late and/or missed calls). The percentage of late calls was not statistically significantly different from pre to post, Mpre = 15.98% (15%), Mpost = 19.96% (21%), t(33) = –0.41, ns. However, the percentage of missed calls did show a statistically significant improvement from pre to post, Mpre = 70.10% (24%), Mpost = 51.30% (32%), t(33) = –2.52, p = .008. Table 1 shows the number of multiple touch users by SLC site. As the number of users per site varied, we were interested in looking at how performance improvements differed by site as this was of particular interest for our Fleet stakeholders. Figures 1 and 2 show percentage performance improvements per site and overall for the accuracy and timeliness metrics. As can be seen in Fig. 1, pre to post percent accuracy improvement ranged from 28%–67% by site. Regarding timeliness, pre to post percent improvement in missed calls ranged from 31%–57% by site and pre to post percent improvements in late calls ranged from 4%–44% by site. As can be seen in Fig. 2, SLC Site 2 showed a large improvement in late calls and, interestingly, that improvement is larger than missed calls for that site. While pre to post improvement in late calls was not statistically significant overall, it appears Site 2 may have been struggling with that aspect of the task and had more opportunities to show improvements using SEW-AT. Table 1. Multi-touch users by submarine learning center site Submarine learning center site n Site 1 14 Site 2 11 Site 3 4 Site 4 5 Total 34

In addition to the usage data, we also asked the operators to complete questionnaires and provide us feedback on SEW-AT. These questions can be seen in Table 2. Each question was followed up with an open-ended comments section in order for us to identify and address issues for future software upgrades. Additionally, we asked for

358

W. L. Van Buskirk et al.

Pre-Post Accuracy Improvement 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Site 1

Site 2

Site 3

Site 4

All

Note: Error bars are SE.

Fig. 1. Percentage improvement in accuracy by site and overall.

Pre-Post Timeliness Improvement 90% 80% 70% 60% 50% 40%

Late

30%

Missed

20% 10% 0% -10% Site 1

Site 2

Site 3

Site 4

All

Note: Error bars are SE.

Fig. 2. Percentage improvement in timeliness by site and overall.

Application of Theory to the Development of an Adaptive Training System

359

opinions on positive aspects of the system, areas for improvement, and details on software bugs (if any occurred). We received responses from 14–17 EW operators depending on the question which was equivalent to a 21–26% response rate. As can be seen from the data below, the overall responses were promising. The system seemed to be well-received and the training materials and stimuli appeared to be face valid. Of one concern was the 3.5 rating on the difficulty of using the verbal report GUI. We found a positive correlation (r = 0.37) between the number of scenarios played and how the GUI was rated. Specifically, the more scenarios an operator played, the more highly operators rated the GUI as easy to use which indicated there may be a slight learning (1–2 scenarios) curve for operators to become comfortable with the verbal report GUI. Table 2. SEW-AT user perception data Question Did you have any issues with the Pre-Periscope Depth briefs (e.g., were they realistic, was there information you would not normally receive, was there information that was missing)? Did you have issues with the realism of any of the scenarios (e.g., were there any emitters you would not expect to see, did something happen in the scenario that normally would not)? Did the difficulty levels appropriately reflect the assigned difficulty (e.g., did basic scenarios seem basic, did advanced scenarios seem advanced)? Did you feel the scenarios adjusted appropriately based on your performance (e.g., did the task become easier when you struggled, did the task become harder when you performed well)? Based on your knowledge of the task, did the verbal report GUI accommodate the actions you would take while sitting the stack (e.g., did it include all necessary litany, were you able to account for all events that took place in the scenarios)? Question Scale How difficult was it for you to use the GUI? How willing would you be to use SEW-AT in the future?

1–Very Difficult, 5–Very Easy 1–Not at all willing, 5–Very willing

Yes 12%

No 88%

n 17

29%

71%

17

81%

19%

16

93%

7%

15

73%

27%

15

Mean (SD) 3.5 (0.85)

n

3.9 (1.03)

15

14

3 Discussion 3.1

Review of TEE Results

As mentioned previously, we had several goals when providing initial versions of SEW-AT software to the SLC sites. The first goal was to obtain operator perceptions and feedback on SEW-AT. In general, EW Operators rated the system favorably and indicated they would be willing to use the system in the future. The open ended

360

W. L. Van Buskirk et al.

comments they provided helped us identify software bugs and additional system capabilities that we intend to include in future versions of the system. The usage data we received also allowed us to gain insight on initial training effectiveness. Specifically, we wanted to determine if the previous approach and the theory-based decisions we made were impacting performance in a positive way. Overall, the performance improvements are highly encouraging and suggest this is the case. As alluded to earlier, the data should be viewed tentatively. The number of EW operators, days between sessions, and the number of scenarios that were completed varied by site and operator. Because we did not have experimental control, we cannot attribute performance improvements to SEW-AT versus practice alone. However, these initial data on the system are helpful to us for making improvements and tuning our adaptive and assessment algorithms. Additionally, we have positive results on Kirkpatrick’s Level 1 (Reactions) and Level 2 (Learning) training evaluation criteria (see Kirkpatrick [20] for a full description of the training evaluation criteria). As this research program continues, we will be able to also gain insight on Levels 3 (Behavior) and 4 (Results) as we will have data on training transfer and be able to assess effects on operational performance. In addition to TEE data, we plan to take a more experimental approach to assess performance and learning gains in our future research. 3.2

Future Directions

Our results support that using SEW-AT improved trainee performance, but we are also interested in refining SEW-AT’s adaptive algorithms in efforts to continue to improve learning gains. Our future research will involve further exploring difficulty adaptation to examine questions about the ZPD and adaptive training methodologies. Specifically, we aim to investigate difficulty adaptation frequency empirically. Theoretically, adaptation schedules that are more frequent should allow for finer-tuned adjustments that more accurately align to a trainee’s ZPD [18]. This adaptation approach assesses performance to adapt training within scenarios in real-time. Conversely, it is possible that a less-frequent adaptation schedule (as implemented in the present version of SEW-AT) could be better suited to aligning difficulty to the trainee’s ZPD. Such an adaptation would be based on a more comprehensive assessment of performance at the conclusion of a scenario. These opposing schedules for difficulty adaptation are two approaches to Vygotsky’s [18] characterization of the ZPD as the zone where a task is challenging while still being achievable. For future research, we seek to understand which type of adaptation schedule is optimal for learning gains. In addition to adapting training difficulty, we also plan to investigate new approaches to adapting feedback in the Submarine EW task. Currently, SEW-AT provides endof-scenario feedback to avoid disrupting trainees during this complex and temporallydemanding task. It may be the case, however, that providing feedback in real time can guide trainees to more efficient means of excelling during training and retaining their skills long-term. This is especially true in tasks spanning multiple modalities, as providing immediate feedback in one modality (e.g. the auditory channel) may help to alleviate overwhelming demand on another modality (e.g. the visual channel) [21]. Potential avenues for implementing real-time feedback in SEW-AT include directing

Application of Theory to the Development of an Adaptive Training System

361

trainees toward real-time displays, as well as providing visual cues for new events occurring in the environment. An additional area we aim to explore is the individual differences that impact learning gains in adaptive training. For instance, previous research has identified that certain individuals can become distressed when task difficulty changes (e.g., those who employ emotion-focused coping strategies; [22]). If this is true, it is possible that a highly frequent adaptation schedule could be distressing for some trainees, limiting their ability to learn from training. This characteristic will be taken into consideration during our future investigations with these different adaptation schedules, as adaptive training solutions may need to account for trainee characteristics. Acknowledgements. We gratefully acknowledge Dr. Kip Krebs and the Office of Naval Research who sponsored this work (Funding Doc# N0001418WX00447). We would also like to thank Marc Prince, Bryan Pittard, and Derek Tolley for their development of the SEW-AT system. Presentation of this material does not constitute or imply its endorsement, recommendation, or favoring by the U.S. Navy or Department of Defense (DoD). The opinions of the authors expressed herein do not necessarily state or reflect those of the U.S. Navy or DoD.

References 1. Landsberg, C.R., Van Buskirk, W.L., Astwood, R.S., Mercado, A.D., Aakre, A.J.: Adaptive training considerations for simulation-based training. (Special report 2010-001). Naval Air Warfare Center Training Systems Division, Orlando (2011) 2. Landsberg, C.R., Astwood Jr., R.S., Van Buskirk, W.L., Townsend, L.N., Steinhauser, N.B., Mercado, A.D.: Review of adaptive training system techniques. Mil. Psychol. 24(2), 96–113 (2012) 3. Cook, D.A., Beckman, T.J., Thomas, K.G., Thompson, W.G.: Adapting web-based instruction to residents’ knowledge improves learning efficiency. J. Gen. Intern. Med. 23(7), 985–990 (2008) 4. Landsberg, C.R., Mercado, A., Van Buskirk, W.L., Lineberry, M., Steinhauser, N.: Evaluation of an adaptive training system for submarine periscope operations. In: Proceedings of the Human Factors and Ergonomics Society 56th Annual Meeting, pp. 2422–2426. SAGE Publications, Los Angeles, CA (2012). (CD ROM) 5. Tennyson, R.D., Rothen, W.: Pretask and on-task adaptive design strategies for selecting number of instances in concept acquisition. J. Educ. Psychol. 69(5), 586–592 (1977) 6. Romero, C., Ventura, S., Gibaja, E.L., Hervas, C., Romera, F.: Web-based adaptive training simulator system for cardiac support. Artif. Intell. Med. 38, 67–78 (2006) 7. Bauer, K.N., Brusso, R.C., Orvis, K.A.: Using adaptive difficulty to optimize videogamebased training performance: the moderating role of personality. Mil. Psychol. 24(2), 148– 165 (2012) 8. Buff, W.L., Campbell, G.E.: What to do or what not to do? Identifying the content of effective feedback. In: Proceedings of the 46th Annual Meeting of the Human Factors and Ergonomics Society, pp. 2074–2078. SAGE Publications, Santa Monica, CA (2002) 9. Park, O.C., Lee, J.: Adaptive instructional systems. In: Jonassen, D. (ed.). Handbook of Research for Educational Communications and technology, pp. 651–684. MacMillan Publishers, New York (1996)

362

W. L. Van Buskirk et al.

10. Wickens, C.D., Hollands, J.: Engineering Psychology and Human Performance, 3rd edn. Prentice Hall, Upper Saddle River (2000) 11. Wickens, C.D.: The effects of divided attention on information processing in tracking. J. Exp. Psychol. Hum. Percept. Perform. 2, 1–13 (1976) 12. Wickens, C.D.: Multiple resources and performance prediction. Theoret. Issues Ergon. Sci. 3(2), 159–177 (2002) 13. Wickens, C.D.: The structure of attentional resources. In: Nickerson, R. (ed.) Attention and Performance VIII, pp. 239–257. Lawrence Erlbaum, Hillsdale (1980) 14. Kinsbourne, M., Hicks, R.E.: Functional cerebral space: a model for overflow, transfer and interference effects in human performance. In: Attention and Performance VII, pp. 345–362 (1978) 15. Goodman, M.J., Tijerina, L., Bents, F.D., Wierwille, W.W.: Using cellular telephones in vehicles: safe or unsafe? Transport. Hum. Factors 1, 3–42 (1999) 16. Schmidt, R.A., Wulf, G.: Continuous concurrent feedback degrades skill learning: implications for training and simulation. Hum. Factors 39, 509–525 (1997) 17. Schooler, L.J., Anderson, J.R.: The disruptive potential of immediate feedback. In: Proceedings of the Twelfth Annual Conference of the Cognitive Science Society, Cambridge, MA, pp. 702–708 (1990) 18. Vygotsky, L.S.: Interaction between learning and development. In: Gauvain, M., Cole, M. (eds.) Readings on the Development of Children, 4th edn., pp. 34–40. Worth, New York, NY (2005). Reprinted from: Cole, M., John-Steiner, V., Scribner, S., Souberman, E. (eds.) Mind in Society: The Development of Higher Psychological Processes, pp. 71–91. Harvard University Press, Cambridge (1978) 19. VanLehn, K.: The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educ. Psychol. 46(4), 197–221 (2011) 20. Kirkpatrick, D.: Great ideas revisited: Revisiting Kirkpatrick’s four-level model. Training Dev. 50, 54–59 (1996) 21. Moreno, R., Mayer, R.E.: A learner-centered approach to multimedia explanations: deriving instructional design principles from cognitive theory. Inter. Multimedia Electron. J. Comput. Enhanc. Learn. 2(2) (2000). http://imej.wfu.edu/articles/2000/2/05/index.asp 22. Matthews, G.: Extraversion, emotion and performance: a cognitive-adaptive model. Adv. Psychol. 124, 399–442 (1997)

Learning Analytics of Playing Space Fortress with Reinforcement Learning Joost van Oijen(&), Jan Joris Roessingh, Gerald Poppinga, and Victor García Netherlands Aerospace Centre, P.O. Box 90502, 1006 BM Amsterdam, The Netherlands {Joost.van.Oijen,Jan.Joris.Roessingh, Gerald.Poppinga}@nlr.nl, [email protected]

Abstract. In this paper we analyze the learning process of a neural networkbased reinforcement learning algorithm while making comparisons to characteristics of human learning. For the task environment we use the game of Space Fortress which was originally designed to study human instruction strategies in complex skill acquisition. We present our method for mastering Space Fortress with reinforcement learning, identify similarities with the learning curve of humans and evaluate the performance of part-task training which corresponds to earlier findings in humans. Keywords: Reinforcement learning Skill acquisition Transfer learning

Human learning Space Fortress

1 Introduction Space Fortress (SF) is an arcade-style game that was developed in the 80’s with support of US DoD DARPA to study human learning strategies in complex skill acquisition [1]. The game has a long history of research in educational psychology (e.g. studying instructional design for skill acquisition [2]), cognitive science (e.g. developing cognitive models that simulate human learning [3]), and more recently, machine learning (e.g. as a test-bed for addressing challenges in reinforcement learning [4]). Recent breakthroughs in (deep) reinforcement learning (DRL) have led to algorithms that are capable of learning and mastering a range of classic Atari video games [5, 6]. SF shares many gameplay similarities to the Atari games. This has inspired us to investigate if such algorithms are capable of mastering the game of SF and if so, whether the manner in which such algorithms learn SF share similarities with the human learning process. The underlying motivation stems from the idea that if one could construct a representative model to simulate characteristics of human learning, it could potentially be used to make predictions on human task training, for example, to identify task complexity, predict learning trends or optimize part-task training strategies. The goal of this study is twofold. First our goal is to acquire a machine learning mechanism that is capable of learning the full game of SF. Others have addressed this game before though only using simplified versions of the game (e.g. [4]). Second our © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 363–378, 2019. https://doi.org/10.1007/978-3-030-22341-0_29

364

J. van Oijen et al.

goal is to analyze the learning process of the learner and compare it to characteristics of human learning in SF. For this comparison we employ human data acquired from an earlier study [7]. Detecting similarities in the learning trends is a prerequisite for the learning mechanism to exhibit predictive qualities for skill acquisition, such as transfer learning. This will be evaluated by comparing the performance of part-task training in SF between man and machine. In this study we take a more opportunistic approach by using an available neural network-based RL implementation that has been proven to learn tasks in game environments similar to SF (e.g. [6]). We recognize that such algorithms have limitations and that the way they learn cannot easily be compared to human learning [9], which we will describe qualitatively in this paper. Still, this does not rule out the existence of any underlying similarities between (parts of the) the learning trends during complex skill acquisition. The outline of this paper is as follows. Section 2 describes the game of SF and previous machine learning efforts on this game. In Sect. 3 we show how we designed an agent that learns SF with reinforcement learning (RL). In Sect. 4 we analyze and compare the learning curves of the RL algorithm and the human, after which in Sect. 5 we describe results of transfer learning experiments that were conducted. Finally in Sect. 6 we conclude on our findings.

2 Background 2.1

Space Fortress

Space Fortress (SF) was developed in the 80’s as part of the DARPA Learning Strategies Program to study human learning strategies for skill acquisition on a complex task [1]. The game is challenging and difficult enough to keep human subjects engaged for extended periods of practice. It is demanding in terms of perceptual, cognitive and motor skills, as well as knowledge of the game rules and strategies to follow. SF has a long history of research in psychology and cognitive science, focusing on topics such as (computer-aided) instructional strategies to support human learning [2, 10, 11] and understanding human cognition [12, 13]. The game has also shown positive transfer of skill to the operational environment. This has been demonstrated in actual flight performance of fighter pilots where human subjects that practiced on the game performed significantly better than those that did not [14]. Space Fortress Gameplay A simplified representation of SF is shown in Fig. 1. In the game the player controls a ship in a frictionless environment (cf. space dynamics) and is tasked to destroy a Fortress which is positioned in the center of the environment. Destroying the Fortress requires a delicate procedure: the player first has to make the Fortress vulnerable, which can be achieved by shooting the Fortress ten times with a missile with at least 250 ms between each shot. When it’s vulnerable, the player can destroy the Fortress by executing a ‘double shot’ which is a burst of two shots fired at an interval less than 250 ms. When performing a double shot when the Fortress is not yet vulnerable, the Fortress is

Learning Analytics of Playing Space Fortress

365

reset and the player has to start the procedure from the beginning. The Fortress defends itself by shooting shells at the player. These have to be avoided.

Fig. 1. Simplified representation of the Space Fortress game

Further, mines appear periodically in the environment. They prevent the player from destroying the Fortress. There are two types of mines: ‘friend’ and ‘foe’ mines. To destroy a mine, the player first has to identify a mine as friend or foe, which is done by the so-called the IFF (identification-friend-or-foe) procedure. This represents a cognitive task also known as the Sternberg task [15], a short-term memory task: On the screen, a letter appears that indicates the mine type (the IFF letter). The player has to associate this letter with a set of letters that are presented prior to the game. If the letter is part of that set, the mine is a foe mine. After identification, to destroy a foe mine, the player has to switch weapon systems. This is a control task where the player has to press a button (the IFF-key) twice with an interval between 250 and 400 ms. Upon completion, the mine can be destroyed. Friend mines can be destroyed (a.k.a. ‘energized’) immediately. A mine disappears either when it is destroyed, when it hits the player, or after 10 s. The player itself has three lives. A life is lost when hit by a Fortress’s shell or upon collision with a mine. The game has a complex scoring mechanism and points are determined based on the player’s performance of Fortress and mine destruction; weapon control; position control (not too far from, not too close to the Fortress); velocity control (not too slow or too fast); and speed of mine handling.

366

2.2

J. van Oijen et al.

Machine Learning in Space Fortress

In recent years Space Fortress (SF) has been used as a relevant task environment for machine learning research with different purposes. From a cognitive science perspective, SF has been used as a medium for simulating human learning. In [3], the authors present a computational model in the ACT-R cognitive architecture [16]. The model is based on Fitts’ three-phase characterization of human skill learning [17] and it is argued that all three are required to learn to play SF at a human level: (1) the interpretation of declarative instructions, (2) gradual conversion of declarative knowledge to procedural knowledge, and (3) tuning of motorcontrol variables (e.g. timing shots, aiming, thrust duration and angle). Online learning mechanisms are present in the latter two. For the former, the instructions of the game are interpreted and hand-crafted into the model as declarative instructions. From a computer science perspective, SF had been considered as a testbed for reinforcement learning. The game has challenging characteristics such: contextdependent strategy changes during gameplay are required (e.g. mine appearance; Fortress becoming vulnerable); actions require time-sensitive control (e.g. millisecond interval control to perform a double shot or switch weapon system); rewards are sparse; and the environment is continuous and frictionless. In [4], SF is introduced as a benchmark for deep reinforcement learning. The writers illustrate an end-to-end learning algorithm, shown to outperform humans, using a PPO (Proximal Policy Optimization) algorithm [18]. It should be noted however that only a simplified version of the game was addressed, without the presence of any mines. In relation to the above works, our study takes a middle ground. On the one hand, our goal is similar to [3] with respect to simulating (aspects of) human learning (for predictive purposes in our case). On the other hand, we aim to explore predictive qualities of a learner using a more readily available, neural network-based RL algorithm. 2.3

Comparing Human Learning and Machine Learning

Despite recent success of neural network-based (D)RL algorithms for Atari classic games, their shortcomings also quickly become apparent: mastering these games may requires hundreds of hours of gameplay which is considerably more compared to humans. Also certain types of games cannot be mastered at all by RL (e.g. games that require a form of planning), though research efforts are undertaken to address such challenges (e.g. introducing hierarchical concepts [19]). Underlying these observations are several crucial differences in the way the machine learns compared to humans. An extensive overview of such differences is described in [8]. To summarize a few: humans are equipped with ‘start-up software’ which includes priors developed from childhood such as intuitive physics and psychology, affordances, semantics about specific entities and interactions [20, 21]. Humans have the capability to build causal models of the world and apply these to new

Learning Analytics of Playing Space Fortress

367

situations, allowing them to compose plans of actions and pursue strategies. Additionally humans have the capability of social learning and acquiring knowledge or instructions from others or written text such as reading a game’s instruction manual [22]. In Sect. 3.4 we discuss in more detail how deficiencies of many current algorithms would affect the learning of SF.

3 Learning Space Fortress In this section we explain the approach we took to teach Space Fortress (SF) to an agent using reinforcement learning (RL). 3.1

Game Definitions

For this study we define the goal of the game as the task to destroy the Fortress as quickly as possible, hereby minimizing the Fortress Destruction Time (FDT). Through continuous training, performance on this task can be improved and can easily be measured throughout the learning process (for the machine as well as human subjects). Three different tasks (games) are defined: two so-called part-tasks (PTs) which represent simplified versions of the game, and the whole task (WT) which represents the full game: • SF-PT1: No mines will be present in the game, only the Fortress • SF-PT2: Only friend-mines will be present (no IFF procedure or switching weapon systems is required) • SF-WT: Friend and foe mines will be present. This represents the full game. These different tasks are used to analyze and compare the performance of the RL algorithm and human subjects. Later, in Sect. 5 they are used to analyze transfer learning between different tasks. 3.2

Implementation Approach

The SF implementation that was used is a Python version of the game from CogWorks Laboratory [23]. This version was wrapped as an OpenAI gym environment [24] for interfacing with the reinforcement learning (RL) algorithm. The RL algorithm is partly based on DQN [5]: the input of the network consists of 21 hand-crafted features, rather than pixel-based features as in common in DQN (i.e. no convolutional layers were used). Fully end-to-end learning is not a priority for this study and the game state of SF can easily be captured in features. Figure 2 visualizes the design of the RL algorithm and the reward structure. These are further explained below.

368

J. van Oijen et al.

Fig. 2. RL algorithm architecture and reward structure

Different types of features were used. First, 18 features encode the current state of the game and include: information about the ship’s position, heading and speed; bearing and distance information in relation to the Fortress; any shell and mine that is present; and the Fortress vulnerability counter. Any location coordinates and bearing vectors were decomposed into a sine and cosine component to model a continuous cyclic dimension. Second, two features were used to cope with the network’s memory limitations. These features allow the network to understand the time-sensitivity required for distinguishing between a single shot or double shot (250 ms interval) and switching weapon systems (pressing the IFF-key twice within an 250–400 ms interval). The features encode the time passed since the last shoot action and the last key-press to switch weapon system. We expect that algorithms with recurring network units such as LSTM (long-short term memory) or GRU (gated recurrent unit) have the memory capability to encode these time-sensitive strategies, hereby making these features potentially obsolete for such algorithms. Finally, the last feature encodes the mine type (friend or foe). For humans playing the game, identifying a mine is achieved by the Sternberg memory task: one has to determine if a letter on the screen (the IFF-letter) associated with a mine is a member of a set of letters displayed prior to the game. This type of task cannot easily be simulated by the used algorithm (or similar algorithms). Therefore we abstract away from this task by calculating it analytically and present the result as feature input. The game has six actions which correspond to human keyboard input to play the game: add trust, turn left, turn right, fire a missile, press the IFF-key to switch weapon systems and a no-op (no operation).

Learning Analytics of Playing Space Fortress

369

The reward function is custom designed and built from seven performance metrics (see the reward structure from Fig. 2). Two of these are also represented in the original game’s scoring mechanism (Fortress destroyed and Mine destroyed). Some metrics that are included in the scoring mechanism were found to be not required in the reward function, such as punishing losing a life or wrapping the ship around the screen. Five metrics are introduced to guide the learning process in understanding the rules of the game (e.g. reward a Fortress hit to increase its vulnerability or how to use the weapon system for foe mines). 3.3

Results

Experiments have been performed to learn each of the three game types (SF-PT1, SF-PT2 and SF-WT). The training time for each experiment was fixed to 60 million game steps. This corresponds to *833 h of real-time human game play (the game runs at 20 Hz). During training, Fortress Destruction Times were measured and recorded to capture the progress of learning. Results from the experiments are compared to results from human subjects. Human subject data originated from an earlier study [7]. From this study a dataset is available from 36 participants who each played SF (without prior experience) for 23 sessions of 8 games of 5 min which totals to around 15 h of gameplay per participant. The comparison with humans is shown in Table 1. In Sect. 4 we will compare the learning process itself. Table 1. Comparison of Fortress Destruction Times (FDTs) between human and machine Machine FDTs Game type Humana FDTs Best time Best average Best time Best average SF-PT1 4.35 s 6.16 s 3.15 s 3.37 s SF-PT2 4.20 s 8.57 s 3.45 s 4.92 s SF-WT 4.55 s 8.80 s 3.80 s 6.25 s a The best average is based on a 5 min game session. Human data is taken from the best performing subject per task.

As can be seen from the comparison, the RL algorithm outperforms the best performing human subject for all three game types. From qualitative observations based on video recordings, it is seen that the algorithm behaves conform the optimal strategy of playing SF: that is to circumnavigate around the Fortress in a clock-wise manner. Note that this behavior was not explicitly encouraged by the reward structure. Further discussions on comparisons between humans and the machine are described below. 3.4

Comparison to Human Learning

Two immediate observations can be made when comparing the performance on SF between the machine and human subjects. These are qualitatively explained next.

370

J. van Oijen et al.

Task Performance The machine outperforms humans on all three tasks, converging to faster Fortress destruction times. One explanation is that the machine is not bounded by the perceptual and cognitive limitations of the body and mind that are present in humans. To give some examples: humans have limited perceptual attention and attention has to be divided between different elements in the game world and the information elements displayed on the screen around the game world (e.g. checking the Fortress vulnerability or the IFF-letter when a mine appears, see Fig. 1). In contrast, the machine has instant access to these information elements. Further, human reaction time is far higher than the time a machine requires making a reactive decision: *200–250 ms versus smaller than 50 ms (time of a single game step). Finally, humans require accurate motor control of the game’s input device (a joystick or keyboard), which is susceptible to human precision errors. The machine performs game actions instantly. Concluding, under the proven assumption that the RL algorithm is capable of learning the game, and with the mentioned advantaged over humans, it can reasonably be expected to outperform humans. Training Time The machine requires far more training time to achieve a comparable performance to human subjects. For instance, the best performing human on task SF-PT1 reaches its best time within 10 h (5.15 s) whereas the machine requires almost 400 h in order to reach the same time. In other words, the RL algorithm is sample inefficient, a common problem in RL research. Several factors contribute to this. First, humans are able to acquire the goal and rules of the game beforehand (e.g. reading the game’s manual). They are able to employ this knowledge from the start of the training process. The machine’s algorithm cannot be programmed with this knowledge and has to infer the game’s goal and rules purely based on reward signals (such as the reward structure from Fig. 2). Second, humans are equipped with learned ‘experiences’ prior to training, which facilitate them to translate the rules of the game to a concrete game plan to execute: e.g. they are equipped with a general understanding of physical laws and object dynamics; or they have experience in solving similar problems (e.g. gaming experience). The machine’s algorithm does not follow an a priori strategy but learns its policy through exploration and exploitation (using the ɛ-greedy method in our implementation). Finally, if the algorithm were to learn from pixels, it is expected that sample inefficiency would be even higher because of the undeveloped vision system which in (adult) humans is already present.

4 Learning Curve Analysis In the previous section it was explained why the machine’s algorithm does not exactly simulate a human learning the game of SF. Still, there may be trends in the machine’s learning process that share characteristics of human learning. In this section we analyze the learning curves of the algorithm and human subjects. Finding similarities is a necessary condition for the algorithm if it is to be used for predictions concerning transfer learning, which we will evaluate in the next section.

Learning Analytics of Playing Space Fortress

371

For the comparative analysis we use the game type SF-WT. We measured the learning curve for accomplishing the goal of the game: to destroy the Fortress. Figure 3 shows the learning curves representing Fortress Destruction Times in seconds as a function of successive Fortress destructions (also called trials). I.e. the x-axis does not denote the training time.

Fig. 3. Learning curve data of the RL algorithm (upper) and human (lower). The blue line represents the measured data. The red line plots Eq. (1) which is described below. (Color figure online)

From the data it is apparent that the RL algorithm starts with much longer destruction times and that it needs much more trials to reach asymptote. This is the direct effect of the sample inefficiency of the algorithm as explained in Sect. 3.4. The curves of both the time series for the algorithm and the time series for the human learner seem to

372

J. van Oijen et al.

show similar trends that can be described by Eq. (1), which was originally proposed in [7] as an alternative to a so called Power Law of Practice, e.g. [8, 25]: Tn ¼ 1=2 Tmean þ 1=2 Tmin

ð1Þ

In the equation, Tn is the time needed to destroy the Fortress in the nth trial. Tmean is the mean of all preceding (n–1) trial times and Tmin is an estimate of the minimum attainable trial time. Additional analysis is required to determine the fitness of Eq. (1) to the RL algorithm’s learning curve, as well as the effects that the RL algorithm’s hyper-parameters will have on the learning curve during training (e.g. the learning rate or variables for action-selection exploration strategies such as ɛ-greedy). This is left for future study.

5 Transfer Learning In this section we investigate the potential benefit of part-task training for SF for the RL algorithm. Part-task training is a training strategy in which a whole task is learned progressively using one more part-tasks. Positive transfer in part-task training implies a task can be learned more efficiently when compared to spending full training time solely on the whole task. Part-task training has shown positive results for humans learning SF [26]. In the remainder of this section we describe the transfer learning experiments that were conducted and present our results. 5.1

Transfer Experiment Design

Transfer experiments were conducted using the three tasks (games) defined for SF in Sect. 3.1. To recapitulate, SF-PT1 represents a part-task with only the Fortress (no mines); SF-PT2 represents a part-task including only friendly mines that can be dealt with immediately; and SF-WT represents the full game with friend and foe mines, which requires an IFF-procedure and switching weapon systems in case of a foe mine. The primary objective of the experiments is to see whether SF-WT can be learned more efficiently by the RL algorithm using part-task training, and how the use of different part-tasks lead to potential positive (or negative) transfer. The experiments are visualized in Fig. 4. The top bar shows the baseline experiment where 100% of the total training time is spent on the whole task. The bottom two bars show the transfer experiments of using either part-task one or part-task two where a certain percentage of the total training time is spent on the part-task before switching to the whole task. For each of the transfer experiments, a range of different part-task – whole task distributions are be tested in order to gain insight into the optimal distribution.

Learning Analytics of Playing Space Fortress

373

Fig. 4. Transfer experiments

5.2

Implementation and Measurement

Transfer Implementation The approach we took for implementing transfer learning in SF is quite straightforward because of the tasks’ similarities. First of all, for all three tasks, the algorithm’s input features and action space can be kept the same. The mere consequence is that for subtasks such as SF-PT1 (in which no mines appear), certain input features do not change or offer any relevant information (e.g. mine-related features), and certain actions become irrelevant and should be suppressed (e.g. the IFF-key). Further, the reward function can be kept the same since the primary goal of the game remains unchanged (destroying the Fortress). Finally, the actual knowledge that is transferred between different tasks is the current policy function which is represented by the algorithm’s network weights in the hidden and output layers. Because of the above properties, running a part-task training experiment is similar to running a baseline, whole task experiment. The RL algorithm can be kept running throughout a training session without requiring any change when switching tasks. The only difference is that the game environment can be switched to a new task during training, at pre-configured moments. During this task transfer, the algorithm will be confronted with new game elements and any associated reward signals to learn new sub-policies (e.g. destroying mines). The algorithm is expected to incorporate these into its currently learned policy from the previous task (e.g. circumnavigate and shoot the Fortress).

374

J. van Oijen et al.

Transfer Measurement There are different ways in which transfer can be measured between a learner that uses transfer (part-task training) and one that does not (baseline training). Several common metrics are explained in [27]: A jumpstart measures the increase of initial performance at the start of the target task after training on the source task. Asymptotic performance measures the final performance at the end of training on the target task. Total reward measures the accumulated performance during training. Transfer ratio measures the ratio of the total award accumulated by the two learners. And time to threshold measures the training time needed to achieve some predetermined performance level. For our analysis we found the Transfer ratio to be the most appropriate metric. It requires measures of the total reward obtained during training on the target task. For this metric we define the total reward (accumulative performance) as the total number of Fortress destructions (FDs) during training on the target task. The Transfer ratio is then calculated as follows: Pn Transfer Ratio ¼

i¼s

Pn FDi of transfer learner i¼s FDi of baseline learner Pn FD of baseline learner i i¼s ð2Þ

P In the equation, ni¼s FDi represents the total number of Fortress destructions from the time step where the transfer learner starts on the target task (s) until the final time step of the full training time (n). For all experiments, the full training time is fixed to 60 million time steps. Thus if a transfer learner is configured to learn 50% of the time on a part-task and 50% on the whole task, then Fortress destructions are measured for both the transfer learner and the baseline learner only for the last 30 million time steps. 5.3

Transfer Analysis

Results of the transfer experiments are shown in Fig. 5, illustrating the transfer ratio (yaxis) for the two transfer games for different distributions (x-axis). Each bar is based on eight data points (i.e. training experiments). The error bar represents a confidence interval of .95 (ci = 95). Positive bars imply positive transfer compared to the baseline. The baseline is determined from an average of eight baseline (whole task) experiments. According to the results, positive transfer can be observed when using SF-PT1 for 37.5–75% of the total training time (up to approximately 11%). For SF-PT2, spending little training time on this part-task ( 1 SD from the baseline mean, and for HRV it is indicated by negative deviations > 1 SD from the baseline mean. 2.2

Perceived Mental Workload

Perceived mental workload was used as a comparative measure for validating and optimizing the mental workload diagnosis of RASMUS. There are various questionnaires to assess subjective mental workload. The National Aeronautics and Space Administration Task Load Index (NASA-TLX [21]) is one of the superior scales with respect to sensitivity and user acceptance [22]. In the prior and the repeated experimental study workload rating was performed using the NASA-TLX subscale of mental effort. Ratings were obtained each time RASMUS detected a performance decrement (the scenario was stopped at that time to ensure the rating did not affect the user’s task

394

A. Bruder and J. Schwarz

completion). The rating was performed on a 15-point scale proposed by Heller [23] that is divided into five subsections: very low (1–3), low (4–6), medium (7–9), high (10– 12), very high (13–15). 2.3

Task

The generic diagnostic tool RASMUS has been integrated into a naval anti-air warfare (AAW) simulation [24]. In this simulation, operators completed four different simplified subtasks: identification of contacts, creation of new contacts, warn, and engage contacts. Figure 1 shows the tactical display area (TDA) of the simulation. The blue dot in the center of the map represents the own ship. Identified radar contacts are visualized in green (neutral), blue (friendly), or red (hostile). New, unidentified contacts (yellow) have to be identified as neutral, friendly or hostile according to certain criteria. If hostile contacts enter the blue or the red circle around the own ship (see Fig. 1) they have to be warned or engaged, respectively.

Fig. 1. Screenshot of the tactical display area of the naval air-surveillance simulation (Color figure online)

The tasks occur at scripted times during the scenario. If tasks have to be performed simultaneously, users were told to process tasks in order of priority. Each task has to be finished within a specified time limit, cf. [10]. If the time limit is exceeded or the task is not completed correctly, RASMUS logs a performance decrement.

3 Definition of New Rules ROC graphs or curves aim to quantify the accuracy of a binary diagnostic test or classifier and are created by plotting the diagnostic sensitivity and specificity values. The measure that is commonly used in this context is the area under the curve (AUC) of the ROC curve [cf. [11, 25]). Performing a ROC analyses requires information about

Evaluation of Diagnostic Rules for Real-Time Assessment

395

the true state. However, user states, such as mental workload, are latent constructs that cannot be measured directly. For this reason we used the NASA-TLX subjective mental effort rating (cf. Sect. 2.2) as an approximation of the true user state within the ROC curve analysis. Subjective rating outcomes were dichotomized in order to discriminate between critical and noncritical mental workload states. The cut-off value was set based on the subsections of the questionnaire (cf. Sect. 2.2): any rating above 9 on the 15 point scale was considered to be a high (critical) workload state whereas a rating of 9 or smaller was not. At first, ROC curves were calculated using the threshold value initially set for each parameter for discriminating between critical and noncritical outcomes with respect to mental workload. We then systematically varied the threshold values in order to determine the value that maximizes the AUC. The ROC curve analysis resulted in modified rules for pupil diameter (>.5 SD instead of >1 SD positive deviation from baseline) as well as HRV (>2 SD instead of >1 SD negative deviation from baseline). For respiration rate the analysis did not reveal any improvement by changing the rule. Therefore, the existing rule (>1 SD positive deviation from baseline) was not modified. Table 1 summarizes the resulting AUC, sensitivity and specificity for the initial as well as the modified rules for each parameter. AUC values range between .6–.7 for the modified rules, which can be considered a sufficient outcome [25]. Table 1. Comparison of initial and modified rules for the physiological parameters after performing individual ROC curve analyses Parameter Pupil diameter

Initial rules >1 SD positive deviation from baseline AUC .577 Sensitivity .533 Specificity .622 HRV >1 SD negative deviation from baseline AUC .631 Sensitivity .762 Specificity .5 Respiration >1 SD positive deviation from rate baseline AUC .674* Sensitivity .552 Specificity .795 Note: *Significantly different from .5 (probability of chance).

Modified rules >.5 SD positive deviation from baseline .631 .733 .514 >2 SD negative deviation from baseline .687* .714 .659 >1 SD positive deviation from baseline .674* .552 .795

As a next step, we analyzed to which extent this modification of the rules for the physiological parameters affect the accuracy of the overall workload assessment in RASMUS. Figure 2 shows the mean deviation (MD) from the baseline of the subjective

396

A. Bruder and J. Schwarz

rating for critical and noncritical system diagnoses with respect to the modified as well as the initial rule set. For the initial rule set, subjects rated their perceived workload significantly higher when the system diagnosis was critical than when it was noncritical (t(74) = 3.301; p < .01). The same outcome was observed for the modified rule set (t(74) = 3.882; p < .001). However, the results suggest a slightly better distinction between critical and noncritical system diagnosis for the modified rule set based on the subjective rating. The overall ROC curve for the diagnosis of high mental workload (see Fig. 3) also indicates a slightly higher AUC for the modified set of rules (AUCmodified = .780; p < .001) than for the initial set of rules (AUCinitial = .730; p < .01). Exceeding a value of .7, both diagnostic rule sets can be considered good diagnostic tests [25].

Fig. 2. Mean perceived workload ratings (with SE as error bars) for critical and noncritical system diagnoses by RASMUS for the initial set of rules (a) and the modified set of rules (b) applied to the l data set of the prior validation study [10]

Fig. 3. Initial and modified rule set applied to the prior validation experiment data set [10] the optimization was based on.

Evaluation of Diagnostic Rules for Real-Time Assessment

397

4 Repetition of Validation Study A repetition of the initial validation study was conducted to investigate whether the obtained outcomes from the ROC curve analysis can be replicated, and thus would be temporarily stable. 4.1

Methodological Design

Fifteen subjects (8 male, 7 female) aged between 20 and 51 years (M = 31.26 ± 8.27) participated in the experiment. A multisensory chest strap (Zephyr BioHarness3) was used to collect data on HRV and respiration rate. Pupil diameter was recorded with an eye tracker (Tobii X3-120) placed underneath the monitor. The setup is depicted in Fig. 4. After reading the instructions, participants completed a ten-minute training scenario, during which the examiner explained the task completion for every subtask (cf. Sect. 2.3). Subsequently, participants performed the tasks in an experimental test scenario with a net duration of 45 min. The scenario was divided into three successive phases, merging into each other without breaks (see Fig. 5). The scenario paused whenever a performance decrement was detected. Users then rated their current perceived mental workload. Thus, the actual duration of the experiment depended on user’s performance. Perceived mental workload was recorded at the end of the training phase as well as at the end of the experiment, to obtain an individual baseline of the subjective rating.

Fig. 4. Experimental setup. Multisensory chest strap (front left), eye tracking device attached to the monitor underneath the screen

398

A. Bruder and J. Schwarz

Fig. 5. Sequence of the different phases and their durations (cf. [10])

4.2

Hypotheses

Two hypotheses were tested in this experiment (see below). The first hypothesis addresses the question whether the outcomes of the first validation experiment can be replicated. With the second hypothesis we aim to assess whether the modified rule set shows a higher diagnostic accuracy than the initial rule set. H1: Perceived mental workload is rated higher for performance decrements with critical system diagnoses than for noncritical system diagnoses (a) using the initial rule set, (b) using the modified rule set. H2: In comparison to the first rule set, the diagnostic accuracy is increased by the modified rule set. 4.3

Data Analysis

The psychophysiological and behavioral data was logged to text and CSV files for each participant. Data preparation included allocating the subjective ratings to the corresponding diagnostic outcomes of RASMUS. Hypothesis 1 was tested by comparing the mean deviation of the subjective rating from baseline for high and non-high workload outcomes of RASMUS using the initial rule set to test H1a and the modified rule set to test H1b. Concerning Hypothesis 2 the diagnostic accuracy was assessed for modified and initial rule sets by performing ROC curve analyses with the dichotomized subjective rating as “true” user state. The data analysis was conducted with SPSS (version 25.0).

5 Results 5.1

Descriptive Analysis

A total of 79 performance decrements occurred across all subjects. As expected, most of the performance decrements occurred in the high workload phase (see Table 2). During the monotony phase only 12 performance decrements could be observed. Two performance decrements were recorded during the second half of the baseline phase. The number of performance decrements for each of the phases is very similar or the same as in the first validation experiment [10] (see numbers in brackets in Table 2).

Evaluation of Diagnostic Rules for Real-Time Assessment

399

However, only little more than 25% of the subjects showed performance decrements during the monotony phase in the second experiment, whereas almost 60% of the subjects were affected in the preceding experiment. Table 2. Number of performance drops and subjects affected per phase. Numbers in brackets refer to the first validation experiment [10]. Phase Baseline High workload Monotony Sum

5.2

Number of performance decrements Subjects affected 2 (0) 1 (0) 65 (64) 15 (12) 12 (12) 4 (7) 79 (76) 15 (12)

Hypothesis Testing

A non-parametrical Mann-Whitney U-test was conducted to test H1 due to the violation of the assumption of normality for parts of the data set. Figure 6.a and b show the subjective rating for critical and noncritical system diagnoses for the initial as well as the modified set of rules respectively. The analysis confirmed that perceived workload was rated significantly higher by the subjects for critical states of workload compared to noncritical states of workload diagnosed by RASMUS using the initial set of rules (z = −2.64; p < .01). However, subjective ratings differed less between critical and noncritical states of workload when using the modified set of rules (see Fig. 6.b). The statistical analysis revealed the difference to be nonsignificant (z = −1.3; ns). Therefore, H1 can be confirmed for the initial rule set (H1a) but not for the modified one (H1b).

Fig. 6. Mean perceived workload ratings (SE) for critical and noncritical system diagnoses by RASMUS for the initial (a) and the modified set of rules (b) applied to the new data set

400

A. Bruder and J. Schwarz

With respect to H2, we evaluated whether the modified set of rules lead to an improved accuracy of the workload diagnosis. In the first experiment, the overall ROC curve indicated a higher discrimination between critical and noncritical states for the modified rule set compared to the initial rule set (see Fig. 3). Figure 7 shows the resulting ROC curves when applying both rule sets to the data set of the second experiment. The analysis showed that the modified rule set was less accurate than the initial rule set when applied to the new data set. The diagnostic accuracy of the initial rule set significantly differs from .5 at an AUC of .645 (p < .05; sensitivity = .643; specificity = .647) whereas for the modified rule set it does not (AUC = .588; p = .198; sensitivity = .607; specificity = .569). Consequently, the hypothesis that the diagnostic accuracy is higher with the modified rule set (H2) cannot be accepted.

Fig. 7. ROC curves for initial and modified set of rules applied to the new data set

6 Discussion The results of the second validation study could confirm the temporal stability of the diagnostic outcomes when using the initial rule set. Surprisingly, the initial rules also showed a better overall diagnostic performance than the modified rules that were determined by performing the ROC curve analysis. Hence, the outcomes of the second study indicate that the initial rule set is likely to achieve a more consistent distinction between critical and noncritical subjective workload states than the modified rule set. Results imply the modified diagnostic rules are specifically tailored to the data set the optimization is based on, and are thus not applicable to a different data set. It should be noted that as part of a post hoc analysis not detailed in this paper we also performed ROC curve analyses for each individual physiological parameter. Results also indicate that the initial rules for the physiological parameters provide a better diagnostic accuracy than the individual modified rules when applied to the new data set. However, a

Evaluation of Diagnostic Rules for Real-Time Assessment

401

surprising result was found for heart rate variability. ROC curve analysis revealed that AUC was even below the value of .5 for both, modified and initial rules. This means the accuracy of the diagnosis for this data set is less accurate than guessing (e.g. [25]) and suggests that HRV behaves in the opposite way as indicated by literature findings. This could have various reasons, e.g. sensor-related measurement errors or inadequate sensor placement. However, further post hoc analyses revealed that, according to expectations, HRV negatively correlates with the subjective (not dichotomized) workload rating, even though the correlation is rather weak (r = −.27). Hence, the unexpected AUC outcome may result from the dichotomization of the subjective rating (critical states: ratings > 9) that was necessary for performing the ROC curve analyses. This contradictory finding on HRV illustrates a general challenge of validating and optimizing rules for user state classification. In contrast to e.g. medical diagnoses it is hard to obtain an appropriate reference measure that reliably differs between true and false critical user states. In our analysis we used the subjective rating as an estimation of true workload. However, the subjective workload rating is also error prone, e.g. affected by response bias, and it has to be artificially dichotomized for performing ROC curve analyses. This means, the cut-off value chosen for discriminating between true and false high workload states also impacts analysis outcomes. Nevertheless, the diagnostic accuracy of the initial rule set could be confirmed by the second validation study, indicating that RASMUS can reliably differentiate between potentially high and non-high workload states. The fact that HRV was not found to be a reliable indicator of workload in the second study emphasizes the necessity to combine several indicators in order to provide a more robust diagnostic result.

7 Conclusion and Lessons Learned ROC curve analysis is a common method to evaluate diagnostic tests in medical research. In this paper we investigated whether this approach can be used to evaluate and optimize diagnostic rules for physiological user state assessment in adaptive systems. Considering the results of our study, we suggest, that ROC curve analysis may be useful for evaluating and comparing the diagnostic accuracy of different workload indicators. However, the results of this study could not show that this method is appropriate for defining and optimizing rules for single user state indicators. As these outcomes also depend on the validity of the subjective rating and the dichotomization to obtain a “true” high workload state, in future studies it could be examined, whether other cut-off values for the subjective rating than > 9 may be more appropriate to classify a high workload state. Also other methods could be investigated for optimizing and validating the user state indicators. Another option for optimizing diagnostic outcomes is to apply methods of artificial intelligence, such as artificial neural networks, as proposed e.g. by Wilson and Russell [26]. However, those systems are often considered “black boxes” as the algorithm, which provides the diagnostic outcomes, is often too complex to be understood [27]. The rule-based approach has the advantage of providing more transparency. Considering RASMUS’ application within an adaptation framework, results indicate that the current workload assessments of RASMUS are sufficiently accurate and

402

A. Bruder and J. Schwarz

reliable to support a proper selection and configuration of adaptation strategies for this task domain. Nevertheless, we identified further options for improving RASMUS diagnostics: Two of five parameters currently used for the workload assessment (heart rate variability and respiration rate) are retrieved from the same sensor (BioHarness). Hence, whenever there’s a problem with this sensor, both indicators will not be reliable, and thus the robustness of the diagnostic outcomes decreases. Adding one or more independent parameters to the diagnosis could avoid this problem. Hernandez et al. [28], for example, investigated the possibility to use a pressure-sensitive keyboard and capacitive mouse in the context of stress detection. They found increased typing pressure in more than 79% as well as an increase of surface contact with the mouse in 75% of the participants during stressful tasks [28]. Another possible measure for workload and stress detection could be the inclination of the trunk (e.g. [29]), which actually can be retrieved from the BioHarness sensor but there is also the possibility to use a separate pressure sensing mat that is placed on the seat of the operator. One last note: The diagnostic framework of RASMUS has been applied to a naval air surveillance task. Hence, the indicators currently used for user state assessment in RASMUS were specifically selected for this task. In the context of AIS these assessments might also prove useful for determining mental states of the learner in order to provide adequate feedback and support. However, this has to be investigated in more detail in future experimental studies.

References 1. Park, O., Lee, J.: Adaptive instructional systems. In: Jonassen, D.H. (ed.) Handbook of Research on Educational Communications and Technology. Simon & Schuster, New York (1996) 2. Ghergulescu, I., Muntean, C.H.: Learner motivation assessment with game platform. In: Proceedings of AACE E-Learn: World Conference on E-Learning in Corporate, Government, Healthcare and Higher Education, pp. 1212–1221. AACE, Honolulu, Hawaii (2011) 3. D’Mello, S., Olney, A., Williams, C., Hays, P.: Gaze tutor: a gaze-reactive intelligent tutoring system. J. Hum. Comput. Stud. 70(5), 377–398 (2012) 4. Derbali, L., Frasson, C.: Prediction of players motivational states using electrophysiological measures during serious game play. In: Conference on Advanced Learning Technologies, pp. 498–502. IEEE, Sousse (2010) 5. Hilburn, B., Jorna, P.G., Byrne, E.A., Parasuraman, R.: The effect of adaptive air traffic control (ATC) decision aiding on controller mental workload. In: Mouloua, M., Koonce, J. (eds.) Human Automation Interaction: Research and Practice, pp. 84–91. Erlbaum, Mahwah (1997) 6. Parasuraman, R.: Adaptive automation matched to human mental workload. In: Hockey, G. R.J., Gaillard, A.W.K., Burov, O. (eds.) Operator Functional State Assessment: The Assessment and Prediction of Human Performance Degradation in Complex Tasks, pp. 177– 193. IOS Press, Amsterdam (2003)

Evaluation of Diagnostic Rules for Real-Time Assessment

403

7. Schwarz, J., Fuchs, S., Flemisch, F.: Towards a more holistic view on user state assessment in adaptive human-computer interaction. In: Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, pp. 1247–1253. IEEE, San Diego (2014) 8. Schwarz, J., Fuchs, S.: Multidimensional real-time assessment of user state and performance to trigger dynamic system adaptation. In: Schmorrow, Dylan D., Fidopiastis, Cali M. (eds.) AC 2017. LNCS (LNAI), vol. 10284, pp. 383–398. Springer, Cham (2017). https://doi.org/ 10.1007/978-3-319-58628-1_30 9. Fuchs, S., Schwarz, J.: Towards a dynamic selection and configuration of adaptation strategies in augmented cognition. In: Schmorrow, Dylan D., Fidopiastis, Cali M. (eds.) AC 2017. LNCS (LNAI), vol. 10285, pp. 101–115. Springer, Cham (2017). https://doi.org/10. 1007/978-3-319-58625-0_7 10. Schwarz, J., Fuchs, S.: Validating a “Real-Time Assessment of Multidimensional User State” (RASMUS) for Adaptive Human-Computer Interaction. In: 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 700–705. IEEE, Miyazaki, Japan (2018) 11. Fawcett, T.: An introduction to ROC analysis. Pattern Recognit. Lett. 27(8), 861–874 (2006) 12. Marquart, G., Cabrall, C., de Winter, J.: Review of eye-related measures of drivers’ mental workload. Procedia Manuf. 3, 2854–2861 (2015) 13. May, J.G., Kennedy, R.S., Williams, M.C., Dunlap, W.P., Brannan, J.R.: Eye movement indices of mental workload. Acta Psychol. 75(1), 75–89 (1990) 14. de Rivecourt, M., Kuperus, M.N., Post, W.J., Mulder, L.J.M.: Cardiovascular and eye activity measures as indices for momentary changes in mental effort during simulated flight. Ergonomics 51(9), 1295–1319 (2008) 15. Coyne, J., Sibley, C.: Investigating the use of two low cost eye tracking systems for detecting pupillary response to changes in mental workload. Proc. Hum. Factors Ergon. Soc. Annu. Meet. 60(1), 37–41 (2016) 16. Backs, R.W., Seljos, K.A.: Metabolic and cardiorespiratory measures of mental effort: the effects of level of difficulty in a working memory task. Int. J. Psychophysiol.: Off. J. Int. Organ. Psychophysiol. 16(1), 57–68 (1994) 17. Kim, H.-G., Cheon, E.-J., Bai, D.-S., Lee, Y.H., Koo, B.-H.: Stress and heart rate variability: a meta-analysis and review of the literature. Psychiatry Investig. 15(3), 235–245 (2018) 18. Vargas-Luna, M., Huerta-Franco, M.R., Montes, J.B.: Evaluation of the cardiac response to psychological stress by short-term ECG recordings: heart rate variability and detrended fluctuation analysis. In: Long, M. (ed.) World Congress on Medical Physics and Biomedical Engineering, vol. 39, pp. 333–335. Springer, Berlin, Heidelberg (2013). https://doi.org/10. 1007/978-3-642-29305-4_89 19. de Waard, R.: The Measurement of Driver’s Mental Workload. Traffic Research Centre VSC, University of Groningen, Haren, the Netherlands (1996) 20. Grassmann, M., Vlemincx, E., von Leupoldt, A., Mittelstädt, J.M., van den Bergh, O.: Respiratory changes in response to cognitive load: a systematic review. Neural Plast. 2016, 16 p. (2016) 21. Hart, S.G., Staveland, L.E.: Development of NASA-TLX (Task Load Index): results of empirical and theoretical research. Adv. Psychol. 52, 139–183 (1988) 22. Hill, S.G., Iavecchia, H.P., Byers, J.C., Bittner, A.C., Zaklade, A.L., Christ, R.E.: Comparison of four subjective workload rating scales. Hum. Factors: J. Hum. Factors Ergon. Soc. 34(4), 429–439 (1992) 23. Heller, O.: Theorie und Praxis des Verfahrens der Kategorienunterteilung (KU). Würzburger Psychologisches Institut, Würzburg (1982)

404

A. Bruder and J. Schwarz

24. Kaster, A., Tappert, E., Ruckert, C., Becker, R.: Design of ergonomic user interfaces for asymmetric warfare (Gestaltung ergonomischer Benutzungsschnittstellen für Asymmetric Warfare). Final Report. Fraunhofer-Institute for Communication, Information Processing and Ergonomics – FKIE, Wachtberg (2010) 25. Šimundić, A.M.: Measures of diagnostic accuracy: basic definitions. EJIFCC 19(4), 203–211 (2009) 26. Wilson, G.F., Russell, C.A.: Operator functional state classification using multiple psychophysiological features in an air traffic control task. Hum. Factors 45(3), 381–389 (2003) 27. Matthias, A.: The responsibility gap: ascribing responsibility for the actions of learning automata. Ethics Inf. Technol. 6(3), 175–183 (2004) 28. Hernandez, J., Paredes, P., Roseway, A., Czerwinski, M.: Under pressure: sensing stress of computer users. In: Proceedings of the 32nd annual ACM Conference on Human Factors in Computing Systems 2014, CHI 2014, pp. 51–60. ACM Press, New York (2014) 29. Balaban, C.D., Cohn, J., Redfern, M.S., Prinkey, J., Stripling, R., Hoffer, M.: Postural control as a probe for cognitive state: exploiting human information processing to enhance performance. Int. J. Hum.-Comput. Interact. 17(2), 275–286 (2004)

Model for Analysis of Personality Traits in Support of Team Recommendation Guilherme Oliveira1, Rafael dos Santos Braz1, Daniela de Freitas Guilhermino Trindade1(&), Jislaine de Fátima Guilhermino2, José Reinaldo Merlin1, Ederson Marcos Sgarbi1, Carlos Eduardo Ribeiro1, and Thiago Fernandes de Oliveira2 1

Centro de Ciências Tecnológicas, Universidade Estadual do Norte do Paraná, Jacarezinho, Paraná, Brazil [email protected], [email protected], {danielaf,merlin,sgarbi,biluka}@uenp.edu.br 2 Ministério da Saúde, Fiocruz Mato Grosso do Sul, Campo Grande, Mato Grosso do Sul, Brazil {jislaine.guilhermino,thiago.oliveira}@fiocruz.br, [email protected]

Abstract. Among the applications of Affective Computing, some studies are focused on the identification of personality traits. Personality is a factor that can influence the development of a person or a team. In this context, analyzing the specificities of project teams, it was observed the need to support their training based on personality traits. Nevertheless, the literature of the area establishes some recommendation systems based on the principles of similarity. Thus, this research proposes a personality trait analysis model to support the development of project team recommendation systems based on the principles of complementarity. With the literature review, it was possible to make an association of the project teams characteristics with personality traits. From this association, a model was proposed for the evaluation of the personality traits, which was applied to a group of people, from different areas of activity, but who are characterized as potential members of project teams. After verifying the applicability of this model, some guidelines were proposed for a recommendation system of project teams, considering the complementarity of the profiles. Keywords: Affective Computing Personality traits Recommendation systems Project teams

1 Introduction Affective Computing is a constantly expanding area that investigates how computers can recognize, model and respond to human emotions and how they can express them through a computer interface (Picard 1997). Among the applications of Affective Computing, some studies are focused on the identification of personality traits. © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 405–419, 2019. https://doi.org/10.1007/978-3-030-22341-0_32

406

G. Oliveira et al.

Personality is a factor that can influence the development of a person or how a team interacts. Taxonomies of personality have aided in understanding of the role of personality in a wide variety of domains (O’Neill and Steel 2017). Tracing personality helps in the psychological differentiation of individuals, but it is a very complex activity. For Nunes and Cazella (2011) personality is not only a superficial and physical appearance of an individual, for although it is relatively stable and predictable, it is not necessarily rigid and immutable. Bejanaro (2005) states that depending on the quality with which members are able to integrate with one another the overall performance improves. A team composed of only leaders can hinder their progress, given the insubordination characteristic that all members can present (Belbin 2010b). Boehm (1981) also states that a team must complement itself in every way, in terms of skills, profiles and goals. Some works that use the identification of the personality traits applied to the recommendation systems in different contexts, either for customer loyalty systems (Nunes and Cazella 2011), for the recommendation of work teams (Nunes 2012), for identify personality traits with Social Media content (Gao et al. 2013), or to investigate the relationships between personality traits and motivational preferences for gamification (Yuan et al. 2016). Some researchers have also focused on the application of personality traits in people recommendation systems. What is observed in the proposed recommendation systems is that they are based on the principles of similarity, homogeneity and attraction. In this perspective, Nass and Lee (2000) consider that people usually prefer to interact with others who have a personality similar to their own. Thus, analyzing the specificities of project teams, it was observed the need to offer a support system for team formation based on the complementarity of personality traits. So, this research is characterized as an exploratory research that aims to combine and adapt some models of personality traits analysis seeking to support the development of recommendation systems of project teams based on the principles of complementarity. To achieve these objectives, the methodological steps were: (i) identification of the ideal characteristics to forming project teams; (ii) combination, adaptation and proposition of a personality traits test model with focus on the relevant team characteristics; (iii) application of this model to a group of people; (iv) proposing guidelines for team recommendation systems based on complementarity of profiles.

2 Personality Traits The affective computation, according to Costa et al. (2015), presents two areas of research: “one that studies the synthesis of emotions in machines, when one wants to insert human emotions in machines; and the other investigates recognizing human emotions or expressing emotions through machines in human-computer interaction.” The focus of this research is on the second perspective, which seeks to recognize human emotions from a person’s personality. Nunes (2012) considers that the psychological aspects, such as emotion and personality, are important and influence the decision-making process and that the emotion suffers great influence from personality.

Model for Analysis of Personality Traits in Support of Team Recommendation

407

There are different approaches to defining personality, one of which is the personality traits approach, which allows us to use measurable and conceptually traits to differentiate people psychologically (Nunes and Cazella 2011). Some models are proposed to describe and identify personality structure. One of the most widespread models within psychometrical personality traits theory is the Big Five Personality Factor Model, known as Big Five, being empirically developed by leading researchers: Lewis Goldberg, Robert R. McCrae and Paul T. Costa, Jerry Wiggins and Oliver John (John and Srivastava 1999). Barroso et al. (2017) points out that the Big Five is one of the most accepted and used models to identify the psychological characteristics related to the personality. Hutz et al. (1998) defines the Big Five model as a modern version of Trait Theory due to its conceptual and empirical advancement in the field of personality. The characteristics of each of the five factors of the Big Five model are described by Berger (2015), as: • Extroversion: quantity and intensity of interactions, higher level of sociability, loquacity and assertiveness; • Agreeableness: it refers to the capacity to be useful, cooperative, generous and relaxed towards others; • Conscience: refers to responsibility, organization and discipline; • Neuroticism: characterizes the degree of emotional stability, impulse control and anxiety; • Openness: it concerns high intellectual curiosity, creativity and openness to new experiences. The trait approach is considered the best way to represent personality in computers, and questionnaires, also known as personality inventory, are commonly used by psychologists (Nunes 2012). Some of the most commonly used Big Five personality traits analysis inventories are: • 240-item NEO-PI-R (Revised NEO (Neuroticism-Extraversion-Openness) Personality Inventory) (MCrae and John 1992); • 300-item NEO-IPIP (Neuroticism-Extroversion-Openness) International Personality Item Pool (Johnson 2000); • 100-item FFPI (Five Factor Personality Inventory) (Henrinks et al. 2002); • 132-item BFQ (Big Five Questionnaire) (Barbaranelli and Caprara 2002); • 120-item SIFFM (Structured Interview for the Five Factor Model) (Trull and Widiger 2002). • 136-items NPQ and 60-items FF-NPQ (Nonverbal Personality Questionnaire and Five Factor Nonverbal Personality Questionnaire) (Paunonen e Ashton 2002); • 504-items GPI (Global Personality Inventory) (Schimit et al. 2002). Nunes (2008) hypothesized that the number of items influences the accuracy of the measured traits, thus, the higher the number of items, the greater the accuracy of traces extracted. However, no studies have been found to confirm this hypothesis. The authors DeRaad and Perugini (2002) affirm that GPI is the largest inventory to measure Personality Traits, however, its disadvantage is the large number of items that compose it.

408

G. Oliveira et al.

NEO-PI-R is an inventory that contains 240 items categorized in 30 facets, 6 for each dimension of the Big Five, thus having a fine description of a person’s personality traits (Nunes 2008). The author further states that NEO-PI-R is used commercially, so its items are protected by copyright, so it can not be used freely by scientists. Johnson (2005) developed the NEO-IPIP inventory, containing 300 items with 6 facets for each dimension of the Big Five, totaling 30 facets, thus being similar to the NEO-PI-R, but free of charge.

3 Project Teams In teamwork each person contributes with his or her own ability to reach a common goal. Luecke (2010) defines a team as a small number of people who complement each other with different skills in order to perform a task. Bejanaro (2005) states that the performance of a team depends on the quality members are able to integrate with one another, since each person brings a different personality and experience that will affect all the team, thus, the team’s formation must be influenced by the way these personalities and experiences articulate. Thamhain (1988) states that efficient teams have some characteristics that are associated with skills of the members and their interaction with the team, being: Ability to solve conflicts; Good communication; Good team spirit; Mutual trust; Selfdevelopment of team members; Effective organizational interface; High need for achievement and growth. In addition to the characteristics of team members, Thamhain (1988) also points out some characteristics that are directly related to project performance, restricting tasks and results: Commitment to technical success; On schedule, on budget performance; Commitment to producing high quality results; Innovation and creativity; Flexibility and willingness to change; Ability to predict trends. According to Belbin (2010b), the structure of the teams must take into account the personality of the individual, since a team made only of leaders does not progress, it is believed that there will certainly be conflict among the members, because of this. The author states that an ideal team should be developed so that the strengths and weaknesses of each member complement each other. Belbin (2010a) states that in order to form good teams meticulous selection is necessary, as well as having an adequate number of candidates offering specific abilities and characteristics in order to seek a combination of these skills and personalities. Not attending to these criteria is the singular reason teams fail. Belbin (2010a) also describes nine “Team Roles” or preferential roles that were developed based on the individual behavioral patterns of members of a successful team: • Plant: These are innovative and creative people who come up with new ideas and approaches. Tend to be highly creative and good at solving problems in unconventional ways. They thrive on praise but criticism is especially hard for them to deal with. They may also be poor communicators and may tend to ignore provided parameters and constraints;

Model for Analysis of Personality Traits in Support of Team Recommendation

409

• Resource Investigator: These are good at exploring opportunities and bringing resources to the team, improving the development of ideas; • Monitor Evaluator: People with a high critical aptitude and good at examining all aspects of a situation. Provide a logical eye, making impartial judgements where required and assess the team’s options in a dispassionate way; • Co-ordinator: These are people who have an ability to get others to work on shared goals. Needed to focus on the team’s objectives, draw out team members and delegate work appropriately; • Shaper: These are highly motivated and energetic people, have a need for achievement. Provide the necessary drive to ensure that team members keep moving and do not lose focus or momentum; • Implementer: These are practical people, have great self-control and discipline. They perform tasks systematically. Needed to plan a workable strategy and carry it out as efficiently as possible; • Teamworker: These are people who are sociable and interested in others, flexible, perceptive and diplomatic. They do not like conflict and do everything to avoid it. They help the team using their versatility to identify needs and address them on behalf of the team; • Completer Finisher: These are typically introverted people, as they prefer to work by themselves, but their standards are high and they have a great interest in accuracy and reliability. Most effective at the end of tasks for polishing and scrutinizing the product, subjecting it to the highest standards of quality control; • Specialist: These are dedicated people who are proud of their technical skill and knowledge. They like to reach a high professional standard. They bring to the team in-depth knowledge of a key area. Team Roles argues that balancing roles in a team improves the possibilities of cooperative work by creating synergy and balance between the strengths and weaknesses essential to each individual role (Santos and Santos 2017).

4 Recommendation Systems Recommendation Systems are defined by Ricci (2011) as software tools and techniques that provide suggestions of items that can be useful to the user, such as products, music, news, among others. In addition to these applications, the recommendation systems can be used for the recommendation of people, supporting the decision making in several areas of knowledge. Cazella et al. (2010) explain that one of the greatest challenges of the referral systems is the right combination of user expectation and the products, services and/or people that will be recommended. According to Cazella and Reategui (2005), the recommendation systems can be classified into three approaches types: (i) content-based filtering, recommendations depends on users former choices, in which information from items seen in the past is used to recommend new items; (ii) collaborative filtering, which uses information from people who have common interests; and (iii) hybrid filtering, is a combination of more than one filtering approach so that the failures presented by each method are minimized.

410

G. Oliveira et al.

For Al-Shamri and Al-Ashwal (2013) the most important recommendation system is the “collaborative recommender system which recommends people with similar tastes and preferences in the past to a given active user.” A recommendation system was proposed by Nunes (2012), also called Group Recommender, which allows “recommending work teams considering the characteristics of the tutor and the similarity of Personality Traits of their students” in E-learning courses. The author explains that it is necessary to divide the students into subgroups, at which point the proposed system helps in decision making, recommending students with a similar personality to their tutor. The majority of existing approaches to recommender systems focus on the similarity of the characteristics, which for project teams would not be the most recommended. So, in this research, the measure of similarity will not be used, since the objective is to support the formation of teams with profiles that complement each other, seeking to improve their performance.

5 Model for Analysis of Traits Personality in Support of the Recommendation of Project Teams From literature review and research presented by Thamhain (1988); Boehm (1981); Belbin (2010a) and Bejanaro (2005) it was possible to relate the characteristics of project teams with the facets proposed in the Big Five model (Table 1). This research aimed to identify the emotional characteristics of team formation, focused on the complementarity of profiles. However, among the characteristics listed in the literature review there are some technical characteristics that are not the focus of this study at this moment, but which are also characterized as important aspects for the next step of recommending teams project. Some of these technical characteristics are: Self-development, Commitment to budget and Quality (Thamhain 1988); Commitment to the technical part and Technical skill (Belbin 2010a). As can be seen in Table 1, there are several features that authors recommend for an ideal design team. For each of these features was inferred a dimension and one or more facets of the Big Five model. Table 1. Association of team features to the Big Five model Author/features of project teams Good (Thamhain 1988); communication (Boehm 1981) and (Belbin 2010a, Innovation and 2010b) creativity (Thamhain 1988) and Ability to (Belbin 2010a, resolve 2010b) conflicts Mutual trust Capability of achievement

Big Five

Facet

Extroversion

Sociable

Openness

Imagination

Socialization

Altruism/Cooperation

Achievement Achievement

Reliable Practical (continued)

Model for Analysis of Personality Traits in Support of Team Recommendation

411

Table 1. (continued) Author/features of project teams (Boehm 1981) and Extroversion (Belbin 2010a, Intelligence 2010b) Orderliness Critical posture (Boehm 1981) Optimistic posture (Belbin 2010a, Detail posture 2010b) Introversion Enthusiasm Sympathy Stable Assertiveness Efficient

Big Five

Facet

Extroversion Openness Achievement Socialization Socialization

Sociable/Enthusiastic/Energetic Intelligent Orderliness Critical Optimstic

Achievement Extroversion

Meticulous !Sociable/!Enthusiastic/ !Energetic Enthusiastic Nice !Unstable Assertiveness Efficient

Extroversion Socialization Neuroticism Extroversion Achievement

Some facets are denied using a “!” before your name so that you reach the desired result. As an example, the facet !Unstable, from its negation it is possible to obtain the characteristic Stable. Associating the characteristics of design teams with the facets presented on Big Five model it was possible to elaborate the questionnaire to evaluate the personality traits test. The questionnaire is presented in Table 2.

Table 2. Questions for the personality test Facets Altruism/cooperation Cooperation Sociable Enthusiastic Energetic Confident Trustworthy Efficient

Questions I make people feel welcome; I like to help others; !I am indifferent to the feelings of others; !I turn my back on the others I am easy to please; I can not stand confrontations; !I scream with people; !I avenge of the others I talk to many different people at parties; I like to be part of a group; !I prefer to be alone; !I avoid crowds Irradio joy; Express children’s joy; I look at the positive side of life; !I rarely play I do a lot in my free time; I can manage many things at the same time; !I like a calm lifestyle; !React slowly I believe that others have good intentions; I trust what people say; !I believe that most people are essentially evil; !I distrust people !I dive into things without thinking; I do things according to a plan; I am careful with others; I do not see the consequences of things Complete tasks successfully; I excel in what I do; Handles tasks smoothly; !I have little to contribute (continued)

412

G. Oliveira et al. Table 2. (continued)

Facets Thorough Organized

Imagination Intelligent

Practical Sympathy Unstable Assertive Optimstic

Critical

Questions Thorough I avoid mistakes; I choose my words carefully; !I act without thinking; !I often make last minute plans I love organization and regularity; !I often forget to put things in their proper place; !I leave my room messy; !I do not mess with people who are messy I love to daydream; I get carried away by my fantasies; Spending time reflecting on things; !I have a hard time imagining things I like to solve complex problems; I have a rich vocabulary; !I have difficulty understanding abstract ideas; !I am not interested in theoretical discussions I’m going straight to the goal; I transform plans into actions; I demand quality; !I only do the work necessary to survive I appreciate the cooperation above the competition; I suffer the pains of others; !I’m not interested in others; !I can not stand weak people Usually as too much; I do things that I regret later; !I easily resist temptations;! I’ve never spent more than I can afford I like to lead others; I try to influence others; !I hope others will tell me the way; !I do not like to draw attention to myself In uncertain times, I usually expect the best;!If anything can be done for me, it will give; !I almost never expect things to go my way; !I rarely count on good things happening to me !I do not like rules and regulations; I take to make decisions; !I’m insecure; !I can easily adapt to new cultures

After analysis of the facets that would be relevant to the work, 64 questions were selected, 4 for each facet. However, there were no questions that assessed two important facets: Critical and Optimistic. For filling this gap, four questions were also listed for each of these facets, and they were developed from Snyder and Lopes (2002); Naranjo (2001) and Belbin (2010a) research. Thus, the questionnaire had 72 questions. The questionnaire was based on Nunes (2008), Snyder and Lopez (2002); Naranjo (2001) and Belbin (2010a). As already mentioned in Sect. 2, the NEO-IPIP questionnaire (Johnson 2005) was composed of 300 questions, separated by dimensions and facets. In this way, it was possible to select only the necessary questions to evaluate the facets reported in Table 1. 5.1

Application of the Proposed Model

The personality trait test model was applied to 30 participants with different profiles: students, teachers and professionals from different areas. The questionnaire was made available through the JotForm online form, in this way, the respondents had access to a link that directed to the questionnaire and at the end the results were sent via the Web.

Model for Analysis of Personality Traits in Support of Team Recommendation

413

For each question the following alternatives, based on the work of Pimentel (2008), were presented: I totally disagree; I disagree; Neither agree nor disagree; I agree; I totally agree. The alternatives had a weight from 0 to 4, following its sequence, which is related to the coherence relation with the facet: 0-Strongly disagree; 1-Disagree; 2-Neither agree nor disagree; 3-Agree; 4-Agree totally. In the case of facets with Denial, such as Stable, obtained from the negation of the Unstable, the weights were inverted. After answering the questionnaire, the percentage of each facet for each participant was calculated. The highest score per facet is 16 points (in the case of greater affinity with the facet) and the lowest is zero (in the case of no affinity with the facet), since for each facet 4 questions are presented. In order to obtain the results, the points obtained in each facet were added, to then transform these points into a percentage. Table 3 presents the facet results for 3 of the 25 participants. Participant 1, for example, in the Altruistic facet, obtained a sum of 10 points, equivalent to 62.5% affinity with such facet. Table 3. Partial result of the analysis of personality traits Facets Altruism Cooperation Sociable Enthusiastic Energetic Confiable Eficient Detailed Organized Creative Intelligent Practical Nice Stable Assertive Optimistic Critical Trustworthy

Participant 1 Participant 2 Participant 3 62,50% 43,75% 56,25% 87,50% 37,50% 43,75% 43,75% 56,25% 62,50% 81,25% 75,00% 37,50% 50,00% 43,75% 62,50% 58,33% 58,33% 58,33% 50,00% 56,25% 62,50% 68,75% 62,50% 68,75% 62,50% 87,50% 56,25% 68,75% 50,00% 37,50% 81,25% 37,50% 37,50% 56,25% 68,75% 75,00% 68,75% 50,00% 43,75% 62,50% 68,75% 56,25% 50,00% 62,50% 56,25% 87,50% 68,75% 62,50% 56,25% 37,50% 43,75% 66,67% 41,67% 41,67%

Each participant has a percentage of affinity with each facet. At this study, we consider that as higher the percentage, as greater its affinity. With the results obtained for each participant, the facets will be used, from their combinations, to infer the relevant characteristics to support the work in project teams, as it is possible to see in Table 4, with the result for 3 participants. Recalling that the relationships between facets and characteristics have already been presented in Table 1.

414

G. Oliveira et al.

In Table 4, the results are also presented in percentage for all inferred characteristics. In this way, it was also considered that the higher the percentage, the more affinity the participant has with it. For the characteristics that needed more than one facet an average was made between them. Those that were related to only one facet maintained their percentage. Table 4. Result of the features of 3 participants Features Ability to resolve conflicts Good communication Mutual trust Innovation and creativity Intelligence Organization Critical posture Optimism Detail Introversion Enthusiasm Practical Sympathy Stability Assertiveness Efficiency Extrovertion

Participant 1 Participant 2 Participant 3 75,00% 40,63% 50,00% 43,75% 56,25% 62,50% 62,50% 50,00% 50,00% 68,75% 50,00% 37,50% 81,25% 37,50% 37,50% 62,50% 87,50% 56,25% 56,25% 37,50% 43,75% 87,50% 68,75% 62,50% 68,75% 62,50% 68,75% 41,67% 41,67% 45,83% 81,25% 75,00% 37,50% 56,25% 68,75% 75,00% 68,75% 50,00% 43,75% 62,50% 68,75% 56,25% 50,00% 62,50% 56,25% 50,00% 56,25% 62,50% 58,33% 58,33% 54,17%

Analyzing these results, it is possible to observe that each individual, according to his or her most outstanding characteristics, can complement the profile of a team with his or her own personality. In this sense Nunes (2012) states that according to structural theories, it is the personality traits that lead individuals to seek, interpret and then react to life events in a proper way. Thus, it is understood that personality can greatly influence the way a team develops, as confirmed by theorist Murray (1938, apud Nunes 2012), “personality would function as an organizing agent, whose functions would be to integrate conflicts and limitations to which the individual is exposed, to satisfy their needs and to make plans for the achievement of future goals.” From the presented characteristics it is possible to infer the profile of each participant using the “Team Roles” from Belbin. Thus, the guidelines for a recommendation system, which, based on the Big Five facets, list the characteristics and profiles denoted by Belbin (2010a).

Model for Analysis of Personality Traits in Support of Team Recommendation

5.2

415

Guidelines for Project Team Recommendation System

In face of several existing recommendation systems which assist in recommending people, products or services based on similarity, guidelines are proposed for a recommendation system that will form project teams using, in addition to the technical profile of their members, their personality traits, in order to compose a team with different and complementary personalities. In this way, from the personality traits, it is possible to infer the characteristics and, finally, to compose the profiles for a project team. Belbin (2010a) presents nine (9) important profiles in a project team, only one (1) focused on their technical training, the other eight (8) focused on the personality of the individual. Keen (2003) explains that among the eight (8) profiles based on the personality proposed by Belbin (2010a), there are four (4) profiles that are essential for a quality team, being: Coordinator; Plant; Monitor Evaluator and Implementer. Thus, a proposed guideline for the recommendation system is for the team to have these 4 profiles in their composition. Each profile has some specific characteristics that are important in a project team. As an example, an Implementer profile, according to the works of Belbin (2010a) Bejanaro (2005) and Boehm (1981), should present the following characteristics: Practical, Organized, Efficient and Stable. There are two characteristics that all profiles need to have: (i) Mutual Trust, since all team members must have a certain trust between them, and (ii) Intelligence, as everyone must have a certain level of intelligence to help the team. However, there are some profiles, such as the Plant and Coordinator, which stand out because they have a high level of these two characteristics. The Optimistic characteristic proposed by Boehm (1981), is not part of the profiles proposed by Belbin, however, it was inserted to add to the Completer Finisher profile, since these characteristics can balance a team with respect to the vision of the problems faced by the group. It is worth emphasizing that the balance of these profiles is important. The excess of pessimism may discourage the team and the excess of optimism may imply in critical vision lack. So, another guideline would be to infer the profile of each team member. For this, it is necessary that their characteristics are listed, and from there to perform a mean of the characteristics that compose the profile. With the proposed calculation it is possible to associate the profile of each person, an example is demonstrated in Table 5. The recommendation system could demonstrate to the manager the profile with which each member has a greater affinity, contributing to the composition of the team. Although the recommendation considers the personality traits of its members, technical training is also important. In this way, a recommendation system of project teams should: • Relate the technical information of the members, according to the professional training and proficiency profiles required for the project; • Match competency profiles with personality traits that can be complementary to an effective team.

416

G. Oliveira et al. Table 5. Affinity for each profile Participant X Sower Implementer Monitor Complementary Formatter Resource researcher Team worker Coordinator

Affinity 65,97% 57,81% 54,16% 51,04% 47,39% 47,19% 45,83% 40,27%

It is suggested that the Recommendation System should end with the recommendation of a members that fits the needs of the team. Beginning with “Analyze the Technical Training”, the step that verifies if the member possesses the necessary technical skills for the development of the project. After that, the personality traits of the member should be analyzed with those that the project team seeks and finally, recommend it. Members must already be registered in the system, with their technical training and the percentages of each related characteristic from the result of the personality trait test. This flow refers only to a part of the system, which should consider other aspects as well, such as training, hiring third parties, among others. Given this, the recommendation system may have several functionalities focused on technical characteristics, such as those proposed by Mengato (2015) in a system to support the allocation of human resources in software development projects. Mengato (2015) proposes the “Allocate Employees” functionality that presents the Compatibility, Availability and Experience of the members that are registered in the system and can compose the team of a project. In this system, the allocation is proposed in a manual way, in which the manager checks among the members which have the highest compatibility with the required function and also verify their availability and the time of experience. It is observed that this system proposed by Mengato (2015) could consider, besides the technical characteristics, the personality characteristics of each member. Finally, two configurations are suggested for the Project Team Recommendation System: • 1st Configuration: the project manager could enter into the system the technical training, personality characteristics and the number of members desired for their project team. In this way, the manager has the freedom to establish the composition that he considers appropriate to his team according to the characteristic of the project. In addition, the recommendation system may offer the option of establishing weights for the profiles or characteristics, thus, the customization of the team becomes more detailed, facilitating the formation of an ideal project team. After the manager enters the data, the recommendation system can automatically combine the information, establishing links between the technical skill and the personality of the employee, and then recommend it to the team.

Model for Analysis of Personality Traits in Support of Team Recommendation

417

• 2nd Configuration: the manager defines only the technical profile and the system recommends the ideal personality characteristics, based on the profiles considered essential by Belbin (2010a). By default, the recommendation system will take into account, first, the essential profiles already mentioned (Coordinator or Formatter, Plant and Monitor), and according to the number of members, the other profiles will complement the team.

6 Final Considerations In this article, we propose a support model for the formation of project teams. The motivation of the work was the fact that the personality of the members influences the success of the team. However, the literature in the area establishes the recommendation based on the principles of similarity. Considering that individuals with similar personality traits may in certain contexts compromise the success of the team (e.g., many managers), the proposed recommendation system was based on the complementarity (differences) of personality traits. With the personality test models and the characteristics listed for the formation of project teams, it was possible to propose a test model that evaluated the personality traits in a significant way for the characteristics of a project team. To identify the personality of the individuals, tests were applied. The test model used was NEO-IPIP, which uses 300 questions to identify personality traits. The questions were refined, based on work in the literature, resulting in 72 questions. A questionnaire with these questions was applied to 30 people, allowing to conclude that it is possible to identify personality differences through it. Once the profiles are identified, the team can be constituted in a way that complements the skills. It is hoped that this research can contribute to project management since the composition of the team is something preponderant to obtain efficiency and effectiveness in the development of a project. After all the theoretical basis and from the analyzes and applications made it possible to perceive that the personality traits are influencers of the behavior of an individual also in the professional scope. In this way, composing a team with different profiles that complement each other in the sense of balancing the emotions and the characteristics (organization, creativity, cooperation, optimism …) can contribute a lot to improve their performance. From the proposed guidelines, it will be possible, in future works, to deepen the techniques and algorithms for the implementation of the recommendation system for recommending project teams.

418

G. Oliveira et al.

References Al-Shamri, M.Y.H., Al-Ashwal, N.H.: Fuzzy-weighted Pearson correlation coefficient for collaborative recommender systems. In: ICEIS, no. 1, pp. 409–414 (2013) Barbaranelli, C., Caprara, G.V.: Studies of the big five questionnaire. In: De Raad, B., Perugini, M. (eds.) Big Five Assessment, 1st edn., chapter 5, pp. 109–128. Hogrefe Huber, Germany (2002) Barroso, A.S., da Silva, J.S.M., Souza, T.D., Bryanne, S.D.A., Soares, M.S., do Nascimento, R.P.: Relationship between Personality Traits and Software Quality - Big Five Model vs. Object-oriented Software Metrics. In: Proceedings of the 19th International Conference on Enterprise Information Systems, ICEIS, vol. 3, pp. 63–74 (2017). ISBN 978-989-758-249-3 Bejanaro, V.C.: Como formar equipes com o equilíbrio ideal de personalidades e perfis pessoais: a teoria e as ferramentas de Meredith Belbin. XXXIII Congresso Brasileiro de Ensino de Engenharia (2005) Belbin, R.M.: Team Roles at Work, 2nd edn. Butterworth Heinemann, Oxford (2010a) Belbin, R.M.: Management Teams - Why They Succeed or Fail, 3rd edn. Butterworth Heinemann, Oxford (2010b) Berger, K.S.: The Developing Person Through the Life Span, 9th edn. Worth Publishers (2015) Boehm, B.W.: Software Engineering Economics. Prentice-Hall, Englewood Cliffs (1981) Cazella, S.C., Nunes, M.A., Reategui, E.: A Ciência da Opinião: Estado da arte em Sistemas de Recomendação. André Ponce de Leon F. de Carvalho; Tomasz Kowaltowski. Jornada de Atualização de Informática-JAI (2010) Cazella, S.C., Reategui, E.B.: Sistemas de recomendação. In: XXV Congresso da Sociedade Brasileira de Computação (2005) Costa, S.W.S., Souza, A.L., Pires, Y.: Computação Afetiva: Uma ferramenta para avaliar aspectos afetivos em aplicações computacionais. Anais do Encontro Anual de Tecnologia da Informação e Semana Acadêmica de Tecnologia da Informação, pp. 286–290 (2015) De Raad, B.E., Perugini, M. (eds.): Big Five Assessment. Hogrefe & Huber Publishers, Gottingen (2002) Gao, R., Hao, B., Bai, S., Li, L., Li, A., Zhu, T.: Improving user profile with personality traits predicted from social media contente. In: Proceedings of the 7th ACM Conference on Recommender Systems, pp. 355–358, Hong Kong, China. ACM (2013) Henrinks, A.A.J., Hoofstee, W.K.B., Raad, B.: The five-factor personality inventory: assessing the Big Five by means of brief and concrete statements. In: De Raad, B., Perugini, M. (eds.) Big Five Assessment, 1st edn., chapter 4, pp. 79–108. Hogrefe Huber, Germany (2002) Hutz, C.S., Nunes, C.H., Silveira, A.D., Serra, J., Anton, M., Wieczorek, L.S.: O Desenvolvimento de Marcadores Para a Avaliação da Personalidade no Modelo dos Cinco Grandes Fatores. Reflexão e Crítica, Psicologia (1998) John, O.P., Srivastava, S.: The Big Five trait taxonomy: history, measurement, and theoretical perspectives. In: Handbook of Personality: Theory and Research, pp. 102–138, New York (1999) Johnson, A.J.: Ascertaining the validity of individual protocols from web based personality inventories. J. Res. Pers. 39(1), 103–129 (2005) Keen, T.: Creating Effective & Successful Teams. Purdue University Press, West Lafayette (2003) Luecke, R.: Criando equipes. Tradução de Ryta Magalhães Vinagre. Rio de Janeiro: Record (2010) Mcrae, R.R., John, O.P.: An introduction to the five-factor model and its applications. J. Pers. 60(2), 175–216 (1992)

Model for Analysis of Personality Traits in Support of Team Recommendation

419

Mengato, J.R.C.: Ferramenta de apoio à alocação de equipes em projetos de desenvolvimento de software. Trabalho de Conclusão de Curso, Universidade Estadual do Norte do Paraná – UENP (2015) Naranjo, C.: Os Nove tipos de Personalidade: Um estudo do caráter humano através do Eneagrama. Objetiva, Rio de Janeiro (2001) Nass, C., Lee, K.M.: Does computer-generated speech manifest personality? An experimental test of similarity-attraction. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 2000, The Hague, The Netherlands, 01–06 April 2000. ACM, New York, pp. 329–333 (2000) Nunes, M.A.: Recommender System Based On Personality Traits. (Tese De Doutorado). Universite Montpellier 2-Lirmm- Franca (2008) Nunes, M.A., Cazella, S.C.: O que sua personalidade revela? Fidelizando clientes web através de Sistemas de Recomendação e traços de personalidade. In: Tópicos em banco de dados, multimídia e Web, 1nd edn. Sociedade Brasileira de Computação. Florianópolis, vol. 1, pp. 91–122 (2011) Nunes, M.A.: Computação Afetiva personalizando interfaces, interações e recomendações de produtos, serviços e pessoas em Ambientes comutacionais. In: Nunes, M., Oliveira, A.A., Ordonez, E.D.M. (eds.) Pesquisas e Projetos, DCOMP e PROCC (2012) O’Neill, T., Steel, P.: Weighted composites of personality facets: an examination of unit, rational, and mechanical weights. J. Res. Pers. 73, 1–11 (2017) Paunonen, S.V., Ashton, M.C.: The nonverbal assessment of personality: The NPQ and THE FFNPQ (2002) Picard, R.W.: Affective Computing. MIT Press, Cambridge (1997) Ricci, F., Rokach, L., Shapira, B.: Introduction to recommender systems handbook. In: Ricci, F., Rokach, L., Shapira, B., Kantor, P. (eds.) Recommender Systems Handbook. Springer, Boston (2011). https://doi.org/10.1007/978-0-387-85820-3_1 Santos, J.V.F., Santos, J.G.: A importância do levantamento de perfis para a formação de equipes a partir da teoria de Meredith Belbin. In: XII Workshop de Pós-Graduação e Pesquisa do centro Paula Souza (2017). ISSN 2175-1897 Schimit, M.J., Kihm, J.A., Robie, C.: The global personality inventory (GPI). In: De Raad, B., Perugini, M. (eds.) Big Five Assessment, chapter 9, pp. 195–236. Hogrefe Huber, Germany (2002) Snyder, C.R., Lopez, S.J.: Handbook of Positive Psychology (2002) Thamhain, H.J.: Team Building in Project Management. Project Management Handbook, 2nd edn. Van Nostrand Reinhold, New York (1988) Trull, T.J., Widiger, T.A.: The structured Interveew for the Five Factor Model of Personality. In: De Raad, B., Perugini, M. (eds.) Big Five Assessment, chapter 7, pp. 148–170. Hogrefe Huber, Germany (2002) Yuan, J., Bin, X., Yamini, K., Stephen, V.: Personality-targeted gamification: a survey study on personality traits and motivational affordances. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, Santa Clara, California, USA (2016)

The Influence of Gait on Cognitive Functions: Promising Factor for Adapting Systems to the Worker’s Need in a Picking Context Magali Kreutzfeldt(&), Johanna Renker, and Gerhard Rinkenauer Leibniz Research Centre for Working Environment and Human Factors, Dortmund, Germany [email protected]

Abstract. We investigated the influence of gait and smart devices on selective attention in an order picking setting employing a task switching paradigm and an Eriksen flanker task. A cue indicated the relevant customer order (one of two tasks) and thereby the correct stimulus-response rule either via smart glasses or headset. The task transition (repetition vs. switch) was varied randomly in mixed blocks and kept constant in single blocks. Participants (n = 24) were asked to classify the central letter of a five-letter string with a manual response (left vs. right). Importantly, the central letter was either congruent or incongruent with the surrounding letters. Participants were either walking at their personal comfort speed or standing on a treadmill. We registered response times and error rates as dependent variables. The combination of a particular smart device and walking condition determined the effect on attention and thus the order picker’s mental state: In mixed task response times, switch costs were higher for headset use than smart glasses, while incongruent flankers were especially harmful while walking wearing smart glasses. In single task errors, congruency effects were more pronounced for headset use than smart glasses but only while standing, not while walking. Results show context-specific effects, suggesting that gait speed and performance requirements can be used as cognitive load indicators in technical systems to adapt instructions. Keywords: Cognitive performance

Gait Adaptive systems

1 Introduction Proceeding digitization at work places reveals new possibilities to support workers in their daily tasks and to optimize their cognitive workload. For example, smart devices in order picking present task instructions on the display of handheld scanners, tablets and smart glasses (pick-by-vision) or as pick-by-voice via headsets. Task instructions – formerly presented on a sheet of paper – may contain article location and quantity or route information. To ensure healthy workplace conditions, these instructions should be adapted to the worker’s mental state. If cognitive load is high due to task difficulty, the device may possibly adapt perceptual dimensions of the content or the content itself on the display of smart glasses or in audio devices. This individual instruction adaptation should support the worker context-specifically and could prevent additional work © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 420–431, 2019. https://doi.org/10.1007/978-3-030-22341-0_33

The Influence of Gait on Cognitive Functions

421

strain. In the current study, we will focus on the order picking context and its state-ofthe-art smart devices, as order pickers are still a very relevant human resource in intralogistics with high demands regarding regular changes in physical and mental load as well as time pressure [1–4]. Findings from previous studies highlighting dual-task walking suggest that gait analysis could provide useful insights into the order picker’s current mental state. Dualtask walking refers to the concept of concurring locomotion and another task [5] – for example walking while observing the smart device display or warehouse surroundings. A frequent finding hereby is a performance decrement in the other task while walking compared to a neutral condition, termed dual-task cost. However, dual-task costs may also refer to a performance decrement in the walking task associated with the dual-task situation [5] and are correlated with the risk of falling (for a concurrent hypothesis of walking as cognitive facilitation/arousal cf. [6]). The degree of dual-task costs is in turn determined by the nature of the secondary task [5, 7, 8]. Bock [7] compared the performance of young and older adults in several cognitive tasks while walking and found especially severe performance derogation in a visual control task. He therefore concluded that the involvement of the visual system is a crucial factor for stable walk. Two visual processing streams need to be attended to, so that mental resource allocation to a visually demanding task hampers walking performance as well as walking while performing visually demanding tasks reduces performance in the latter. Along this line, Barra and colleagues [8] found that spatial tasks and not verbal tasks increase the risk of falling. Both studies are relevant in light of determining the mental state of an order picker and the use of smart glasses to read a picking list – a visual observation task - while order picking. Tomporowski and Audiffren [9] conducted a study in which young and older participants performed auditory task switching while standing, walking at preferred speed and walking faster than preferred speed (see below for an explanation of task switching and switch costs). All participants showed switch costs, but these were not modulated by the movement condition in young adults. Older adults showed a speedaccuracy trade-off: slower responses while standing compared with the walking conditions, but also less errors standing in contrast to walking. In this study employing an auditory task, the typical dual-task costs while walking could not be observed. This study may provide further evidence for differential mental states while walking depending on the nature of the task and might prove useful in the assessment of the order picker’s mental state when picking with instructions via headset. In a previous study, we investigated the influence of smart devices on cognitive performance in a simulated order picking task while standing. Results showed generally faster responses to the use of headsets compared with smart glasses, but also worse performance with respect to selective attention and cognitive flexibility [10]. Here, the mental state of the order picker depended clearly on the smart device presenting instructions (cues) either visually or auditorily. Nonetheless, additional research is required to understand the relationship and potential interaction of gait and cognitive functions while using smart devices. If, for example, cognitive load shows a deteriorating effect on gait and therefore task performance, the order picker compensates for it by reducing gait speed. However, if the work context requires an increased gait speed resulting in a potentially detrimental effect on cognitive functions, the cognitive difficulty of the task should be reduced.

422

M. Kreutzfeldt et al.

A failure to do so would particularly affect cognitive resources of older order pickers. Ample empirical evidence suggests that gait and posture control require additional mental resources in older people [5, 11]. If these findings are not taken into account when adjusting information load, performance and safety is at risk. In our study, we set out to investigate the influence of gait and smart devices on cognitive functions, more precisely on selective attention. Selective attention (i.e., focusing on relevant information while ignoring irrelevant information) is crucial while reading picking information as well as searching for the relevant item in the storage racks [1, 10]. We investigated cognitive performance in an order picking setting employing a task switching paradigm and an Eriksen flanker task [12] while standing or walking at preferred speed. Generally, the task switching paradigm as well as the Eriksen flanker task enable assessing selective attention and cognitive flexibility as participants are asked to change their focus of attention regularly to attend to varying relevant information in quick succession. Congruency effects (i.e., performance decrement of incongruent relative to congruent stimuli) and switch costs (i.e., performance decrement of task switches relative to task repetitions) are means of investigating the former mentioned cognitive concepts as cognitive performance indicators. Congruency effects arise from conflicting information in incongruent trials, in which both possible responses are triggered by components of the stimuli, and its size represent the size of cognitive conflict [13]. Switch costs result from the additional effort in switch trials to update new task requirements as well as interference control from competing tasks and can be seen as empirical markers for cognitive flexibility [14]. In the beginning of each trial, a cue indicated the relevant customer order (one of two tasks) and thereby the correct stimulus-response rule either via smart glasses or headset (pick-by-vision vs. pick-by-voice). Participants were asked to classify the central letter of a five-letter string with a manual response. Importantly, the central letter was either congruent or incongruent with the surrounding letters introducing potential cognitive conflict and the need to focus attention. Also, the customer order could change repeatedly (task transition: repetition vs. switch), requiring cognitive flexibility. Participants were either walking at their personal comfort speed or standing on a treadmill. Based on our preceding study [10], we expected generally increased switch costs and congruency effects for the headset use compared with smart glasses while standing. In addition, we expected a stronger decrease of performance with smart glasses compared with the headset use while walking as dual-task costs due to the involvement of the visual system [7].

2 Methods 2.1

Participants

Twenty-four participants (21 female) with a mean age of 24 years (SD = 4) were tested in this experiment. They reported normal or corrected-to-normal vision and normal hearing abilities. Participants gave informed consent and received course credit or 20 € for participation.

The Influence of Gait on Cognitive Functions

2.2

423

Apparatus and Stimuli

Participants were either standing or walking on a treadmill (Woodway PPS 70 Ortho) equipped with a safety belt system while responding to a letter discrimination task [12]. The central target letter in between distractors of a letter string was supposed to be “placed” to the left or right side of a palette by a keypress depending on the currently relevant “customer order”. The “customer order” determining the response rule was indicated by a cue and could change from one trial to the next. Participants were standing or walking on the treadmill at ca. 2.85 m distance in front of a screen (2.53 m 1.58 m). The fixation cross and letter string stimuli were presented on it by a Panasonic PT-EZ590 WUXGA projector. Cues and response feedback were presented via Brother AiRScouter WD-200B in the smart glasses part or via Bose SoundTrue Ultra in-ear headphones in the headset part. Manual responses were recorded via Speedlink Phantom Hawk joy-sticks placed in the left and right hand. Two color words (red or green) served as cues for the “customer order” and corresponding response rule. The visual cues were presented in white on black background on the smart glass display (75 px height, display resolution of 1280 * 720, font: Consolas). The auditory cues, which were artificially created via an online text to speech service (fromtexttospeech.com), were presented on the headphones. Stimuli consisted of five-letter strings in white on the black screen made of the letters “S” and “H” (35 px height, resolution of 1920 * 1200, font: Consolas). While in congruent trials all five letters were identical asking for the same response (e.g., left keypress), the central letter in incongruent trials was surrounded by the other letter type (e.g., SSHSS) associated with the opposite response. Response feedback was displayed on the smart glasses as a green checkmark or a red cross (correct vs. incorrect). In the headset part, response feedback was presented over headphones: high-pitched bell sound or low-pitched “buzzer” after a correct or incorrect response. 2.3

Procedure

In the beginning of the experiment, the comfort speed (in mm/s) of each participant was assessed by a repeated step-wise acceleration routine. Also, participants’ eyesight as well as hearing abilities were tested. Subsequently, participants underwent calibration routine for a standardized orientation across all participants of the smart glasses display relative to the screen (central alignment of a cross on the smart glasses to a square on the screen). The experiment consisted of two movement and two smart device parts: standing or walking while using smart glasses or the headset (counterbalanced across participants). During all parts, participants were asked to discriminate the central letter of a letter string stimulus as either “H” or “S” by responding via keypress of a joystick in either the left or right hand. The “customer order” (representing the task) determined the

424

M. Kreutzfeldt et al.

stimulus-response association (counterbalanced across participants) and was indicated by a cue (e.g., “green”: S = right, H = left; “red”: S = left, H = right) presented via smart glasses or headset. In single task blocks, only one type of customer orders was presented (e.g., only red) as baseline for behavioral performance, while both types were randomly presented in mixed tasks blocks (resulting in task repetitions or switches). Both block types appeared in counterbalanced order across participants (two single task blocks of 20 trials and two blocks of 160 trials each per smart device and movement condition) with single tasks either preceding or succeeding mixed tasks of each condition. Practice trials were included before each new block type (eight trials per each single task block and sixteen trials before the mixed tasks block). Each trial began with the cue presentation for 400 ms succeeded by a pause/blank screen on the headphones or smart glasses for 200 ms. Meanwhile, a fixation cross was presented on the screen until the stimulus appeared (cue-stimulus interval of 600 ms). The stimulus was presented for a maximum of 1500 ms or until a response was made. No response within this interval was registered as response omission and therefore as error. Response feedback was presented either on the headphones or smart glasses for 300 ms, followed by a a pause/blank screen for 200 ms (response-stimulus interval of 1100 ms). The experiment lasted approx. 2 h. 2.4

Design

The independent within-subject variables were movement (standing vs. walking), smart device (smart glasses vs. headset), task transition (repetition vs. switch), and congruency (congruent vs. incongruent). The levels of movement and smart device were blocked and counterbalanced. The levels task transition and congruency were varied randomly. Response times (RT) and error rates (ER) were registered as dependent variables. All tests of significance were conducted at an alpha level of 0.05.

3 Results For data analyses, the first trial of each block was excluded as it cannot be classified as task repetition or switch. Trials with RT less than 100 ms were removed (0.1%) as well as trials with RT exceeding ±3 SD of the individual mean (1.6%). Also, trials following an error were discarded (15.7%). Incorrect responses and response omissions were registered as errors. For RT analyses, only correct trials were analyzed. We conducted repeated measures Analyses of Variances (ANOVA) with the variables above and report only significant results. Mean RT and ER of single tasks are depicted in Fig. 1 and of mixed tasks in Fig. 2.

The Influence of Gait on Cognitive Functions

425

750

RT (in ms)

700 650 600 550 500 450 smart glasses

headset

smart glasses

standing

headset

wa l ki ng

congruent

incongruent

35 30

ER (in %)

25 20 15 10 5 0 smart glasses

headset

smart glasses

standing congruent

headset

wa lk in g incongruent

Fig. 1. Response times and error rates in single tasks as a function of movement, device and congruency. Error bars reflect the 95% confidence interval around the mean.

3.1

Single Tasks

The RT analysis indicated significant main effects of device, F(1, 23) = 12.22, p = .002, g2p = .35, and congruency, F(1, 23) = 96.70, p < .001, g2p = .81. Responding to instructions via smart glasses was slower compared with instructions via headset (527 ms vs. 494 ms). Congruent trials yielded faster responses than incongruent trials (492 ms vs. 529 ms), indicating a congruency effect.

426

M. Kreutzfeldt et al.

The ER analysis showed a main effect of congruency F(1, 23) = 38.63, p < .001, g2p = .63 with less errors in congruent compared with incongruent trials (3.8% vs. 12.2%). In addition, a three-way interaction of movement, device, and congruency was significant, F(1, 23) = 7.71, p = .011, g2p = .25. Post-hoc analyses showed a significant two-way interaction of device and congruency while standing, F(1, 23) = 13.70, p = .001, g2p = .37, but not while walking F < 1.00. While standing, the congruency effect was more pronounced for the headset use compared with the use of smart glasses: the headset yielded a significant congruency effect of 13%, t(23) = 5.33, p < .001 (onetailed), and the smart glasses of only 3.7%, t(23) = 1.75, p = .047 (one-tailed). The congruency effects between headset and smart glasses while walking did not differ significantly: 7.9%, t(23) = 4.82, p < .001 (one-tailed), and 9.3%, t(23) = 3.26, p = .002 (one-tailed). 3.2

Mixed Tasks

The RT analysis showed main effects of task transition, F(1, 23) = 49.59, p < .001, g2p = .68, and congruency, F(1, 23) = 86.57, p < .001, g2p = .79. Repetition trials yielded shorter RT than switch trials (572 ms vs. 651 ms), indicating switch costs, as well as congruent trials yielded shorter RT than incongruent trials (593 ms vs. 630 ms), indicating a congruency effect. Moreover, the two-way interactions of movement and device, F(1, 23) = 4.72, p = .040, g2p = .17 and of device and task transition, F(1, 23) = 31.89, p < .001, g2p = .58, were significant. While standing, responses to the devices were comparable (smart glasses: 610 ms, headset: 607 ms) whereas while walking, responses to smart glasses were slower (628 ms) than to the headset (601 ms). Yet and importantly, switch costs were larger in the headset condition (104 ms) than in the smart glasses condition (54 ms). The three-way interaction of movement, device, and congruency was also significant, F(1, 23) = 6.20, p = .020, g2p = .21. The congruency effect in the smart glasses condition was less pronounced than in the headset condition while standing (31 ms vs. 41 ms) and more pronounced while walking (48 ms vs. 33 ms). However, since there was a significant four-way interaction of movement, device, task transition, and congruency, F(1, 23) = 5.71, p = .025, g2p = .20, incorporating lower-level interactions, this interaction will be explored in detail. We conducted posthoc tests separately for each task transition condition, the additional factor to the threeway interaction above. For task repetitions, the three-way interaction of movement, device and congruency was significant, F(1, 23) = 7.44, p = .012, g2p = .25, but not for task switches, F < 1.00, indicating similar congruency effects across movement and device conditions for switches. We explored task repetitions therefore further separately for devices: The two-way interaction of movement and congruency was significant for smart glasses, F(1, 23) = 9.53, p = .005, g2p = .29, but not for headsets, F(1, 23) = 2.39, p = .136, g2p = .09. Using the headset, the congruency effect was 47 ms while standing, t (23) = 6.67, p < .001 (one-tailed), and 31 ms while walking, t(23) = 4.46, p < .001

The Influence of Gait on Cognitive Functions

427

(one-tailed), although this difference was not significant as suggested by the nonsignificant interaction above. However, in the use of smart glasses, the congruency effect amounted to 27 ms while standing, t(23) = 4.15, p < .001 (one-tailed), and increased significantly to 57 ms while walking, t(23) = 5.36, p < .001 (one-tailed). The ER analysis indicated main effects of device, F(1, 23) = 6.61, p = .017, g2p = .22, task transition, F(1, 23) = 157.04, p < .001, g2p = .87, and congruency, F(1, 23) = 19.55, p < .001, g2p = .46. Participants made more errors using the smart glasses compared with the headset (17.4% vs 14.0%). ER were higher in switch compared with repetition trials (21.7% vs. 9.7%), indicating switch costs, and higher in incongruent compared with congruent trials (17.6% vs. 13.8%), indicating a congruency effect. The two-way interaction of task transition and congruency was also significant, F(1, 23) = 12.80, p = .002, g2p = .36. The congruency effect was more pronounced in repetition trials compared with switch trials (5.3% vs. 2.1%). Post-hoc t-tests indicated significant congruency effects for both conditions: t(23) = 5.42, p < .001 (one-tailed) for repetitions, t(23) = 2.31, p = .015 (one-tailed) for switches. 3.3

Comfort Speed Analysis

To assess the relationship between individual gait aspects and cognitive performance, personal comfort speed (in mm/s) of each participant was extracted. The comfort speed was then compared to measures of cognitive performance in the walking conditions across single and mixed tasks for RT and ER for both devices. For single tasks, neither individual mean RT nor mean ER in walking blocks correlated with comfort speed. There was also no significant correlation with the individual congruency effect across devices in ER. In RT, however, there was a significant positive correlation of the size of the congruency effect when using the headset and comfort speed, r(22) = .57, p = .003, indicating the faster the comfort speed the more pronounced the congruency effect. The same contrast for the use of smart glasses was not significant. For mixed tasks, there was no significant correlation of comfort speed with switch costs or congruency effects in RT or ER, but there was one significant negative correlation of individual mean RT while walking with comfort speed, r(22) = −.41, p = .046: the faster the comfort speed the slower the individual mean RT. At first sight, these findings might suggest a detrimental effect of increased gait speed on cognitive performance, however, data is not convincing enough to generalize at this point (only two significant correlations of in total 16 contrasts). Importantly, the introduction of personal comfort was initially thought to compensate for individual differences in walking speed and to set an individual speed baseline. It is therefore not surprising that there were hardly any correlations of comfort speed and measures of cognitive performance.

428

M. Kreutzfeldt et al.

750

RT (in ms)

700 650 600 550 500

smart glasses

headset

switch

repeƟƟon

smart glasses

standing

headset

wa lk in g

congruent

incongruent

smart glasses

headset

smart glasses

standing congruent

switch

repeƟƟon

switch

repeƟƟon

switch

repeƟƟon

switch

35 30 25 20 15 10 5 0 repeƟƟon

ER (in %)

switch

repeƟƟon

switch

repeƟƟon

switch

repeƟƟon

450

headset

w al ki ng incongruent

Fig. 2. Response times and error rates of mixed tasks as a function of movement, device, task transition and congruency. Error bars reflect the 95% confidence interval around the mean.

4 Discussion In this study we investigated the influence of gait on selective attention while using smart devices in an order picking setting. Results shall be used to assess the order picker’s current mental state to adapt task instructions on the smart devices (smart glasses and headset). To this end, we employed a task switching paradigm (two customer orders as task) [14] and an Eriksen flanker task (assorting articles to the left or right) [12] while standing or walking at comfort speed on a treadmill. Task instructions (cues) were presented on smart glasses or a headset. Participants performed single task

The Influence of Gait on Cognitive Functions

429

blocks with only one customer order or mixed tasks in which the customer order could change from trial to trial. We registered RT and ER and expected differential switch costs (cognitive flexibility) and congruency effects (distractibility) as cognitive load indicators with respect to smart device and movement condition. 4.1

Synopsis of Results

Results indicated switch costs (i.e., performance decrement of task switches relative to task repetitions) and congruency effects (i.e., performance decrement of incongruent relative to congruent stimuli), which depended further on the current smart device and walking condition. In single task response times, responses to smart glasses were slower than to the headset. Results of single task errors suggested increased congruency effects for the headset while standing compared with the use of smart glasses, while congruency effects between the smart devices were similar during walking. In mixed tasks response times, switch costs were more pronounced for headset use than smart glasses. Congruency effects were most pronounced in repetitions during the use of smart glasses while walking. In mixed tasks errors, smart glasses produced more errors. In sum, the combination of a particular smart device and movement condition determined the effect on attention and thus the order picker’s mental state. 4.2

Influence of Smart Devices on Cognitive Performance – Application in Warehouses

Results of single tasks can be used to infer the cognitive ergonomic use of smart devices in well-structured warehouses without much distraction [10]. Here, smart glasses provided slower responses but also, while standing, smaller congruency effects, participants were therefore less prone to distraction or cognitive conflict. Depending on the overall goal of system adaptation regarding speed and accuracy as cognitive load indicators, either the order picker’s speed while using headsets or the distraction resistance of smart glasses can be favored. The performance in mixed tasks can be compared with the performance in crowded and busy warehouses requiring the need for cognitive flexibility and resistance to distraction and cognitive conflict [10]. Smaller switch costs of smart glasses indicate better cognitive flexibility. However, smart glasses showed also larger congruency effects while walking in task repetitions and generally more errors. Smart glasses are therefore only favorable when distractions while walking are not frequent in the warehouse. 4.3

Influence of Gait on Cognitive Performance – Application in Warehouses

The movement conditions had an influence of cognitive performance with respect to smart glasses and congruency effects in single task errors while standing. The standing condition can simulate the situation of an order picker in front of a shelf, where selective attention is needed to search for the respective item. Here, congruency effects were smaller for smart glasses compared with headsets. Participants were less

430

M. Kreutzfeldt et al.

distracted through conflicting information. The modality switch from instructions (auditory) to stimuli on the screen (visual) was accompanied by costs [15]. Instructions via smart glasses seem therefore favorable. Gait had also an effect on cognitive performance in mixed tasks. Here, in task repetitions, congruency effects were especially large while walking wearing smart glasses. In line with the idea of two visual processing streams, one focusing on walking and the other on the cognitive task [5], incongruent stimuli yielded a large decrease in cognitive performance when the visual modality was involved in the cognitive task. Displaying instructions on smart glasses, drawing on visual attention, is therefore less favorable while order pickers move through warehouses. While walking, staying focused is crucial for workplace safety. Busy and crowded warehouses provide much visual input, are potentially dangerous and the use of headsets for instructions should be preferred over smart glasses. 4.4

Instruction Adaptation in Technical Systems

Finding context-specific cognitive load, which depend on gait as well as the smart device in use, suggest that order pickers could benefit from adaptive instructions. Generally, instructions via smart glasses are favorable regarding cognitive flexibility, but perceiving conflicting visual information is more harmful when wearing smart glasses while walking. In order to establish safe and healthy workplaces, these differences in cognitive load indicators based on gait information and task requirements need to be taken into account to adapt instructions accordingly. For example, in challenging situations, the physical properties (e.g. contrast, color intensity, font size) or the content of instructions could be adjusted according to the respective load. As soon as the cognitive load decreases, the changes can be reversed. Moreover, instructions could fade out on the smart glasses to reduce distraction, while the order picker moves around in the warehouse. In addition, other parameters of the situation at work could be adapted such as the physical workload of a worker or the number of assignments and breaks in a given time period. Using gait information might prove beneficial in the future compared to other indicators because no additional mobile equipment for monitoring the mental state of the worker is required. Acceleration sensors can be easily integrated into the smart devices themselves and from their data, gait information could be derived. However, more research is needed to link the acceleration information to cognitive functions. In addition, which was outside the scope of the current study, the influence of smart devices and cognitive functions on gait needs to be further explored. In doing so, differentiating between dual-task costs and effects due to different task prioritization becomes possible [5].

5 Conclusion The combination of a particular smart device and walking condition determined the effect on attention and thus the order picker’s mental state, suggesting that gait speed and performance requirements can be used as cognitive load indicators in technical

The Influence of Gait on Cognitive Functions

431

systems to adapt instructions. Physical properties (e.g., contrast, color intensity, font size) or the content of instructions can be adjusted to match the current mental state.

References 1. Grosse, E.H., Glock, C.H., Jaber, M.Y., Neumann, W.P.: Incorporating human factors in order picking planning models: framework and research opportunities. Int. J. Prod. Res. 53, 695–717 (2015) 2. Grosse, E.H., Glock, C.H., Neumann, W.P.: Human factors in order picking: a content analysis of the literature. Int. J. Prod. Res. 55, 1260–1276 (2017) 3. Larco, J.A., de Koster, R., Roodbergen, K.J., Dul, J.: Managing warehouse efficiency and worker discomfort through enhanced storage assignment decisions. Int. J. Prod. Res. 55, 6407–6422 (2016) 4. Lodree Jr., E.J., Geiger, C.D., Jiang, X.: Taxonomy for integrating scheduling theory and human factors: review and research opportunities. Int. J. Ind. Ergon. 39, 39–51 (2009) 5. Beurskens, R., Bock, O.: Age-related deficits of dual-task walking: a review. Neural Plast. 2012, 1–9 (2012) 6. Yogev-Seligmann, G., Rotem-Galili, Y., Mirelman, A., Dickstein, R., Giladi, N., Hausdorff, J.M.: How does explicit prioritization alter walking during dual-task performance? Effects of age and sex on gait speed and variability. Phys. Ther. 90, 177–186 (2010) 7. Bock, O.: Dual-task costs while walking increase in old age for some, but not for other tasks: an experimental study of healthy young and elderly persons. J. Neuroeng. Rehabil. 5, 27 (2008) 8. Barra, J., Bray, A., Sahni, V., Golding, J.F., Gresty, M.A.: Increasing cognitive load with increasing balance challenge: recipe for catastrophe. Exp. Brain Res. 174, 734–745 (2006) 9. Tomporowski, P.D., Audiffren, M.: Dual-task performance in young and older adults: speedaccuracy tradeoffs in choice responding while treadmill walking. J. Aging Phys. Act. 22, 557–563 (2014) 10. Kreutzfeldt, M., Renker, J., Rinkenauer, G.: The attentional perspective on smart devices: empirical evidence for device-specific cognitive ergonomics. In: Rebelo, F., Soares, Marcelo M. (eds.) AHFE 2018. AISC, vol. 777, pp. 3–13. Springer, Cham (2019). https:// doi.org/10.1007/978-3-319-94706-8_1 11. Boisgontier, M.P., Beets, I.A., Duysens, J., Nieuwboer, A., Krampe, R.T., Swinnen, S.P.: Age-related differences in attentional cost associated with postural dual tasks: increased recruitment of generic cognitive resources in older adults. Neurosci. Biobehav. Rev. 37, 1824–1837 (2013) 12. Eriksen, B.A., Eriksen, C.W.: Effects of noise letters upon the identification of a target letter in a nonsearch task. Percept. Psychophys. 16, 143–149 (1974) 13. Botvinick, M.M., Braver, T.S., Barch, D.M., Carter, C.S., Cohen, J.D.: Conflict monitoring and cognitive control. Psychol. Rev. 108, 624–652 (2001) 14. Koch, I., Poljac, E., Müller, H., Kiesel, A.: Cognitive structure, flexibility, and plasticity in human multitasking—an integrative review of dual-task and task-switching research. Psychol. Bull. 144, 557–583 (2018) 15. Hunt, A.R., Kingstone, A.: Multisensory executive functioning. Brain Cogn. 55, 325–327 (2004)

Using Learning Analytics to Explore the Performance of Chinese Mathematical Intelligent Tutoring System Bor-Chen Kuo(&), Chia-Hua Lin, Kai-Chih Pai, Shu-Chuan Shih, and Chen-Huei Liao National Taichung University of Education, Taichung, Taiwan [email protected]

Abstract. This study introduced the Chinese dialogue-based mathematical intelligent tutoring system that developed to help students to learn mathematics without teachers. Student interactions with the Chinese dialogue-based mathematical intelligent tutoring system produce a wide range of learning sequences. The present study explored students’ learning performance and the effectiveness of knowledge components by using additive factors model and Bayesian knowledge tracing model. The model fit and the learning patterns of students were discussed. Keywords: ITS Learning curve

Fraction multiplication and division Learning analytics

1 Introduction The intelligent tutoring system (ITS) is a computer system that provides adaptive learning feedback to learners. The development of intelligent tutoring systems (ITS) has been explored in multiple domains in the last decades, such as physics, computer literacy, mathematics, and so on [1, 2]. For mathematics, some ITS have been developed and widely used for different disparity age of students, such as Cognitive Tutor (grade 9 to grade 12), Mathia (grade 6 to grade 8) and ALEKS (K to grade 12). These systems provided some topics to students to learn and give adaptive contextspecific instruction when an individual student needs it. A dialogue-based ITS were developed to facilitate students express the domain knowledge by their own words or phrases [2], such as AutoTutor, Operation ARA [3]. The dialogue-based ITSs have been developed in computer literacy, physics, critical thinking and produced learning gains. Previous studies evaluated the effectiveness of ITS by analyzing the effect size by comparison of ITS to human tutors. The results found that there was no statistical difference in effect size between expert one-on-one human tutors and step-based ITS [4]. Other studies also found the similar results in dialogue-based ITSs [4]. The ITSs seems to be effective for students learning the domain knowledge. Recently, some studies explored students’ learning sequence in the online courses, intelligent tutoring systems, and other learning platform [5–7]. Students’ interactions with © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 432–443, 2019. https://doi.org/10.1007/978-3-030-22341-0_34

Using Learning Analytics to Explore the Performance of Chinese Mathematical ITS

433

systems were recorded as log files while the students typing, click on an action space, or drag and drop. Some learning analytic models would be useful to analyze the sequence of students’ interactions to evaluate, explain, and predict students’ knowledge, such as Bayesian knowledge tracing models (BKT), Additive Factors Model (AFM) [8, 9]. However, less studies were using related learning analytic models to analyzed students’ learning performance in a Chinese dialogue-based mathematical ITS. The purpose of current study aims to examine whether the Chinese dialogue-based mathematical intelligent tutoring system was effective in helping students learn mathematics.

2 Method 2.1

Chinese Dialogue-Based Mathematical Intelligent Tutoring System

The Chinese mathematical intelligent tutoring system is a web-based learning system that developed based on the theoretical framework of intelligent tutoring system and AutoTutor [10]. Four-model (domain model, tutoring model, interface model, and student model) of ITS architecture is implemented in the system. The development and procedure of system demonstrate in Fig. 1.

Fig. 1. The procedure of Mathematical Intelligent Tutoring System

434

B.-C. Kuo et al.

Domain Model. The current study was designing the instruction on fraction multiplication and division for the sixth graders. Ten main training mathematical problems were designed, including ten main concepts of fraction multiplication and division. The knowledge structure of the concepts was constructed by domain experts. The sequence of the concepts’ development and the relationships between these mathematical skills were presented in Figs. 2 and 3.

Fig. 2. The knowledge structure on fraction multiplication

Fig. 3. The knowledge structure on fraction division

Using Learning Analytics to Explore the Performance of Chinese Mathematical ITS

435

Interface Model. The Chinese dialogue-based mathematical ITS contains four key interface components, including a conversational agent, dialogue history, question display, and response typing. The computer agent will interact with students with face expression and speech. The main question asked by the computer agent is shown in the upper left, some questions contain the picture. Students will type their answers, include math formula, texts, drag and drop objects, or choose one answer, which is displayed in the lower area. Figures 4, 5 and 6 present the interface of our math ITS, including type equation, drag and drop objects.

Fig. 4. The interface of math ITS (formula typing)

Fig. 5. The interface of math ITS (drag and drop)

436

B.-C. Kuo et al.

Fig. 6. The interface of math ITS (drag, drop and paint)

Tutoring Model. The system generates adaptive feedback to students is based on human tutor that simulates teacher’s teaching interaction and strategies and a five-step tutoring frame, and expectation and misconception-tailored dialogue (EMT) in AutoTutor [2, 11]. Students interact with a conversational tutor to learn mathematics by typing equations, texts, drag and drop objects, and choosing correct answers. The system immediately gives different feedback based on the students’ responses and assesses whether they are correct, incorrect or if there are any misconceptions. The system provides some dialogue moves

Fig. 7. The instruction material of math ITS

Using Learning Analytics to Explore the Performance of Chinese Mathematical ITS

437

for students to articulate the expectations. Each expectation contains a series of pump, hints, prompts, and assertions to assistant students construct their mathematics concepts. The system provides a cycle of pump ! hint ! prompt ! assertion for each expectation until the expectation is covered. The system also provides the instruction video for students reviewing the math concepts if students do not answer the question correctly over three times. Figure 7 presents the instruction material of mathematical ITS, and Table 1 presents the examples of different types of feedback. Table 1. The examples of different types of feedback

Feedback type

Feedback examples 1

main question 1

One box contains 3 cakes. If your sister eats 1 boxes, how many 2

cakes do her eat? Please write down your equation. Great! We did the correct equation. Now please calculate the answer. main question 2

Remember that you need represent your answer by mixed fraction or irreducible fraction.

hint prompt

Do the problem be solved by subtraction? Notice that the “unit” is different. Let’s try again. Fraction multiplication is different from fraction addition. Fraction multiplication does not reduce to common denominator. The orange colored part of the figure represents five by six. The red colored part of the figure is that we need to calculate. The answer is 5 1 5 × = 6

3

18

teaching

short feedback

Great! You are right.

Student Model. In the mathematical ITS, some different methods were implemented to evaluate students’ answers and update their learning state. For the mathematical equation, an automated analysis process to diagnose error patterns which developed by Yang, Kuo, and Liao in 2011 was used [12]. The prototypes of error patterns were compared to responses of the participants by using the block-based matching analysis.

438

B.-C. Kuo et al.

Moreover, the Additive Factors Model (AFM), is a statistical model of student learning to estimate students’ prior knowledge, the difficulty of tutored skills, and the rates at which these skills are learned. The current study tracks and updates students’ learning state by using AFM. 2.2

Participants

The participants of this research were 72 sixth graders from one elementary school in Taichung city. Students used the mathematical ITS for remedial instruction. 2.3

Experiment Procedure

All students underwent a 2-sessions intervention. Each session contains forty minutes. Students learned five main concepts (fraction multiplication) in the first session, and learned other five main concepts (fraction division) in the second session. The students’ responses were recorded as log files. The data were coded to fit the format of Datashop, which is a data repository and web application for researchers analyzing learning sequence of data. 2.4

Data Analysis

In this research, we used existing proposed student models in the LearnSphere (see http://learnsphere.org/index.html), that are AFM and BKT model. LearnSphere has been developed by researchers at CMU, MIT and the University of Memphis and is funded by the National Science Foundation [12]. Two statistics Akaike Information Criterion (AIC) and Bayesian Information Criteria (BIC) are employed to evaluate these models. For all these measurements, the lower the value, the better the model performs. In addition, we through the learning curve to understand the student’s learning performance in each KC, and the shape of learning curves may reveal opportunities for improving the domain model, the instructional activities, and their sequence [13].

3 Results The cognitive model implemented in the fraction multiplication and division unit of the Chinese mathematical ITS had 22 knowledge components (KCs). The knowledge components are a combination of 10 problems and 2–4 steps. The problems are P1 P2 P3 P4 P5

– – – – –

Fraction multiply whole numbers Proper fraction multiply proper fraction Mixed fraction multiply whole numbers Improper fraction multiply improper fraction Mixed fraction multiply mixed fraction

Using Learning Analytics to Explore the Performance of Chinese Mathematical ITS

439

P6 – Dividing fractions by whole number P7 – Fractions division in same denominator P8 – Whole numbers divided by fraction P9 – Fraction division in different denominator P10 – Division with remainders as fractions The steps are S1 – produce mathematic forms, S2 – calculation, S3 – write the remainder, and S4 – write the whole number respectively. Therefore, P1S1 represents the KC of produce mathematic forms in fraction multiply whole numbers. Our data consist of 4343 data points involving 72 students, and 275 unique steps. Each data point is a correct or incorrect student action corresponding to a single production execution. 3.1

AFM vs. BKT

As can be seen in Table 2, AFM better model fit than BKT with lower AIC and BIC. This result is consistent with Gong et al. in 2010 that against factor analysis based method is as good as or better than BKT [14]. Table 2. Compare AFM and BKT model Model AIC BIC AFM 2605.01 3459.44 BKT 5225.52 6016.18

3.2

Learning Curve

We have gathered data on student practice with some set of learning activities, we can generate a set of learning curves with respect to fraction multiplication and division unit. A learning curve visualizes changes in student performance over time on different KCs. The line graph displays opportunities across the x-axis, and a measure of student performance (e.g. error rate) along the y-axis. The Fig. 8 shows, these learning curves reveal improvement in student performance (i.e., error rate decreases) as opportunity count (i.e., practice with a given knowledge component) increases. The Fig. 9 shows, the KC curve is not declining but it is already at a low error rate from the start, such a pattern indicates the KC is already known and mastered, therefore, little to no learning is expected. For these KCs, consider reducing the number of opportunities in the student model, or take them in the tutoring model. In addition, there are two KCs classified as “Too little data”, because students didn’t practice these KCs enough.

440

B.-C. Kuo et al.

Fig. 8. Learning Curves in category “Good”

Using Learning Analytics to Explore the Performance of Chinese Mathematical ITS

Fig. 8. (continued)

Fig. 9. Learning Curves in category “Low and flat”

441

442

B.-C. Kuo et al.

4 Conclusions The current study explored students’ performance on knowledge components in Chinese dialogue-based mathematical intelligent tutoring system using AFM and BKT models. This yields several observations. First, our results showed the AFM fits the data better than the BKT model does. This observation is consistent with the results of Gong et al. [15] and is different from that of Lin and Chi [16] that shows BKT is better than AFM. Therefore, more works are needed to explore the model applicability of models in different conditions. Second, most of the learning curves of knowledge components demonstrated the improvement of students’ learning. This indicates the Chinese Math ITS with adaptive feedbacks that is helpful for learning fraction multiplication and division.

References 1. VanLehn, K., Graesser, A.C., Jackson, G.T., Jordan, P., Olney, A., Rose, C.P.: When are tutorial dialogues more effective than reading? Cogn. Sci. 31, 3–62 (2007) 2. Graesser, A.C., D’Mello, S.K., Hu, X., Cai, Z., Olney, A., Morgan, B.: AutoTutor. In: McCarthy, P.M., Boonthum, C. (eds.) Applied Natural Language Processing and Content Analysis: Identification, Investigation and Resolution, pp. 169–187. IGI Global, Hershey (2012) 3. Nye, B.D., Graesser, A.C., Hu, X.: AutoTutor and family: a review of 17 years of natural language tutoring. Int. J. Artif. Intell. Educ. 24(4), 427–469 (2014) 4. VanLehn, K.: The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educ. Psychol. 46(4), 197–221 (2011) 5. Koedinger, K.R., Cunningham, K., Skogsholm, A., Leber, B.: An open repository and analysis tools for fine-grained, longitudinal learner data. In: Proceedings of the international Conference on Educational Data Mining, Montreal, Quebec, Canada, pp. 157–166 (2008) 6. Chi, M., VanLehn, K., Litman, D., Jordan, P.: An evaluation of pedagogical tutorial tactics for a natural language tutoring system: a reinforcement learning approach. Int. J. Artif. Intell. Educ. 21, 83–113 (2011) 7. Sawyer, R., Rowe, J., Azevedo, R., Lester, J.: Filtered time series analyses of student problem-solving behaviors in game-based learning. In: Proceedings of the Eleventh International Conference on Educational Data Mining, Buffalo, New York, pp. 229–238 (2018) 8. Stamper, J.C., Koedinger, K.R.: Human-machine student model discovery and improvement using DataShop. In: Biswas, G., Bull, S., Kay, J., Mitrovic, A. (eds.) AIED 2011. LNCS (LNAI), vol. 6738, pp. 353–360. Springer, Heidelberg (2011). https://doi.org/10.1007/9783-642-21869-9_46 9. Corbett, A.T., Anderson, J.R.: Knowledge tracing: modeling the acquisition of procedural knowledge. User Model. User Adapt. Interact. 4(4), 253–278 (1994) 10. Pai, K.C., Kuo, B.C., Liao, C.H., Liu, Y.M.: An application of Chinese dialogue-based intelligent tutoring system in remedial instruction for mathematics learning. Educ. Psychol. (submitted) 11. Graesser, A.C.: Conversations with AutoTutor help students learn. Int. J. Artif. Intell. Educ. 26, 124–132 (2016)

Using Learning Analytics to Explore the Performance of Chinese Mathematical ITS

443

12. Yang, C.W., Kuo, B.C., Liao, C.H.: A HO-IRT based diagnostic assessment system with constructed response items. Turk. Online J. Educ. Technol. TOJET 10, 46–51 (2011) 13. Koedinger, K., Booth, J.L., Klahr, D.: Instructional complexity and the science to constrain it. Science 342(6161), 935–937 (2013) 14. Goldin, I., Pavlik Jr., P.I., Ritter, S.: Discovering domain models in learning curve data. In: Design Recommendations for Intelligent Tutoring Systems, p. 115 (2016) 15. Gong, Y., Beck, J.E., Heffernan, N.T.: Comparing knowledge tracing and performance factor analysis by using multiple model fitting procedures. In: Aleven, V., Kay, J., Mostow, J. (eds.) ITS 2010, Part I. LNCS, vol. 6094, pp. 35–44. Springer, Heidelberg (2010). https:// doi.org/10.1007/978-3-642-13388-6_8 16. Lin, C., Chi, M.: Intervention-BKT: incorporating instructional interventions into Bayesian knowledge tracing. In: Micarelli, A., Stamper, J., Panourgia, K. (eds.) ITS 2016. LNCS, vol. 9684, pp. 208–218. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-39583-8_20

Eye Blinks Describing the State of the Learner Under Uncertainty Johanna Renker(&), Magali Kreutzfeldt, and Gerhard Rinkenauer Leibniz Research Centre for Working Environments and Human Factors, Dortmund, Germany [email protected]

Abstract. Adaptive systems are able to support the human-machine interaction in a great manner. However, the question arises which parameter are useful to gain insights into the user and can be easily implemented in the adaptive system. Eye blinks are frequent and most of the time automatic actions that reflect attentional and cognitive processes. They have not gained much attention in the context of adaptive systems until now. Thus, the current experiment investigated the number of blinks as an indicator of the state of the user while interacting with a technical system. Participants had to perform a dynamic visual spatial search task while their eye blinks were tracked. The task is to predict the appearance of target objects and thereby to learn a probability concept in order to improve the prediction. Results showed that eye blinks could distinguish between good and poor learner and increased parallel to the increasing task performance. Further, eye blinks reflected the information processing during the performance of a trial and the completion of the task. Thus, eye blinks might inform about the needs of the user with regard to the amount and detailedness of new information as well as additional help. However, the individual variability necessitates a separate baseline to be determined for each user. Further research is needed to foster the results also in more applied settings. Keywords: Eyetracking Blink rate Mental representation Uncertainty

Probability learning

1 Introduction When users interact with technical systems for the first time, they start to develop a mental representation of the functioning of the technical system that becomes more detailed by and by. In dealing with the system, users have to learn how it works and to deal with uncertainties about its exact functioning due to inexperience. If users interact, for example, with a ticket machine the first time, they might need some time to search and find the appropriate items when using it, press more often the back button or do not understand that they have to touch the screen in order to activate the menu. If they are trained to use this ticket machine and know the possible processes, they are quite fast and could nearly automatically navigate to the desired ticket. Thus, they develop a mental representation of the ticket machine which is a reduced but for the user coherent copy of the reality. The performance improves with more accurate mental © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 444–454, 2019. https://doi.org/10.1007/978-3-030-22341-0_35

Eye Blinks Describing the State of the Learner Under Uncertainty

445

representation [1]. It would be helpful if the technical system could take the state of the learner into account and adapt accordingly to support the learning process. In line with this, the question arises which indicators could be used for this adaptation. In humancomputer interaction literature, eye movement parameters are often described to be useful for gaining insights into cognitive processes and thus to detect usability problems. The most known parameters are already listed by e.g. Jacob and Karn [2] and Ehmke and Wilson [3]. One parameter that might prove highly relevant but has not yet gained lots of attention in the context of adaptive systems is the blink rate. It has already been reported that eye blinks are related to information processing and attentional processes e.g. during choice response tasks [4]. There is evidence that an increasing blink rate indicates the completion of a cognitive task when attention decreases and the likelihood of missing relevant information is low. Further, the blink rate decreases when the task is cognitively more demanding, indicating a higher attentional load [5, 6]. Also Brookings et al. [5] concluded in their applied study with air traffic controllers and different challenging scenarios that eye blinks are more sensitive to workload caused by task difficulty than other eye movement parameters. These findings emphasize the potentials of the blink rate as an interesting indicator for the state of the learner. Due to the close link to information processing and attentional processes, blinks in the current study might also reflect learning processes and the level of uncertainty. In the following, it is assumed that the blink rate increases parallel to the learning process. The blink rate was also already studied in the context of the IQ by Paprocki and Lenskiy [7]. Participants had to rest and to perform an IQ test. The authors grouped their participants into a higher IQ group and a lower IQ group. Interestingly, they found that participants in the higher IQ group showed generally a lower blink rate and a higher eye-blink rate variability while resting than participants in the lower IQ group. The authors stated that the eye-blink rate variability indicated the cognitive state of the user and assume a strong relation to cognitive performance. Thus, we also assume that users with accurate mental representations of the technical system show less blinks in comparison to users with flawed mental representations due to cognitive abilities and the increased attentional effort to process information. Importantly, depending on the situation the blink rate is also associated with other effects that have to be kept constant like lightning and climate or have to be controlled like fatigue [8]. It is quite common to use blinks as an indicator for fatigue in sleepiness alert systems, for example in cars. In literature, blink duration seems to be one of the most noticeable oculomotoric parameter to reflect fatigue [e.g. 9]. However, in the current study we assume that fatigue is not the driving force as participants constantly have to concentrate on the task for a limited period of time.

2 Method 2.1

Participants

A total of 31 (20 female) with mean age 23 (SD = 4) participated in the experiment at the Leibniz Research Centre for Working Environments and Human Factors. All but

446

J. Renker et al.

four were not students. All participants were right-handed and had normal vision without visual aid. They obtained informed consent and were naïve about the study’s purpose before participating in the experiment. At the end of the experiment, they received either course credit or were paid for their participation. 2.2

Experimental Settings

Gabor figures with different patterns (vertical, horizontal or diagonal lines) with a diameter of 2 cm were presented on a 23.6-in. monitor (1080 1920) with a distance of approximately 75 cm. The target object moved into a dark room, which means a black square of 20 20 cm size, that was presented in the middle of the monitor. An entrance at the bottom and three exits left, top and right were drawn in the black square (Fig. 1). While performing the experimental task eyes were tracked with the SMI RED500 (SensoMotoric Instruments, Teltow, Germany) every 2 ms (500 Hz screen refresh frequency). Blinks were detected with the Event Detector for High Speed Event Detection provided by the iView software. In order to avoid head movements a skin rest was used. Further, indirect light was set up to prevent irritations of the Eyetracker due to light reflection.

Fig. 1. Starting position of the experiment: Participants were asked to predict the exit of the presented target object by pressing the appropriate arrow key (left, top or right).

2.3

Procedure

In order to investigate if the blink rate is a useful indicator for the learning process under uncertainty, the occluded visual spatial search task (OVSST) was set up. Participants had to observe one of three target objects disappearing in a quadratic room and reappearing at one of three exits (left, top, right). Participants were asked to predict

Eye Blinks Describing the State of the Learner Under Uncertainty

447

the reappearance of the target objects at the exit by pressing the appropriate arrow key of the keyboard directly after the target object disappeared into the room. Participants had to perform the prediction task within 2 000 ms and auditive error feedback was provided if the arrow key was pressed too late or too early (see Fig. 2 for a schematic description). In order to increase task performance, participants had to learn the underlying probability concept. Each object was associated to one of the exits with a higher probability (74%) and to the other exits with a lower probability (11%). In the remaining 4% of the cases objects reappear at the bottom entrance. This rare occurrence is used to analyze the behavior of the participants with regard to special unexpected and uncertain situations.

Fig. 2. Schematic description of the occluded visual spatial search task (OVSST) performed in the current experiment. First, participants fixate a fixation cross in the middle of the room (750 ms). Then, one of the three target objects appears at the bottom entrance (1 000 ms) and moves into the room within 1 000 ms. After the target object has disappeared into the room participants are instructed to predict at which exit the target object will reappear with the arrow keys of the keyboard within 2 000 ms. Finally, the target object fades in at one of the exit positions within 3 000 ms.

Participants were not informed about any probability concept, they were only instructed to improve the prediction. In the experimental condition, participants had to perform four blocks of the prediction task each with 81 trials. They started with two training blocks with 21 trails each in order to become familiar with the task. At the end of the experiment, participants were asked to estimate the probability relations via

448

J. Renker et al.

questionnaire and to notify if they understand any probability concept (“Do you think the target object (diagonal lines, horizontal lines, vertical lines) influences the exit where the symbol reappears?”). Participants performed a computer version of the D2 Test of attention by Brickenkamp [10] before and after the completion of the experimental task in order to check for systematic changes in attention capacities, e.g. due to fatigue as a confounding variable. The D2 Test of attention measures accuracy and speed during the discrimination of similar visual stimuli. 2.4

Data Analysis

Data of the experimental condition were adjusted by excluding 3.5% of the trials from data analysis due to missing responses. Another 1.7% of the trials were excluded, because less than 65% of the eye tracking data within those trials were not available due to tracking errors. A two-way repeated measures ANOVA with the within-subjects variable block (1–4) and judgment (correct, incorrect) was employed to statistically analyze the development over time for the number of correct predictions and the number of blinks. For the within-trial analysis, blinks of the valley and peak values before and after the object appearance, prediction task and object reappearance (+/ −50 ms) were averaged and planned t-test performed to check if the number of blinks indicates information processing and the completion of the task (Fig. 5). A hierarchical cluster analysis was performed to group participants into two performance cluster (high and low performer). High performers (N = 16) predicted 62% (SD = 7%) of the cases correctly and low performer (N = 15) made only in 33% (SD = 5%) of the cases a correct prediction. Planned t-tests were used to identify meaningful differences of the number of correct predictions and the number of blinks between these groups.

3 Results Overall, 20 of the 31 participants confirmed that the target object was related to the exits. In 74% of the cases this self-report coincided with the grouping of the cluster analysis into low and high performer. Further, results of the questionnaire showed that participants who confirmed to understand a probability concept actually learned the tendencies of the probability concept in a correct way as shown in Table 1 (mean maximum deviation from the given probabilities: 12%). In total, they predicted 62% of all trials correctly and chose in 62% of the cases the likely object-exit association of the probability concept. If the Gabor figure with diagonal lines appeared, participants predicted the likely left exit in 65.5% of the cases. If the Gabor figure with horizontal lines was presented, then the likely left exit was chosen in 54.9% of the cases and if the Gabor figure with vertical lines was displayed, the likely top exit was chosen in 65.6% of the cases. The results showed that in 64% of the cases the response strategies did not change after the rare occurrence, the object reappearance at the bottom entrance. Thus, this special situation was not examined in more detail.

Eye Blinks Describing the State of the Learner Under Uncertainty

449

Table 1. Subjective probability concept of all participants who confirmed to understand any probability concept (N = 20). Gabor figures

Exit

SubjecƟve Probability Concept 11-11)

leŌ

63% (24%)

top

18% (13%)

right

19% (12%)

leŌ

20% (14%)

top

18% (10%)

right

62% (21%)

leŌ

15% (10%)

top

70% (19%)

(74-

right 15% (10%) Note. The object-exit associations with higher probabilities are shown in bold. Values in brackets show the standard deviation.

The number of correct predictions for all participants (N = 31) increased significantly across blocks, F(3;90) = 13.65, p < .001, g2p ¼ :313, indicating learning effects particularly from Block1 to Block2 (Fig. 3). Planned t-test for high performer, t(15) = 4.57, p < .001, as well as for low performer t(14) = 6.29, p = .060, showed a significant or almost significant increase from Block1 to Block2. Parallelly to the increasing number of correct predictions, the main effect of block for the number of blinks showed a similar trend, F(3;90) = 2.25, p = .104, g2p ¼ .070, suggesting an increase of the number of blinks across blocks. However, we observed no main effect of judgment for the number of blinks, F(1;30) = 0.80, p = .779, g2p ¼ .003 (Fig. 4). The within-trial analysis of the number of blinks showed an increase after the appearance and reappearance of the target object at the exit as well as after the completion of the prediction task (see Fig. 5) indicating a strong relation to the presentation of information and thus attentional processes. Most interestingly, results of the cluster analysis showed that high performer blinked significantly less often compared to low performer, t(29) = 3.24, p = .003 (Fig. 6). The analysis of the D2 test showed that participants selected significantly more targets after the experiment than before, t(30) = 7.04, p < .001. In the pre-test, they detected 64% (SD = 16%) of the cases on average and in the post-test 71% (SD = 17%) of the cases on average, presumably due to learning effects. Additionally, the error rate changed significantly from the pre-test (M = 98%, SD = 2%) to the post-test (M = 99%, SD = 1%), t(30) = 2.67, p = .013. The results did not differ between low and high performer (p > .05). Thus, attention generally did not seem to decrease from the beginning to the end of the experiment.

450

J. Renker et al.

Fig. 3. Mean number of correct predictions for high and low performer across blocks (1–4). Maximum are 81 correct prediction per block. Error bars depict standard errors.

Fig. 4. Results of the number of blink rate as a function of block (1–4) and judgment (correct, incorrect). Error bars depict standard errors.

Eye Blinks Describing the State of the Learner Under Uncertainty

451

Fig. 5. The mean number of blinks during the time course of the trial for correctly and incorrectly predicted trials. The first vertical grey line marks the beginning of the object appearance (1), the second line marks the beginning of the prediction task (2) and the third line marks the beginning of the reappearance of the object (3) from left to right.

Fig. 6. The number of blinks for low and high performer. Error bars depict standard errors.

452

J. Renker et al.

4 Discussion Besides learning effects across blocks, it was expected that the blink rate reflects the development of the mental representation. The blink rate indeed increased while participants learned the probability concept supposedly indicating that relevant information was processed more easily and the attentional effort accordingly decreased. However, we could only detect a trend probably due to the rapid learning in Block1 due to the easiness of the task. Participants had only to learn three target objects associated to three exits and tend to focus mainly on the likely object-exit association. This result is generally in common with the inverse relation between task difficulty and eye blink frequency reported earlier in the review by Martins and Carvalho [11]. In the current experiment, task difficulty also decreases as participants learn to handle the situation. If the participants advised to understand the probability concept, they developed a quite accurate mental representation of the relations between the target objects and mainly chose the likely object-exit association. Interestingly, they tend to underestimate higher probabilities and overestimate lower probabilities, probably because of a regression to the mean [12, 13]. Participants who did not understand any probability concept seem to learn also a strategy to deal with the situation, but they develop an inaccurate mental representation of the task so that task performance does not increase to a greater extent. The within trial analysis showed that blinks were suppressed during the object appearance and object reappearance in order to process the presented information of the target object. Participants need to identify the target object to activate the association to the appropriate exit and they have to learn this association by combining the target object with the reappearance at the distinct exits. One step further, they have to estimate the probability of occurrence to improve their predictions and to develop an accurate mental representation of the OVSST. Blinks were also suppressed before the completion of the prediction task suggesting the acquisition of adequate attentional resources to perform the task. Thus, results of earlier studies could be replicated [e.g. 4, 14, 15] Finally, high performer showed a lower blink rate than low performer which could be attributed to the focus on the task and cognitive abilities. As already reported in the Introduction, Paprocki and Lenskiy [7] studied eye blinks during the performance of a IQ tests and they also found that subjects in the higher IQ group blinked less often than in the lower IQ group. They explain this finding by the connection of eye-blinks to the quality of higher cognitive processes. The findings for high and low performers seem to reflect these findings, since the high performers obviously have developed a better mental representation of the task than the low performers. It might seem contradicting that on the one hand the overall number of blinks increase with the number of correct predictions but on the other hand high performer blink less than low performer. However, in general all participants develop a mental representation and learn a strategy indicated by the increasing number of blinks. Thereby good learners develop an accurate mental representation of the task and seem to be more focused on the task reflected by less blinks than poor learners. The low task performance of poor learners caused by an inaccurate mental representation of the task might have different reasons: motivational causes, concentration deficits or cognitive

Eye Blinks Describing the State of the Learner Under Uncertainty

453

inabilities. As the D2 concentration test indicated no systematic changes in attention capacities, a general lack of attention resources and fatigue can be excluded. Overall, there seemed to be high interindividual variability with regard the number of correct predictions indicated by the high range of the standard error bars and to the number of blinks. The uniqueness of eye blink patterns of the individual participants was already reported in literature [11] and has to be taken into account if blinks are used as an indicator for the state of learner, e.g. by first defining a baseline for each individual user. In sum, the blink rate might be a relevant indicator for the state of the learner. In future, technical systems with eye-tracking software might detect the state of the user and adapt in the way, for example, that information is kept simple and even repeated and additional information is presented for low performers. Thus, users will be supported on an individual level possibly before they experience difficulties during the interaction with the technical system. Such an approach could provide a way to maintain or optimize the cognitive load during learning [e.g. 16]. However, more research is needed to foster the results. The next step towards using the blink rate as an indicator for the learner state diagnosis in adaptive instructional systems would be to implement and test the blink rate in more applied research studies, for example in a VR learning setting [e.g. 17] as blinks could be easily detected and information accordingly adapted.

References 1. Donnell, M.L.: Human cognition and the expert systems interface mental models and explanation facilities. In: Ntuen, C.A., Park, E.H. (eds.) Human Interaction with Complex Systems. The Kluwer International Series in Engineering and Computer Science, vol. 372, pp. 343–349. Springer, Boston (1996). https://doi.org/10.1007/978-1-4613-1447-9_26 2. Jacob, R.J., Karn, K.S.: Eye tracking in human-computer interaction and usability research: ready to deliver the promises. In: Hyönä, J., Radach, R., Deubel, H. (eds.) The Mind’s Eye: Cognitive and Applied Aspects of Eye Movement Research, pp. 573–605. North-Holland, Amsterdam (2003). https://doi.org/10.1016/b978-044451020-4/50031-1 3. Ehmke, C., Wilson, S.: Identifying web usability problems from eye-tracking data. Paper Presented at the British HCI Conference 2007. University of Lancaster, U.K. (2007) 4. Wascher, E., Heppner, H., Möckel, T., Kobald, S.O., Getzmann, S.: Eye-blinks in choice response tasks uncover hidden aspects of information processing. EXCLI J. 14, 1207–1218 (2015). https://doi.org/10.17179/excli2015-696 5. Brookings, J.B., Wilson, G.F., Swain, C.R.: Psychophysiological responses to changes in workload during simulated air traffic control. Biol. Psychol. 42, 361–377 (1996). https://doi. org/10.1016/0301-0511(95)05167-8 6. Maffei, A., Angrilli, A.: Spontaneous eye blink rate: an index of dopaminergic component of sustained attention and fatigue. Int. J. Psychophysiol. 123, 58–63 (2018). https://doi.org/10. 1016/j.ijpsycho.2017.11.009 7. Paprocki, R., Lenskiy, A.: What does eye-blink rate variability dynamics tell us about cognitive performance? Front. Hum. Neurosci. 11, 1–9 (2017). https://doi.org/10.3389/ fnhum.2017.00620

454

J. Renker et al.

8. Stern, J.A., Boyer, D., Schroeder, D.: Blink rate: a possible measure of fatigue. Hum. Factors J. Hum. Factors Ergon. Soc. 36, 285–297 (1994). https://doi.org/10.1177/ 001872089403600209 9. Schleicher, R., Galley, N., Briest, S., Galley, L.: Blinks and saccades as indicators of fatigue in sleepiness warnings: looking tired? Ergonomics 51, 982–1010 (2008). https://doi.org/10. 1080/00140130701817062 10. Brickenkamp, R.: Test D2: Aufmerksamkeits-Belastungstest, 2nd edn. Hogrefe, Göttingen (1994) 11. Martins, R., Carvalho, J.M.: Eye blinking as an indicator of fatigue and mental load—a systematic review. In: Arezes, P.M. (ed.) Occupational Safety and Hygiene III, pp. 231–235. CRC Press, Boca Raton (2015) 12. Beuer-Krüssel, M., Krumpal, I.: Der Einfluss von Häufigkeitsformaten auf die Messung von subjektiven Wahrscheinlichkeiten. Methoden – Daten – Analysen: Zeitschrift für Empirische Sozialforschung 3, 31–57 (2009) 13. Fischhoff, B., Beyth, R.: I knew it would happen. Organ. Behav. Hum. Perform. 13, 1–16 (1975). https://doi.org/10.1016/0030-5073(75)90002-1 14. Chen, S., Epps, J., Ruiz, N., Chen, F.: Eye activity as a measure of human mental effort in HCI. In: Pu, P., Pazzani, M., André, E., Riecken, D. (eds.) Proceedings of the 15th International Conference on Intelligent User Interfaces - IUI 2011, pp. 1–4. ACM Press, New York (2011). https://doi.org/10.1145/1943403.1943454 15. Irwin, D.E., Thomas, L.E.: Eyeblinks and cognition. In: Coltheart, V. (ed.) Macquarie Monographs in Cognitive Science. Tutorials in Visual Cognition, pp. 121–141. Psychology Press, New York (2010) 16. Sweller, J.: Implications of cognitive load theory for multimedia learning. In: Mayer, R. (ed.) The Cambridge Handbook of Multimedia Learning, pp. 19–30. Cambridge University Press, Cambridge (2005). https://doi.org/10.1017/CBO9780511816819.003 17. Law, B., Atkins, M.S., Kirkpatrick, A.E., Lomax, A.J.: Eye gaze patterns differentiate novice and experts in a virtual laparoscopic surgery training environment. In: Duchowski, A.T., Vertegaal, R. (eds.) Proceedings of the Eye Tracking Research & Applications Symposium on Eye Tracking Research & Applications – ETRA 2004, pp. 41–48. ACM Press, New York (2004). https://doi.org/10.1145/968363.968370

Adaptive Remediation with Multi-modal Content Yuwei Tu1(B) , Christopher G. Brinton1,2 , Andrew S. Lan3 , and Mung Chiang4 1

3

Zoomi, Inc., Chesterbrook, USA [email protected] 2 Princeton University, Princeton, USA University of Massachusetts Amherst, Amherst, USA 4 Purdue University, West Lafayette, USA

Abstract. Remediation is an integral part of adaptive instructional systems that provide a supplement to lectures in case the delivered content proves too difficult for a user to fully grasp in a single class session. To extend the delivery of current remediation methods from single type of sources to combinations of different material types, we propose an adaptive remediation system with multi-modal remediation content. The system operates in four main phases: ingesting a library of multimodal content files into bite-sized chunks, linking them based on topical and contextual relevance, then modeling users’ real-time knowledge state when they interact with the delivered course through the system and determining whether remediation is needed, and finally identifying a set of remediation segments addressing the current knowledge weakness with the relevance links. We conducted two studies to test our developed adaptive remediation system in an advanced engineering course taught at an undergraduate institution in the US and evaluated our system on productivity. Both studies show that our system is effective in increasing the productivity by at least 50%.

1

Introduction

There are a lot of online learning platforms have emerged in recent years, servicing learning scenarios from corporate training programs to Massive Open Online Courses (MOOCs). Originally developed as supplements to in-class delivery, Adaptive Instructional Systems (AISs) have received recent attention in online learning [7,8,17,18]. The typical AISs today will analyze question responses submitted by users to maintain individual knowledge state models, and then adjust the delivery of future modules by re-ordering, augmenting, and/or skipping over content according to a set of rules and possibly alternate content files created by the instructor [16]. Remediation is an integral part of the learning process of AISs: it provides a supplement to lectures where the delivered content proves too difficult for a user to fully grasp in a single class session, due to many possible factors (e.g., weakness in prerequisites, disengagement in class, or unclear explanations by the c Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 455–468, 2019. https://doi.org/10.1007/978-3-030-22341-0_36

456

Y. Tu et al.

instructor). The traditional “one-size-fits-all” approach to remediation, in which the instructor creates a single set of remediation content for the entire class, is undesirable due to the well-documented heterogeneity in user backgrounds, abilities, and strategies. Creating and managing personalized remediation for each user, on the other hand, would be difficult for an instructor to scale to even medium-sized courses. As a result, it is desirable to build systems that can automatically create, select and deliver individualized remediation content based on inferences of a user’s weaknesses in particular topic areas. To date, many methods for adaptive learning in education have been proposed, with results demonstrating varying levels of effectiveness [4,11,15,23]. One aspect that remains largely unstudied, however, is the systematic delivery of multi-modal remediation content, i.e., combinations of different material types like textbooks, lecture videos, web pages, practice questions, and interactive simulations. Doing so is increasingly relevant today as users are becoming accustomed to visiting multiple sources of content outside of the standard lecture material (e.g., through a search engine). It is thus desirable to develop methods that select pieces of multi-modal content for a particular user’s remediation and integrate them together into a single application for delivery to the user. In this paper, we develop and evaluate a system that provides adaptive multimodal remediation to users as they progress through a course. The operation of our system can be divided into four main phases: 1. Ingesting a library of multi-modal content files forming the baseline course and segmenting the materials into bite-sized chunks. 2. Linking those content chunks based on topical and contextual relevance. 3. Modeling users’ knowledge state when they interact with the delivered course through the system, and determining whether remediation is needed at this point. 4. Identifying a set of remediation segments addressing the current knowledge weakness and integrating them into single module of content to users. In Sect. 5, we present the results of a trial we conducted to evaluate our remediation model in an advanced engineering mathematics course taught at an undergraduate institution in the US. In particular, we evaluate our method on a productivity metric defined as the total score on the questions divided by the total time spent. Using a series of statistical tests, we show that our individualization system outperforms one-size-fits-all course delivery significantly, increasing productivity by at least 50%, especially when content is varied at the segment level.

2

Related Work and Contributions

The long history of thoughts and admonitions about adapting instruction to individual student’s needs has been documented by many researchers (e.g., [19,20,24]). There exist several possibilities how the instruction is adapted: the macro-adaptive approach, the aptitude-treatment interaction approach, the

Adaptive Remediation with Multi-modal Content

457

micro-adaptive approach and the constructivistic-collaborative approach, listed in chronological order beginning with the oldest approach [12]. Among those theoretical approaches, the most popular one is the micro-adaptive approach where adaptive instruction is conducted on a micro-level by diagnosing the user’s specific learning needs during instruction [17]. It has been widely used in modern adaptive instructional systems like Intelligent Tutoring Systems (ITSs), which applies a variety of artificial intelligence techniques to represent the learning and teaching process. Inspired by ITS, and combined with hypermedia-based systems, another family of AISs is developed as Adaptive Hypermedia Systems (AHSs). There are three main criteria for AHSs: the system is based on hypertext or hypermedia; a user model is applied and third; the system is able to adapt the hypermedia by using this user model [5]. There are also two different types of AHSs: adaptive presentation will directly provides an adaptation of the content which presented in different ways or orders. The content can be adapted to various details, difficulty, and media usage to satisfy users with different needs, background knowledge, and knowledge state [6]. Another type is adaptive navigation support, which only present an adaption of navigation by direct guidance, adaptive hiding or re-ordering of links, link annotation, map adaptation, link disabling and link removal [22]. As users are becoming accustomed to visiting multiple sources of content outside of the original lecture materials, it is necessary for AHSs to combine different material types(like textbooks, lecture videos, web pages, external links) and unfortunately, most existing adaptive systems do not meet this requirements [7,15]. One of our main contribution is to propose an adaptive remediation system which is capable of systematic delivery of multi-modal re-mediation content. Our method identifies pieces of multi-modal content for a particular user’s remediation without the limitation of standard course content and integrate them together into a single application for delivery for the user’s convenient. It greatly enhances the extensibility and efficiency of regular adaptive remediation systems. In addition, even though a variety of AHSs have been designed and implemented, it is still claimed that here was little empirical evidence for the effectiveness of AHSs [2,4,17]. Besides, most of the current works only compare the difference between users’ grade with and without AHSs, but unfortunately, having higher scores doesn’t means learning more efficiently [21]. Another main contribution of our paper, is that we propose a new evaluation metric - productivity, which considers the trade-off between points earned and time spent. This can be further extended to an uniform adaptive system evaluation criteria [9]. Besides, we also conduct trials in real classrooms and our experiment results provide strong evidence of the effectiveness of our proposed multi-modal remediation system.

458

3

Y. Tu et al.

System Methodology

In this section, we present the system we have developed to support multi-modal remediation. A high-level block diagram of its major components is shown in Fig. 1, with the individualization itself consisting of three parts: Content Tagging (Sect. 3.2), User Modeling (Sect. 3.3), and Path Switching (Sect. 3.4). In describing these components, we will overview several variants implemented for different use cases. Subsequently, in Sect. 4, we will formalize the specific algorithms implemented for the user trials in Sect. 5.

Fig. 1. Overview of our adaptive remediation system and its key components.

3.1

Inputs

The Individualization System in Fig. 1 receives two types of inputs: measurements on user behavior, and the course content itself. Course Content. The course content is stored in the Content Database. It consists of a series of modules ordered for delivery to users according to the instructor’s syllabus for the course, with each module containing a set of content files. Importantly, these files can be of different formats corresponding to different learning modes, including videos (.mp4), PDFs (.pdf), and slideshow presentations (.pptx). The modules can also include quiz questions, which are delivered to the user upon finishing the content in the module by the Player application.

Adaptive Remediation with Multi-modal Content

459

User Behaviors. As the user interacts with the course application, the Player collects four types of data: responses to assessment questions, clickstream measurements generated from navigation through content files [3], posts on discussion forums comprising the Social Learning Network (SLN) [25], and annotations consisting of notes, bookmarks, and highlights. Each assessment response is tied to a particular question, while each clickstream measurement is recorded on a particular segment: segments are partitions of content files, as determined by Content Tagging described next. If necessary, performance prediction [13] can also be used to estimate a user’s score on assessments she/he did not take. 3.2

Content Tagging

The purpose of Content Tagging is to associate the learning materials with the topics they contain. To do this, course files are first broken down into segments, where each segment is a 20 s chunk of video or a page/slide in a document/slideshow. In this process, a textual representation of each segment is obtained, where any audible components are passed through speech-to-text conversion and optical character recognition (OCR) is applied to any images. Quizzes are also included in this process, with each question comprising a quiz being treated as a segment. The collection of segments in the course are then passed through natural language processing (NLP) algorithms for topic inference, with each segment modeled as a bag-of-words. Several NLP techniques are possible here, the important requirement being that they infer segment-topic and topic-word distributions, by either (i) treating topics as latent dimensions of the model or (ii) treating an outline/syllabus of the course as the topics and associating each segment with these topics by e.g., frequency of occurrence. The output of Content Tagging is then an association of each segment with its constituent topics: this association is specified as a probability distribution for each segment, with each part of the distribution expressing the amount of a particular topic comprising the segment. 3.3

User Modeling

The purpose of User Modeling is to estimate a user’s knowledge state and/or content preferences with respect to each topic as the user proceeds through the course, so that Path Switching can adapt accordingly. The user model is updated through analysis of all or a subset of the input measurements collected by the Player. As a simple example, answering a test question correctly signifies an increase in content knowledge on the tested topics [4]. As another example, exhibiting high engagement in a certain file is interpreted as an increase in preference for this content mode [9]. More generally, our AIS invokes specific behavioral sequences called “motifs” to update the user model: these motifs are recurring subsequences of actions identified through sequential pattern mining algorithms that have been a-priori associated with increases/decreases in knowledge state and/or increases/decreases in content preferences [3].

460

Y. Tu et al.

When a user exhibits a motif on a set of segments, the dimensions of the user model corresponding to the topics covered in this set are updated based on the motif’s association with knowledge state changes. In this way, the user’s knowledge state is tracked as they progress through the course, indicating topics needing further instruction/remediation and presentation preferences [10]. 3.4

Path Switching

Upon completion of each module, Path Switching analyzes the user model to determine whether remediation is needed, and if so, synthesizes the alternate content for rendering in the Player application. The decision to remediate is based on a comparison of the current knowledge state to an expected state of each topic from the module coverage. If the knowledge state is insufficient, the combination of topics for which the user needs assistance is identified, and then Path Switching searches for the segments that have the highest scoring match to the needed remediation, described further below. These segments are drawn primarily from the course files in the Content Database, but may also come from alternate files available in an External Database, e.g., additional courses provided by the instructor. In general, the match score between a segment and the needed remediation is determined based on three factors: topic relevance, contextual relevance, and historical utility. Topic relevance measures the variation between the topic distributions (i.e., lower variation segments are more useful), and contextual relevance measures how far away the segment is from the current module (i.e., closer segments are more relevant to what is being taught). Historical utility quantifies how effective this segment has been for remediation in the past, updating over time as the segment is chosen and subsequent changes in knowledge state are observed. With the list of segments in hand, they are then split into different sets for each content type, and remediation files are created containing these segments. Finally, the files are instantiated in a remediation module and delivered to the user, after which point the user is returned to the original path.

4

Algorithms and Implementation

With an understanding of our AIS from Sect. 3, we now detail the specific algorithms used to test multi-modal content remediation in the trials in Sect. 5. Here, we restrict the input of the system to only quiz responses, which is most popular on other adaptive systems. We denote the course library as a set of segments S and the linear version of the course is delivered as a sequence of modules M = {1, 2, ...}. Sm ⊂ S is the subset of segments making up module m, and μ(s) is the module (or set of modules) where s appears. Q ⊂ S is the set of segments that are quiz questions, with Qm ⊂ Q being those set of questions that are asked in m. C is the set of content material (non-quiz) segments, S = C ∪ Q, and Cm is the set of content segments in module m, Sm = Cm ∪ Qm .

Adaptive Remediation with Multi-modal Content

4.1

461

Content Topic Modeling

With each file in the course broken down into segments and ordered accordingly, our system extracts key words from a collection of documents and then create a bag-of-words representation for each segment s; concretely, the bag-of-words over the dictionary X = {w1 , w2 , ...} of non-stopwords appearing in the course is xs , where xs (k) is the number of times word wk ∈ X appears in s. We then infer a topic distribution for each segment through the Latent Dirichlet Allocation (LDA) algorithm [1]. LDA extracts document-topic and topic-word distributions from a corpus of documents; here, segments are treated as separate documents. With the number of topics T extracted by the model chosen to minimize the coherence value, the resulting topic distributions θs=1...S are T dimensional probability vectors, forming the matrix Θ = [θs ]. Each segment’s topic distribution can then be taken as its content tag. As a simple example, suppose a course consists of six topics A, B, C, D, E, and F. The distribution [0, 0.2, 0, 0.5, 0.3] then specifies a segment comprised of 50% of words from topic D, 30% of words topic E, and 20% of words topic B, which would occur if the breakdown of words in this segment followed these particular topic proportions (a distribution must sum to 1). The frequencies of the topic terms is significant here because intuitively, the more frequently a term appears, the more important that particular term is likely to be to the particular segment. Note also that stop words (for example, I, and, the, and so forth) need to be excluded prior to applying the NLP techniques. Also, if a syllabus, outline, or related material is available, that material is usable as a guidepost to better understand topics. The distributions, particularly the frequency of the topic terms, are used later to calculate similarities between content files and are used relative to syllabus topics. With the segment-topic distribution matrix, we construct a segmentto-content similarity matrix D. It has dimensions |S| × |C|. More specifically, D = [ds,c ], where ds,c = cos(Θs , Θc ) is a number between 0 and 1 that measures how similar segment s (either content material or quiz file) is to content segment c with respect to their topic proportions. Higher is more similar. 4.2

User Behavior Modeling

In this individualization trial, we only consider quiz performance to be triggers and here we construct the sequence of quiz questions where the users answered wrong in each module. Qw ⊆ Qm is the subset of these questions that this user answered incorrectly. Then for each question q in Qw , we search for a remediation set R = N1 (q), N2 (q), ..., NL (q). These are L neighborhoods for question q where L is the maximum number of times the individualization review can be triggered for a single module, with each l = 1, ..., L corresponding to a different review iteration. Each neighborhood Nl is of length K, |Nl (q)| = K, where K is the maximum number of segments shown per question per iteration. Nl (q) ⊂ C = arg max dq,c c=c1 ,...,cK

462

Y. Tu et al.

is chosen such that u = μ(c1 ) = μ(c2 ) = · · · = μ(cK ) and ck ∈ / N1:l−1 (q) ∀k. In other words, the segments in the lth neighborhood for question q are the K segments with highest similarity to q such that (i) none of the segments appear in the same module with each other, (ii) none of the segments already appear in a previous neighborhood, and (iii) none of the segments are in the current module m. Realistically, this can be built by sorting C in descending order based on dq,c , removing all the files from the current module, and then setting N1 (q) to the first K items from this list that are not in the same module, N2 (q) to the second K items not from the same module, and so on. 4.3

Individualization

At a high level, the path switching selects a sequence of segments that have a high likelihood of making the learning process more efficient based on the User Model. In our individualization trial, the path switching will be triggered when a user has just finished the quiz Qm at the end of module m. Qw ⊆ Qm is the subset of these questions that this user answered incorrectly. Starting with l = 1, the following is the logic to determine the individualization the user receives at this point: 1. If Qw = ∅ (all questions correct) or l > L (maximum iterations reached), go to step 5. 2. Set Nl = ∪q∈Qw Nl (q), the collection of unique segments in the lth neighborhoods of the questions answered incorrectly. 3. For each segment s ∈ Nl : (a) Obtain the similarities between s and all other content segments that appear in the same module as s, i.e, ds,c ∀c ∈ Cμ(s) . Ss ⊆ Cμ(s) is the subset of these segments for which ds,c ≥ δ, those for which the similarity is at least δ. (b) Generate a module r containing the segments s ∈ {s, Ss } for which e(o(s)) > E (the segments in the set for which the user’s engagement on that mode is at least E). If there are no such segments, then let this module just consist of segment s as a standalone document. (c) Show the user the module r. 4. Let the user take the set of questions Qw he/she answered wrong again. Update Qw based on the result (the subset of questions answered incorrectly again). Increment l and return to step 1. 5. Unlock the explanations of the questions in module m, and allow the user to proceed to module m + 1.

5

Experiments

To test the efficacy of our adaptive remediation system, we conducted randomized control trials with one course offered to upper-class engineering students at an undergraduate institution in the US. We performed two trials at different levels of the course: the module level, and the segment level, which will be

Adaptive Remediation with Multi-modal Content

463

described in details in the following section. And for each trial, we randomly divided the users into two groups, one experimental group using our adaptive system and one control group without the adaptive system. By considering the trade-off between points gained for each question answered correctly and its relation to time spent on each module of material, we produce an overall score measurement called productivity. Higher productivity represents higher learning efficiency and it was hypothesized that the experimental group of users will have significantly higher productivity when compared to that of the control group. [Original Module (Non-adaptive)]

[Virtual Module (Adaptive)]

Fig. 2. A comparison of the original module and the virtual module.

464

5.1

Y. Tu et al.

Experimental Setting

46 students enrolled in the engineering course participated in this study. The course content was on WiFi and had no relation to material that users were learning in class. Furthermore, the course used for this study was taught completely online. The baseline content consisted of fourteen modules(chapters), each being a combination of a lecture video and a PDF of lecture notes. Twelve of these fourteen modules were followed by a set of questions. Since the test subjects were split in half, with half in the experimental group and the other half in the control group, users were treated differently based on whether their responses to these question sets were correct or incorrect. For the control group, no matter the response to the questions, users were automatically pushed forward to the next module. However, for the experiment group, the adaptive remediation algorithm was implemented if any question was answered incorrectly, searching for material in the database that most directly corresponded to that specific question. The adaptive remediation system also consists of two types, individually conducted in two separate trials. The first trial provides support material on a module level, whereas the second trial provides support material on a segment level (segment extracted from individual files and put together as another file). This supporting material, shown in Fig. 2 was presented to the user with the hope that the learned knowledge would be reinforced and the user would be able to answer the question correctly before moving onto the next module. 5.2

Evaluation Metric

It is clear that the adaptive version would have users gain more points due to multiple chances at incorrect questions. However, this is at the expense of spending more time to get those points due to users sitting through a virtual reinforcement segment. Therefore, the benefit is analyzed statistically by considering the ratio of points gained per minute of time spent on the respective modules. Our productivity measure for a user at a given point in the course is: pα (s, t) = s/tα where s is the total points cumulatively obtained to that time, t is the cumulative time spent, and α is a parameter controlling the importance of s versus t to the productivity measure. A higher α value places more importance on time spent, and thus a lower overall score, while a lower alpha value places more importance on points gained. 5.3

Results

We start by describing some basic statistics of the course are as follows. Table 1 shows summary comparison between the aggregated productivity p between adaptive participants and non-adaptive participants, with different choices of α. Clearly, participants in the adaptive course demonstrated higher productivity

Adaptive Remediation with Multi-modal Content

465

Table 1. Summary statistics Method

Mean SD

Median Min Max

Kurtosis α

Adaptive (Segment level) 1.433 0.805 1.454

0

4.046 0.816

1

Adaptive (Module level)

1.329 0.646 1.389

0

3.306 0.579

1

Non-adaptive

0.878 0.582 0.887

0

3.223 3.010

1

Adaptive (Segment level) 4.585 2.172 5.108

0

8.552 0.390

0.5

Adaptive (Module level)

4.491 1.826 4.834

0

9.350 0.998

0.5

Non-adaptive

2.843 1.559 2.977

0

7.241 -0.021

0.5

4

Productivity

3

Type Linear 2

Module_Adaptive Segment_Adaptive

1

Chapter 14

Chapter 13

Chapter 12

Chapter 11

Chapter 10

Chapter 9

Chapter 8

Chapter 7

Chapter 6

Chapter 5

Chapter 4

Chapter 3

Chapter 2

Chapter 1

Chapter 0

0

Chapter

Fig. 3. Productivity p by Chapter for α = 1

with the both α choices. In particular, users gain relatively higher productivity points on average with segment level adaptation. Overall, our adaption system increases the productivity by 50%–60%. Now we plot the accumulative productivity score chapter by chapter in Figs. 3 and 4, grouped by segment level adaptation, module level adaptation, and no adaptation. In Figs. 3 and 4, consistently across all of the chapters in the course, users always gain higher productivity score in adaptive courses. In particular, in 8 out of the 11 chapters, adaptation on module level demonstrates a higher median while 8 out of 11 adaptation on segment level demonstrates higher 75th percentile values. Intuitively, this observation can be explained by adaptation at

466

Y. Tu et al.

8

Productivity

6

Type Linear 4

Module_Adaptive Segment_Adaptive

2

Chapter 14

Chapter 13

Chapter 12

Chapter 11

Chapter 10

Chapter 9

Chapter 8

Chapter 7

Chapter 6

Chapter 5

Chapter 4

Chapter 3

Chapter 2

Chapter 1

Chapter 0

0

Chapter

Fig. 4. Productivity p by Chapter for α = 0.5

a segment level providing supporting materials at a detailed level with the exact information. However, supporting materials consists of file segments may lack the coherence of information that is in an adaptation at a module level. Table 2. Statistical differences between groups with choice of α. Non-adaptive versus MLA Non-adaptive versus SLA SLA versus MLA α = 0.5 1.522e-25

1.894e-27

0.025

α=1

1.038e-21

0.113

1.933e-18

In additional to the visual comparison in Figs. 3 and 4, we conduct Wilcoxon Rank-sum test, also called Mann-Whitney U test [14], to test if there is a statistical difference between the productivity in segment level adaptation (SLA), module level adaptation (MLA), and the non-adaptive version of the course. Above in Table 2 we record the p-values for each test and they show a statistically significant difference between the groups, except for SLA versus MLA when α = 1. It demonstrates that users in the experimental group achieve higher productivity than those in the control group with significance, which proves our

Adaptive Remediation with Multi-modal Content

467

method is effective at individualization. Moreover, individualizing at the segment level shows relatively higher productivity overall than individualizing at the module level, which directs our future research on individualization with finer granularity.

6

Conclusion

In this paper, we propose an adaptive remediation system with multi-modal remediation content. The system consists of four main phases: ingesting a library of multi-modal content files into bite-sized chunks, linking them based on topical and contextual relevance, then modeling users’ real-time knowledge state when they interact with the delivered through the system, and finally identifying a set of remediation segments addressing the current knowledge weakness with the relevance lines. We conducted two studies to test our developed adaptive remediation system in an engineering mathematics course taught at an undergraduate institution and evaluated our system on productivity. Using a series of statistical tests, we show that users in the experimental group achieve higher productivity than those in the control group with significance. Moreover, individualizing at the segment level shows relatively higher productivity overall than individualizing at the module level. These results show that our method is effective at individualization, increasing the overall productivity by 50%–60%. In the future, we will conduct trials using additional inputs to user modeling, like viewing behaviors and social learning networks. More user features will lead to more sophisticated user modeling techniques. Besides, for multi-modal content remediation, rather than ingesting the course materials by natural characteristics, like splitting videos with equal duration, we can further develop our content digesting methods based on the semantic meaning of course content with advanced text/language segmentation techniques. Another direction we are investigating is a comparison to an augmented version of our adaptive remediation system with self-learning reinforcement learning techniques. The system integrated with reinforcement learning can provide remediation options to users, collect their responses, and subsequently self-adjust the adaption agent for each user based on their responses to our remediation content.

References 1. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003) 2. Bra, P.D.: Pros and cons of adaptive hypermedia in web-based education. Cyberpsychology Behav. 3(1), 71–77 (2000) 3. Brinton, C.G., Buccapatnam, S., Chiang, M., Poor, H.V.: Mining MOOC clickstreams: video-watching behavior vs. in-video quiz performance. IEEE Trans. Signal Process. 64(14), 3677–3692 (2016) 4. Brinton, C.G., Rill, R., Ha, S., Chiang, M., Smith, R., Ju, W.: Individualization for education at scale: MIIC design and preliminary evaluation. IEEE Trans. Learn. Technol. 8(1), 136–148 (2015)

468

Y. Tu et al.

5. Brusilovsky, P., Kobsa, A., Vassileva, J.: Adaptive Hypertext and Hypermedia. Springer, Dordrecht (1998). https://doi.org/10.1007/978-94-017-0617-9 6. Brusilovsky, P., Mill´ an, E.: User models for adaptive hypermedia and adaptive educational systems. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) The Adaptive Web. LNCS, vol. 4321, pp. 3–53. Springer, Heidelberg (2007). https://doi.org/10. 1007/978-3-540-72079-9 1 7. Brusilovsky, P., Peylo, C.: Adaptive and intelligent web-based educational systems. Int. J. Artif. Intell. Educ. (IJAIED) 13, 159–172 (2003) 8. Brusilovsky, P., et al.: Adaptive and intelligent technologies for web-based eduction. Ki 13(4), 19–25 (1999) 9. Chen, W., Joe-Wong, C., Brinton, C.G., Zheng, L., Cao, D.: Principles for assessing adaptive online courses 10. Corbett, A.T., Anderson, J.R.: Knowledge tracing: modeling the acquisition of procedural knowledge. User Model. User-Adapted Interact. 4(4), 253–278 (1994) 11. De Bra, P., Calvi, L.: AHA: a generic adaptive hypermedia system. In: Proceedings of the 2nd Workshop on Adaptive Hypertext and Hypermedia, pp. 5–12 (1998) 12. Froschl, C.: User modeling and user profiling in adaptive e-learning systems. Master Thesis, Graz, Austria (2005) 13. Lan, A.S., Waters, A.E., Studer, C., Baraniuk, R.G.: Sparse factor analysis for learning and content analytics. J. Mach. Learn. Res. 15(1), 1959–2008 (2014) 14. Mann, H.B., Whitney, D.R.: On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18, 50–60 (1947) 15. Melis, E., et al.: ActiveMath: a generic and adaptive web-based learning environment. Int. J. Artif. Intell. Educ. (IJAIED) 12, 385–407 (2001) 16. Paramythis, A., Loidl-Reisinger, S.: Adaptive learning environments and e-learning standards. In: Second European Conference on e-learning, vol. 1, pp. 369–379 (2003) 17. Park, O.C., Lee, J.: Adaptive instructional systems. Educ. Technol. Res. Dev. 25, 651–684 (2003) 18. Park, O.C., Tennyson, R.D.: Computer-based instructional systems for adaptive education: a review. Contemp. Educ. Rev. 2(2), 121–135 (1983) 19. Reiser, R.A.: Instructional technology: a history. In: Instructional Technology: Foundations, pp. 11–48 (1987) 20. Ross, E.W.: Adapting teaching to individual differences. Soc. Sci. Rec. 26(1), 27–29 (1989) 21. Shepard, L.A.: The role of assessment in a learning culture. Educ. Res. 29(7), 4–14 (2000) 22. Sonwalkar, N.: Adaptive individualization: the next generation of online education. In: EdMedia: World Conference on Educational Media and Technology, pp. 3056– 3063. Association for the Advancement of Computing in Education (AACE) (2007) 23. Tadlaoui, M.A., Aammou, S., Khaldi, M., Carvalho, R.N.: Learner modeling in adaptive educational systems: a comparative study. Int. J. Mod. Educ. Comput. Sci. 8(3), 1 (2016) 24. Tobias, S.: When do instructional methods. Educ. Res. 11(4), 4–9 (1982) 25. Yang, T.Y., Brinton, C.G., Joe-Wong, C.: Predicting learner interactions in social learning networks. In: IEEE INFOCOM 2018 IEEE Conference on Computer Communications, pp. 1322–1330. IEEE (2018)

Diagnostic Requirements for Efficient, Adaptive Robotic Surgery Training Thomas E. F. Witte1(&) , Martin Schmettow2 and Marleen Groenier2

,

1

2

Fraunhofer FKIE, Fraunhoferstr. 20, 53343 Wachtberg, Germany [email protected] University of Twente, P.O. Box 217, 7500 AE Enschede, The Netherlands {m.schmettow,m.groenier}@utwente.nl

Abstract. Robot-assisted surgery training is shifting towards simulation-based training. Challenges that accompany this shift are high costs, working hour regulations and the high stakes aspects of the surgery domain. Adaptive training could be a possible solution to reduce the problems. First, an adaptive system needs diagnostic data with which the system can make an action selection. A scoping literature search was performed to give an overview of the state of the research regarding diagnostic requirements. Diagnostic metrics should be (a) useful for formative and not only summative assessment of trainee progress, (b) valid and reliable, (c) as nonintrusive as possible for the trainee, (d) predictive of future performance in the operating theater (e) explanatory, and (f) suitable for real-time assessment of trainee’s learning state. For a more indepth understanding, further research is needed into which simulator parameters can be used as diagnostic metrics that can be assessed in real-time. A possible framework for adaptive training systems is discussed, and future research topics are presented. Keywords: Adaptive training Real-time assessment

Robotic surgical training Robotic surgery

1 Introduction Robot-assisted surgery (RAS) is the next step in the evolution of the surgery domain. It is accompanied by a major paradigm change for surgery training. Traditionally, residents are trained ‘on-the-job’, which nowadays shifts to simulation-based training [1, 2]. According to Schreuder and Verheijen [3] the main benefits of RAS are “better ergonomics of the surgeon” and “improved dexterity” among others ([2], p. 201). However, the fast introduction of new technology and methods into the operating theater (OT) comes with a number of challenges, some of which resemble the challenges encountered with the swift introduction of laparoscopic surgery. Brunt, a founding member of the fundamentals of laparoscopy training describes these as follows: “(1) a huge group of surgeons required training, as did residents, in an environment where not a lot of teachers were available; and (2) surgeons were being trained through industry-funded courses that were highly variable in terms of their format” © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 469–481, 2019. https://doi.org/10.1007/978-3-030-22341-0_37

470

T. E. F. Witte et al.

([4], p. 11). Without a different training paradigm, such as simulation-based training, the described situation led to surgeons who were inexperienced with the surgical technique and applied it in the OT on actual patients. The situation resulted in patients being harmed and in some circumstances it even resulted in death [5]. These so-called adverse events had put classical minimal invasive surgery into jeopardy, despite the advantages it has to offer. In response, standardized basic minimal invasive surgery training, for example, the fundamentals of laparoscopy, were introduced for credentialing surgeons [4]. Similar problems are reported for RAS. In a retrospective U. S. Food and Drug Administration (FDA) study, data of self-reported critical incidents and adverse events between 2000 and 2013 were analyzed [6]. A total of 10 624 adverse events were gathered that occurred during 1 735 000 robotic surgeries. There were 1 391 injuries and 144 reported deaths of patients. The surgeons or staff caused only 7% of the events. However, the incidents and events were often the results of a combination of different causes, for example, surgeon mistakes and device malfunctions. Further, 50% of events resulting in death were reported without a categorization, possibly due to legal issues. The “common flawed operational practices used by the surgical team that contributed to catastrophic events during surgery” are interesting ([6], p. 14) (Table 1). Table 1. Practices that contributed to adverse events during robotic assisted surgeries reported by Alemzadeh et al. [6]. Causes for catastrophic events during robot assisted surgery “Inadequate experience with handling emergency situations Lack of training with specific system features Inadequate troubleshooting of technical problems Inadequate system/instrument checks before procedure Incorrect port placement Incorrect electro-cautery settings Incorrect cable connections Inadequate manipulation of robot master controls Inadequate manipulation between hand & foot movements Incorrect manipulation or exchange of instruments” ([6], p. 14)

One conclusion of the authors is to provide real-time assessment of the team’s actions for simulation training and during surgery to prevent some of the causes of adverse events. The problems mentioned above are just the tip of the iceberg regarding challenges of modern surgery training programs for RAS. High costs of the robot system and longer anesthesia times for the patients during training make on-the-job training inefficient [7, 8]. A systematic approach is needed to create appropriate training programs in time and to prevent a technology-practice gap as has happened with laparoscopy. Therefore, a state-of-the-art simulation-based robotic surgery standardized training has to be developed, taking a human factors perspective into account.

Diagnostic Requirements for Efficient, Adaptive Robotic Surgery Training

1.1

471

A Human Factors Approach to Efficient, Effective and Safe Robotic Surgery Training

Literature reviews [9–11] were used as a starting point for a broader literature search to identify the characteristics of current robotic surgery training programs, which is described in the report by Witte [12]. As a result of the review, the core requirements for a RAS Training curriculum and the main components of a training are summarized: A RAS curriculum should be (a) cognitive science and expert literature based, (b) proficiency-based, (c) simulation based, (d) standardized, (e) validated, and (f) adaptive to the experience level of the trainee. A training should contain: (a) technical and non-technical skill modules, (b) emergency training, and (c) recursive proficiency-based assessments and credentialing of surgeons. Developing, maintaining and providing RAS training according to the requirements consists of three processes that act upon the curriculum and receive information from it: (a) training development, (b) training and (c) iterative updates for new surgery technology (see Fig. 1).

Fig. 1. Prototypical surgery training curriculum creation process [12]

From a human factors point of view, the training development begins with a task analysis, incorporating critical incident analysis [13]. With the results of the task

472

T. E. F. Witte et al.

analysis, a syllabus of the declarative and procedural knowledge a surgeon should obtain is created. For the training itself, training modalities and the scheduling of the training are important aspects to consider. Further, the credentialing process should be standardized to ensure the quality of the training. Methods and technologies newly introduced or updated for the use in the OT should be regularly and systematically evaluated to trigger a new task analysis. The requirement that training should be adaptive to the trainee’s experience level and the recommendation as mentioned above of providing real-time assessment of the trainee makes the training design even more complex. The simplified creation process of Fig. 1 is used in this paper to discriminate diagnostic requirements for the different aspects of an adaptive RAS training design. 1.2

Information Processing Model for Robotic Assisted Surgery

According to Cao and Rogers [13], the surgeon and technical system can be seen as information processing systems. The “efficiency can be associated with the amount of information an operator can process per unit time. Also, task difficulty can be viewed in terms of the amount of information and the rate at which information is presented” [13, p. 76]. An information processing model for RAS is summarized in Fig. 2. To perform the surgery on the patient, the surgeon controls the manipulators such as a camera and instruments via a console, without haptic feedback. The output of the manipulation is filtered via a three-dimensional display that contains the video feed of the camera within the patient. Also, the haptic feedback is filtered, because of the lack of haptic feedback and the lack of the fulcrum effect. In traditional minimal invasive surgery, the surgeon’s movement is reversed to the movement of the tip of an instrument. In RAS this effect is compensated by the robot. The surgeon first receives the sensory information, which is mainly visually, but also auditory and then perceives it through encoding of the data. The perceived information is then used for decision making for the next motor response or memory storage for later decisions. In the final step, “response execution requires the call up, release, and generation of motor sequences with muscle activation” [13, p. 77]. After the introduction of requirements for RAS training in general and the description of a proposed information processing model for the RAS tasks, the next section will introduce the adaptive aspects of RAS training. 1.3

Adaptive Robotic Surgery

An “adaptive training is training in which the problem, the stimulus, or the task is varied as a function of how well the trainee performs” ([15], p. 547). With an adaptive training, issues like the “lack of training with specific features of the system” or “inadequate troubleshooting of technical problems” ([6], p. 14) that have led to adverse events in the past could be trained on an individual basis. Diagnostic Requirements for Robotic Assisted Surgery Training. An adaptive training, based on the earlier mentioned requirements for RAS training, could help to solve the problems of today’s surgical training. It is a means of optimizing the training by modeling the learning curve representing a trainee’s progress and providing

Diagnostic Requirements for Efficient, Adaptive Robotic Surgery Training

473

surgical goals, plan of surgical steps filtered stimuli

output

control

Fig. 2. The information processing model for robotic-assisted surgery from Cao and Rogers [13], modified, based on Wickens [14].

individualized training to account for individual differences in ability and experience. Schwarz and Fuchs [16, 17] propose a framework for a multidimensional real-time assessment of user state and performance for adaptive systems. The simplified version shows that the adaptive technical system interacts with the operator, or in the case of RAS training the trainee, and the environment (see Fig. 3). The state regulatory process is responsible for the adaptation of the technical system to the trainee. First, data should be gathered from multiple sources for a reliable basis to evaluate the data in the next step. State assessment is then responsible for the valid diagnostic results that can be displayed as feedback for the trainee and be used as input for the action selection. By the diagnostics the system can make decisions to support the trainee and then execute those decisions in the last step. Without reliable and valid diagnostic data, the adaptation process can become ineffective or even destructive. 1.4

Summary

The literature reviews on RAS training as mentioned earlier lacks information about the status of adaptive robotic-assisted surgery training. Hence, the focus of a scoping literature review that was performed in this paper lies on the adaptive training topic to fill in this gap. Further, diagnostic requirements for metrical data of trainees and

474

T. E. F. Witte et al.

Fig. 3. A simplified model of a generic adaptation framework of Schwarz and Fuchs [16]

surgeons are needed for an adaptive training framework. These data can be used for training system adaptations specific to the state of the trainee. To broaden the scope of the literature reviews for RAS the following sections focus on the question: what are the diagnostic requirements for efficient, adaptive robotic surgery training? One key feature of adaptive training is to assess trainee performance; otherwise, a fine-grained adaptation of the training program for the trainee’s needs would not be possible. Valid and reliable real-time robotic-assisted surgical performance assessment is the first step for further systematic research on the topic of adaptive robotic surgery training. By combining the previously described literature review [12] with the scoping review, the current state in literature is analyzed and summarized.

Diagnostic Requirements for Efficient, Adaptive Robotic Surgery Training

475

2 Method A scoping literature review was performed to identify requirements for adaptive RAS training. Two search engines were used to broaden the results: (a) Web of Science, and (b) Scopus with integrated PubMed Hits. Search results were evaluated in phases: (1) by title, (2) by abstract and (3) by full text. After deleting duplicates, a reference check of the hits was performed to add missing hits. Peer-reviewed articles or books with the keywords surgery adaptive training, surgery adaptive curriculum, diagnostic requirements surgery training and requirements adaptive surgery training were considered as hits.

3 Results The search resulted in a total of 1241 initial documents that were reviewed for further analysis. Figure 4 gives an overview of the search process. Of the 65 full texts included, two were books, and 63 were peer-reviewed journal articles. After categorization, the information from 65 documents contained in the analysis was summarized and used to compare the introduced models of training creation, adaption framework, and surgical information processing model in the context of adaptive surgery training.

Scopus n = 62

Web of Science n = 63

Scopus n = 51

Web of Science n = 43

n = 65

Review n=1

Adaptive training n = 22

Assessment n=5

Diagnostics n=1

Requirements n=3

Training Factors n=6

Fig. 4. Search process for hits regarding adaptive surgery training

476

3.1

T. E. F. Witte et al.

Adaptive Surgery Training

Vaughan et al. [18] are providing an overview of self-adaptive technologies for virtual reality. According to them the field of adaptive training contains the aspects: (a) “Mechanisms for adaptive learning about a user’s requirements”, (b) “adaptive and reactive features to enhance trainee’s learning efficiency”, (c) “autonomous training using simulation”, (d) “intelligent monitoring of a trainee’s progress”, and (e) “various types of adaptive content can be included” ([18], p. 5). After providing the pros and cons of virtual reality training in the context of medical training, the review does not give details on the status quo of adaptive surgical training. Five adaptive motor skill training courses for surgery were described in the documents [19–23]. These pieces of training are not fully fletched training programs, but mostly a research vehicle for adaptive training. Another line of research is adaptive training with the expert in the loop [24]. On the one hand these findings contradict those of cost efficiency and working hour regulations, but on the other hand, it can increase the efficiency of complex task training. Reliable and valid diagnostic data is the basis for a successful adaptive strategy. Because of the limited information about adaptive surgical training in general and robot surgery training in particular, it is essential to begin with the first step and summarize the diagnostic strategies used so far by the authors of the articles found in the literature search. The 2005 publication by Pham et al. [21] was the first to describe an adaptive simulated surgical training. In their paper, the authors discuss the Yerkes-Dodson Principle that states “that in situations of high or low stress, learning and performance are compromised” [21, p. 385]. If the difficulty level does not match the need of the learner, frustration or boredom can occur, which makes the training inefficient. With the so-called Smart Tutor software, the task difficulty was kept in an “optimal learning zone” [21, p. 386] regarding frustration and boredom. Results of an empirical study indicated no increase in the learning rate. However, participants of the adaptive training were less frustrated compared to the non-adaptive training group. Pham et al. conclude that more research is needed to refine the algorithm. However, no specific diagnostic metrics or method were disclosed. More recently, Mariani et al. [20] describe a training model where participants first perform a complex bimanual visuo-motor task to obtain a baseline performance measurement. A ring had to be moved virtually along a wire without touching it. This complex task was split into simpler subtasks. Performance data were compared to experts performing the same task. The adaptive system then evaluated which subtasks needed further training. The diagnostic metrics used were: (a) “distance between tool and target”, (b) time to completion, (c) “distance between the actual and the ideal position”, (c) “angular difference between the actual and the ideal pose”, and (e) “number of drops while performing an object transfer” [20, p. 2163]. A group of trainees showed higher performance in the adaptive training condition compared to a self-managed training group. Siu et al. [22] propose an adaptive virtual reality training for surgery skills, based on the Adaptive Control of Thought/Rational (ACT-R) cognitive architecture and point out a particular need for military medical training. The needed skill set can differ significantly from deployment to deployment and should be effectively approached via

Diagnostic Requirements for Efficient, Adaptive Robotic Surgery Training

477

new training. They demonstrated a preliminary model based on their adaptive training framework that can predict “the majority of variance in the human performance data” ([22], p. 218). However, they also conclude that they did not account for fatigue. The data suggest a slowdown caused by fatigue after a couple of trials. A more holistic approach with multiple dimensions for the user state assessment is supported by this finding. Further, their framework aims to “predict the decay effect and maximize the training experience by monitoring the occurrence of mistakes during skill acquisition and retention” ([22], p. 217). The framework does not provide explanations for realtime performance problems or critical states of the trainee, which could lead to more fine-grained adaptive training. Kinematic data, time to task completion, total distance traveled, and average speed are used as diagnostic metrics. Also, muscle effort of four muscles was recorded. According to the authors, the ACT-R based model fits better than a logarithmic regression model. Another technique found in literature on monitoring a surgeon’s performance is the cumulative sum technique [19, 23, 25]. “The acceptable outcome rate, the unacceptable outcome rate, the Type I error rate and the Type II error rate” [24, p. 583] needs to be set in advance. The basis is data of experts with which the trainee performance will be statistically compared. 3.2

Assessment Strategies

Assessment of a trainee is closely related to the diagnostic requirements of adaptive robotic surgery. In the past, assessments were mostly used for selection of aspiring surgeons, tracking of overall learning progress and credentialing. In the traditional apprenticeship training model, experts evaluate learning progress during actual surgeries performed by the trainee and adapt the intensity and complexity of the learning experience and environment for the trainee. With simulation-based training, this role has to be compensated. One option is to let trainees self-evaluate their progress; another option is to use a one-size-fits-all strategy. The self-evaluation strategy can lead to higher cognitive load during training because trainees have to determine their own learning needs for the training sessions [20]. The one-size-fits-all approach cannot account for individual differences of the trainees, which requires a homogeneous trainee group in terms of prior knowledge, ability, and experience. Taking the changing surgical techniques and technologies in the OT, varied experience levels and different learning needs of trainees into account, implementing an efficient, effective and safe training can be cumbersome. These developments call for robust and evidence-based adaptive training. This training requires real-time monitoring of trainee performance. As discussed in the previous section, most documents from the literature search concentrate on the comparison of diagnostic metrics regarding movement, errors and time to completion to expert data [26]. Diagnostic parameters such as cognitive aptitude [27] are also considered as valuable for real-time assessment. Requirements for the assessments are: (a) “reliability; if the test were to be administered to the same individual on separate days, in the absence of any learning, the results obtained should be similar,” (b) “feasible; the test must be practical and straightforward to administer, (c) “fair; the results from any assessment process should be reproducible”, (d) “objective” and (e) “valid” [28].

478

T. E. F. Witte et al.

4 Discussion Research into adaptive surgical training and more specific robotic surgery training is scarce. The literature search did not lead to a concise picture of a coherent model taking all relevant human and technological factors into account. There is no consensus about which diagnostic metrics are best suited to be implemented in adaptive robotic surgery training, despite the multitude of performance data that is generated by the simulation devices. Given the current state of the literature, the following requirements for diagnostic metrics for adaptive robotic surgery are proposed. Diagnostic parameters should be (a) useful for formative assessment of trainee progress and not only summative assessment, (b) valid and reliable, (c) as nonintrusive as possible for the trainee, (d) predictive of future performance in the OT, (e) explanatory, and (e) suitable for realtime assessment of trainee’s learning state. These requirements and the outcomes of the literature search have multiple implications for curriculum design in robotic surgery, and the components of real-time user-state assessment [16]. The task analysis process should be performed with adaptive training in mind. The syllabus that feeds the curriculum should contain explanatory information about the tasks, skills, and knowledge that are required for robotic surgery. Furthermore, the syllabus should have a form, that an adaptive training system can evaluate the trainee by the causalities of the containing items. More precisely this means, that a digital training system should be capable of deeply understanding the surgery domain to provide meaningful training tasks to the trainee. For a full adaptive training program, diagnostic metrics of non-technical skills, such as communication, leadership, and teamwork, should also be considered and not only metrics of technical skill performance. We propose an implementation of a simplified model of a generic adaptation framework of Schwarz and Fuchs [16], see Fig. 5. The technical system detects critical states of the system’s operator and assists by simplifying the task. When it comes to training, the system should not only assist but also keep the trainee within the optimal training zone. This means that the training system is introduced as a third component. The trainee, who is an adaptive information processing system too, interacts via the training system with the training environment. For the operational use of the framework the problem of two interacting systems was solved by only acting when an operator performs at a critical level. In a training scenario, the trainee has to be kept challenged, and the learning process itself can be seen as a constant adaptation. This implicates that the baseline has to be evaluated constantly; otherwise, the adaptations of the training system can be counterproductive. The framework uses six dimensions for evaluating the user state: (a) situation awareness, (b) attention, (c) fatigue, (d) emotional state, (e) workload, and (f) motivation [16]. The involvement of attention, fatigue, emotional state, and workload for the surgical task, and learning to perform it, was described in the information processing model and the documents about adaptive surgical training.

Diagnostic Requirements for Efficient, Adaptive Robotic Surgery Training

479

surgical goals, plan of surgical steps filtered stimuli

output

Fig. 5. Framework for future research for adaptive robotic surgery based on Cao and Rogers [13] and Schwarz and Fuchs [16] (modified)

5 Conclusion The scoping literature review revealed promising developments for adaptive robotic surgery. Current developments in surgical training and technological advancements indicate a need for real-time user-state-assessment during training. Research on diagnostic requirements for RAS training lacks a systematic approach. For future research, the generic adaptation framework of Schwarz and Fuchs [16] shows promising first support by the literature and therefore should be considered as a starting point for more complex investigations of interactions of factors for learning and performing RAS. The challenges facing the development of adaptive training are still numerous and complex. Many questions remain: Which diagnostic metrics are best suited for real- time evaluation?; How to define the best individual, proximal learning zone?; How to structure the curriculum for constant updating without restarting the design phase all over again?

480

T. E. F. Witte et al.

Our literature review shows that research into adaptive robotic surgical training is still in its infancy. The first attempts of formulating requirements seem to be promising and could potentially make robotic surgery more efficient, effective and safe.

References 1. Bric, J.D., Lumbard, D.C., Frelich, M.J., Gould, J.C.: Current state of virtual reality simulation in robotic surgery training: a review. Surg. Endosc. 30, 2169–2178 (2016). https://doi.org/10.1007/s00464-015-4517-y 2. Gallagher, A.G., Ritter, E.M., Champion, H., et al.: Virtual reality simulation for the operating room. Ann. Surg. 241, 364–372 (2005). https://doi.org/10.1097/01.sla. 0000151982.85062.80 3. Schreuder, H., Verheijen, R.: Robotic surgery. BJOG Int. J. Obstet. Gynaecol. 116, 198–213 (2009). https://doi.org/10.1111/j.1471-0528.2008.02038.x 4. Brunt, L.M.: Celebrating a decade of innovation in surgical education. Bull. Am. Coll. Surg. 99, 10–15 (2014) 5. Chapron, C., Querleu, D., Bruhat, M.A., et al.: Surgical complications of diagnostic and operative gynaecological laparoscopy: a series of 29,966 cases. Hum. Reprod. 13, 867–872 (1998). https://doi.org/10.1093/humrep/13.4.867 6. Alemzadeh, H., Raman, J., Leveson, N., et al.: Adverse events in robotic surgery: a retrospective study of 14 years of FDA data. PLoS One 11, e0151470 (2016). https://doi.org/ 10.1371/journal.pone.0151470 7. Martinic, G.: Glimpses of future battlefield medicine - the proliferation of robotic surgeons and unmanned vehicles and technologies. J. Mil. Veterans. Health 22, 4–12 (2014) 8. Catchpole, K., Perkins, C., Bresee, C., et al.: Safety, efficiency and learning curves in robotic surgery: a human factors analysis. Surg. Endosc. 30, 3749–3761 (2016). https://doi.org/10. 1007/s00464-015-4671-2 9. Schreuder, H., Wolswijk, R., Zweemer, R., et al.: Training and learning robotic surgery, time for a more structured approach: a systematic review. BJOG Int. J. Obstet. Gynaecol. 119, 137–149 (2012). https://doi.org/10.1111/j.1471-0528.2011.03139.x 10. Fisher, R.A., Dasgupta, P., Mottrie, A., et al.: An overview of robot assisted surgery curricula and the status of their validation. Int. J. Surg. 13, 115–123 (2015). https://doi.org/ 10.1016/j.ijsu.2014.11.033 11. Halvorsen, F.H., Elle, O.J., Fosse, E.: Simulators in surgery. Minim. Invasive Ther. Allied Technol. 14, 214–223 (2005). https://doi.org/10.1080/13645700500243869 12. Witte, T.E.F.: Requirements for Efficient Robotic Surgery Training. University of Twente (Master Thesis) (2015). https://essay.utwente.nl/68614/ 13. Cao, C.G.L., Rogers, G.S.: Robot-assisted minimally invasive surgery: the importance of human factors analysis and design. Surg. Technol. Int. 12, 73–82 (2004). https://doi.org/10. 1177/0278364909104276 14. Wickens, C.D., Gordon, S.E., Liu, Y.: An Introduction to Human Factors Engineering. Addison-Wesley Educational Publishers, Old Tappan (1998) 15. Kelley, C.R.: What is Adaptive Training? Hum. Factors J. Hum Factors Ergon. Soc. 11, 547–556 (1969). https://doi.org/10.1177/001872086901100602 16. Schwarz, J., Fuchs, S.: Multidimensional real-time assessment of user state and performance to trigger dynamic system adaptation. In: Schmorrow, D.D., Fidopiastis, C.M. (eds.) AC 2017, Part I. LNCS (LNAI), vol. 10284, pp. 383–398. Springer, Cham (2017). https://doi. org/10.1007/978-3-319-58628-1_30

Diagnostic Requirements for Efficient, Adaptive Robotic Surgery Training

481

17. Schwarz, J., Fuchs, S., Flemisch, F.: Towards a more holistic view on user state assessment in adaptive human-computer interaction. In: 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 1228–1234. IEEE (2014) 18. Vaughan, N., Gabrys, B., Dubey, V.N.: An overview of self-adaptive technologies within virtual reality training. Comput. Sci. Rev. 22, 65–87 (2016). https://doi.org/10.1016/j.cosrev. 2016.09.001 19. Hu, Y., Goodrich, R.N., Le, I.A., et al.: Vessel ligation training via an adaptive simulation curriculum. J. Surg. Res. 196, 17–22 (2015). https://doi.org/10.1016/j.jss.2015.01.044 20. Mariani, A., Pellegrini, E., Enayati, N., et al.: Design and evaluation of a performance-based adaptive curriculum for robotic surgical training: a pilot study. In: 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 2162–2165. IEEE (2018) 21. Pham, T., Roland, L., Benson, K.A., et al.: Smart tutor: a pilot study of a novel adaptive simulation environment. Stud. Health Technol. Inform. 111, 385–389 (2005) 22. Siu, K.-C., Best, B.J., Kim, J.W., et al.: Adaptive virtual reality training to optimize military medical skills acquisition and retention. Mil. Med. 181, 214–220 (2016). https://doi.org/10. 7205/MILMED-D-15-00164 23. Ramirez, A.G., Hu, Y., Kim, H., Rasmussen, S.K.: Long-term skills retention following a randomized prospective trial on adaptive procedural training. J. Surg. Educ. 75, 1589–1597 (2018). https://doi.org/10.1016/j.jsurg.2018.03.007 24. Shahbazi, M., Atashzar, S.F., Ward, C., et al.: Multimodal sensorimotor integration for expert-in-the-loop telerobotic surgical training. IEEE Trans. Robot. 34, 1549–1564 (2018). https://doi.org/10.1109/TRO.2018.2861916 25. Yap, C.-H., Colson, M.E., Watters, D.A.: Cumulative sum techniques for surgeons: a brief review. ANZ J. Surg. 77, 583–586 (2007). https://doi.org/10.1111/j.1445-2197.2007.04155.x 26. Oropesa, I., Sánchez-González, P., Chmarra, M.K., et al.: Supervised classification of psychomotor competence in minimally invasive surgery based on instruments motion analysis. Surg. Endosc. 28, 657–670 (2014). https://doi.org/10.1007/s00464-013-3226-7 27. Luursema, J.-M., Rovers, M.M., Groenier, M., van Goor, H.: Performance variables and professional experience in simulated laparoscopy: a two-group learning curve study. J. Surg. Educ. 71, 568–573 (2014). https://doi.org/10.1016/j.jsurg.2013.12.005 28. Shah, J., Darzi, A.: Surgical skills assessment: an ongoing debate. BJU Int. 88, 655–660 (2001). https://doi.org/10.1046/j.1464-4096.2001.02424.x

Supporting Human Inspection of Adaptive Instructional Systems Diego Zapata-Rivera(&) Educational Testing Service, Princeton, NJ 08541, USA [email protected]

Abstract. Providing relevant information about students to various educational stakeholders is one of the main functions of an Adaptive Instructional System (AIS). External representations of the student model are used to communicate information to students, teachers, and other educational stakeholders. AISs can provide a variety of information including cognitive and noncognitive aspects of students in relation to their performance in the AIS. AISs should consider the information needs of educational stakeholders and support the development of dashboards or reporting systems that clearly communicate relevant assessment results. This paper describes issues concerning the inspection of student models in AISs. Keywords: Open/inspectable student models Adaptive instructional systems

Reporting systems

1 Introduction Adaptive instructional systems (AISs) are used in formal and informal educational contexts to support student learning. These systems make use of a student/learner model to keep a representation of the student that is deployed by the system to create adaptive interactions. This adaptivity can take the form of different type and amount of feedback, content sequencing, and access to educational materials offered to the student based on the status of the student model. Information to initialize and keep the student model up-to-date can come from different sources (e.g., student process and response data). No matter the type of AIS, one must keep in mind the need for providing students and other educational stakeholders with the information they need to support their decisions. However, many AISs are built as “black boxes,” using data driven approaches that provide limited information for students, teachers, parents, or other stakeholders [1–3]. This paper elaborates on the need to support human inspection of the student model (e.g., student performance levels) and some of the inner workings of AIS (e.g., evidence used to infer such performance levels). That is, to guarantee transparency AISs should provide educational stakeholders not only with information about student progress or mastery levels, but also with information about how the decisions made by the system were based on relevant evidence and how these decisions supported student learning. © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 482–490, 2019. https://doi.org/10.1007/978-3-030-22341-0_38

Supporting Human Inspection of Adaptive Instructional Systems

483

2 Open Student Models Research in open student models (OSM) has shown that allowing students to inspect or interact with a representation of the student model can reduce the complexity of the diagnostic and selection processes and at the same time support student learning and self-reflection [4–6]. Creating open student models that support learning can be challenging, partially due to the various types of student data that can be represented in a model, and the variety of ways people might interact with that model. The student model can include various types of information about the student such as: cognitive abilities, metacognitive skills, affective states, personality traits, learner styles, social skills, and perceptual skills. These sources of student model information can vary in terms of their quality and granularity. Several approaches to interacting with open student models have been explored. Some of these approaches include the use of external representations such as text, progress bars, concept maps, mastery grids, hierarchical graphs, and directed acyclic graphs, and interaction approaches like following interaction protocols, guided exploration using an artificial agent, collaborating with a peer, and negotiating the model with a teacher or a virtual tutor [7–11]. Multiple audiences can benefit from the student modeling information maintained by AISs. Audiences such as students, parents, teachers, and researchers are potential users of this information. While the type of representation and interaction mode may differ according to the needs, and other characteristics of each audience, the AIS should have capabilities to support access and facilitate interpretation of the information maintained by the system. That is, AISs should be able to support the creation of dashboards or reporting systems that can provide each audience with the information they need to support their decisions.

3 An Interpretation Layer for AISs AISs integrate and make sense of these diverse pieces of student information to make adaptive decisions that positively influence student learning. However, AISs do not always implement mechanisms to support the creation of reporting systems or dashboards that provide information to students and other stakeholders (e.g., teachers and other third parties interested in knowing more about the inner workings of the AIS). Inspired by assessment design methodologies such as evidence-centered design (ECD; [12]), which makes use of argument-based structures to articulate the chain of reasoning connecting task-level data to the evidence required to support assessment claims about the student, Zapata-Rivera et al. [13] proposed the implementation of an evidence layer for intelligent tutoring systems. This evidence layer can facilitate the creation of adaptive systems by maintaining sound argument-based structures that represent how student data are used to support assessment claims about the student. This layer also aims at formalizing the concept of evidence, facilitating reuse of evidence, supporting the integration of claims and evidence in reporting information produced by the tutor, and aiding with the automatic evaluation of evidence.

484

D. Zapata-Rivera

Tools and services can be implemented on top of this evidence-layer to create an interpretation layer that can support the creation of reporting systems, dashboards and open student models. Zapata-Rivera et al. [13] show how the internal evidence-layer in conversationbased assessments [14] can be used to identify and reuse pieces of evidence (e.g., quality of description of an earth science process, accuracy of a selected sequence of events associated with the process, accuracy of identification and quality of description of an event based on data collected, quality of selection between two data collection notes, and accuracy of prediction based on data collected) across two different systems designed to assess science inquiry skills. Also, the authors elaborate on applications of this approach to improve the use of evidence in military prototypes that were implemented using the Generalized Intelligent Framework for Tutoring (GIFT; [15, 16]). Technologies such as the experience application programming interface (xAPI; [17] can support the implementation of common data structures that various AISs can use to share user information. By adopting an assessment vocabulary similar to the one offered by ECD, it is possible to implement an evidence layer for AISs [18].

4 Evidence-Based Interaction with Open Student Models Given the richness of process and response data available in AISs, different interactions, approaches, and external representations could be employed to provide users with the information they require. The existence of an interpretation layer for AISs, such the one described above, would enable the implementation of open student modeling approaches that not only provide users with information about their current knowledge levels and other characteristics, but also provide them with information about the evidence used to support these values. An example of such an approach to open student modeling is called evidence-based interaction with OSM (EI-OSM; [6]). In this approach, a user interface based on Toulmin’s argument structure [19] is used to provide the student with information about current assessment claims maintained by the system and corresponding supporting evidence. Assessment claims are assertions regarding students’ knowledge, skills and other abilities (e.g., a student’s knowledge level on a concept). Figure 1 shows an assessment claim and supporting evidence regarding Clarissa’s knowledge of “Calculate slope from points.” The assessment claim with the highest strength value appears at the top of the screen. The strength of an argument represents a measure of the credibility, relevance, and quality of the supporting evidence for each side of the argument. Supporting evidence for the system claim appears on the left, while the student’s alternative explanation and available evidence appears on the right. This type of interaction can be part of a formative dialogue between teachers and students (e.g., what type of evidence should be provided to strengthen the student’s argument. Also, the AIS can provide suggestions about learning materials and tasks that can be administered to update the current state of the student model.

Supporting Human Inspection of Adaptive Instructional Systems

485

Fig. 1. The student suggests an alternative explanation to the claim “Clarissa’s knowledge level of Calculate slope from points is Low” and provides supporting evidence.

The effectiveness of interactive OSM and external representations to clearly communicate the information and support decisions should be evaluated with the intended audience. EI-OSM was evaluated with eight teachers who interacted with various use cases (i.e., understanding and using a proficiency map, exploring assessment claims and supporting evidence; and assigning credibility and relevance values and providing adaptive feedback) and provided feedback during small group interviews (1–3 teachers and at least 2 interviewers per session) [6]. Each interview lasted for about 2 h. Some of the results of this study included: (1) teachers appreciated the use of evidence-based argument structures due to their potential for improving communication with students and parents. However, interviewees mentioned that some teachers may not have enough time and resources to implement this approach; teachers provided suggestions to facilitate the implementation in real settings (e.g., generating email alerts for teachers to inform them about particular situations that may require their attention, and involving tutors and teacher assistants in the process to divide the work); (2) teachers would like the system to use algorithms to handle common cases but want to be in control in case they need to override the system’s actions; teachers appreciated

486

D. Zapata-Rivera

the option to use their own instructional materials and tasks but suggested the use of predefined instructional packages of materials and task banks that can be integrated automatically into the system; finally, (3) some teachers mentioned possible issues with students trying to game the system by adding unsupported claims but recognized that the system could support learning and help improve self-assessment skills by engaging students in a goal oriented process that involves gathering evidence (e.g., by learning about particular topics and solving relevant tasks) that can be used to strengthen their assessment claims.

5 Designing and Evaluating Reporting Systems Research on score reporting may provide some insights into the type of work that can be done in order to better support human inspection of AISs: • Work on frameworks for designing and evaluating score reports [20, 21]. These are iterative frameworks involving activities such as gathering assessment information needs, reconciling these needs with assessment information available to be shared with the audience, designing prototypes based on design principles from areas such as information visualization, human computer interaction and cognitive science, and evaluating them with the intended audience [22]. These frameworks also take into account the client expectations, whether the report is designed in a research vs. an operational context, and propose the use of various evaluation approaches including cognitive laboratories, focus groups and large-scale studies evaluating the effectiveness of alternate score reports. • Following standards on the quality of assessment information (e.g., psychometric properties of scores and subscores) to decide whether or not the information provided can support appropriate uses of the results [23, 24]. • Designing and evaluating score reports targeted for particular audiences (e.g., teachers, parents, students and policymakers) in summative and formative contexts. This work includes the use of simple reports using traditional graphical representations and interactive report systems or dashboards. Evaluation of these reports usually includes examining comprehension and preference aspects [25, 26]. • Developing supporting materials such as tutorials, interpretive and ancillary materials that offer guidance on the appropriate use of assessment results [27]. This work as well as work on designing and evaluating OSM [10] can inform the development of AISs that consider the information needs of different types of users, which may positively impact the acceptance and extensive adoption of AISs in a variety of educational contexts. Table 1 shows sample assessment information needs for various types of users.

Supporting Human Inspection of Adaptive Instructional Systems

487

Table 1. Sample assessment information needs for various types of users. Audience Teachers, tutors, and mentors

Students

Parents

Administrators and policy makers

Researchers

Assessment information needs • Student performance at the individual, sub-group and class levels (e.g., What are my students’ strengths and weaknesses? How did the class perform on a task or a group of tasks? How does a student’s performance compare to that of other students?) • Progress information at the individual, subgroup, and class levels (e.g., How much progress have my students made towards mastery?) • Information that helps understand current performance (e.g., Were my students engaged in the task(s)? Did my students try to game the system? How reliable are the knowledge estimates calculated by the system?) • Information that can help inform future teaching (e.g., How difficult were the tasks for my students? What were the most frequent errors and misconceptions? What should I do next to help an individual student or the class as a whole?) • Actionable feedback that they can use to guide their learning (e.g., What are my strengths and weaknesses? How can I improve?) • Progress and performance information (e.g., How much progress have I made towards mastery? How does my performance compare to that of other students?) • Evidence supporting assessment claims (e.g., What type of information was used to calculate my knowledge levels? Can I provide additional evidence to update my knowledge levels in the system?) • Actionable information on how to help the student (e.g., What are my child’s strengths and weaknesses? How can I help my child? Should I talk to the teacher?) • Progress and performance information (e.g., How did my child perform? How much progress has my child made towards mastery? How does my child’s performance compare to that of other students?) • Aggregate data to inform decisions in areas such as evaluation of current educational policies, school improvement plans, professional development, program selection and evaluation, curriculum selection, improving student achievement, and staff allocation (e.g., How does student performance compare to other students at the classroom, school or district levels? How do students from particular subgroups perform? How much progress did students from particular subgroups make in the last month?) • Information to evaluate and improve different aspects of the system (e.g., Were students engaged in the task(s)? How effective were adaptive mechanisms such as feedback and task sequencing in helping students learn? Do users understand and appropriately use the information in the reporting system or dashboard? To what extent do teachers and students use the information in the reporting system or dashboard to inform their teaching/learning? Which aspects of the system need to be improved?)

488

D. Zapata-Rivera

6 Discussion In this section we discuss several recommendations and challenges to improving support for human inspection of AISs. • An interpretation layer. As presented above, by relying on an evidence layer, it is possible to implement an interpretation layer that supports the creation of OSM, reporting systems and dashboards targeted for particular audiences. As more AISs are built using machine learning approaches that may be difficult to inspect [1–3], it is important to think about ways of interpreting the results or recommendations made by these models and make them available for the creation of systems that support the needs of users. For example, some students may want to know why a particular problem or piece of feedback was presented by the AIS [6, 28, 29]. Also, teachers, administrators and researchers can be interested in knowing how students interact with the AIS and how their data are used by the system to adapt its interaction. Approaches for interpreting “black box” models are being explored [30–32]. • Iterative design and evaluation frameworks. AISs should follow well-established design and evaluation frameworks and standards that aim at producing systems that offer high quality information at the right level to support the decision-making needs of target users including students, teachers, parents, administrators and researchers. Designing and evaluating interactive systems that clearly communicate AIS information to these users should be part of the development cycle of any AIS. Lessons learned from evaluating the use of external representations with particular audiences should be made available so that designers can benefit from these types of results. It is not enough to simply design a dashboard that shows all the data available in the system. Dashboard components should be evaluated to guarantee that users appropriately understand and use the information presented [33].

7 Future Work Future work in this area includes the development of platforms that implement interpretation layers; the use of iterative, audience-centered frameworks for the design and evaluation of OSM, reporting systems, and dashboards; additional work on communicating information from machine learning approaches that are difficult to interpret; and the creation of standards that serve as guidance for the type of information that should be available to support user decisions in AIS.

References 1. Conati, C., Porayska-Pomsta, K., Mavrikis, M.: AI in Education needs interpretable machine learning: lessons from Open Learner Modelling. In: Workshop on Human Interpretability in Machine Learning. arXiv preprint arXiv:1807.00154 (2018) 2. Mao, Y., Lin, C., Chi, M.: Deep Learning vs. Bayesian knowledge tracing: student models for interventions. JEDM – J. Educ. Data Min. 10(2), 28–54 (2018)

Supporting Human Inspection of Adaptive Instructional Systems

489

3. Min, W., et al.: DeepStealth: leveraging deep learning models for stealth assessment in game-based learning environments. In: Conati, C., Heffernan, N., Mitrovic, A., Verdejo, M. F. (eds.) AIED 2015. LNCS (LNAI), vol. 9112, pp. 277–286. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19773-9_28 4. Bull, S., Kay, J.: Student models that invite the learner in: the SMILI open learner modelling framework. Int. J. Artificial Intell. Educ. 17(2), 89–120 (2007) 5. Guerra-Hollstein, J., Barria-Pineda, J., Schunn, C.D., Bull, S., Brusilovsky, P.: Fine-grained open learner models: complexity versus support. In: Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization, Bratislava, Slovakia, pp. 41–49 (2017) 6. Zapata-Rivera, D., Hansen, E., Shute, V.J., Underwood, J.S., Bauer, M.: Evidence-based approach to interacting with open student models. Int. J. Artif. Intell. Educ. 17(3), 273–303 (2007) 7. Bull, S., Pain, H.: Did I Say What I Think I Said, And Do You Agree With Me? Inspecting and Questioning the Student Model. In: Greer, J. (ed.) Proceedings of the World Conference on Artificial Intelligence in Education, pp. 501–508 (1995) 8. Dimitrova, V.: STyLE-OLM: interactive open learner modelling. Int. J. Artif. Intell. Educ. 13(1), 35–78 (2003) 9. Dimitrova, V., Brna, P.: From interactive open learner modelling to intelligent mentoring: STyLE-OLM and Beyond. Int. J. Artif. Intell. Educ. 26(1), 332–349 (2016) 10. Bull, S., Kay, J.: SMILI☺: a framework for interfaces to learning data in open learner models, learning analytics and related fields. Int. J. Artif. Intell. Educ. 26(1), 293–331 (2016) 11. Zapata-Rivera, J.D., Greer, J.: Exploring various guidance mechanisms to support interaction with inspectable learner models. In: Proceedings of Intelligent Tutoring Systems ITS 2002, pp. 442–452 (2002) 12. Mislevy, R.J., Steinberg, L.S., Almond, R.G.: On the structure of educational assessments. Measur. Interdisc. Res. Perspect. 1, 3–62 (2003) 13. Zapata-Rivera, D., Brawner, K., Jackson, G.T., Katz, I.R.: Reusing evidence in assessment and intelligent tutors. In: Sottilare, R., Graesser, A., Hu, X., Goodwin, G. (eds.) Design Recommendations for Intelligent Tutoring Systems: Volume 5 - Domain Modeling, pp. 125– 136. U.S. Army Research Laboratory, Orlando (2017) 14. Zapata-Rivera, D., Jackson, T., Katz, I.R.: Authoring conversation-based assessment scenarios. In: Sottilare, R.A., Graesser, A.C., Hu, X., Brawner, K. (eds.) Design Recommendations for Intelligent Tutoring Systems Volume 3: Authoring Tools and Expert Modeling Techniques, pp. 169–178. U.S. Army Research Laboratory, Orlando (2015) 15. Sottilare, R.A., Brawner, K.W., Goldberg, B.S., Holden, H.K.: The generalized intelligent framework for tutoring (GIFT). U.S. Army Research Laboratory, Human Research & Engineering Directorate (ARL-HRED), Orlando (2012) 16. Sottilare, R., Brawner, K., Sinatra, A., Johnston, J.: An Updated Concept for a Generalized Intelligent Framework for Tutoring (GIFT). U.S. Army Research Laboratory, Orlando (2017). https://doi.org/10.13140/rg.2.2.12941.54244 17. Advanced Distributed Learning (2019) from https://www.adlnet.gov/newest-version-ofxapi-version-1-0-3/. Accessed 1 Feb 2019 18. Johnson, A., Nye. D.B., Zapata-Rivera, D., Hu, X.: Enabling intelligent tutoring system tracking with the Experience Application Programming Interface (xAPI). In: Sottilare, R., Graesser, A., Hu, X., Goodwin, G. (eds.) Design Recommendations for Intelligent Tutoring Systems: Volume 5 - Domain Modeling, pp. 41–45. U.S. Army Research Laboratory, Orlando (2017). ISBN 978-0-9893923-9-6 19. Toulmin, S.E.: The Uses of Argument. University Press, Cambridge (1958)

490

D. Zapata-Rivera

20. Hambleton, R., Zenisky, A.: Reporting test scores in more meaningful ways: a researchbased approach to score report design. In: APA Handbook of Testing and Assessment in Psychology, pp. 479–494. American Psychological Association, Washington, D.C. (2013) 21. Zapata-Rivera D., VanWinkle, W.: A research-based approach to designing and evaluating score reports for teachers (Research Memorandum 10-01). Educational Testing Service, Princeton (2010) 22. Hegarty, M.: Advances in cognitive science and information visualization. In: ZapataRivera, D. (ed.) Score Reporting Research and Applications, pp. 19–34. Routledge, New York (2018) 23. American Educational Research Association: American Psychological Association, & National Council on Measurement in Education: Standards for Educational and Psychological Testing. American Educational Research Association, Washington, D.C. (2014) 24. Sinharay, S., Puhan, G., Haberman, S., Hambleton, R.K.: Subscores: when to communicate them, what are their alternatives, and some recommendations. In: Zapata-Rivera (ed.) Score Reporting Research and Applications, pp. 35–59. Routledge, New York (2018) 25. Zapata-Rivera, D.: Why is score reporting relevant? In: Zapata-Rivera, D. (ed.) Score Reporting Research and Applications, pp. 1–6. Routledge, New York (2018) 26. Zapata-Rivera, D., Katz, R.I.: Keeping your audience in mind: applying audience analysis to the design of score reports. Assess. Educ. Principles Policy Pract. 21, 442–463 (2014) 27. Zapata-Rivera, D., Zwick, R., Vezzu, M.: Exploring the effectiveness of a measurement error tutorial in helping teachers understand score report results. Educational Assessment 21(3), 215–229 (2016) 28. Bull, S.: Negotiated learner modelling to maintain today’s learner models. Research and Practice in Technology Enhanced Learning 11(10), 1–29 (2016) 29. Van Labeke, N., Brna, P., Morales, R.: opening up the interpretation process in an open learner model. Int. J. Artif. Intell. Educ. 17, 305–338 (2007) 30. Doran, D., Schulz, S., Besold, T.R.: What does explainable AI really mean? A new conceptualization of perspectives. arXiv preprint arXiv:1710.00794 (2017) 31. Pardos, Z.A., Fan, Z., Jiang, W.: Connectionist recommendation in the wild: on the utility and scrutability of neural networks for personalized course guidance. User Modeling and User-Adapted Interaction, pp. 1–39 (2019) 32. Ras, G., van Gerven, M., Haselager, P.: Explanation methods in deep learning: users, values, concerns and challenges. In: Escalante, H.J., et al. (eds.) Explainable and Interpretable Models in Computer Vision and Machine Learning. TSSCML, pp. 19–36. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98131-4_2 33. Corrin, L.: Evaluating students’ interpretation of feedback in interactive dashboards. In: Zapata-Rivera, D. (ed.) Score Reporting Research and Applications, pp. 145–159. Routledge, New York (2018)

AI in Adaptive Instructional Systems

Adaptive Agents for Adaptive Tactical Training: The State of the Art and Emerging Requirements Jared Freeman1(&), Eric Watz1, and Winston Bennett2 1

Aptima, Inc., 12 Gill Street, Suite 1400, Woburn, MA 01801, USA [email protected] 2 Warfighter Readiness Research Division, 711 HPW/RHA, USAF Dayton, OH, USA

Abstract. Military tacticians require practice to learn their craft. Practice requires adaptive opponents capable of responding to trainee actions in ways that are realistic and instructionally productive. Current agents are generally too brittle, too scripted and unresponsive to support adaptive training in this way. What is required to develop adaptive agents are (1) real-time feeds of simulator data that are sufficiently rich and realistic to support agent development and execution; (2) agent architectures capable of generating realistic and instructive behaviors from these data; and (3) a testbed that can deliver data and performance measures in sufficient volume to enable modelers to accelerate agent development by applying emerging analytics and machine learning. The 711th Human Performance Wing/RHA has invested in precisely these capabilities over four years, engaging eight of the leading developers of intelligent agents. In this paper, we describe these capabilities, and, importantly, the data requirements these solutions impose on simulators and operational systems that can employ these technologies in the future. Keywords: Adaptive training

Cognitive models Tactical training

1 Introduction U.S. Air Force pilots battle smart, agile enemy flyers in the wild. But in training simulations, the automated enemies – Computer Generated Forces (CGF) – are often predictable in flight and unresponsive to maneuvers by pilot trainees. Pilots can rehearse textbook tactics against textbook adversaries under these circumstances. That has high value early in pilot training. However, when pilots apply subtle, novel, or erroneous tactics, CGF typically respond unrealistically or do not respond at all. Human simulator operators often take over from CGFs when smart, agile behavior is needed. This work-around raises the staffing cost of simulation-based training and thus limits its availability to those times when staff are on hand. Adaptive agents are needed to make simulation-based training more instructionally effective and available. This capability requires three developments. First, agents must be built using architectures capable of generating realistic and instructive behaviors. Second, data from simulators must be sufficiently rich to drive these agents. If the same © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 493–504, 2019. https://doi.org/10.1007/978-3-030-22341-0_39

494

J. Freeman et al.

data feeds are available from operational flight equipment, then agents used in training may eventually transition to the operational environment, to support pilot decisions and accelerate tactical execution. Third, a testbed is needed that can deliver some of these data and performance measures to support development of agents and evaluation of their tactical smarts and agility. In the best case, the data and measures will be so voluminous that developers can apply modern machine learning techniques to efficiently create adaptive agents as powerful as the current AI champions of Go [1] and many video games [2]. The Air Force Research Laboratory, 711th Human Performance Wing/RHA has invested in precisely these capabilities over four years, engaging eight of the leading developers of intelligent agents. In this paper, we describe key functionality of adaptive agents, and the data requirements these solutions impose on simulators and operational systems that drive them. We describe a testbed for agent development and evaluation that fulfills many of these data requirements, and concisely profile the agent architectures that testbed currently hosts.

2 Functions of Smart, Agile Agents Four functions enable pilot agents to behave in a smart and agile manner: tactical inference, tactical action, modal behavior, and instructional capability. Two functions support maintenance and extension and use of smart agents: evolution and transparency. These functions can reside in agents themselves, or they can be distributed across an agent and the system that hosts it. Here, we define these six functions. Tactical inference enables the agent to interpret instants and streams of tactical data (e.g., altitude, airspeed) in ways that support decision making. In Endsley’s terminology of situation awareness [3], the agents must perceive the state of the environment, comprehend its tactical importance, and project (i.e., predict) potential evolutions of the environment. Tactical action is the agent’s capability to respond smartly to the tactical situation with actions that drive the adversary toward defeat and (if the adversary is a trainee) toward learning. An agent typically must synchronize its tactical actions with friendly forces to achieve strong and sure effects, and so the concept of tactical action necessarily includes communication and coordination between entities. Modal behavior is the capacity of the agent to adopt distinct modes of tactical inference, tactical action, and other characteristics. For example, agent inference might be modeled on the doctrine of adversary force X in one mode and Y in another; an agent’s tactical actions may vary on speed and synchronization as a function of the mode of expertise (e.g., novice, apprentice, journeyman, master); or an agent might vary its risk tolerance, and thus its tactical decisions, as a function of personality modes1.

1

Just how much variability in behavior is instructionally or operationally useful is an empirical question, as is the value of psychological realism (such as decision biases described by Tversky & Kahneman [12]) in agent processing.

Adaptive Agents for Adaptive Tactical Training

495

Instructional capability is the ability of the agent to selectively produce tactical behaviors that exercise and develop the skills of trainee pilots. Such an agent must maintain a model of the trainee skill state, the target skill state (typically represented by training objectives or a model of expert behavior), and the degree to which the available tactical conditions exercise deficient trainee skills. In addition, a training agent should generate guidance and feedback for trainees, and if it can do so, communicate these effectively but selectively based on some instructional strategy. Evolution is the ability of an agent to learn from new data, whether those data arise from training simulations, live operations, or other sources such as tactical simulations. Evolution requires careful management. An agent must evolve at a deliberate pace, meaning that the agent should learn or apply newly learned skills at a tempo that ensures a reliable training environment or operational capability for each cohort of human pilots who must fight from the same foundation of tactical knowledge. Transparency is ability of an agent to describe the services it provides, the conditions under which it provides them, the accuracy and reliability with which it functions under those conditions, and the historical costs of modifying those services. With these descriptive and administrative metadata [4], an instructor or instructional system can select the optimal agent for a training task; and a user can estimate the return on investment of modifying an agent to provide new or better services. By enumerating and defining these functions, we are better able to design adaptive instructional systems that can challenge trainees in a manner that is instructionally effective. Further, we are assured that these capabilities can be maintained and applied smartly and cost-effectively. In Fig. 1, we map the six functions, above, to three capabilities of an agent-based adaptive training system: domain proficiency, instructional efficacy, and maintainability and applicability.

Fig. 1. Six functions (right) enable three capabilities (center) of adaptive instructional systems.

496

J. Freeman et al.

3 Data Requirements of Smart, Agile Agents Agents can implement the functions, above, only if they have access to a range of data, which may be generated directly by agents, or by the host system on which they operate. Here, we define some of these data requirements, then briefly describe a testbed that delivers some of these data to agents developed by eight agent development firms. 3.1

Tactical Inference

Tactical inference implements situational awareness (SA). The three levels of SA defined by Endsley [3] impose distinct class requirements. To implement the perceptual level of SA requires a dynamic feed of data concerning the position, kinematics, and tactical actions of each entity in the battlespace (friendly, adversary, and neutral) and, in high fidelity simulations or operations, the state of the environment (e.g., visibility conditions ranging from dark to light, cloudy to clear). To implement the comprehension level of SA requires the agent to infer missing data and transform perceptions into tactically meaningful constructs. The inference challenge is significant; an agent may need to infer missing dynamic data given incomplete or inaccurate persistent data. Entity and environmental data may not be available in their entirety to an agent either directly (e.g., through sensor readings) or indirectly, through reports by other entities such as an Airborne Warning and Control System (AWACS) or wingmen on a common network. Thus, an agent may need to infer attributes of an entity from incomplete data, as when a pilot infers that an aircraft is armed for battle given its speed, altitude, and heading. This inference requires persistent data concerning the potential order of battle, meaning the type, capability, number, and organization of entities in the battle space. Each of these may be complex data. For example, in tactical aviation, the capability of an aircraft includes at least its flight characteristics (e.g., potential speed, acceleration, and maneuverability), sensor capabilities, communications capability, and weaponry. In operations, these data may not be precisely known for any given platform, and it may not be known exactly which platforms the adversary will bring to a given battle. Training simulations typically evade the problem by providing agents with complete and accurate entity data, reducing the need for agents to implement rapid and accurate forms of inference. Comprehension SA also entails transformation of flight data into tactically meaningful constructs. The locations of entities in space must be transformed into the labels and parameters of formations documented in persistent data; the recent history of change in position and kinematics must be transformed into maneuvers defined in persistent data. The projection level of SA requires the agent to predict how the battle will evolve. This, in turn, taps some persistent store of data concerning the responses of entities to potential actions of their partners and adversaries, and the effects of those actions on the mission. For example, an agent may accurately predict the maneuvers of enemy aircraft and estimate the threat those maneuvers pose to friendly units and their mission.

Adaptive Agents for Adaptive Tactical Training

497

In sum, tactical inference, which implements situational awareness (perception, comprehension, and projection), requires persistent and dynamic data concerning the characteristics of entities and their behaviors. 3.2

Tactical Action

Tactical action entails selecting and dynamically adapting tactics to fulfill mission objectives and/or drive a trainee to learn. This requires persistent data concerning the effects that are appropriate given the mission (e.g., evade the enemy to approach a target, or engage the enemy to defend an asset), the tactics available, and the effects those tactics can achieve given the projected actions of the adversary. Persistent data concerning tactics must define formations, maneuvers, use of sensors and weapons, and, critically, coordination and communication with other members of the agent’s force. Rich data supporting coordination might document a human or synthetic wingman’s likely response time and accuracy of adherence to tactical doctrine and specific orders. Rich data supporting communication actions might document the effects of timing messages to arrive during periods of relative calm (when the recipient can attend to them) or relative chaos. Estimates of the effects of tactics may require relatively simple data (e.g., maneuver A has 75% likelihood of evading an adversary that is employing tactic B), or historical data that support a more game-theoretic decision strategy (e.g., with each repetition of tactic A, the likelihood of evasion drops by 50%). These persistent data concerning planned effects, tactical options, and likely effects of tactical options in context may suffice to select tactics. To adapt tactical actions continuously to the moment requires the same dynamic data used in tactical inference. 3.3

Modal Behavior

Modal behavior by agents requires configuration data that specify the agent characteristics that an instructor (human or automated) or operator wants to evoke from an agent. An agent’s tactical behaviors might vary as a function of their configured nationality, where an agent representing a poorly trained and equipped adversary might select ill-fitting tactics predictably from a short playbook, while an agent representing a sophisticated adversary might select surprising and effective tactics from larger doctrine. An agent’s configured task work expertise might determine the accuracy and adaptability with which it executes a tactic at speed. Configurations for teamwork capability might drive the speed, accuracy, and timeliness of agent communications. Teamwork configuration data might control how well agents coordinate their actions to, for example, execute a pincer movement attacking an enemy’s flanks. A configuration of agent personality or attitude might address tolerance for risk, which in turn might bias the agent to select tactics that achieve effects in a manner that is more aggressive but potentially less survivable. Implementation of modal behaviors requires persistent data that characterize the decision and behavior biases of different adversaries, at varied levels of expertise, or

498

J. Freeman et al.

among different personalities. Data concerning the behavioral effects of these modes can be estimated from theory or research findings, or summarized from empirical data. In addition, parameter values must be available to select among these modes. 3.4

Instructional Capability

Instructional capability enables an agent to apply tactics that train. Such instructive behavior requires that an agent maintain a model of the trainee skill state relative to the target skill state. Measures of performance and effectiveness (MOPs and MOEs) populate trainee skill models; training objectives (or expert models) define the target skill states. Both types of data are necessary for instructional capability. They are not sufficient. To smartly select actions that train requires at least data that map training objectives to the training conditions that grow skill. For example, to develop pilot competence in tactical communication requires scenarios in which a wingman or AWACS operator communicates accurately (or error-fully) with the trainee. There may be many training conditions that exercise a given skill. It is not trivial to choose the most effective among them for a given trainee. Ideally, an agent’s tactical actions invoke a training challenge that is neither too small to trigger human learning, nor too great to prevent it; rather, it falls in the Zone of Proximal Development [5] in which the agent can manage student learning. To adapt the challenge to the trainee requires data concerning the effects of the challenge on trainees’ task work and teamwork skills given trainees’ state (or state history). Note that trainee state may include physiological data that indicate attention, workload, and other conditions that bear on learning. Systems that capture physiological data should make measures of these conditions available to agents. Human ability to learn from experience – from interaction with intelligent agents and other humans – is robust. However, these effects increase for some trainees when they are prepared for the specific learning experience, when they receive coaching (or otherwise “scaffolded”) during it, and when they get feedback afterwards. Agents should train more adaptively and effectively if they can generate, communicate, and smartly deliver (or withhold) advanced organizers, real-time coaching, and debriefs based on instructional strategy. For example, immediate feedback benefits less proficient trainees, who are unable to assess their performance accurately, but may hamper journeymen, who must assess themselves in operational settings. Agents thus require data concerning the impact of instructional actions (not just tactical ones) on learning given trainee state. 3.5

Evolution

Evolution ensures that agents maintain or grow their tactical and instructional effectiveness as trainees, training objectives, measures, and the tactical environment change over time. Agent developers typically deliver evolution by manually modifying agent software. New machine learning techniques – notably deep learning from a static dataset and reinforcement learning from a dynamically generated dataset – are a costeffective alternative that has produced superior agents in Go [1] and other domains [2]. Manual development benefits from, and automated learning requires, voluminous and

Adaptive Agents for Adaptive Tactical Training

499

variable data concerning tactical states (developed in tactical inference, above) and tactical actions, as well as measurements of their effects in battle, and ideally of their similarity to doctrinal tactics (procedures). These data can be captured in training simulations, simulated tactical exercises, and real operations for use by agent developers and machine learning. Evolution requires careful management, as we note above. An agent that evolves more quickly than its users may appear to them to be unpredictable and untrustworthy. Agents or training systems should provide some control over the frequency with which agents learn or at which they apply what they learn. This implies a requirement for learning rate and/or learning application parameters. 3.6

Transparency

Transparency enables an agent to describe itself to its users, both human and automated. The data required to do this are what the National Information Standards Organization calls descriptive metadata and administrative metadata [4]. Descriptive metadata document the services that an agent provides (e.g., pilots an F-35 and communicate with F-35 pilots and AWACS), the conditions under which it provides them (e.g., air superiority missions and ground attack missions), and the reliability with which it functions under those conditions (e.g., mean time between failure of tactical engagements). These data enable a system that hosts many agents to select those that are best for a given training task. Administrative metadata enable users to manage a resource. Here, the required data concern the cost of acquiring an agent if it is purchased per use or per user (as are some software-as-service applications), applying that agent (particularly if its installation and configuration are complex), and modifying it for new uses. These cost metadata are sometimes competition sensitive, and agent vendors may not wish to publish them, but purchasers know them and may be able to make them available on their own systems. Doing so enables a user or a system to perform return-on-investment (ROI) calculations that trade the effectiveness of an agent (documented in its descriptive metadata) against its cost. 3.7

Summary

We have enumerated a representative sample of the data required to deliver six useful functions of smart, agile agents for adaptive instructional systems. Such systems should be able to generate instructional challenges for trainees by virtue of tactical inference and tactical action functions. They should select among those challenges to optimize learning speed (holding the proficiency target constant) and/or proficiency (holding training time constant) by virtue of modal behavior and instructional capability). It should be straightforward to maintain and apply these agents because they evolve in tactical and instructional competence, and their utility and cost are documented in metadata. We know of no single agent or host system that satisfies all these data requirements. However, one AFRL research program is systematically developing a testbed and agents that fulfill some of these data requirements. We describe those products below.

500

J. Freeman et al.

4 A Testbed for Developing and Evaluating Agile Agents AFRL has developed a unique testbed for developing and evaluating smart, agile agents [6]. We call this testbed AGENT, the Agent Generation & Evaluation Networked Testbed (see Fig. 2). In general, the testbed is distinguished by (1) batch configuration and execution of scenarios in which agents fight against agile CGF and (2) automated performance measures that are (3) recorded in a common data store. These functions enable developers and testers to generate large volumes of variable performance and outcome data that sample a massive tactical space. Developers can access those data for machine learning of agent capabilities such as classification of tactical states from environmental data, and selection of tactical responses from behavioral and outcome data. The testbed fulfills some of the data requirements described above. To support tactical inference, the testbed reports the standard entity position and kinematics data available through the DIS protocol. It also describes the adversary formation and location, much as an AWACS operator would do for pilots in flight, through a special purpose protocol called m2DIS. Finally, it responds to requests for certain fundamental tactical information at the comprehension level of SA, such as “Am I being threatened? Am I in the adversary’s weapons engagement zone? Where is my wingman in relation to me?” Agents in the testbed are responsible for predicting the evolution of the engagement. Agents hosted on this testbed are responsible for the tactic selection and parameterization, the core function of tactical action as we’ve defined it above. The testbed enables agents to execute maneuvers by name (e.g., dogfight, posthole, drag) with parameters (e.g., terminal speed and altitude). Similarly, it enables agents to control sensors and weapons with fine granularity at an unclassified level. Testbed agents are responsible for coordinating actions between one another, such as ensuring that each agent is prosecuting a different adversary, or that both are converging on the same one. To support modal behavior, the testbed employs a relatively expert enemy (to developers’ agile agents), which is implemented as behavior transition networks in the Next Generation Threat System [7]. We plan to eventually configure these to represent the multiple levels of tactical expertise of human trainees. Testbed developers are free to model and parameterize their agents to represent different nationalities, levels of expertise, and psychological profiles (such as risk tolerance). To support instruction, the testbed computes measures of tactical procedures and effects. Agents can use those data to optimize their actions for instructional effects. The testbed also provides a graphical interface in which agents report or explain their actions. This feature has proved useful in assessing agents, and so we expect it to add value in After Action Reviews. The testbed currently does not issue other data specific to instructional decisions. However, members of the team have published [8] a specification for representing some of those data. Training Objective Packages represent training objectives, their relationships to scenario conditions, and the performance measures by which trainee skill on objectives is evaluated. TOPs are designed to enable

Adaptive Agents for Adaptive Tactical Training

501

Fig. 2. AGENT provides agent developers with a secure and private environment for exercising their red AI pilots against blue CGF in scenarios defined in a shared library. Blue airframes, sensors, and weapons are modelled in the Next Generation Threat System (developed initially by the Air Force and now by the Naval Air Warfare Center Training Systems Division). AGENT publishes standard entity state and interaction data using the DIS protocol. It computes tactically meaningful information concerning the tactical situation over a custom Model to DIS (m2DIS) interface. Measures of performance and effects are computed automatically by the PETS Performance Evaluation and Tracking System. Users can observe scenario runs on the LNCS LVC Network Control Suite (not shown). (Color figure online)

trainers or simulators to generate scenario events that bear on training objectives, and to measure the tactical and instructional effects. A future version of the testbed may incorporate TOPs. To support evolution, the testbed computes measures and stores them in a shared data store from which any agent developer can draw data concerning the performance of every agent. These data can be voluminous because the testbed’s batch operation function and agile opponent pilots. This should provide enough data for manual analysis and for machine learning. To support transparency, the system will eventually document the capabilities of each agent in context. With these data, we will implement a function that selects the best agent for the training task from a library of agents.

502

J. Freeman et al.

5 Agent Architectures The architectures and methods used to develop agents in this AFRL program are each capable of fulfilling some mix of the functions, above. Here, we briefly profile the architectures and call out just one or two agent functions each supports particularly well. TiER1 Performance Solutions, LLC, is using a hybrid architecture that integrates two different human behavior representations. A task network model represents operator goals or functions in graph form. An accumulator model aggregates data over time to control transitions through the task network. Thus, this architecture can smoothly adapt tactical inferences and actions (i.e., decision making) in a dynamic tactical scenario. Stottler Henke Associates, Inc., applies the SimBionic architecture [9], which implements an agent as an integrated set of behavior transition networks. This open source architecture supports a dynamic scripting machine learning algorithm, developed to adapt the behavior of agents by learning from experience. Thus, SimBionic has unusual strength with respect to agent evolution. Soar Technology, Inc., applies the Soar cognitive architecture [10], a production system that searches a problem space and dynamically revises agent knowledge and actions to accomplish goals. If programmed at sufficiently fine level of granularity, a production system can effectively generate novel tactical inferences and actions. Thus, Soar agents are particularly capable of tactical inference and action within scenarios, and potentially of evolution over them. Aptima, Inc., is applying deep learning techniques to infer tactical state and appropriate tactical response. These populate a Behavior Definition Language (BDL) that expresses goals, tactical state, behavioral constraints, actions, predictive measures and other attributes necessary for intelligent agent behavior. BDL is input to Soar agents. This work exemplifies automated evolution of agents. Eduworks Corporation employs Brahms, a government-owned agent modeling framework created to design, simulate, and develop work systems, which consist of humans and technologies. Accurate representation of human-human and humanmachine interaction make Brahms particularly capable of realistic coordination in tactical action [11]. Discovery Machine, Inc., applies a cognitive architecture called DMInd that represents hierarchies of pre-specified problem spaces and response strategies, which are retrieved as a function of fit to context. These functions are designed for accurate inference and action concerning tactics, but they can be applied to manage instruction as well. CHI Systems applies its Personality-enabled Architecture for Cognition (PAC), a system that uses narrative threads to control perception and behavior. PAC explicitly represents personality and emotion. This makes it capable of modal behavior as a function of attributes such as risk tolerance and perception of threat. Charles River Analytics, Inc., employs the Situation Assessment Model for Personin-the-loop Evaluation (SAMPLE), which emulates recognition-primed decision-

Adaptive Agents for Adaptive Tactical Training

503

making, and the Hap model of reactive, goal-focused behavior. Thus, SAMPLE is well-suited to emulating expertise in tactical inference and action.

6 Conclusion This article describes a set of functions that software agents should provide if they are to be tactically smart, instructionally effective, and cost-effective opponents in simulation-based flight training. Those functions, in brief, are tactical inference, tactical action, modal behavior, instructional capability, evolution, and transparency. These capabilities have interesting implications for data requirements, ranging from kinematic data already published by many simulators to training objectives and student skill assessments that are typically maintained only in the minds of expert instructors. A testbed developed by AFRL satisfies some of these new data requirements. Several agent development firms are testing the sufficiency of those data to drive smarter, more agile agents for training and, one day, for operations in battle. Acknowledgements. This article is based upon work supported by the United States Air Force Research Laboratory, Warfighter Readiness Research Division 711 HPW/RHA, under Contract FA8650-16-C-6698. This article is cleared for public release on 28 Jan 2019, Case 88ABW2019-0371.Thanks also to members of the following organizations for the concepts and agents they have developed in this program: Aptima, CHI Systems, Charles River Analytics, Discovery Machine, Eduworks, Soar Technology, Stottler Henke Associates, and Tier1 Performance Solutions.

References 1. Silver, D., et al.: Mastering the game of Go without human knowledge. Nature 550(7676), 354–359 (2017) 2. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518 (7540), 529–533 (2015) 3. Endsley, M.R.: Toward a theory of situation awareness in dynamic systems. Hum. Factors 37(1), 32–64 (1995) 4. Riley, J.: Understanding Metadata. National Information Standards Organization, Baltimore (2017) 5. Vygotsky, L.: Mind in Society: The Development of Higher Psychological Processes. Harvard University Press, MA (1978) 6. Freeman, J., Watz, E., Bennett, W.: A testbed for developing & evaluating AI pilots. In: Proceedings of ITEC 2019, Stockholmassan, Sweden (2019) 7. Next Generation Threat System. http://www.navair.navy.mil/nawctsd/pdf/2017-NGTS.pdf. Accessed 1 Jan 2017 8. Stacy, W., Freeman, J.: Training objective packages: a mechanism for enhancing the effectiveness of simulation-based training. Spec. Issue Theoret. Issues Ergon. Sci. 27, 149– 168 (2016) 9. SimBionic GitHub. https://github.com/StottlerHenkeAssociates/SimBionic. Accessed 21 Dec 2018 10. Laird, J.E.: The Soar Cognitive Architecture. MIT Press, Cambridge (2012)

504

J. Freeman et al.

11. Bell, B., Bennett, W., Clancey, W.: Socio-technical simulation for denied environments training: a contested airspace example. In: Proceedings of the Interservice/Industry Training, Simulation, and Education Conference (I/ITSEC). National Defense Industry Association, Arlington, VA (2018) 12. Tversky, A., Kahneman, D.: Judgment under uncertainty: heuristics and biases. Science 185, 1124–1131 (1974)

Cognitive Agents for Adaptive Training in Cyber Operations Randolph M. Jones(&), Ryan O’Grady, Fernando Maymi, and Alex Nickels Soar Technology, 3600 Green Court, Suite 600, Ann Arbor, MI 48105, USA {rjones,alex.nickels}@soartech.com, [email protected], [email protected]

Abstract. To support training for offensive and defensive cyber operations, we focus on giving the trainee a realistic ecosystem to train in. This ecosystem includes models of attackers, defenders, and users. The high-level goals for adaptation in this ecosystem are of two types: realism in behavior and tailoring of training. In terms of realism, real-world cyber operations are highly adaptive. Attackers constantly innovate new attack techniques and adapt existing techniques to take advantage of emerging vulnerabilities. Defenders must adapt to ever-changing attack tactics and vulnerabilities. Users continuously adapt to rapidly changing technology. A realistic training ecosystem requires those adaptations to be reflected in the models of the synthetic actors. In terms of tailoring, training systems often require ecosystem actors to step outside of what would “realistically” happen and instead create artifices to focus the trainee’s experience on particular learning objectives. In support of these high-level adaptation goals, the CyCog (CYber COGnitive) framework currently supports three types of adaptivity. These include adaptation of tactics and techniques (for example, innovating a new attack or defense), adaptation of level of sophistication (for example, to make an attacker more or less aggressive, or to limit or expand a defender’s awareness to focus training), and adaptation of personality parameters (for example, to tune the preferences of various types of users in the ecosystem). To maintain maximum training flexibility, we use a mixed-autonomy approach that allows all forms of adaptation to be controlled on a spectrum from automated tuning to manual manipulation by human instructors. Keywords: Cyber operations Training Cognitive modeling Adaptive behavior Adaptive instruction

1 Introduction We have developed the CYber COGnitve (CyCog) framework to, among other reasons, add realistic decision-making models to the “ecosystem” used in cyber-operations training environments. The core of CyCog is a set of cognitive models that can play the roles of cyber attackers, defenders, or users, to various levels of fidelity. In addition to the cognitive models themselves, CyCog incorporates a generic framework for simulating applications and networks, as well as for integrating with real applications and © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 505–520, 2019. https://doi.org/10.1007/978-3-030-22341-0_40

506

R. M. Jones et al.

networks. Specialized components of CyCog focus on how best to represent the common knowledge and situation understanding shared by cyber operators. One of the goals of the CyCog project is to provide training systems with a rich ecosystem of realistic actors, so defensive operators can train against threats as they appear “in the wild”, rather than simply against individual tactics and techniques in a classroom or lab setting. The use of cognitive models provides two primary enhancements to the training experience. First, our cognitive models have goals, and the tactics, techniques, and procedures (TTPs) they use are consistent with those goals. This provides opportunities for diagnosis, attribution, prediction, and preemption that would otherwise be absent from a training experience. Goals can be long term, allowing CyCog to model Advanced Persistent Threats (APTs), as well as less persistent types of threats, such as “script kiddies”. Second, our cognitive models include knowledge, situation-understanding capabilities, and learning mechanisms that allow them to adapt in their responses to defensive measures. Adaptive attackers provide the opportunity for a much richer and more realistic experience in training than current textbook and classroom activities provide, and they provide a more cost-effective training option than exercises that employ human experts as role players. We have also integrated the CyCog agents with a form of Dynamic Tailoring, which adapts agent behaviors to support training goals. This paper describes the CyCog framework and its application to training, with a focus on forms of adaptation that we are incorporating into the cognitive models and training systems.

2 Cyber Operations Training Cyber operations present a persistent and evolving threat to military and civilian information systems. Both the Department of Defense [24] and the Office of the Director of National Intelligence [25] have ranked cyber warfare as our top national security concern. Department of Homeland Security Secretary Nielsen has stated that “cyber threats collectively now exceed the danger of physical attacks” [22]. In addition to threats to our military forces, cyber attacks pose domestic infrastructure and economic threats [27]. Attackers continually evolve their tactics, techniques, and procedures (TTPs) to exploit emerging vulnerabilities so they can exfiltrate, manipulate, or deny information. Would-be cyber attackers are constantly changing their attack vectors to take advantage of security lapses by human resources and the latest vulnerabilities in information technology. These activities are guided by cognitive behavior that includes a variety of types of goals and expertise: script kiddies, ideological activists, investigators, financial criminals, intelligence agents, or cyber warfighters [15]. At the human, cognitive level, offense reacts and adapts to actions of defenders [26] and users [1] that are also cognitively driven. To counter these adversarial actions, cyber-security personnel must rapidly adapt to develop and refine their defensive skills. A common way to support this adaptation by defensive personnel is through training exercises within realistic environments. To be effective, these events require intelligent, adaptive opposition forces (OPFOR), which currently requires the use of human role players. Unfortunately, using human adversaries is not feasible to support the scale and frequency of exercises needed to maintain a highly skilled defense,

Cognitive Agents for Adaptive Training in Cyber Operations

507

because these skilled OPFOR are scarce and expensive resources. In addition, a maximally effective training paradigm adjusts the training experience in response to the level of performance of the trainee. Using intelligent systems to provide role players introduces consistency and cost-effective automation across the training experiences. Building effective training systems for cyber operations presents a suite of unique problems: • Offensive and defensive activity is highly interactive and dynamic. • Cyberspace environments are defined by both their structure and the activity of all actors operating within them. To be realistic, training environments must provide an ecosystem of users (whose behavior can is variable and unpredictable), as well as attackers and defenders. • User behavior (willingly or not) can either assist or hinder the efforts of both attackers and defenders. For example, vigilant users can provide valuable intelligence, while careless users often create vulnerabilities that can be exploited. • Cyber attackers and defenders themselves are extremely adaptive and creative. In order to meet their objectives, they will change tactics or tools based on opportunities detected in a computer network or responses initiated by adversaries or users. These problems underscore the fact that cyber operations are a domain in which rapid adaptivity is the coin of the realm. Effective training systems and role players must exemplify this variability in their basic structure and their delivery of training experiences. We have developed a set of cognitive models for cyber-operations training, together with a general framework in which they operate. The CyCog framework provides the shared infrastructure for building knowledge-based agents that can play various roles within the cyber-operations ecosystem. Individual attacker, defender, and user models instantiate CyCog with specific knowledge bases, modes of operation, and types of adaptivity. This paper describes three particular forms of adaptivity that we have so far emphasized in developing CyCog models that support cyber-operations training.

3 Intelligent Systems for Cyber Training Our team began development of a cyber attacker model in 2013. Subsequently, we pursued projects that demanded the development of additional models of cyber defenders and network users. Combining the models and performing several major redesigns and refactors eventually produced the generalized CyCog framework. CyCog-A (the attacker instantiation of CyCog) and CyCog-D (the defender instantiation) are implemented in the Soar cognitive architecture [17], with associated tools implemented in Java for a high degree of portability. CyCog agents emulate human role players by modeling their decision-making processes using cognitive-systems design patterns we have developed over the course of more than two decades of applied research and development (see, for example, [8, 9, 11, 29, 30, 32]). This allows CyCog agents to provide not just static representations of cyber operations, but generative behavioral models that execute and interact with a network in real time. One advantage of using the Soar architecture over some other approaches to cognitive modeling [14] is that it incorporates decades of experience in cognitive

508

R. M. Jones et al.

research into an architecture that is reusable across new cognitive models. Soar’s models integrate varieties of knowledge, learning, and reasoning strategies, including semantic, episodic, and procedural memory, reinforcement learning, spatial reasoning, and activation dynamics, as well as cognitive models that depend on situation understanding and interaction with complex environments [18]. Soar provides a reusable software architecture that instantiates a unified theory of cognition, which gives CyCog firm practical and theoretical grounding. In the CyCog-A model (Fig. 1), knowledge is structured as composable hierarchical modules representing TTPs that can combine in multiple ways to support different situations. Combining goals and subgoals for a particular attack generates a flexible, situation-dependent attack tree structure. As the agent attempts to achieve the goals of its mission, it communicates its intent to an abstraction layer, which is responsible for translating intent into real-world actions. This enables the agent to reason over task-level goals, such as “list open ports on 192.168.1.5”, without requiring the agent to possess system-level knowledge of how to do so. To accomplish this, the abstraction layer determines which of its available resources are appropriate and translates the agent’s intent into commands for those resources. The abstraction layer supports interaction between CyCog-A and custom-built or commercial off-the-shelf (COTS) tools to perform specific offensive actions, such as port scanning, password cracking, or phishing. The modular nature of the toolkit facilitates the addition and removal of tools as needed, and also provides a robust framework for configurability and portability. Finally, a human supervisor can command and control the agent via a CyCog Command and Control (C4) Server, which uses an abstract communication model that currently implements HTTP and IRC support.

Fig. 1. CyCog-A framework

Cognitive Agents for Adaptive Training in Cyber Operations

509

CyCog agents use a key combination of situational understanding [14] and leastcommitment reasoning [33] to generate human-realistic, situation-responsive behaviors. Once a scenario has been configured, CyCog agents use situation-understanding knowledge to engage with the network and begin building and maintaining a model of the environment. This model drives the agents in the pursuit of their scenario goals and the employment of tactics to achieve those goals. This, in turn, allows them to adapt robustly in response to changes in the environment, including actions and responses by other role players in the cyber ecosystem. Least-commitment reasoning (LCR) is a process for “generating partially ordered, partially specified sequences of actions whose execution will achieve an agent’s goal” [33]. CyCog also implements the situationunderstanding model in an external and sharable knowledge representation, which allows multiple CyCog agents to work together. This shared knowledge representation is stored in an Asset Database, which is currently implemented in a software database we call the Cyber Data Repository (CyDaR). This datastore enables storing, correlating, and retrieving data from all three layers of cyberspace: physical, logical and persona [6]. As other actors and role players take actions that change elements of the ecosystem, CyCog agents may update their shared situation-understanding model and revise their goals and tactics in response to the changes.

4 Adaptivity in Cyber Training As described above, to support training for offensive and defensive cyber operations, we focus on giving the trainee a realistic ecosystem in which to train. This ecosystem includes models of attackers, defenders, and users. We have also stressed the adaptive nature of cyber operations in general, as well as the need for adaptivity in effective training. The high-level goals for adaptation in the training ecosystem include realism in behavior and tailoring of training. In terms of realism, real-world cyber operations are highly adaptive. Attackers constantly innovate new attack techniques, while defenders must adapt to ever-changing attacker TTPs, and users continuously adapt to rapidly changing technology. A realistic training ecosystem requires those adaptations to be reflected in the models of all of the synthetic actors. Simultaneously, we must balance realism and trainee proficiency in order to challenge learners without overwhelming them. Thus, any effective training environment must be tailorable in order to adapt to an individual’s (or team’s) proficiencies, as well as to other pedagogical goals (such as targeted lesson plans). In support of these high-level adaptation goals, the CyCog framework currently supports three types of adaptation. These include adaptation of tactics and techniques (for example, innovating a new attack or defense), adaptation of level of sophistication (for example, to make an attacker more or less aggressive, or to limit or expand a defender’s awareness to focus training), and adaptation of personality parameters (for example, to tune the preferences of various types of users in the ecosystem).

510

R. M. Jones et al.

To maintain maximum training flexibility, we use a mixed-autonomy approach that allows all forms of adaptation to be controlled on a spectrum from automated tuning to manual manipulation by human instructors 4.1

Adaptivity of Tactics and Techniques

One of the key characteristics of cyber attackers (as well as effective defenders) is that they are adaptive: they perceive and react to changes in their environment and they learn to exploit the tendencies of their adversaries. While there are varying degrees of adaptivity, any viable autonomous agent must execute a four-phased loop: (1) sense the environment, (2) learn what is different and/or interesting, (3) decide how to best achieve its next set of goals, and (4) act on the environment in pursuit of those goals [20]. CyCog attacker models incorporate TTP-level adaptivity by combining a robustly updated representation of situational understanding (stored in the CyDaR database) with least-commitment reasoning over a modular, generative set of goals and subgoals that fluidly implement a variety of TTPs for a spectrum of cyber-operational goals. CyDaR represents situational understanding by integrating an ontology of cyberoperational concepts with an application programming interface (API) for database queries and updates. When multiple agents are coordinating, they can use the shared database to drive adaptivity in their situation understanding and employment of tactics. For example, one agent may remember on which hosts it implanted remote access toolkits (RAT), while another agent notices when they stop responding to commands, reasoning that the defender may have detected and contained the compromise. Situation understanding is only the first half of the story for the adaptivity of tactics and techniques. Least-commitment reasoning (LCR) is an approach to intelligent systems that allows the system to “satisfice”, or “make the best decision in a reasonable amount of time”. Least-commitment strategies differ from more traditional AI planning and rule-based systems along a number of dimensions: • Traditional AI relies on “weak methods” (logical formalism with little knowledge), while LCR relies on significant amounts of domain-specific knowledge. • Traditional AI downplays the role of dynamic situational understanding, while dynamic consideration of the environment is a key part of LCR. • Traditional AI relies on computationally expensive search techniques to generate optimal plans and to precompute contingencies, while LCR trades off optimality for “reasonability” and reconsiders contingencies in rapid fashion at run time. • Traditional AI expends significant time and effort identifying a “best course of action”, making it very expensive to change course, while LCR continuously makes inexpensive decisions about whether to adjust or switch the current course of action. • Traditional AI has difficulty anticipating actions that cannot be predicted ahead of time, while LCR reduces the need for anticipatory planning.

Cognitive Agents for Adaptive Training in Cyber Operations

511

Fig. 2. Abstract representation of the least-commitment decision cycle for a system that integrates situation understanding, reasoning, planning, and action.

An LCR system implements intelligent decision making by engaging in a continuously running “decision loop”. This loop must run many times per second, possibly hundreds or thousands, depending on the application, in order to deliver responsive behavior. Traditionally, when mapped onto human cognition, it is assumed that this cycle should run about 20 times per second [21]. In each 50 ms time slice, the LCR system must make incremental changes to its internal state, based on the most recent sensed information from the environment. The LCR’s internal state consists of several functional elements including situational understanding (represented by beliefs), goal management (consisting of desires, which are things the agent would like to be true, and goals, which are things the agent has decided to commit resources to achieving), and action management (consisting of plans, which are short-term courses of action, and individuals actions that are consistent with the current plans). Figure 2 provides an abstract illustration of this cycle [12]. Because each individual decision is rapid and incremental, the LCR decision cycle composes individual decisions into a course of goal-driven actions that can adapt in small or large ways, depending on changes to the situation. The key to making rapid, but good quality, decisions is to replace the resource-intensive search process of traditional AI with an efficient, knowledge-intensive pattern-matching process. Cognitive systems that support LCR provide fast methods for accessing large knowledge bases and bringing the knowledge to bear on the decision-making process in rapid fashion [13]. The Soar Cognitive Architecture The Soar cognitive architecture provides the implementation substrate for CyCog Attacker and Defender agents, as it is one such mature engine for developing LCR cognitive systems. Soar includes an extremely efficient pattern-matching engine for

512

R. M. Jones et al.

accessing knowledge, and it has been demonstrated to execute the LCR decision cycle orders of magnitude faster than human-scale decision times [7, 16, 19]. Every software architecture that supports LCR implements some variation of the LCR decision cycle [12]. Figure 3 illustrates Soar’s implementation of this cycle.

Fig. 3. Instantiation of the least-commitment decision cycle in Soar.

Soar implements a state-based approach to LCR, but this is not to be confused with the way the term “state” is used for finite-state machines (FSMs). In FSMs, each state must contain a small number of situational variables, and the number of states in an FSM explodes combinatorially with the complexity of the problem domain. A Soar “state” instead contains some arbitrary number of features (possibly thousands) that describe the system’s current situational awareness and current goals. These features serve as input to the pattern-matching engine, which retrieves knowledge to create new beliefs, create new goals, or suggest new actions to take in pursuit of existing goals. During each instance of the decision cycle, a Soar agent retrieves knowledge to compute inferences (entailments and associations) from the current state, to retrieve candidates for the next “action decision” (which may be an action to create a new belief, to create a new goal, or to send output to the environment), to select a single next action, and then to execute that action. In addition to providing efficient native support for LCR, Soar contains a number of subsystems to support psychological relevance of implemented cognitive systems. Primary among these are learning and memory subsystems that support reinforcement learning, semantic memory and learning, and episodic memory and learning. For CyCog, episodic memory can assist in discovering unknown side effects of individual

Cognitive Agents for Adaptive Training in Cyber Operations

513

decisions. Semantic memory can help in generalizing side effects into abstract patterns. Reinforcement learning can assist in making probabilistic evaluations of likely outcomes of competing combinations of offensive or defensive decisions. Each of these three mechanisms support some form of adaptivity or learning. Some have already been exploited within CyCog, and others will lead to extended forms of adaptivity and learning in the future. Cognitive Adaptivity and Learning Our initial CyCog agents incorporate fairly simple logic for choosing offensive and defensive actions, and include a basic understanding of how to compromise or defend computer systems. The initial agents provide a conceptual structure in which to organize this knowledge, based on a cognitive analysis of deception, sensemaking, and relevant features of uncertainty and complexity. The initial agents also implement an extensible, first-principles knowledge representation. We are building from this base representation and additional analyses to implement agent extensions that naturally synthesize complex offensive and defensive strategies. Our work on the generalized CyCog framework has also focused in part on building a relatively thorough (albeit scaled-down) body of cyber-relevant knowledge, and developing a grammatical model of cyber operations to facilitate the continual updating of agent knowledge with new goals, subgoals, and TTPs. Our companion analysis of cyber operations and associated workflows suggests that we can capture much of the structure of cyber strategy as a formal model. The key focus is to develop an ontological model that relates abstractions about tactical approaches and constraints that govern how they may be combined, to types and permissible sequences of specific tactics. One key to this approach is the modularization of knowledge units that can then be recombined at run-time to generate novel attacks from first principles, reasoning as appropriate about possible moves at multiple levels of abstraction. This approach also forms the foundation for long-term tactical adaptivity through learning mechanisms that minimize the need for model engineers to update the agents manually through time. Although most work on CyCog to date has emphasized expert models (as opposed to learning models), we have also performed initial investigations into four types of learning within the cyber-operations domain: • Inferential learning is the result of Soar’s goal-driven reasoning process, and occurs when this process derives a new high-level insight from previously known lowlevel steps. In the Soar architecture, this capability is called “chunking.” It is analogous to the learning that a trainee experiences on working through a proof: the trainee “knows” all the pieces, but inferential learning leads her to recognize their implications. In our initial agents, inferential learning takes place when the agents dynamically compile multiple steps (possibly modules taken from multiple TTPs) to generate a single, coherent attack. • Observational learning results from agent interaction with the external world. For example, our attacker agent can deploy a distributed denial of service attack for the goal of disrupting a service source, and it may learn from experience that the same action can also accomplish the goal of disrupting a service channel when the defensive response includes blocking traffic from the attack addresses.

514

R. M. Jones et al.

Architecturally, Soar supports this kind of reasoning through a combination of its semantic and episodic memories. • Abductive learning is reasoning from observations to the best explanation. This mechanism may allow the agents to infer causal explanations to observed adversary responses, and then use these causal models to innovate new attacks or defenses. The inference of causal models is an essential example of adaptivity among human cyber operators, and it serves as an essential part of the “dance” between attackers and defenders. • The development of a formal knowledge representation also opens the door to instructional learning, in which a human supervisor can coach individual instances of the agents. We are investigating this form of learning in the context of prior work on taskability and interactive task learning [4, 31]. In all four types of learning, as an agent instance learns, it can share its knowledge with other agent instances, either on the same problem or across multiple scenarios, so that the agent knowledge bases increase in capability over time with minimal need for support by model builders. 4.2

Adaptivity of Synthetic Role-Player Sophistication

As we have suggested, the most effective training environment would serve an entire cyber ecosystem containing a spectrum of actors and role players. For training purposes, a key dimension along which to vary agent behavior is in the level of sophistication of the agents populating the ecosystem. Trainees should receive training experiences containing a variety of levels of competence in terms of attackers, network users, and possibly also collaborative defenders. In addition to simply providing a rich ecosystem, levels of sophistication can be dynamically adapted to suit the level of competence of the trainee or to focus lessons on particular pedagogical goals. Training dynamics involves two issues: 1. Estimating the proficiency of the trainee 2. Managing the flow of training. Conventional methods for estimating proficiency focus on explicit testing, which interrupts the flow of the training experience. Conventional methods for managing the flow of training are built around pre-defined lesson modules, which do not readily adapt as the trainee learns. Our colleagues have pursued two capabilities developed in other projects [5], which we have adapted for training in cyber operations: assessing proficiency by observing the trainee’s actions without explicit tests, and dynamic tailoring of the exercise as it evolves. The Dynamic Tailoring System is another type of agent that interacts with the ecosystem agents (attackers, defenders, and users) to modulate their behavior (e.g., the range of attack tools available to the attacker; the complexity of the attack, the sophistication of the attack goals or techniques) as a function of the trainee’s observed proficiency. During our development of prototype training systems for cyber operations, we focused particularly on the adaptivity facilitated by Dynamic Tailoring (DT). Dynamic Tailoring is the process of adapting training content and gameplay dynamically based

Cognitive Agents for Adaptive Training in Cyber Operations

515

on the observed needs of the individual trainees in the training environment. For example, in a social training simulation the system may automatically demonstrate content to a novice trainee who has demonstrated difficulty.

Dynamic Tailoring System Cognitive Monitor

Student Model

Student Data

Scenario Content

Feedback Mechanism

Trainee Input

Instructor Interface Overrides

Past Experience

Scenario Description

Intrinsic Feedback Dynamic Alteration to Scenario Content

Expert Model

Tailoring Strategies

Soar Agent Data ATTACKER

SCR2AM

DEFENDER

USER

Scenario Manager

Extrinsic Feedback Explicit User Prompts

External Components (System terminals, virtual network)

Fig. 4. Dynamic tailoring components and system input and outputs.

Research shows that immediate feedback on errors is a more effective training technique than delayed feedback [28]. Many training and simulation-based environments provide feedback at the conclusion of an exercise – reducing the potential learning effect. Dynamic Tailoring is capable of providing timely feedback, which can lead to a more effective training experience. The Dynamic Tailoring component is responsible for capturing the state of the trainee, including the trainee’s assessed proficiencies. The DT accomplishes this by building a memory model of the trainee’s most recent actions, assessment of those actions, training strategy, and scenario goals. Based on this information the DT is able to assess trainee performance in real time. As illustrated in Fig. 4, the Dynamic Tailoring component monitors multiple inputs and outputs of the system. These external inputs include information on individual CyCog agents representing the attackers, defenders, and users. The DT monitors and modifies the behavior of these agents as required to fulfill the overall scenario objectives. The system collects trainee data encompassing current and past actions, as well as an expert/novice classification. This data is used to tailor the experience based on trainee skill and performance. Scenario Data includes a scenario description, training strategies, and inputs for the external instructor interface. This information defines the scenario and the approach the

516

R. M. Jones et al.

DT uses to complete it. Finally, a feedback mechanism provides the DT with the ability to influence external components. These outputs are used to change the simulation or provide prompting to the trainee. Internally, the Dynamic Tailoring component has three core components for managing the exercise and monitoring the trainee. A Scenario manager monitors the overall scenario status, controlling the declaration of success and failure. A Cognitive monitor observes the cognitive status of the trainee and CyCog agents working within the ecosystem. The third component is a trainee model that houses information on individual trainee performance. This component evaluates individual trainee actions and assesses their effectiveness. Based on these evaluations, DT can adjust the difficulty level of a scenario. Such adjustments can include altering the number of cyber attackers, changing the network topology, spawning/closing system vulnerabilities, and configuring the level of sophistication of attacks and counter attacks. The primary method for adapting attack sophistication is to adjust the complexity, aggressiveness, or other features of the goals and TTPs that are assigned to CyCog attacker agents. To foster this type of adaptivity, we have pursued a formal mapping of TTPs (informal descriptions of tactics used largely by military personnel) to the formal knowledge representation language we invented for CyCog and other Soar-based intelligent agents. The term TTP is pervasive in the cybersecurity literature. Despite this ubiquity, there are no clear definitions allowing the community to differentiate tactics, techniques, and procedures. While ambiguity and imprecision when referring to TTPs is usually not problematic among security professionals, it is a significant impediment to using these concepts in autonomous systems. This problem manifested itself while developing actionable behavioral models of offensive cyberspace operations in the CyCog attacker agent. Our goal has been to ground TTPs in a semantic representation that enables adversarial behavior modeling and autonomous decision-making, reasoning, and learning. This representation will also allow translation of varieties of (informal) TTPs into formal attacker goals and methods, at varying levels of complexity, which DT can then adjust to foster effective training. 4.3

Adaptivity of Synthetic Role-Player Personality

Even the most adaptive and capable automated agents are of limited use if training exercise planners cannot configure and task them according to the exercise objectives. However, it is not reasonable to expect cyber training experts to also be engineering experts. Thus, it is necessary to develop an engine to translate training adaptations into ecosystem and agent parameter settings. CyCog agents are driven in part by their knowledge base of goals and TTPs, and in part by a fairly large number of “personality” parameters that govern decision making. These parameters can include preferences for which types of attacks to use, which attacker tools to use, how stealthy or deceptive to be, and many others. The result is a large and complex amount of formal information that may need to be specified to configure each agent in the ecosystem. Instructors must be able to avoid such fine-grained configuration, instead expressing high-level configuration preferences that can be automatically refined into individual parameter settings. To support these types of training adaptations, we are exploring AI

Cognitive Agents for Adaptive Training in Cyber Operations

517

techniques and tools for collaborative planning and decision support to reduce the complexity of configuring and tasking the CyCog agents (see examples in Fig. 5). Cyber operations are complex, and any approach to adapting the personalities of agents in the ecosystem cannot hope to ignore that complexity. It is important that we create tools that assist exercise planners in configuring, deploying, and tasking the agents, without hiding or obscuring access to key functionality. However, we recognize that instructors do run large-scale, complex training scenarios without having to specify every minute interaction and configuration option. Our approach is to analyze the existing workflow of instructors running exercises with human role players, and adapting that workflow to the synthetic CyCog agents. We are using cognitive task and workflow analysis techniques to model how exercise planners currently brief human role players and relay mission objectives, tasks, and constraints. Based on the model we are developing for planner/role-player interaction, we plan in the near future to implement a proof of concept approach to configuring CyCog agents. Matching the interaction between instructors and human role players, we are adopting a mixed-autonomy approach. For portions of the training exercise that the instructor wishes to “micromanage”, they will be allowed to specify configuration options to any desired level of detail. For the more usual case, where role players must appropriately interpret the “instructor’s intent”, the CyCog agents will similarly make appropriate inferences about reasonable configuration options that meet the requirement of the instructor’s high-level intentions. In general, we envision scenario configuration as a collaborative, iterative process in which the exercise planner specifies tasks and constraints, and the support system recommends changes based on conflicts, errors, and omissions. However, there are other interaction models that may prove to be more appropriate, such as the mission planner specifying generalized requirements and allowing the agents to derive specific parameters based on those general guidelines.

Fig. 5. Guided configuration of CyCog agents

518

R. M. Jones et al.

In additional to providing tools to map instructor’s intent to individual parameter values, we must thoroughly and formally define the configuration parameters of the agents in the first place. Throughout our design of CyCog, we have seized every opportunity for data-driven configurability through the application of two techniques: externalization [23] and parameterization [2]. Externalization is the process of making explicit those aspects of the knowledge base (such as data structures and parameters) that can be usefully shared outside the core system. CyCog already externalizes a number of parameters that allow data-driven specification of goals and decision-making preferences, and CyDaR is another example of such an externalization. By externalizing information, it becomes much easier to create tools that can support configuration without having to make changes to the underlying software. This in turn fosters mechanisms from improving the adaptivity of training. Parameterization is the process of moving from special-purpose to general-purpose implementations that can be configured to behave differently by supplying different parameters. It is the nature of complex intelligent systems that there are many different types of knowledge that cannot easily be generalized into parameter-driven patterns. However, there are also large portions of knowledge than can be generalized and parameterized, and we aggressively pursue every opportunity to do so when engineering the CyCog agent knowledge bases. It tuns out that such parameterization efforts serve three important goals: they make knowledge bases easier to extend, they make it easier to develop agents that intelligent adapt their own parameters, and they make it easier to develop training-oriented adaptations. Thus, parameterization is already widely used throughout the CyCog framework, and our efforts at this point are to extend the use of the parameters to foster further types of adaptation.

5 Discussion and Conclusions The ability to adapt is a key property of most behaviors that we would be willing to call “intelligent”. Adaptivity is also a key component of training, because learning is fundamentally change, and effective training must adjust to the level of competence of a trainee. For the domain of cyber operations, in particular, adaptivity plays an even more prominent role, because the domain tasks themselves deal with constantly changing technology and tactics, as well as responding to (or preempting) constantly changing actions that others are changing to achieve they operational goals. We have developed cognitive models of the major players involved in cyber operations (attackers, defenders, and users). We have also developed a number of approaches to adaptivity in training for interactive domains. We are in the midst of research efforts to combine the two, extending our cognitive modeling and training systems to support effective training in cyber operations. We have enumerated a variety of types of adaptivity that play a role in these research efforts, including adaptation of tactics and techniques, sophistication of scenario actors and role players, and personality parameters (broadly construed) of scenario actors and role players. The goal of all of these forms of adaptivity are to realistically represent the dynamic nature of cyber operations, while simultaneously providing an effective training framework that tailors trainee experience to specific pedagogical goals associated with each trainee.

Cognitive Agents for Adaptive Training in Cyber Operations

519

Our current and future research plans aim to overcome a number of challenges to building a cost-effective but realistic cyber-operations ecosystem that supports domainlevel and training-level adaptivity. We are particularly focused on methods of expanding the depth and breadth of knowledge for the role-player models, This includes efforts to develop knowledge representations that ease the acquisition of subject-matter expertise, as well as the engineering of that knowledge into composable goal hierarchies and formal definitions of TTPs. We are also pursuing research on learning and abductive inference methods to build robust models of self-adaptation, based on the ability to explain unexpected outcomes and novel experiences. Farther in the future, we hope to incorporate formal models of deception, which are a key component of advanced cyber operations, and open up an additional category of behavioral adaptivity [3].

References 1. Bowen, B.M., Stolfo, S.J., et al.: Measuring the human factor of cyber security. In: Homeland Security Affairs, IEEE 2011 Conference on Technology for Homeland Security: Best Papers, Suppl. 5 (2012) 2. Goguen, J.A.: Parameterized Programming. IEEE Trans. Software Eng. 10(5), 528–543 (1984). https://doi.org/10.1109/TSE.1984.5010277 3. Henderson, S., Hoffman, R.R., Bunch, L., Bradshaw, J.: applying the principles of magic and the concepts of macrocognition to counter-deception in cyber operations. In: Proceedings of the 12th International Meeting on Naturalistic Decision Making. MITRE Corp., McLean, VA, June 2015 4. Huffman, S.B., Laird, J.E.: Instructo-Soar: learning from interactive natural language instructions (Video Abstract). In: Proceedings of the AAAI, p. 857 (1993) 5. Folsom-Kovarik, J.T., Newton, C., et al.: Modeling proficiency in a tailored, situated training environment. In: Proceedings of the Conference on Behavior Representation in Modeling and Simulation (BRIMS 2014) (2014) 6. Joint Chiefs of Staff. Joint Pub 3-12: Cyberspace Operations. Joint Chiefs of Staff (2018) 7. Jones, R.M., Furtwangler, S., van Lent, M.: Characterizing the performance of applied intelligent agents in Soar. In: Proceedings of the 2011 Conference on Behavior Representation in Modeling and Simulation (BRIMS), Sundance, UT (2011) 8. Jones, R.M., Laird, J.E., Nielsen, P.E., Coulter, K.J., Kenny, P., Koss, F.V.: Automated Intelligent Pilots for Combat Flight Simulations (1999) 9. Jones, R.M., Marinier, R.P., Koss, F.V., Bechtel, R.: Tactical behavior modeling for ground vehicles. Presented at the SAE World Congress Experience (2017). https://doi.org/10.4271/ 2017-01-0261 10. Jones, R.M., et al.: Modeling and integrating cognitive agents within the emerging cyber domain. In: Proceedings of the Interservice/Industry Training, Simulation, and Education Conference (I/ITSEC), vol. 20. Citeseer (2015) 11. Jones, R.M., Wallace, A.J., Wessling, J.: An intelligent synthetic wingman for army rotary wing aircraft. In: The Interservice/Industry Training, Simulation and Education (2004) 12. Jones, R.M., Wray, R.E.: Comparative analysis of frameworks for knowledge-intensive agents. AI Mag. 27(2), 45–56 (2006) 13. Klein, G.A.: Sources of Power: How People Make Decisions. MIT Press, Cambridge (1998)

520

R. M. Jones et al.

14. Kokar, M.M., Endsley, M.R.: Situation awareness and cognitive modeling. IEEE Intell. Syst. 27(3), 91–96 (2012). https://doi.org/10.1109/MIS.2012.61 15. Lathrop, S., Hill, J.M.D., et al.: Modeling network attacks. In: Proceedings 12th Conference on Behavior Representation in Modeling and Simulation (BRIMS 2003) (2003) 16. Laird, J.E.: Millions of rules, billions of decisions. In: Presentation at the 29th Soar Workshop (2009) 17. Laird, J.E.: The Soar Cognitive Architecture. MIT Press, Cambridge (2012) 18. Laird, J.E., Newell, A., Rosenbloom, P.S.: SOAR: an architecture for general intelligence. Artif. Intell. 33(1), 1–64 (1987). https://doi.org/10.1016/0004-3702(87)90050-6 19. Laird, J.E., Voigt, J., Derbinsky, N.: Peformance evaluation of declarative memory systems in Soar (2010). Manuscript submitted for publication 20. Maymí, F.J., Lathrop, S.D.: AI in cyberspace beyond the hype. Cyber Defense Rev. 3(3), 71–82 (2018) 21. Newell, A.: Unified Theories of Cognition. Harvard University Press (1990) 22. Nielsen, K.: Secretary Kirstjen M. Nielsen’s National Cybersecurity Summit Keynote Speech, 31 July 2018. https://www.dhs.gov/news/2018/07/31/secretary-kirstjen-m-nielsen-snational-cybersecurity-summit-keynote-speech. Accessed 19 Jan 2019 23. Nonaka, I., Takeuchi, H., Umemoto, K.: A theory of organizational knowledge creation. Int. J. Technol. Manage. 11(7–8), 833–845 (1996) 24. Parrish, K.: Cyber May Be Biggest Threat, Hagel Tells Troops (2013). http://www.defense. gov/news/newsarticle.aspx?id=120178 25. Pellerin, C.: Cyber Tops Intel Community’s 2013 Global Threat Assessment (2013). http:// www.defense.gov/News/newsarticle.aspx?ID=119776 26. Pfleeger, S.L., Caputo, D.D.: Leveraging behavioral science to mitigate cyber security risk. Comput. Secur. 31(4), 597–611 (2012) 27. Ponemon Institute: Cost of a Data Breach Study: Global Overview. Ponemon Institute (2018) 28. Shute, V.J.: Focus on formative feedback. Rev. Educ. Res. 78(1), 153–189 (2008) 29. Taylor, G.E., Sims, E.M.: Developing believable interactive cultural characters for crosscultural training, San Diego (2009) 30. Taylor, G.E., Stensrud, B.S., Eitelman, S., Durham, C., Harger, E.: Toward Automating Airspace Management (2007) 31. van Lent, M.: Learning task-performance knowledge through observation. Ph.D. Thesis at University of Michigan, Department of Electrical Engineering and Computer Science (2000) 32. Van Lent, M., Fisher, W., Mancuso, M.: An explainable artificial intelligence system for small-unit tactical behavior, pp. 900–907 (2004) 33. Weld, D.S.: An introduction to least commitment planning. AI Mag. 15(4), 27 (1994)

Consideration of a Bayesian Hierarchical Model for Assessment and Adaptive Instructions Jong W. Kim1(&) and Frank E. Ritter2 1

ORAU and US Army CCDC Soldier Center STTC, Orlando, FL 32826, USA [email protected] 2 Pennsylvania State University, University Park, PA 16802, USA [email protected]

Abstract. People appear to practice what they do know rather than what they do not know [1], suggesting a necessity of an improved assessment of multilevel complex skill components. An understanding of the changing knowledge states is also important in that such an assessment can support instructions. The changing knowledge states can be generally visualized through learning curves. These curves would be useful to identify and predict the learner’s changing knowledge states in multi-domains, and to understand the features of task/subtask learning. Here, we provide a framework based on a Bayesian hierarchical model that can be used to investigate learning and performance in the learner and domain model context—particularly a framework to estimate learning functions separately in a psychomotor task. We also take an approach of a production rule system (e.g., ACT-R) to analyze the learner’s knowledge and skill in tasks and subtasks. We extend the current understanding of cognitive modeling to better support adaptive instructions, which helps to model the learner in multi-domains (i.e., beyond the desktop) and provide a summary of estimating a probability that the learner has learned each of a production rule. We find the framework being useful to model the learner’s changing knowledge and skill states by supporting an estimate of probability that the learner has learned from a knowledge component, and by comparing learning curves with varying slopes and intercepts. Keywords: Assessment Learning curves Bayesian hierarchical model

Psychomotor skill

1 Introduction Adaptive Instructional Systems (AISs) are intended to help the learner to acquire knowledge and skills, to practice them, and to achieve expertise through the progression of stages that can be visualized as a learning curve, which has shown to impact learning in various task domains including procedural troubleshooting tasks, mathematics, physics problem-solving, etc. Learning curves are useful to visualize performance changes, and to evaluate adaptive instructional systems (AISs). Particularly, as a formative assessment, where a student is being taught about a concept, a fact, or a task, © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 521–531, 2019. https://doi.org/10.1007/978-3-030-22341-0_41

522

J. W. Kim and F. E. Ritter

an adaptive instructional system should appropriately assess the student’s learning progress (or changing knowledge states), and properly inform the student of personalized and instructional contents. This is a very important but hard issue, but essential for advancing the learning experience. As an assessment of performance changes, comparisons of two scores is a simple method. The two scores would be pre- and post-test results by comparing two learning materials, or by comparing before and after a unit of work, and by comparing scores after a period of time. This is called a summative assessment. In the meantime, the learner can be assessed during a course of learning, rather than at the end of a course, which helps us to see the skill development progress. The progress (i.e., the changing knowledge states by deliberate practice) could be visualized, and summarized as a learning curve. This formative assessment tool can be useful in a case of military training. For example, soldiers are instructed to perform varying skill components. In general, a set of subtask skills for psychomotor tasks would consist of a physiological control skill (e.g., deep slow breathing while determining Point of Aim against a moving target during a marksmanship training). This skill set would not be not overtly identified whether it is being practiced correctly or not, which necessitates a new look toward assessment through learning curves. 1.1

The Learner and Domain Model

The domain model can be considered as a repository of knowledge and skills. That is, a domain model is a representation of knowledge for a task including domain content (scenarios, problems, or knowledge components), a learner model (including both a novice and an expert), and common misconceptions, and tactics/actions that can be taken by an intelligent tutoring system to help the learner engage in an optimized learning environment [see 2, p. 1]. The learner model specifies how the learner acquires knowledge and skills in the domain model through the stages of learning (e.g., from declarative to procedural stages). The domain and learner model can be comprised of knowledge components (e.g., declarative and procedural knowledge in ACT-R) that have been used to generalize the terms for describing pieces of cognition or knowledge including production rules, facts, principles, concepts, and schema [3]. Knowledge components can be authored manually based on a modeling framework (e.g., rule-based, or constraintbased). Domain models created by experts can be sometimes wrong, and thus, it is necessary to be evaluated by some types of the learning curve data—that is, the error rates decrease through practice [4]. One kind of the knowledge component model is rule-based cognitive models that can be computationally runnable [5]. A production rule-based model can help in thinking about what knowledge may be needed to perform a particular task, how that knowledge might be decomposed to capture what the learner would do, and how widely specific knowledge components will transfer [6]. The different domains (e.g., cognitive, domain, and social domains) can affect the way we understand the domain and learner modeling. In a cognitive domain, one of the objectives of the learner would acquire the maximum number of knowledge components specified in the domain model. In the psychomotor domain, the learner model would need a finer tuned domain model. For example, a novice golfer would acquire

Consideration of a Bayesian Hierarchical Model

523

knowledge and skills about a putting task, and these knowledge and skills can be modeled using a rule-based system. The learner would go through cognitive processes (e.g., judge the line of the ball, etc.) before making an action (e.g., hit the ball) while he/she would control his/her breath to increase accuracy. In seconds, the task would be completed. Compared to other cognitive tasks (e.g., solving an algebra question), the golf putting task would require a finer granularity of instructions about how to successfully and fluently coordinate the cognitive, physiological, and physical processes. The knowledge components based on the learner (domain) models can shape the behavior of an adaptive instructional system. Thus, the efficiency of maintaining and updating the knowledge components, and their associated learner (domain) models would play a considerable role in the design of the grain size of adaptivity and in the behavior of the AIS. Authoring tools can be also improved by providing mechanisms to maintain and update the learner models with knowledge components, and to effectively alter aspects of the instruction as well [6]. 1.2

The Adaptive Level and Assessment

Adaptive instruction can arise from an understanding of variations by individuals and by tasks/subtasks. As a formative assessment technique, learning curves can provide important insights of how to assess performance and define the adaptive levels. Educational outcomes generally reside in a certain time band of hours, months, and years. In the meantime, outcomes in a cognitive learning theory would be in the time band of seconds or milliseconds. Regarding a meaningful assessment in training, it would be necessary to consider the time band of human actions that are ranging from milliseconds to years [7, p. 122]. It is argued that there is a significant gap in an analysis of performance changes [8], and the gap can be defined by the Newell’s time scale of human action. We need an appropriate level of granularity and relevant theories [e.g., 3]. Learning curves have been used in industry as well in an attempt to investigate the prediction time (or cost) to produce a product [9]. In Cognitive Science and Education, learning curves have played a role to investigate practice effects by a certain knowledge representation in human memory systems [e.g., 10]. In both cases, learning curves generally follow a log-linear model [11, 12]. In general, the task completion time follows a power law of learning representing a speed-up effect. One of known pitfalls of learning curves is that a larger domain model or a large student sample size is likely to exhibit a better fit than a smaller one, even if the system does not teach the students any better [13]. For example, a larger task with a large number of sample sizes with a sufficient amount of practice can still exhibit a power law of learning. But some subtasks would be learned differently and some subtasks would be learned slowly compared to others [14]. Thus, a simple analysis of learning curve in a large task seems not sufficient enough to make an instructional strategy. Furthermore, a near-term assessment by comparing learning curves would not be related to the longterm stability of learning [e.g., 15]. Thus, it would be necessary to use a probabilistic model and its parameter estimation. Particularly, Bayesian hierarchical model would be useful because it supports multilevel structures of variables to model variation explicitly. Inappropriate averaging to construct variables can sometimes remove variation, leading to inappropriate certainty of data and its handling [e.g., 16, p. 356].

524

J. W. Kim and F. E. Ritter

Decomposing the Task to Subtasks. In Psychology, there is Reducibility Hypothesis [17]. Simply, it indicates a larger task can be meaningfully and functionally decomposed to smaller unit tasks. Thus, a subtask (in seconds) might be relevant to educational outcomes. An experimentation that investigates human attention in milliseconds provides meaningful implications toward a longer time duration of a task. The adaptive level would be related to how to decompose a task into smaller tasks. Therefore, a task can be meaningfully decomposed to a smaller unit of subtasks. A psychomotor task can be decomposed to cognitive subtasks, sensorimotor subtasks, and physiological control subtasks (like a tactical breathing subtask in a larger psychomotor task). This is useful in designing an assessment tool in an adaptive instructional system because a large task can be meaningfully decomposed and monitored. Time Bands To define the adaptive levels in an instructional system, it is used to consider behavior with respect to a time scale. Newell’s seminal work in Unified Theories of Cognition proposed a Time Scale of Human Action [7, 8]. Newell first mentioned the Bands: Biological, Cognitive, Rational, and Social Band. 100 ms to 100 h, even we think 10000 h of deliberate practice to be an expert. It has inspired the efforts in the AIS community, and is seen in Anderson’s and Koedinger’s work. It has been pointed out that there is a gap between millisecond level experimentations in cognitive psychology and months/years outcomes in education [8]. Anderson made a suggestion to use the Newell’s time scale of human action. In terms of his arguments, the learner model would reside in Newell’s Cognitive Band (100 ms to 10 s), and the educational outcome lie in Social Band ranging from days to months. He argues that using cognitive modeling approach can bridge the gap in terms of Reducibility Hypothesis. Ken Koedinger and his colleagues proposed a cognitive science based framework, called Knowledge-Learning-Instruction (KLI), in an attempt to promote high potential principles for generality [3]. He mentioned also Newell’s time band of human action. Firing one production rule is assumed to take 500 ms (0.5 s). His Cognitive Tutor uses a set of production rules as a learner model. Reading time can be modeled using a set of production rules, and can be plotted in a log-log scale, representing a power law of learning. Educational outcomes would lie in the Social Band with days to months. The granularity of adaptive instruction can go down to the Cognitive Band with a production rule in ms. It is claimed that cognitive modeling can provide a basis for bridging between events on the small scale and desired outcomes on the larger scale. In this paper, we summarize the aspects of knowledge and skills that can be represented in a domain and learner model within a cognitive architecture. Knowledge and skills can be referred to as knowledge components that are grounded in different theories of cognition. Based on the knowledge components in the learner (domain) model, learning curves (e.g., log-linear models) is modeled and compared to summarize performance changes using the Bayesian hierarchical modeling approach. We report a case study of learning a psychomotor task (golf putting) in the context of multidomain to test the proposed framework. Limitations and learned lessons are discussed at the end in order to improve our understanding of the learner and domain context in the development of AISs.

Consideration of a Bayesian Hierarchical Model

525

2 Constructs of Knowledge and Skills in a Psychomotor Task Identifying what is learned would be useful to design instructions. Unobservable nature of the knowledge and limited scientific tools lead to a need to delve into cognitive task analysis [3]. It has been recognized that cognitive architectures have been played an important role in understanding the level of knowledge, and its learning [18]. We have taken an approach of Cognitive Science to fathom the unobservable nature of knowledge and focus on learning a psychomotor task—a golf putting task [19, 20]. Learning a psychomotor task can be simply considered as achieving a fluent coordination of cognitive, physiological, and physical systems of human body. For example, a golf putting task consists of subtask skills of: (a) cognitive skills including “judge the line of the ball”, (b) a physiological control skill including “slow breathing”, and (c) a physical skill including “hitting the ball”. These systems need to be interdependently executed to produce accurate performance. We describe our theoretical base for the construct of knowledge and skills that can be usable for an adaptive instructional system. 2.1

The Declarative to Procedural (D2P) Construct

The declarative to procedural (D2P) construct for knowledge and skills is based on the learning process implemented in ACT-R [21] with an understanding of the KRK learning and retention theory [22]. Learning theories [e.g., 12, 23–25] suggest a consensus understanding that a learning process consists of a number of stages from acquiring declarative knowledge to forming procedural knowledge by practice as shown in Fig. 1.

Fig. 1. The KRK theory of skill learning and retention [26].

Several learning processes are implemented in the ACT-R cognitive architecture. Facts of task knowledge are encoded and stored in declarative memory. With practice, the learning process converts the acquired knowledge into a procedural form of

526

J. W. Kim and F. E. Ritter

knowledge based on both the activation mechanism of declarative chunks and production rule compilation [18]. As seen in Fig. 1, for the first stage, knowledge in declarative memory is strengthened by practice or is degraded by lack of use in terms of the activation mechanism. For the second stage, declarative knowledge is compiled into procedural knowledge, and the task knowledge is represented as a mix of declarative and procedural memory. With lack of use, declarative knowledge is forgotten, leading to missed steps and mistakes, and procedural knowledge is basically immune to decay. For the third stage, task knowledge is available in both declarative and procedural forms, but procedural knowledge predominantly produce performance. 2.2

Knowledge Components

The D2P construct is related to knowledge components (KC) in the KnowledgeLearning-Instruction framework proposed by Koedinger and his colleagues [3]. KC is defined as an acquired unit of cognitive function or structure inferred from performance on a set of related tasks. The main purpose of using KCs is to generalize terms for describing concepts, facts, cognition, and knowledge that are represented as production rules with declarative knowledge. They mentioned that many KCs representing mental process at the 10 s unit task level in Newell’s time band. But, we recognize these KCs are not directly related to a psychomotor task; they are mostly limited to cognitive tasks. Unit tasks usually last 10 s. A single golf putting task can be finished less than 10 s. Decomposing KCs in a psychomotor task would require smaller time scales to provide an improved assessment. The KLI framework focuses on the analysis of academic learning (e.g., geometry, multiplication, etc.); it appears to have a loose linkage to psychomotor domain learning (e.g., tennis, golf, archery, etc.). Knowledge components can be characterized by Newell’s time bands, and can be also characterized by the properties of application conditions and the student responses [3]. We take such an approach to summarize the properties of application conditions and responses for a golf putting task learning domain, shown in Table 1. The relationship between application condition and response can be constant to constant, variable to constant, constant to variable, and variable to variable. For example, some KCs are applied under constant conditions when there is a single unique pattern to which the KC applies, but others are applied under variable conditions indicating there are multiple patterns that a KC applies to. In general, perceptual category learning can have a type of a variable-constant KC that is essentially category-recognition rules with many-to-one mappings. A more complex production rules can be described by a type of the variablevariable KC. In our task dome, practicing a slow breathing technique can be applied to the variable-variable KC. The conditions and responses can then be further specified by the relationships that can be either verbal or non-verbal, indicating whether it is expressed in words or not [3]. This is similar to the binary memory classification of declarative and procedural knowledge [18].

Consideration of a Bayesian Hierarchical Model

527

Table 1. Examples of different kinds of knowledge components in a golf putting task. Knowledge/Skill category by subtasks Cognitive JudgeLineOfBall

Application conditions

Response

Example

Variable

Constant

JudgeGrainTurf

Variable

Constant

JudgeDistanceToHole

Variable

Constant

Assess ! Determine the line Assess variable conditions ! Determine the property of turf grain Estimate the distance from the ball to the hole ! Decide the distance

Physiological DeepSlowBreathing

Variable

Variable

Breathe in 4 s, hold 4 s, breathe out 4 s, and then hold 4 s

Physical PositionBallBtwnCenterOfFeet

Variable

Constant

Assess the position ! Position

3 The Framework Based on the insights about the unobservable knowledge and its states, we need to fit learning functions by varying knowledge components separately so that we could identify different learning mechanisms—e.g., insights or gradual accumulation of knowledge, strategy shifts by altering the subtasks structures of the task, or subtasks trade-offs by understanding where the learner spends more time on one subtask that can lead to reduction in time (or accelerated time) to complete another subtask. It appears that the D2P construct is limited to implement models of subtasks with the assumption of consistent learning and most intelligent tutors appear to be so. But, our approach can be useful to extend the understanding of both consistent and inconsistent learning. 3.1

Bayesian Hierarchical Models

We can consider a simple hierarchical model for golfers (i ¼ 1; . . .; nÞ, which can be nested by the golf handicap groups (i.e., groups by the golf handicap, which is how many strokes over par the golfer averages). The response variable ðyij Þ can be a distance to the target from the ball that is hit by the golfer. We assume that practice can help the golfer to reduce the distance as much as he/she can. We have a single predictor as a unit of practice trials (e.g., days, or months, but trials might be more appropriate in some cases but is less practical to compute). We assume that the error term (eij ) is normally distributed with mean zero and unknown standard deviation.

528

J. W. Kim and F. E. Ritter

yij ¼ aj þ bj xij þ eij

ð1Þ

aj ¼ la þ uj bj ¼ lb þ vj eij ¼ N 0; r2y In Eq. (1), we have variation in the aj , and bj . A correlation parameter q can be defined as follows.

uj vj

N

2 ra 0 ; qra rb 0

qra rb r2b

This model can be specified by subtasks and by golfers (nested by the handicap group). Subtasks can include knowledge components specified in cognitive, physiological, and physical properties. This linear mixed effect model is fit by using Markov Chain Monte Carlo sampling process in Bayesian inference. The proposed model explains varying intercepts for the golfers and subtasks, and varying slopes for the effect of the practice trials over time. The model is useful to summarize learning rates by subtasks by subjects. It is emphasized that an adaptive instruction can arise from an understanding of variations. Successfully dealing with varying components in the model is important. Thus, the proposed hierarchical model based on knowledge components is useful to understand different learning curves by subtasks and by subjects as an improved assessment for adaptive instructions.

4 Discussion and Conclusions This paper has discussed briefly theoretical foundations of knowledge and skills (e.g., D2P, ACT-R, KLI framework), tasks/subtasks decomposition, and granularity of adaptive level. As an assessment tool, the framework with Bayesian hierarchical models support an understanding of learning curves with varying slopes and interception. In this discussion, we note some lessons and future work. 4.1

Developing Fluent Knowledge and Skill Component

To help skill development, it would be necessary to consider the grain size of adaptive instructions and feedback. As mentioned earlier, a golf putting task can be decomposed into a cognitive subtask, a motor subtask, and for optimal behavior, a subtask about controlling breathing (a physiologically related subtask). Learning curves from all these subtasks would vary. In an initial ACT-R model of golf putting [5, 19], the number of production rules is around 20. Learning in this domain involves the acquisition of such production rules.

Consideration of a Bayesian Hierarchical Model

529

The production rule type knowledge component is useful, and this model can be run to find some sequence of productions that produces the behavior exhibited by the learner. A physio-cognitive model has been recently proposed to represent slow breathing in a psychomotor task as well [20]. The physio-cognitive model appears to provide a rich understanding of the task for adaptive instructions. It is also necessary to consider a timing standard to look at performance—physiological control in seconds, cognitive thought process in seconds, and deliberate practice and outcomes in days. A time unit is used across the world as a standard, which helps us to communicate with each other regarding the time. Similarly, we could propose a standard regarding the granularity of adaptive instructions in an intelligent tutoring system. This would help us to better understand performance changes and their assessments. Newell’s time band [7] can be one candidate. There are several attempts that verify its usefulness for an intelligent tutoring system [3, 8]. Based on this, an adaptive instructional system would be cognitively and physiologically inspired to produce instructional adaptivity. Adaptations in the appropriate time band could provide an understanding of a finer granularity of adaptive instructions. This effort would be helpful to achieve a cognitive training and brain plasticity [e.g., 27] in AIS. 4.2

The Usefulness of the Framework: Predicting Readiness

Predicting readiness is important because soldier readiness is one of the top priorities in military training. Soldiers spend massive amounts of their time practicing knowledge and skills (e.g., shooting a static or a moving target). If the number of practice trials is sufficiently large, performance can be predicted by a regularity known as a power law of learning, where the time to complete a task decreases with practice or the number of errors decreases according to a power function [e.g., to note a few, 3, 4, 28, 29]. At the same time, if there is a period of skill disuse, soldiers and squads might not be fully ready for a military mission. A soldier’s performance would lie on a certain range in the curve shown in Fig. 1, summarizing a learning and retention theory in ACT-R. There have been important attempts to create such predictions—to note a few, models in the KLM GOMS framework (not including learning) or in the ACT-R cognitive architecture. The KLM models predict expert performance, but they do not model learning. Models in ACT-R predict learning. It may be still not fully known if a soldier is actually ready for a combat mission and task under time stress and fatigue. Furthermore, a type of a power law of learning would provide much lesser predictability when the task is applied to multi-domains (e.g., in the wild or in a synthetic training environment). We and others suggest that psychomotor performance is interrelated with cognitive, physiological, and physical factors [e.g., 19, 30, p. 31, 31]. Predicting soldier and squad readiness is challenging, which will fundamentally play a crucial role in enhanced soldier lethality. A statistical and probabilistic model of the soldier’s changing knowledge and skill state can be useful to identify and predict performance readiness. We have reviewed attempts to monitor the changing knowledge state: one approach is to use Bayesian knowledge tracing [1, 32, 33], and the other one is to use performance factor analysis [34]. These approaches appear to be limited and unable to describe characteristics beyond the desktop environment and multi-domains. It would be necessary to extend the existing theories. Based on a preliminary physio-cognitive model

530

J. W. Kim and F. E. Ritter

in ACT-R/U, it is worth exploring Bayesian hierarchical model based estimates of probabilities that the learner have learned each of production rules in the physio-cognitive model until the learner has reached to the procedural stage shown in Fig. 1. This approach can increase the probability of learner and warfighter readiness.

References 1. Atkinson, R.C.: Optimizing the learning of a second-language vocabulary. J. Exp. Psychol. 96, 124–129 (1972) 2. Sottilare, R.A., Sinatra, A., Boyce, M., Graesser, A.: Domain modeling for adaptive training and education in support of the US Army learning model – Research outline. US Army Research Laboratory (2015) 3. Koedinger, K.R., Corbett, A.T., Perfetti, C.: The knowledge-learning-instruction framework: bridging the science-practice chasm to enhance robust student learning. Cogn. Sci. 36, 757– 798 (2012) 4. Goldin, I., Pavlik Jr., P.I., Ritter, S.: Discovering domain models in learning curve data. In: Sottilare, R., Grasser, A., Hu, X., Olney, A., Nye, B., Sinatra, A. (eds.) Design Recommendations for Intelligent Tutoring - Domain Modeling, vol. 4, pp. 115–126. US Army Research Laboratory, Orlando (2016) 5. Kim, J.W., Dancy, C., Goldberg, B., Sottilare, R.: A cognitive modeling approach - does tactical breathing in a psychomotor task influence skill development during adaptive instruction? In: Schmorrow, D.D., Fidopiastis, C.M. (eds.) AC 2017. LNCS (LNAI), vol. 10285, pp. 162–174. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58625-0_11 6. Aleven, V., Koedinger, K.R.: Knowledge component (KC) approaches to learner modeling. In: Design Recommendations for Intelligent Tutoring Systems - Learner Modeling, vol. 1, pp. 165–182. U.S. Army Research Laboratory, Orlando (2013) 7. Newell, A.: Unified Theories of Cognition. Harvard University, Cambridge (1990) 8. Anderson, J.R.: Spanning seven orders of magnitude: A challenge for cognitive modeling. Cognitive Science 26, 85–112 (2002) 9. Jaber, M.Y.: Learning Curves: Theory, Models, and Applications. CRC Press, Boca Raton (2016) 10. Anderson, J.R., Fincham, J.M., Douglass, S.: Practice and retention: a unifying analysis. J. Exp. Psychol. Learn. Mem. Cogn. 25, 1120–1136 (1999) 11. Newell, A., Rosenbloom, P.S.: Mechanisms of skill acquisition and the law of practice. In: Anderson, J.R. (ed.) Cognitive Skills and Their Acquisition, pp. 1–55. Lawrence Erlbaum, Hillsdale (1981) 12. Rosenbloom, P., Newell, A.: Learning by chunking: a production system model of practice. In: Klahr, D., Langley, P., Neches, R. (eds.) Production System Models of Learning and Development, pp. 221–286. MIT Press, Cambridge (1987) 13. Martin, B., Mitrovic, A., Koedinger, K.R., Mathan, S.: Evaluating and improving adaptive educational systems with learning curves. User Model User-Adap. 21, 249–283 (2011) 14. Kim, J.W., Ritter, F.E.: Microgenetic analysis of learning a task: its implications to cognitive modeling. In: Proceedings of the 14th International Conference on Cognitive Modeling, Penn State, pp. 21–26 (2016) 15. Schmidt, R.A., Bjork, R.A.: New conceptualizations of practice: Common principles in three paradigms suggest new concepts for training. Psychol. Sci. 3, 207–217 (1992) 16. McElreath, R.: Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press, Boca Raton (2016)

Consideration of a Bayesian Hierarchical Model

531

17. Lee, F.J., Anderson, J.R.: Does learning a complex task have to be complex? A study in learning decomposition. Cogn. Psychol. 42, 267–316 (2001) 18. Anderson, J.R.: How Can the Human Mind Occur in the Physical Universe? Oxford University Press, New York (2007) 19. Kim, J.W., Dancy, C., Sottilare, R.A.: Towards using a physio-cognitive model in tutoring for psychomotor tasks. In: Proceedings of the AIED Workshop on Authoring and Tutoring for Psychomotor, Mobile, and Medical Domains (2018) 20. Dancy, C.L., Kim, J.W.: Towards a physio-cognitive model of slow-breathing. In: Proceedings of the 40th Annual Conference of Cognitive Science Society, pp. 1590– 1595. Cognitive Science Society (2018) 21. Ritter, F.E., Yeh, K.-C., Cohen, M.A., Weyhrauch, P., Kim, J.W., Hobbs, J.N.: Declarative to procedural tutors: A family of cognitive architecture-based tutors. In: Proceedings of the 22nd Annual Conference on Behavior Representation in Modeling and Simulation, pp. 108– 113. BRIMS Society (2013) 22. Kim, J.W., Ritter, F.E.: Learning, forgetting, and relearning for keystroke- and mouse-driven tasks: relearning is important. Hum. Comput. Interact. 30, 1–33 (2015) 23. Fitts, P.M.: Perceptual-motor skill learning. In: Melton, A.W. (ed.) Categories of Human Learning, pp. 243–285. Academic Press, New York (1964) 24. Anderson, J.R.: Acquisition of cognitive skill. Psychol. Rev. 89, 369–406 (1982) 25. Rasmussen, J.: Information Processing and Human-Machine Interaction: An Approach to Cognitive Engineering. Elsevier, New York (1986) 26. Kim, J.W., Ritter, F.E., Koubek, R.J.: An integrated theory for improved skill acquisition and retention in the three stages of learning. Theoret. Issues Ergon. Sci. 14, 22–37 (2013) 27. Fu, W.-T., Lee, H., Boot, W.R., Kramer, A.F.: Bridging across cognitive training and brain plasticity: a neurally-inspired computational model of interactive skill learning. WIREs Cogn. Sci. 4, 225–236 (2013) 28. Delaney, P.F., Reder, L.M., Staszewski, J.J., Ritter, F.E.: The strategy-specific nature of improvement: the power law applies by strategy within task. Psychol. Sci. 9, 1–7 (1998) 29. Seibel, R.: Discrimination reaction time for a 1,023-alternative task. J. Exp. Psychol. 66, 215–226 (1963) 30. Grossman, D., Christensen, L.W.: On Combat: The Psychology and Physiology of Deadly Conflict in War and in Peace. Warrior Science Publications, Belleville (2008) 31. Goldberg, B., Amburn, C., Ragusa, C., Chen, D.-W.: Modeling expert behavior in support of an adaptive psychomotor training environment: a marksmanship use case. Int. J. Artif. Intell. Educ., 1–31 (2017) 32. Baker, R.S.J., Corbett, A.T., Aleven, V.: More accurate student modeling through contextual estimation of slip and guess probabilities in bayesian knowledge tracing. In: Woolf, B.P., Aïmeur, E., Nkambou, R., Lajoie, S. (eds.) ITS 2008. LNCS, vol. 5091, pp. 406–415. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-69132-7_44 33. Corbett, A.T., Anderson, J.R.: Knowledge tracing: modeling the acquisition of procedural knowledge. User Model User Adap. 4, 253–278 (1995) 34. Pavlik, Jr., P.I., Cen, H., Koedinger, K.R.: Performance factors analysis—a new alternative to knowledge tracing. In: Proceedings of the 14th International Conference on Artifiicial Intelligence in Education (2009)

Developing an Adaptive Opponent for Tactical Training Jeremy Ludwig(&) and Bart Presnell Stottler Henke Associates, Inc., San Mateo, CA, USA [email protected]

Abstract. This paper describes an effort to create adaptive opponents for simulation-based air combat, where the opponents behave realistically while at the same time fulfilling instructional objectives. Three different models are developed to control the behavior of red pilots against (simulated) blue trainees in a set of 2v2 scenarios. These models are then evaluated on their tactical and instructional performance, with the machine-learning model performing on par with the two hand-constructed models. The contribution of this paper is to investigate technology and infrastructure enhancements that could be made to existing systems used for simulation-based air combat training. Keywords: Behavior modeling

Simulation Adaptive training

1 Introduction This paper describes an effort to create adaptive opponents for simulation-based training, where the opponents behave realistically while at the same time fulfilling instructional objectives. The domain for this work is air combat, where US Air Force pilots train in a simulator against computer-generated enemies. The effort is one example of a behavior model built under the research program described in [1]. The objective of this program is creating smart and agile opponents that respond more realistically, are less predictable, and take advantage of errors made by trainees while still providing specific types of instructional opportunities. The remainder of the paper includes an overview of the methods used to create and evaluate the behavior models. Following this, the results section highlights the tactical and instructional intelligence demonstrated in the model. Finally, the discussion section reviews the implications of the results and presents future work. 1.1

Related Work

There is a substantial amount of related work toward creating intelligent, adaptive opponents for games and simulations and in developing systems for autonomous unmanned aerial vehicles. This paper will focus on a small subset of highly related work modelling opponent pilots in the air combat domain. One of the early, and still active, success stories in this domain is TacAir Soar [2]. This system used production rules created for the Soar architecture. TacAir Soar was used for operational training, flying many different types of fixed-wing aircraft in © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 532–541, 2019. https://doi.org/10.1007/978-3-030-22341-0_42

Developing an Adaptive Opponent for Tactical Training

533

simulations using appropriate doctrine and tactics. More recent work in this area includes utilizing genetic algorithms [3] and neural networks [4] to best human opponents. Developing a capable adversary in this domain is an issue that has been pursued for quite some time and is still an open problem. The work presented in [5] applies to finite state machines combined with dynamic scripting to the problem of creating an opponent for air combat training. We followed nearly the same behavior representation approach as [5]–combining hierarchical finite state machines with dynamic scripting. However, an example of a behavior modelling approach currently in use within the US Air Force is the Next Generation Threat System (NGTS) [6]. NGTS is used across the US government and DoD to develop behavior models that human pilots train against. The contribution of this paper is to investigate future enhancements to systems currently in use, such as the NGTS, that would result in improved training outcomes.

2 Methods This section includes the methods used both to create and evaluate the behavior models. First, we provide an overview of the SimBionic architecture and dynamic scripting. Following this is a description of the specific modeling approach that we took using these tools. Finally, we present an overview of the evaluation approach. 2.1

Agent Architecture

This section describes the SimBionic agent architecture, which is followed by a description of the Dynamic Scripting reinforcement learning algorithm. The SimBionic architecture, along with a Dynamic Scripting extension, are available as open source on GitHub [7]. SimBionic. The goal of the SimBionic architecture [8] is to make it possible to specify real-time intelligent software agents quickly, visually, and intuitively by drawing and configuring behavioral transition networks (BTNs) as shown in Fig. 1. Each BTN is a network of nodes joined by connectors (links), similar to a flow chart or finite state machine. Visual logic makes it easy to show, discuss, and verify the behaviors with members of the development team, subject matter experts, and other stakeholders. The SimBionic architecture provides three components. The SimBionic Visual IDE application enables modelers to specify intelligent agent behaviors by creating and saving BTNs that are read and executed by the SimBionic Run-time System. The run-time

Fig. 1. This behavior will attempt to perform a stern intercept on an aircraft until specific conditions are met.

534

J. Ludwig and B. Presnell

software library connects to the simulation API to query for state information and execute actions in the simulation, as specified by the BTNs. The SimBionic Debugger application helps developers test and debug behavior logic by stepping through the execution of the BTNs and inspecting the values of local and global variables. As an example, while controlling an aircraft in the simulation, the Fig. 1 BTN is invoked as the initial behavior. Execution of this BTN starts in the green action node (Log) and then transitions to the SternIntercept node—which is a reference to another BTN. At this point, flow of control is passed to the SternIntercept BTN (Fig. 2). Execution of the SternIntercept BTN then starts in the green action node and will eventually call other BTNs (SternInterceptLowAA and SternInterceptHighAA). Meanwhile, the initial behavior is monitoring in the background for the conditional node (isFuelLow, isReturnToBase, or isDone) to be true. As soon as one of these becomes true, the SternIntercept BTN is interrupted and control is returned to the initial behavior, which will transition to the red action node (Final). Each node in a SimBionic BTN includes both a user-friendly description (e.g., “High AA?”) as well as JavaScript or Java code that will be called when the node is executed, as shown in the lower left of Fig. 2. The overall behavior model includes multiple levels: i) the high-level flow of control represented in the BTNs, ii) the lowerlevel building blocks written in Java, and iii) the simulation API that carries out actions and provides sensory data.

Fig. 2. The SimBionic IDE showing the SternIntercept BTN.

A SimBionic entity is a thread of execution. The SimBionic engine can contain one or more entities, each executing its own set of BTNs simultaneously. When using SimBionic to develop agents for a game or simulation, different entities might

Developing an Adaptive Opponent for Tactical Training

535

correspond to simulated characters or computer-generated forces, each BTN acting independently (though they may communicate with each other to coordinate their actions). Dynamic Scripting. Dynamic scripting [9] is an online reinforcement learning algorithm developed specifically to control the behavior of adversaries in modern computer games and simulations. Put simply, dynamic scripting attempts to learn a subset of IFTHEN rules (called actions) that allows the entity to perform well. The subset, chosen from the larger set, is the script, in dynamic scripting. Dynamic scripting makes a specific tradeoff for games and simulators, favoring speed of learning over context sensitivity. More concretely, actions contain (i) a value, (ii) an optional IF clause that describes when an action can be applied based on the perceived game state, and (iii) a userdefined priority that captures domain knowledge about the relative importance of each action. Action values are used to create scripts of length n prior to a scenario by selecting rules in a value-proportionate manner (e.g., softmax) from the complete set of actions available to the agent. During a scenario, applicable actions are selected from the script in priority order first, the action value second. Applicability is determined by the perceived game state and the action’s IF clause. At the end of a scenario, action values are updated using the dynamic scripting updating function combined with a domain-specific reward function created by the behavior author. The reward is distributed primarily to the actions selected in the episode and then to actions in the script that were not selected, with a smaller negative reward given to actions not included in the script. There are two primary benefits of this learning approach in the context of creating opponents in the air combat domain [5]. First, the algorithm can learn quickly and continuously—a few scenarios is enough to drive an obvious change in behavior. Second, the algorithm learns within a well-bounded space. Instructor pilots develop the space of possible actions, and the algorithm searches to find a set of actions from this space that work well together. This leads to behavior that is adaptive from the trainee’s perspective while still being predictable from the instructor’s perspective. 2.2

Modeling Approach

We investigated three different models to control the behavior of red pilots in a set of scenarios that pit two red aircraft against two blue (2v2) as shown in Fig. 3. All three models were built in the SimBionic modeling architecture. Following the discussion of the three models is a description of some common functionality. • SimBionic. A handcrafted behavior model using SimBionic’s hierarchical behavior transition network representation. • Rules. The second model is simply composed of a hand-selected set of IF-THEN rules, modeled in a SimBionic BTN. • Dynamic Scripting (DS). The third model uses reinforcement learning to learn the behavior model from experience from among a larger set of IF-THEN rules.

536

J. Ludwig and B. Presnell

Fig. 3. Example 2v2 scenario.

SimBionic. For the 2v2 behavior model, we developed three specific SimBionic entities. The first and second entities each control one of the two aircraft. These two entities run their own versions of the same set of hierarchical BTNs, similar to that shown in Figs. 1 and 2. The third entity is a controller, which assigns targets to the red aircraft. This entity uses a different set of BTNs. The controller BTN is the only explicit instance of coordination that happens between the two aircraft—all other coordination is implicit in either the BTNs or the low-level Java building blocks. Rules. After developing the SimBionic 2v2 model using hierarchical BTNs, a second, much simpler model to control the red aircraft was created. This model was composed of six IF-THEN rules implemented in a single BTN in SimBionic. The Rules model also reused the controller entity from the SimBionic model to assign targets to the two red aircraft. This enabled us to take advantage of all of the integration work and building blocks that had already been completed and quickly create the rule-based model. The idea behind this second model was to explore the performance of an extremely simple approach relative to the hierarchical model. Dynamic Scripting. For the 2v2 Dynamic Scripting behavior model, thirteen possible actions to select from were created. These included the six actions from the Rules model and seven new actions. Since each DS action is an IF-THEN rule, we represented all thirteen rules in a single SimBionic BTN. The result is a BTN that looks very much like the Rules model, just with more nodes and transitions. The primary difference is the introduction of a special type of dynamic scripting node that indicates dynamic scripting transition selection should be used rather than the standard SimBionic transitions. When executing in a scenario, the first time a chooseDS node is reached, a script will be generated based on action weights. Subsequent transition through the BTN will use the same script. There will be a corresponding rewardDS node to apply the reward function at the end of the scenario, which will update the weights and reset Dynamic Scripting so that a new script is selected next time the chooseDS node is reached. With this setup, each red aircraft learns its own script.

Developing an Adaptive Opponent for Tactical Training

537

The rewards are given to the behaviors based on two events. The first event is the occurrence of any frame in which a red entity controlled by a behavior model is threatening a blue entity. For this event, a small reward is evenly shared between each red entity that is targeting the threatened blue entity. The intention is to provide a reward as entities move into position or distract the target. The second reward event occurs when a red entity is removed from the simulation. This reward takes the form of a large negative reward that is applied only to the red entity that was captured. This decision was based on the assumption that individual behaviors are responsible for maneuvering the entity into a dangerous position, regardless of how other entities are behaving. Prior to evaluation, the dynamic scripting model was trained by running it against twelve scenarios five times each, for a total of 60 training runs. The behavior model was then frozen for evaluation. Learning was disabled during the evaluation so as to generate more stable results. Common Functionality. There is significant common functionality across all three models. This includes being modeled in the SimBionic architecture, sharing the same Controller BTN for target assignments, and sharing the same lower-level building blocks developed in Java that execute actions such as turns or attempts to intercept another aircraft. These three models also share constrained variability through the use of these thresholds in the SimBionic BTNs and Java building blocks. The thresholds allow behaviors to vary their performance in subtle ways in each scenario. For example, rather than always performing an action at 10 NM, a model using a threshold might perform the action at 9.2 NM one time and 10.7 NM another time. This approach has two advantages. First, the exact behavior of the red aircraft is more difficult to predict. Second, the behavior of the red aircraft is easy to change. Updating the default, min, and max thresholds will affect how the model performs but requires no additional changes to the model. 2.3

Evaluation Approach

To evaluate the tactical intelligence of the three behavior models (SimBionic, Rules, and DS), we ran each model against a set of thirteen scenarios and then averaged the quantitative results. The scenarios are not deterministic and returned slightly different results on different runs, so each scenario was completed twice. We also recorded videos of the evaluation scenarios for qualitative assessment of tactical and instructional intelligence by subject matter experts. In these scenarios, the behavior models controlled the red aircraft, while the blue aircraft were controlled by simplistic scripts. The quantitative goal was to maximize the number of blue aircraft removed from the scenario (captured) while minimizing the number of red aircraft lost. The qualitative goal was to perform realistically while following an instructor-developed plan that describes at a high level how the red aircraft should act in these scenarios against the blue trainees.

538

J. Ludwig and B. Presnell

3 Results The results section highlights the tactical and instructional intelligence demonstrated by the three different behavior models. Tactical intelligence focuses on successfully completing the scenario against a simulated trainee. Instructional intelligence focuses on aspects such as performing more realistically, taking better advantage of mistakes made by the simulated trainee, and being less predictable, than current agent models. The quantitative results in Table 1 show the aircraft “captured to lost” ratio across the evaluation scenarios. Following the table are three charts showing the average captured and lost aircraft per scenario for each of the three models in each of the evaluation scenarios (Figs. 4, 5 and 6). Table 1. Average model performance across two runs of all evaluation scenarios. Model SimBionic Rules DS

Average # Blue Aircraft Captured [0-2] 1.54 1.46 1.38

Average # Red Aircraft Lost [0-2] 0.15 0.12 0.12

Captured/Lost Ratio 10.3 12.2 11.5

SimBionic 2.5 2 1.5 1 0.5 0 -0.5 -1 -1.5 Captured

Lost

Fig. 4. Performance of SimBionic model on evaluation scenarios. The x-axis is the scenario and the y-axis is the number of aircraft captured (blue) and number of aircraft lost (orange). (Color figure online)

Developing an Adaptive Opponent for Tactical Training

539

Rules 2.5 2 1.5 1 0.5 0 -0.5 -1 -1.5 Captured

Lost

Fig. 5. Performance of Rules model on evaluation scenarios. The x-axis is the scenario and the y-axis is the number of aircraft captured (blue) and number of aircraft lost (orange). (Color figure online)

Dynamic Scripting 2.5 2 1.5 1 0.5 0 -0.5 -1 -1.5 Captured

Lost

Fig. 6. Performance of Dynamic Scripting model on evaluation scenarios. The x-axis is the scenario and the y-axis is the number of aircraft captured (blue) and number of aircraft lost (orange). (Color figure online)

Qualitative examination was performed only on the SimBionic scenarios due to the limited amount of subject matter expert time available. The subject matter experts found that the SimBionic model generally behaved realistically and took advantage of trainees’ mistakes—with a couple of exceptions. One exception had to do with how red

540

J. Ludwig and B. Presnell

carried out a maneuver. In this case, the error was in the low-level Java code that carries out the maneuver, not with the behavior model itself. This means that all three models would make the same mistake. The second exception had to do with teamwork, where one red aircraft was not supporting the other correctly under certain circumstances. As this error was embedded in the SimBionic model, the incorrect behavior would not necessarily be present in the Rules or DS models.

4 Discussion These quantitative results show that all of the models were fairly successful in capturing opposing forces while not getting captured themselves. Using this as the primary metric, the results demonstrate good performance for all three behavior models. What is very surprising is that the simplest behavior model, Rules, displays the best performance in terms of captured-to-lost ratio. The model created via machine learning, DS, performs almost as well. This is not necessarily surprising since DS has access to all of the IF-THEN statements in the Rules model, but it is a good outcome nonetheless. SimBionic performs reasonably well, too—it is just not quite as good as these other models at maximizing the captured-to-lost ratio. It is also interesting to note that the different models have different strengths and weaknesses—e.g., SimBionic was at a stalemate in 4e, where both Rules and DS were able to come out ahead in this scenario. Instructional intelligence is evaluated relative to the requirements set by the instructor pilots with respect to improved red behavior models: perform realistically, take advantage of mistakes made by trainees, and be less predictable. First, our belief is that the SimBionic model will behave more realistically than the Rules or DS models. Based on personal observation as modelers, we can see that the Rules and DS models behave differently than the more highly constrained SimBionic model. The realism of Rules and DS models has not been independently examined. Second, all three models seem to take advantage of blue mistakes while protecting themselves, given the captured-to-lost ratio. Third, the constrained variability discussed in Common Functionality provides significant variation in aircraft behavior across scenario runs— though this was not directly noticeable to the subject matter experts reviewing model performance.

5 Conclusion The overall objective of this work is to advance the intersection of cognitive modeling and machine learning in order to develop behavior modeling technology and supporting infrastructure for the US Air Force. Working towards the overall objective, this paper describes the development and evaluation of three different behavior models from both a tactical and instructional perspective. All three models performed well from both perspectives, with the machine learning model performing on par with the other two models after relatively little training.

Developing an Adaptive Opponent for Tactical Training

541

In future work, we plan to combine the SimBionic and Dynamic Scripting approaches, adding dynamic scripting nodes at various points within the hierarchical SimBionic BTNs. We believe this will best support realistic behavior and adherence to an instructional plan, while at the same time applying machine learning to adapt to the trainees’ performance. Acknowledgements. This article is based upon work supported by the United States Air Force Research Laboratory, Warfighter Readiness Research Division 711 HPW/RHA, under Contract FA8650-16-C-6698. This article is cleared for public release on 14 Dec 2018, Case 88ABW2018-6265. Distribution Statement A: Approved for Public Release, Distribution Unlimited.

References 1. Freeman, J., Watz, E., Bennett, W.: Adaptive agents for adaptive tactical training: the state of the art and emerging requirements. In: Sottilare, R.A., Schwarz, J. (eds.) HCII 2019. LNCS, vol. 11597, pp. 493–504. Springer, Cham (2019) 2. Jones, R.M., Laird, J.E., Nielsen, P.E., Coulter, K.J., Kenny, P., Koss, F.V.: Automated intelligent pilots for Combat flight simulation. AI Mag. 20(1), 27 (1999) 3. Ernest, N., Carroll, D., Schumacher, C., Clark, M., Cohen, K., Lee, G.: Genetic fuzzy based artificial intelligence for unmanned combat aerial vehicle control in simulated air combat missions. J. Def. Manag., 6(144) 4. Teng, T., Tan, A., Tan, Y., Yeo, A.: Self-organizing neural networks for learning air combat maneuvers. In: The 2012 International Joint Conference on Neural Networks (IJCNN), pp. 1– 8 (2012) 5. Toubman, A., Roessingh, J.J., Spronck, P., Plaat, A., van den Herik, H.J.: Rapid adaptation of air combat behaviour. In: ECAI (2016) 6. Next Generation Threat System (NGTS), 10 December 2018. http://www.navair.navy.mil/ nawctsd/pdf/2018-NGTS.pdf 7. SimBionic, 04 June 2018. https://github.com/StottlerHenkeAssociates/SimBionic. Accessed 11 Dec 2018 8. Fu, D., Houlette, R.: Putting AI in entertainment: an AI authoring tool for simulation and games. IEEE Intell. Syst. 17(4), 81–84 (2002) 9. Spronck, P., Ponsen, M., Sprinkhuizen-Kuyper, I., Postma, E.: Adaptive game AI with dynamic scripting. Mach. Learn. 63(3), 217–248 (2006)

Application of Artificial Intelligence to Adaptive Instruction - Combining the Concepts Jan Joris Roessingh, Gerald Poppinga, Joost van Oijen(&), and Armon Toubman Netherlands Aerospace Center NLR, Anthony Fokkerweg 2, 1059 CM Amsterdam, The Netherlands {Jan.Joris.Roessingh,Gerald.Poppinga,Joost.van.Oijen, Armon.Toubman}@nlr.nl

Abstract. In recent years, instructional systems for individuals and teams, including virtual environments, serious games, simulator-based training and onthe-job/live training, have been supplemented by Adaptive Instructional Systems (AISs). Artificial Intelligence (AI) techniques and Machine Learning (ML) techniques have been proposed, and are increasingly used, for a number of functions of AISs. This paper aims to combine, on the one hand, the concepts of AI and ML, and, on the other hand, adaptive instruction. The emphasis is put on simulator-based training in a professional context, predominantly skill learning by practicing tasks in simulated environments, either as an individual student or as part of a team. The major goals of this paper are: (1) to provide a basic description of available ML techniques, (2) to sketch the potential use of machine learning techniques in adaptive instruction, and (3) to provide examples of applications from the literature. This paper neither introduces a new AI approach to adaptive instruction, nor does it extensively review the literature of such approaches. Keywords: Intelligent Tutoring Systems Artificial Intelligence

Adaptive instruction

1 Introduction In recent years, instructional systems for individuals and teams, including virtual environments, serious games, simulator-based training and on-the-job/live training, have been supplemented by Adaptive Instructional Systems (AISs). This is a general term for intelligent, computer-based tools used for education, instruction and training. AISs guide learning experiences by tailoring instruction and recommendations based on the goals, needs and preferences of each learner (or team) in the context of domain learning objectives [1]. Examples of AISs are Intelligent Tutoring Systems (ITSs), intelligent mentors and personal assistants for learning. Such technology has been integrated in computerized instructional media, including training systems that are embedded in operational systems (so called embedded training systems). Artificial Intelligence (AI) techniques and Machine Learning (ML) techniques have been © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 542–556, 2019. https://doi.org/10.1007/978-3-030-22341-0_43

Application of Artificial Intelligence to Adaptive Instruction

543

proposed, and are increasingly used, for a number of functions of AISs (see [2–4]). These functions include expert domain knowledge representation, predicting the behavior, state and final performance of students, preventing drop-out of students, generating the behavior of non-player characters (NPCs) in training scenarios, and, more generally, the generation of suitable training scenarios and orchestrating pedagogical interventions. Examples of use of AI in scenario management and serious gaming packages for the military are provided in [5]. More recently, [6] provided several examples of the use of machine learning techniques for behavior generation in simulator-based training. The current paper aims to combine, on the one hand, the concepts of AI, more specifically ML techniques, and, on the other hand, adaptive instruction. The emphasis is on simulator-based training in a professional context, predominantly skill learning by practicing tasks in simulated environments, either as an individual student or as part of a team. The purpose of this paper is to introduce the HCII AIS conference session on application of AI to adaptive instruction. The major goals of this paper are: (1) to provide a basic description of available ML techniques, (2) to sketch the potential use of machine learning techniques in adaptive instruction, and (3) to provide examples of applications from the literature. This paper neither introduces a new AI approach to adaptive instruction, nor does it extensively review the literature of such approaches.

2 Machine Learning AI is applied when the machine mimics “cognitive” functions that humans associate with the human mind, such as “learning” and “problem solving” [7]. The field of ML grew out the broader study of AI over the past several decades. As a research field, ML studies computer algorithms that improve automatically through experience [8]. In a data processing sense, ML, which is also known as ‘predictive analytics’, is a set of techniques with several prominent applications, such as detection, clustering, classification, prediction, dimension reduction, decision support, simulation and data generation, in particular through non-pre-programmed methods. Traditional AI methods for data processing and prediction have relied on explicitly programmed systems, often captured as a set of “if-then” rules of which a predictable subset would be activated under specified conditions. However, such expert systems only performed as well as their original programming, and were often unable to adapt to unexpected data sets, thus lacking robustness. ML, however, attempts to apply the concepts of AI in a way that allows a computer program to improve its performance over time, without being explicitly programmed. ML approaches generally have the implicit benefit of producing a reasonable satisfactory solution in the face of new, unknown, observations. In such cases, traditional expert systems might get stuck. The downside of producing such solutions, however, is that it can be unclear to humans how ML algorithms produce these solutions, which doesn’t help in interpreting, since the relationship between input and output may be difficult to infer and cannot be explained. From a historical perspective, the application of ML to adaptive instruction, in the current day appreciation, i.e. beyond the Skinner teaching machine [9, 10], started

544

J. J. Roessingh et al.

around 1970 with Carbonell [11, 12], who developed a prototype ITS. This system, called SCHOLAR, was “[..] information structure oriented, based on the utilization of a symbolic information network of facts, concepts, and procedures.” A semantic network is a set of concepts, like ‘electrons’, ‘neutrons’, ‘protons’ and ‘atoms’, and relations among those concepts, such as, ‘the nucleus of an atom consists of neutrons and protons’, and ‘electrons orbit the nucleus of an atom’. SCHOLAR’s semantic network had the capability to learn such patterns from facts (“SCHOLAR learns what is told.”, [12, p. 52]), but main concern in the development was how to use and represent knowledge in the semantic network. Main feature of SCHOLAR was its ability to maintain a ‘mixed-initiative’ dialogue with the student, with questions asked by either SCHOLAR or the student and answered by the other. In the decade thereafter, adaptive instruction became a research-intensive subfield of ML, with many practitioners from the cognitive- and computer sciences, as apparent from, for example, [13]. In summary, whereas traditional AI tried to create self-contained expert systems with steady output, current day ML aims to create a system that can, through repeated exposure to data, learn for itself and therewith able to adapt to novel data. 2.1

Machine Learning Approaches

Domingos [14] subdivides ML methods as follows: • Evolutionary, ML is considered as the natural selection through the mating of computer programs, with genetic programs as the representational form of this class of methods. [15] provides an example of using this type of method in adaptive instruction. • Connectionism, ML is considered as adjusting weights in Artificial Neural Networks (ANNs), the ANNs being the representational form of this class of methods. • Symbolic reasoning, ML is considered as logical deduction from symbolic information (‘symbol manipulation’). Logic symbolic systems, in the form of sets of rules or decision trees, are the representational form of this class of methods. The aforementioned SCHOLAR ITS [11, 12] would be an example of application of this method. • Bayesian probabilistic reasoning, ML of probabilities (of events or the truth of propositions) is based on inference, i.e. the propagation of probabilities through a graphical network. The latter network is the representational form for this class of methods, in which the nodes describe events, and the arcs the relationships between events. • Analogy-based learning, ML is based on determining similarity between data points. Representational forms for this class of methods are the nearest neighbor algorithm and support vector machines. These methods are able to find a decision rule for binary classification (decide on one of two outcomes), based on several predictor variables. Although each of these approaches has their preferred set of methods, tools and techniques, they are not mutually exclusive and allow for hybrid approaches.

Application of Artificial Intelligence to Adaptive Instruction

2.2

545

Three Types of Machine Learning from an Input-Output Perspective

ML can also be looked at from a perspective of how the ML algorithm operates. What is needed at the input? How is the output structured? Then, basically three types of machine learning can be distinguished: (1) supervised learning, (2) unsupervised learning, and (3) reinforcement learning. Supervised Learning. This type of ML encompasses the techniques that learn from examples. For supervised learning techniques, the data (examples, observations) must be labeled, i.e., annotated by a human expert or some other source. In other words, a human or another entity supervises the learning process of the machine. The algorithm forms a model about relations (features) between the data and the labels. For example, a supervised learning algorithm learns to classify correct and incorrect student responses after it has been presented with a large number of such responses, labeled by an expert instructor. Unsupervised Learning. This type of ML includes techniques that automatically build a model of the data that they are presented with. For unsupervised learning techniques the input data is unlabeled, i.e., not annotated by an expert, in other words, unsupervised. The algorithm forms a model of the structures or patterns that are present in the data, without any explicit hints as to what those structures are. An unsupervised ML algorithm may discover patterns in student behavior that are not directly apparent to human instructors, for example, that good student performance on teamwork aspects in a simulator-based training is strongly associated with a certain cluster of measures. Reinforcement Learning. This type of ML can be characterized with ‘learning by doing’. Reinforcement learning is often associated with a learning agent, i.e. a robot or software agent, that learns through interaction with some system or environment. Through its behavioral actions it interacts with the environment and therewith creates examples of this behavior. The agent receives a reward for its actions when it achieves some goal, such as solving a maze, completing a game, making a profitable deal, or neutralizing an opponent. The received reward serves as a kind of label with the example, i.e. the series of actions undertaken by the agent to reach its goal. The agent therewith learns to gradually improve the sequence of actions (‘the policy’) to achieve its goal. As with the five ML approaches discussed in the previous section, hybrids of different types of ML are often successful. For example, many of the successes with deep (reinforcement) learning (e.g. [16, 17]) are based on algorithms that combine supervised, unsupervised and reinforcement learning in a connectionist approach. On-Line Learning and Off-Line Learning. Another useful distinction is between online learning and off-line learning. When a ML application receives its training, the algorithm will most often set some balance between exploration (actions that create or include novel examples, but with uncertain reward) and exploitation (actions that are ‘known’ by the algorithm to lead to the desired reward). Both are essential for learning. However, in practical applications, exploration by the algorithm may be undesirable, for example, because it may lead to emergent or unstable behavior of the algorithm, which would disturb the tutoring of an ITS. Therefore, with off-line learning, the

546

J. J. Roessingh et al.

algorithm will be trained before actual application, and will not be allowed to continue learning during application. With on-line learning, learning (and therewith some degree of exploration) continues during practical application.

3 What Is Adaptive Instruction? In an educational context, Gaines [18] describes adaptive instruction as sketched in Fig. 1. In his view, an adaptive instructional system has three elements: (1) the evaluation of learning outcome, (2) a dynamic model of the learning process that is implemented in the so-called adaptive logic, and (3) an adaptive variable that changes the training task or the environment. The learning process is the result of interactions between a student (or a team) and a task. The control model is a model of the learning process. The parameters of this model are not fixed but will be selected on the basis of the observed learning process. Examples of adaptive instruction are:

observe learning process

adaptive logic selection of control model

evaluate learning outcome

implementation of control model

adaptive variable TASK

TRAINEE/ TEAM ENVIRONMENT

Fig. 1. Concept for adaptive instruction (based on [18]). The modules above the horizontal line establish the Artificial Intelligence. Interaction between trainee/team, task and environment is denoted by the black arrow. The adaptive variable may modify any of these.

• The control we take over our own learning process, based on an implicit model of that learning process; • The control that an instructor takes over the learning process of a student, based on an implicit model of the learning process of the student; • The control of a learning process of a student by an algorithm that explicitly models a learning process of the student. The first two of these examples are examples of adaptive instruction based on human intelligence, while the latter example may be based on an (AI-) algorithm, possibly a ML algorithm. Hence, in AI-based adaptive instruction, the educational experience is tailored by an AI-enabled tutor. More generally, the goal of adaptive instruction is to optimize learner outcomes [19]. Learner outcomes that can be

Application of Artificial Intelligence to Adaptive Instruction

547

optimized are, for example, knowledge and skill acquisition, performance, enhanced retention, accelerated learning, or transfer-of-training between different instructional settings or work environments. A range of technologies may fall under the heading of AISs, including personal assistants for learning and recommender systems. For the current purposes, we consider the ITS as the most comprehensive technology under this heading, in a sense that it is not just an add-on, but creates the complete setting for tutoring: domain expertise, a tutor, and communication or other interactions with the student. According to [20], an ITS is an educational support system (a kind of virtual tutor), used to help learners in their tasks and to provide them with specific and adapted learning contents. Nwana [21] emphasizes that ITSs are designed to incorporate techniques from the AI community in order to provide tutors which know what they teach, who they teach, and how to teach it. Hence, less comprehensive AIS technologies may not be capable of knowing the full scope of teaching (what? who? how?). An ITS would provide instant and personalized instruction or feedback to students, usually without needing involvement of a human tutor, with the purpose of enabling learning in a meaningful and effective manner. Many situations, such as on-the-job, in the classroom or in remote locations, lack sufficient availability of one-to-one instruction, the latter being more effective than one-to-many instruction. In such situations, an ITS is capable of mimicking a personal tutor (one instructor per student or per team). There are many examples of ITSs being used in such situations, including aerospace, military, health care and industry, and where their capabilities and limitations (see e.g. [22]) have been demonstrated. Machine learning techniques may be helpful to further enrich these capabilities and mitigate the limitations. In this paper, we will therefore consider the application of machine learning techniques to the various components of the ITS.

4 Modules of an ITS The general concept of an ITS [21, 23, 24] is based on four modules (see Fig. 2). Arrows denote the exchange of information. For example, the domain expert module provides performance standards to the student module. The tutoring module receives progress information (from the student module) on a learning objective (selected by the tutoring module) and plans the next exercise. The exercise will be made available to the student via the user interface module. The student provides his/her response back to the user interface module, etc. The four modules are now briefly discussed. 4.1

The Domain Expert Module

The first module, the domain expert module, contains the concepts, rules, and problemsolving strategies for the domain to be learned. Its main function is to provide the standard for evaluating the student’s response and therewith providing the ability to assess the student’s or team’s overall progress. Expert knowledge must not only include shallow knowledge (e.g. the categories and explanations of various concepts that the student has to acquire), but also the

548

J. J. Roessingh et al.

Domain Expert Module

Student Model Module

Tutoring Module

User Interface Module

Student

Fig. 2. General concept of an ITS and its relation to a student (from [21])

representational ability that has been acknowledged to be an essential part of expertise. Expert knowledge can be represented in various ways, including network presentations (e.g. belief networks), (rule-based) production systems, behavior trees, hierarchical finite state machines, or as a set of constraints, which can be used to analyze students’ solutions in order to provide feedback on errors. Knowledge elicitation and codification of this knowledge can be very timeconsuming, especially for a complex domain with an enormous amount of knowledge and interrelationships of that knowledge. Thus, investigating how to encode knowledge and how to represent it in an ITS remains the central issue of creating an expert knowledge module [21]. Please note that this bears similarity with issues in knowledge representation and explainability in ML and AI in general. The expert module of an ITS should also be viewed in the context of simulatorbased training or in a gaming/virtual environment. Is such an environment, the student has to learn by doing how to perform a given task, e.g. with the goal to defeat an enemy or to troubleshoot and resolve a malfunction in a piece of equipment, etc. In such context, not only the explicit knowledge, but also the simulation itself is part of the domain expert module, with its built-in concepts, behavior of NPCs, rules, constraints and score keeping for indicating task performance.

Application of Artificial Intelligence to Adaptive Instruction

4.2

549

The Student Model Module

The student model module refers to the dynamic representation of the evolving knowledge and skills of the student, as would become apparent from a ‘learning curve’ for this student. Important functions of the student model are: (1) to evaluate a student’s or a team’s competency with the tasks to be mastered against the standard established in the domain expert module, and (2) to evaluate how competency evolves with further exposure to the current state of the learning environment. The results of these evaluations feed into the tutoring module (to be discussed in the following subsection), that will decide on pedagogical adaptations of the learning environment. The student model module thus acts as a source of information about the student. Such knowledge is vital for the tutoring module of the ITS, as no intelligent tutoring can take place without such understanding of the student. The student model should include those aspects (variables) of the student’s behavior and knowledge that have an assumed effect on his/her performance and learning. Constructing a valid model is non-trivial. Human tutors would normally combine data from a variety of sources, possibly using bodily posture, gestures, voice effects and facial expressions. They may also be able to detect aspects such as boredom or motivation which are crucial in learning. The evolution of these cognitive and affective states must then be traced as the learning process advances. However, in the absence of a human tutor, the student’s cognitive and affective states must be inferred from the student input received by the ITS, via a keyboard, and/or other input devices or sensors. Traditionally in ITSs, a student model could often be created from the representation of the target knowledge in the expert knowledge module. Accordingly, the student model can include a clear evaluation of the mastery of each unit of knowledge in the expert module. This allows the student’s state of knowledge to be compared with the expert knowledge module, and instruction would then be biased towards portions of the model shown to be weak. This form of student modelling is referred to as ‘overlay’ modelling [25], because the student’s state of knowledge is viewed as a subset of the expert domain knowledge. Thus, in this form, the student model can be thought of as an overlay on the domain model. As the student progresses though the training tasks and the student model starts to deviate from the domain expert model, this is flagged to the tutoring module. 4.3

The Tutoring Module

The tutoring module is the part of the ITS that plans and regulates teaching activities (see Fig. 2) via the user interface module. In other architectures, this module is referred to as the teaching strategy or the pedagogic module. It plans the teaching activities on the basis of information from the student model module about the student’s learning progress relative to the objectives defined in the domain expert module. The tutoring module thus decides on activities to achieve learning objectives: hints to overcome impasses in performance, advice, support, explanations and different practice tasks (see e.g. [26]). Such decisions or suggestions

550

J. J. Roessingh et al.

are based on the instructional strategy of the tutoring module, the evolution of the student’s competencies and possibly the student’s profile (see e.g. [20]). The sequence and way in which activities take place can lead to distinct learning outcomes. Tightly orchestrating the teaching activities might harm the student’s explorative abilities. Sometimes, it may be more effective to let the student struggle for some time before interrupting. However, the student should not lose his motivation when he or she gets stuck during struggling. In traditional implementations of ITSs, for example an application to learn to resolve algebra problems, the student may request guidance on what to do next at any point in a problem-solving process (e.g. [27]). This guidance is then based on the comparison between the student’s state of knowledge and the expert knowledge. The tutoring module diagnoses that the student has turned away from the rules of the expert model and provides feedback accordingly. In a similar fashion, e.g. in an application that teaches the programming language Lisp [28], every time a student successfully applies a rule (from the domain expert module) to a problem, the student model module increases a probability estimate that the student has learned the rule. The tutoring module keeps on teaching students on problems that require effective application of a rule until the probability that the rule has been learned exceeds a certain criterion. The tutoring in existing ITSs can be ordered along a range with an increasing flexibility of control. At the low end, there are systems that monitor every response of the student diligently, adjusting the tutoring activities to the student’s responses but never resigning control. At the high end, there are guided discovery learning systems where the student has maximum control over the activity, and the only way the system can direct the course of action is by modifying the environment. Somewhere halfway this range are more versatile tutors, where the control is shared by the student and the ITS as they exchange information. The presence of this variety in tutoring styles underlines that variation in flexibility is required for different applications and possibly at different stages of the student’s learning process. Such tutoring requirements are still challenging to formulate and to embody in an ITS. Nevertheless, some progress has been made, and machine learning techniques will certainly help to create the potential to adapt and improve strategies over time (as in the case of self-improving tutors), and for the same strategies to be used for other domains. 4.4

The User Interface Module

The user interface module regulates a dialogue between the student and the tutoring module, as depicted in Fig. 2. It translates between the tutoring module’s internal representation of the teaching activities and the behavior of the student in a communication form that is on the one hand comprehensible to the student and, can, on the other hand, be processed by intelligent tutor. The consideration of the user interface module as a distinct part of the ITS architecture should lead to explicitly addressing of user-interface design and usability issues during ITS development [29]. Challenges that relate to the user interface are: ease of use; natural interaction dialogues; a dialogue that is task-oriented and adaptive; effective screen design; and supporting a variety of interaction styles and/or learning styles. No matter how ‘intelligent’ the internal system is, if these challenges have not

Application of Artificial Intelligence to Adaptive Instruction

551

been suitably addressed, the ITS is unlikely to yield positive transfer of learning and become acceptable for the student. Progress in user interface design is progressively delivering superior tools whose interactive capabilities strongly influences ITS design. ITSs provide user interfaces which, for the input, range from the use of fixed menus with multiple-choice answers to the use of natural language, gestures, hand-, finger- and eye-movements, use of 3Dpointing devices and a variety of physiological sensors/measurements. For the output, they range from the mere display of pre-stored texts, to the use of computer-generated speech and multi-modal virtual reality displays. Within these two ends of the range, the designers are, in principle, flexible in their choice. Much more experimental research into the use of such user interfaces is still required.

5 Applying ML to Adaptive Instruction Both adaptive instruction and ML are terms for a broad set of technologies and vast fields of research. In the aforementioned sections we have broken down a commonly used ITS concept in its modules: the domain expertise, the student model, the tutor, and the user interface. Given a specific application, it may be evaluated whether required functionality of each module can benefit from ML. Whether or not it is worthwhile to apply a ML technique in the implementation of such functionality must be evaluated at a case-by-case basis by the ITS designers. ML is notably strong in tasks that require detection, clustering, classification, prediction, dimension reduction, decision support, simulation and data generation. There are many software tools and libraries that provide ML solutions in support of these tasks. It may well be possible to design ITSs that do not perform these tasks, and if so, they might as well be done manually, or implemented with techniques that do not fall under the heading of ML techniques as discussed in Sect. 2. In the following we provide examples, per module, of where ML could be a feasible technique to fulfill a functional requirement. 5.1

Applying ML to the Domain Expert Module

The main function of the expert module is to provide expert knowledge or expert behavior that can serve as the basis for evaluating the student’s response. In tactical training, particularly in the military, trainees often have to respond to other parties that can be friendly, cooperative, hostile or neutral. In virtual games and simulator-based environments, these parties take the form of NPCs. The behavior of NPCs is part of the domain expertise. Specifying NPC behavior through manual programming of an ITS, can be an expensive, time consuming and tedious job, requiring specialized personnel. Several examples of behavior generation for NPCs using ML are provided in [6]. Different ML techniques have been used in different applications to overcome the knowledge elicitation challenge. A rule-based reinforcement learning technique called Dynamic Scripting [30] has been applied to generate behavior of opponents in an air combat training system (see [31]). A different ML technique, so-called Data Driven Behavior Modeling (DDBM) has been applied (see [32]) to create NPCs in VBS3 (Virtual Battle Space, a game-based military simulation system). These NPCs learn

552

J. J. Roessingh et al.

bounding overwatch for dismounted infantry, which is a military tactical movement used to improve the security of units when they are moving towards a target. 5.2

Applying ML to the Student Model Module

One function of the student model is to evaluate how competency evolves with further exposure to the current state of the learning environment. In a study [39], an intelligent agent was developed with the aim to mimic student learning behavior. The agent managed to learn a complex game (the Space Fortress game) using reinforcement learning as an ML technique. Some learning characteristics, such as transfer-of-training between part-tasks, were comparable to that of human students. Hence the model may, in principle, be used to predict student learning characteristics as part of the student model. 5.3

Applying ML to the Tutoring Module

An important function of the tutoring module is to plan the teaching activities on the basis of information from the student model module. A genetic algorithm in combination with novelty search and combinatorial optimization, to automated scenario generation was used in [19]. As an example, the “clear rooms training task” [40] was used, in which a team of soldiers has to learn to clear rooms with various complexity factors. The series of generated scenarios has been made adaptive to a current competency measurement of the team. In a more general sense, self-improving tutors can be devised using ML to adapt the learning environment. Adaptive variables such as (1) error-sensitive feedback, (2) mastery learning, (3) adaptive spacing and repetition for drill and-practice items, (4) fading of worked examples for problem solving situations, or fading of demonstrations for behavioral tasks (such as in scenario-based simulations), and (5) metacognitive prompting, both domain relevant and domain independent, were suggested in [33]. For adaptive variables in flight simulations, [34] suggested aspects of the (simulated) environment such as illumination, sound-level, turbulence, g-forces, oxygen supply or manipulation of controls, displays and task load. 5.4

Applying ML to the User Interface Module

The user interface provides the translation between the tutoring module’s plans and the behavior of the student in a form that is comprehensible to the student and can be processed by the intelligent tutor. Natural language processing (speech recognition, natural language understanding, natural language generation) is an aspect of the user interface module where ML could be an applicable. Moreover, Cha et al. [35] appreciate that each learner has different preferences and needs. Therefore, it is crucial to provide students with different learning styles with different learning environments which are suited to their preferences and provide a more efficient learning experience to them. Cha et al. report a study of an ITS where the learner’s preferences are detected, and then the user interface is customized in an adaptive manner to accommodate the preferences. An ITS with a specific interface was created based on a learning-style model. Different student preferences became apparent through user interactions with

Application of Artificial Intelligence to Adaptive Instruction

553

the system. Using this interface and using different ML techniques (Decision Tree and Hidden Markov Model approaches), learning styles were diagnosed from behavioral patterns of the student interacting with the interface.

6 Discussion/Conclusion ML can potentially be applied in adaptive instruction for any domain requiring trained operators and teams. Life-long learning, increasing demand for education and training, technological progress, and the scalability of adaptive instruction will supposedly contribute to its spread. However, new developments may spread slower than we expect. As an example, it was generally expected that Massive Open Online Education Courses (MOOCs) would disrupt existing models of higher education. However, recent research (e.g. [36]) into their success reveals that most students do not complete such courses. Dropout rates of MOOCs offered by Stanford, MIT and UC Berkley are as high as 80–95%. Given that the first prototypes of ITSs were already developed in the early seventies, and one may assume that lessons learned from adaptive instruction could be applied to MOOC design in order to increase their success. This suggests that further progress can be booked in this area. In this paper, we present an architecture to discuss the application of ML techniques to adaptive instruction, particularly ITSs. In the sense of the model of Fig. 2, ML can potentially be applied in all four modules of the model. This is supported by concrete examples of the application of ML techniques in the previous sections. The tutoring module decides how learner outcomes have an effect on the adaptive variable. Such decisions must be better tailored to the individual student or team than decisions built into non-adaptive computer-based instruction. The tutoring module takes into account the characteristics of the learning process of the individual or team. Changes in an adaptive variable can then be tailored to this learning process. This distinguishes adaptive from non-adaptive instructional strategies. It implies that adaptive instruction has added value over non-personalized instruction in applications where individuals or individual teams have sufficiently different learning processes and learner outcomes. For the purpose of evaluation of learner outcomes, valid behavioral markers must be defined. In turn, these must be represented by signals that can be used in evaluation. For example, in a relatively straightforward task, such as a compensatory tracking task, the goal is to minimize the deviation, i.e. the difference between a manual output signal and a reference signal. A valid and reliable learner outcome may be found through averaging this deviation over a certain time period. For more complex real-life tasks, the determination of learner outcomes of interest, associated behavioral markers and processing of signals that represent these markers, are equally complex. ML techniques may be of help to find solutions in this context, too. The application of ML may also have some potential disadvantages. Methods may be opaque in the sense that the relationship between input and output neither be inferred, nor explained. Some methods require massive numbers of data to converge to a solution, which numbers cannot always be made available in an educational context. Emergent, unstable or unexpected behavior of ML-enabled functions may be

554

J. J. Roessingh et al.

problematic for instructors, but possibly also for other purposes. Some methods are resource intensive and computationally heavy that may render them unsuitable for e.g. mobile platforms and real-time application. Also, in some settings, it may be desirable that a human instructor temporarily takes over control from the intelligent tutor (or the expert module). This may constrain the use of ML for specific purposes or applications. ML techniques, the behavior models they generate, and the tools with which they are controlled should facilitate such takeovers, and the behavior of the ITS should adapt gracefully.

References 1. Sottilare, R.A.: Applying adaptive instruction to enhance learning in non-adaptive virtual training environments. In: Bagnara, S., Tartaglia, R., Albolino, S., Alexander, T., Fujita, Y. (eds.) IEA 2018. AISC, vol. 822, pp. 155–162. Springer, Cham (2019). https://doi.org/10. 1007/978-3-319-96077-7_16 2. Hämäläinen, W., Vinni, M.: Comparison of machine learning methods for intelligent tutoring systems. In: Ikeda, M., Ashley, K.D., Chan, T.-W. (eds.) ITS 2006. LNCS, vol. 4053, pp. 525–534. Springer, Heidelberg (2006). https://doi.org/10.1007/11774303_52 3. Kotsiantis, S.B., Pierrakeas, C.J., Pintelas, P.E.: Preventing student dropout in distance learning using machine learning techniques. In: Palade, V., Howlett, R.J., Jain, L. (eds.) KES 2003. LNCS (LNAI), vol. 2774, pp. 267–274. Springer, Heidelberg (2003). https://doi.org/ 10.1007/978-3-540-45226-3_37 4. Minaei-Bidgoli, B., Kashy, D.A., Kortemeyer, G., Punch, W.F.: Predicting student performance: an application of data mining methods with an educational web-based system. In: Frontiers in Education (FiE), p. T2A-13. IEEE (2003) 5. Abdellaoui, N., Taylor, A., Parkinson, G.: Comparative analysis of computer generated forces’ artificial intelligence. Technical report, Defence Research and Development, Canada, Ottawa (Ontario) (2009) 6. Roessingh, J.J.M., et al.: Machine learning techniques for autonomous agents in military simulations - multum in parvo. In: IEEE International Conference on Systems, Man, and Cybernetics (SMC), Banff Center, Banff, Canada (2017) 7. Russell, S.J., Norvig, P.: Artificial Intelligence: A Modern Approach. Pearson Education, Upper Saddle River (2003) 8. Mitchell, T.: The discipline of machine learning. Technical report CMUML-06-108. Carnegie Mellon University, Pittsburgh, PA 15213 (2006) 9. Skinner, B.F.: Teaching machines. Science 128(3330), 969–977 (1958) 10. Skinner, B.F.: Review lecture - the technology of teaching. Proc. R. Soc. 162, 427–443 (1965) 11. Carbonell, J.R.: AI in CAI: an artificial intelligence approach to computer-assisted instruction. IEEE Trans. Man-Mach. Syst. II, 190–202 (1970) 12. Carbonell, J.R.: Mixed-initiative man-computer instructional dialogues. Final Report, BBN Report No. 1971, Job No. 11399. Bolt Beranek and Newman, Inc., Cambridge, MA, USA (1970) 13. Michalski, R.S., Carbonell, J.G., Mitchell, T.M. (eds.): Machine Learning: An Artificial Intelligence Approach. Springer, Heidelberg (1983). https://doi.org/10.1007/978-3-66212405-5 14. Domingos, P.: The Master Algorithm: How the Quest for the Ultimate Learning Machine will Remake Our World. Basic Books, New York (2015)

Application of Artificial Intelligence to Adaptive Instruction

555

15. Sottilare, R.: A hybrid machine learning approach to Automated Scenario Generation (ASG) to support adaptive instruction in virtual simulations and games. In: The International Defense & Homeland Security Simulation Workshop of the I3M Conference, Budapest, Hungary, September 2018 16. Hessel, M., et al.: Rainbow: combining improvements in deep reinforcement learning. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (2018) 17. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518, 529– 533 (2015) 18. Gaines, B.R.: The learning of perceptual-motor skills by men and machines and its relationship to training. Instr. Sci. 1(3), 263–312 (1972) 19. Sottilare, R.A., Burke, C.S., Salas, E., Sinatra, A.M., Johnston, J.H., Gilbert, S.B.: Designing adaptive instruction for teams: a meta-analysis. Int. J. Artif. Intell. Educ. 28(2), 225–264 (2017). https://doi.org/10.1007/s40593-017-0146-z 20. Fonte, F.A.M., Burguillo, J.C., Nistal, M.L.: An intelligent tutoring module controlled by BDI agents for an e-learning platform. Expert Syst. Appl. 39(8), 7546–7554 (2012) 21. Nwana, H.S.: Intelligent tutoring systems: an overview. Artif. Intell. Rev. 4(4), 251–277 (1990) 22. Kulik, J.A., Fletcher, J.D.: Effectiveness of intelligent tutoring systems: a meta-analytic review. Rev. Educ. Res. 86(1), 42–78 (2016) 23. Freedman, R.: A plan manager for mixed-initiative, multimodal dialogue. In: AAAI 1999 Workshop on Mixed-Initiative Intelligence (1999) 24. Nkambou, R., Bourdeau, J., Mizoguchi, R.: Advances in Intelligent Tutoring Systems. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14363-2 25. Goldstein, I.P.: The genetic graph: a representation for the evolution of procedural knowledge. Int. J. Man Mach. Stud. 11(1), 51–77 (1979) 26. Self, J.A.: Student models: what use are they? In: Ercoli, P., Lewis, R. (eds.) Artificial Intelligence Tools in Education, pp. 73–86. North Holland, Amsterdam (1988) 27. Koedinger, K.R., Anderson, J.R., Hadley, W.H., Mark, M.A.: Intelligent tutoring goes to school in the big city. Int. J. Artif. Intell. Educ. (IJAIED) 8, 30–43 (1997) 28. Corbett, A.T., Anderson, J.R.: Student modeling and mastery learning in a computer-based programming tutor. In: Frasson, C., Gauthier, G., McCalla, G.I. (eds.) ITS 1992. LNCS, vol. 608, pp. 413–420. Springer, Heidelberg (1992). https://doi.org/10.1007/3-540-55606-0_49 29. Padayachee, I.: Intelligent tutoring systems: architecture and characteristics. In: Proceedings of the 32nd Annual SACLA Conference, pp. 1–8 (2002) 30. Spronck, P., Ponsen, M., Sprinkhuizen-Kuyper, I., Postma, E.: Adaptive game AI with dynamic scripting. Mach. Learn. 63(3), 217–248 (2006) 31. Toubman, A., Roessingh, J.J., Spronck, P., Plaat, A., Van Den Herik, J.: Rewarding air combat behavior in training simulations. In: IEEE International Conference on Systems, Man, and Cybernetics, Hong Kong, pp. 1397–1402 (2015) 32. Kamrani, F., Luotsinen, L.J., Løvlid, R.A.: Learning objective agent behavior using a datadriven modeling approach. In: IEEE International Conference on Systems, Man, and Cybernetics (SMC), Budapest, Hungary, pp. 002175–002181, October 2016 33. Durlach, P.J., Ray, J.M.: Designing adaptive instructional environments: in-sights from empirical evidence. U. S., Technical report. Army Research Inst for the Behavioral and Social Sciences, Orlando, FL (2011) 34. Kelley, C.R.: What is adaptive training? Hum. Factors 11, 547–556 (1969) 35. Cha, H.J., Kim, Y.S., Park, S.H., Yoon, T.B., Jung, Y.M., Lee, J.-H.: Learning styles diagnosis based on user interface behaviors for the customization of learning interfaces in an intelligent tutoring system. In: Ikeda, M., Ashley, K.D., Chan, T.-W. (eds.) ITS 2006. LNCS, vol. 4053, pp. 513–524. Springer, Heidelberg (2006). https://doi.org/10.1007/11774303_51

556

J. J. Roessingh et al.

36. Khalil, H., Ebner, M.: MOOCs completion rates and possible methods to improve retention a literature review. In: EdMedia: World Conference on Educational Media and Proceedings of World Conference on Educational Multimedia, Hypermedia and Telecommunications, AACE, Chesapeake, VA (2014) 37. Bellman, R., Kalaba, R.: On adaptive control processes. IRE Trans. Autom. Control 2(4), 1– 9 (1959) 38. Hull, C.L.: Mind, mechanism, and adaptive behavior. Psychol. Rev. 44, 1–32 (1937) 39. van Oijen, J., Roessingh, J.J., Poppinga, G., Garcia, V.: Learning analytics of playing space fortress with reinforcement learning. In: Sottilare, R., Schwarz, J. (eds.) HCII 2019. LNCS, vol. 11597, pp. 363–378. Springer, Cham (2019) 40. US Army, Field Manual No. 3-21.8 - The Infantry Rifle Platoon and Squad. Department of the Army, Washington, D.C., 28 March 2007

Validating Air Combat Behaviour Models for Adaptive Training of Teams Armon Toubman(B) Netherlands Aerospace Centre NLR, Amsterdam, The Netherlands [email protected]

Abstract. On many occasions, the use of machine learning to adaptively generate new content for training simulations has been demonstrated. However, the validation of the new content (i.e., proof that the new content is fit for use in training simulations), has received relatively little attention. In this study, we design a validation procedure for one particular type of content, namely the behaviour models for the virtual opponents in air combat training simulations. As a case study, we generate a new set of behaviour models and apply the validation procedure to them. Our results are positive, but leave room for interpretation. We discuss why this is the case and suggest avenues for future work. Keywords: Adaptive training

1

· Air combat · Model validation

Introduction

In air combat training simulations, the role of opponent is often played by virtual entities known as computer generated forces (cgfs). Various research efforts have demonstrated the ability of machine learning (cf. Karli et al. 2017; Teng et al. 2013; Toubman et al. 2016) and other adaptive techniques (cf. Floyd et al. 2017; Karneeb et al. 2018) to generate air combat behaviour models for cgfs. The strength of such techniques is that the computer can automatically adapt the behaviour of the cgfs, and thus the training, to the trainee fighter pilots. However, the creative capabilities of these techniques may result in undesirable (e.g., non-humanlike) behaviour that is not useful for training (Petty 2003). The main idea behind this paper is that newly generated behaviour models should be validated to prove their usefulness in training simulations. In the remainder of this paper, we investigate what this validation entails (see Sect. 2). The two contributions of this paper are the following: 1. We present a validation procedure for machine-learned air combat behaviour models (see Sect. 3). A key component of the procedure is a newly developed questionnaire for the assessment of the behaviour produced by air combat behaviour models. We call this questionnaire the Assessment Tool for Air Combat cgfs (atacc); c Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 557–571, 2019. https://doi.org/10.1007/978-3-030-22341-0_44

558

A. Toubman

2. As a case study, we generate novel air combat behaviour models by means of machine learning (see Sect. 4) and apply the validation procedure to the models (see Sect. 5). The results show that the generated behaviour models are valid to some extent, but also that both the behaviour models and the validation procedure require additional effort (see Sect. 6). To the best of our knowledge, this is the first time that the validation of machine-generated air combat behaviour models has been treated as a research subject in its own right (see Sect. 7).

2

The Difficulty of Validating Behaviour Models

Since the advent of the use of simulation in military training there has been a rising interest in the validation of simulation models (cf. Kim et al. 2015; Sargent 2011). Many definitions of validation have been stated throughout the literature (cf. Birta and Arbez 2013; Bruzzone and Massei 2017; Petty 2010). When military simulations are discussed in particular, we find references to the definition of validation that is used by the US Department of Defense (2009). We use this definition from now onwards. For convenience, we restate the definition. Definition 1 (Validation). Validation is “[t]he process of determining the degree to which a model or simulation and its associated data are an accurate representation of the real world from the perspective of the intended uses of the model” (ibid.). The definition names four important concepts: (1) a process, (2) a degree of accuracy, (3) a model (or simulation), and (4) the intended use of the model. We can readily fill in concepts (3) and (4). Regarding concept (3), the models that we wish to validate are newly generated behaviour models. Furthermore, regarding concept (4), the intended use of these models is to produce behaviour for opponent cgfs in air combat training simulations. However, this leaves open two questions for us to investigate: (1) what the process entails, and (2) how we should determine the accuracy of the models. The difficulty of validating behaviour models is answering these two questions for every specific case. First, we investigate the question of what the process entails. There is no one-size-fits-all solution for validation processes, since different models have (1) different intended uses, and (2) different associated works available for use in the validation. Here, we use the notion “associated work” to refer to a range of results of performed work, e.g., (1) baseline models, (2) expected output data, (3) conceptual diagrams of the modelled phenomenon, or (4) expert knowledge. This being so, we still find that the various validation methods to be applied are well described in the literature (cf. Balci 1994; Petty 2010; Sargent 2011). In general, the four categories of validation methods are: (1) informal methods such as face validation, (2) static methods such as evaluating the model structure, (3) dynamic methods that involve executing the model and analysing the output data, and (4) formal methods based on mathematical proofs. An important factor in the choice of validation method(s) to use is the availability of associated

Validating Air Combat Behaviour Models

559

works (cf. Petty 2010; Sargent 2011). For example, dynamic methods can only be applied if (1) it is possible to execute the model with input that is relevant with regard to the intended use of the model, (2) data can be collected on the execution of the model, and (3) it is known how the collected data should be interpreted (e.g., compared to another available set of data). In other words, the choice of validation methods is always limited by practical considerations. The second question we would like to investigate reads: how should we determine the accuracy of the models? For instance, for a physics-based model, the accuracy of the model can be defined in terms of the number of faults that is allowed when the data that the model produces is compared to data that is measured in the real world. However, for behaviour models the question is particularly difficult to answer, since the notion of fault is difficult to grasp (Hahn 2017). Goerger et al. (2005) identify five causes to the difficulty of validating behaviour models in general. Four1 of these causes relate to the problem of defining the accuracy of a behaviour model. These four causes are: (1) the cognitive processes that are modelled may be nonlinear, which makes the processes as well as their models hard to reason about, (2) it is impossible to investigate all possible interactions that may arise in simulations because of the large number of interdependent variables in the models, (3) the metrics for measuring accuracy are inadequate, (4) there is no “robust” set of input data for the models. An important consequence of the difficulty of validating behaviour models is that the outcome of a validation should not be interpreted as either “the model is valid” or “the model is not valid”, as it is practically impossible to “completely validate” a model (Birta and Arbez 2013). Therefore, Birta and Arbez, (ibid.) note that “degrees of success must be recognized and accepted.” For them, it is important that the chosen validation methods are able to adequately reflect on the extent of the validity of the models.

3

Our Proposed Validation Procedure

In this section, we present our validation procedure for air combat behaviour models. Specifically, the validation procedure is aimed at automatically generated (e.g., machine-learned) behaviour models. The main idea behind the validation procedure is a comparison of (a) the behaviour displayed by cgfs that use the generated behaviour models, to (b) the behaviour displayed by cgfs that use behaviour models that have been written by professional model builders and/or subject matter experts (henceforth the professionals). Essentially, we use the latter, established type of behaviour models to provide a standard of behaviour to which the former, newly generated type of behaviour models should adhere. In other words, we do not aim for the generated models to surpass the established models in any way. Rather, we aim to show their equivalence, so that the new models can be used to supplement the established models, and thereby widen the variety of the training simulations that are offered. 1

The fifth cause is the lack of a standard validation process, which we discussed earlier.

560

A. Toubman

Fig. 1. The validation procedure. In human-in-the-loop simulations, human fighter pilots engage cgfs that are either controlled by the 4m-models (subject of the validation) or the 4p-models (baseline for comparison). Expert assessors assess the behaviour displayed by the cgfs by means of a newly developed assessment tool. Equivalence testing on the assessment results in a measurable extent of validity of the 4m-models.

In order to produce observable (and thus comparable) behaviour, all of the models have to be fed with the behaviour of their opponents, i.e., cgfs controlled by human fighter pilots, in a realistic air combat setting (see Sect. 3.1). Next, the displayed behaviour has to be assessed to create data on the basis of which a comparison can be made (see Sect. 3.2). For the actual comparison, we rely on a statistical method known as equivalence testing (see Sect. 3.3). Based on the outcome of the equivalence testing, we can state the extent of the validity of the generated behaviour models. Figure 1 provides an overview of the entire validation procedure. 3.1

Human-in-the-Loop Simulations

The validation procedure begins with human-in-the-loop simulations in a highfidelity beyond-visual-range air combat simulator. We consider a simulator that accommodates four human participants acting as fighter pilots. In the simulations, the participants engage a so-called four-ship (viz. a team of four) of hostile cgfs. The behaviour of the four-ship of cgfs is driven by four behaviour models, one for each cgf. In our experience, the behaviour models for the cgfs in a four-ship are treated as a single model. Especially when the models are designed by professionals, they are carefully tuned to each other to provide the illusion of a cohesive team at work. We henceforth consider the four models that together control the behaviour of a four-ship to be an indivisible unit. For convenience, we introduce the term 4-model to refer to a group of four behaviour models. Using the term 4-model, we are now able to make the distinction between (1) 4-models that have been written by the professionals, and (2) 4-models that have been generated by means of machine learning. We introduce the terms 4p-model

Validating Air Combat Behaviour Models

561

(where the p stands for professional ) and 4m-model (where the m stands for machine learning) to refer to these two kinds of 4-model, respectively. The 4m-models are the subjects of the validation procedure. However, by themselves they are not sufficient input for the validation process. As Petty (2010) stated succinctly, validation “[is a] process[] that compare[s] things.” Therefore, we require either (a) a baseline model, (b) a set of expected output data, or (c) implicit expert knowledge as a reference to compare against the 4m-models. For complex air combat behaviour models, it is almost infeasible to compile a set of expected output data, since the output depends on a wide range of possible interactions with other entities. However, what we do have available are behaviour models that have been written previously by professionals (i.e., 4pmodels). These 4p-models constitute a sample of all behaviour models that have been written by the professionals, comparable to how the 4m-models that are validated are a sample of the behaviour models that can possibly be generated by machine learning. Furthermore, we argue that since the 4p-models have been developed by means of the behaviour modelling process, the 4p-models have themselves been validated to some extent. We therefore add 4p-models as the second input to the validation process. We record the human-in-the-loop simulations, resulting in a set of behaviour traces. The behaviour traces contain three-dimensional recordings of the simulated airspace, including the movements of all entities (i.e., cgfs and missiles) flying in the airspace. The behaviour traces serve as input for the assessment (see next section). 3.2

Assessment

The goal of the assessment is to summarise the cgfs’ behaviour that is encoded in the behaviour traces into values that are (1) meaningful and (2) comparable between the 4m-models and the 4p-models. The assessment is performed by means of a structured form of face validation, which is one of the informal validation methods (see Sect. 2). However, there is little to no information available on measures for cgf behaviour that are relevant to training simulations. Therefore, we make use of the implicit knowledge of expert evaluators. We leverage this knowledge in two manners. First, we elicit knowledge on measures for behaviour of air combat cgfs, and then structure this knowledge into a novel assessment tool which we call the Assessment Tool for Air Combat cgfs (atacc) (see below). This tool enables a structured assessment of cgf behaviour. Second, expert evaluators review the behaviour traces that we have collected, and then assess the behaviour that the cgfs display. The result of the assessment is a series of ratings on Likert scales. The ratings serve as input for the equivalence tests (see next section). Below, we describe the development and contents of the atacc.

562

A. Toubman

The Assessment Tool for Air Combat CGFs. Together with instructor fighter pilots, we identified three performance dimensions that should be taken into consideration in the assessment of the behaviour of air combat cgfs. These performance dimensions are (1) the challenge provided by the cgfs, (2) the situational awareness that the cgfs display, and (3) the realism of the behaviour of the cgfs. We briefly describe the three performance dimensions below. Performance dimension 1: Challenge. The tool should measure whether (1) the cgfs behave in such a way that the human participants in the simulations need to think about and adjust their actions, and (2) whether the cgfs provide some form of training value to the simulations. Performance dimension 2: Situational awareness. The tool should measure whether (1) the cgfs appear to sense and react to changes in their environment, and (2) whether multiple cgfs belonging to the same team appear to acknowledge each other’s presence. Performance dimension 3: Realism. The assessment tool should measure (1) whether the cgfs behave as can be expected from their real-world counterparts, and (2) whether the cgfs use the capabilities of their platform (including e.g., sensors and weapons) in a realistic manner. Next, we attempted to formulate examples of behaviour that relate to each of the performance dimensions. This was done in an iterative manner, such that examples that were proposed could be critically analysed by each of the instructor fighter pilots. We formulated eight examples of behaviour in total (listed below). Examples 1 through 4 relate to Challenge; 5 and 6 to Situational awareness; and 7 and 8 to Realism In each of the examples, red air refers to the cgfs, whereas blue air refers to the human participants in the human-in-the-loop simulations. Example Example Example Example Example Example Example Example

1. 2. 3. 4. 5. 6. 7. 8.

Red air forced blue air to change their tactical plan. Red air forced blue air to change their shot doctrine2 . Red air was within factor range3 . Blue air was able to fire without threat from red air.4 Red air acted on blue air’s geometry. Red air acted on blue air’s weapon engagement zone5 . Red air flew with kinematic realism. Red air’s behaviour was intelligent.

In the atacc, each of the eight examples of behaviour is presented as a separate rating item, so that the presence of each behaviour is rated on a fivepoint Likert scale. For all of the eight rating items, the scale is labelled as ranging 2 3 4 5

Jargon: pre-briefed instructions for the use of air-to-air weapons. Jargon: the range within which opponents have to be taken into account in the selection of tactical actions. We formulated this behaviour from the viewpoint of blue air, since we were unable to satisfactorily state the behaviour from the viewpoint of red air. Jargon: the airspace in front of a fighter jet in which a fired missile can be effective.

Validating Air Combat Behaviour Models

563

from Never to Always. To conclude the atacc, we added a general ninth rating item stating “Red air’s behaviour tested blue air’s tactical air combat skills.” This item served to provide us with a general indication of the usefulness of the behaviour of the cgfs in relation to the human-in-the-loop simulations that were performed. The ninth item is also rated on five-point Likert scale, ranging from Strongly disagree to Strongly agree. 3.3

Equivalence Testing

At this point in the validation process, we have two sets of data: (1) the assessment of the 4p-models, and (2) the assessment of the 4m-models. We wish to compare these two sets of data in a meaningful way. Since we used the 4p-models as the baseline, we assume that the assessment of the 4p-models contains information about the desirable properties of air combat cgf behaviour. Based on this assumption, we define the measure of validity of the 4m-models as the extent that the assessment of the 4m-models and the assessment of the 4p-models can be measured to be equivalent. Obviously, a simple comparison (viz., determining if the difference between the assessments equals zero) of the assessments is too strict. The results of our assessments include noise from multiple sources (e.g., the behaviour of the pilots in the human-in-the-loop simulations, and possible bias of the assessors). Furthermore, standard statistical significance tests do not suffice, since these tests check for differences rather than for equivalence. We found a solution in a form of comparison testing that is called equivalence testing. The two one-sided tests (tost) method tests for equivalence of the means of two populations (cf. Anderson-Cook and Borror 2016; Lakens 2017; Meyners 2012). Therefore, the method starts with the assumption that two populations are different, and then collects evidence to show that the populations are the same. Note that this is the opposite of traditional tests that compare two populations (e.g., Student’s t-test), which (1) start with the assumption that two populations are similar or even the same, and then (2) collect evidence to show that the populations are different. In tost, the assumption that two populations are different (viz., the null hypothesis or H0 ) is stated as follows. H0 :

μA − μB ≤ δL

or

μA − μB ≥ δU

(1)

Here, the difference of the means of two populations A and B are compared. Two populations are considered different if the difference of their means lies outside of the indifference zone [δL , δU ]. We assume that the indifference zone is symmetrical, i.e., δ = δU = −δL . However, we are interested in examining the hypothesis viz. the means are not different, i.e., the difference between the means lies inside of the indifference zone. The reformulation of the hypothesis (viz. the alternative hypothesis or H1) is stated as follows. H1 :

δL < μA − μB < δU

(2)

564

A. Toubman

If the tost finds evidence that the difference of the means lies within the indifference zone under the assumption that it does not, we reject H0 and accept H1 , meaning that we conclude that the populations are the same (up to a very small difference). Finding this evidence is done by splitting H0 into two hypotheses which can be tested using standard one-sided t-tests. The p-value of the tost then becomes the maximum of the two p-values that are obtained from the two one-sided t-tests. The outcome of the tost greatly depends on the value chosen for δ. Until recently, δ could not be calculated directly. It was either (1) prescribed by regulatory agencies (e.g., in the field of pharmacology) or (2) determined by subject matter experts based on reference studies or expectations about the data (e.g., in psychology) (cf. Anderson-Cook and Borror 2016; Lakens 2017). For our validation, it is difficult to determine a suitable δ, since we have neither a regulatory agency, nor a reference study available. However, in 2016, an objective calculation of δ was introduced by Juzek (2016). The calculation of this delta δ (henceforth: Juzek’s δ) is as follows. δ = 4.58

sp Np

(3)

Here, sp is the pooled standard deviation in the two samples under comparison, and Np is the pooled number of data points in the samples. Juzek found the coefficient (4.58) by simulating a large number of tost applications. The coefficient was approximated in such a way that Juzek’s δ gives the tost the appropriate statistical power (1 − α = 95%, 1 − β = 80%). Armed with the tost method, we are now able to test the statistical equivalence of the assessments for the 4p-models and the 4m-models per rating item. The extent to which the rating items are equivalent can then be seen as the extent to which the 4m-models are valid.

4

Generating Air Combat Behaviour Models

We generated four novel 4m-models in preparation for the application of the validation procedure. These 4m-models served as the subject of the validation. The 4m-models were generated by means of the dynamic scripting machine learning algorithm (Spronck et al. 2006). The specific method for applying dynamic scripting to generate the 4m-models for air combat simulations is described by Toubman et al. (2016). We do not restate the full method here, as it is not the focus of this paper. In brief, the method consists of the following three steps: 1. We obtain four 4p-models that have been written by a professional and that have seen use in actual training simulations; 2. We decompose the 4p-models into their constituent “states”6 and the transitions between these states; 6

In our method, a state defines a “piece of behaviour”, such as but not limited to “firing a missile” or “defensive moves”.

Validating Air Combat Behaviour Models

565

3. The dynamic scripting algorithm repeatedly recombines the states and transitions into new behaviour models (4m-models) and tests these models in automated, agent-based simulations. The algorithm halts after a certain number of repetitions and returns the four best performing (viz. most-winning) 4m-models that it has found. The use of this method thus results in (a) the four professionally written 4p-models obtained in the first step, and (b) the four machine-generated 4mmodels obtained in the third step. Together, the eight models serve as input to the validation procedure (see Sect. 5).

5

Applying the Validation Procedure

In this section, we report on the application of the validation procedure (see Sect. 3) to a set of newly generated 4m-models (see Sect. 4). We present the application in the form of an experiment: the current section contains the “experimental method”, i.e., gathering behaviour traces in human-in-the-loop simulations (see Sect. 5.1) and performing the assessment (see Sect. 5.2). Later, we present the “experimental results”, i.e., the ratings obtained from the assessment and the results of the equivalence tests (see Sect. 6). 5.1

Human-in-the-Loop Simulations

Human-in-the-loop simulations were used to determine how a four-ship of red cgfs behaves when the cgfs interact with human participants controlling four blue cgfs. The simulations were performed in nlr’s Fighter 4-Ship simulator. This simulator consists of four networked f-16 mock-up cockpits. The behaviour of the reds was controlled by means of eight 4-models: the four 4p-models plus the four 4m-models (see Sect. 4). Using these eight 4-models, we defined eight scenarios. Each scenario was a simulation configuration in which a four-ship of red cgfs approached the human participants from the simulated north. In each scenario, the red four-ship used either one of the four 4p-models or one of the four 4m-models, so that each of the 4-models was used in one of the scenarios. The human participants in the simulations were active-duty Royal Netherlands Air Force (rnlaf) f-16 pilots from Volkel Airbase (all male, n = 16, age μ = 32.0, σ = 5.35), and one former rnlaf f-16 pilot (age = 60).7 No selection 7

One of the active-duty participants had to leave after four scenarios. This situation presented us with three options: (1) continue without this participant (viz., with a three-ship), (2) cancel the remaining simulations, or (3) substitute the participant with a former f-16 pilot who was available. Since the participant had a noncommanding role in the four-ship, we deemed his influence in the decision-making of the human participants to be minimal. Still, by controlling the fourth blue cgf, he provided valuable input that allowed the red cgfs to function. Furthermore, participants were scarce. We decided that the collection of data was paramount, and let the former f-16 pilot substitute the participant in the remaining simulations.

566

A. Toubman

criteria were applied. The active-duty pilots were assigned to the human-in-theloop simulations based on availability. Experience levels ranged from wingman to weapons instructor pilot. Over the course of three days, five teams of four participants controlled the blue cgfs in the Fighter 4-Ship. Before the simulations took place, the participants received a “mission briefing” document that described (1) the capabilities of the blue cgfs that they would control, and (2) the capabilities of the red cgfs that the participants were to expect in the simulator. The eight scenarios were presented sequentially in a random order. The participants were unaware of the origin of the 4-models controlling the red cgfs (i.e., the simulations were performed in a single-blinded fashion). Each scenario ended when either all four red cgfs, or all four human participants were defeated. The human-in-the-loop simulations were recorded using the pcds mission debrief software. In addition to behaviour traces, the recordings included (1) the voice communication that took place among the human participants, and (2) video recordings of the multi-functional displays of the cockpits occupied by the human participants. In total, 33 recordings8 were stored. 5.2

Assessment

The behaviour that the reds displayed in the human-in-the-loop simulations were assessed by human experts. Active-duty rnlaf f-16 pilots from Leeuwarden Airbase acted as assessors (all male, n = 5, age μ = 35.2, σ = 5.17). Assessors were selected on having tactical instructor pilot or weapons instructor pilot qualification. All five assessors had the weapons instructor pilot qualification. The assessment was performed by means of the atacc, implemented on paper. Originally, we had planned to let each assessor assess all of the 33 recordings within a three hour time span. However, a pilot study with two weapons instructor pilots (not counted above) revealed that this was unfeasible because of time constraints. We subsequently reduced the pool of recordings available for rating to 16 recordings. These 16 recordings came from two teams that completed all eight scenarios (i.e., simulations with the four 4p-models and the four 4m-models) in human-in-the-loop simulations. From this reduced pool of recordings, we assigned ten recordings to each assessor, consisting of (1) eight recordings from one of the two teams in random order, and (2) two recordings from the other team. Furthermore, the weapons instructor pilots in the pilot study expressed that they were unable to adequately assess the intelligence of the red cgfs (rating item 8) and the extent to which the red cgfs tested the skills of the pilots in the simulator (rating item 9) without knowing the experience levels of these pilots. Based on this feedback, we made the decision to disclose the experience levels to the assessors during the assessment. The assessors were provided with a laptop computer with mouse and headphones, a stack of ten atacc forms, and an instruction sheet. The pcds recordings were opened on the computer. Each atacc was marked with a unique code 8

Two teams were not available to complete all eight scenarios. Together, these two teams completed nine scenarios: the eight scenarios, plus one duplicate.

Validating Air Combat Behaviour Models

567

Table 1. Summary of the atacc responses: the number of responses (n), mean response (μ), and standard deviation (σ) of the responses to the atacc rating items for the 4pmodels and the 4m-models. Rating item 4p-models n μ σ

4m-models n μ σ

1

28 3.04 0.79 24 3.25 0.99

2

28 2.07 0.98 24 2.33 1.13

3

28 3.18 1.19 24 3.92 1.02

4

27 2.26 0.86 24 2.71 0.91

5

28 3.29 0.71 24 3.42 0.58

6

28 2.75 0.89 24 3.33 0.70

7

22 3.82 0.66 20 3.70 0.73

8

28 2.86 0.80 24 2.96 0.69

9

27 3.81 0.68 24 3.63 0.65

that referred to a specific recording in pcds. The assessors were instructed to view the recordings in the (pre-randomised) order as indicated by their ataccs.

6

Validation Results

In this section, we present the results of (a) the assessments and (b) the equivalence tests that were performed. Additionally, we provide the results of (c) follow-up tests in the cases where no equivalence was found. Assessment Results. A summary of the responses to the atacc is given in Table 1. The responses to the Likert scale rating items were coded as integer values ranging from 1 (Never/Strongly disagree) to 5 (Always/Strongly agree). The coding for rating item four (Blue air was able to fire without threat from red air ) was inverted so that the values reflected the occurrence of red behaviour (i.e., red influencing blue’s ability to fire). Equivalence Testing. We applied Schuirmann’s (1987) tost method to determine the equivalence of (1) the responses given on the atacc for 4p-models, and (2) the responses given on the atacc for 4m-models. We calculated δ (as Juzek’s δ) for the responses to each rating item of the atacc, and then performed the tost on the responses to each rating item. The tost was performed using the TOSTtwo.raw function from R’s TOSTER package, with Welch’s t-test as the underlying one-sided test. We chose to use Welch’s t-test here because of

568

A. Toubman

the unequal sample sizes.9 The δ and the results of the tost (t-value, degrees of freedom [df ], p-value, and the 90% confidence interval [ci] of the difference of the means) are shown in Table 2. In Table 2, the bold p-values indicate a significant result of the tost. Based on the results of the tost, we conclude that the responses to rating items 1, 2, 5, 7, 8, and 9 are equivalent between the 4p-models and the 4m-models (see Sect. 3.2 for the definitions of the examples of behaviour represented by these rating items). Table 2. Results of the tost method per rating item (i.). The tost was based on Welch’s t-test. For rating items where the tost method did not find equivalence, an additional standard (Welch’s) t-test was performed. Significant p-values at the α = 0.05 level are indicated in bold. The relevance (rel.) of the outcome of the tests is indicated in the rightmost column. i. tost δ t

df

p

90% ci

Standard t-test t df p 95% ci

Rel.

1 0.798

2.322 43.9 .012 −0.637; +0.208

eq.

2 0.944

2.307 45.9 .013 −0.758; +0.234

eq.

3 1.000

0.855 50.0 .198 −1.251; −0.225 −2.41 50.0 .020 −1.353; −0.124 diff.

4 0.800

1.414 47.5 .082 −0.866; −0.032 −1.81 47.5 .077 −0.949; +0.050 und.

5 0.590

2.551 49.9 .007 −0.432; +0.170

6 0.725

0.643 49.7 .262 −0.953; −0.214 −2.64 49.7 .011 −1.018; −0.149 diff.

7 0.697 −2.674 38.5 .005 −0.247; +0.483

eq. eq.

2.779 50.0 .004 −0.448; +0.246

eq.

9 0.604 −2.223 48.8 .015 −0.122; +0.502 eq. = equivalent, diff. = different, und. = undecided

eq

8 0.677

Follow-Up Testing. The tost did not find equivalence for rating items 3, 4, and 6. For these rating items, we conducted a follow-up test to determine if the responses to these rating items significantly differed between the 4p-models and the 4m-models. This test was a standard two-sided Welch’s t-test. A significant difference was found for rating items 3 (Red air was within factor range) and 6 (Red air acted on blue air’s weapons engagement zone). For both rating items, the responses indicated a higher frequency of the behaviour that was rated for the 4m-models (see Table 1). The responses to rating item 4 (Blue air was able to fire without threat from red air ) were neither significantly equivalent, nor significantly different. Therefore, we may conclude that their relationship is undecided. 9

There is an ongoing discussion on the topic of whether parametric tests such as the t-test are suitable for use on ordinal Likert-scale data. Parametric tests have on multiple occasions been shown to be robust against violated assumptions (such as non-normal, ordinal data) (cf. De Winter 2013; Derrick and White 2017; Norman 2010). Using parametric tests in our tost allows us to use well-tested, publicly available tools such as the mentioned R package.

Validating Air Combat Behaviour Models

7

569

Discussion and Related Work

We started this paper by decomposing the difficulty of validating behaviour models into two questions (see Sect. 2): what does the process entail? and how should we determine the accuracy of the models? For the case of air combat behaviour models, our answer to the first question is the procedure laid out in Sect. 3. Our answer to the second question is embedded in the procedure: we determine the accuracy of newly generated 4m-models by a combination of simulation technology, behavioural science, statistical methods, and human input. For our case study, we generated a new set of air combat behaviour models by means of machine learning, and applied the validation procedure to these models. Our key finding is that out of the nine rating items of the atacc, six are assessed as equivalent between the 4m-models and the 4p-models. Following the advice of Birta and Arbez (2013) to recognise the partial success, the results appear to moderately indicate validity. Still, the responses to the remaining three rating items do not support the notion of validity as we have defined it. Is there any way that we could have achieved a more convincing indication towards the (non-)validity of the new models? We must acknowledge the large number of variables in our study, e.g., (1) the 4p-models, (2) the 4m-models, (3) the pilots, (4) the assessors, and (5) the atacc. While efforts could be made to control the “noise” from these variable, it is important to consider that (1) and (2) exist in too many variations to ever be sampled effectively, and (3) and (4) are assisting with all the implicit and explicit knowledge they have to offer. The contribution of this knowledge should be stimulated before it is controlled. It is therefore that we propose that improvements should be sought in the area of the assessment tool (5), such as refinement of the examples of behaviour posed by the atacc. One interesting approach might be to incorporate recent work on the mission essential competencies (mecs) into the tool (see, e.g., MacMillan et al. 2013; Tsifetakis and Kontogiannis 2017). The validation study performed by Sadagic (2010) most closely resembles our work. The subject of this study were behaviour models for troops in urban warfare. Expert assessors observed the behaviour of these troops, and rated its realism. The work of Sadagic differs from ours in that their simulations had no human participants. Furthermore, no statistical tests were performed, as the behaviour was rated as conforming to the assessors’ ideal of realistic behaviour. In the air combat domain, we find small-scale validation studies attached to machine learning experiments. For instance, Teng et al. (2013) show that their adaptive cgfs are rated more favourably than non-adaptive cgfs on certain qualities (e.g., predictability and aggression) by expert assessors. However, in contrast to our work, Teng, Tan, and Teow aimed to develop cgfs that showed improvement on these qualities, rather than find equivalence. By focusing on improvement, the adaptive capabilities of the cgfs have been validated, but the question remains whether the improved qualities are useful for training. In conclusion, properly validating air combat behaviour models is difficult to accomplish, yet essential for the training simulations that aim to use them.

570

A. Toubman

The validation procedure that we propose likely is one of many possible solutions. We invite more machine learning researchers and training experts to jointly address the issue of validation in future research, thereby paving the way to reliable adaptive training of teams. Acknowledgment. The author graciously thanks 312, 313, and 322 squadrons of the rnlaf for their generous support during this study. Many thanks to Rich, Gump, Slime, and Speedy for sharing their ideas regarding the atacc and test-driving the simulator. The author is also grateful to Jaap van den Herik, Pieter Spronck and Jan Joris Roessingh for their advice and thorough reviews.

References Anderson-Cook, C.M., Borror, C.M.: The difference between equivalent and not different. Qual. Eng. 28(3), 249–262 (2016). https://doi.org/10.1080/08982112.2015. 1079918 Balci, O.: Validation, verification, and testing techniques throughout the life cycle of a simulation study. Ann. Oper. Res. 53(1), 121–173 (1994). https://doi.org/10.1109/ wsc.1994.717129 Birta, L.G., Arbez, G.: Modelling and Simulation: Exploring Dynamic System Behaviour. Springer, London (2013). https://doi.org/10.1007/978-1-4471-2783-3. ISBN 978-1-4471-2783-3 Bruzzone, A.G., Massei, M.: Simulation-based military training. In: Mittal, S., Durak, ¨ U., Oren, T. (eds.) Guide to Simulation-Based Disciplines. SFMA, pp. 315–361. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-61264-5 14. ISBN 9783-319-61264-5 De Winter, J.C.F.: Using the student’s t-test with extremely small sample sizes. Pract. Assess. Res. Eval. 18(10) (2013) Derrick, B., White, P.: Comparing two samples from an individual Likert question. Int. J. Math. Stat. 18(3), 1–13 (2017) Floyd, M.W., et al.: A goal reasoning agent for controlling UAVs in beyond-visual-range air combat. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 4714–4721. AAAI Press (2017) Goerger, S.R., McGinnis, M.L., Darken, R.P.: A validation methodology for human behavior representation models. J. Def. Model. Simul. 2(1), 39–51 (2005). https:// doi.org/10.1177/154851290500200105 Hahn, H.A.: The conundrum of verification and validation of social science-based models Redux. In: Schatz, S., Hoffman, M. (eds.) Advances in Cross-Cultural Decision Making. AISC, vol. 480, pp. 279–292. Springer, Cham (2017). https://doi.org/10. 1007/978-3-319-41636-6 23 Juzek, T.S.: Acceptability judgement tasks and grammatical theory. Ph.D. thesis. University of Oxford (2016) ¨ Sever, H.: Air combat learning from F-16 flight information. In: Karli, M., Efe, M.O., Proceedings of the IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1–6 (2017). https://doi.org/10.1109/FUZZ-IEEE.2017.8015615 Karneeb, J., et al.: Distributed discrepancy detection for a goal reasoning agent in beyond-visual-range air combat. In: Roberts, M., et al. (eds.) AI Communications, vol. 31, no. 2, pp. 181–195 (2018). https://doi.org/10.3233/aic-180757

Validating Air Combat Behaviour Models

571

Kim, J.H., et al.: Verification, Validation, and Accreditation (VV&A) considering military and defense characteristics. Ind. Eng. Manag. Syst. 14(1), 88–93 (2015). https://doi.org/10.7232/iems.2015.14.1.088 Lakens, D.: Equivalence tests: a practical primer for t tests, correlations, and metaanalyses. Soc. Psychol. Pers. Sci. 8(4), 355–362 (2017). https://doi.org/10.1177/ 1948550617697177. ISSN 1948-5514 MacMillan, J., et al.: Measuring team performance in complex and dynamic military environments: the SPOTLITE method. Mil. Psychol. 25, 266 (2013) Meyners, M.: Equivalence tests - a review. Food Qual. Prefer. 26(2), 231–245 (2012). https://doi.org/10.1016/j.foodqual.2012.05.003. ISSN 0950-3293 Norman, G.: Likert scales, levels of measurement and the laws of statistics. Adv. Health Sci. Educ. 15(5), 625–632 (2010) Petty, M.D.: Benefits and consequences of automated learning in computer generated forces systems. Inf. Secur. 12, 63–74 (2003). https://doi.org/10.11610/isij.1203 Petty, M.D.: Verification, validation, and accreditation. In: Sokolowski, J.A., Banks, C.M. (eds.) Modeling and Simulation Fundamentals: Theoretical Underpinnings and Practical Domains, Chap. 10, pp. 325–372. Wiley, Hoboken (2010). ISBN 978-0-47048674-0 Sadagic, A.: Validating visual simulation of small unit behavior. In: Proceedings of the 2010 Interservice/Industry Training, Simulation, and Education Conference, I/ITSEC, Orlando, Florida (2010) Sargent, R.G.: Verification and validation of simulation models. In: Proceedings of the Winter Simulation Conference, WSC 2011, Winter Simulation Conference, Phoenix, Arizona, pp. 183–198 (2011) Schuirmann, D.J.: A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. J. Pharmacokinet Pharmacodyn. 15(6), 657–680 (1987). https://doi.org/10.1007/BF01068419 Spronck, P., et al.: Adaptive game AI with dynamic scripting. Mach. Learn. 63(3), 217–248 (2006). https://doi.org/10.1007/s10994-006-6205-6. ISSN 08856125 Teng, T.-H., Tan, A.-H., Teow, L.-N.: Adaptive computer generated forces for simulator-based training. Expert Syst. Appl. 40(18), 7341–7353 (2013). https://doi. org/10.1016/j.eswa.2013.07.004 Toubman, A., et al.: Rapid adaptation of air combat behaviour. In: Kaminka, G.A., et al. (eds.) ECAI 2016–22nd European Conference on Artificial Intelligence. Frontiers in Artificial Intelligence and Applications, The Hague, The Netherlands, vol. 285. IOS Press, pp. 1791–1796 (2016). https://doi.org/10.3233/978-1-61499-672-91791 Tsifetakis, E., Kontogiannis, T.: Evaluating non-technical skills and mission essential competencies of pilots in military aviation environments. Ergonomics, 1–15 (2017). PMID 28534423. https://doi.org/10.1080/00140139.2017.1332393 US Department of Defense: DoD Modeling and Simulation (M&S) Verification, Validation, and Accreditation (VV&A). Department of Defense Instruction 5000.61 (2009) http://www.esd.whs.mil/Portals/54/Documents/DD/issuances/dodi/500061p.pdf

Six Challenges for Human-AI Co-learning Karel van den Bosch1(B) , Tjeerd Schoonderwoerd1 , Romy Blankendaal1 , and Mark Neerincx1,2 1 TNO, Kampweg 55, 3769 DE Soesterberg, The Netherlands {karel.vandenbosch,tjeerd.schoonderwoerd,romy.blankendaal, mark.neerincx}@tno.nl 2 Delft University of Technology, Van Mourik Broekmanweg 6, 2628 XE Delft, The Netherlands

Abstract. The increasing use of ever-smarter AI-technology is changing the way individuals and teams learn and perform their tasks. In hybrid teams, people collaborate with artificially intelligent partners. To utilize the different strengths and weaknesses of human and artificial intelligence, a hybrid team should be designed upon the principles that foster successful human-machine learning and cooperation. The implementation of the identified principles sets a number of challenges. Machine agents should, just like humans, have mental models that contain information about the task context, their own role (self-awareness), and the role of others (theory of mind). Furthermore, agents should be able to express and clarify their mental states to partners. In this paper we identify six challenges for humans and machines to collaborate in an adaptive, dynamic and personalized fashion. Implications for research are discussed. Keywords: Co-active learning · Human-agent teaming · Hybrid teams · Theory of mind · Explainable AI · Mental model

1

Introduction

The literature on teams (e.g., [48,52]) has produced knowledge on how to design a training environment and the operational environment to ensure that a team of experts is also an expert team [47]. Now, with the introduction of advanced technology, people also have to form effective teams with artificially intelligent partners. The principles derived from studies on the effectiveness of humanhuman teams are valuable for designing human-technology teams, but there are also differences between human intelligence and Artificial Intelligence (from now on: AI) that must be taken into account. Modern AI-applications acquire knowledge about their domain and tasks by establishing correlations and patterns in the large sets of data they collect about their environment. It then uses this knowledge to solve new problems. When the environment provides sufficient data, the algorithm can become very successful (e.g., for example, recognizing c Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 572–589, 2019. https://doi.org/10.1007/978-3-030-22341-0_45

Six Challenges for Human-AI Co-learning

573

cancerous tissue in MR-images [62]). However, the intelligence of such applications remain within the boundaries of the trained task. If these are narrow and well-defined, then AI is doing well. However, when the task context imposes a rich and an a priori unknown variety of conditions (wide system boundaries), then the problem-solving intelligence of AI drops dramatically [3]. Where AI still falls short is thinking in the abstract, in applying common sense, and in transferring knowledge from one area to another [7]. Thus, humans and AI each have their strengths and weaknesses. Humans, for example, are poor at storing and processing information, unlike AI. However, humans can make abstract decisions based on intuition, common sense, and scarce information [29]. Rather than acting as separate and equal entities, humans and AI should collaborate in a coordinated fashion to unlock the strengths of a heterogeneous team. It is believed that this needs to develop iteratively by interaction between partners [4,19,27]. This paper discusses the challenges for developing systems that enable humans and artificially intelligent technology to jointly learn and work together, adaptively and effectively. 1.1

Hybrid Teams

A hybrid team is a team of multiple agents that interdependently work together, and where agents can be either humans or machines. The cooperation of humans and machines sets new demands as the nature of intelligence is different between agents [3]. One demand is that the conditions must be created in which all agents come to recognize and acknowledge their respective capabilities. This may apply to a single human-machine combination, but it may also concern a team of multiple human-machine combinations. Another demand is that team members should have a shared understanding of how to exploit their complementary strengths to the benefit of the team. How team members should adapt to form an effective team varies from occasion to occasion. It is dependent upon many factors, like for example the specific demands of the context, the capabilities and preferences of the other team members, and many other variables. Learning how to adapt always take place, with every new performance of a team. Each training and each operation provides opportunities for team members to develop their skills, to refine their understanding of their own role within the team, and to deepen their knowledge of the other team members’ roles, capabilities, and preferences. A further demand is that the members of a hybrid team should be able to use the progressive insights of its members to formalize and tune the work agreements. Figure 1 shows a representation of a hybrid team. The green inner area of Fig. 1 shows a team consisting of four human-machine unit. One human-machine unit is shown enlarged for explication purposes. The human and the machine both have, develop, and maintain a mental model (shown in the lower two clouds). The mental model of the human involves knowledge about the task, a concept of its own role in the team, and expectations about the contributions of other team members. The mental model of a machine is likely to be much less elaborated, involving specific knowledge about the task to be performed by the machine, and some aspects of the context.

574

K. van den Bosch et al.

Fig. 1. Human-machine cooperation in a hybrid team (Color figure online)

Both the human and the machine have an ePartner, an AI-based agent [5,40]. The purpose of an ePartner is to assist its user to act as a good team member. An agent (either machine or human) and its ePartner form a unit (hence the dotted ellipses). Indirectly, by assisting its user, the ePartner supports the team as a whole. An ePartner collects and processes information about the task (grey middle zone) and the context (outer yellow zone). It also collects and processes information about its user, about the partner of its user, as well as of the other human-machine units of the team (green inner zone). The ePartners use this information to construct and maintain an elaborated mental model (shown in the two upper blue clouds) containing a representation of its user (a ‘self’-model), as well as a representation of the perspective of the partner-agent (a theory of mind model). Through these mental models the ePartners develop an understanding of the task, users, and team. Based on this understanding, the ePartner can initiate various support actions. A hybrid team consists of agents. An agent is an entity that is autonomous, intentional, social, reactive, and proactive [61]. So a human is an agent. A machine can also be an agent, but only if it meets the criteria above. For instance, a robot arm that mechanically performs some kind of action is not considered an agent, even though its actions may be valuable to the team. In the hybrid team outlined in Fig. 1, we have machines in mind that are more or less intelligent agents. A machine’s mental model is typically targeted at the intelligence needed for acting adequately in a bounded variety of task conditions. It generally does not include the ability to acknowledge the needs of its partner, or of other members of the team. Thus a mental model of the machine supports task behavior, not

Six Challenges for Human-AI Co-learning

575

team behavior [51]. However, in a machine-ePartner unit (right dotted ellipse), the ePartner-agent is able to develop a mental model that covers the needs of others; not only of its machine-partner, but also of other agents in the team. This enables this ePartner to initiate supportive actions (e.g., informing its machine that a task has already been done by others in the team; informing the ePartner of the human partner that the machine’s battery is about empty). In a humanePartner unit (left dotted ellipse), both the human as well as its ePartner-agent develop a mental model of the task and of the team. However, the mental models of these agents are not the same. The human’s ePartner-agent can, for example, receive information from the ePartner of the machine about the status of its task work (e.g., the remaining battery power of the machine). It can also determine conditions of its partner (e.g., fatigued; high-work load) that the human may self not be aware of [16]. Again, this enables the ePartner to initiate supportive actions (e.g., issue warning to human partner; request other agents in the team to take over tasks). This envisioned cooperation between humans and machines in a hybrid team needs to develop through interaction and feedback during learning and operations, enabling all agents to acquire implicit and explicit knowledge about themselves and about their partners. Implicit knowledge about the partner is, for example, intuitively knowing how the partner will respond to a particular situation (often without realizing why). This is called ‘tacit knowledge’ [46], as it often cannot be adequately articulated. Explicit knowledge is, for example, knowing what the partner is likely to achieve, and accordingly, how it will act. Explicit knowledge is often obtained by deduction, logic, and reasoning [10]. The next chapter presents a use case of human-AI co-learning in hybrid teams, relating it to the literature for principles of successful development of human-AI partnerships. These principles are used to define the challenges for establishing human-AI Co-learning in Sect. 3. The final chapter discusses the implications for research.

2

Co-learning in Hybrid Teams

The increasing use of ever-smarter AI technology is changing the way individuals and teams perform their tasks. Designing the models for successful hybrid teams should be based upon the principles that foster the cooperation between units consisting of human-machine combinations, and that promote the collaboration of multiple human-machine combinations at the team level (see Fig. 1). This section proposes a set of principles for human-AI co-learning, derived from the literature on human-machine interaction, human-agent teaming, and teamwork in general. It starts with a general use case to illustrate the co-learning process. 2.1

Use Case

Figure 2a presents an overview of a Human-ePartner-Robot-Team (HeRT) at a disaster scene of our use case (inspired by the TRADR use cases for robot-assisted

576

K. van den Bosch et al.

(a) Human-ePartner-Robot-Team

(b) Example scenario

Fig. 2. HeRT team and scenario map (scene)

disaster response; [14,28]). In this team, all agents have sensors for monitoring the environment (e.g., to identify human beings, passageways, objects) and their own states (e.g., health and location). However, machine agents will only have limited state-sensing capability. To support collaboration, humans and ePartners are also equipped with sensors that assess states of other agents (such as workload; [16]). There is a shared knowledge base; policies define the obligations, permissions and prohibitions for knowledge exchange (e.g., as adjustable work agreements [38]). When approaching the disaster scene, the Team Leader (TL) assesses the situation, selects the first Point-of-Interests (PoI) to explore, and estimates the corresponding priorities. Based on previous missions, the ePartner of the Team Leader, i.e., ePartner(TL), proposes task allocations and work agreements for the team. Figure 2b gives an overview of the task context. The PoI is in a valley in between mountains. Victims might be found there, but the area is dangerous for humans due to the possible presence of toxic gases. The ePartners initiate the operation by issuing work agreements among the groups and units, i.e., for notifying progress, agent states, and environmental events. A first example is that the TL will be notified about the progress of (all) groups, when there is a (i) deviation of the plan, (ii) change of agent’s state, or (iii) unforeseen critical event. Second, specific for the Air-group, there is the agreement that the TL-unit will get regular updates (“situation reports”, provided by the ePartner of airgroup’s Explorer, i.e., ePartner(E-air)) with the overview pictures of the UAV (so that TL will maintain a general overview, and can immediately help the less experienced Air-group when needed). Third, specific between the Ground-group and the Air-group, there is the agreement that the other group is notified when a new obstacle is detected in the planned navigation routes. Following the plan, the Explorer(air) initiates the first (high-priority) task: The UAV has to explore the area between the base station and PoI to assess its accessibility for UGV navigation. In parallel, Explorer(ground) initiates the first (high-priority) task: navigation of UGV to PoI, to gather information about

Six Challenges for Human-AI Co-learning

577

the environment during the navigation and, subsequently, at PoI. Based on the available information of the environment, the UGV calculates the best navigation route and starts navigating. The Air-group identifies a blockade of the planned route and ePartner(E-air) notifies the Ground-group. The UGV changes its route and continues; ePartner(UGV) provides an explanation; ePartner(TL) notifies the TL about the changed route plan with the explanation (and the information that the time of arrival at PoI is extended). In the meantime, the Air-group (i) is processing a large amount of environmental data with inconclusive outcomes, (ii) has to anticipate for a required battery change, and (iii) is notified that storm and rain are approaching. The ePartner(E-air) identifies a “cognitive lock-up” in the data-processing task of its partner, draws her attention to the battery level and weather forecast, and notifies unit(TL). The TL assesses the adapted task plan, UAV’s battery level and the weather forecast, and determines that the UAV can stay in air till the UGV approaches the PoI. After the mission, all agents participate in a debriefing session. The ePartner(E-air) points to the cognitive-lock-up event, and explains its assessment. The TL refines the explanation, enhancing ePartners’ knowledge base. Explorer(air) understands what happened and selects training scenarios to practice this type of situations in virtual reality. 2.2

Principles of Human-AI Co-learning

Research in human-machine interaction provides useful models and methods for the required communication in the envisioned use case of Sect. 2.1, such as chat bots [11], virtual assistants [16,54], and personal teaching agents [25,55]. ePartners should tailor their communication to the specific characteristics of their human partner (e.g., preferences, experiences, mental state), the team (e.g., roles, work procedures, communication protocols) and the context (e.g., movement, noise, time pressure). However, collaboration and collaborative learning are not driven by explicit demarcated communicative acts only. A joint task performance of human and a machine agent requires that their social, cognitive, affective and physical behaviors are harmonized for the work processes. For establishing such harmonization, we identify a number of important principles: OPED (observability, predictability, explainability & directability), trust generation & calibration, self-awareness & theory of mind, lifelong learning on the job, and teams learning from teams. Observability, Predictability, Explainability and Directability Joint task performance requires that the agents deal with interdependencies: the coordinated adaptation of task performance of humans and machines to optimize their performance as a team [19,42]. Johnson et al. define three requirements for successful interdependent collaboration: Observability, Predictability and Directability [19]. In addition, Explainability has been identified as an important prerequisite for collaboration and learning (e.g., [12,41]).

578

K. van den Bosch et al.

Observability implies that the human agent and the machine agent are informed of their own actions, each other’s actions, and the status of their role and progress in the task. In a human-machine partnership this requires that the state of a machine agent should be observable to a human partner, and the machine-agent should be informed about the human’s status from explicit and implicit behavioral cues. The use case of Sect. 2.1 provides several example work agreements for establishing observability within a team (e.g., on agent’s state, like robot’s battery level and explorer’s stress level). Predictability means that actions of a team member are -to some extentpredictable, so that team members can understand it, and anticipate to it. The use case shows, for example, the processing of prediction information within the team on robot state (battery level), weather and reaching the PoI, to decide on the UAV’s route. Explainability is needed in circumstances where partners desire a clarification of each other’s behavior. One way of achieving this is by requesting explanations. In order to generate an explanation that fits the objective of the requesting agent, partners should have the capabilities to diagnose the state of the other agent (related to observability), and the partner’s intention of the request (related to predictability). In the use case of Sect. 2.1, for example, the ePartner of the UGV provides an explanation of the changed route towards the PoI. Directability refers to the property of agents to take over and delegate tasks, both reactively and pro - actively. In the use case, for example, the TL takes over part of the task of explorer(air), when she is in a “lock-up”. Trust Generation and Calibration The research community has not (yet) provided a unified definition of trust, but it is commonly recognized that trust is as a psychological state that is influenced by the complex interrelations between expectations, intentions and dispositions [9]. For now, we will use Mayer’s trust definition: “The willingness of a party to be vulnerable to the actions of another party based on the expectation that the other will perform a particular action important to the trustor, irrespective of the ability to monitor or control that other party” [36] (p. 712). Trust development is a continuous process in teamwork, involving trust establishment and adjustment based on team-members’ experiences concerning each other’s performances and the overall team performance. In teamwork, three processes should be considered: (i) interpersonal trust between members, (ii) collective trust at the team level, and (iii) the cross-level influences and dynamics [9]. It should be noted that high levels of team trust may have negative consequences, like the pressure to conform to group’s norms in “groupthink” [18]. Adequate trust calibration is crucial to establish appropriate attitudes and performances in teamwork. In the use case of Sect. 2.1, for example, the TL developed a higher level of trust for the (experienced) Ground group than for the (less experienced) Air group. Based on the low level of trust, a specific work agreement was made for the last group: ePartner(E-air) provides regular updates with overview pictures to unit(TL), so that the TL can immediately help the Air-group when needed.

Six Challenges for Human-AI Co-learning

579

Self-awareness and Theory of Mind In a well-functioning team, the team-members learn to perform the tasks, how their tasks relate to those of the other(s), and how to manage their tasks. They develop “self-awareness” of their own state and role in the partnership, “self-management capabilities” and a “Theory of Mind” (i.e., knowledge of the other, [45]). Current AI developments are enhancing machine’s capabilities on the “self-awareness & management” [59]. However, developing a Theory of Mind also proves to be crucial for effective human-human teamwork [35] and human-machine teamwork [24,30,43,53,60]. Furthermore, the self-awareness, self-management and Theory of Mind can develop at four levels: agent, unit, group and team. The ePartners aim to enhance this by sensing, modeling, activating and sharing the relevant information (see Fig. 1). With the capabilities to form Theories of Mind, humans and ePartners develop the capability to maintain common ground, thereby meeting the challenge of Klein and colleagues [26] for successful joint activity. In the use case of Sect. 2.1, ePartner(E-air) detected a “cognitive lock-up” of its partner, sharing it (i) with the TL-unit (team level) to ensure an effective air-group task performance, and (ii) with its partner (“selfawareness at unit level”) for experienced-based learning. Lifelong Learning on the Job Appropriate experience sharing will help teams to learn from their practices and improve their adaptive capabilities. For example, team reflections can make “tacit” knowledge explicit in a systematic way in such a way that the team can better cope with similar situations in the future (i.e., team’s resilience increases, [49]). The ePartner will support this process by (i) providing the “episodic memory” with the features that affect performance and resilience, and (ii) the procedures to reflect on these episodes [16]. One way to do this, is to share experiences, and to reflect upon these experiences [58], for example by engaging in an After Action Review [39]. The previous principle (“Self-awareness and Theory of Mind”), already referred to the learning of Explorer(air) in the use case by recalling the “cognitive lock-up” episode. In addition, the TL-unit will learn from this episode about the effectiveness of its back-up behavior (i.e., enhancing team’s resilience). Teams Learning from Teams A learning organization requires that team experiences and knowledge are shared with other teams continuously (cf., [6]). Concerning this capability, ePartners will provide excellent support: Their knowledge-base can be, almost instantaneously and completely, shared with all the other ePartners. This way, an evolving library of constructive and destructive team patterns can be build and shared [57]. Subsequently, the ePartners can help to identify such patterns when they appear with the corresponding supporting or mitigating strategies. For the use case of Sect. 2.1, for example, the set of work agreements that proved to be effective will be shared by all teams.

580

3

K. van den Bosch et al.

Challenges for Developing Hybrid Team Agents

The previous chapter discussed the principles for human-AI co-learning from a team perspective. In this chapter we address the implementation of these principles: the challenges of creating learning human-AI partnerships, the constituting elements of a hybrid team. An important prerequisite for effective task and team performance is that humans and machines become aware of each other’s knowledge, skills, capabilities, goals, and intentions. Humans store and structure such information in their brain in the form of mental models [8,21]. Mental models can be regarded as personal and subjective interpretations of what something is, and how something works in the real world. Humans use their mental models to explain and predict the world around them, for example interpreting the behavior of others. In fact, it has been demonstrated that a mental model of the environment, including information about the task and knowledge of other agents, is required for efficient cooperation between humans in a team [35]. We argue that if a team consists of humans and machines, machine agents need to be initiated with a basic model of the task context, their own role, and the role of others. Furthermore, they need to be able to learn from experiences and feedback; to refine and adjust their mental models. Not all knowledge and functions need to reside in the individual agents; agents are able to share information, thus creating a kind of “team cloud” database. We identify the following six challenges to achieve effective human-machine team collaboration: 1. Agents of a hybrid team should have, develop, and refine a shared vocabulary of concepts and relations (taxonomy model ) 2. Agents of a hybrid team should have access to a shared set of work agreements and interdependencies. This include agreements on how agents can dynamically update this as a result of learning (team model ) 3. An agent should have, develop, and refine a mental model containing knowledge about the regularities between task conditions, actions and outcomes (task model ) 4. An agent should have, develop, and refine a mental model containing knowledge about itself, including its needs, goals, values, capabilities, resources, plans, and emotions (self-model ) 5. An agent should have, develop, and refine a mental model containing knowledge of other agent’s needs, goals, values, capabilities, resources, plans, and emotions (theory-of-mind model ) 6. An agent should have the functionalities, instruction, and training to communicate and explain experiences to other agents (communication model ) Challenges concerning the contents of agents’ mental models are discussed in Sect. 3.1. The mental models of agents should not constitute a fixed representation of the world, but a dynamic one. The models’ contents need to be constantly refined and adjusted, as a result of learning from experiences. This raises the question how machine agents should restructure their mental models in order to assimilate and represent newly acquired knowledge. Such representation

Six Challenges for Human-AI Co-learning

581

challenges are discussed in Sect. 3.2. A mental model is functional in the sense that it helps the agent to determine and tune its behavior and to develop an approach for solving a problem. At best, an agent’s operations may be experienced as logical, plausible, or understandable by other agents. However, sometimes they lead to surprise or incomprehension. Establishing a flexible and resilient hybrid team requires mechanisms that enable agents to resolve misconceptions, ambiguities and inconsistencies. These challenges are discussed in Sect. 3.3. 3.1

Components of Mental Models

Conceptually, we distinguish between three types of integrated knowledge in a mental model of a hybrid team agent: knowledge about the task and context; knowledge about oneself; and knowledge about the partner. Knowledge about the task and context: through instruction and experience, an agent accumulates its knowledge about about the regularities between task conditions, actions and outcomes. The agent should be able to expand its task model with the acquired knowledge (challenge 3). The agent may or may not be aware of its knowledge. Some of the relationships may be formally coded in the mental model (e.g., the task condition of seeing a ‘stop’ sign, triggers the act of stopping the car, leading to the outcome of a safe crossing of the intersection). An agent’s mental model should also contain strategies for conducting a task. Formal knowledge about relationships and strategies can be easily communicated to other agents. In addition, agents may also have implicit knowledge of regularities in their mental model. For example, when having to make a right turn, a driver agent uses subtle environmental cues to apply the forces to the steering wheel and gas pedal that produce an adequate bend. The implicit nature of such knowledge, also called ‘tacit knowledge’ [44], makes it hard to articulate it, and thus to communicate it with other agents. Knowledge about oneself : the mental model of a hybrid team agent should contain information about its own needs, goals, values, capabilities, resources, plans, and emotions (challenge 4). This enables the agent to be self-aware, an essential principle for self-management, as well as for alignment and adaptation in a team. An agent’s self-knowledge should be adjustable under influence of interactions, experiences and feedback. Knowledge about other(s): Agents should also be able to construct models of other agents (challenge 5), a theory of mind. The agent should have the metacognitive ability to attribute mental capacities and states to others [45], such as their assumed motivations, beliefs, values, goals, and aspects of personality. Furthermore, this theory-of-mind model should also include information about how the other agent thinks about its partner (i.e., “what could the other agent know about my knowledge, beliefs, values, and emotions?”). Another challenge is that an agent should be able to retrieve and connect information from the different sources of knowledge (challenge 1) so that the agent can detect and understand interdependencies within the team (challenge 2). It allows the agent to infer, for example, that a team agent may be too fatigued to carry out its task, and to offer assistance to this team agent.

582

3.2

K. van den Bosch et al.

Representational Challenges for Mental Models

As argued in Sect. 3.1, agents should be able to develop a mental model consisting of different types of information, like observations or factual information in the task environment (e.g., whether something or someone is present or not), known or perceived relationships between events and actions (e.g., “if I see fire, and I press this button, then an alarm will sound”), and assumptions (e.g., “if my partner is very busy, then he is more likely to ignore my request”). In a task domain, there often exist many relationships between different types of information. An agent’s mental model should be able to represent all these, and the representations should allow the agent to make connective associations between them. The sections below discuss the challenges associated with this requirement. Hybrid AI The literature reports a variety of models that can represent an individual’s performance and psychological states, such as emotion, trust, stress, memory, and theory of mind (see [31] for an overview). A symbolic approach, for example, is based on knowledge and rules and works best in well-defined problems. An advantage of symbolic models is that they are understandable to people. Another approach is to represent knowledge as a network of nodes and associations (e.g., [50]). This data-driven, sub-symbolic approach to modeling is suited for ill-defined problem environments. However, a disadvantage is that knowledge is distributed throughout the network, and is therefore non-transparent for humans. It has been advocated to combine both approaches, for example as shown in Fig. 3. This is called hybrid AI [1,32,41,56]. Interestingly, human thinking is also considered to be the result of a combination of implicit intuitive knowledge (cf. sub-symbolic), and explicit, conscious reasoning (cf. symbolic) [22]. The nature of human information processing has recently been aptly described by Harari: “[..] the mind is a flow of subjective experiences [..] made of interlinked sensations, emotions and thoughts [..]. When reflecting on it, we often try to sort the experiences into distinct categories such as sensations, emotions and thoughts [..]”. ([15], p. 123). For humans and intelligent machines to jointly learn and perform a task, both should develop and maintain a common vocabulary of concepts and relations (challenge 1); to reason and communicate with each other (challenge 6), about the task and environment (challenge 3), their own perspective (challenge 4), and the perspective of the other (challenge 5). This means, for example, that an ePartner of a machine agent should be able to translate implicit sub-symbolic knowledge, that is acquired through associations, into symbolic concepts (see Fig. 3). Only then can this ePartner-agent communicate with other agents about it. Perceptual, Cognitive, and Social Components Machine agents and ePartner agents should be able to represent knowledge obtained from sensory experiences by building associative networks of perceptual inputs (challenge 3). This would allow the agent to, for example, perform image classification and object recognition.

Six Challenges for Human-AI Co-learning

583

Fig. 3. A human-machine combination consisting of a human agent with its ePartner agent, and an intelligent robot (e.g., an UGV), also with its ePartner agent. All agents build and maintain their own mental models that contain learned regularities (acquired through e.g., Machine Learning), as well as symbolic knowledge (e.g., in terms of BDI). To enable communication, ePartner-agents should be able to translate sub-symbolic knowledge into symbolic terms. This symbolic model functions as a shared vocabulary for agents.

In the Human-AI co-learning concept, agents have different capabilities and they communicate in order to exploit their complementary strengths to the benefit of the team. Agents should therefore be able to show and share their status and intentions to their partners (challenge 6), demanding a mental model that allows them to express and explain their beliefs, goals, intentions, and actions in terms that are adequate for human understanding and appreciation (challenge 5) [41]. If necessary, the mental model can be expanded with computational models of emotion [23], enabling agents to take affective states into account when deciding upon which goals to pursue and which actions to perform. Some of the socially adaptive behavior of agents can be streamlined in advance by setting and agreeing upon work agreements. However, in order for agents to know when and how to adapt to changes, they should be able to continuously collect and update information about the individual team member(s) and the context (challenge 2). The agent’s mental model should therefore have slots for social information; with strategies for obtaining information from the task context to fill and refine the value of these variables; and with algorithms to make social inferences from the data in the model (challenge 1). 3.3

Functional Challenges for Mental Models

Constructing mental models for agents that allow them to represent perceptual, cognitive-affective, and social knowledge, is only part of the challenge. Agents

584

K. van den Bosch et al.

should also have the capabilities to dynamically update and refine their models. This can be achieved in various ways, like internal consistency checks, deduction, induction, reasoning, and validation. The sections below discuss the challenges associated with establishing these functions. Dynamic Mental Models Human-AI co-learning demands mental models to be dynamic, because human and machine agents will generally not have a mutual understanding right from the start. Instead, understanding develops over time, from experiences and interactions during training and operations. Of course, there may be some prior experience in the form of memories, ‘lessons learned’, and assumptions in the human agent, as well as computational task models in the machine agent. Furthermore, the human may have provided the machine agent with personal data to make itself better known. But a deeper understanding and mutual awareness develops through prolonged collaboration, interaction, shared experiences, and feedback from the environment (see also Sect. 2.2). To facilitate these processes, the human should be instructed and trained for understanding an AI-agent, and the AI-agent should have the functionalities to develop an understanding of his human teammate (challenge 5). Mental Models That Support Observability, Predictability, Explainability, and Directability Humans are cognitively wired to automatically infer mental states from subtle behavioral cues expressed by other agents [17,34]. To enhance its observability [19] for a human partner, a machine agent should be able to express such information about its ‘mental’ state in a way that is easy to comprehend for its human partner (challenge 6). In addition, a machine agent should be able to infer its human partner’s mental state from their behavior (challenge 5). Agents should also be able to use their theory of mind model to make predictions about the behavior of team partners. Comparisons with observed behavior should be used to validate the model, and to make adjustments if necessary. Members of an effective human-human team try to detect and solve discrepancies in their mental models. They discover misunderstandings, diagnose the cause, and provide corrective explanations that gets the team back on track [12,37]). Likewise, machine agents too need to be explainable. They need to be able to generate explanations that shed light on the underlying causes of their actions, and are attuned to the characteristics of the receiving agent (challenge 6) [37]. Agents may form explanations reactively, in response to a request by a partner, but also pro-actively, when the agent anticipates that a partner may not understand its (choice of) behavior. Directability refers to the property of agents to take over and delegate tasks, both reactively and pro-actively. The agent should be able to consult its model of the team (challenge 2), taking also into consideration the level of interpersonal trust (see Sect. 2.2).

Six Challenges for Human-AI Co-learning

4

585

Addressing the Challenges

In a successful hybrid team, humans and machines collaborate in an adaptive, dynamic, and personalized fashion. This requires that machine agents, just like humans, have mental models that contain information on the task context, their own role, and the role of others. Furthermore, human and machine agents should be able to express and clarify their mental states in a way that is easy to comprehend for their partner and that allows them to act in a coordinated and adaptive manner. In this paper, we have proposed six challenges to achieve successful human-AI partnership in hybrid teams. These challenges should be addressed and tested in research. A good start would be, for example, to investigate how learning of an individual agent can be organized and supported, studying how this learning affects performance of others, first at the unit level and successively at the team level. Research questions would concern the construction, maintenance, and use of mental models (e.g., what information should an agent disclose to elicit adaptive responses from its partner or partners?, and what are the effects of different explanations by others on an agent's learning?). In order to address the challenges, a suitable research simulation environment is needed. This needs to involve a task that is representative for a real world environment in which humans and intelligent technology jointly work together. Yet the research task should allow simple and unambiguous manipulation and control of demands on human-AI cooperation, and should allow the measurement of learning. A suitable research environment is needed that meets the following requirements: (1) control over what information is available. If some information about the task is unknown or uncertain to some agents, this requires them to communicate and to generate explanations that facilitate mutual understanding; (2) the opportunity to create hard interdependencies [19], compelling agents to cooperate because each have unique capabilities; (3) control over resources needed to carry out the task (e.g., imposing time limits); and (4) the opportunity to make task goals achievable in several ways. This feature requires agents of a unit or team to search for common ground on strategy, and to explore the division of roles and tasks that result in good collective performance. Earlier studies on human-AI collaboration have been using research environments that meet the requirements above, like Blocks World for Teams [20] and Hanabi [2]. We intend to use such environments to conduct experimental research into human-AI learning. As a first study we aim to investigate how a human and machine agent can evaluate their joint task performance in terms of lessons for the future (i.e., how to re-assign tasks to improve overall performance). Controlled studies in the lab are needed to design, implement and evaluate the principles for successful Human-AI Co-learning. In addition, these principles should be tested further in practical field settings (e.g., similar to experiments by De Greeff et al. [13] and Looije et al. [33]). For example, trust has been shown to be an important aspect of human-AI cooperation in real-life. The co-learning of humans and agents over a prolonged period of time may not only benefit performance, but also trust calibration [58].

586

K. van den Bosch et al.

Given the developments in society, the future will unequivocally demand humans and intelligent systems to work together. This paper addresses the challenges facing hybrid human-AI teams to acquire the strengths of human-human teams, and to exploit the unique benefits of intelligent technology at the same time. Acknowledgments. This study has been funded by the Netherlands Ministry of Defence, under program V1801.

References 1. Bader, S., Hitzler, P.: Dimensions of neural-symbolic integration-a structured survey (2005). arXiv preprint cs/0511042 2. Bard, N., et al.: The Hanabi challenge: a new frontier for AI research (2019). arXiv preprint: arXiv:1902.00506 3. Bergstein, B.: AI isn’t very smart yet. But we need to get moving to make sure automation works for more people (2017). https://www.technologyreview.com/s/ 609318/the-great-ai-paradox/ 4. van den Bosch, K., Bronkhorst, A.: Human-AI cooperation to benefit military decision making. In: Proceedings of the NATO IST-160 Specialist’ Meeting on Big Data and Artificial Intelligence for Military Decision Making, Bordeaux, France, 30 May–1 June 2018, S3-1/1-S3-1/12 (2018) 5. Bosse, T., Breebaart, L., Diggelen, J.V., Neerincx, M.A., Rosa, J., Smets, N.J.: Developing epartners for human-robot teams in space based on ontologies and formal abstraction hierarchies. Int. J. Agent-Oriented Softw. Eng. 5(4), 366–398 (2017) 6. Bron, R., Endedijk, M.D., van Veelen, R., Veldkamp, B.P.: The joint influence of intra-and inter-team learning processes on team performance: a constructive or destructive combination? Vocations and learning, pp. 1–26 (2018) 7. Brooks, R.: The Seven Deadly Sins of AI Predictions (2017). https://www. technologyreview.com/s/609048/the-seven-deadly-sins-of-ai-predictions/ 8. Converse, S., Cannon-Bowers, J., Salas, E.: Shared mental models in expert team decision making. In: Individual and Group Decision Making: Current Issues Issues, p. 221 (1993) 9. Costa, A.C., Fulmer, C.A., Anderson, N.R.: Trust in work teams: an integrative review, multilevel model, and future directions. J. Organ. Behav. 39(2), 169–184 (2018) 10. Evans, J.S.B.: Dual-processing accounts of reasoning, judgment, and social cognition. Annu. Rev. Psychol. 59, 255–278 (2008) 11. Fryer, L.K., Nakao, K., Thompson, A.: Chatbot learning partners: connecting learning experiences, interest and competence. Comput. Hum. Behav. 93, 279– 289 (2019) 12. de Graaf, M., Malle, B.F.: How people explain action (and autonomous intelligent systems should too). In: AAAI Fall Symposium on Artificial Intelligence for Human-Robot Interaction (2017) 13. de Greeff, J., Hindriks, K., Neerincx, M.A., Kruijff-Korbayova, I.: Human-robot teamwork in USAR environments: the TRADR project. In: Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction Extended Abstracts, pp. 151–152. ACM (2015)

Six Challenges for Human-AI Co-learning

587

14. de Greeff, J., Mioch, T., van Vught, W., Hindriks, K., Neerincx, M.A., KruijffKorbayov´ a, I.: Persistent robot-assisted disaster response. In: Companion of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, pp. 99– 100. ACM (2018) 15. Harari, Y.N.: Homo Deus: A Brief History of Tomorrow. Random House (2016) 16. Harbers, M., Neerincx, M.A.: Value sensitive design of a virtual assistant for workload harmonization in teams. Cogn. Technol. Work 19(2–3), 329–343 (2017) 17. Heider, F.: The Psychology of Interpersonal Relations. Psychology Press, New York (1958) 18. Janis, I.L.: Groupthink. IEEE Eng. Manag. Rev. 36(1), 36 (2008) 19. Johnson, M., et al.: Coactive design: designing support for interdependence in joint activity. J. Hum. Robot Interact. 3(1), 43–69 (2014) 20. Johnson, M., Jonker, C., van Riemsdijk, B., Feltovich, P.J., Bradshaw, J.M.: Joint activity testbed: blocks world for teams (BW4T). In: Aldewereld, H., Dignum, V., Picard, G. (eds.) ESAW 2009. LNCS (LNAI), vol. 5881, pp. 254–256. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-10203-5 26 21. Johnson-Laird, P.N.: Mental models in cognitive science. Cogn. Sci. 4(1), 71–115 (1980) 22. Kahneman, D., Egan, P.: Thinking, Fast and Slow, vol. 1. Farrar, Straus and Giroux, New York (2011) 23. Kaptein, F., Broekens, J., Hindriks, K.V., Neerincx, M.: CAAF: a cognitive affective agent programming framework. In: Traum, D., Swartout, W., Khooshabeh, P., Kopp, S., Scherer, S., Leuski, A. (eds.) IVA 2016. LNCS (LNAI), vol. 10011, pp. 317–330. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47665-0 28 24. Kenny, P., et al.: Building interactive virtual humans for training environments. In: Proceedings of I/ITSEC, vol. 174, pp. 911–916 (2007) 25. Kim, Y., Baylor, A.L.: based design of pedagogical agent roles: a review, progress, and recommendations. Int. J. Artif. Intell. Educ. 26(1), 160–169 (2016) 26. Klein, G., Woods, D.D., Bradshaw, J.M., Hoffman, R.R., Feltovich, P.J.: Ten challenges for making automation a “team player” in joint human-agent activity. IEEE Intell. Syst. 19(6), 91–95 (2004) 27. Knight, W.: More evidence that humans and machines are better when they team up - MIT Technology Review.pdf (2017). https://www.technologyreview.com/s/ 609331/more-evidence-that-humans-and-machines-are-better-when-they-teamup/ 28. Kruijff-Korbayov´ a, I., et al.: TRADR project: long-term human-robot teaming for robot assisted disaster response. KI-K¨ unstliche Intell. 29(2), 193–201 (2015) 29. Lake, B.M., Ullman, T.D., Tenenbaum, J.B., Gershman, S.J.: Building machines that learn and think like people. Behav. Brain Sci. 40, e253 (2017) 30. Lemaignan, S., Warnier, M., Sisbot, E.A., Clodic, A., Alami, R.: Artificial cognition for social human-robot interaction: an implementation. Artif. Intell. 247, 45–69 (2017) 31. Lin, J., Spraragen, M., Zyda, M.: Computational models of emotion and cognition. In: Advances in Cognitive Systems. Citeseer (2012) 32. Liszka-Hackzell, J.J.: Prediction of blood glucose levels in diabetic patients using a hybrid AI technique. Comput. Biomed. Res. 32(2), 132–144 (1999) 33. Looije, R., Neerincx, M.A., Cnossen, F.: Persuasive robotic assistant for health self-management of older adults: design and evaluation of social behaviors. Int. J. Hum. Comput. Stud. 68(6), 386–397 (2010) 34. Malle, B.F.: How the Mind Explains Behavior. Folk Explanation, Meaning and Social Interaction. MIT Press, Cambridge (2004)

588

K. van den Bosch et al.

35. Mathieu, J.E., Heffner, T.S., Goodwin, G.F., Salas, E., Cannon-Bowers, J.A.: The influence of shared mental models on team process and performance. J. Appl. Psychol. 85(2), 273 (2000) 36. Mayer, R.C., Davis, J.H., Schoorman, F.D.: An integrative model of organizational trust. Acad. Manag. Rev. 20(3), 709–734 (1995) 37. Miller, T.: Explanation in artificial intelligence: insights from the social sciences. In: Artificial Intelligence (2018) 38. Mioch, T., Peeters, M.M., Nccrincx, M.A.: Improving adaptive human-robot cooperation through work agreements. In: 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 1105–1110. IEEE (2018) 39. Morrison, J.E., Meliza, L.L.: Foundations of the after action review process. Technical report, Institute for Defense Analyses, Alexandria, VA (1999) 40. Neerincx, M., et al.: The mission execution crew assistant: improving humanmachine team resilience for long duration missions. In: Proceedings of the 59th International Astronautical Congress (IAC 2008) (2008) 41. Neerincx, M.A., van der Waa, J., Kaptein, F., van Diggelen, J.: Using perceptual and cognitive explanations for enhanced human-agent team performance. In: Harris, D. (ed.) EPCE 2018. LNCS (LNAI), vol. 10906, pp. 204–214. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91122-9 18 42. Nikolaidis, S., Hsu, D., Srinivasa, S.: Human-robot mutual adaptation in collaborative tasks: models and experiments. Int. J. Robot. Res. 36(5–7), 618–634 (2017) 43. Parasuraman, R., Barnes, M., Cosenzo, K., Mulgund, S.: Adaptive automation for human-robot teaming in future command and control systems. Technical report, Army Research Lab Aberdeen proving ground MD Human Research and Engineering Directorate (2007) 44. Patterson, R.E., Pierce, B.J., Bell, H.H., Klein, G.: Implicit learning, tacit knowledge, expertise development, and naturalistic decision making. J. Cogn. Eng. Decis. Mak. 4(4), 289–303 (2010) 45. Premack, D., Woodruff, G.: Does the Chimpanzee have a theory of mind? Behav. Brain Sci. 1(4), 515–526 (1978) 46. Reber, A.S.: Implicit learning and tacit knowledge. J. Exp. Psychol. Gen. 118(3), 219 (1989) 47. Salas, E.: Team Training Essentials: A Research-Based Guide. Routledge, London (2015) 48. Salas, E., Reyes, D.L., McDaniel, S.H.: The science of teamwork: progress, reflections, and the road ahead. Am. Psychol. 73(4), 593 (2018) 49. Siegel, A.W., Schraagen, J.M.: Team reflection makes resilience-related knowledge explicit through collaborative sensemaking: observation study at a rail post. Cogn. Technol. Work 19(1), 127–142 (2017) 50. Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354 (2017) 51. Stout, R.J., Salas, E., Carson, R.: Individual task proficiency and team process behavior: what’s important for team functioning? Mil. Psychol. 6(3), 177–192 (1994) 52. Stout, R.J., Cannon-Bowers, J.A., Salas, E.: The role of shared mental models in developing team situational awareness: implications for training. In: Situational Awareness, pp. 287–318. Routledge (2017)

Six Challenges for Human-AI Co-learning

589

53. Teo, G., Wohleber, R., Lin, J., Reinerman-Jones, L.: The relevance of theory to human-robot teaming research and development. In: Savage-Knepshield, P., Chen, J. (eds.) Advances in Human Factors in Robots and Unmanned Systems. AISC, vol. 499, pp. 175–185. Springer, Cham (2017). https://doi.org/10.1007/978-3-31941959-6 15 54. Tielman, M.L., Neerincx, M.A., Bidarra, R., Kybartas, B., Brinkman, W.P.: A therapy system for post-traumatic stress disorder using a virtual agent and virtual storytelling to reconstruct traumatic memories. J. Med. Syst. 41(8), 125 (2017) 55. Tielman, M.L., Neerincx, M.A., van Meggelen, M., Franken, I., Brinkman, W.P.: How should a virtual agent present psychoeducation? Influence of verbal and textual presentation on adherence. Technol. Health Care 25, 1–16 (2017). Preprint 56. Tsaih, R., Hsu, Y., Lai, C.C.: Forecasting s&p 500 stock index futures with a hybrid ai system. Decis. Support Syst. 23(2), 161–174 (1998) 57. Van Diggelen, J., Neerincx, M., Peeters, M., Schraagen, J.M.: Developing effective and resilient human-agent teamwork using team design patterns. IEEE Intell. Syst. 34(2), 15–24 (2018) 58. de Visser, E.J., et al.: Longitudinal trust development in human-robot teams: models, methods and a research agenda. IEEE Trans. Hum. Mach. Syst., 1–20 (2018) 59. Werkhoven, P., Kester, L., Neerincx, M.: Telling autonomous systems what to do. In: Proceedings of the 36th European Conference on Cognitive Ergonomics, p. 2. ACM (2018) 60. Wiltshire, T.J., Fiore, S.M.: Social cognitive and affective neuroscience in humanmachine systems: a roadmap for improving training, human-robot interaction, and team performance. IEEE Trans. Hum. Mach. Syst. 44(6), 779–787 (2014) 61. Wooldridge, M., Jennings, N.R.: Agent theories, architectures, and languages: a survey. In: Wooldridge, M.J., Jennings, N.R. (eds.) ATAL 1994. LNCS, vol. 890, pp. 1–39. Springer, Heidelberg (1995). https://doi.org/10.1007/3-540-58855-8 1 62. Xiao, Z., et al.: A deep learning-based segmentation method for brain tumor in MR images. In: 2016 IEEE 6th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), pp. 1–6. IEEE (2016)

Conversational Tutors

Authoring Conversational Intelligent Tutoring Systems Zhiqiang Cai1(&), Xiangen Hu1,2, and Arthur C. Graesser1 1

2

University of Memphis, Memphis, TN 38152, USA {zcai,xhu,graesser}@memphis.edu Central China Normal University, Wuhan 430079, Hubei, China

Abstract. Conversational Intelligent Tutoring Systems (ITSs) are expensive to develop. While simple online courseware could be easily authored by teachers, the authoring of conversational ITSs usually involves a team of experts with different expertise, including domain experts, linguists, instruction designers, programmers, artists, computer scientists, etc. Reducing the authoring cost has long been a problem in the use of ITSs. Using AutoTutor as example, this paper discusses the authoring process and possible solutions by automatizing some authoring processes in authoring conversational ITSs. Keywords: Conversational ITSs AutoTutor Question generation Short answer grading

Authoring tools

1 Introduction The first educational computer system, PLATO (Programmed Logic for Automatic Teaching Operation), was developed in 1960 [15]. In 1968, Bitzer and Shaperdes [2] claimed that large scale CBE (Computer-Based Education) systems could be built. As personal computer and internet came out, large scale courseware could be developed more and more easily. The best example is probably MOOCs (Massive Open Online Courses). In 2017, there were 81 million students registered to 9.4 thousand MOOCs in over 800 universities (https://www.class-central.com/report/mooc-stats-2017/). The fast growth of MOOCs is due to the cheap and easy authoring, even though the effect is debatable [20, 27]. On the contrary, although ITSs have been proven to be as effective as human tutors [23, 24], they have never been so widely used as MOOCs. The reason behind this is simple: building ITSs is expensive. Let’s have a look at the typical authoring process of a MOOC lesson. A teacher had a powerpoint document for a lesson. The teacher presented the slides in a classroom and had the lecture video recorded. The teacher then uploaded the video to a MOOCs platform. Then the teacher announced that a MOOC course was up and students from all over the world could join the class online. Subsequent videos of later lectures were then continuously uploaded to the platform to complete the course. The teach might do a little more, such as uploading assignments, grading student homework, and answering student questions online. For a teacher who is giving a lecture to an offline class, such additional “authoring” processes are almost of zero cost. It is basically copying an offline class to an online platform. It is easy to convince schools to © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 593–603, 2019. https://doi.org/10.1007/978-3-030-22341-0_46

594

Z. Cai et al.

use MOOCs because it could hugely expand class size with little investment. However, the major disadvantage of offline classroom teaching remains to exist - classroom teaching lacks adaption to individual student’s needs. Tutoring is an individualized learning process that cannot be pre-recorded. The interactions between a tutor and a student are generated on the fly. ITSs are created to simulate human tutors to help individual students, especially struggling students to improve their learning. Unlike common computed-based courseware that focuses on delivering knowledge, ITSs adaptively help individual student learn. An ITS system usually contains a domain model that represents domain knowledge by specific problems, a student model that tracks each student’s learning history, a pedagogical model that utilizes successful pedagogical strategies, and an interface model that provides easy access to learning content, learning history and rich interactions [10]. While common courseware systems allow students to select one’s own learning pace, ITSs go further – ITSs organize domain problems in a way that allows each individual student to go through an optimal learning path. A student model keeps track of complete learning history of each student. The learning data is then used for the pedagogical model to select learning path and provide adaptive feedback in each step. Authoring the content of an ITS system is usually a complicated process involving a team of experts with different expertise, including domain experts, language experts, computer scientist, instruction design expert, interface design experts, and more. This paper describes AutoTutor authoring process as an example of conversational ITSs and explores possible ways that may automatize ITS authoring processes and thus reduce the cost of ITS authoring and increase the performance of the systems.

2 AutoTutor Systems 2.1

AutoTutor Conversation

AutoTutor is a system created by Arthur Graesser and his team in 1990s [13]. Since then, many AutoTutor applications have been successfully developed, including a Computer Literacy tutoring system [19, 25], a Newtonian Physics tutoring system [7], OperationARIES [16], ElectronixTutor [10], and many others. AutoTutor helps student learn by holding a conversation in natural language. AutoTutor is known for its Expectation-Misconception Tailored (EMT) conversation style [11]. An EMT conversation starts with a main question or a problem. The goal of the conversation is to help student construct a complete answer to the main question or a solution to the main problem. In each conversation turn, AutoTutor evaluates student input and finds the missing parts of the answer. AutoTutor then asks a hint or a prompt question targeting one of the missing parts. A student constructs a complete answer or solution by answering various hint or prompt questions. In AutoTutor, a hint question is a question that requires an answer with a sentence or a clause; while a prompt question requires an answer with a word or a phrase. When a misconception is identified, the misconception is corrected before a hint or prompt question is asked.

Authoring Conversational Intelligent Tutoring Systems

595

Figure 1 shows the flow of an EMT conversation. The conversation starts from the “Question”. If the student’s answer is complete, the conversation ends by presenting an ideal “Answer”. If the student did not answer the question at all, the system goes into a “Pump” step, at which the tutor encourages the student to construction an answer. When a partial answer is received, the system enters an “Expectation/Misconception cycle”. In this cycle, the tutor uses hints and prompts to help the student to construct answers that cover all expectations. When an expectation is covered, an “Assertion” is given. An “Assertion” is a statement about the expectation. After the student answers a prompt question, the tutor always gives the correct prompt completion answer (“PComp”). Misconception is identified in the hint or prompt answering process. When a misconception is identified, the tutor may immediately correct the misconception and continue the hint/prompt process.

Fig. 1. Flowchart of AutoTutor EMT conversation.

The EMT conversation could be used in single agent systems, as well as multiple agent systems. When two agents are used, the conversation between a human student and two computer agents is called “trialog” [8]. In a trialog conversation, one compute agent usually plays a role as a tutor and the other agent plays a role as a peer student. With the help of the peer student, different conversation modes can be used. In the last decade, AutoTutor team have successfully developed four trialog modes, including vicarious learning mode, tutoring mode, teachable agent mode and competition mode. In the vicarious learning mode, the conversation is mostly between the teacher agent and the peer student agent. To keep the human student engaged, the human student is occasionally asked to answer some simple definitional or verificational questions during the conversation. This mode is used when the human student has very little knowledge about the topic under discussion. “Vicarious learning” refers to a learning strategy by which students learn by observing. This strategy has been proven effective [6, 9]. The tutoring mode conversation is mainly between the tutor agent and the human student. The peer student agent plays a role as a sidekick. The use of peer student in the tutoring mode is helpful especially when the answer from the human student is likely a wrong answer. So far there is no naturally language processing algorithm that can

596

Z. Cai et al.

identify a wrong answer with 100% accuracy. Giving a negative feedback to a correct answer due to a wrong judgement would be awkward. To avoid this, AutoTutor uses a safe conversation trick. When the system judges that the human student’s answer is very likely wrong, the peer student chimes in and articulates a surely wrong answer which is semantically close to the one from the human student. Then the tutor agent critics the peer student agent without taking the risk of giving an awkward negative feedback to a correct answer. The tutoring mode is used when the human student has average knowledge about the topic. When the human student has high knowledge about the topic, the teachable agent could be used. In this mode, the conversation is mainly between the peer student and the human student. The peer student seeks help from the human student. The human student’s responsibility is to teach the peer student until the peer student fully understand the topic. The tutor agent helps when the human student has problem in teaching. This provides the human student an opportunity to learn from teaching, which is also an effective learning strategy [14]. The competition mode is often used in a game like situation. The peer student and the human student compete in getting the answer to the main question or the solution to the main problem. This mode has been proven to be helpful in engaging human students [16]. EMT conversations are useful for constructing complex answers and solutions that involve deep understanding of the knowledge. To author an engaging lesson, other simple types of conversations could be used. For example, a greeting and introduction conversation could be used at the beginning of a lesson, a transition conversation could be used between two main questions, an instructional conversation could be used for a short quiz, a closing conversation could be used at the end of a lesson, etc. 2.2

AutoTutor Tutoring Interface

A typical AutoTutor interface contains five major components, including a main question area, an agent area, a media area, a conversation history area and an input box (see Fig. 2). The main question area displays a question under discussion. The main question usually stays there from the beginning to the end of a lesson. The agent area displays one or two animated talking heads. The talking heads deliver information by speech. The interactive media area displays texts, pictures, videos, and interactive controls, such as clickable buttons, dropdown menu items, drag-and-drop objects, text areas that can be highlighted, etc. The content displayed in the media area changes as the conversation goes. A student could interact with the controls in the media area and trigger events that may change the way how the conversation goes. The conversation history area displays the conversation between the agents and the human student. The student input box allows a student to enter utterances. Other components could be added. For example, a microphone component could be added to enable voice input, a camera component could be added to allow video input. Examples of such AutoTutor systems can be found at autotutor.org.

Authoring Conversational Intelligent Tutoring Systems

597

Fig. 2. AutoTutor Interface with five major components.

3 Authoring a Webpage-Based AutoTutor Lesson Developing an AutoTutor application is a complicated process. The development tasks usually include creating conversational agents, developing interactive web pages, authoring conversation units and developing tutoring rules. None of these tasks could be easily done by a domain expert without expertise in web page development, conversation script authoring and intelligent tutoring. However, well designed authoring tools could make it possible for domain experts to put in most of the content and thus minimize the involvement of other developers. Pre-created computer agents, webpage templates and conversation unit template provided by AutoTutor authoring tools allow domain experts create agents, webpages and conversation units by replacing texts, images and videos in the templates. During the last decade, several ASATs (AutoTutor Script Authoring Tools) have been developed [3, 4]. Yet, improvements are still needed to make the tools easier for use. 3.1

Creating Agents

Creating agents is a process at application level. Once agents are created, they are used through all lessons of the application. An AutoTutor agent is usually an animated talking head that can deliver speeches and simple gestures. There are online character web services that can be easily integrated into AutoTutor systems. In addition to a name that matches the appearance and the voice of an agent, AutoTutor agents often store a list of commonly used expressions, such as greeting, agreeing, disagreeing, positive feedback, negative feedback, etc. Such expressions could be created to reflect an agent’s “personality”. For example, a negative feedback from a friendly agent could be “I don’t think that is right”; while a straightforward agent may say “That is terribly wrong!” While more agents could be used, AutoTutor often uses one agent for dialogbased systems and two agents for trialog-based systems. To communicate with

598

Z. Cai et al.

AutoTutor interface, an agent component needs two basic functions. One is a Speak function that can receive a textual speech and deliver the speech in voice and gesture. The textual speech may contain Speech Synthesis Markup Language (SSML) tags to indicate gestures or special speech effects [22]. The other is a Status Report function to tell AutoTutor interface whether or not the agent is busy in speaking. Optionally, the agent may have an interruption function that allows the interface to pause or stop the speech when needed. 3.2

Developing Interactive Webpages

An interactive webpage is used to show a part of the problem under discussion or a step toward a solution. A webpage may contain static texts and images. It may also contain interactive elements, such as buttons, pulldown menus, highlightable texts, drag and drop objects, etc. Developing highly interactive webpages is often beyond what a domain expert can do. However, this task can be split into two different tasks. One task is for webpage developers to develop high quality interactive template webpages. The other task is for the domain experts to fill in the content in the template pages. The question is, can we provide enough templates so that domain experts can create satisfactory lessons? Well, we would say yes or no. We can think of common webpages that are used in learning environments, such as multiple-choice questions, highlightable texts, images with hotspots, and so on. Yet, there could always be some webpages a domain expert cannot find. Therefore, continuous template development and team authoring is probably unavoidable. On the other hand, even if webpage developers are actively available, template-based authoring is always more cost-effective. An interaction on an AutoTutor webpage usually triggers an event, which could be associated with conversational units. For example, the selection of an answer to a multiple-choice question could trigger a “correct answer” event or “incorrect” answer event, which can be associated with corresponding feedback in a conversation. The event could also contribute to the assessment of student performance and make impact to the tutoring path. For example, the current webpage could be followed by a webpage with harder or easier items according to the performance of the human student. 3.3

Authoring Conversation Units

Conversation units could be of very different size. A small unit could be a feedback with a single utterance. A large one could be an EMT conversation that can selectively deliver tens or even hundreds of utterances. EMT is the typical and most complicated conversation unit. Other conversation types could be considered as simplified version of an EMT. In this section we focus on the authoring of an EMT unit. Main Question. An EMT conversation starts with a main question. The length of the answer to the question determines the possible number of turns of the conversation. For a typical EMT conversation, the main question usually requires an answer of about three to ten sentences long. There are several issues that need to keep in mind when authoring an EMT main question. First, the question should be a deep reasoning question in order to support a deep comprehension conversation. The question schema can be found, for

Authoring Conversational Intelligent Tutoring Systems

599

example, in Graesser, Rus and Cai [12]. Second, the question should be of reasonable uncertainty. In other words, while there could be different ways to answer the main question, the answers should form a small number of semantically different clusters. A highly uncertain question could be, for example, “What is your favorite work?” There is no way to preset a possible ideal answer for this question because everyone may have a different answer. AutoTutor main questions are usually of low uncertainty. Here is an example: “The sun exerts a gravitational force on the earth as the earth moves in its orbit around the sun. Does the earth pull equally on the sun? Explain why?” More detailed discussions on question uncertainty can be found in Cai et al. [3]. Ideal Answer to Main Question. For a question with low uncertainty, usually a domain expert can form a reasonable ideal answer. For example, the ideal answer to the above sun and earth question could be: The force of gravity between earth and sun is an interaction between these two bodies. According to Newton’s third law of motion, if one body exerts a force on the other then the other body must exert an equal and opposite force on the first body. Therefore, the sun must experience a force of gravity due to the earth, which is equal in magnitude and opposite in direction to the force of gravity on the earth due to the sun.

Expectations. After an ideal answer is authored, expectations can be formed by splitting the ideal answer to the main question. For example, the above ideal answer may be split into three expectations: 1. The sun exerts a gravitational force on the earth. 2. Newton’s third law says that for every action there is an equal and opposite reaction. 3. Thus, the force of the earth on the sun will be equal and opposite to the force of the sun on the earth. An expectation is a part of the ideal answer. It could a rewording of a sentence or a clause from the ideal answer. The collection of the expectations should cover all aspects of the ideal answer. The expectations may have overlap but should have enough difference in meaning so that they are semantically distinguishable. Misconceptions. A misconception is a typical wrong answer based on incorrect thinking or misunderstanding. It is usually hard for authors to pre-imaging what misconception learners may have. Therefore, misconceptions are usually added when they are identified from learner’s inputs. It is fine to keep misconception element blank in an initial EMT script. Hints. A hint is a question to help the learner to construct one of the expectations. The answer to a hint should be a part of an expectation. Multiple hints could be created for each expectation. A hint should be asked in a way so that the learner would answer in a sentence. For example, “What is going on in this situation between the earth and sun?” The answer to this hint could be “The sun exerts a gravitational force on the earth.” Since the purpose is to help the learner to construct the answer, a hint question should have minimum number of “juicy” answer words. For example, the above example hint question only used “sun” and earth and left “exerts” and “gravitational” for the learner to construct.

600

Z. Cai et al.

Prompts. In AutoTutor, a prompt is a question to help the learner to construct a small part (a word or a phrase) of an expectation. For example, “What is the type of force exerted by the sun on the earth?” The answer could be a single word, “gravity”. Hint and Prompt Answers. In addition to expert’s answers to hint and prompt questions, AutoTutor requires authors to provide possible students answers. Such answers could be correct, partially correct, wrong, irrelevant or any other types. These answers are important because they are used for matching learners’ inputs. In the initial authoring process, such answers are “imagined” by authors. Unfortunately, it is usually hard for an expert to exhaustively imagine all possible answers learners may give. In practice, new answers and answer types are added after learner answers are collected. Thus, AutoTutor EMT authoring is usually an iterative process. Question Answering Pairs. AutoTutor style conservation is mostly “answer questioning”, i.e., learners learn by answering computer agents’ questions. However, AutoTutor also responds to learners’ questions that are highly related to the topic. At the initial authoring phase, authors may prepare answers for questions that are likely asked by learners. Authors may also add answers to questions asked by learners when they interact with AutoTutor. Agent Assignment. Questions and answers are assigned to specific agents, if there are multiple agents. In other words, a question could be asked by a tutor agent (e.g., in vicarious and tutoring mode) or a peer student agent (e.g., teachable agent mode). An answer could also be given by an agent. Tutor agent always gives a correct answer. Peer student may give any types of answers. The wording of a question or answer must be consistent with the role of the agent. For example, a tutor agent may say “The correct answer is gravity”; while a peer student agent may say “I think the answer is gravity”. 3.4

Authoring Conversation Rules

Conversation rules are created to adaptively select displayable content, feedback and questions and change interface settings. Each AutoTutor lesson may have its own rule set. A rule set may be shared by multiple lessons. Each rule has two parts: condition part and action part. The condition part specifies the condition under which the rule could be selected; and the action part specifies a sequence of actions to execute when the rule is selected. The selection variables in the condition part include (1) Status: representing the current conversation step; (2) Response: the type of the learner’s inputs (e.g., complete answer, partial answer, etc.); (3) Event: representing learner’s actions or interface changes, such as new content is loaded, a button is pressed, etc.; (4) HasItem: a flag indicating whether a specific conversation item is available; (5) Priority: selection priority when multiple rules satisfy (1) to (4); (6) Frequency: a relative selection frequency used to compute random selection probability when multiple rules found according to (1) to (5).

Authoring Conversational Intelligent Tutoring Systems

601

The action part is a list of actions to be executed on the server side for further rule selection or on the interface side to display new content, deliver speeches or change interface settings. An action is an agent-act-data triplet. The agent could be “system”, “tutor” or “peer student”; the act could be any action that can be executed by AutoTutor Conversation Engine (ACE) or by the interface program of the specific AutoTutor application; the data is additional information needed to execute the action. We cannot cover more details about the rules due to the space limitation. Readers who are interested in knowing more about AutoTutor rules may find an example AutoTutor script template at http://ace.autotutor.org/attrialogtemplate001.xml.

4 Automatization in AutoTutor Script Authoring High quality conversation script authoring is a difficult task. Just imagine writing conversation scripts for a Hollywood movie. A writer may write the conversation for each character in the story scene by scene linearly. However, in writing conversation scripts for ITS, there is an unknown character – the learner. An author may easily determine what the tutor, the peer student or any other computer agent should say without interacting with the learner. However, once the learner is considered, the authoring becomes complicated. In each conversation turn, an author needs to imagine all possible ways a learner may say and prepare for the next turn agent scripts based on each type of learner responses. Assuming there is a single script S-1 for the first conversation turn. For the second turn, because AutoTutor needs to continue the conversation adapting to the learner’s input, there will be multiple scripts S-2-1, S-2-2, …, S-2-N. The number of scripts may exponentially increase with the number of turns. AutoTutor solved this problem by restricting the conversation within the EMT framework, so that the conversation paths always converge to the answer of the main question. Authoring AutoTutor script becomes easier with the help of AutoTutor Script Authoring Tools. However, understanding and authoring high quality conversation units is still challenging to domain experts. The recent advances in natural language processing (NLP) [21] may help to make AutoTutor script authoring easier. Question generation techniques may be used in AutoTutor hint and prompt generation [5, 18]. AutoTutor targets deep knowledge and requires deep reasoning questions to achieve the goal [12]. Using automatic question generation techniques, authoring load is reduced from creating questions to selecting and editing questions. Moreover, the automatically generated deep questions could be good examples to help authors construct high quality questions. Automatic question answering techniques may be used in AutoTutor so that authors don’t have to prepare answers for possible student questions. Currently, question answering systems are mostly information retrieval systems [17]. Conversation systems like AutoTutor requires answers in the style of “conversation”. A conversational answer needs to be adequately conveyed through speeches. Such answers are usually short and clear. Long answers could be delivered as conversation between multiple computer agents and the learner. Therefore, a conversational question answering involves two types of language processing: (1) generating an answer to a given question; and (2) converting an answer to conversational scripts.

602

Z. Cai et al.

Text clustering techniques could be very helpful in data-driven script authoring. Learners’ answers could be collected and presented as semantic clusters so that authors can create adaptive feedback [1, 26]. Iterative answer clustering could increase the accuracy of user input assessment and thus increase system performance.

5 Conclusion AutoTutor EMT framework successfully solved the conversation turn explosion problem. Yet, authoring effective conversational ITS like AutoTutor is still an expensive process. Using automatic natural language techniques may help reduce the cost and increase the quality of authoring. Would it be possible in the future that effective conversational tutoring scripts can be fully automatically generated? That may happen at a time after Hollywood movie scripts could be fully automatically generated. Acknowledgment. The research on was supported by the National Science Foundation (DRK12-0918409, DRK-12 1418288), the Institute of Education Sciences (R305C120001), Army Research Lab (W911INF-12-2-0030), and the Office of Naval Research (N00014-12-C-0643; N00014-16-C-3027). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NSF, IES, or DoD. The Tutoring Research Group (TRG) is an interdisciplinary research team comprised of researchers from psychology, computer science, and other departments at University of Memphis (visit http://www.autotutor.org).

References 1. Aggarwal, C.C., Zhai, C.: A survey of text clustering algorithms. In: Aggarwal, C., Zhai, C. (eds.) Mining Text Data. Springer, Boston (2012). https://doi.org/10.1007/978-1-4614-32234_4 2. Bitzer, D.L., Skaperdas, D.: PLATO IV-An Economically Viable Large Scale ComputerBased Education System. National Electronics (1968) 3. Cai, Z., Gong, Y., Qiu, Q., Hu, X., Graesser, A.: Making AutoTutor agents smarter: AutoTutor answer clustering and iterative script authoring. In: Traum, D., Swartout, W., Khooshabeh, P., Kopp, S., Scherer, S., Leuski, A. (eds.) IVA 2016. LNCS (LNAI), vol. 10011, pp. 438–441. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47665-0_50 4. Cai, Z., Graesser, A.C., Hu, X.: ASAT: AutoTutor script authoring tool. In: Sottilare, R., Graesser, A.C., Hu, X., Brawner, K. (eds.) Design Recommendations for Intelligent Tutoring Systems: Authoring Tools, pp. 199–210. Army Research Laboratory (2015) 5. Cai, Z., Rus, V., Kim, H.-J.J., Susarla, S.C., Karnam, P., Graesser, A.C.: NLGML: a Markup Language for question generation. In: Proceedings of E-Learn: World Conference on ELearning in Corporate, Government, Healthcare, and Higher Education, Honolulu, Hawaii, USA, pp. 2747–2752, October 2006 6. Craig, S.D., Gholson, B., Ventura, M., Graesser, A.C.: Overhearing dialogues and monologues in virtual tutoring sessions: effects on questioning and vicarious learning. Int. J. Artif. Intell. Educ. 11, 242–253 (2000)

Authoring Conversational Intelligent Tutoring Systems

603

7. Graesser, A., et al.: Why/AutoTutor: a test of learning gains from a physics tutor with natural language dialog AutoTutor and why/AutoTutor. In: Proceedings of the 25th Annual Conference of the Cognitive Science Society, Boston, pp. 474–479 (2003) 8. Graesser, A.C.: Conversations with AutoTutor help students learn. Int. J. Artif. Intell. Educ. 26(1), 124–132 (2016) 9. Graesser, A.C., Forsyth, C.M., Foltz, P.: Assessing conversation quality, reasoning, and problem-solving performance with computer agents. In: The Nature of Problem Solving: Using Research to Inspire 21st Century Learning (2017) 10. Graesser, A.C., Hu, X., Sottilare, R.: Intelligent tutoring systems. In: International Handbook of the Learning Sciences (2018) 11. Graesser, A.C., et al.: AutoTutor: a tutor with dialogue in natural language. Behav. Res. Methods Instrum. Comput. 36(2), 180–192 (2004) 12. Graesser, A.C., Rus, V., Cai, Z.: Question Classification Schemes. In: The Workshop on Question Generation (2008) 13. Graesser, A.C., Wiemer-Hastings, K., Wiemer-Hastings, P., Kreuz, R.: AutoTutor: a simulation of a human tutor. Cogn. Syst. Res. 1(1), 35–51 (1999) 14. Leelawong, K., et al.: Teachable agents learning by teaching environments for science domains. In: Proceedings of the Fifteenth Annual Conference on Innovative Applications of Artificial Intelligence, pp. 109–116 (2003) 15. Van Meer, E.: PLATO: From computer-based education to corporate social responsibility. Iterations Interdiscip. J. Softw. Hist. 2, 1–22 (2003) 16. Millis, K., Forsyth, C., Butler, H., Wallace, P., Graesser, A., Halpern, D.: Operation ARIES!: a serious game for teaching scientific inquiry. In: Ma, M., Oikonomou, A., Jain, L.C. (eds.) Serious Games and Edutainment Applications, pp. 169–195. Springer, London (2011). https://doi.org/10.1007/978-1-4471-2161-9_10 17. Mishra, A., Jain, S.K.: A survey on question answering systems with classification. J. King Saud Univ. Comput. Inf. Sci. 28(3), 345–361 (2016) 18. Olney, A.M., Graesser, A.C., Person, N.K.: Question generation from concept maps. Dialogue Discourse 3(2), 75–99 (2012). Editors: Paul Piwek and Kristy Elizabeth Boyer 19. Person, N.K., Graesser, A.C., Kreuz, R.J.: Simulating human tutor dialog moves in AutoTutor (2001) 20. Rezaei, E., Zaraii Zavaraki, E., Hatami, J., Abadi, K.A., Delavar, A.: The effect of MOOCs instructional design model-based on students’ learning and motivation. Man India 97(2017), 115–126 (2017) 21. Sun, S., Luo, C., Chen, J.: A review of natural language processing techniques for opinion mining systems. Inf. Fusion 36, 10–25 (2017) 22. Taylor, P., Isard, A.: SSML: a speech synthesis Markup Language. Speech Commun. 21(1– 2), 123–133 (1997) 23. Vanlehn, K.: The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educ. Psychol. 46(4), 197–221 (2011) 24. Vanlehn, K., Graesser, A.C., Jackson, G.T., Jordan, P., Olney, A., Rosé, C.P.: When are tutorial dialogues more effective than reading? Cogn. Sci. 31(1), 3–62 (2007) 25. Wiemer-Hastings, P., Graesser, A.C., Harter, D., Grp, T.R.: The foundations and architecture of autotutor. Intell. Tutoring Syst. 1452, 334–343 (1998) 26. Xu, J., et al.: Self-Taught convolutional neural networks for short text clustering. Neural Netw. 88, 22–31 (2017) 27. Yousef, A.M.F., Chatti, M.A., Schroeder, U., Wosnitza, M., Jakobs, H.: A review of the state-of-the-art. In: Proceedings of the 6th International Conference on Computer Supported Education - CSEDU 2014, pp. 9–20 (2014)

A Conversation-Based Intelligent Tutoring System Benefits Adult Readers with Low Literacy Skills Ying Fang1(&), Anne Lippert2, Zhiqiang Cai2, Xiangen Hu1, and Arthur C. Graesser1 1

2

Department of Psychology, The University of Memphis, Memphis, TN 38152, USA [email protected] Institute for Intelligent Systems, The University of Memphis, Memphis, TN 38152, USA

Abstract. This article introduces three distinctive features of a conversationbased intelligent tutoring system called AutoTutor. AutoTutor was designed to teach low literacy adult learners comprehension strategies across different levels of discourse processing. In AutoTutor, three-way conversations take place between two computers agents (a teacher agent and a peer agent) and a human learner. Computer agents scaffold learning by asking questions and providing feedback. The interface of AutoTutor is simple and easy to use and addresses the special technology needs of adult learners. One of AutoTutor’s strengths is that it is adaptive and as such can provide individualized instruction for the diverse population of adult literacy students. The adaptivity of AutoTutor is achieved by assessing learners’ performance and branching them into conditions with different difficulty level. Data from a reading comprehension intervention suggest that adult literacy students benefit from using AutoTutor. Such learning benefits may be increased by enhancing the adaptivity of AutoTutor. This may be accomplished by tailoring instruction and materials to meet the various needs of the individuals with low literacy skills. Keywords: Intelligent tutoring system Adult learner

AutoTutor Conversational agents

1 Introduction Approximately one in five adults aged 16 or older in thirty-three of the OECD (Organization for Economic Cooperation and Development) countries have literacy skills at a low level of proficiency [19]. A challenge for literacy centers is that low literacy adults have heterogeneous characteristics such as age, race/ethnicity, country of origin, highest educational level attained, literacy skills, interests, and goals [2]. The diversity of this population makes it difficult for teachers to differentiate instruction to optimal levels to meet the needs of a group or classroom of students. A computer program, on the other hand, can use learner responses to adapt instruction for each student and get closer to this optimal level [4, 13, 25]. Besides diversity in skill level, © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 604–614, 2019. https://doi.org/10.1007/978-3-030-22341-0_47

A Conversation-Based ITS Benefits Adult Readers with Low Literacy Skills

605

another challenge for literacy programs is high absenteeism and attrition rates due to unstable work hours, transportation difficulties, and childcare issues [9, 10, 16, 22]. A web based literacy program can also help with this issue. Being able to access a computer program on the Internet is an excellent way for absent adult learners to continue to improve their reading abilities. It provides students more choice with when and where they choose to work on their literacy skills. These reasons and others are why many believe computer programs can offer a solution for the challenges faced by adult literacy centers. Web based reading programs may indeed be a feasible solution, especially considering that computers equipped with the internet are becoming more common in literacy centers and in adult learners’ homes. In United States, 1200 federally funded adult literacy programs were surveyed between 2001 and 2002, and results indicated that 80% of the programs used computers in some type of capacity with adult learners [24]. According to the 2003 National Assessment of Adult Literacy [14], 67% of adults who read at grade levels 3 to 7.9 had a computer in their home with Internet access. Computers with Internet access are also available for adult learners in public libraries, children’s schools, and adult literacy programs. Newnan (2015) conducted a survey on more than 1000 programs and found more than 80% of survey respondents had computers in their classrooms with consistent access to the Internet (although significant variability was noted) [17]. In addition, it was reported that an increasing number of adult literacy programs are infusing technology into their classrooms and curriculum [21]. As part of the effort to increase computer-based instruction, we developed an intelligent tutoring system (ITS) called AutoTutor to help adult learners improve reading comprehension skills. In this paper, we introduce three distinctive features of AutoTutor and report the effectiveness of AutoTutor based on the early results of a reading comprehension intervention using AutoTutor.

2 Unique Features of AutoTutor AutoTutor is a conversation-based intelligent tutoring system (ITS) shown to promote learning on a wide range of topics [5, 18]. Research reports average learning gains of 0.8 r for AutoTutor compared to various traditional teaching controls [18]. AutoTutor teaches literacy skills by holding conversations in natural language. Conversations between a human student and a pedagogical agent are based on the expectationmisconception tailored (EMT) approach [7]. In this approach, a human student provides answers to questions asked by the agent. The AutoTutor system uses this response to assess a learner’s understanding of the content by comparing it to expected answers or misconceptions in real time. Using this EMT approach, AutoTutor is constantly assessing the students by providing feedback, hints, pumps, prompts to guide the learning through the content. Traditional AutoTutor systems implement conversations called dialogues that model the interactions that occur between a single human tutor and human student. More recent versions of AutoTutor may employ trialogues which are tutorial conversations between three actors - a teacher agent, a human learner, and a peer agent [6]. Trialogues are implemented in AutoTutor for CSAL, an ITS that was developed by

606

Y. Fang et al.

researchers at the Center for the Study of Adult Literacy. The system was designed to help adults with low literacy acquire strategies for comprehending text at multiple levels of language and discourse. The AutoTutor curriculum has 35 lessons that focus on specific comprehension components [5]. Each AutoTutor lesson takes 10 to 50 min to complete. The lessons typically start with a 2–3 min video that reviews a comprehension strategy. After the review, the computer agents scaffold students through the learning by conversation. 2.1

Easy to Use Interface

Before designing the system, a study was conducted to assess the digital literacy of 105 adult learners who read between 3rd and 8th grade. The assessment consisted of a behavioral test and a self-reported questionnaire. For the behavioral test, the participants were presented with tasks that the user needed to complete in a simulated computer interface. For example, if the task is to drag a file to the Recycle Bin, the user must actually click and drag on a file in the simulated interface and then release the file over the Recycle Bin. The behavioral tasks covered various types of computer skills from four categories: Basic Computer Skills, World Wide Web, Windows, and Email. The questionnaire asked the participants about frequency of computer use and computer habits. Results showed the percentage of tasks the adults could perform correctly ranged from 20% to 96%. The tasks in which the adults were least proficient ( 1.00 Very large ES > 2.00 Bloom’s challengec a Notes: Extended from suggestions by Cohen [32]; bWhat Works Clearinghouse [16]; cBloom [8]

Meta-analyses of digital tutoring have been performed by VanLehn [35] and Kulik and Fletcher [36]. VanLehn reviewed 27 studies of digital tutoring and found that they averaged a moderate effect size of 0.59. However, in further investigation of his data, he found an average effect size of 0.40 for sub-step-based tutoring compared to a large effect size of 0.76 for step-based tutoring. In other words, learning by 50th percentile learners would improve (roughly and on average) to the 66th percentile under finegrained tutoring – about the same as frame-based learning – but improve (roughly and on average) to the 78th percentile under more general, less specific tutorial interactions. Additional research may better account for this finding, which may have been due to the need for students to reflect more carefully and thoroughly on results of step-based tutoring than those of sub-step-based tutoring. An extensive and more recent analysis by Kulik and Fletcher [36] found a large effect size of 0.66 for 50 digital tutors, with data ranging from –0.34 to 3.18 (after Winsorizing for outliers) in effect size – an overall result between VanLehn’s analysis of sub-step and step-based tutoring, but appreciably closer to the latter than the former. In any case, these findings, which used a more precise definition of digital tutoring than some earlier analyses, suggest substantial learning improvements over both classroom instruction and applications of frame-based programmed learning techniques in CAI. 2.2

Design and Development of the DARPA Digital Tutor

Given this context, the design, development, and two recent assessments of a digital tutor developed by the Defense Advanced Research Projects Agency (DARPA) for the US Navy deserve attention. Development of this technology has been proceeding steadily since Feurzieg’s MENTOR [14]. However, DARPA’s mission is to leap ahead and develop high payoff research that is too risky and expensive to be considered by the military Service laboratories. An example is the development of cigarette package sized devices to replace suitcase-sized systems for determining locations on earth using Global Positioning System satellites. In education and training, DARPA’s intent was to substantially accelerate acquisition of expertise, well beyond journeyman levels, by learners starting with little, if any, prior knowledge or training in the subject area. The subject matter chosen was

622

J. D. Fletcher

Information Systems Technology, which is abbreviated by the Navy as IT in reference both to the technology and to individuals with this occupational specialty. Design and development of the Tutor was basically a matter of identifying highquality tutorial ingredients and applying them in proportions determined by system-atic empirical testing. It was pragmatic and not focused on verifying any particular theory of learning, cognition, and/or instruction. The developers had acquired years of experience studying one-on-one tutoring by humans. Findings from that work provided initial approaches for designing the DARPA Tutor. DARPA funding allowed extensive assessments in devising the Digital Tutor [37–39]. Early assessments of tutoring techniques used IT novices of about the same age, education level, and capabilities as novice Sailors assigned to the Navy IT school. The Tutor was developed step by step in sessions with detailed video and additional voice recordings to determine what was effective and what was not. These sessions were repeated as needed to develop instructional activities and interactions that reliably accomplished targeted learning objectives. This iterative approach provided the foundation for a problem-based learning environment allowing computer development of subject matter and individualized information structures that were functionally similar to those of human tutors with expertise in both the subject matter and in one-on-one instruction. Experts in specific IT topics were contacted based on their knowledge, publications, and reputation. These experts were auditioned in 30-min IT tutorial sessions with learners who were representative of new Navy sailors. The intention was to base (or “clone”) the Digital Tutor using the practices of individuals who were expert in both an IT topic and in one-on-one tutoring. These sessions helped identify and select 24 tutors who were experts in requisite IT topics for use in designing the Digital Tutor. The tutors then tutored 15 IT qualified sailors who were newly graduated from recruit training and chosen at random. The sailors were tutored one-on-one by these experts for 16 weeks to prepare them for IT careers in the Navy. Every session in this tutoring was again captured in video. These sessions, which were extensively reviewed and assessed, served as the basis for the tutorial instruction provided by the Digital Tutor. This analytic work, including further trial and analyses, continued as the Tutor was developed. The Tutor employs the following prescriptive procedures: • Promote reflection by eliciting learner explanations of what went well and what did not; • Probe vague and incomplete responses; • Allow learners to discover careless errors but assist learners in correcting errors arising from lack of knowledge or misconceptions; • Never articulate a misconception, provide the correct answer, or give a direct hint; • In the case of a learner impasse, review knowledge and skills already successfully demonstrated by the learner and probe for why they might or might not be relevant to the current problem; and • Require logical, causal, and/or goal-oriented reasoning in reviewing or querying both incorrect and correct actions taken by the learner to solve problems.

Adaptive Instructional Systems and Digital Tutoring

623

The Tutor used information structures to: • • • •

Model the subject matter; Generate evolving models of the learner; Generate, adapt, and assign problems to maximize progress of individual learners; Engage in tutorial exchanges that shadow, assess, and guide learners’ problem solving; • Ensure that learners reflect on and understand deeper issues and concepts illustrated by the problems. Operationally, the design of the Digital Tutor emphasizes: • Active, constant interaction with learners – which fostered the “flow” that is found in computer-based games [40]. • Capturing in digital form the processes and practices of one-on-one tutoring; • Requiring problem solving in authentic environments – leaners used actual Navy systems, not simulations of these systems. Problem solving was not based on copying the problem solving paths of experts. The tutor was expected to help learners follow whatever path they chose to troubleshoot and solve problems. • Continual, diagnostic assessment of individual learner progress; • Focus on higher order concepts underlying problem solving processes and solutions; • Integration of human mentors. The presence of experienced Navy ITs as mentors was essential for this training. They resolved difficulties in human-computer communication, managed the study halls, and, especially, provided examples of Navy bearing and culture for the novice sailors. “Sea stories” might be viewed as little more than entertainment, but, as with Army “War Stories” and Air Force “Air Stories”, few activities are as effective and important as these stories in providing civilians with the esprit de corps and culture needed to prepare them for military service. Mirroring its development strategy, the Tutor’s instructional approach is spiral. It presents conceptual material selected for individual learners by the tutor. This material is immediately applied in solving problems intended to be comprehensive and authentic. Learners interact directly with US Navy IT systems, not simulations, while the Tutor observes, tracks, and models their progress and solution paths. Tutoring tactics developed for the Tutor were the following: • Promote learner reflection and abstraction by: – Prompting for antecedents, explanations, consequences, or implications of answers. – Questioning answers, both right and wrong. – Probing vague or incomplete explanations and other responses by the learner. • Review knowledge and skills when the learner reaches an impasse or displays a misconception by asking why something did or did not happen. • Avoid providing a correct answer, providing a direct hint, or articulating a misconception.

624

J. D. Fletcher

• Sequence instruction to pose problems that are tailored and selected to optimize each learner’s progress. • Require logical, causal, or goal-oriented reasoning in reviewing or querying steps taken by the learner to solve problems. • Refocus the dialogue if the learner’s responses suggest absent or misunderstood concepts that should have been mastered. • If a learner makes a careless error in applying a concept already mastered, allow problem-solving to continue until the learner discovers it. • Verify learner understanding of any didactic material before proceeding. A daily schedule for instruction consisted of 6 h using the Tutor followed by a twohour study hall, which was proctored by one of the Navy instructors assigned to the school. It involved discussion and reflection on material presented during the day. At the end of the week, one of the senior designers of the Tutor would attend to participate in the discussion, address particularly difficult issues that the learners encountered during week, and, in return, gain insight into what the Tutor was doing well and not well. 2.3

Effectiveness of the DARPA Digital Tutor for the Active Navy

Navy Assessment After the training was about half finished, IT knowledge of the human tutored sailors was assessed by a paper-and-pencil test prepared by Navy instructors. It included multiple choice, network diagram, and essay questions answered by the 15 humantutored and by 17 classroom instructed sailors. The tutored sailors averaged 77.7 points compared to 39.7 points for the classroom training sailors on this test, which indicated an effect size of 2.48 exceeding Bloom’s 2.00 target in their favor [39]. Four other assessments of the Digital Tutor were performed with different sailors participating at progressive stages of the Tutor’s development: 4 weeks, 7 weeks, 10 weeks, and finally 16 weeks [37, 39]. The first assessment of the Tutor compared the IT knowledge of 20 new sailors, who had completed the first 4 weeks of Digital Tutor training then available, with that of 31 sailors who had graduated from approximately 10 weeks of standard classroom training and with that of 10 Navy IT instructors. This study the found an effect size of 2.81 in favor of the 4-week Tutor students over the students who had graduated from the 10 week IT course and an effect size of 1.32 in their favor compared to their Navy instructors. These differences were also statistically significant (p < 0.05) [39]. The next assessment compared both the IT trouble shooting ability and IT knowledge of 20 new sailors, who had completed the 7 weeks of the Digital Tutor training then available, with that of 20 sailors who had graduated from a newly revised 19-week IT classroom and laboratory training course and with that of 10 instructors who only took the knowledge test. The IT trouble shooting effect size favoring the 7-week Tutor sailors’ troubleshooting skill over that of the 19-week classroom and laboratory sailors was 1.86. The IT knowledge difference effect size favoring the 7-week Tutor sailors over the 19-week classroom and laboratory sailors was 1.91 and it

Adaptive Instructional Systems and Digital Tutoring

625

was 1.31 in comparison with their instructors. Again these differences were statistically significant (p < 0.05) [37, 39]. A final assessment was performed after another representative group of 12 sailors had completed training with the final 16-week version of the Tutor. The DARPA challenge was to produce in 16 weeks (the usual time for ab initio IT training) sailors who would be superior in skill and knowledge to (a) other novice sailors trained using conventional classroom and laboratory practice, and (b) ITs with many years of experience in the Fleet [39]. The assessment involved new sailors who completed IT training with the DARPA Digital Tutor, other new sailors trained for 35 weeks using the Navy’s classroom based Information Technology Training Continuum (ITTC), and senior ITs with an average of 9.2 years of Fleet experience. As with all the assessments, sailors who had just finished recruit training were assigned at random to the two training groups (DT and ITTC standard classroom training with laboratory experience). The Fleet ITs were chosen as the “go to” ITs from ships on shore duty in San Diego and Oak Harbor, Washington. There were 12 ITs in each group. Repeated measures were used because of the small sample sizes – 14 h of IT trouble shooting skill testing, 4 h of written (mostly short answer) knowledge testing were used in the assessments. Other tests such as oral examination by experienced ITs, development and design of IT systems according to typical specifications, and ability to ensure security of an IT system were also applied. IT troubleshooting in response to trouble tickets was the most important component of the training in preparing these novice sailors for their Navy IT occupation. It was intended to resemble Fleet IT requirements as closely as possible. Navy “Trouble Tickets” which had been submitted from the Fleet for shore-based assistance were presented as problems to be solved by 3-member teams with a specified time for solution. Results of the Troubleshooting testing are shown in Fig. 3. Notably neither the ITTC nor the Fleet teams attempted to solve the “Very Hard” problems.

Fig. 3. Troubleshooting problems solved by DT, Fleet, and ITTC teams

626

J. D. Fletcher

Troubleshooting capability was the primary focus of the assessment because it best indicated how well the new ITs were prepared to do their jobs in the Fleet. Acquisition of IT knowledge was also of interest and found to account for about 40% of individuals’ troubleshooting scores. This is an appreciable amount and it is of interest, but performance in IT troubleshooting is the main concern of the Navy. More description of these assessments along with additional data, testing, and findings is provided by [39]. Veterans Assessment An approximate replication of the above assessment was provided by assessing an 18- week version of the Tutor used to train 100 military veterans [38]. The course was extended by 2 weeks to assist the veterans in adapting and applying to civilian technical workplaces. As Table 2 shows, most of the veterans were unemployed before taking this course. There were no academic dropouts from the course, which was completed by 97 of the veterans. Fourteen of the veterans chose to seek higher education rather than apply directly for employment. Another 6 veterans did not reply to requests for post-training information. All 77 of the graduates who sought employment were hired. Their average annual salary was $73,000, which, at the time, was equivalent to civilian employment intended for IT technicians with 3–5 years of IT experience [41]. Most received early bonuses and promotions. Table 2. Characteristics of 101 Veteransa accepted for digital tutor IT training Average years of separation from military service 5.20 Avg age 30.5 Married 30 Armed forces qualification test 87.1 Full time employment 11 Part time employment 45 Prior civilain IT instruction 8 High school/GED degree 45 AA degree 11 BA/BS 44 Other 1 Prior Military IT Instruction 4 Note: aOne veteran dropped out before beginning the course and was replaced.

Results from return on investment analysis are shown in Fig. 4. It shows that monetary return to the government over a 20-year period is appreciable for all monetary support provided to veterans. However, the return is much greater for the 18-week digital tutoring program than government support for a either a 2 or 4 year degree and even for veterans who completed a program of education after receiving no monetary support from the Veterans Administration [38].

Adaptive Instructional Systems and Digital Tutoring

627

Fig. 4. Monetary Return to US Government Per Individual from Support Provided for Education and/or Training ($000)

2.4

Summary Comments on Digital Tutoring to Provide Adaptive Instruction

Assessments of the DARPA Digital Tutor suggest a number of possibilities and issues, four of which are discussed here. Others will doubtless occur to readers. Acceleration of Expertise The value of technical expertise is as evident from empirical research as it is from random observation [42, 43]. However, the years of experience and practice needed to develop technical expertise increase its cost and limit its supply. Empirical demonstrations that the time to develop technical expertise can be compressed from years into months are few, but extant. For instance, the Sherlock project [44–46] prepared technicians to solve complex problems occurring in a test stand used to troubleshoot components of Air Force avionics systems. Assessments found that 20–25 h of Sherlock training produced about the same improvement in performing difficult and rarely occurring diagnostic tasks as 4 years of on-job experience [44]. Their approach presaged that of DARPA’s Tutor in assuring that intensive technical learning was always followed by guided reflection on what worked and what did not. Other evidence was provided by IMAT, the Navy’s Interactive Multi-Sensor Analysis Trainer [47]. This system focused on what the authors described as ‘incredibly complex tasks’. They describe these as broad, multifaceted, abstract, co-dependent, and nonlinear tasks that require a large repertoire of patterns and pattern-recognition capabilities for their solution. An at-sea trial found that 2 days of training with a laptop version of IMAT increased submarine effective search area by a factor of 10.5 [48]. In effect, a submarine with IMAT-trained sonar operators could provide the sonar surveillance of 10 submarines with operators who lacked IMAT training. The operational and monetary value of this capability is substantial. These examples, in addition to the DARPA Digital Tutor discussed here, suggest training advances of considerable value, including those that may reliably and significantly accelerate the acquisition of expertise, waiting and within our reach, but not yet in our grasp. Return on Investment Training and education are often viewed as expenses, not investments, which does not serve well either of them or the many who benefit from them. If designed honestly and

628

J. D. Fletcher

well1 digital tutors are expensive to design and build. However, the monetary and operational costs of not doing so may be far greater. No quantity of digital tutors would cost more than the loss of a single submarine, or any ship that was lost because its internal network system failed. That aside, analysis [49] found that continuing to provide standard classroom IT training with its requirement for years of follow-on development and on-job training is far more expensive than the cost to design, develop, and reliably update a digital tutor. For instance and assuming that the Navy must train 2,000 ITs a year for the Fleet, the costs saved by using the 16-week Digital Tutor program to replace 16 weeks of classroom training followed by 7 years of on-job training averaged savings in discounted dollars were estimated to average $109M annually over a 12 year period [49]. As further shown by the development and delivery of the DARPA Digital Tutor to veterans, its design, development, and delivery costs were substantially less than the costs to provide standard Veterans Administration education benefits for individuals to acquire a 2-year or 4-year college degree [38]. For that matter, individuals who completed a 4-year college degree in information network technology with no support from the government returned considerably less to the government in income than ITs who completed the Digital Tutoring program was extended to 18 weeks to prepare veterans for the civilian market place. The internal rate of return on government investment that provided 18 weeks of housing, meals, and Digital Tutoring to prepare a veteran for a career in IT was estimated to be about 35 percent over a 20-year period. In general, training and education and their consumers, might benefit significantly if, in addition to assessing their effectiveness for learning, our evaluations and assessments of instructional capabilities additionally included assessments of their likely return on investment. Applying Digital Tutors As discussed above, a case can be made that Digital Tutoring is more effective and suitable for instructional objectives involving conceptual learning rather than those concerning the initial rudiments of any subject. Objectives for these rudiments, such as nomenclature, common procedures, and basic procedures, are found at the low end of Bloom’s often referenced hierarchy of instructional objectives [6]. Nonetheless, they are essential for learning and instruction in most, if not all, subject matter. As early CAI programs demonstrated, these rudiments are readily learned through the techniques of drill and practice. Most successful drill and practice programs focus on discrete items such as arithmetic facts, vocabulary words, orthography, technical terms, and the like. Drill and practice is an effective, and when well done, incentivizing approach. Its promise was early demonstrated for introducing basics in subjects such as beginning reading [50] and elementary mathematics [3, 4, 51, 52]. Comments about “drill and kill” may apply to some classroom learning, but drill and practice programs presented by computer have been found to be successful and enjoyed by early learners [34, 53, 54]. Considerable data from the 1960s, 1970s, and onward have shown these rudiments can be acquired efficiently and effectively through computer-assisted drill and practice. Some 1

“Intelligent Tutoring” has long been used as a marketing term by training developers and contractors.

Adaptive Instructional Systems and Digital Tutoring

629

drill and practice programs have applied sophisticated approaches such as statistical optimization routines to select and present items that maximize individual learning given constraints such as time available and a learner’s progress [51, 55–58]. Comparisons of these early drill and practice programs with conventional classroom instruction generally found effect sizes of about 0.40 [33]. Moreover, and as shown in Table 3, digital tutoring does not do as well in preparing learners with these rudiments as does drill and practice. Tutoring appears necessary and more suited for the next step – applying subject matter rudiments to develop the abstractions and concepts needed for a deeper and more nuanced understanding of the subject matter. As effect sizes in the table suggest, this conceptual area appears to be where digital tutoring is most needed and most successful in providing a full understanding of the subject matter and applying it successfully. Table 3. Effect sizes for four digital tutoring systems assessed for conceptual and rudimentary learning Source Graesser et al. [59] Koedinger et al. [60] Person et al. [61] VanLehn [35] Average (Standard Deviation)

Concepts 0.34 0.99 0.30 0.95 0.65 (0.326)

Rudiments 0.00 0.36 0.03 −0.08 0.08 (0.168)

The idea of pairing drill and practice programs with digital tutoring is becoming less heretical in digital tutoring circles. For instance, this approach is suggested by Nye et al. [62] along with the sensible caution that drill and practice may be overdone by focusing entirely on solving specific problems. Overall, rudiments may be best left to drill and practice techniques with digital tutoring brought in once the rudiments are sufficiently acquired. Reflection, which is enabled by dialogue exchanges in digital tutoring, reveals the abstract and generalizable concepts underlying problems presented and increases both retention and transfer of what is learned [45, 46, 60, 63, 64]. Team Training An issue raised by Fletcher and Sottilare [65] concerns Digital Tutoring capabilities applied to training for teams. As Jones [66] suggested, teams differ in their degree of “teamness”. That is to say that in some teams the members perform their task almost independently, passing off their contributions without much adjustment in their actions based on what other team members do. In these cases, application of Digital Tutoring seems relatively simple – they can be trained and taught in much the same manner that individuals are trained and taught. Many positions in baseball, as Jones points out, are like this. However, in some teams the interactions of at least some members depend closely on what other team members may do. Doubles tennis is a good example of this interaction. In baseball, as Jones suggests, the pitcher and the catcher may comprise a genuine team, with each responding to actions taken by the other. Whether we agree this example or not, it seems evident that teams and their members differ in the teamness of their responsibilities.

630

J. D. Fletcher

Sottilare with others [65, 67] suggests that there is a role for intelligent tutoring in team training and that it may be organized and assisted by the Generalized Intelligent Framework for Tutoring (GIFT). Work in this area is recent and continuing. It has yet to be applied and assessed in a context for team training, but, considering the intense requirements for teams and team activity in Defense, it seems likely to proceed.

3 Final Comment Finally, no “magic sauce” or specific academic theory was used to produce the DARPA Digital Tutor. It was designed and developed by using empirical means to identify high-quality, but well-known, tutorial ingredients and applying them in proportions determined by systematic, empirical testing. Certainly theory for instruction is essential [51, 58] but the Digital Tutor was initiated by a DARPA challenge to solve a practical problem. The approach used to develop a solution was based on performance requirements rather than an attempt to prove a theory. Like education and training, practice and theory appear to exist on a continuum, but the Tutor was more focused on solving a practical problem, than proving a theory. Its development was fundamentally eclectic and pragmatic, based on an iterative, formative evaluation approach. The DARPA Digital Tutor may have realized a breakthrough in the technology of adaptive learning. It was a catalyst for a 2017 National Academy of Sciences, Engineering, and Medicine symposium to press for wider and more routine use of this technology in order to prepare the national technical workforce for both present and emerging challenges to the national economy and productivity. The consensus was that digital tutoring technology is essential and ready to assume this responsibility. Nonetheless, how best to move it from the laboratory into the field remains undetermined.

References 1. James, W.: Principles of Psychology, vol I. Dover Press, New York (1890/1950) 2. Thorndike, E.L.: Principles of Teaching. A. G. Seiler & Company, New York (1906) 3. Suppes, P.C., Fletcher, J.D., Zanotti, M.: Performance models of American Indian students on computer-assisted instruction in elementary mathematics. Instr. Sci. 4, 303–313 (1975) 4. Suppes, P.C., Fletcher, J.D., Zanotti, M.: Models of individual trajectories in computerassisted instruction for deaf students. J. Educ. Psychol. 68, 117–127 (1976) 5. Gettinger, M.: Individual differences in time needed for learning: a review of literature. Educ. Psychol. 19(1), 15–29 (1984). https://doi.org/10.1080/00461528409529278 6. Carroll, J.B.: Problems of measurement related to the concept of learning for mastery. Educ. Horiz. 48, 71–80 (1970) 7. Tobias, S.: Extending Snow’s conceptions of aptitudes. Contemp. Psychol. 48, 277–279 (2003) 8. Bloom, B.: The 2 sigma problem: the search for methods of group instruction as effective as one-to-one tutoring. Educ. Res. 13(6), 4–16 (1984) 9. Cohen, P.A., Kulik, J.A., Kulik, C.-L.C.: Educational outcomes of tutoring: a meta-analysis of findings. Am. Educ. Res. J. 19, 237–248 (1982)

Adaptive Instructional Systems and Digital Tutoring

631

10. Evans, M., Michael, J.: One-On-One Tutoring by Humans and Computers. Lawrence Erlbaum, Mahwah (2006) 11. Graesser, A.C., D’Mello, S.K., Cade, W.: Instruction based on tutoring. In: Mayer, R.E., Alexander, P.A. (eds.) Handbook of Research on Learning and Instruction, pp. 408–426. Routledge Press, New York (2011) 12. Graesser, A.C., Moreno, K., Marineau, J., Adcock, A., Olney, A., Person, N.: The tutoring research group. AutoTutor improves deep learning of computer literacy: is it the dialog or the talking head? In: Hoppe, U., Verdejo, F., Kay, J. (eds.) Proceedings of Artificial Intelligence in Education, pp. 47–54. IOS Press, Amsterdam (2003) 13. Brown, J.S., Burton, R.R., DeKleer, J.: Pedagogical, natural language and knowledge engineering in SOPHIE I, II, and III. In: Sleeman, D., Brown, J.S. (eds.) Intelligent Tutoring Systems, pp. 227–282. Academic Press, New York (1982) 14. Feurzeig, W.: Computer systems for teaching complex concepts (BBN Report 1742). Bolt Beranek & Newman, Inc., Cambridge, MA (DTIC AD 684 831) (1969) 15. Fletcher, J.D., Tobias, S., Wisher, R.L.: Learning anytime, anywhere: advanced distributed learning and the changing face of education. Educ. Res. 36(2), 96–102 (2007) 16. USDoEd: Transforming American education learning powered by technology. National education technology plan 2010. United States Department of Education, Office of Educational Technology, Washington, DC (2010) 17. Keller, F.S.: Goodbye, teacher. J. Appl. Behav. Anal. 1, 79–89 (1968) 18. Crowder, N.A.: Automatic teaching by means of intrinsic programming. In: Galanter, E. (ed.) Automatic Teaching: The State of the Art, pp. 109–116. Wiley, New York (1959) 19. Skinner, B.F.: The science of learning and the art of teaching. Harv. Educ. Rev. 24, 86–97 (1954) 20. Kulik, J.A., Cohen, P.A., Ebeling, B.J.: Effectiveness of programmed instruction in higher education: a meta-analysis of findings. Educ. Eval. Policy Anal. 2, 51–64 (1980) 21. Kulik, C.-L.C., Schwalb, B.J., Kulik, J.A.: Programmed instruction in secondary education: a meta-analysis of evaluation findings. J. Educ. Res. 75, 133–138 (1982) 22. Barr, A., Feigenbaum, E.: Buggy. In: Handbook of Artificial Intelligence, vol. 2, pp. 279– 284. William Kaufmann, Los Altos (1982) 23. Fletcher, J.D., Rockway, M.R.: Computer-based training in the military. In: Ellis, J.A. (ed.) Military Contributions to Instructional Technology, pp. 171–222. Praeger Publishers, New York (1986) 24. Quillian, M.R.: The teachable language comprehender: a simulation program and theory of language. Commun. ACM 12(8), 459–475 (1969) 25. Carbonell, J.R.: AI in CAI: an artificial intelligence approach to computer-assisted instruction. IEEE Trans. Man Mach. Syst. 11, 190–202 (1970) 26. Fletcher, J.D.: Models of the learner in computer-assisted instruction. J. Comput. Based Instr. 3, 118–126 (1975) 27. Anderson, J.R., Boyle, C.F., Corbett, A., Lewis, M.W.: Cognitive modeling and intelligent tutoring. Artif. Intell. 42, 7–49 (1990) 28. Koedinger, K.R., Anderson, J.R., Hadley, W.H., Mark, M.A.: Intelligent tutoring goes to school in the big city. Int. J. Artif. Intell. Educ. 8, 30–43 (1997) 29. Shute, V., Ventura, M.: Stealth Assessment: Measuring and Supporting Learning in Video Games. MIT Press, Cambridge (2013) 30. Ellis, P.D.: The Essential Guide to Effect Sizes. Cambridge University Press, Cambridge (2010) 31. Grissom, R.J., Kim, J.J.: Effect Sizes for Research. Lawrence Erlbaum Associates, Mahwah (2005)

632

J. D. Fletcher

32. Cohen, J.: Statistical Power Analysis for the Behavioral Sciences, 2nd edn. Lawrence Erlbaum Associates, Hillsdale (1988) 33. Kulik, J.A.: Meta-analytic studies of findings on computer-based instruction. In: Baker, E.L., O’Neil Jr., H.F. (eds.) Technology Assessment in Education and Training, pp. 9–33. Erlbaum, Hillsdale (1994) 34. Vinsonhaler, J.F., Bass, R.K.: A summary of ten major studies on CAI drill and practice. Educ. Technol. 12, 29–32 (1972) 35. VanLehn, K.: The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educ. Psychol. 46, 197–221 (2011). https://doi.org/10.1080/ 00461520.2011.611369 36. Kulik, J.A., Fletcher, J.D.: Effectiveness of intelligent tutoring systems: a meta-analytic review. Rev. Educ. Res. 86, 42–78 (2016) 37. Fletcher, J.D.: DARPA Education dominance program: April 2010 and November 2010 digital tutor assessments. (IDA Document NS D-4260). Institute for Defense Analyses, Alexandria, VA (2011) 38. Fletcher, J.D.: The value of digital tutoring and accelerated expertise for military veterans. Educ. Technol. Educ. Dev. 65, 679–698 (2017) 39. Fletcher, J.D., Morrison, J.E.: Accelerating development of expertise: a digital tutor for Navy technical training (IDA Technical report D-5358). Institute for Defense Analyses: Alexandria, VA (2014) 40. Csikszentmihalyi, M.: The Psychology of Optimal Experience. Harper Perennial, London (1990) 41. Salary.com. Network administrator II (2014). http://swz.salary.com/SalaryWizard/NetworkAdministrator-II-Salary-Details.aspx 42. Ericcson, K.A., Charness, N., Feltovich, P.J., Hoffman, R.R. (eds.): The Cambridge Handbook of Experts and Expert Performance. Cambridge University Press, New York (2006) 43. Hoffman, R.R., Ward, P., Feltovich, P.J., DiBello, L., Fiore, S.M., Andrews, D.H.: Accelerated Expertise: Training for High Proficiency in a Complex World. Psychology Press, New York (2014) 44. Gott, S.P., Lesgold, A.M.: Competence in the workplace: how cognitive performance models and situated instruction can accelerate skill acquisition. In: Glaser, R. (ed.) Advances in Instructional Psychology, pp. 239–327. Erlbaum, Hillsdale (2000) 45. Gott, S.P., Lesgold, A.M., Kane, R.S.: Tutoring for transfer of technical competence. In: Wilson, B.G. (ed.) Constructivist Learning Environments: Case Studies in Instructional Design, pp. 33–48. Educational Technology, Englewood Cliffs (1996) 46. Lesgold, A.S., Lajoie, M., Bunzo, E.G.: SHERLOCK: a coached practice environment for an electronics troubleshooting job. In: Larkin, J., Chabay, R., Scheftic, C. (eds.) Computer Assisted Instruction and Intelligent Tutoring Systems: Establishing Communication and Collaboration, pp. 201–238. Lawrence Erlbaum Associates, Hillsdale (1988) 47. Wetzel-Smith, S.K., Wulfeck, W.H.: Training incredibly complex tasks. In: Cohn, J.V., O’Connor, P.E. (eds.) Performance Enhancement in High-Risk Environments, pp. 74–89. Praeger, Westport (2010) 48. Chatham, R.E., Braddock, J.V.: Training superiority and training surprise. United States Department of Defense, Defense Science Board, Washington, DC (2001) 49. Cohn, J., Fletcher, J.D.: What is a pound of training worth? Frameworks and practical examples for assessing return on investment in training. In: Proceedings of the Interservice/Industry Training, Simulation, and Education Annual Conference. National Training and Simulation Association, Arlington, VA (2010)

Adaptive Instructional Systems and Digital Tutoring

633

50. Atkinson, R.C., Fletcher, J.D.: Teaching children to read with a computer. Read. Teach. 25, 319–327 (1972) 51. Suppes, P.C.: The place of theory in educational research. Educ. Res. 3, 3–10 (1974) 52. Suppes, P., Morningstar, M.: Computer-Assisted Instruction at Stanford 1966-68: Data, Models, and Evaluation of the Arithmetic Programs. Academic Press, New York (1972) 53. Jamison, D.T., Suppes, P., Wells, S.: The effectiveness of alternative instructional media: a survey. Rev. Educ. Res. 44, 1–67 (1974) 54. Niemiec, R., Walberg, H.J.: Comparative effects of computer-assisted instruction: a synthesis of reviews. J. Educ. Comput. Res. 3, 19–37 (1987) 55. Atkinson, R.C., Paulson, J.A.: An approach to the psychology of instruction. Psychol. Bull. 78, 49–61 (1972) 56. Chant, V.G., Atkinson, R.C.: Application of learning models and optimization theory to problems of instruction. In: Estes, W.K. (ed.) Handbook of Learning and Cognitive Processes, vol. 5. Erlbaum Associates, Hillsdale (1978) 57. Sottilare, R.A., Brawner, K.W., Goldberg, B.S., Holden, H.K.: The generalized intelligent framework for tutoring (GIFT). Concept Paper Released with GIFT Software Documentation. U.S. Army Research Laboratory – Human Research & Engineering Directorate (ARL-HRED), Orlando FL (2012) 58. Suppes, P.C.: Modern learning theory and the elementary-school curriculum. Am. Educ. Res. J. 1, 79–93 (1964) 59. Graesser, A.C., Person, N.K., Magliano, J.P.: Collaborative dialogue patterns in naturalistic one-on-one tutoring. Appl. Cogn. Psychol. 9, 495–522 (1995) 60. Koedinger, K.R., Corbett, A.T., Perfetti, C.: The knowledge-learning-instruction framework: bridging the science-practice chasm to enhance robust student learning. Cogn. Sci. 36, 757–798 (2012) 61. Person, N.K., Bautista, L., Graesser, A.C., Mathews, E.C., The Tutoring Research Group: Evaluating student learning gains in two versions of AutoTutor. In: Moore, J.D., Redfield, C. L., Johnson, W.L. (eds.) Artificial Intelligence in Education: AI-ED in the Wired and Wireless Future, pp. 286–293. IOS Press, Amsterdam (2001) 62. Nye, B.D., Pavik Jr., P.I., Windsor, A., Olney, A., Hajeer, M., Hu, X.: SKOPE-IT (shareable knowledge objects as portable intelligent tutors): overlaying natural language tutoring on an adaptive learning system for mathematics. Int. J. STEM Educ. (2018). https://doi.org/10. 1186/s40594-018-0109-4 63. Healy, A.F., Kole, J.A., Bourne, L.E.: Training principles to advance expertise. Front. Psychol. 5, 1–4 (2014) 64. Moreno, R., Mayer, R.: Interactive multimodal learning environments. Educ. Psychol. Rev. 19, 309–326 (2007). https://doi.org/10.1007/s10648-007-9047-2 65. Fletcher, J.D., Sottilare, R.A.: Shared mental models in support of adaptive instruction for teams using GIFT. Int. J. Artif. Intell. Educ. (2017). https://doi.org/10.1007/s40593-0170147-y 66. Jones, M.B.: Regressing group on individual effectiveness. Organ. Behav. Hum. Perform. 11, 426–451 (1974) 67. Sottilare, R.A., Ragusa, C., Hoffman, M., Goldberg, B.: Characterizing an adaptive tutoring learning effect chain for individual and team tutoring. In: Proceedings of the Interservice/Industry Training Simulation and Education Conference, Orlando, FL (2013)

Conversational AIS as the Cornerstone of Hybrid Tutors Andrew J. Hampton(&) and Lijia Wang University of Memphis, Memphis, TN 38152, USA [email protected]

Abstract. This paper describes the benefits of artificially intelligent conversational exchanges as they apply to multi-level adaptivity in learning technology. Adaptive instructional systems (AISs) encompass a great breadth of pedagogical techniques and approaches, often targeting the same domain. This suggests the utility of combining individual systems that share concepts and content but not form or presentation. Integration of multiple approaches within a unified system presents unique opportunities and accompanying challenges, notably, the need for a new level of adaptivity. Conventional AISs may adapt to learners within problems or between problems, but the hybrid system requires recommendations at the level of constituent systems as well. I describe the creation of a hybrid tutor, called ElectronixTutor, with a conversational AIS as its cornerstone learning resource. Conversational exchanges, when properly constructed and delivered, offer substantial diagnostic power by probing depth, breadth, and fluency of learner understanding, while mapping explicitly onto knowledge components that standardize learner modeling across resources. Open-ended interactions can also reveal psychological characteristics that have bearing on learning, such as verbal fluency and grit. Keywords: Intelligent recommender

Knowledge components Trialogue

1 Introduction Adaptive instructional systems (AISs) guide learning experiences in computer environments, adjusting instruction depending on individual learner differences (goals, needs, and preferences) filtered through the context of the domain [1]. These leverage principles of learning to help learners master knowledge and skills [2]. Typically, this type of instruction focuses on one student at a time to be sensitive to individual differences relevant to the topic at hand or instruction generally. It is also possible to have an automated tutor or mentor interact with small teams of learners in collaborative learning and problem solving environments. Adaptivity in learning technologies dramatically improve upon the paradigm of conventional computer-based training (CBT). Often instruction via CBT operates by simple heuristic that a correct response on a given question calls for a more difficult question to follow, and the inverse. Though this and related techniques can reduce assessment time [3], the adaptivity remains coarse, with no more than basic learning principles applied. In this paradigm, learners may read static material, take an adaptive © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 634–644, 2019. https://doi.org/10.1007/978-3-030-22341-0_49

Conversational AIS as the Cornerstone of Hybrid Tutors

635

assessment, receive feedback on performance, and repeat the process until reaching a threshold of performance. AISs, on the other hand, use fine-grained adaptivity, often within individual problems (step-adaptive), and structure interactions based on empirically-validated pedagogical techniques. Subsequent learning material is typically recommended based upon holistic assessment of mastery or other individual variables (macro-adaptive), rather than the immediately preceding interaction alone. Adaptivity both within and between problems forms two distinct “loops” of adaptivity [4] that can be used in conjunction for more intelligent interactions (see Fig. 1). AISs can also track detailed learner characteristics (e.g., knowledge, skills, and other psychological attributes) and leverage artificial intelligence, informed by cognitive science, to create computational models describing learners and recommending next steps [4–6].

Fig. 1. Adaptivity at the step and problem levels that characterize AISs.

1.1

Successes of AISs

A substantial array of advanced AIS environments have sprung from the application of pedagogical and technological advancements to the field of CBT. Many of these have matured to the point of demonstrating significant learning gains over conventional instruction methods. These include efforts from concrete domains such as algebra and geometry, such as Cognitive Tutors [7–9] and ALEKS [10]. Systems targeting electronics (SHERLOCK [11], BEETLE-II [12]) and digital information technology (Digital Tutor [13]) have also proven effective. Meta-analyses of intelligent tutoring systems (a subset of AIS leveraging artificial intelligence) have shown benefits compared to traditional classroom instruction or directed reading. The effects found varied substantially from a marginal d = 0.05 [14, 15] to a tremendous d = 1.08 [16], with most converging between d = 0.40 and d = 0.80 [17, 18]. This compares favorably to human tutoring, which can vary from roughly those same bookends depending on the expertise and effectiveness of the individual tutor [18]. Direct comparisons between human tutors and intelligent tutoring systems tend to show little or no statistical difference in learning outcomes [18, 19].

636

1.2

A. J. Hampton and L. Wang

Conversational AIS

AISs that leverage verbal interaction can encourage more natural engagement with the subject matter [20, 21] and provide an avenue for addressing abstract domains. Explaining concepts extemporaneously urges learners to reflect and reason without relying on rigid repetition, formulas, or pre-defined answer options. Allowing learners to ask questions in return, through mixed initiative design, can focus conversational exchanges on areas or concepts with which the learner struggles or in which she has particular interest. Paralinguistic cues like pointing or facial expressions, instantiated with the use of embodied conversational agents, can increase realism and reinforce information by leveraging appropriate (and organically processed) gestures. Successful conversational AIS resources include AutoTutor [19, 21], GuruTutor [22], Betty’s Brain [23], Coach Mike [24], Crystal Island [25], and Tactical Language and Culture System [26]. Each of these systems has demonstrated learning advantages relative to conventional instructional techniques. Conversational AISs have collectively addressed topics including computer literacy, physics, biology, and scientific reasoning. This paper will focus on an application of AutoTutor. Some versions of AutoTutor include multiple conversational agents to provide more rich interaction dynamics. Conversational systems with multiple agents afford multiple interaction dynamics and greater interpersonal involvement [20, 27, 28]. In some iterations of AutoTutor, both a tutor agent and a peer agent engage in “trialogues” with the learner. This offers alternatives for presenting questions. The tutor agent can pose questions in a conventional instructor role, where the learner assumes the question-asker has the answer. Alternatively, the peer agent can pose the question from a position of relative weakness, asking the human learner for help. This tact can reduce pressure to provide full and complete information, or allow the human learner to play the role of a tutor. Also, novel formulations of answers may not be readily identifiable as correct or incorrect based on the pre-defined parameters constructed by experts. But having the peer agent rephrase the best match and submit as his or her own answer can avoid having the tutor agent erroneously evaluate learner responses as wrong. This avoids frustration and gives learners a chance to model more conventional expressions of the content. 1.3

Barriers to Adoption

Despite the learning gains demonstrated by the AISs above (both conversational and not), there remain several reasons that they have not gained ubiquity among the learning community. Primarily, they require considerable investment of resources and personnel. They require collaboration among many different specialties including experts in the domain of interest, computer science, and pedagogy. Additionally, the representation of information (e.g., graphical, interactive, conversational) likely requires specialists to synthesize expert input into a distinct approach. Smaller systems sometimes streamline this process by restricting the breadth of content available or depth of representation. In either case, creating broad swaths of content proves difficult.

Conversational AIS as the Cornerstone of Hybrid Tutors

637

This restriction of content naturally impedes widespread adoption. Any learning system entails some start-up costs where individuals learn how to interact with the system comfortably. Even exceptionally efficient systems would struggle to convince learners that the initial effort was justified for a mere subset of their learning goals. Evaluations then tends to focus on experimental groups to whom the AIS was assigned. 1.4

Hybrid Tutors

Although individual AIS content may suffer from lack of depth, breadth, or both, several independently produced systems may address the same domain, but represent slightly different (or complementary) content and engage in distinct pedagogical strategies. This array of technologies presents an opportunity to link existing learning resources centered around common subject matter. A confederated approach built into a common platform allows learners to have a common access point for a significantly enhanced range of content, presented in myriad ways. Such a system can include both adaptive and conventional learning resources, providing a full range of approaches to the material. Through integration, AISs may overcome the single greatest hurdle to widespread adoption. This, in turn, could afford in-kind expansion of the individual learning resources based on the increased base of learners and subsequent opportunities for datadriven improvements. Comprehensive coverage of the target topic also provides invaluable learner data to human instructors at both the individual and classroom level. This makes a combined system ideal for classroom integration, where instructors can rapidly adjust lesson plans in response to automatically graded out-of-class efforts. A confederated system designed for use in conjunction with a human instructor is known as a hybrid tutor [29]. 1.5

Meta Loop Adaptivity

Integrating systems together in a common platform creates new challenges to overcome. Learners may become overwhelmed by the amount of options, both in learning resources and in individual exercises or readings. Learners would naturally lack sufficient knowledge of the material and their understanding of it to navigate effectively, even without the addition of multiple representations. Human instructors understand progression through topics, but likely not the affordances and constraints of constituent learning resources. Further, this would not take full advantage of the capacity for learning technology to adapt to the individual. We must then construct a new “meta” loop of adaptivity to complement the existing within- and between-problem loops. Maintaining those two, a hybrid tutor must adapt to individual performance, history, and psychological characteristics at the level of learning resource. To do so requires two innovations: a way to translate progress among the several learning resources and a method of multifaceted assessment to begin recommendations. Conversational interactions have unique advantages in providing the latter requirement.

638

A. J. Hampton and L. Wang

Fig. 2. Four-loop adaptivity, with AIS two-loop adaptivity plus meta-adaptive resource selection and micro-adaptive interaction style selection, utilized in AutoTutor.

2 ElectronixTutor The potential benefits of confederated systems detailed above drove the creation of ElectronixTutor [29]. ElectronixTutor epitomizes the hybrid tutor approach, intended not to bypass classroom instruction, but to supplement it. The system addresses the topic of electricity and electrical engineering by leveraging diverse learning resources (both adaptive and static) in a unified platform. This platform (see Fig. 3) organizes content by

Fig. 3. The ElectronixTutor user interface, here showing an AutoTutor question with Point & Query engaged.

Conversational AIS as the Cornerstone of Hybrid Tutors

639

topic and resource, allowing learners to quickly find and engage with relevant problems or readings. Once selected, all resources appear in the activity window that occupies most of the screen, so the primary interface does not change for the learner. There resources range from easy to difficult and employ a range of interaction types. As an introduction, Topic Summaries provide a brief (1–2 page) overview of each topic with hyperlinks to external resources like Wikipedia or YouTube. Built originally for the Navy, ElectronixTutor has a massive collection of static digital files from the Navy Electricity and Electronics Training Series. These comprehensive files are indexed by topic to ensure learners find the relevant sections quickly. Derived from a conversational AIS, BEETLE-II multiple-choice questions offer remedial exposure to important concepts represented in circuit diagrams. Dragoon circuit diagram questions, by far the most difficult resource, require learners to understand circuit components, relationships, variables, and parameters holistically. For mathematical rehearsal, LearnForm gives applied problems that break down into constituent steps, providing feedback along the way. Finally, AutoTutor provides conversational adaptivity for deep learning. I will discuss its functionality in depth below. 2.1

Knowledge Components

Critically, all individual learning resources contribute to a unified learner record store. This store translates progress among the many resources on several discrete levels. To accomplish this interfacing endeavor, ElectronixTutor conceptualizes learning as discrete knowledge components [30]. Knowledge components in ElectronixTutor (see Fig. 4) divide the target domain of electrical engineering into fifteen topics, device or circuit, and structure, behavior, function, or parameter, for a total of 120 knowledge components. Experts in electrical engineering use this structure to evaluate every problem and learning resource, indicating which knowledge component or components it addresses. That way, learner interactions with the problems produce a score from

Fig. 4. Knowledge component mapping in ElectronixTutor

640

A. J. Hampton and L. Wang

zero to one on each of the associated knowledge components that updates a comprehensive learner model appropriately. 2.2

Recommendations

Knowledge components and the associated learning record store enable the creation of an intelligent recommender system. The availability of copious amounts of information is bound to overwhelm if not presented in an order that makes sense to the individual learner. Recommendations must account for a learner’s historical performance on a range of knowledge components, performance within learning resources, time since last interacting with content, and preferably psychological characteristics such as distinguishing between motivated and unmotivated learners. ElectronixTutor offers two distinct methods of recommendation generation. One, instantiated in the Topic of the Day functionality, allows instructors to pick a topic either manually or through a calendar function. This restricts recommendations to content within that topic. The other, labeled Recommendations in Fig. 3, provides three options across the entire system. While the mechanisms for each differ slightly, both benefit from learners engaging with content that provides detailed feedback. Conversational AISs (AutoTutor in ElectronixTutor) surpass other available systems for several reasons. 2.3

AutoTutor

AutoTutor [19, 21] focuses on teaching conceptual understanding and encouraging deep learning. It accomplishes this by open conversational exchange in a mixedinitiative format. Availability of both a tutor agent and a peer agent opens the door to numerous conversational and pedagogical scenarios and techniques. These “trialogues” begin by introducing the topic and directing attention to a relevant graphical representation, calling the learner by name to encourage engagement. Figure 5 demonstrates how AutoTutor (“Conversational Reasoning”) fits within the Topic of the Day recommendation decision tree. Here, the topic summary orients learners to the topic via brief, static introduction. This avoids asking questions on topics learners have not considered recently, thereby catching learners off their guard. From there, AutoTutor forms the hub of diagnosis and subsequent recommendation. It plays a similar role in populating the Recommendations section. It holds those position for several reasons. Each full problem in AutoTutor has a main question with several components of a full correct answer. The conversational engine can extract partial, as well as incorrect, responses from natural language input. Assuming learners do not offer a full and correct answer upon first prompting, either the tutor agent or peer agent will follow up with hints, prompts, or pumps. This gives the learners every opportunity to express all of the information (or misconceptions) that they hold about the content at hand. In so doing, AutoTutor gathers far more than binary correct/incorrect responses. Learners who require multiple iterations to supply accurate information have a demonstrably more tenuous grasp of the concepts and their application. This provides invaluable diagnostic data regarding depth of understanding for the creation of intelligent recommendations.

Conversational AIS as the Cornerstone of Hybrid Tutors

641

Within AutoTutor, Point & Query [29] lowers the barrier for interacting with content by offering a simple mouse-over interaction with circuit diagrams. In conventional learning environments, students may fail to ask questions either because they do not know what appropriate questions are or they are intimidated by the scope of content presented. Point & Query removes that barrier by providing both the questions and the answers, orienting learners and providing baseline information. This greatly increases the absolute number of interactions with the learning program, and subsequently encourages more engagement by reinforcing question-asking behavior with immediate answers. Further, main questions typically touch on several knowledge components simultaneously. AutoTutor’s focus on concepts naturally engenders questions about the relationships between multiple knowledge components. This both contributes to functional understanding and provides a breadth of information to the recommender systems. Presentation of multiple knowledge components in a single problem is not unique to AutoTutor. Notably, Dragoon requires extensive conceptual and technical knowledge of circuits to complete. However, finishing a problem in Dragoon is difficult

Fig. 5. A sample flow chart for ElectronixTutor intelligent recommendations within a topic designated by human instructors. Conversational reasoning (AutoTutor) plays a central role in diagnosis and subsequent recommendation.

642

A. J. Hampton and L. Wang

without a significant amount of expertise in the content already. Dragoon struggles to walk learners through problems unless they are already at a high level. AutoTutor, by contrast, provides ample lifelines for struggling learners. Point & Query establishes a baseline of information needed to answer the question, then hints, prompts, and pumps provide staged interventions, eliciting knowledge and encouraging reasoning to give every opportunity for learners to complete the problem. Open-ended interactions also have the benefit of providing opportunities to assess more psychological characteristics than rigidly defined learning resources. Willingness to continue engaging with tutor and peer agents through more than a few conversational turns indicates that a learner is motivated to master the content. These learners may prefer to push the envelope rather than refreshing past material. Long answers indicate verbal fluency, perhaps a suggestion that they will respond well to more conversational instruction moving forward. Evaluation on these lines can determine not just what problems should come next (the macro-adaptive step), but also how to respond to learner input within a problem (micro-adaptivity—see Fig. 2).

3 Conclusions AISs represent a substantial advancement over conventional computer-based training. Providing adaptivity within or between problems allows learners to have more personalized learning experiences. However, the trade-off between depth of instruction and investment to create it means that most individual systems lack the breadth to engender widespread adoption. Integration of these systems into a common platform serves two important purposes. First, it constitutes a method of improving individual learning in AISs by making multiple representations of content available and broadening the total area covered. Second, it represents a practical method of encouraging more learners and educational institutions to use it, providing data for improvement and incentive to continue creating. However, fundamental challenges must be overcome to realize this idea. First, some method of translating progress and collecting learner characteristics must be implemented. Second, the system must include intelligent recommendations for how to proceed, thus avoiding overwhelming learners with too many unorganized options. ElectronixTutor approaches the first problem through knowledge components that discretize the domain relative to individual learning resources. The second problem requires a learning resource with exceptional diagnostic power to offer a preliminary assessment. Conversational reasoning questions serve this function, in this case through the AutoTutor AIS. These questions probe depth of understanding on a conceptual level in a way that few AISs can. Further, the inclusion of multiple knowledge components in a single interaction allows for considerable breadth. An array of remedial affordances allows fine-grained evaluation of knowledge and encourage completion despite potential difficulties. These embedded resources include Point & Query functionality to encourage initial interaction with content, along with a series of hints, prompts, and pumps to get learners across the finish line. Finally, the open-ended interaction provides an opportunity to glean psychological characteristics that can inform recommendations

Conversational AIS as the Cornerstone of Hybrid Tutors

643

beyond that made possible with performance history alone. For these reasons—depth of assessment, breadth of content covered, availability of remedial steps, and potential for psychological assessment—conversational AISs form the cornerstone of intelligent recommendations in hybrid tutors.

References 1. Sottilare, R., Brawner, K.: Exploring standardization opportunities by examining interaction between common adaptive instructional system components. In: Proceedings of the First Adaptive Instructional Systems (AIS) Standards Workshop, Orlando, Florida (2018) 2. Graesser, A.C., Hu, X., Nye, B., Sottilare, R.: Intelligent tutoring systems, serious games, and the Generalized Intelligent Framework for Tutoring (GIFT). In: Using Games and Simulation for Teaching and Assessment, pp. 58–79 (2016) 3. Wainer, H., Dorans, N.J., Flaugher, R., Green, B.F., Mislevy, R.J.: Computerized Adaptive Testing: A Primer. Routledge, New York (2000) 4. VanLehn, K.: The Behavior of tutoring systems. Int. J. Artif. Intell. Educ. 16, 227–265 (2006) 5. Sottilare, R., Graesser, A.C., Hu, X., Holden, H. (eds.).: Design Recommendations for Intelligent Tutoring Systems. Learner Modeling, vol. 1. U.S. Army Research Laboratory, Orlando (2013) 6. Woolf, B.P.: Building Intelligent Interactive Tutors. Morgan Kaufmann Publishers, Burlington (2009) 7. Aleven, V., Mclaren, B.M., Sewall, J., Koedinger, K.R.: A New paradigm for intelligent tutoring systems: example-tracing tutors. Int. J. Artif. Intell. Educ. 19(2), 105–154 (2009) 8. Koedinger, K.R., Anderson, J.R., Hadley, W.H., Mark, M.: Intelligent tutoring goes to school in the big city. Int. J. Artif. Intell. Educ. 8, 30–43 (1997) 9. Ritter, S., Anderson, J.R., Koedinger, K.R., Corbett, A.: Cognitive tutor: applied research in mathematics education. Psychon. Bull. Rev. 14, 249–255 (2007) 10. Falmagne, J., Albert, D., Doble, C., Eppstein, D., Hu, X.: Knowledge Spaces: Applications in Education. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-35329-1 11. Lesgold, A., Lajoie, S.P., Bunzo, M., Eggan, G.: SHERLOCK: a coached practice environment for an electronics trouble-shooting job. In: Larkin, J.H., Chabay, R.W. (eds.) Computer Assisted Instruction and Intelligent Tutoring Systems: Shared Goals and Complementary Approaches, pp. 201–238. Erlbaum, Hillsdale (1992) 12. Dzikovska, M., Steinhauser, N., Farrow, E., Moore, J., Campbell, G.: BEETLE II: deep natural language understanding and automatic feedback generation for intelligent tutoring in basic electricity and electronics. Int. J. Artif. Intell. Educ. 24, 284–332 (2014) 13. Fletcher, J.D., Morrison, J.E.: DARPA Digital Tutor: Assessment data (IDA Document D4686). Institute for Defense Analyses, Alexandria, VA (2012) 14. Dynarsky, M., et al.: Effectiveness of reading and mathematics software products: Findings from the first student cohort. U.S. Department of Education, Institute of Education Sciences, Washington, DC (2007) 15. Steenbergen-Hu, S., Cooper, H.: A meta-analysis of the effectiveness of intelligent tutoring systems on college students’ academic learning. J. Educ. Psychol. 106, 331–347 (2013) 16. Dodds, P.V.W., Fletcher, J.D.: Opportunities for new “smart” learning environments enabled by next generation web capabilities. J. Educ. Multimed. Hypermedia 13, 391–404 (2004) 17. Kulik, J.A., Fletcher, J.D.: Effectiveness of intelligent tutoring systems: a meta-analytic review. Rev. Educ. Res. 85, 171–204 (2015)

644

A. J. Hampton and L. Wang

18. VanLehn, K.: The relative effectiveness of human tutoring, intelligent tutoring systems and other tutoring systems. Educ. Psychol. 46, 197–221 (2011) 19. Graesser, A.C.: Conversations with AutoTutor help students learn. Int. J. Artif. Intell. Educ. 26, 124–132 (2016) 20. Johnson, W.L., Lester, J.C.: Face-to-face interaction with pedagogical agents, Twenty years later. Int. J. Artif. Intell. Educ. 26(1), 25–36 (2016) 21. Nye, B.D., Graesser, A.C., Hu, X.: AutoTutor and family: a review of 17 years of natural language tutoring. Int. J. Artif. Intell. Educ. 24(4), 427–469 (2014) 22. Olney, A.M., Person, N.K., Graesser, A.C.: Guru: designing a conversational expert intelligent tutoring system. In: Cross-Disciplinary Advances in Applied Natural Language Processing: Issues and Approaches, pp. 156–171. IGI Global (2012) 23. Biswas, G., Jeong, H., Kinnebrew, J., Sulcer, B., Roscoe, R.: Measuring self-regulated learning skills through social interactions in a teachable agent environment. Res. Pract. Technol. Enhanc. Learn. 5, 123–152 (2010) 24. Lane, H.C., Noren, D., Auerbach, D., Birch, M., Swartout, W.: Intelligent tutoring goes to the museum in the big City: a pedagogical agent for informal science education. In: Biswas, G., Bull, S., Kay, J., Mitrovic, A. (eds.) AIED 2011. LNCS (LNAI), vol. 6738, pp. 155–162. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21869-9_22 25. Rowe, J.P., Shores, L.R., Mott, B.W., Lester, J.C.: Integrating learning, problem solving, and engagement in narrative-centered learning environments. Int. J. Artif. Intell. Educ. 21, 115–133 (2011) 26. Johnson, L.W., Valente, A.: Tactical language and culture training systems: using artificial intelligence to teach Foreign languages and cultures. AI Mag. 30, 72–83 (2009) 27. Craig, S.D., Twyford, J., Irigoyen, N., Zipp, S.A.: A Test of spatial contiguity for virtual human’s gestures in multimedia learning environments. J. Educ. Comput. Res. 53(1), 3–14 (2015) 28. Graesser, A.C., Li, H., Forsyth, C.: Learning by communicating in natural language with conversational agents. Curr. Dir. Psychol. Sci. 23, 374–380 (2014) 29. Graesser, A.C., et al.: ElectronixTutor: an adaptive learning platform with multiple resources. In: Proceedings of the Interservice/Industry Training, Simulation, & Education Conference (I/ITSEC 2018), Orlando, FL (2018) 30. Tackett, A.C., et al.: Knowledge components as a unifying standard of intelligent tutoring systems. In: Exploring Opportunities to Standardize Adaptive Instructional Systems (AISs) Workshop of the 19th International Conference of the Artificial Intelligence in Education (AIED) Conference, London, UK (2018)

Ms. An (Meeting Students’ Academic Needs): Engaging Students in Math Education Karina R. Liles(&) Claflin University, Orangeburg, SC 29115, USA [email protected]

Abstract. This research presents a new, socially adaptive robot tutor, Ms. An (Meeting Students’ Academic Needs). The goal of this research was to use a decision tree model to develop a socially adaptive robot tutor that predicted and responded to student emotion and performance to actively engage students in mathematics education. The novelty of this multi-disciplinary project is the combination of the fields of HRI, AI, and education. In this study we (1) implemented a decision tree model to classify student emotion and performance for use in adaptive robot tutoring-an approach not applied to educational robotics; (2) presented an intuitive interface for seamless robot operation by novice users; and (3) applied direct human teaching methods (guided practice and progress monitoring) for a robot tutor to engage students in mathematics education. Twenty 4th and 5th grade students in rural South Carolina participated in a between subjects study with two conditions: (A) with a non-adaptive robot (control group); and (B) with a socially adaptive robot (adaptive group). Students engaged in two one-on-one tutoring sessions to practice multiplication per the South Carolina 4th and 5th grade mathematics state standards. Although our decision tree models were not very predictive, the results gave answers to our current questions and clarity for future directions. Our adaptive strategies to engage students academically were effective. Further, all students enjoyed working with the robot and we did not see a difference in emotional engagement across the two groups. Keywords: Education

Social robotics Human robot interaction

1 Introduction “The robot would be unable to understand children because it would lack reasoning skills that require cognitive, social, and emotional intelligences. Teaching requires that teachers understand what students already know and use that to help them make new connections. It also requires a relationship between student and teacher, one that allows and encourages risktaking. A teacher’s job is to balance a push for new knowledge and a stay for students to gain mastery. This takes a lot of intuition and personal judgment.” -Educator

The above quote was taken from a survey we conducted on educators’ opinions on a robot teaching assistant and mirrors the thoughts of many of the respondents [1]. To address this concern, it is crucial to (1) understand the state of mathematics education in © Springer Nature Switzerland AG 2019 R. A. Sottilare and J. Schwarz (Eds.): HCII 2019, LNCS 11597, pp. 645–661, 2019. https://doi.org/10.1007/978-3-030-22341-0_50

646

K. R. Liles

the US, particularly for at risk youth; (2) assess the potential of Human-Robot Interaction (HRI) for education; and (3) consider the Artificial Intelligence (AI) needed to develop a socially adaptive robot tutor. This three-pronged approach (education, HRI, and AI, respectively) lays the foundation of this research. 1.1

Education

Mathematics is among the core academic subjects identified by the US Department of Education [2]. Math competence in early education leads to career and college readiness as it prepares students for undergraduate courses in college [3] and plays a critical part in the competency for workers in the technical workforce and the nation’s economic development [3]. Although math proficiency is extremely important, many students are not excelling in the field [4]. Tutoring. Tutoring is one approach to help students perform better in mathematics as it is often used to assist students who may show weaknesses in academic areas. Tutoring is a supplemental aid in the learning process that can further enhance a student’s academic ability [5]. Benjamin Bloom found that students who receive oneon-one tutoring outperformed students who receive traditional classroom instruction by two standard deviations (two-sigma problem) [6]. A tutoring interaction is comprised of an academic component and social component [7, 8]. Academically, tutors provide immediate and specific feedback. Socially, tutors provide positive reinforcement and guidance [8]. Together, these components are critical for success in tutoring [5]. Further, this academic and social interaction fosters student engagement [5]. Engagement. In education, student engagement influences student motivation and progress in learning. The term student engagement encompasses the student’s attention, curiosity, interest, optimism, and passion when learning. There are many facets of engagement as it relates to education including intellectual engagement and emotional engagement [9]. Intellectual engagement focuses on a student’s cognitive state during learning [9]. Teaching strategies are often employed for the maximum benefit of intellectual engagement. Two effective techniques that are encouraged are guided practice and progress monitoring [9]. Emotional engagement describes a student’s affective state during learning. Student emotions impact cognition and positive emotions stimulate attention [10]. It is important to organize emotions in a way that makes emotion groupings meaningful. Scherer et al. labels emotions by valence and control/power [11]. 1.2

Human-Robot Interaction

HRI is the field of study that involves understanding, designing, and evaluating robotic systems that communicate with humans [12]. HRI is applied in areas in which it is necessary for the robot to interact with the user [13, 14]. This is exactly the case in a tutoring scenario where a social interaction between the tutor (the robot) and the student is necessary for effective learning to occur [15].

Ms. An (Meeting Students’ Academic Needs)

647

Robots and Education. Several research studies have investigated the use of robots for education. These studies have shown that social robots are useful supplemental tools for education. Yun and colleagues documented a study where students were instructed via a robot tele-operated by a teacher, which led to learning gains for students [16]. Another study investigated the conceptual design of an educational robot that engaged students in a lesson about historical ancient cultures [17]. Though the robot’s sociability has been shown to contribute to student achievement, little has been done to illustrate the specific aspects of the robot that facilitate learning and retention [18, 19]. Social robots have also been widely used to support mathematics education. Brown and Howard used verbal cues to minimize idle time and decrease boredom during a mathematics test [39]. In another study, researchers used personalization to students while playing an adaptive arithmetic game with a robot [20]. Ramachandran and colleagues used a social robot that aided students while practicing fractions [21]. Socially responsive feedback (i.e., task-related feedback, motivational support), was effective in a robot learning companion that helped students practice mathematics problems [22]. Robots have also demonstrated positive trends among student perception and engagement [23]. One study documented how a robot’s perceived sociability increased from the pre- to post-questionnaires during a mathematics tutoring session [24]. Howley et al. documented that students were more willing to ask the robot questions over a human tutor in most situations due to varying perceptions of the robot’s social role during a tutoring session [25]. Kanda et al. concluded that the social behavior of the robot aided in facilitating a better relationship with the student and increased the student’s social acceptance of the robot during a mathematics lesson [26]. The implementation of adaptive robots is an important topic in HRI; however, AI can be applied to develop robots that adapt and respond to a student’s needs. 1.3

Artificial Intelligence

AI is the field of study that involves synthesizing and analyzing computational agents that can act intelligently. An intelligent agent can make decisions about its actions based on factors such as goals/values, prior knowledge, observations, past experiences, and the environment [27]. An effective human tutor adapts to the student (tutee) by gathering information about the student (e.g., capabilities, motivations, etc.) and tailoring real time instruction to meet the learning needs of the student [28]. This adaptability makes AI a probable approach to intelligent tutoring systems. Agents rely on an array of inputs such as student’s prior knowledge, common student errors, or facial expressions which can be used to conduct activities (i.e., assess student knowledge and provide relevant feedback). Figure 1 shows a sample agent system as a tutor. Previous work has focused on adaptive tutoring and the robot’s [or computer’s] response once information is inferred. In some cases, social responses are reactions to a student’s state to aid in academic success [29–32].

648

K. R. Liles

Fig. 1. Agent system (tutor)

Decision Trees. A decision tree is a model used for classifying data and is one of the most effective methods used for supervised classification learning. A tree is built per its training data, which it uses to make classifications. The internal nodes in a decision tree represent the tree’s features and its classes are represented by the tree’s leaves [33]. Figure 2 shows a sample decision tree that uses four predictors (outlook, temperature, humidity, and wind) to determine a decision (yes, no) to play golf.

Fig. 2. Sample decision tree [34]

1.4

Summary

Due to the need for student enrichment in the math, and the benefits of using robots for education, socially adaptive robots are ideal as a teaching tool for mathematics

Ms. An (Meeting Students’ Academic Needs)

649

education. Social robots are not only capable of delivering mathematics content, but they are also capable of socially interacting with students to promote an enriching educational experience. However, how do we develop a socially adaptive robot with reasoning skills and an intuition about the student’s emotional state?

2 Impetus of Research 2.1

Problem

Educators have expressed that to best serve students, a robot tutor must possess reasoning skills and the robot must be capable of having an intuition about the student’s emotional state [1]. To date, there is a lack of literature that describes implementation of a socially adaptive robot tutor that uses a decision tree model to predict student emotion and performance for practicing multiplication via effective teaching techniques (i.e., guided practice and progress monitoring). 2.2

Research Goal

The goal of this research was to use a decision tree model to develop a socially adaptive robot tutor that predicted and responded to student emotion and performance to actively engage students in mathematics education. To assess the research goal (i.e., effectiveness of a robot’s ability to educate and engage students), this study addressed the following research questions: [Q1] How well can a decision tree model classify a student’s emotion and performance? [Q2] How well can a socially adaptive robot tutor engage 5th grade students to practice multiplication? (a) How do students perform academically by studying with a socially adaptive robot tutor? (b) How do students respond emotionally by studying with a socially adaptive robot tutor? [Q3] What social perceptions do students have of a socially adaptive robot tutor while practicing multiplication? To address these research questions, we conducted a study in which students interacted with a robot during multiple tutoring sessions. We recorded information (such as delay in answer) that was needed to help the robot make predictions about the student. We collected information about each student’s mathematics performance before, during, and after the tutoring sessions as well as information about each student’s emotional states throughout the study. We also gathered information about the student’s perceptions and opinions of the robot tutor.

650

K. R. Liles

3 Methodology 3.1

Platform

We used the NAO humanoid robot (see Fig. 3) as the robot tutor named Ms. An (Meeting Students’ Academic Needs). The NAO humanoid robot is an ideal platform for delivering education because of its multimodal capabilities such as speech and gesture. The NAO stands 58 cm tall. It has 25 degrees of freedom, 2 cameras, various touch sensors, and 4 microphones. The robot is also capable of voice and vision recognition.

Fig. 3. NAO humanoid robot [35]

3.2

Lesson

State Standards. The multiplication tutoring session covered problems that addressed the South Carolina state standards: • (4th grade) 4.NSBT.5 Multiply up to a four-digit number by a one-digit number and multiply a two-digit number by a two-digit number using strategies based on place value and the properties of operations. • (5th grade) 5.NSBT.5 Fluently multiply multi-digit whole numbers using strategies to include a standard algorithm [36]. Content. The content of the lessons spanned across the different ways in which multiplication can be described through equal groups, area arrays, and comparison [37]. Students practiced multiplication with problems that included multiplying whole numbers by up to four digits and one digit and multiplying two-digit numbers by

Ms. An (Meeting Students’ Academic Needs)

651

two-digit numbers. To ensure record of a wholistic multiplication experience, students solved problems with different combinations of multiplication question and answer types. For example, Fig. 4 shows a session question that was given as a context question type and pictorial (equal groups) answer type.

Fig. 4. Example session question

3.3

HRI Study Design

To analyze the effectiveness of the adaptive robot, we conducted a between-subjects study with two conditions: (A) with a non-adaptive robot (control group); and (B) with a socially adaptive robot (adaptive group). Table 1 shows a comparison of robot traits and behaviors for each condition. Table 1. Adaptive robot versus non-adaptive robot Non-adaptive robot (Control) Static emotional state (neutral)

Asked multiplication questions [progress monitoring] without instructional support

Adaptive robot Dynamic emotional state to match student’s emotional state (happy, angry, sad, surprised, neutral) Asked multiplication questions [progress monitoring] with instructional support [guided practice]

In the adaptive robot condition, Ms. An predicted the student’s emotion and performance and proactively determined next actions before presenting a question to be solved. For emotion classification, Ms. An used social responses that corresponded to each student’s emotional state (happy, angry, sad, surprised, neutral) before asking a question. For performance, if Ms. An predicted that the student would answer the upcoming question correctly, it proceeded with progress monitoring; however, if

652

K. R. Liles

Ms. An predicted that the student would answer the upcoming question incorrectly, the robot proceeded with guided practice. In the non-adaptive condition, Ms. An behaved in a neutral emotional state and completed only progress monitoring activities (i.e., asked mathematics questions to be solved without any intervention) despite the student’s affective state or competency. Student-Robot Interaction. In addition to gestures and movements, Ms. An communicated with the students verbally using speech and visually through the tablet (see Fig. 5). Ms. An performed actions such as reading each multiplication question, prompting the students at various points during the lesson, and giving feedback on answer choices (verbal communication). These actions corresponded with the question and activity display on the tablet (visual communication). Students could press buttons on the touch screen interface to select answer choices and enter values using a keypad. Ms. An received that data from the tablet and responded accordingly.

Fig. 5. Student-robot communication diagram

Participants. Twenty 4th and 5th grade students were included in the study (10 males, 10 females), ranging from 9–12 years old (M = 9.95, SD = 0.84). Participants were recruited from Blenheim/Elementary Middle School in rural Blenheim, SC. Of those participants, 50% identified themselves as Black/African American; 30%, White/ Caucasian, 5%, other; and 5% opted not to report race/ethnicity. Participants completed a Technology Experience Profile that measured their familiarity with and use of different technologies. While the students rated an overall familiarity (M = 3.59, SD = 1.12) with technology, their experience with robots, specifically, was low (M = 2.95, SD = 1.28). The top technologies which the student reported using on at least a weekly basis (e.g., M = 4.0 or higher) were video games, tablet, smart board, smart phone, music player, and social media. The least used technologies (M = 3.0 or below) were webcam, electronic book reader, LCD projector, student response systems, robots, and a camera. See Fig. 5. Eleven students were assigned to the control group and nine students were assigned to adaptive group. To ensure the groups were equally split by student performance, we

Ms. An (Meeting Students’ Academic Needs)

653

used the pre-test 3 scores to assign students to each condition. Students in the control group had a 32% (SD = 14) average score and students in the adaptive group had a 28% (SD = 9) average score. 3.4

Procedure

Students engaged in one-on-one tutoring sessions to practice multiplication per the South Carolina 4th and 5th grade mathematics state standards. Mainly due to time and resource constraints, most education interventions using robotics are short-term interventions (some being as short as one interaction) [38]. To have a longer intervention and multiple interactions, students worked with Ms. An for two sessions, having one session a week for two weeks. Sessions lasted approximately 30–45 min. Prior to the study session, students completed a student demographic form, technology experience profile, and multiplication pre-test. Then, each student was asked to sit in a small room and the students worked at a desk with the robot. The student began each session by answering the emotions questionnaire. Next, they interacted with the robot. For both the adaptive and non-adaptive groups, the robot acted as a tutor and completed progress monitoring activities. The robot asked students multiplication questions. Each question was displayed on the tablet and students answered questions via the tablet interface. Contrary to the non-adaptive robot, the adaptive robot employed proactive behaviors and executed those behaviors when needed (see Sects. 3.2 and 3.3). Students completed the emotions questionnaire again, halfway through the study session. Once the tutoring session was complete, students completed the emotions questionnaire a final time. After all sessions were completed, students completed a final session on solving multiplication using the partial products technique. In this session, students began with guided practice then concluded with progress monitoring. At the end of all three sessions, students completed a post-test, RPI questionnaire, and interview. Students were given a retention test after all students completed the study. Figure 6 details the study procedure for each student.

Fig. 6. Study procedure

654

K. R. Liles

4 Results 4.1

Data Analysis

Unless otherwise noted, alpha was set at .05 for all statistical tests. Due to the small sample size in each group, we report this data with guarded generalizations. We indicate all data that are statistically significant with an asterisk (*). 4.2

Decision Tree Model

Research question 1 (How well can a decision tree model classify a student’s emotion and performance?) addressed the accuracy of a decision tree model. To evaluate the robot’s ability to make classifications for emotion and performance and to better understand where misclassifications could occur: (1) for emotion, we compared the robot’s prediction for each student’s emotion to the student’s self-reported emotions throughout the session and (2) for performance, we compared the robot’s prediction for each student’s performance to the student’s actual performance. We used data from a previous study to train our decision tree, which is a popular technique known as transfer learning. Transfer learning in artificial intelligence is a technique in which the data used for a training set to solve one problem is applied as a training set to solve a similar problem [39]. While using this technique is common, it may have contributed to the low prediction accuracy for both training models to the new models. These comparisons for emotion classifications are shown in the confusion matrix in Table 2. The values along the diagonal of the matrices are the success rates for predictions. Table 2. Confusion matrix for each emotion Angry Angry 0.00 Happy 0.00 Neutral 0.04 Sad 0.00 Surprised 0.00

Happy 0.00 0.00 0.51 0.40 0.00

Neutral 0.00 0.00 0.19 0.20 0.00

Sad 0.00 0.00 0.05 0.00 0.00

Surprised 0.00 0.00 0.21 0.40 0.00

The results show that the model is not accurate for each individual emotion (as expected from results of the training set described in Sect. 3.2). Despite the students exhibiting other emotions, the robot only predicted neutral and sad emotions. Happy was most commonly classified as neutral (51%). Happy was also misclassified as sad 40% out of all sad classifications. Surprise was also misclassified as neutral (21%) and sad (40%). The comparisons for performance classification are shown in the confusion matrix in Table 3. The values along the diagonal of the matrix are the success rates for predictions.

Ms. An (Meeting Students’ Academic Needs)

655

Table 3. Confusion Matrix for performance Correct Incorrect Correct 0.54 0.46 Incorrect 0.40 0.60

Incorrect performance was classified correctly at a higher rate than correct performance. However, the classifications were correct a little over half the time, which is only slightly better than choosing randomly. 4.3

Engagement

Research question 2 (How well can a socially adaptive robot tutor engage 5th grade students to practice multiplication?) emphasized student engagement. To consider two aspects of engagement, research question 2 was comprised of two sub-questions. To address question 2a, how do students perform academically by studying with a socially adaptive robot tutor, we report average learning gains and percent correct by answer type. To address question 2b, how do students respond emotionally by studying with a socially adaptive robot tutor, we report the frequency of emotions exhibited throughout the study sessions. Learning Gains. The difference in pre-test and post-test scores is a measure of each participant’s learning gain during the study. We also calculated the difference in session 1 and session 2 scores to measure each participant’s learning gains. To allow for a reliable analysis for our between-subjects design, we calculated the normalized learning gain in each group [40]. Pre-/post-test 1. Pre-/post-test 1 was a test on students’ ability to identify the different ways to represent multiplication problems. Figure 7 shows the normalized average learning gains for pre-/post-test 1 for each condition. We conducted Wilcoxon Signed Rank tests to compare pre- to post-test scores in each condition. The adaptive conditions did show a statistically significant (z = −2.06, p < .05) improvement from pre(M = 3.67, SD = 1.32) to post-test (M = 5.44, SD = 2.83) scores. Therefore, the adaptive robot did, in fact, promote learning gains in the students’ ability to identify the different ways to represent multiplication problems. The control condition did not have a significant change (z = −0.239, p = .81) from pre- (M = 4.63, SD = 2.01) to posttest (M = 4.63, SD = 2.06) scores. Therefore, the control condition did not yield learning gains in this skill. Mann-Whitney U test was used to compare the learning gains between conditions. While the adaptive group had higher learning gains from pre- to post-test1 (M = 0.44, SD = 0.22) than the control group (M = −0.15, SD = 0.67), this difference between groups was not statistically significant (z = −1.62, p = .10). It is important to note that although this is a promising trend, the there is no significant difference likely due to the variance in the control group being higher, and due to the small sample size.

656

K. R. Liles

Fig. 7. Pre-/post-test 1 average learning gains

Pre-/post-test 2. Pre-/post-test 2 was a test on students’ ability to correctly solve multiplication problems. Figure 8 shows the normalized average learning gains for pre-/post-test 2 for each condition. We conducted Wilcoxon Signed Rank tests to compare pre- to post-test scores in each condition. The adaptive conditions did not show a statistically significant (z = −1.13, p = .26) improvement from pre- (M = 0.28, SD = 0.09) to post-test (M = 0.39, SD = 0.29) scores. The control condition also did not have a significant change (z = −1.66, p = .10) from pre- (M = 0.32, SD = 0.15) to post-test (M = 0.46, SD = 0.21) scores. Therefore, the control condition also did not yield learning gains in this skill.

Fig. 8. Pre-/post-test 2 average learning gains3

Ms. An (Meeting Students’ Academic Needs)

657

Mann-Whitney U test was used to compare the learning gains between conditions. Figure 8 shows the normalized average learning gains for pre-/post-test 2 for each condition. There was not a decrease in learning gains for either group in test 2; thus, the adaptive session did not negatively impact the students. There was not a statistically significant (via Mann-Whitney U test) difference in learning gains between the two conditions for test 2 (z = −.34, p = .73). Frequency of Emotions. We assumed that students were equally likely to select any of the 5 emotions (happy, angry, sad, surprised, neutral), and calculated a Pearson’s Goodness of Fit Chi Square. The chi square for both the control (X2 = 37.77) and the adaptive (X2 = 42.62) were significant (p < .001), suggesting that the distribution of reported emotions were not evenly reported. Students significantly reported happiness more often than other emotions. Students were more likely to feel surprised in the control condition. This could be because they had less feedback/coaching on how they were doing. The emotions sadness and anger were not commonly reported. 4.4

Robot Sociability

Research question 3 (What social perceptions do students have of a socially adaptive robot tutor while practicing multiplication?) focuses on students’ perceptions of a robot tutor. To address research question 3, we used the results of the RPI questionnaire. We analyzed the RPI by each individual item. We then conducted Wilcoxon Signed Rank tests to conduct a within-group comparison for each individual questionnaire item – comparing the mean to 3.00 (neutral). The significant questionnaire items are listed in Table 4. As depicted in this table, more items from the facilitated learning construct were statistically significant in the adaptive condition. The robot was interesting was statistically significant for both groups. More items from the credible construct were statistically significant in the control condition. The robot seemed knowledgeable and the robot seemed like a teacher were statistically significant for both groups. No items in the human-like construct were statistically significant. Lastly, more items from the engagement construct were statistically significant in the control condition. The robot was motivating was statistically significant for the adaptive group, which directly addressed the engaging persona factor (how well the agent motivated the student). Table 4. Statistically significant RPI items by condition and agent factors Control Facilitated Learning Made multiplication interesting Kept student’s attention Helped student focus on important information Interesting

Adaptive Facilitated Learning Made student think about multiplication more deeply Encouraged students to think about what they were doing Kept student’s attention Showed information effectively (p = .05) Helped student concentrate on lesson Interesting Enjoyable (continued)

658

K. R. Liles Table 4. (continued)

Control Credible Knowledgeable Intelligent Helpful Seemed like a teacher Human Like None Engagement Expressive Enthusiastic Friendly

Adaptive Credible Knowledgeable (p = .05) Useful Seemed like a teacher Human like None Engagement Motivating

5 Results This study investigated the use of a socially adaptive robot tutor to engage students in mathematics education. Often, it is difficult to get students to engage in mathematics education [41]. While technology is not a full solution, it can make significant contributions to better engage students in mathematics education [42]. This study was important because it offered strategies to better engage students (emotionally and academically) in mathematics education. Although our decision tree models were not very predictive, the results gave answers to our current questions and clarity for future directions. Our adaptive strategies to engage students academically were effective. All students enjoyed working with the robot and we did not see a difference in emotional engagement across the two groups. Our adaptive strategies made students think more deeply about their work and focus more. This higher order thinking is preferred in education as it a cognitive process that demonstrates deeper understanding of the academic material [43]. Not only does this study tell us more about education and AI, but it also tells us how to improve the methodology for educational HRI in rural areas. Novelty likely played an important role in this study on a rural population due to lack of exposure for students. Future studies that include urban students may yield different results. This study offered insight for developing a socially adaptive robot tutor to engage students academically and emotionally while practicing multiplication. Results from this study will inform the human-robot interaction (HRI) and artificial intelligence (AI) communities on best practices and techniques within the scope of this work.

Ms. An (Meeting Students’ Academic Needs)

659

References 1. Liles, K.R., Beer, J.M.: The Potential of a Robot-teaching Assistant for Educators in Rural Areas (in prep.) 2. Title IX - General Provisions (2005). https://www2.ed.gov/policy/elsec/leg/esea02/pg107. html. Accessed 28 Feb 2017 3. Savage, M.D., Hawkes, T.: Measuring the Mathematics Problem. The Engineering Council, London (2000) 4. South Carolina Department of Education (n.d.). http://ed.sc.gov/data/test-scores/stateassessments/act-aspire-test-scores/2015/. Accessed 28 Feb 2017 5. Fuchs, D., Fuchs, L.S.: Introduction to response to intervention: what, why, and how valid is it. Read. Res. Q. 41(1), 93–99 (2006) 6. Bloom, B.S.: The 2 sigma problem: the search for methods of group instruction as effective as one-to-one tutoring. Educ. Res. 13(6), 4–6 (1984) 7. Ong, J., Ramachandran, S.: Intelligent Tutoring Systems: The What and the How. Learning Circuits, 1 (2000) 8. Fantuzzo, J.W., Riggio, R.E., Connelly, S., Dimeff, L.A.: Effects of reciprocal peer tutoring on academic achievement and psychological adjustment: a component analysis. J. Educ. Psychol. 81(2), 173 (1989) 9. Student Engagement Definition (2016). http://edglossary.org/student-engagement/. Accessed 28 Feb 2017 10. Safer, N., Fleischman, S.: How student progress monitoring improves instruction. Educ. Leadersh. 62(5), 81–83 (2005) 11. Engaging Emotions: Role of Emotions in Learning (n.d.). http://serendip.brynmawr.edu/ exchange/l-cubed/engaging-emotions-role-emotions-learning. Accessed 21 Jan 2017 12. Kidd, C.D., Breazeal, C.: Effect of a robot on user perceptions. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3559–3564 (2004) 13. Dautenhahn, K.: Socially intelligent robots: dimensions of human-robot interaction. Philos. Trans. R. Soc. B Biol. Sci. 362(1480), 679–704 (2007) 14. Fong, T., Nourbakhsh, I., Dautenhahn, K.: A survey of socially interactive robots. Robot. Auton. Syst. 42(3), 143–166 (2003) 15. Tiberius, R.G., Billson, J.M.: The social context of teaching and learning. College Teaching: From Theory to Practice, vol. 45, pp. 67–86. Jossey-Bass, San Francisco (1993) 16. Yun, S., Shin, J., Kim, D., Kim, C.G., Kim, M., Choi, M.T.: Engkey: tele-education robot. In: Proceedings of the International Conference on Social Robotics, pp. 142–152 (2011) 17. Cuéllar, F.F., Peñaloza, C.I., López, J.A.: Educational robots as promotors of cultural development. In: The 11th ACM/IEEE International Conference on Human Robot Interaction, p. 547 (2016) 18. Kennedy, J., Baxter, P., Senft, E., Belpaeme, T.: Social robot tutoring for child second language learning. In: Proceedings of the 11th ACM/IEEE International Conference on Human-Robot Interaction, pp. 231–238 (2016) 19. Tanaka, F., Takahashi, T., Matsuzoe, S., Tazawa, N., Morita, M.: Telepresence robot helps children in communicating with teachers who speak a different language. In: Proceedings of the 2014 ACM/IEEE International Conference on Human-Robot Interaction, pp. 399–406 (2014) 20. Janssen, J.B., van der Wal, C.C., Neerincx, M.A., Looije, R.: Motivating children to learn arithmetic with an adaptive robot game. In: Proceedings of the International Conference on Social Robotics, pp. 153–162 (2011)

660

K. R. Liles

21. Ramachandran, A., Litoiu, A., Scassellati, B.: Shaping productive help-seeking behavior during robot-child tutoring interactions. In: The Eleventh ACM/IEEE International Conference on Human Robot Interaction, pp. 247–254 (2016) 22. Lubold, N., Walker, E., Pon-Barry, H.: Effects of voice-adaptation and social dialogue on perceptions of a robotic learning companion. In: Proceedings of the Eleventh ACM/IEEE Conference on Human-Robot Interaction, pp. 255–262 (2016) 23. Lee, E., Lee, Y., Kye, B., Ko, B.: Elementary and middle school teachers’, students’ and parents’ perception of robot-aided education in Korea. In: World Conference on Educational Multimedia, Hypermedia and Telecommunications, Vienna, pp. 175–183 (2008) 24. Gratch, J., Rickel, J., Andre, E., Cassell, J., Petajan, E., Badler, N.: Creating interactive virtual humans: some assembly required (2002) 25. Howley, I., Kanda, T., Hayashi, K., Rosé, C.: Effects of social presence and social role on help-seeking and learning. In: Proceedings of the 2014 ACM/IEEE International Conference on Human-Robot Interaction, pp. 415–422 (2014) 26. Kanda, T., Shimada, M., Koizumi, S.: Children learning with a social robot. In: Proceedings of the Seventh Annual ACM/IEEE International Conference on Human-Robot Interaction, New York, NY, pp. 351–358 (2012) 27. Poole, D.L., Mackworth, A.K.: Artificial Intelligence: Foundations of Computational Agents. Cambridge University Press, Cambridge (2010) 28. Sottilare, R.A.: Adaptive Intelligent Tutoring System (ITS) research in support of the Army Learning Model—research outline. US Army Research Laboratory (ARL-SR-0284) (2013) 29. Leyzberg, D., Spaulding, S., Scassellati, B.: Personalizing robot tutors to individuals’ learning differences. In: Proceedings of the 2014 ACM/IEEE International Conference on Human-robot Interaction, pp. 423–430 (2014) 30. Spaulding, S., Gordon, G., Breazeal, C.: Affect-aware student models for robot tutors. In: Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pp. 864–872 (2016) 31. Szafir, D., Mutlu, B.: Pay attention: designing adaptive agents that monitor and improve user engagement. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 11–20 (2012) 32. Gordon, G., et al.: Affective personalization of a social robot tutor for children’s second language skills. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 3951–3957 (2016) 33. Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, pp. 3–8. Prentice Hall, Englewood Cliffs (1995) 34. Decision Tree – Classification (n.d.). http://www.saedsayad.com/decision_tree.htm. Accessed 28 Feb 2017 35. Aldebaran documentation (n.d.). http://doc.aldebaran.com/2-1/family/nao_h25/index_h25. html. Accessed 28 Feb 2017 36. South Carolina Department of Education (n.d.). http://ed.sc.gov/instruction/standardslearning/mathematics/. Accessed 28 Feb 2017 37. Carpenter, T.P., Fennema, E., Franke, M.L., Levi, L., Empson, S.B.: Children’s mathematics: cognitively guided instruction. Heinemann, 361 Hanover Street, Portsmouth, NH 038013912 (1999) 38. Short, E., et al.: How to train your DragonBot: socially assistive robots for teaching children about nutrition through play. In: 2014 RO-MAN: The 23rd IEEE International Symposium on Robot and Human Interactive Communication, pp. 924–929, August 2014 39. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010)

Ms. An (Meeting Students’ Academic Needs)

661

40. Ekman, P., Friesen, W.V.: Constants across cultures in the face and emotion. J. Pers. Soc. Psychol. 17(2), 124 (1971) 41. Glaessner, B., Salk, S., Stodolsky, S.S.: Student views about learning math and social studies. Am. Educ. Res. J. 28, 89–116 (1991) 42. Masalski, W.J.: Technology-Supported Mathematics Learning Environments (Sixty-Seventh Yearbook) [2005 NCTM Yearbook (67th)]. National Council of Teachers of Mathematics (2005) 43. Bloom, B.S.: Taxonomy of Educational Objectives. Cognitive Domain, vol. 1, pp. 20–24. McKay, New York (1956)

Author Index

Bailey, Mike 118 Bailey, Shannon K. T. 314 Barr, Avron 169 Bell, Benjamin 3 Bennett, Winston 493 Biddle, Elizabeth 15 Blankendaal, Romy 572 Bove, Lara K. 28 Brawner, Keith 179, 188 Brinton, Christopher G. 455 Bruder, Anna 391 Buck, Barbara 15 Cai, Zhiqiang 204, 327, 593, 604 Carroll, Meredith 263 Casano, Jonathan D. L. 130 Catrambone, Richard 282 Chaparro, Maria 263 Chen, Dar-Wei 188, 282 Chiang, Mung 455 Cockroft, Jody L. 204, 217 Copland, Cameron 204 Craighead, Jeffrey 142 Dalangin, Bianca 340 Dargue, Brandt 40, 108 de Fátima Guilhermino, Jislaine 405 de Freitas Guilhermino Trindade, Daniela 405 de Oliveira, Thiago Fernandes 405 DeFalco, Jeanine A. 52 Domeshek, Eric 62 dos Santos Braz, Rafael 405 Durlach, Paula J. 76 Ericson, Mark 340 Fang, Ying 604 Files, Benjamin T. 340 Fletcher, J. D. 615

Folsom-Kovarik, Jeremiah T. 40, 188, 204 Fraulini, Nicholas W. 352 Freeman, Jared 493 García, Victor 363 Gatewood, Jessica 327 Goldberg, Benjamin 301 Graesser, Arthur C. 96, 204, 217, 327, 593, 604 Grant, Jonathan 118 Groenier, Marleen 469 Hampton, Andrew J. 96, 204, 239, 634 Hu, Xiangen 204, 217, 593, 604 Jacobson, Dov 108 Jensen, Randy 62 Johnson, Cheryl I. 314, 352 Jones, Randolph M. 505 Kennedy, Justin 118 Khooshabeh, Peter 340 Kim, Jong W. 521 Kreutzfeldt, Magali 420, 444 Kuo, Bor-Chen 432 Lan, Andrew S. 455 Liao, Chen-Huei 432 Liles, Karina R. 645 Lin, Chia-Hua 432 Lindsey, Summer 263 Lippert, Anne 327, 604 Ludwig, Jeremy 62, 532 Manahan, Dominique Marie Antoinette Marraffino, Matthew D. 314, 352 Maymi, Fernando 505 McCarthy, James E. 118 Merlin, José Reinaldo 405 Mostafavi, Behrooz 188

130

664

Author Index

Neerincx, Mark 572 Nickels, Alex 505 O’Grady, Ryan 505 Ocumpaugh, Jaclyn 130 Oiknine, Ashley H. 340 Oliveira, Guilherme 405 Ong, Jim 62 Pai, Kai-Chih 432 Passaro, Antony D. 340 Patton, Debbie 340 Pollard, Kimberly A. 340 Poppinga, Gerald 363, 542 Presnell, Bart 532 Ramachandran, Sowmya 62 Renker, Johanna 420, 444 Ribeiro, Carlos Eduardo 405 Rinkenauer, Gerhard 420, 444 Ritter, Frank E. 521 Robson, Robby 169 Rodgers, Stuart 379 Rodrigo, Ma. Mercedes T. 130 Roessingh, Jan Joris 363, 542 Rus, Vasile 217 Sanders, John 40 Schmettow, Martin 469 Schoonderwoerd, Tjeerd 572 Schroeder, Bradford L. 352

Schwarz, Jessica 391 Sgarbi, Ederson Marcos 405 Shih, Shu-Chuan 432 Sinatra, Anne M. 52, 340 Sottilare, Robert 3, 142, 227, 239 Stensrud, Brian 239 Stottler, Dick 62 Tanaka, Alyssa 142 Taylor, Glenn 142 Thai, K. P. 251 Tong, Richard 251 Toubman, Armon 542, 557 Trewhitt, Ethan 151 Tu, Yuwei 455 Van Buskirk, Wendi L. 352 van den Bosch, Karel 572 van Oijen, Joost 363, 542 Veinott, Elizabeth S. 151 Wang, Lijia 634 Warwick, Walter 379 Watz, Eric 493 Whitaker, Elizabeth 151 Whitmer, Daphne E. 314 Witte, Thomas E. F. 469 Zapata-Rivera, Diego

482