Human Activity Recognition Challenge [1st ed.] 9789811582684, 9789811582691

The book introduces some challenging methods and solutions to solve the human activity recognition challenge. This book

546 71 4MB

English Pages XIV, 126 [133] Year 2021

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Human Activity Recognition Challenge [1st ed.]
 9789811582684, 9789811582691

Table of contents :
Front Matter ....Pages i-xiv
Summary of the Cooking Activity Recognition Challenge (Sayeda Shamma Alia, Paula Lago, Shingo Takeda, Kohei Adachi, Brahim Benaissa, Md Atiqur Rahman Ahad et al.)....Pages 1-13
Activity Recognition from Skeleton and Acceleration Data Using CNN and GCN (Donghui Mao, Xinyu Lin, Yiyun Liu, Mingrui Xu, Guoxiang Wang, Jiaming Chen et al.)....Pages 15-25
Let’s Not Make It Complicated—Using Only LightGBM and Naive Bayes for Macro- and Micro-Activity Recognition from a Small Dataset (Ryoichi Kojima, Roberto Legaspi, Kiyohito Yoshihara, Shinya Wada)....Pages 27-37
Deep Convolutional Bidirectional LSTM for Complex Activity Recognition with Missing Data (Swapnil Sayan Saha, Sandeep Singh Sandha, Mani Srivastava)....Pages 39-53
SCAR-Net: Scalable ConvNet for Activity Recognition with Multimodal Sensor Data (Zabir Al Nazi)....Pages 55-64
Multi-sampling Classifiers for the Cooking Activity Recognition Challenge (Ninnart Fuengfusin, Hakaru Tamukoh)....Pages 65-74
Multi-class Multi-label Classification for Cooking Activity Recognition (Shkurta Gashi, Elena Di Lascio, Silvia Santini)....Pages 75-89
Cooking Activity Recognition with Convolutional LSTM Using Multi-label Loss Function and Majority Vote (Atsuhiro Fujii, Daiki Kajiwara, Kazuya Murao)....Pages 91-101
Identification of Cooking Preparation Using Motion Capture Data: A Submission to the Cooking Activity Recognition Challenge (Clément Picard, Vito Janko, Nina Reščič, Martin Gjoreski, Mitja Luštrek)....Pages 103-113
Cooking Activity Recognition with Varying Sampling Rates Using Deep Convolutional GRU Framework (Md. Sadman Siraj, Omar Shahid, Md Atiqur Rahman Ahad)....Pages 115-126

Citation preview

Smart Innovation, Systems and Technologies 199

Md Atiqur Rahman Ahad Paula Lago Sozo Inoue   Editors

Human Activity Recognition Challenge

Smart Innovation, Systems and Technologies Volume 199

Series Editors Robert J. Howlett, Bournemouth University and KES International, Shoreham-by-sea, UK Lakhmi C. Jain, Faculty of Engineering and Information Technology, Centre for Artificial Intelligence, University of Technology Sydney, Sydney, NSW, Australia

The Smart Innovation, Systems and Technologies book series encompasses the topics of knowledge, intelligence, innovation and sustainability. The aim of the series is to make available a platform for the publication of books on all aspects of single and multi-disciplinary research on these themes in order to make the latest results available in a readily-accessible form. Volumes on interdisciplinary research combining two or more of these areas is particularly sought. The series covers systems and paradigms that employ knowledge and intelligence in a broad sense. Its scope is systems having embedded knowledge and intelligence, which may be applied to the solution of world problems in industry, the environment and the community. It also focusses on the knowledge-transfer methodologies and innovation strategies employed to make this happen effectively. The combination of intelligent systems tools and a broad range of applications introduces a need for a synergy of disciplines from science, technology, business and the humanities. The series will include conference proceedings, edited collections, monographs, handbooks, reference books, and other relevant types of book in areas of science and technology where smart systems and technologies can offer innovative solutions. High quality content is an essential feature for all book proposals accepted for the series. It is expected that editors of all accepted volumes will ensure that contributions are subjected to an appropriate level of reviewing process and adhere to KES quality principles. Indexed by SCOPUS, EI Compendex, INSPEC, WTI Frankfurt eG, zbMATH, Japanese Science and Technology Agency (JST), SCImago, DBLP. All books published in the series are submitted for consideration in Web of Science.

More information about this series at http://www.springer.com/series/8767

Md Atiqur Rahman Ahad Paula Lago Sozo Inoue •



Editors

Human Activity Recognition Challenge

123

Editors Md Atiqur Rahman Ahad University of Dhaka Dhaka, Bangladesh

Paula Lago Kyushu Institute of Technology Fukuoka, Japan

Osaka University Osaka, Japan Sozo Inoue Kyushu Institute of Technology Fukuoka, Japan

ISSN 2190-3018 ISSN 2190-3026 (electronic) Smart Innovation, Systems and Technologies ISBN 978-981-15-8268-4 ISBN 978-981-15-8269-1 (eBook) https://doi.org/10.1007/978-981-15-8269-1 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Preface

Activity Recognition applications include remote monitoring of daily activities for elders living alone and automatic record creation for nurses in hospitals. Although these applications only require the recognition of the activity being done, i.e., cooking, such complex activities are usually made up of several smaller activities like “taking something from the fridge”. Recognizing such steps has multiple advantages. For instance, in the scenario of an elder living alone, recognizing activity steps enables applications such as reminding of the next steps. Current activity recognition systems focus on recognizing either the complex label (macro activity) or the small steps (micro activities) but their joint recognition is critical for analysis like the ones proposed. In fact, in a nursing scenario, washing hands after taking blood is quite different than doing it before, as it is mandatory. Therefore, with the Cooking Activity Recognition Challenge, we aimed at the recognition of the macro- and micro activities taking place during cooking sessions. In this book, there are 10 chapters on activity recognition challenge from Bangladesh, China, France, Japan, Slovenia, Switzerland, and the USA. As there were three data sources available, each solution selected the data source they found more suitable for the task. Most of the chapters used only the data from accelerometer sensors. These were placed on both wrists (smartwatches), on the right forearm and on the left hip (smartphones). The chapters using accelerometer data have explored a wide variety of approaches, while others have devoted to deep learning approaches. Some chapters have explored a combination of data sources as well as different models. All chapters have utilized two separate models for each level of classification. The results are summarized and analyzed in the chapter “Summary of the Cooking Activity Recognition Challenge”.

v

vi

Preface

Chapter “Activity Recognition from Skeleton and Acceleration Data Using CNN and GCN” used the motion capture and only the accelerometer placed on the right arm as input to a Convolutional Neural Network. Chapter “Let’s Not Make It Complicated—Using Only LightGBM and Naive Bayes for Macro- and Micro-Activity Recognition from a Small Dataset” also combines the accelerometer with motion capture data and then a LightGBM model for classification. Chapter “Deep Convolutional Bidirectional LSTM for Complex Activity Recognition with Missing Data” also only used three accelerometers, excluding the left wrist sensor, and a deep learning approach. However, they chose LSTM as the model for the data. Chapter “SCAR-Net: Scalable ConvNet for Activity Recognition with Multimodal Sensor Data” proposed a convolutional network SCAR-Net which implements an end-to-end approach. However, they use two models, one for each level. Chapter “Multi-sampling Classifiers for the Cooking Activity Recognition Challenge” used all three data sources (accelerometer, motion capture, and OpenPose) and a classifier based on time-series similarity. Chapter “Multi-class Multi-label Classification for Cooking Activity Recognition” used both motion capture and accelerometer data as input to a multi-class random forest. Chapter “Cooking Activity Recognition with Convolutional LSTM Using Multilabel Loss Function and Majority Vote” uses LSTM and also all 4 accelerometer sensors. All deep learning models use raw data as input, with preprocessing to account for different sampling rates and missing values. Chapter “Identification of Cooking Preparation Using Motion Capture Data: A Submission to the Cooking Activity Recognition Challenge” used only the motion capture data. This data source records the 3D position of 25 body parts. They reconstructed the sequence of data to train over the whole sequence instead of segments. A random forest proved to be more effective in the classification than other approaches. Chapter “Cooking Activity Recognition with Varying Sampling Rates Using Deep Convolutional GRU Framework” used only three accelerometers, and a deep learning model combining a Convolutional Neural Network (CNN) and a Gated Recurrent Unit (GRU).

Preface

vii

We would like to thank the expert reviewers for their time to review the chapters. We thank Springer for their prompt cooperation to publish this book. We strongly feel that this book will enrich the researcher communities in the academia and industry to move further on human activity and behavior understanding. Best regards, Md Atiqur Rahman Ahad, Ph.D., SMIEEE Professor, University of Dhaka Dhaka, Bangladesh Specially Appointed Associate Professor Osaka University Osaka, Japan Paula Lago, Ph.D. Postdoctoral Researcher Kyushu Institute of Technology Fukuoka, Japan Sozo Inoue, Ph.D. Professor, Kyushu Institute of Technology Fukuoka, Japan

Contents

Summary of the Cooking Activity Recognition Challenge . . . . . . . . . . . . Sayeda Shamma Alia, Paula Lago, Shingo Takeda, Kohei Adachi, Brahim Benaissa, Md Atiqur Rahman Ahad, and Sozo Inoue Activity Recognition from Skeleton and Acceleration Data Using CNN and GCN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Donghui Mao, Xinyu Lin, Yiyun Liu, Mingrui Xu, Guoxiang Wang, Jiaming Chen, and Wei Zhang Let’s Not Make It Complicated—Using Only LightGBM and Naive Bayes for Macro- and Micro-Activity Recognition from a Small Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ryoichi Kojima, Roberto Legaspi, Kiyohito Yoshihara, and Shinya Wada

1

15

27

Deep Convolutional Bidirectional LSTM for Complex Activity Recognition with Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Swapnil Sayan Saha, Sandeep Singh Sandha, and Mani Srivastava

39

SCAR-Net: Scalable ConvNet for Activity Recognition with Multimodal Sensor Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zabir Al Nazi

55

Multi-sampling Classifiers for the Cooking Activity Recognition Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ninnart Fuengfusin and Hakaru Tamukoh

65

Multi-class Multi-label Classification for Cooking Activity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shkurta Gashi, Elena Di Lascio, and Silvia Santini

75

Cooking Activity Recognition with Convolutional LSTM Using Multi-label Loss Function and Majority Vote . . . . . . . . . . . . . . . . Atsuhiro Fujii, Daiki Kajiwara, and Kazuya Murao

91

ix

x

Contents

Identification of Cooking Preparation Using Motion Capture Data: A Submission to the Cooking Activity Recognition Challenge . . . . . . . . 103 Clément Picard, Vito Janko, Nina Reščič, Martin Gjoreski, and Mitja Luštrek Cooking Activity Recognition with Varying Sampling Rates Using Deep Convolutional GRU Framework . . . . . . . . . . . . . . . . . . . . . 115 Md. Sadman Siraj, Omar Shahid, and Md Atiqur Rahman Ahad

Editors and Contributors

About the Editors Md Atiqur Rahman Ahad SMIEEE, is Professor, University of Dhaka (DU), and Specially Appointed Associate Professor, Osaka University. He did B.Sc. (Honors) & Masters (DU), Masters (University of New South Wales), and Ph.D. (Kyushu Institute of Technology) and is JSPS Postdoctoral Fellow and Visiting Researcher. His authored books are “Motion History Images for Action Recognition and Understanding,” in Springer; “Computer Vision and Action Recognition,” in Springer; “IoT-sensor based Activity Recognition,” in Springer. He has been authoring/editing a few more books. He published 150+ peer-reviewed papers, *80 keynote/invited talks, 25+ Awards/Recognitions. He is Editorial Board Member of Scientific Reports, Nature; Associate Editor of Frontiers in Computer Science; Editor of the International Journal of Affective Engineering; Editor-in-Chief: International Journal of Computer Vision & Signal Processing; General Chair: 10th ICIEV; 5th IVPR; 3rd ABC; Guest-Editor: Pattern Recognition Letters, Elsevier; JMUI, Springer; JHE, Hindawi; IJICIC; Member: OSA, ACM, IAPR. Paula Lago has a Ph.D. from Universidad de los Andes, Colombia. She received her Bachelor’s and Master’s degree in Software Engineering from the same university. From 2018 to 2020, she was a Postdoctoral Researcher at Kyushu Institute of Technology, Japan. Her current research is on how to improve the generalization of activity recognition in real-life settings taking advantage of data collected in controlled settings. In 2016, she was an invited researcher in the Informatics Laboratory of Grenoble, where she participated in smart home research in collaboration with INRIA. She has served as Reviewer for MDPI Sensors and ACM IMWUT journal and for several conferences. She is a Co-Organizer of the HASCA Workshop, held at Ubicomp yearly. She currently volunteers for ACM SIGCHI.

xi

xii

Editors and Contributors

Sozo Inoue is a Full Professor in Kyushu Institute of Technology, Japan. His research interests include human activity recognition with smart phones, and healthcare application of web/pervasive/ubiquitous systems. Currently, he is working on verification studies in real field applications and collecting and providing a large-scale open dataset for activity recognition, such as a mobile accelerator dataset with about 35,000 activity data from more than 200 subjects, nurses’ sensor data combined with 100 patients’ sensor data and medical records, and 34 households’ light sensor dataset for 4 months combined with smart meter data. Inoue has a Ph.D. in Engineering from Kyushu University in 2003. After completion of his degree, he was appointed as an Assistant Professor in the Faculty of Information Science and Electrical Engineering at Kyushu University, Japan. He then moved to the Research Department at Kyushu University Library in 2006. Since 2009, he is appointed as an Associate Professor in the Faculty of Engineering at Kyushu Institute of Technology, Japan, moved to the Graduate School of Life Science and Systems Engineering at Kyushu Institute of Technology in 2018, and appointed as a Full Professor from 2020. Meanwhile, he was a Guest Professor in Kyushu University, a Visiting Professor at Karlsruhe Institute of Technology, Germany, in 2014, a special Researcher at the Institute of Systems, Information Technologies and Nanotechnologies (ISIT) during 2015–2016, and a Guest Professor at the University of Los Andes in Colombia in 2019. He is a Technical Advisor of Team AIBOD Co. Ltd during 2017–2019, and a Guest Researcher at RIKEN Center for Advanced Intelligence Project (AIP) during 2017–2019. He is a Member of the IEEE Computer Society, the ACM, the Information Processing Society of Japan (IPSJ), the Institute of Electronics, Information and Communication Engineers (IEICE), the Japan Society for Fuzzy Theory and Intelligent Informatics, the Japan Association for Medical Informatics (JAMI), and the Database Society of Japan (DBSJ).

Contributors Kohei Adachi Kyushu Institute of Technology, Fukuoka, Japan Md Atiqur Rahman Ahad University of Dhaka, Dhaka, Bangladesh Sayeda Shamma Alia Kyushu Institute of Technology, Fukuoka, Japan Brahim Benaissa Kyushu Institute of Technology, Fukuoka, Japan Jiaming Chen Shandong University, Jinan, China Elena Di Lascio Università della Svizzera italiana (USI), Lugano, Switzerland Ninnart Fuengfusin Graduate School of Life Science and Systems Engineering, Kyushu Institute of Technology, Fukuoka, Japan

Editors and Contributors

xiii

Atsuhiro Fujii Graduate School of Information Science and Engineering, Ritsumeikan University, Kusatsu, Shiga, Japan Shkurta Gashi Università della Svizzera italiana (USI), Lugano, Switzerland Martin Gjoreski Jozef Stefan Institute, Ljubljana, Slovenia Sozo Inoue Kyushu Institute of Technology, Fukuoka, Japan Vito Janko Jozef Stefan Institute, Ljubljana, Slovenia Daiki Kajiwara Graduate School of Information Science and Engineering, Ritsumeikan University, Kusatsu, Shiga, Japan Ryoichi Kojima KDDI Research, Tokyo, Japan Paula Lago Kyushu Institute of Technology, Fukuoka, Japan Roberto Legaspi KDDI Research, Tokyo, Japan Xinyu Lin Shandong University, Jinan, China Yiyun Liu Shandong University, Jinan, China Mitja Luštrek Jozef Stefan Institute, Ljubljana, Slovenia Donghui Mao Shandong University, Jinan, China Kazuya Murao Graduate School of Information Science and Engineering, Ritsumeikan University, Kusatsu, Shiga, Japan Zabir Al Nazi Dhaka, Bangladesh Clément Picard École normale supérieure de Rennes, Bruz, France; Jozef Stefan Institute, Ljubljana, Slovenia Nina Reščič Jozef Stefan Institute, Ljubljana, Slovenia Swapnil Sayan Saha University of California, Los Angeles, USA Sandeep Singh Sandha University of California, Los Angeles, USA Silvia Santini Università della Svizzera italiana (USI), Lugano, Switzerland Omar Shahid University of Dhaka, Dhaka, Bangladesh Md. Sadman Siraj University of Dhaka, Dhaka, Bangladesh Mani Srivastava University of California, Los Angeles, USA Shingo Takeda Kyushu Institute of Technology, Fukuoka, Japan Hakaru Tamukoh Graduate School of Life Science and Systems Engineering, Kyushu Institute of Technology, Fukuoka, Japan Shinya Wada KDDI Research, Tokyo, Japan

xiv

Guoxiang Wang Shandong University, Jinan, China Mingrui Xu Shandong University, Jinan, China Kiyohito Yoshihara KDDI Research, Tokyo, Japan Wei Zhang Shandong University, Jinan, China

Editors and Contributors

Summary of the Cooking Activity Recognition Challenge Sayeda Shamma Alia, Paula Lago, Shingo Takeda, Kohei Adachi, Brahim Benaissa, Md Atiqur Rahman Ahad, and Sozo Inoue

Abstract Cooking Activity Recognition Challenge [1] is organized as a part of ABC2020 [2]. In this work, we analyze and summarize the approaches of submissions of the Challenge. A dataset consisting of macro and micro activities, collected in a Cooking scenario were opened to the public with a goal of recognizing both of these activities. The participant teams used the dataset and submitted their predictions of test data which was released on March 1st, 2020. The submission of the teams was evaluated rigorously and the winning team achieved about 92.08% averaged accuracy for macro- and micro activities.

1 Introduction The combination of the Internet of Things (IoT) with Artificial Intelligence (AI) is giving rise to services and applications for personalized health care and monitoring. Among those services, monitoring at home has sparked attention for its S. S. Alia (B) · P. Lago · S. Takeda · K. Adachi · B. Benaissa · S. Inoue Kyushu Institute of Technology, Fukuoka, Japan e-mail: [email protected] P. Lago e-mail: [email protected] S. Takeda e-mail: [email protected] K. Adachi e-mail: [email protected] B. Benaissa e-mail: [email protected] S. Inoue e-mail: [email protected] M. A. R. Ahad University of Dhaka, Dhaka, Bangladesh e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 M. A. R. Ahad et al. (eds.), Human Activity Recognition Challenge, Smart Innovation, Systems and Technologies 199, https://doi.org/10.1007/978-981-15-8269-1_1

1

2

S. S. Alia et al.

impact in elderly care services that allow them to Age in Place 1 while ensuring their safety. These services provide awareness of what the user is doing to alert in case of emergencies. For example, by fall detection [3] immediate medical service can be dispatched to them. The future possibilities of using these services in various fields are numerous. An example can be that for people who need special attention like Autism spectrum disorder [4], Dementia [5], Parkinson’s disease [6], patients can be monitored at home while providing tailored care to them based on their needs. Another example is automatic record creation for nurses in hospitals. One important use case for the elderly is cooking monitoring. As the activity of cooking is a strong indicator of cognitive health and independent living ability, it opens the door for monitoring nutrition. Cooking is a complex activity, usually made up of several smaller activities like “taking from the fridge”, “washing the food”, “mixing in the bowl”, etc. Recognizing such steps can have several advantages. For instance, in the scenario of an elderly living alone, recognizing the steps can be used to remind them of a missing step, or to ensure a healthy diet is being followed. In the scenario of the nursing record, recognizing the steps can be useful for care quality assessment, or for ensuring that safety protocols have been followed, like washing hands at the proper moments. Cooking is a complex activity. Monitoring studies use several sensors, embedded in the environment such as temperature and motion sensors, as well as electric consumption sensors. However, such installations might be costly and difficult to maintain. Therefore, in this challenge, we explored the possibility of cooking activity monitoring with wearable (smart watch) and smartphone sensors, which are cheaper and already available at homes. We collected a dataset [7] consisting of three recipes and nine steps (Cut, Take, Mix, Add, Other, Pour, Open, Peel, Wash). While the dataset was collected in a laboratory setup, this dataset is considered to make the initial evaluation of the cooking activity and to get a sense of its complexity, for the goal of automatic recognition of the recipes that are being prepared, and to monitor the steps followed to make them. Current activity recognition systems focus on recognizing either the complex label (macro activity) or the small steps (micro activities) but their combined recognition is critical for real-life application analysis. In fact, in a nursing scenario, washing hands after taking blood is very different than doing it before, as it is mandatory. Thus, this challenge is aimed at the recognition of the macro- and micro activities taking place during cooking sessions. In this paper, we provide an overview of the submissions received to the challenge, analyze their approaches as well as metadata and highlight the lessons learned.

2 Dataset Description The dataset is collected in a setting where cooking activities are performed. Each of the subjects is instructed to cook three types of foods following specific recipes. The cooking process of these three types of foods are considered as three different 1 https://en.wikipedia.org/wiki/Aging_in_place.

Summary of the Cooking Activity Recognition Challenge

3

activities. In this section, details of the activities of this dataset will be described as well as the data collection environment and the used sensors will be reported.

2.1 Activities Collected The dataset used for this challenge consists of activities and actions associated with cooking. Actions are named as Micro activities and activities are named as Macro activities. There are three macro activities and 9 micro activities. Each macro activity consists of multiple micro activities. Details of each macro activity are given below. • C EREAL: Take , Open , Cut , Peel , Other, Put • F RUITSALAD: Take , Add , Mix , Cut , Peel , Other, Put • S ANDWICH: Take , Cut , Other, Wash , Put As we can see, the macro activities have many similar micro activities which are done in slightly different ways. This increases the difficulty level for correctly detecting these activities. Data is divided into 30 s segments, macro and micro labels are given for each segment. Most of the time, micro activities are done in less than 30 s, so multiple labels for micro activities are observed in most of the segments. The number of samples for each of the activities is shown in Table 1.

Table 1 Number of samples for training and testing Classes Training Macro

Micro

Cereal Fruitsalad Sandwich Take Add Mix Open Cut Peel Other Wash Put

73 102 113 134 18 19 23 99 96 74 30 114

Testing 26 38 35 46 6 4 6 31 36 34 10 46

4

S. S. Alia et al.

Fig. 1 Motion capture markers used in this dataset

2.2 Experimental Settings and Sensor Modalities The data collection experiment was conducted in Smart Life Care Unit of the Kyushu Institute of Technology in Japan. Four subjects participated during data collection, and there was no overlap between the subjects. The experiment was conducted in a controlled environment where the steps are predefined for the subjects. They had to prepare three types of foods following the defined steps. The data was collected using smartphone, smart watches, motion capture system and open pose. Only the first three sensors are open to the public for this challenge. The details of each sensor are given below. Motion Capture: Motion capture system from Motion Analysis Company [8] is used for this experiment. It has 29 body markers. The places of markers in the body are shown in Fig. 1. 16 infrared cameras are used to track the markers. Accelerometer sensor: 2 smartphones are placed on the right arm and the left hip, 2 smart watches are placed on both of the wrists of a subject. The used smartwatches are TicWatch E. Samsung Galaxy S9 SCV38 and Huawei P20 Lite smartphone are used on the left hip and right arm consecutively. OpenPose [9]: It is a real-time 2D open-source pose detection system. It can detect 135 key points of the human body from a single image. But during this experiment, among the key points, only the marker points of motion capture are used in OpenPose.

2.3 Data Format The data has been separated into training data and test data. Training data contains data from 3 subjects and test data contains the fourth subject’s data. Each recording has been segmented into 30-s segments as mentioned earlier. Each segment was

Summary of the Cooking Activity Recognition Challenge

5

Fig. 2 Folder structure for the dataset

assigned a random identifier, so the order of the segments is unknown. Each of these segments is recorded in one CSV file individually. Data collected by each of the sensors are represented by one folder. One row of each file contains the file name, the macro activity and the micro activities which are all separated by commas. The structure of a folder is shown in Fig. 2.

3 Challenge Tasks and Results The goal of the Cooking Activity Recognition Challenge is to recognize both the macro- and micro activities. The challenge provides training and test data. The participants are asked to predict the macro- and micro labels for the test data. Evaluation is done based on the submitted labels for both of the activities separately. Initially, 78 unique teams registered, but 9 teams submitted in the final challenge. This large difference in number can be caused due to a few reasons: • Global pandemic of COVID19 • Not enough time • Could not get high accuracy The reasons are obtained through talking with several teams that could not submit for the final stage of the challenge.

3.1 Evaluation Metric To evaluate the submission, accuracy is calculated for both macro- and micro labels individually. Accuracy of the macro labels is obtained using Eq. 1. Accuracy(Ama ) =

CorrectlyPredictedSamples TotalSamples

(1)

6

S. S. Alia et al.

For micro activity, accuracy is calculated using the multilabel accuracy formula [10]:  i=1 1  |Yi Zi |  Accuracy(Ami ) = n n |Yi Zi |

(2)

Here, Y is predicted labels and Z is actual labels of micro activities and n is Total number of samples. Then the average of two accuracies—macro activities and micro activities—are calculated. The formula is as follows: Accuracy, A =

Ama + Ami 2

(3)

3.2 Results Participant teams used various preprocessing and classification methods for this challenge. Difference in the use of sensor modalities is also observed and it is shown in Fig. 3a. As we can see in the figure, three teams used all of the sensors. There were four accelerometer sensors: right arm (RA), right wrist (RW), left wrist (LW) and left hip (LH). Different teams used different combinations of sensors as shown in the figure. The difference in accuracy by using different sensors can be seen in Fig. 3b. A big difference in performance can be seen when Mocap and Mocap&RA are used. The best performance is achieved using Mocap. Performance degraded very much when it is combined with Accelerometer sensor in right arm (Mocap&RA). One possible reason can be: the use of different classifiers and features. The training time and testing time of the teams is presented in Fig. 4. For train and test, time was divided into Machine Learning (ML) and Deep Learning (DL) groups. Three teams used ML, five teams used DL and one team used both ML and DL. One of the teams mentioned that it took a very long time for training and testing. So, only 8 teams’ information is shown in the figure. Here, we can see that for training ML, it 100

Average Accuracy

Number of Teams

3

2

1

75

50

25

0

0 Acc (All)

Acc (RA,

All

RW,LH)

Sensors

(a)

Mocap

Mocap&RA

Acc (All)

Acc (RA, RW,LH)

All

Mocap

Mocap&RA

Sensors

(b)

Fig. 3 a Sensor modalities used by teams and b Accuracies by different sensor modalities

Summary of the Cooking Activity Recognition Challenge

7

Fig. 4 a Training time and b Testing time by classical machine learning (ML) and deep learning (DL) pipelines

took very less time whereas for DL, it took a long time. And for testing both of the groups required very less time. As for training it took longer, so it is shown in hours and testing took less than an hour, so to emphasize the difference between ML and DL, it is shown in minutes. As for the implementation, all of the teams used Python, and one team used both Python and MATLAB. This indicates the popularity of Python for AI and Machine learning related fields. Also in Fig. 5b, commonly used libraries are shown. The libraries which are used by multiple teams are shown in the figure. For the window size, different teams used different window sizes. The dataset was provided in a 30 s window frame. Some of the teams modified it and 0.5, 2, 3, 10 s are used. Many teams used different feature extraction and selection strategies. Mean, standard deviation, max and min are commonly used features among different teams. One of the teams converted the inputs to image and used it for classification. For the post-processing of the data, the use of resampling techniques and imputation strategies are observed. ML and DL pipelines used by different teams are shown in Fig. 6. It is visible that more people are using DL algorithms than ML, although ML is faster as seen in Fig. 4. The use of different ML algorithms like k-NN, SVM, Random Forest and LightGBM was noted. In the case of DL algorithms, the use of CNN, GCN, Convolutional Neural Network (SCAR-Net), Deep Convolutional Bidirectional LSTM, Deep Convolutional GRU and ConvLSTM was observed. Accuracy comparison of these algo-

8

S. S. Alia et al. 6

Python

10.0%

5

Matlab Teams

4 3 2 1

90.0%

0 sklearn

pytorch

pandas

keras

tsfresh

Libraries

(b)

(a) Fig. 5 a Programming Language and b Libraries used by the teams Fig. 6 ML and DL pipelines used by teams

ML 11.1%

DL 33.3%

Both

55.6%

rithms is done by dividing them into ML and DL groups, and it is shown in Fig. 7a. This accuracy is only based on training data, provided by participants in their submitted papers. In the Figure, we can see that most of the DL algorithms have 70–85% accuracy during training while the range of accuracy for ML algorithms is quite large. One thing to notice is that the number of teams that used DL algorithms is almost double than the teams used ML, still having a small range means that the performance of the DL algorithms is consistent and similar. On the other hand, accuracy on the testing set is shown in Fig. 7b. Samples of subject 4 are used as testing data and no cross validation is used. Unlike the training accuracy, the range of testing accuracy for DL is smaller. But the difference in median can be observed for training and testing. Both ML and DL have smaller median in testing accuracy. In Fig. 8, comparison of the accuracy of training and testing data is shown. Training accuracy is reported by participants in the papers and Testing accuracy is calculated by us using Eq. 3. The overall performance of all the teams are good. Team T2 has exact same accuracy for training and testing. For the teams T6, T8 and T4, the testing accuracies are quite close with the training accuracies. However, in all of the cases, the training accuracy is higher than the testing accuracy.

Summary of the Cooking Activity Recognition Challenge

9

Fig. 7 Accuracy of ML and DL pipelines for a Training data and b Testing data Train Set

Test Set

Average Accuracy

100

75

50

25

0 T1

T2

T7

T5

T6

T8

T4

T3

T9

Teams

Fig. 8 Accuracy comparison of Training and Testing data

Overall accuracy comparison on the test dataset for all the teams is shown in Fig. 9. Here, the result is grouped in 3 groups: Macro, Micro and All. All means the average of Macro and Micro. The details of each are given in Table 2. Overall, the recognition of micro activity was poorer than macro activity. One reason can be very few samples. 4 teams (T1, T2, T6, T7) were able to recognize micro activity better, than macro

10

S. S. Alia et al.

Fig. 9 Accuracy of Macro, Micro and All activities Fig. 10 Confusion Matrix for Macro activities

activity. Among the teams, 3 teams used DL and one team used ML algorithm. Team T8 was able to achieve 100% accuracy in Macro activity recognition and 84.16% in Micro activity recognition. On average team T8 outperforms all the other teams in both Macro and Micro activity recognition. Confusion matrix for the team with the highest accuracy is shown in Fig. 10. It can be seen that, all the macro activity classes: CEREAL, FRUITSALAD and SANDWICH are perfectly-predicted.

4 Conclusion After summarizing the results of this challenge, we can say that the average accuracy obtained by various teams ranges between 32% to 92%, which is quite large. Although for macro-activity recognition, most of the teams performed well, but the average results degraded because of the comparatively lower performance of micro activity classification. On the other hand, some teams are achieve higher accuracy in

Summary of the Cooking Activity Recognition Challenge

11

micro activity classification. The highest accuracy for macro activity classification is 100%, whereas for micro, it is 84%. So, we can conclude that detecting micro activity is relatively easier, hence, it results in higher average accuracy. Some classes like ADD, MIX have very few samples. This can be one reason behind the performance degradation of micro activity classification. Another case observed from the results is that: although deep learning methods are very promising and demonstrating excellent performance in many fields, but machine learning algorithms seem to perform better for this dataset for both training and testing. Also, the running time for the model is exceptionally lower for machine learning algorithms. One difference from previous Nurse care activity recognition challenge [14] is that the winning team of that challenge has lower training accuracy than test and this challenge has the opposite. One reason can be that the previous challenge dataset has a larger number of samples than this challenge. So, in the future, we want to collect data in a wider range and share it with young researchers for more interesting results and observations.

Appendix 1 See Table 2

Cal. Accuracy

56.63%

55.00%

59.11%

32.75%

43.39%

61.05%

52.79%

92.08%

42.16%

Team

T1 (Chap. 2 [12])

T2 (Chap. 3 [15])

T3 (Chap. 4 [16])

T4 (Chap. 5 [17])

T5 (Chap. 6 [11])

T6 (Chap. 7 [18])

T7 (Chap. 8 [13])

T8 (Chap. 9 [19])

T9 (Chap. 10 [20])

Acc (RW, RA, LH)

Mocap

Acc (All)

All

All

Acc (All)

Acc (RW, RA, LH)

All

Mocap & RA

Used sensor modalities

Table 2 Details of Each Team

Python

Python

Python

Python

Python

Python

Python, Matlab

Python

Python

Prog. language

Seras, sklearn

Sklearn, xgb, hmm, tsfresh

Pytorch

Pandas, sklearn, numpy

Keras, sklearn, sktime, pandas, tslearn, numpy

Tensorflow, tsfresh, matplotlib

Keras, Sklearn, Tensorflow

Sklearn, pandas, lightgbm, optuna

Pytorch

Libraries NA

Window size

Deep Convolutional GRU

Combination of different classifiers

ConvLSTM

k -NN

Multi-sampling classifiers (MSC)

Convolutional Neural Network (SCAR-Net)

Deep Convolutional Bidirectional LSTM

3s

2s

500 ms

30 s

30 s

NA

10 s

LightGBM and 10 s and Naïve Bayes

CNN and GCN

Classifier

Longtime

thr eshold, where x is a signal, is satisfied then the signal is cut at index t into two signals. 2. For each of the segmented signals, 12 features are calculated. The features are described below. 3. For the feature set from each segmented signals, a final feature set is calculated by taking a weighted sum of the features calculated from segmented signals. The weight is given based on the length of each sub-signal. 4. Another weighted sum is calculated based on the sensor channel. Each sensor channel is assigned equal weight. 5. The final feature set of length 12 is stored. The features used in this experiment are absolute energy, spectral moment 2, LOG, WL, autocorrelation, binned entropy, C3, AAC, mean second derivative central (MSDC), zero/mean crossing (ZC), time reversal asymmetry statistic, and variance. Features from both time and frequency domain were used for the network. Twelve features were calculated from each time series. The list of features [16, 17] with mathematical definitions are listed below: Absolute Energy is the sum of squared values. E=

n 

xi2

(1)

i=1

Spectral Moments (SM2) is a statistical approach to extract power spectrum of the ECG singal and it is defined as S M2 =

n 

Pi f i2

(2)

i=1

Waveform Length (WL) is used to measure the complexity of the ECG signal and is defined as n−1  WL = |xi+1 − xi | (3) i=1

Binned Entropy(BE) is calculated as BE = −

min(max_bins,len(x))  k=0

2 baseline

Method.

Pk log( pk ).wher e, pk > 0

(4)

SCAR-Net: Scalable ConvNet for Activity Recognition with Multimodal Sensor Data

59

Average Amplitude Change (AAC) is formulated as N −1 1  ACC = |xi+1 − xi | N i=1

(5)

Variance is the measure of how far a random variable is spread out and Time Reversal Asymmetry Statistic (TRAS) is T R AS =

n−2lag  1 x2 .xi+lag − xi+lag .x 2 n − 2lag i=0 i+2lag

(6)

The features are used to train a random forest classifier (estimators = 50, max depth = 1), and a support vector machine to set a baseline accuracy for the task.

2.2.2

SCAR-Net 1: Network Architecture

SCAR-Net is a multi-input ConvNet, and each input layer corresponds to a sensor channel. In this experiment, only four channels are considered (‘right_arm’, ‘right_wrist’, ‘left_hip’, ‘left_wrist’). The most challenging part of the dataset is the variance of the length. The major difference in the time-series length makes it harder to decide a fixed length for the input. So, the SARS-Net is designed to work with any temporal dimension. To achieve it, SCAR-Net was trained with batch size 1. It was trained on the data of two subjects and then tested on the third one for selecting the optimal model. For the final training after model selection, the dataset was shuffled and trained with a very small learning rate (0.00005) for two epochs, and after each epoch the learning rate was divided by 1.5. SCAR-Net architecture is shown in Fig. 1. It uses the same number of filters in each input branch and as the batch size is 1; the temporal data can be concatenated after a few layers. Instead of using Fully Connected layers, global max pooling is used to flatten the features. This is used to avoid over-fitting alongside L1, L2 regularization in the convolutional layers.

2.2.3

Unit Batch Training with Sigmoid at the Last Layer

Figure 2 illustrates the distribution of the length of the signals in the dataset. It is evident that the length has a really high variance, so choosing a fixed-length model may not be optimal. As a result, SCAR-Net is designed to handle variable-length signals without any preprocessing. To compensate that, the model is trained with batch size = 1. As it is ideal to train with higher batch sizes, this leads to convergence issues. It is observed from the experimentation that, under these situations, it is easier

60

Fig. 1 SCAR-Net architecture

Fig. 2 Histogram plot of the time-series length

Z. A. Nazi

SCAR-Net: Scalable ConvNet for Activity Recognition with Multimodal Sensor Data Table 1 Result Comparison (task 1) Method SVM (baseline) Random Forest (baseline) SCAR-Net 1

61

Accuracy 0.30 0.33 0.54

to train the model with sigmoid activation in the last layer. So, SCAR-Net was trained with binary cross-entropy loss and Adam optimizer.

2.2.4

Result Analysis

For task 1 (macro activity), first the model is trained on the data of three subjects separately for 3 epochs each with binary cross-entropy loss. After that, the model is again trained for 2 epochs on a shuffled set which contains data for all the subjects with a low initial learning rate (0.00005). Table 1 shows the performance metrics after cross-validation. The model was trained with data of 2 subjects, and tested on the other subject. Here, SCAR-Net outperforms the baseline methods by a big margin while maintaining all the metrics comparable. Note that, accuracy is the subject-wise crossvalidation accuracy; SCAR-Net was trained without any data preprocessing.

2.3 Micro-Activity Recognition The same sensor data is used for this task too, but here there can be multiple labels for a task. There are 10 classes in the dataset which are ‘Add’, ‘Cut’, ‘Mix’, ‘Open’, ‘Peel’, ‘Pour’, ‘Put’, ‘Take’, ‘Wash’, ‘other’.

2.3.1

SCAR-Net 2: Network Architecture

SCAR-Net 2 has the same architecture as SCAR-Net 1. But, as there are 10 classes in this task, in the last layer 10 neurons are used. It is also trained with binary crossentropy loss and Adam optimizer. Learning rate scheduling was used to reduce the learning rate after each epoch with a factor of 1.5.

62

2.3.2

Z. A. Nazi

Result Analysis

For task 2 (micro activity), first the model is trained on the data of three subjects separately for 10 epochs each with binary cross-entropy loss. After that, the model is again trained for 5 epochs on a shuffled set which contains data for all the subjects with a low initial learning rate (0.00001). Table 2 shows the performance metrics after cross-validation. The model was trained with data of 2 subjects, and tested on the other subject. Multi-label classification requires setting a threshold to the sigmoid output from the model to generate hard labels. To select a threshold, the test accuracy for each subject was considered for a range of thresholds. The data is shown in Fig. 3. For the inference stage, a threshold of 0.3 is selected based on the test accuracy on a range of thresholds. The best model achieves an average of 0.27 cross-validation accuracy on the unseen subjects.

Table 2 SCAR-Net 2 performance metrics (task 2) Accuracy 0.272 F1-score Mean absolute error

0.289 0.269

Fig. 3 Threshold selection for multi-label classification

SCAR-Net: Scalable ConvNet for Activity Recognition with Multimodal Sensor Data

63

3 Conclusion In this work, SCAR-Net is presented which can handle variable-length data in an efficient way without any preprocessing. The model is faster to train, hard to over-fit, mostly consists of convolutional blocks which make it a suitable choice for activity recognition task. Even though the model shows promising outcomes, the performance metrics are not high due to the complexity of the cross-subject features. The generalization across subjects is a challenging task for data with such diversified distribution. In the future, SCAR-Net can be augmented by adding an embedding layer at the end to design a siamese network and trained with triplet loss—which can further improve the accuracy, and exploit meta-learning characteristics.

Appendix See Table 3. Table 3 Processing and Resources Sensor modalities Features used Programming language and libraries used Window size and Post-processing Training and testing time Machine specification

right arm, right wrist, left hip, left wrist Convolutional blocks (automatic) Python: tensorflow 2, tsfresh, matplotlib None 4.2, 1.3 mins RAM: 16 GB, CPU: i7, GPU: RTX 2060

References 1. Lara, O., Labrador, M.: A survey on human activity recognition using wearable sensors. Ieee Commun. Surv. Tutorials 15, 1192–1209 (2012) 2. Krishnan, N., Cook, D.: Activity recognition on streaming sensor data. Pervasive Mob. Comput. 10, 138–154 (2014) 3. Ronao, C., Cho, S.: Human activity recognition with smartphone sensors using deep learning neural networks. Expert Syst. Appl. 59, 235–244 (2016) 4. Ordóñez, F., Roggen, D.: Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors 16, 115 (2016) 5. Mohamed, R.: Multi-label classification for physical activity recognition from various accelerometer sensor positions. J. Inform. Commun. Technol. 17, 209–231 (2020) 6. Asghari, P., Soleimani, E., Nazerfard, E.: Online human activity recognition employing hierarchical hidden Markov models. J. Ambient Intell. Humanized Comput. 11, 1141–1152 (2020) 7. Wang, L., Liu, R.: Human activity recognition based on wearable sensor using hierarchical deep LSTM networks. Circ., Syst., Sig. Process. 39, 837–856 (2020)

64

Z. A. Nazi

8. Irvine, N., Nugent, C., Zhang, S., Wang, H., Ng, W.: Neural network ensembles for sensor-based human activity recognition within smart environments. Sensors. 20, 216 (2020) 9. Akbari, A., Jafari, R.: Personalizing activity recognition models with quantifying different types of uncertainty using wearable sensors. Ieee Transactions On Bio-medical Engineering. (2020) 10. Qin, Z., Zhang, Y., Meng, S., Qin, Z., Choo, K.: Imaging and fusing time series for wearable sensor-based human activity recognition. Inform. Fusion 53, 80–87 (2020) 11. Soleimani, E., Nazerfard, E.: Cross-subject transfer learning in human activity recognition systems using generative adversarial networks. Arxiv Preprint Arxiv:1903.12489. (2019) 12. Wang, J., Zhao, Y., Ma, X., Gao, Q., Pan, M., Wang, H.: Cross-scenario device-free activity recognition based on deep adversarial networks. Ieee Transactions On Vehicular Technology. (2020) 13. Ismailfawaz, H., Forestier, G., Weber, J., Idoumghar, L., Muller, P.: Adversarial attacks on deep neural networks for time series classification. Arxiv. pp. arXiv–1903 (2019) 14. Cooking Activity Recognition Challenge https://abc-research.github.io/cook2020/learn/ 15. Lago, P., Takeda, S., Adachi, K., Alia, S.S., Matsuki, M., Benai, B., Inoue, S., Charpillet, C.: Cooking activity dataset with macro and micro activities. (IEEE Dataport, 2020), https://doi. org/10.21227/hyzg-9m49 16. Christ, M., Braun, N., Neuffer, J., Kempa-liehr, A.: Time series feature extraction on basis of scalable hypothesis tests (tsfresh-a python package). Neurocomputing 307, 72–77 (2018) 17. Alnazi, Z., Biswas, A., Rayhan, M., Abir, T.: Classification of ECG signals by dot Residual LSTM Network with data augmentation for anomaly detection (2019)

Multi-sampling Classifiers for the Cooking Activity Recognition Challenge Ninnart Fuengfusin and Hakaru Tamukoh

Abstract We propose multi-sampling classifiers (MSC), a collection of multi-class and binary classifiers, to address the cooking activity recognition challenge (CARC), CARC consists of macro and micro labels. To deal with these labels, MSC uses a multi-class classifier to recognize the macro labels and 10 binary classifiers to examine whether each micro label exists. To shield the MSC model from sampling noise, we generate three distinct resampling rates of the temporal sensor data. All predictions of the three data sampling rates are gathered together using a soft-majority voting ensemble.

1 Introduction The increasing demands of smart mobile phones, smart watches, and Internet-ofThings (IoT) [23] show a growing trend with respect to the availability of cheap temporal sensor data. This huge amount of sensor data can provide patterns that can be used to gain insights into user activities. Such insights can permit a device to have a broader prior knowledge of a user and be able to smartly interact with the user’s preferences. Activity recognition (AR) aims to map time-series sensor signals into certain predefined activities. Machine learning (ML) algorithms are one of the methods used to achieve AR. However, ML might require domain-specific feature extraction, which consumes a large amount of user time. Recently, the deep learning [18] (DL) model, or deep neural networks, has become one of dominant algorithms in ML, particularly in the field of image recognition [24], object detection [25], and image N. Fuengfusin (B) · H. Tamukoh Graduate School of Life Science and Systems Engineering, Kyushu Institute of Technology, 2-4 Hibikino, Wakamatsu-ku, Kitakyushu-shi, Fukuoka 808-0196, Japan e-mail: [email protected] H. Tamukoh e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 M. A. R. Ahad et al. (eds.), Human Activity Recognition Challenge, Smart Innovation, Systems and Technologies 199, https://doi.org/10.1007/978-981-15-8269-1_6

65

66

N. Fuengfusin and H. Tamukoh

segmentation [12]. DL gains an advantage from its ability to automatically recombine given features into new representations. However, the trade-off is that DL requires a large amount of feature-rich data to operate efficiently with its ability to scale with the model [13–15]. Given the merits and demerits of each approach, a benchmark is necessary to measure the relative abilities of different approaches. The cooking activity recognition challenge (CARC) [3] corresponds to time-series sensor data recorded from when users performed cooking activities [16, 17]. The data were recorded from smartphones, smart watches, and motion capture markers. Each signal is segmented into 30 s intervals. The challenge is to recognize each signal segment given the provided macro-activity and multi-micro-activity labels. The macro label indicates the coarse activity, for example, fruitsalad, sandwich, and cereal. The micro label is a finegrained activity that might look at a section of time within a segment, for example, Put, Wash, and other. The CARC dataset consists of 288 training and 180 testing segments. The DL approach is not suitable for the CARC task due to the amount of data, which is not large enough to surpass the performance of conventional ML. Therefore, our considered classifiers use conventional ML approaches. The macro label can be solved directly using a multi-class classifier algorithm. However, the micro labels are different for each segment of data and might contain multi-labels; further, the number of labels is not fixed. To solve this challenge, we propose multi-sampling classifiers (MSC). MSC solves the micro-label problem by treating each micro label as a binary classification problem. In other words, each binary-class classifier corresponds to a micro label. This classifier considers whether the micro label exists or not. Therefore, our backbone MSC model consists of a multi-class classifier for the macro labels and 10 binary classifiers for the 10 unique micro labels. We preprocess the given data using three different sampling rates. The hypothesis of our preprocessing is that the varied sampling rates may allow our classifiers to detect certain patterns that might not be recognized at other sampling rates. With three different sampling rates, the algorithm can view three perceptions of the data trends from simple to complex views. All of the results of the classification of each sampling are used via a soft-voting ensemble. Our main contributions are as follows. • To identify the macro- and micro labels, we propose MSC, that is a combination of a multi-class classifier and multiple binary classifiers. • We propose a training method with preprocessing at three different rates. We show that this might improve the performance of a multi-sampling classifier.

Multi-sampling Classifiers for the Cooking Activity Recognition Challenge

67

2 Related Work In this section, we describe a related work that has a similar research problem to MSC.

2.1 Learning Subclass Representations for Visually Varied Image Classification Learning subclass representations for visually varied image classification (LSR) [19] was proposed for the 2013 Yahoo! large-scale flick-tag classification grand challenge [1]. This challenge consisted of 10 classes that were defined as top-class labels; each class consisted of 150,000 training and 50,000 test images. Each of the 10 classes was a top-rank tag generated by the users. The main challenge was the overgeneralized top-class labels, for example, nature, which can be represented in various contexts with images such as flower, bird, and forest. This results in a model learning little from a given task. This diverse definition of labels causes the model to be less robust. To address this problem, LSR can be trained with other provided tags that are not top-class labels; such labels are defined as subclasses. LSR statistically finds the co-occurrence between subclass and top-class labels; if the co-occurrence ratio of the subclass to the sum of all the top-classes is higher than a predefined threshold, then the subclasses produced by support vector machine (SVM) [8] will be used as the features and put into SVM to classify the top class. LSR concludes that, with the subclass approach, such a model outperforms a model directly trained with top-level labels. In our opinion, applying the LSR approach requires a statically significant amount of data, which the Yahoo! challenge provides. However, in CARC, there are only 288 training segments, with 3 macro labels and 10 different micro labels; therefore, the correlation between each macro label and the micro labels might not be sufficient.

3 Multi-sampling Classifiers In this section, we describe two main MSC principles: multi-classifiers and multisampling.

3.1 Multi-classifiers MSC is proposed to solve CARC; therefore, MSC directly focuses on solving for the macro- and micro labels provided by CARC. The macro labels consist of cereal,

68

N. Fuengfusin and H. Tamukoh

Fig. 1 Multi-classifier receives an input signal and produces the macro- and micro-label predictions

sandwich, and fruitsalad. With three classes, the macro label can be solved for using a multi-class classifier. The major challenge in CARC is the micro labels. MSC converts the micro-label problem into multi-binary classification problems. This method is possible for CARC because there are no duplicated micro labels in the segments except for in one of the training segments. MSC uses the multi-binary classifiers to recognize the micro labels. CARC consists of 10 unique micro labels: Pour, Peel, Put, Take, Cut, Wash, Add, Mix, Open and other. MSC matches each binary classifier to each unique micro label. Each binary classifier predicts whether its assigned micro label exists or not. Our backbone model is shown in Fig. 1. In this figure, the backbone model is called as a multi-classifier. This model contains a multi-class classifier to deal with the macro labels and 10 binary classifiers to deal with the micro labels.

3.2 Multi-sampling Multi-sampling is used to generate three different perspectives of the signals. Sampling with short intervals allows the ML algorithm to focus on every detail of the time-series data. Conversely, sampling with long intervals allows the ML algorithm to view the trend in the time-series data easier. Nonetheless, because each sensor dataset contains a different rate, a resampling of all the temporal data is necessary to extend or reduce the length of all the signals to a certain length. All of the data were resampled using an approach from [2]. The time-series data were first down-sampled to 1 ms, and then up-sampled to the preferred periods. We selected three sampling rates, 100, 300, and 1000 ms. These sampling rates differed from each other by roughly three times. With a maximum recorded time length of 30 s, this resulted in time-series data with lengths of 300, 100, and 30, respectively. We selected these ranges from the sampling to cause the shapes of the waveforms

Multi-sampling Classifiers for the Cooking Activity Recognition Challenge

69

Fig. 2 The three different samplings are input into each multi-classifier. All of the test dataset predictions from each multi-classifier are voted on using soft-voting ensemble. The collection of multi-classifiers is called as the multi-sampling classifiers

to differ. The sampling length should be lower than 1000 because MSC requires an extensive training time for numbered models. With three different sampling rates, the three predictions can be realized. To reduce these predictions into one, we applied the same soft-weight voting. The multi-sampling is summarized in Fig. 2.

4 Experimental Results and Discussion We described the overall setting of the experiment. We used all of the provided sensor data without excluding any data. CARC contains a number of missing feature segments, for example, the x-axis of the left wrist smart watch data might be missing for the whole segment. We could not impute with the segment mean when all data in some of the feature segments is missing. Therefore, we imputed with the signal mean instead. All segments from the same feature were averaged to find the signal means. All of the missing data were replaced with the signal mean from the same feature. Each signal was scaled to the unit mean and variance. We randomly selected 33% of the training segments and split them into the validation dataset. Each given segment contains the nonuniform time stamps. The conventional resampling with an unevenly spaced signal might cause a shift between the original signal to the resampling signal. Therefore, we applied with two steps resampling from [2]. All sensor data was up-sampled to 1 ms and down-sampled to 100, 300, and 1000 ms. To evaluate the models, the CARC metric uses the average accuracy, which is defined in Eq. 1, where a is the average accuracy, m a is the macro accuracy, and m i is the micro accuracy. The micro-accuracy metric or Hamming score [9] is defined

70

N. Fuengfusin and H. Tamukoh

Table 1 Macro-label accuracy at the baseline setting for the time-series ML models Algorithm Package Macro Soft voting accuracy KNeighborsTimeSeriesClassifier TimeSeriesSVC TimeSeriesForestClassifier

tslearn tslearn sktime

0.7396 0.6458 0.8438

0.7292 0.6563 0.8542

as in Eq. 2, where n t is the number of correct predictions, and n c is the cardinality of the union between labels and predictions. a=

ma + mi 2

(1)

nt nc

(2)

mi =

MSC has an advantage of allowing the conventional binary classifiers to solve the multi-label classification problem. Therefore, we only needed to identify the suitable classifier instead of inventing a novel model. To identify the best classifier to apply in MSC, we tried several Python time-series ML packages: tslearn [26],sktime [20], seglearn [4], and tsfresh [7]. From these packages, we did not select tsfresh because tsfresh consumed a considerable amount of time to perform the feature engineering with this dataset. seglearn also was not chosen because seglearn did not provide with a function that converts the element-wise predictions to segment-wise predictions. Therefore, with the process of elimination, we selected tslearn and sktime. Time-series SVM (TimeSeriesSVC) and K-nearest neighbor (KNeighborsTimeSeriesClassifier) are included in the tslearn modules. Time-series random forest (TimeSeriesForestClassifier) is included in sktime. We evaluated all of these algorithms using the default settings with the macro-label accuracy as our metric. Each input signal was sampled at 100 ms. The validation scores are listed in Table 1. In Table 1, the macro accuracy of the soft-voting ensemble of the multi-sampling is also displayed as well as the “Soft voting” section, which indicates that the accuracy after each classifier was soft voted with three different samplings. Random forest from the sktime package was selected to be the main classifier in MSC because it outperformed the other models. Soft voting of the different samplings showed an improvement in the accuracy for TimeSeriesSVC and TimeSeriesForestClassifier. Nevertheless, the KNeighborsTimeSeriesClassifier had the worse macro accuracy. We found out that the macro- and micro label require a distinct set of hyperparameters due to the different complexity of the tasks. In general, the highcomplexity model performed well with the macro label. However, the high-complexity model over-fitted the micro label. Therefore, we selected TimeSeriesForestClassifier with 1000 estimators as the multi-class classifier and TimeSeriesForestClassifier with 50 estimators as the binary

Multi-sampling Classifiers for the Cooking Activity Recognition Challenge

71

Table 2 Validation results with the different sampling rates and soft voting Sampling rate Macro accuracy Micro accuracy Average accuracy 1000 ms 300 ms 100 ms Soft voting

0.875 0.8854 0.8542 0.875

0.7927 0.8255 0.816 0.8191

0.8339 0.8555 0.8351 0.847

classifier. The micro, macro, and average accuracy of these models are shown in Table 2. Our model displayed its best average accuracy at the 300 ms sampling rate. The last row of Table 2 indicates the accuracies after the soft voting. These accuracies are shown in the improvement in accuracy for 100 and 300 ms sampling rates. However, the soft-voting accuracy cannot exceed the accuracy from 300 ms sampling rate. We expected that the soft-voting accuracy might be better with a greater number of samplings. However, by increasing a sampling, the extra 11 classifiers are needed to be trained. This affects the training time of MSC to increase exponentially. Therefore, we push this hypothesis to further investigate in future work. Within the test dataset, we found that MSC did not produce micro-label predictions for 26 test segments. On the other hand, this problem did not occur with the validation dataset. We investigated this problem by looking for the difference between the training, test, and validation datasets. The major difference is that the training and validation datasets were recorded from human subjects 1, 2, and 3. However, the test dataset was recorded from subject 4. We hypothesized that the distribution of validation and test datasets might vary, and our MSC was not robust enough to handle this problem. Another hypothesis is that the converting process from a micro label to binary labels might cause high sparsity within binary labels. The binary classifier might assume the output that is most likely zero and does not produce the micro label. We considered exploring this sparsity hypothesis in future work. If this hypothesis is correct, we might solve this problem by applying techniques for dealing with an imbalance dataset, such as SMOTE [5], Borderline-SMOTE [10], and ADASYN [11]. To fix this missing label problem with the available method, no output should be changed to the most frequent micro label in the training data, instead, in this case, the label was Take, as shown in Fig. 3.

5 Conclusion We proposed MSC for CARC. MSC provides a simple alternative to solve a multilabel classification problem by converting the problem into a multi-binary classification problem. This allows conventional binary classifiers to be deployed to solve the multi-label classification problem. On the other hand, this method requires a number of binary classifiers. This affects the increase in the computation time. To deal with

72

N. Fuengfusin and H. Tamukoh

Fig. 3 Distribution of the training micro labels after partitioning the sequence of micro labels

CARC, our MSC model consisted of a multi-class classifier and 10 binary classifiers to deal with the macro- and micro labels. To allow the model to overcome noise from resampling, our models were trained with three different sampling rates, and all predictions of the test data from the different sampling rates were soft voted. Acknowledgements This research was supported by JSPS KAKENHI Grant Numbers 17K20010.

Appendix As a requirement of CARC, our experimental settings are given in Table 3. Table 3 Description of our experimental settings Description Sensor modalities Features used Programming language and libraries

Window size and post processing Training and testing time Specifications

We used all provided sensors We applied all provided features Python with keras [6], sklearn [22], sktime, pandas [21], tslearn, and numpy [27] 30,000 ms 1 h 49 min 12 s CPU Intel Xeon E5-1620v3, RAM DDR4 64 GB, GPU GeForce GTX 1080

Multi-sampling Classifiers for the Cooking Activity Recognition Challenge

73

References 1. Yahoo! large-scale flickr-tag image classification grand challenge. http://www.sigmm. org/archive/MM/mm13/acmmm13.org/submissions/call-for-multimedia-grand-challengesolutions/yahoo-large-scale-flickr-tag-image-classification-challenge/index.html (2013). Accessed 18 May 2020 2. Rafael schultze-kraft - building smart iot applications with python and spark. https://www. youtube.com/watch?v=XkBbAymUDEo (2017). Accessed 18 May 2020 3. Alia, S.S., Lago, P., Takeda, S., Adachi, K., Benaissa, B., Ahad, M.A.R., Inoue, S.: Summary of the cooking activity recognition challenge. In: Human Activity Recognition Challenge, Smart Innovation, Systems and Technologies. Springer (2020) 4. Burns, D.M., Whyne, C.M.: Seglearn: A python package for learning sequences and time series. J. Mach. Learn. Res. 19(1), 3238–3244 (2018) 5. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority oversampling technique. J. Art. Intell. Res. 16, 321–357 (2002) 6. Chollet, F., et al.: Keras. https://github.com/fchollet/keras (2015) 7. Christ, M., Braun, N., Neuffer, J., Kempa-Liehr, A.W.: Time series feature extraction on basis of scalable hypothesis tests (tsfresh-a python package). Neurocomputing 307, 72–77 (2018) 8. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995) 9. Godbole, S., Sarawagi, S.: Discriminative methods for multi-labeled classification. In: PacificAsia Conference on Knowledge Discovery and Data Mining, pp. 22–30. Springer (2004) 10. Han, H., Wang, W.Y., Mao, B.H.: Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International Conference on Intelligent Computing, pp. 878–887. Springer (2005) 11. He, H., Bai, Y., Garcia, E.A., Li, S.: Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE world congress on computational intelligence), pp. 1322–1328. IEEE (2008) 12. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017) 13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 14. Huang, G., Sun, Y., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: European Conference on Computer Vision, pp. 646–661. Springer (2016) 15. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012) 16. Lago, P., Takeda, S., Adachi, K., Alia, S.S., Matsuki, M., Benai, B., Inoue, S., Charpillet, F.: Cooking Activity Dataset with Macro and Micro Activities (2020). https://doi.org/10.21227/ hyzg-9m49 17. Lago, P., Takeda, S., Alia, S.S., Adachi, K., Bennai, B., Charpillet, F., Inoue, S.: A dataset for complex activity recognition with micro and macro activities in a cooking scenario (2020). arXiv:2006.10681 18. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015) 19. Li, X., Xu, P., Shi, Y., Larson, M., Hanjalic, A.: Learning subclass representations for visuallyvaried image classification (2016). arXiv:1601.02913 20. Löning, M., Bagnall, A., Ganesh, S., Kazakov, V., Lines, J., Király, F.J.: sktime: a unified interface for machine learning with time series (2019). arXiv:1909.07872 21. McKinney, W., et al.: pandas: a foundational python library for data analysis and statistics. In: Python for High Performance and Scientific Computing, vol. 14, No. 9 (2011) 22. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

74

N. Fuengfusin and H. Tamukoh

23. Perera, C., Liu, C.H., Jayawardena, S., Chen, M.: A survey on internet of things from industrial market perspective. IEEE Access 2, 1660–1679 (2014) 24. Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neural networks (2019). arXiv:1905.11946 25. Tan, M., Pang, R., Le, Q.V.: Efficientdet: Scalable and efficient object detection (2019). arXiv:1911.09070 26. Tavenard, R., Faouzi, J., Vandewiele, G., Divo, F., Androz, G., Holtz, C., Payne, M., Yurchak, R., Rußwurm, M., Kolar, K., Woods, E.: tslearn: A machine learning toolkit dedicated to time-series data. https://github.com/tslearn-team/tslearn (2017) 27. Walt, S.v.d., Colbert, S.C., Varoquaux, G.: The numpy array: a structure for efficient numerical computation. Comput. Sci. Eng. 13(2), 22–30 (2011)

Multi-class Multi-label Classification for Cooking Activity Recognition Shkurta Gashi, Elena Di Lascio, and Silvia Santini

Abstract In this paper, we present an automatic approach to recognize cooking activities from acceleration and motion data. We rely on a dataset that contains three-axis acceleration and motion data collected with multiple devices, including two wristbands, two smartphones and a motion capture system. The data is collected from three participants while preparing sandwich, fruit salad and cereal recipes. The participants performed several fine-grained activities while preparing each recipe such as cut and peel. We propose to use the multi-class classification approach to distinguish between cooking recipes and a multi-label classification approach to identify the fine-grained activities. Our approach achieves 81% accuracy to recognize fine-grained activities and 66% accuracy to distinguish between different recipes using leave-one-subject-out cross-validation. The multi-class and multi-label classification results are 27 and 50% points higher than the baseline. We further investigate the effect on classification performance of different strategies to cope with missing data and show that imputing missing data with an iterative approach provides 3% point increment to identify fine-grained activities. We confirm findings from the literature that extracting features from multi-sensors achieves higher performance in comparison to using single-sensor features.

1 Introduction Automatic detection of physical activities—known as Human Activity Recognition (HAR) [1]—is a key research area in mobile and ubiquitous computing. One of the main aims of HAR is to detect user’s behavior with the goal of allowing mobile S. Gashi (B) · E. Di Lascio · S. Santini Università della Svizzera italiana (USI), Lugano, Switzerland e-mail: [email protected] E. Di Lascio e-mail: [email protected] S. Santini e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 M. A. R. Ahad et al. (eds.), Human Activity Recognition Challenge, Smart Innovation, Systems and Technologies 199, https://doi.org/10.1007/978-981-15-8269-1_7

75

76

S. Gashi et al.

systems to proactively assist users with their tasks [1]. The wide range of HAR applications includes also remote monitoring of daily activities of the elderly or cognitively impaired people with the ultimate goal of developing technologies to assist and promote a healthy lifestyle. Quantity and quality of food intake are particularly crucial factors contributing to a healthy lifestyle [23]. An unhealthy diet may lead to nutrition-related diseases, which in turn can reduce the quality of life [21]. A system able to monitor people’s cooking and eating behaviors could provide insightful information to the user towards the improvement of their health. For instance, it could remind elderly people living alone of a missing cooking step or help the monitoring of a healthy diet. While impressive progress has been made in cooking activity recognition and HAR in general, the recognition of complex and fine-grained human activities is still an open research problem [18]. This is first due to the fact that the same activity can be performed in different ways, both by the same person or by different persons [18]. For instance, people perform different arm or wrist postures for holding or picking an object. Another challenge stems from the fact that there is no clear separation between activities but rather continuous motion and repetitive movements [18], which does not allow to segment activities precisely to develop ground-truth for each activity separately [1]. Further, a complex activity might be composed of multiple finegrained activities, which are characterized by low interclass variability and finegrained body motions [16, 24]. The majority of work in this area either focuses only on complex or fine-grained activity recognition and does not detect multiple fine-grained activities that occur simultaneously. In this paper, we address the problem of identifying both complex coarsegrained and fine-grained cooking activities. In particular, we propose automatic approaches to distinguish actions performed while preparing three different meals, namely, sandwich, fruit salad and cereal. We refer to these three activities as macro-activities. Additionally, we identify several fine-grained activities that occur while preparing each meal such as, e.g., cut, peel, take, pour and put. We refer to these fine-grained activities as micro-activities. To distinguish one macroactivity from the others, we develop a multi-class classification pipeline. Since a macro-activity can be composed of multiple micro-activities, we propose to address the problem of micro-activity recognition using a multi-label classification approach. Multi-label learning consists in predicting more than one output category for each input sample [8] and seems appropriate to identify micro-activities occurring while preparing a macro cooking activity. To train and validate our approach, we use an existing dataset, presented in [5–7], which contains acceleration and motion data collected with wristbands, smartphones and a motion capture system. Our results show that macro-activities can be distinguished with an accuracy of 66% using k-nearest neighbours and leave-one-user-out cross-validation, which is 27% points increment from the baseline classifier that always predicts the majority class. Our multi-label classification approach identifies micro-activities with an accuracy of 81%, which represents a 50% point increment from a baseline that always predicts the most frequent micro-activity.

Multi-class Multi-label Classification for Cooking Activity Recognition

77

2 Related Work In the ubiquitous and wearable research community, multiple methods have been proposed for the automatic recognition of human activities [1–3, 12]. The activities explored in these approaches are rather coarse-grained that include full-body movements such as walking, waving and jumping. These activities may not be very relevant for application domains that aim to distinguish more fine-grained activities such as cut and peel and other complex activities such as preparing sandwich or salad, as we do in this work. Several researchers have addressed the specific problem of recognizing cooking activities [4, 13, 16, 18, 19, 21–23]. Pham et al. [4] propose a real-time approach to classify fine-grained cooking activities such as peeling, slicing and dicing, using accelerometer data. Their method achieves an accuracy of 82% using a leave-onesubject-out (LOSO) cross-validation approach. Lago et al. [13] investigate the use of single and multiple sensors to distinguish between macro-activities such as setting a table and eating a meal. Their approach achieves an F1 score of 51% using data from multiple sensors and LOSO validation procedure. In contrast to these approaches, we investigate the problem of recognizing both macro- and micro-activities performed while cooking different recipes. We use a dataset, presented in [5–7], collected with multiple sensors available in wristbands, smartphones and a motion capture system. Authors in [18–20] provide multi-modal sensor datasets of humans performing multiple activities in a kitchen environment, including cooking and food preparation. Tenorth et al. [18], for instance, provide the TUM kitchen dataset which includes data such as video sequences, full-body motion capture data recorded by a markerless motion tracker, RFID tag readings and more. While these datasets are very diverse and include similar activities and data as the ones used in this work, the authors in [19, 20] do not present any automatic approach to distinguish between cooking activities and the dataset used in [18] does not contain data from on-body sensors such as wristband and smartphone, which are less intrusive and more likely to be used by humans while they are cooking their meals. Further, these datasets include video or audio data, which are often considered as privacy invasive. In contrast to the work presented above, we investigate the use of multi-class and multi-label classification for micro- and macro-activities recognition. Multi-label classification has been used for cooking ingredients and recipe recognition from images in [25] and for physical activity recognition from accelerometer sensor [26]. To the best of our knowledge, this approach has not been previously applied to cooking activity recognition from acceleration and motion data.

3 Dataset We describe in this section the dataset used to train and validate the activity classifiers described in the next section.

78

S. Gashi et al.

Fig. 1 Overview of the number of samples for each type of macro-activity (i.e., meals)

Procedure and Participants. Four participants performed three macro-activities (meals), five times each. The data has been collected in a controlled setting and participants followed a script for each macro-activity and were asked to act as naturally as possible. The macro-activities were sandwich, fruit salad and cereal, which included the micro-activities: (actions) cut, peel, take, put, pour, wash, mix and open. Devices and Collected Data. Movement data during cooking activities is collected using two wristbands—one for each wrist—two smartphones—for the right arm and left hip—and a motion capture system (mocap) with 29 markers. The wristbands and smartphones were used to collect three-axis acceleration data. The mocap collected three-axis motion data for markers of different parts of the body such as top/front/rear head, left/right shoulder/elbow/wrist/knee and more. The sampling frequency for the sensors in the wristband, smartphone and mocap are 100 Hz, 50 Hz and 100 Hz, respectively. The sensor data is then segmented into 30 s windows and annotated with one of the macro-activities and multiple micro-activities. Figure 1 shows the total amount of macro-activities collected from three participants. In particular, there are 113 samples of preparing a sandwich, 102 of fruit salad and 73 of cereal, in total there are 288 macro-activities. Figures 2, 3 and 4 show the micro-activities that occur while preparing each macro-activity. We use the data from three participants to train and test the classifiers.

4 Cooking Activity Recognition Pipeline The macro- and micro-activity recognition pipelines are composed of data preprocessing, synchronization, imputation strategy, feature extraction, classification and evaluation procedure steps, which are common in HAR [1].

Multi-class Multi-label Classification for Cooking Activity Recognition

79

Fig. 2 Number of micro-activities while preparing a cereal

Fig. 3 Number of micro-activities while preparing a fruit salad

Preprocessing. We first preprocess the raw data acquired by different sensors to be able to analyze them simultaneously. The wristband sensor data has been transmitted to the smartphone via a bluetooth connection. This may cause some sensor data to arrive with delay to the smartphone or even duplicated. To account for this, we first sort the data by the time of measurement and drop duplicates. Given that the wristband and mocap sensors captured data with a sampling frequency 100 Hz and smartphone data 50 Hz, we resample wristband and mocap data to 50 Hz sampling frequency. To

80

S. Gashi et al.

Fig. 4 Number of micro-activities while preparing a sandwich

synchronize the sensor readings, we first get the first and last timestamps when the data was collected from all the sensors. We then generate new timestamps of 50 Hz sampling frequency and add data from each sensor whenever available. Imputation Strategy. A basic strategy to handle missing data is to discard all the data where readings from at least one sensor are missing. However, this implies losing valuable and vast amounts of data especially when multiple sensors are used. To account for this issue, we explore different imputation strategies, namely, mean, constant, most frequent and iterative imputation explained in [8, 28]. For the first three strategies, we impute missing sensor data using the mean or the most frequent value of the sensor data or a constant value (e.g., 0), the latter has also been explored in [11]. For the iterative imputation strategy, we model each sensor data with missing values as a function of other sensors. In particular, the missing sensor data is considered as output y and the other sensor data as input X, then a regressor is fit on (X, y) for all available data and is used to predict the missing values of y. Authors in [10] have also explored this imputation strategy in a different context. We estimate the missing data from one sensor using the data from other available sensors first because the same macro- and micro-activity has been carried out multiple times by a user. Thereby, the available data when the activity is performed once could be indicative of the missing data when the same activity is executed again because the same user might perform the activity in a similar way. For instance, when cutting food, one hand is usually static and the other hand moves (operates the knife). Additionally, multiple sensors have been used to measure the movement in the same part of the body. For instance, the missing values from the left wristband data can be estimated using the mocap marker for the left hand.

Multi-class Multi-label Classification for Cooking Activity Recognition

81

Feature Extraction. We reduce the 30 s sensor segments into features that might help discriminate cooking activity recognition. We extract features from each sensor separately as well as combine different sensors. We group the features into two categories: single-sensor and multi-sensors features. In this way, we investigate which features play a discriminative role in distinguishing between cooking activities. Table 1 shows a detailed list of features and modalities we use in this work. Single-sensor features. We first compute the magnitude of acceleration and motion for the left and right wrists, left hip, right arm and the markers from the mocap as in [27]. We then extract statistical features from each individual signal, i.e., x-axis from the left hand or acceleration magnitude from the right hand. The statistical features we extract are minimum, maximum, mean, median, standard deviation and skewness as suggested in [1, 12]. We hypothesize that the upper body position and movement play a more significant role to distinguish between different macro- and micro-activities because the activities explored in this work include more movements of the upper part of the body, e.g., cut, peel and slice. Therefore, from the mocap, we expect to see a difference in the features mainly from the first 10 markers (e.g., top front or rear head, right or left shoulder, right offset, right or left elbow, right or left wrist). These features can be further grouped by device type in wristband, mocap and smartphone features. Multi-sensor features. We then combine the signals from two sensors available in the same or two different devices. In particular, we compute the direction unit vector between each pair of the following markers from the mocap: top front or rear head, right or left shoulder, right offset, right or left elbow, right or left wrist. We then compute the direction unit vector from signals collected with two devices such as right wrist data collected from the wristband with left arm data collected from the smartphone, similar to [9]. We then extract the statistical features, similar to the single-sensor features, from the directional unit vectors. From the data exploration shown in Figs. 2, 3 and 4, we observe that some activities such as take, put and cut are very common for all macro-activities and others such as mix, wash and open are unique for each macro-activity. Thereby, we aim at extracting features that could help to better characterize these activities, which would then in turn help to distinguish between macro- and micro-activities. We expect that the statistical features extracted from the combined multi-sensors could help to distinguish between the classes. For instance, the distance between the left and right hand could be lower when we wash some food than when we mix the salad. Similarly, when we open the milk for the cereal, the distance between the elbow and wrist might be different from when we wash the food when making a sandwich. We then concatenate all the single and multi-sensor features in a single-feature vector, known as the feature-level fusion approach in HAR [1]. We scale each feature before providing as input to the classifiers, using the standard scaler,1 as a common preprocessing procedure in [8].

1 https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html.

82

S. Gashi et al.

Table 1 Summary of the features extracted from the five sensor modalities used in this paper. L/R stand for left and right, respectively. H stands for hip and A for arm. Motion or ACC Magn refers to acceleration or motion magnitude calculated from the X-, Y- and Z-axis Feature group Signal

Features

Wristband (L/R): ACC Magn* Smartphone (H/A): ACC Magn Mocap:

Single-sensor

X, Y, Z-axis for the following markers Head (Top/Front/Rear) Shoulder (L/R) Elbow (L/R) Wrist (L/R) Offset (R)

Statistical features mean median skewness** minimum maximum standard deviation

Multi-sensor

Motion Magn of the following markers Shoulder (L/R) Elbow (L/R) Wrist (L/R)* Direction unit vector between each axis of the following sensors: Elbow (L/R) Wrist (L/R) Elbow (R) with Wrist (R) Elbow (R) with Wrist (L) Elbow (L) with Wrist (R) Elbow (L) with Wrist (L)

Total features

Distance between the following vectors: Wristbands (L/R) Wristband and Smartphone: (R/A) and (L/H) Mocap: Wrist (L) and Elbow (L), Wrist (R) and Elbow (R), Knee (L/R) 312 for macro-activity and 342 for micro-activity recognition

* Features extracted from this signal have been used only for micro-activity recognition ** This feature has been extracted from all the signals except x-, y- and z-axes

Multi-class Multi-label Classification for Cooking Activity Recognition

83

Multi-class Macro-Activity Recognition. To distinguish between the sandwich, fruit salad and cereal macro-activities, we set up a multi-class classification problem [8]. We experiment with a range of supervised classifiers including support vector machine, k-nearest neighbours, random forest, decision trees, gradient boosting and multi-layer perceptron. K-nearest neighbours (kNN) achieved the best results, therefore, we report results using only kNN. As a baseline classifier, we use a random classifier that always predicts the majority class, used also in similar problems in [14]. Multi-label Micro-Activity Recognition. From Figs. 2–4, we can observe that each 30 s segment may contain a unique micro-activity such as take, peel and put, or multiple micro-activities such as cut/peel/take. For this reason, to recognize the microactivities in a window we set up a multi-label classification problem. In the multi-label classification, the classifier learns from a set of instances, where each instance can belong to one or multiple classes, and is able to predict a set of classes for each new instance [8, 15, 25, 26]. We experiment with a range of supervised classifiers that support multi-label classification as suggested in [8], including k-nearest neighbours, random forest, decision trees, multi-layer perceptron and extra trees classifier. We obtain the best results using k-nearest neighbours (kNN) and we report only those. We consider as a baseline a classifier that always predicts the most frequent microactivity. Evaluation Procedure and Metric. We evaluate the generalization of our models to new users, by measuring its performance on a subject whose data has not been seen before, known as leave-one-subject-out (LOSO) or person-independent approach [1]. In this approach, the classifiers are trained with data of all subjects except one, which is used as a test set. This procedure is repeated for all the subjects and the performance of the model is reported as the average score across all the iterations. To evaluate the performance of the macro and micro-activity recognition pipelines, we consider the accuracy metric [8]. Accuracy quantifies the fraction of samples correctly classified by the model. For the multi-label classification problem, we first compute the accuracy for each test sample and average the results to obtain an overall metric.

5 Results and Discussion In what follows we report the classification results obtained applying the analysis described in the previous section. In particular, we report the macro- and microactivity recognition results, the difference of features between different activities, the performance obtained with different imputation strategies and feature groups.

84

S. Gashi et al.

Fig. 5 Accuracy of the multi-class kNN classifier and baseline for distinguishing between macro-activities

Fig. 6 Accuracy of the kNN classifier and baseline for identifying micro-activities using multi-label classification approach

5.1 Macro- and Micro-Activity Recognition Results Figure 5 shows the multi-class classification results for the baseline classifier and kNN classifier trained using all the features described in Sect. 4 and applying the LOSO validation procedure. The accuracy for the kNN is 66%, which is 27% point increment from the baseline. This implies that we are able to correctly identify 66% of the macro-activities in the dataset. Figure 6 shows the multi-label classification results for the baseline classifier and kNN trained ones using all the features described in Sect. 4 and applying the LOSO validation procedure. The accuracy for kNN is 81%, which is 50% point increment from the baseline classifier that always predicts the most frequent micro-activity. These results imply that we can recognize 81% of the micro-activities that occur while preparing a macro-activity.

5.2 Interpretation of Cooking Activity Features We then investigate the difference between the features for different macro- and micro-activities. In this section, we present some exemplary features and their difference among different classes.

Multi-class Multi-label Classification for Cooking Activity Recognition

85

Fig. 7 Distribution of the mean acceleration magnitude from the right wrist for sandwich, fruit salad and cereal macro-activities

Fig. 8 Distribution of the mean acceleration magnitude from the right wrist for different micro-activities explored in this work

Macro-activity features. Figure 7 shows the distribution of the mean acceleration magnitude of the right hand between the three classes. In particular, we can observe that while preparing a salad there are less right-hand movements rather than when preparing a sandwich. This is because when we prepare salad, we perform more wrist movements such as cut, peel and put—as also shown in Fig. 3—whereas when preparing a sandwich and cereal we perform more full-body and intense movements such as take and wash, as shown in Figs. 2 and 4. Micro-activity features. Figure 8 shows the distribution of the mean acceleration magnitude for the right wrist for different micro-activities. We observe that the average movement magnitude is higher for take and wash. This is expected as these two activities require full-body movement or movement of both hands in comparison to the others such as open, when we mainly use the wrist. It is also interesting to see that the activity ’other’ contains many outliers. This might be due to the nature of this activity, which may include several micro-activities ranging from light to vigorous movements.

5.3 Performance with Different Imputation Strategies We also investigate the impact of the imputation strategies adopted in the cooking activity recognition. Table 2 shows the multi-label classification results using the imputation strategies explored in this work. We achieve 81% accuracy when using the iterative imputation strategy, which is 3% points higher than when not using an

86

S. Gashi et al.

Table 2 Multi-label classification results for iterative, mean, constant, most frequent and no imputation strategies Imputation strategy Accuracy None Most frequent Constant Mean value Iterative

0.78 0.79 0.80 0.80 0.81

Fig. 9 Multi-label classification results using features extracted from all modalities (single- and multi-sensor), from wristband, mocap and smartphone only and for using only multi-sensor features

imputation strategy. The increment by 3% points in the performance hints at the importance of using data imputation and the suitability of iterative imputation for improving recognition performance. The performance when using other imputation strategies is 80% for the mean and constant, and 79% for most frequent, which is still slightly higher than no data imputation. We observe similar results for the macro-activity recognition but for simplicity we decide to report only the multi-label classification performance.

5.4 Single-Sensor and Multi-Sensor Feature Performance We then evaluate the performance of micro-activity recognition while cooking using single- and multi-sensor features. Figure 9 shows the multi-label classification results using features extracted from one sensor modality alone or combining multiple sensor modalities. We obtain the best results using the features extracted from all sensors with an accuracy of 81%, which is in line with findings from the literature [13]. This confirms the necessity to have multiple sources of data to capture the characteristics of different and more complex activities, as also discussed in [17]. The performance using only the data from the motion capture system is comparable to using the data from all the sensors. This implies that in case of missing sensors, the motion capture system could be used alone.

Multi-class Multi-label Classification for Cooking Activity Recognition

87

6 Limitations and Future Work While this work shows promising results in the identification of macro- and micro cooking activities, future research is needed to overcome the limitations of our approach. A limitation stems from relying on the usage of sensors from multiple devices. While using multiple devices enhances the recognition performance and data quality, in a real system all the devices might not be available all the time. Future work should focus on optimizing the performance with the fewest number of devices. Additionally, our macro- and micro-activity identification pipelines rely on hand-crafted features and do not capture the temporal nature of the sensor data. Future work should focus on the exploration of deep learning, as in [2, 3] methods to automatically extract features (i.e., using convolutional neural networks) and to consider the sequential nature of the data (i.e., using long short-term memory neural networks), which might more effectively identify micro-activities that occur in a sequential manner.

7 Conclusions In this paper, we present our approach for automatic recognition of macro- and micro-activities while cooking using acceleration and motion data from multiple sensors. We show that it is feasible to distinguish between macro-activities with 66% accuracy using a multi-class k-nearest neighbours classifier. We further show that our multi-label classification approach can recognize micro-activities while cooking with an accuracy of 81%. We then show that using data from other sensors to predict missing sensor data increases the performance by 3% points and could be an interesting direction for future research. We confirm findings from related work that using data from multiple sensor modalities performs significantly higher than using some sensors alone. Overall, our findings enable new possibilities in the design and development of automatic systems for supporting people in their daily activities.

References 1. Bulling, A., Blanke, U., Schiele, B.: A tutorial on human activity recognition using body-worn inertial sensors. ACM Comput. Surv. (CSUR) 46(3), 1–33 (2014) 2. Radu, V., Tong, C., Bhattacharya, S., Lane, N.D., Mascolo, C., Marina, M.K., Kawsar, F.: Multimodal deep learning for activity and context recognition. In: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 1, no. 4, pp. 1–27 (2018) 3. Guan, Yu., Plötz, T.: Ensembles of deep LSTM learners for activity recognition using wearables. In: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 1, no. 2, pp. 1–28 (2017)

88

S. Gashi et al.

4. Pham, C., Plötz, T., Oliver, P.: Real-time activity recognition for food preparation. In: Proceedings of the IEEE International Conference on Computing and Communication Technologies, Nagercoil, Tamil Nadu, India (2010) 5. Lago, P., Takeda, S., Adachi, K., Alia, S.S., Matsuki, M., Benai, B., Inoue, S., Charpillet, F.: Cooking activity dataset with Macro and Micro activities. IEEE Dataport (2020). https://doi. org/10.21227/hyzg-9m49 6. Lago, P., Takeda, S., Alia, S.S., Adachi, K., Benaissa, B., Charpillet, F., Inoue, S.: A dataset for complex activity recognition with Micro and Macro activities in a cooking scenario (2020) 7. Alia, S.S., Lago, P., Takeda, S., Adachi, K., Benaissa, B., Rahman Ahad, Md A., Inoue, S.: Summary of the cooking activity recognition challenge. Human Activity Recognition Challenge, Smart Innovation, Systems and Technologies. Springer Nature, Berlin (2020) 8. Géron, A.: Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media, Sebastopol (2019) 9. Ahuja, K., Kim, D., Xhakaj, F., Varga, V., Xie, A., Zhang, S., Townsend, J.E., Harrison, C., Ogan, A., Agarwal, Y.: EduSense: practical classroom sensing at scale. In: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 3, no. 3, pp. 1–26 (2019) 10. Saha, K., Reddy, M.D., das Swain, V., Gregg, J.M., Grover, T., Lin, S., Martinez, G.J., et al.: Imputing missing social media data stream in multisensor studies of human behavior. In: 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 178–184. IEEE (2019) 11. Jaques, N., Taylor, S., Sano, A., Picard, R.: Multimodal autoencoder: a deep learning approach to filling in missing sensor data and enabling better mood prediction. In: 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 202–208. IEEE (2017) 12. Janko, V., Rešçiç, N., Mlakar, M., Drobni, V., Gams, M., Slapniar, G., Gjoreski, M., Bizjak, J., Marinko, M., Luštrek, M.: A new frontier for activity recognition: the Sussex-Huawei locomotion challenge. In: Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers, pp. 1511–1520 (2018) 13. Lago, P., Matsuki, M., Inoue, S.: Achieving single-sensor complex activity recognition from multi-sensor training data (2020). arXiv:2002.11284 14. Meurisch, C., Gogel, A., Schmidt, B., Nolle, T., Janssen, F., Schweizer, I., Mühlhäuser, M.: Capturing daily student life by recognizing complex activities using smartphones. In: Proceedings of the 14th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services, pp. 156–165 (2017) 15. Sorower, M.S.: A Literature Survey on Algorithms for Multi-label Learning, vol. 18, pp. 1-25. Oregon State University, Corvallis (2010) 16. Rohrbach, M., Amin, S., Andriluka, M., Schiele, B.: A database for fine grained activity detection of cooking activities. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1194–1201. IEEE (2012) 17. Zinnen, A., Blanke, U., Schiele, B.: An analysis of sensor-oriented vs. model-based activity recognition. In: 2009 International Symposium on Wearable Computers, pp. 93–100. IEEE (2009) 18. Tenorth, M., Bandouch, J., Beetz, M.: The TUM kitchen data set of everyday manipulation activities for motion tracking and action recognition. In: 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, pp. 1089–1096. IEEE (2009) 19. De la Torre, F., Hodgins, J., Bargteil, A., Martin, X., Macey, J., Collado, A., Beltran, P.: Guide to the Carnegie Mellon University multimodal activity (CMU-MMAC) database (2009) 20. Roggen, D., Calatroni, A., Rossi, M., Holleczek, T., Förster, K., Tröster, G., Lukowicz, P., et al.: Collecting complex activity datasets in highly rich networked sensor environments. In: 2010 Seventh International Conference on Networked Sensing Systems (INSS), pp. 233–240. IEEE (2010)

Multi-class Multi-label Classification for Cooking Activity Recognition

89

21. Whitehouse, S., Yordanova, K., Paiement, A., Mirmehdi, M.: Recognition of unscripted kitchen activities and eating behaviour for health monitoring, pp. 1–6 (2016) 22. Yordanova, K., Whitehouse, S., Paiement, A., Mirmehdi, M., Kirste, T., Craddock, I.: What’s cooking and why? Behaviour recognition during unscripted cooking tasks for health monitoring. In: 2017 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops), pp. 18-21. IEEE (2017) 23. Yordanova, K., Lüdtke, S., Whitehouse, S., Krüger, F., Paiement, A., Mirmehdi, M., Craddock, I., Kirste, T.: Analysing cooking behaviour in home settings: towards health monitoring. Sensors 19(3), 646 (2019) 24. Rohrbach, M., Rohrbach, A., Regneri, M., Amin, S., Andriluka, M., Pinkal, M., Schiele, B.: Recognizing fine-grained and composite activities using hand-centric features and script data. Int. J. Comput. Vis. 119(3), 346–373 (2016) 25. Bolaños, M., Ferrà, A., Radeva, P.: Food ingredients recognition through multi-label learning. In: International Conference on Image Analysis and Processing, pp. 394-402. Springer, Cham (2017) 26. Mohamed, R.: Multi-label classification for physical activity recognition from various accelerometer sensor positions. J. Inf. Commun. Technol. 17(2), 209–231 (2020) 27. Leeger-Aschmann, C.S., Schmutz, E.A., Zysset, A.E., Kakebeeke, T.H., Messerli-Bürgy, N., Stülb, K., Arhab, A. et al.: Accelerometer-derived Physical Activity Estimation in Preschoolers–comparison of Cut-point Sets Incorporating the Vector Magnitude vs the Vertical Axis. BMC public health 19, no. 1, p. 513 (2019) 28. Burkov, A.: The Hundred-page Machine Learning Book. In: Burkov, A. (ed.) Quebec City (2019)

Cooking Activity Recognition with Convolutional LSTM Using Multi-label Loss Function and Majority Vote Atsuhiro Fujii, Daiki Kajiwara, and Kazuya Murao

Abstract This paper reports the Cooking Activity Recognition Challenge by team Rit’s cooking held at International Conference on Activity and Behavior Computing (ABC 2020). Our approach leverages the convolutional layer and LSTM to recognize macro activities (recipe), and micro activities (body motion). For micro activities consisting of multiple labels in a segment, loss is calculated using BCEWithLogistsLoss function in PyTorch for each body part, and then the final decision is made by majority vote of classification results for each body part.

1 Introduction This paper reports the solution of our team “Rits’s cooking” to Cooking Activity Recognition Challenge held at International Conference on Activity and Behavior Computing (ABC2020). Activity recognition can enrich our lives by identifying the characteristics of human behavior, therefore, there have been a lot of research on human activity recognition (HAR). Pierluigi et al. [5] proposed HAR method based on accelerometer data using a wearable device. Atallah et al. [4] conducted a study on sensor positioning for HAR using wearable accelerometers. HAR using built-in sensors in smartphones is also popular. Bayat et al. [1] conducted a study on HAR using accelerometer data from smartphones. Wang et al. [2] conducted a comparative study on HAR using inertial sensors in a smartphone. Chen et al. [11] conducted a performance analysis of smartphone-sensor behavior for HAR. In addition, a device-free HAR method has been proposed by Wang et al. [10]. Recently, a lot of research on activity recognition using neural networks have been conducted, and a high degree of accuracy has been achieved. Chen et al. [13] proposed A. Fujii · D. Kajiwara · K. Murao (B) Graduate School of Information Science and Engineering, Ritsumeikan University, 1-1-1 Nojihigashi, Kusatsu, Shiga 525-8577, Japan e-mail: [email protected]

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 M. A. R. Ahad et al. (eds.), Human Activity Recognition Challenge, Smart Innovation, Systems and Technologies 199, https://doi.org/10.1007/978-981-15-8269-1_8

91

92

A. Fujii et al.

a deep learning approach to HAR based on a single accelerometer. Wenchao et al. [9] conducted a study on HAR using wearable sensors by deep convolutional neural networks. Chen et al. [12] proposed an LSTM-based feature extraction approach to recognize human activities using tri-axial accelerometer data. Ordóñez et al. proposed deep convolutional and LSTM Recurrent Neural Networks for multimodal wearable activity recognition [3]. Most conventional methods consider single-label activities which means only one non-overlapping label is given to the input data. However, the cooking activity dataset we handle in this challenge includes micro- and macro activities. The micro activity consists of a set of single segments with multiple labels. In addition, the number of samples in a segment for each sensor are different. In this paper, we constructed a network with a convolutional layer and an LSTM layer for sensors at each body part. Handcrafted features are employed as an input of the network. For micro activities, BCEWithLogistsLoss is used as the loss function to evaluate multi-label data. Four decisions obtained with the networks are then merged to one final output by majority vote.

2 Challenge In this challenge, each team competes with the other on the recognition accuracy of cooking activities. This section introduces the challenge goal, the dataset, and the evaluation criteria.

2.1 Challenge Goal The goal of the Cooking Activity Recognition Challenge is to recognize both the macro activity (recipe) and the micro activities taking place in a 30 s window based on acceleration data and motion capture data. The training dataset contains data about 3 subjects and contains all activity labels. The test dataset contains data about the other subject and is not labeled. Participants must submit their predicted macro- and micro activities on the test dataset using their models.

2.2 Dataset This section introduces the dataset used for this challenge. For more details, please refer to these articles [6–8].

Cooking Activity Recognition with Convolutional LSTM …

2.2.1

93

Sensors and Subjects

The data has been collected from four subjects who had attached two smartphones on the right arm and left hip, two smart watches on both wrists, and one motion capture system with 29 markers. The subjects cooked three recipes (sandwich, fruit salad, and cereal) five times each by following a script for each recipe, but acted as naturally as possible.

2.2.2

Data Structure

Training data contains data from three subjects (subject s1, 2, and 3) out of the four subjects, and test data contains the data from the fourth subject (subject 4). Each recording has been segmented into 30 s segments. Each segment was assigned a random identifier, so the order of the segments is unknown. Each sensor data segment is stored in a separate file with the segment-id used to identify related files. Segments of the four sensors at the same time frame were assigned the same identifier. Groundtruth for all the segments are stored in one file. This file contains one row per file, and each row contains the file name, the macro activity, and the micro activities, all separated by commas; for example, [subject1_ f ile_ 939,fruitsalad,Take,Peel,], which means that in segment 939, the subject 1 took something and peeled something while making the fruit salad. The micro activity is a multi-label recognition task. The macro activity is of three classes: sandwich, fruitsalad, and cereal; and the micro activity is of ten classes: Cut, Peel, Open, Take, Put, Pour, Wash, Add, Mix, and other.

2.2.3

Statistics

Table 1 shows the number of segments for each subject, the number of annotated classes of macro activity (one in this challenge), max, mean, and min number of annotated classes of micro activities, max, mean, and min length of the segments.

2.3 Evaluation Criteria Submissions will be evaluated by the average of the accuracy of macro-activity classification (ma) and the average accuracy of micro-activity classification (mi). . The average accuracy of micro-activity classification is based on the That is ma+mi 2 multi-label accuracy formula. The accuracy of one sample is given by accuracy = P∩G : the number of correct labels predicted (logical product of prediction set P and P∪G groundtruth set G) divided by the number of total true and predicted labels (logical sum of P and G).

94

A. Fujii et al.

Table 1 Statistics of the dataset Subject

1

2

3

4

Body part Left hip

# of # of segments macro

Max

Mean

Min

5

2.09

1

Max

Mean

Min

159

131.9

1

8191

2945

0

Right arm

1470

1309

8

Right wrist

8257

4484

0

505

428.3

10

Left wrist

5986

2171

0

Right arm

1500

1272

8

Right wrist

2992

2465

0

Left hip

105

1

Length

Left wrist

Left hip

80

# of micro

2.30

1

519

429.3

32

774.6

0

Right arm

1594

1182

164

Right wrist

5938

3559

0

1

6

2.26

5529

180

1

6

Left wrist

Left hip

103

1

1

406.7

46

Left wrist

Unknown Unknown Unknown 534 7143

1126

0

Right arm

1479

1233

86

Right wrist

8761

2080

0

3 Method This section describes the preprocessing to obtain the features from the raw data, the structure of the model, the loss function and the optimizer, and the process of obtaining the activity labels from the predictions obtained by the one-hot vector. Note that our method does not use motion capture data.

Cooking Activity Recognition with Convolutional LSTM …

95

3.1 Preprocessing Handcrafted feature values are extracted from the raw data [[x1 , . . . , x N ], [y1 , . . . , y N ], [z 1 , . . . , z N ]], where x, y, z are raw data of x, y, z axes, and N is the number of samples in a 30 s segment. The features are mean, variance, max, min, root mean square (RMS), interquartile range (IQR), and zero crossing rate (ZCR) for x, y, and z axes, respectively. These features are calculated over a 3 ms window slid in steps of 50 s. From the preprocessing, 7 features × 3 axes = 21 dimension feature time series are obtained for one sensor. The dataset includes the data obtained at four body parts, therefore, this process is conducted for each sensor.

3.2 Model Figure 1 shows the structure of our model. The 21-dimensional feature time-series data is fed into our model consisting of 1D convolutional layer, LSTM layer, linear layer, and sigmold layer. The process at each layer is as follows: • 1D convolutional layer has an input of sequence length N  × 21 channels and an output of sequence length N  × map size M. N  is the length of the time series after the feature extraction, which is smaller than the original raw data. N  is the length of the time series after the convolution, which is N  − K + 1, where K

Fig. 1 Our model. The first layer is raw data provided as it is. The second layer is handcrafted features consisting of 21 channels. Conv1d layer is a one-dimensional convolutional layer consisting of 6 maps for 21 channels, 126 maps in total. The 126-channel time-series data will be fed into the next LSTM layer consisting of 24 hidden layers. Then 24-dimensional tensor is shrunk to a 10-dimensional tensor, then the sigmoid function is applied, one-hot encoded with predetermined threshold. At last, four one-hot encoded vectors are merged and the final multi-label prediction is obtained

96



• • •

A. Fujii et al.

is the kernel size. N  and N  are variable lengths because the dataset has deficit values and sampling frequency is different for the sensors. Segments whose length is less than 10 are discarded and not fed into the model. Kernel size K is set to 10. Map size M is the number of filters and is set to 6 × 21 = 126. Here, the convolution is depth-wise, i.e., the convolution is conducted for each channel and there are 6 filters for each channel. LSTM layer has an input of sequence length × 126 channels and an output of a 24-dimensional tensors. The LSTM is many to one. The number of hidden layers is 24, and the last output of the LSTM layers are obtained. At this moment, the output is no longer time series, but one tensor. Linear layer has an input of a 24-dimensional flattened tensor and an output of 10/3 dimensional tensors. For macro (recipe) recognition, the output is 3 dimensions, for micro activities, output is 10 dimensions. Sigmoid layer applies the sigmoid activation function to the 10/3 dimensional tensors, which represents the likelihood of the classes. Activation layer has an 10 dimensional tensor and an output of 10 dimensional one-hot vector for micro activity recognition. This layer activates the prediction classes whose values are more than the threshold T h. The output one-hot vector has multiple 1 elements since the data is multi-labeled, e.g.., [0, 1, 0, 0, 0, 0, 0, 0, 0, 0] or [0, 0, 1, 0, 1, 0, 0, 0, 0, 0]. The threshold T h is determined in the training phase by finding the best accuracy by changing the threshold from 0 to 1. For recipe recognition, output of the sigmoid layer is used for Final Activation.

3.3 Loss Function and Optimizer The models for four sensors are trained separately. The model is trained on BCEWithLogistsLoss in PyTorch for micro activities, and its weight was set to one for all classes. For macro activity, CrossEntropyLoss in PyTorch is used as the loss function. Adam was used for an optimizer for macro- and micro activities.

3.4 Final Prediction Classes Activation Through the process above, up to four predictions are obtained. At last, our method merges the predictions and outputs the final prediction. In detail, for micro-activity recognition, the four one-hot vectors are summed up; then the final prediction is done as follows. Note that segments whose length is less than 10 are not fed into the system and in that case the system does not output prediction, therefore, the cases when the number of predictions are one, two, and three have to be considered. The number of predictions is one or two, i.e., segments of three or two sensors are too short to be fed into the system; index which is greater than or equal to 1 is activated

Cooking Activity Recognition with Convolutional LSTM …

97

as the final prediction. The number of predictions is three or four; index which is greater than or equal to 2 is activated as the final prediction. For example, suppose that micro activities [“Cut”, “Peel”, “Open”, “Take”, “Put”, “Pour”, “Wash”, “Add”, “Mix”, “other”] are one-hot encoded and predictions of the four sensors are [1, 0, 0, 0, 0, 0, 0, 0, 0, 0] for left hip, [1, 0, 1, 0, 0, 0, 0, 0, 0, 0] for left wrist, [0, 0, 1, 0, 0, 0, 0, 0, 0, 0] for right arm, and [0, 1, 0, 0, 0, 0, 0, 0, 0, 0] for right wrist. The summed-up one-hot vectors is [2, 1, 2, 0, 0, 0, 0, 0, 0, 0] and the number of predictions is four in this case. Indices whose values are greater than or equal to 2, i.e., index 0 and 2, are activated, and our method outputs Cut and Open as a prediction of micro activities for the segment. Index 1 (Peel) is not activated. For macro-activity recognition, the four vectors in the sigmoid layer are summed up, then the index showing the maximal value is activated since macro activity is a single label. For example, suppose macro activities are [“sandwich”, “fruitsalad”, “cereal”] and the four vectors in the sigmoid layer are [0.1, 0.5, 0.9] for left hip, [0.1, 0.2, 0.6] for left wrist, [0.1, 0.6, 0.8] for right arm, and [0.3, 0.2, 0.7] for right wrist. The summed-up vector is [0.6, 1.5, 3.0]. Index 2, which is showing the greatest value, is activated and our method outputs cereal as a prediction of macro activity for the segment. Threshold is not used for macro-activity recognition since macro activity is a single label.

4 Evaluation This section describes the evaluation environment, the loss and accuracy in training phase, and processing time in training and testing phases.

4.1 Environment We implemented the program in Python 3.6.7, PyTorch 1.4.0, CUDA 10.0, and cuDNN 7402. The specification of the computer used for the evaluation is as follows: OS is Windows 10 Pro; CPU is Intel Core i7-8700K 3.7GHz; RAM is DDR4 64GB; and GPU is NVIDIA GeForce RTX 2080Ti GDDR6 11GB. All the data were stored on a local HDD. In the training phase, all the data of subjects 1, 2, and 3 (288 segments) were used for training in one epoch, which iterated 1,000 epochs.

4.2 Result Table 2 shows the maximum accuracy and minimum loss of micro- and macro activities over 1,000 epochs for the four sensor positions by changing training data and test data. The loss was calculated using the vectors in the sigmoid layer in Fig.1.

98

A. Fujii et al.

Table 2 Maximum accuracy and minimum loss of micro- and macro activities over 1,000 epochs for four sensor positions by changing training data and test data Activity type Train data Test data Sensor Max. accuracy Min. loss position Micro

Macro

Subjects 1, 2

Subject 3

Subjects 2, 3

Subject 1

Subjects 1, 3

Subject 2

Subjects 1, 2, 3

Subjects 1, 2, 3

Subjects 1, 2

Subject 3

Subjects 2, 3

Subject 1

Subjects 1, 3

Subject 2

Subjects 1, 2, 3

Subjects 1, 2, 3

Left hip Left wrist Right arm Right wrist Left hip Left wrist Right arm Right wrist Left hip Left wrist Right arm Right wrist Left hip

0.593 0.556 0.591 0.405 0.597 0.432 0.516 0.441 0.564 0.534 0.596 0.432 0.717

0.396 0.454 0.394 0.498 0.370 0.536 0.393 0.479 0.381 0.490 0.374 0.452 0.260

Left wrist Right arm Right wrist Left hip Left wrist Right arm Right wrist Left hip Left wrist Right arm Right wrist Left hip Left wrist Right arm Right wrist Left hip

0.761 0.769 0.688 0.520 0.519 0.539 0.510 0.494 0.298 0.603 0.268 0.535 0.596 0.520 0.489 0.904

0.245 0.208 0.261 1.051 1.008 1.078 1.090 1.097 1.108 1.029 1.140 1.062 1.008 1.047 1.081 0.280

Left wrist Right arm Right wrist

0.905 0.911 0.992

0.366 0.271 0.065

Cooking Activity Recognition with Convolutional LSTM …

99

Table 3 CPU and GPU memory usage and time taken in training and testing. These figures are when data of four body parts are processed at once Resource Macro Micro CPU memory GPU memory Training time (1,000 epoch) Testing time (1,000 epoch)

2391 MB 1.6 GB 21.554 s 58.042 s

2391 MB 1.6 GB 28.891 s 59.299 s

From these results, the average accuracy of 0.521 and 0.491 were achieved among subjects 1, 2, and 3 in leave-one-subject-out manner for micro- and macro activities, respectively. Considering ten multi-label micro activities, it would be said that 0.521 accuracy is good, while 0.491 accuracy for 3-class macro activity can be improved. Comparing the results using test data of subjects 1, 2 and 3, there seems to be no difference among them, showing that the data of the three subjects are similar. However, the results of both wrists are lower than the hip and the arm. It can be said that individual characteristics appeared in hand movements while cooking and the constructed model cannot be generalized. When data of subjects 1, 2, and 3 are used for both training and testing, accuracy for micro- and macro activities were improved to 0.734 and 0.928, respectively. Note that for submitted results for the data of subject 4, our model was trained separately for the body parts with the data of subjects 1, 2, and 3, and the model at 1,000th epoch was used for testing the data of subject 4. Table 3 shows the memory usage on CPU and GPU, and processing time taken in the training phase and the testing phase.

5 Conclusion This paper reported the solution of our team “Rits’s cooking” to Cooking Activity Recognition Challenge held at International Conference on Activity and Behavior Computing (ABC2020). Our approach leverages the convolutional layer and LSTM to recognize macro activities (recipe), and micro activities (body motion). The evaluation results showed that the average accuracy of 0.521 and 0.491 were achieved among subjects 1, 2, and 3 in leave-one-subject-out manner for micro- and macro activities, respectively. We plan to construct the streamline model without handcrafted features and majority vote.

100

A. Fujii et al.

6 Appendix 6.1 Used Sensor Modalities Four acceleration sensors at the left hip, left wrist, right arm, and right wrist from three subjects were used. Mocap data was NOT used.

6.2 Features Used Seven kinds of features were used: Mean, variance, max, min, root mean square (RMS), interquartile range (IQR), and zero crossing rate (ZCR). These features are extracted for x, y, and z axes, respectively.

6.3 Programming Language and Libraries Used Python 3.6.7 was used. For network implementation, PyTorch 1.4.0 was used.

6.4 Window Size and Post-processing Window size is 3 s and step size is 50 ms.

6.5 Training and Testing Time Training time (1,000 epoch) was 21.554 s for macro activity and 28.891 s for micro activity. Testing time (1,000 epoch) was 58.042 s for macro activity and 59.299 s for micro activity.

6.6 Machine Specification OS: Windows 10 Pro. CPU: Intel Core i7-8700K 3.7KHz. RAM: DDR4 64GB. GPU: NVIDIA GeForce RTX 2080Ti GDDR6 11GB.

Cooking Activity Recognition with Convolutional LSTM …

101

References 1. Bayat, A., Pomplun, M., Tran, D.A.: A study on human activity recognition using accelerometer data from smartphones 34, 450–457 (2014) 2. Wang, A., Chen, G., Yang, J., Zhao, S., Chang, C.: A comparative study on human activity recognition using inertial sensors in a smartphone 16(11), 4566–4578 (2016) 3. Javier Ordóñez, F., Roggen, D.: Deep, convolutional and lstm recurrent neural networks for multimodal wearable activity recognition 16, 1–25 (2016) 4. Atallah, L., Lo, B., King, R., Yang, G.: Sensor positioning for activity recognition using wearable accelerometers 5(4), 320–329 (2011) 5. Casale, P., Pujol, O., Radeva, P.: Human activity recognition from accelerometer data using a wearable device. In: Pattern Recognition and Image Analysis, pp. 289–296 (2011) 6. Lago, P., Takeda, S., Adachi, K., Shamma Alia, S., Matsuki, M., Benaissa, B., Inoue, S., Charpillet, F.: Cooking activity dataset with macro and micro activities (2020). https://doi.org/ 10.21227/hyzg-9m49 7. Lago, P., Takeda, S., Shamma Alia, S., Adachi, K., Benaissa, B., Charpillet, F., Inoue, S.: A dataset for complex activity recognition with micro and macro activities in a cooking scenario (2020) 8. Shamma Alia, S., Lago, P., Takeda, S., Adachi, K., Benaissa, B., Ahad, Md A.R., Inoue, S.: Summary of the cooking activity recognition challenge (2020) 9. Jiang, W., Yin, Z.: Human activity recognition using wearable sensors by deep convolutional neural networks. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 1307–1310 (2015) 10. Wang, W., Liu, A.X., Shahzad, M., Ling, K., Lu, S.: Device-free human activity recognition using commercial wifi devices 35(5), 1118–1131 (2017) 11. Chen, Y., Shen, C.: Performance analysis of smartphone-sensor behavior for human activity recognition 5, 3095–3110 (2017) 12. Chen, Y., Zhong, K., Zhang, J., Sun, Q., Zhao, X.: LSTM Networks for Mobile Human Activity Recognition. In: 2016 International Conference on Artificial Intelligence: Technologies and Applications, pp. 50–53 (2016) 13. Chen, Y., Xue, Y.: A deep learning approach to human activity recognition based on single accelerometer. In: 2015 IEEE International Conference on Systems, Man, and Cybernetics, pp. 1488–1492 (2015)

Identification of Cooking Preparation Using Motion Capture Data: A Submission to the Cooking Activity Recognition Challenge Clément Picard, Vito Janko, Nina Rešˇciˇc, Martin Gjoreski, and Mitja Luštrek

Abstract The Cooking Activity Recognition Challenge tasked the competitors with recognizing food preparation using motion capture and acceleration sensors. This paper summarizes our submission to this competition, describing how we reordered the training data, relabeled it and how we handcrafted features for this dataset. Our classification pipeline first detected basic user actions (micro-activities); using them it recognized the recipe, and then used the recipe to refine the original microactivity predictions. After the post-processing step using a Hidden Markov Model, we achieved the competition score of 95% on the training data with cross-validation.

1 Introduction Being able to perform basic daily activities such as cooking, dressing, bathing, and moving in and out of a bed is essential to older adults’ quality of life and health. Loss of independence in these activities is strongly associated with higher use of health services, nursing home placement, and death [1]. Ambient assisted technology can support older adults in performing their daily activities, for example, by giving them advice, reminding them of key tasks, or calling for help if needed. An important first step to do so, however, is to recognize these activities. This can be accomplished by wearable and ambient sensors combined with machine learning, a combination that has already been adopted in some domains [2]. The Cooking Activity Recognition Challenge [3] aims to further push the state of the art, specifically in the domain of using wearable and ambient sensors, for the recognition of the food preparation. The organizers created a dataset [4, 5] captured C. Picard (B) École normale supérieure de Rennes, Bruz, France e-mail: [email protected] C. Picard · V. Janko · N. Rešˇciˇc · M. Gjoreski · M. Luštrek Jozef Stefan Institute, Ljubljana, Slovenia e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 M. A. R. Ahad et al. (eds.), Human Activity Recognition Challenge, Smart Innovation, Systems and Technologies 199, https://doi.org/10.1007/978-981-15-8269-1_9

103

104

C. Picard et al.

by wearable accelerometers and motion capture (mocap) sensors, and the competitors had to detect both the recipe being prepared and the micro-activities within this preparation (e.g., cutting, taking, washing, and peeling). One could then use such a system to detect if the observed user is regularly cooking or if they are doing some obvious mistakes in the preparation (e.g., skipping an important step). Each of these sensors is suitable for the recognition of human activities [6, 7]. Having both of them available thus offers an interesting insight into how they compare, and should allow highly accurate recognition of food-preparation activities.

1.1 Method Overview The summary of our method is presented in Fig. 1. First, we ordered the data segments that were originally shuffled. This allowed us to take into account temporal dependencies between different micro-activities. Then we used the mocap data to visualize the micro-activities in the training set and added handmade labels to them. From the same data, we then derived additional sensor streams (e.g., velocity of different body parts) and from them calculated a wide array of different features. The classification process begun with one micro-activity classifier that could make the first, rough, predictions. Using these predictions, we could infer the underlying recipe of each sequence. This allowed us to use a specialized micro-activity classifier for each of the three different recipes. More precisely, we used two different classifiers for each recipe and then merged their predictions. The final step was using a Hidden Markov Model (HMM) to smooth out the predictions. This model can learn the expected sequence of micro-activities for each recipe and can thus correct parts of sequences that look very atypical—most likely due to a misclassification error.

2 Challenge Data The Cooking Activity Recognition Challenge presented us with the data captured by four subjects, preparing three different recipes (Sandwich, Fruit salad, and Cereal), five times each. Each recipe was composed of different actions (Cut, Peel, etc.) that we will call micro-activities. There were 10 different micro-activities in total. All subjects had to follow the same script for preparing each recipe, resulting in very similar sequences of micro-activities. The training dataset contained the data of three subjects and the appropriate labels, with the labeling of the last subject’s data being the goal of the competition. The presented dataset had data of two different types: mocap data collected using a set of 29 markers, and acceleration data from four worn devices—two smartphones positioned on the right arm and left hip, and two smart watches on both wrists.

Identification of Cooking Preparation Using Motion Capture …

105

Mocap data

Re-ordering

Derived sensor streams

Micro activity classifier

Re-labeling

Per-stream feature extraction

Recipe classifier

2) Data pre-processing

3) Feature extraction

Hidden Markov Model 4.2) Post-processing

General

Micro act. base classifier

Micro act. base classifier

Micro act. base classifier

Micro act. precise classifier

Micro act. precise classifier

Micro act. precise classifier

Sandwich

Fruit salad

Cereal

4.1) Classification

Fig. 1 The pipeline for the proposed method. The step numbers correspond to the numbers of the sections that describe them

The mocap data had a sample rate 100 Hz and no missing data. On the other hand, the sample rate of the acceleration data was different between the devices and also frequently varied during the recording. The average sampling rate was 100 Hz for the smart watches, 50 Hz for the smartphones. In addition, the accelerometer data had several gaps, with 20–80% of the data missing, depending on the device. After recording, the organizers segmented the data into 30 s segments. Each segment was then given two labels: the recipe being performed and the list of all microactivities that happened in that segment. Notably, the start and end of each microactivity was omitted. These segments were then shuffled and their original order was not given to the competitors.

2.1 Data Preprocessing The micro-activity aggregation (only having a list of micro-activities, not their time) presents a big problem for classical machine-learning methods that expect a single label for each time window. In addition, the shuffling complicates the recipe recognition, as it is hard to determine the recipe when seeing only a part of it. For example, taking ingredients from the cupboard (Take micro-activity) is the same no matter what preparation procedure follows. To solve both problems, we preprocessed the dataset to make it look more “standard” and more similar to a dataset that we would acquire in a real-life setting. First, we reordered all the segments to their original order. To do so, we leveraged the fact that if two segments are subsequent in a recording, then the end of one segment must be very similar to the beginning of the next one. We calculated the difference

106

C. Picard et al.

Fig. 2 (left) Visualization of the subject as a “stickman” figure using the mocap data. The layout of the room was approximated by looking at where different activities were performed. (right) Using the visualization to determine which index in the data file corresponds to which body part

between the values of mocap markers between the beginning and the end for each pair. This was done for the x, y, z coordinates for all 29 marker values, and then all the differences were summed together. The pairs with the smallest differences indicated subsequent segments. If a segment did not have any that preceded it, it was considered the first segment in the current sequence, and vice versa for the last segment (one sequence being the preparation of one recipe). After reordering, we also relabeled all the segments, with the goal of precisely determining the start and end of each micro-activity. To do that, we used the mocap data to visualize the marker positions in 3D space. The visualization was done using Unity [8] (Fig. 2). We then used the same program to create a simple labeling tool, where we could watch the motion of a subject and try to visually infer their activity. During the relabeling process, we created two sets of labels, so that each frame had two different labels attached to it. For the first set (base labeling), we were using only the labels of the challenge. For the second set (precise labeling), we additionally used the label Undefined if the activity performed was not the one the organizers required us to recognize. For instance, for the cutting motion, we labeled as Cut the frames when the subject was cutting food, and as Undefined all the frames when they were doing something else (taking the knife, putting it on the table, etc.). This allowed us to later train and recognize the very specific motions of each micro-activity with a high accuracy.

3 Feature Extraction Using the raw mocap data, we first used the existing sensor streams (e.g., each mocap sensor position) to create additional, derived, sensor streams. These streams included the speed of movement, acceleration, and the distances and angles between some selected joints (e.g., distances between hands, elbows and shoulders, the distance between ankles, the angles between different hand parts, and the distance to the

Identification of Cooking Preparation Using Motion Capture …

107

floor). The latter were chosen based on the expert knowledge acquired by looking at the visualizations and determining the relevant specifics of motion. After this procedure, we had 129 sensor streams. All the sensor streams (base and derived) were split into two-second windows, and from each window features were calculated. Larger windows were found impractical as they did not capture short micro-activities and micro-activities that started/ended at the border of the 30 s segments. The features include basic ones such as mean, variance, standard deviation, minimum, maximum, and lower and upper quartiles, but also features computed by Fast Fourier Transform and some other features frequently used in similar domains (e.g., count above and below mean, and absolute sum of change). While we also calculated features from the acceleration data that have proven themselves in our previous work [10], we found that including them does not increase the classification accuracy. We believe this is due to the variable sampling rate and missing values in the acceleration data, especially compared to the high-quality mocap data. In addition, since data from the accelerometer is missing a large proportion of the time, we had to make two classifiers for each task (one that used both sensor modalities, and one that only used mocap data). For these two reasons, we decided against using acceleration data in our final submission, and it will be thus omitted in the rest of the paper. For the feature selection step, we used a simple approach of ordering the features by the mutual information between the feature and the label. Then we trained models using the best n features, where n was a variable cutoff. All the features described so far were used for the recognition of micro-activities. In order to recognize the recipe, we created another set of features. Since the reordering assembled all parts of each recipe into one sequence, we could use this full sequence as one instance. We used the general micro-activity recognizer (trained on base labels for all three recipes) on each sequence, and then computed the proportions of micro-activities in the first eighth of the sequence, second eighth, and so on. Only the most well-recognized micro-activities were used: Take, Wash, Put, and Cut. These proportions alongside the length of the sequence became the features for the recipe recognition.

4 Classification 4.1 Classifiers As described in Sect. 1.1, we started by using one general micro-activity classifier to classify all two-second windows. Next, we collected all the classifications for one sequence and from them calculated the features for that sequence—which we then classified into a recipe. Depending on which recipe was detected, two recipe-specific

108

C. Picard et al.

micro-activity classifiers were used to classify the same two-second windows as before (hopefully, more accurately than with the general micro-activity classifier). The reasoning for having two micro-activity classifiers is the following: as explained in Sect. 2.1, many of the actions performed by the recorded subjects did not fit into any of the available micro-activities, and the labels for those actions were essentially noise. We feared that as a consequence, some micro-activities would be hard to learn. To mitigate this, we had another set of labels (precise) that labeled all those actions as Undefined. One classifier was trained on the original labels (base) and the other on the precise labels. When classifying, the precise classifier made the predictions first. If the prediction was Undefined, the second classifier made another prediction to substitute it. This whole pipeline therefore has 7 (one general, and two specialized for each recipe) different micro-activity classifiers and one recipe classifier. The recipe classifier was a simple Random Forest with default parameters. For all the 7 micro-activity classifiers, we decided to use the same parameters: the same features, the same classifier type (Random Forest), and the same number of estimators in the Random Forest. The only difference between them was the data used for training (all data or only data from one recipe) and the type of labels used (base or precise). The classifier and its hyperparameters were selected empirically, as shown in Sect. 5. We tested different classical classifiers, and a custom deep learning network. For that, a deep Multi Task Learning (MTL) architecture was utilized, where each micro-activity was being represented as a separate task (one vs. all). The architecture had two fully connected layers shared across all the tasks and one task-specific layer. The final output of the model was provided by concatenating the outputs of the task-specific layers.

4.2 Post-Processing with a Hidden Markov Model Using only classical classification, all the windows are classified independently from one another. This approach discards all the information on temporal dependencies between them. If a subject is currently taking food from the cupboard (Take), for example, but the next window is Cut, followed by another Take classification, it is far more likely for Cut to be a misclassification than a micro-activity switch. In addition, the order of the micro-activities is more or less fixed, so if Mix is classified before Pour we can be certain that the recognition is wrong and that their order must be changed. This motivated us to use an extra step after each classification, where the temporal information was taken into account using an HMM model. In this model, we assume that we are moving through a number of hidden states, generating stochastic, but visible, emissions. In this case, the hidden states represented the actual microactivity, while the observed emissions represented the classified micro-activities. The parameters of this model are the transition probabilities between the states (transition

Identification of Cooking Preparation Using Motion Capture … Table 1 Accuracy for different micro-activity classifiers Algorithm Accuracy (%) Algorithm Decision tree Bagging Gradient boosting Deep learning Random forest

52.7 61.7 55.0 65.7 68.0

k-NN SVM XGB MLP Ensemble

109

Accuracy (%) 52.0 47.3 62.0 52.0 67.7

matrix) and the probabilities of the observed emissions in each state (essentially a normalized confusion matrix). The input to the model is an entire classified sequence (one run of the recipe from start to finish). This observed sequence could be generated by many different sequences of actual micro-activities, but HMM can determine the most likely one of them and return it as output. This output was our submission to the competition.

5 Results First, we tested which machine-learning algorithm is the most suited for the microactivity recognition task. We took several classic machine-learning algorithms from the sklearn library [9] in addition to the Extreme Gradient Boost (XGB) algorithm [11] and deep learning (Sect. 4.1). For this experiment, we took the base labeling, split it into two-second windows, and data from all recipes. We used the leave-onesubject-out scheme, where data from two subjects was used for training and data from the remaining one for testing. All the reported results in this section are the average from all three runs. Additionally, we chose to report the results using accuracy, as it is the most common and well-known metric. However, for the final result we also give the score as defined by the competition. This score averages the mean of the accuracy of recipe classification and the mean of the accuracy of micro-activity classification. From the results in Table 1, we can see that Random Forest was the most accurate one, surpassing both the deep learning approach and somewhat surprisingly the ensemble of all approaches (implemented by the majority vote). Having decided on the Random Forest classifier, we tested the impact of two parameters: the number of trees in the forest and the number of available features. For the number of trees, we sampled numbers between 1 and 2000, and observed that accuracy is slowly increasing up to around 1000 trees. Some sample results from this test are shown in Table 2. The feature selection process was described in Sect. 3 and some sample results, using different number of features, are again found in Table 2. They show, interestingly, that using all features gives us better performance than any tested subset.

110

C. Picard et al.

Table 2 Accuracy (%) for different number of trees in the Random Forest and for different number of features used. When testing different number of trees, all features were used. Conversely, when testing different number of features, the maximum (1000) number of trees was used # Trees Accuracy (%) # Features Accuracy (%) 50 Trees 100 Trees 500 Trees 1000 Trees

67.7 68.0 68.9 69.0

10 Features 100 Features 1000 Features 1994 Features

65.4 67.8 67.7 69.0

Table 3 Accuracy (%) when using specialized classifiers for each recipe. The table shows the results for both sets of labels and their combination Sandwich Cereal Fruit salad All Base Precise Combined

76.7 83.6 76.0

80.3 90.0 81.3

82.3 82.0 83.3

79.8 85.2 80.2

The outputs from this micro-activity classifier were then used as features for the recipe classifier—implemented as a Random Forest. Each individual run of each recipe was one instance. We achieved 100% accuracy for this task, so for the further steps we could assume that we always knew to which recipe any time window belongs. The next step was to train two classifiers for each recipe as described in Sect. 4. We used the Random Forest classifier with the same parameters and features as in the first general micro-activity classifier. A case could be made for using different parameters or even classifiers for each recipe, but we chose the simpler approach to avoid over-fitting. The results for the base classifier for each recipe are shown in Table 3 and show that the accuracy increases substantially (from 69 to 80%) if the classifiers are specialized. Using precise labels, the accuracy is even higher, but those labels contain Undefined activity that appears roughly 35% of the time, making the problem easier. When combining the predictions, the overall accuracy increases, but only by a negligible amount. We decided to still use this combination for our final competition predictions as it increased the recall of short micro-activities, but it is possible that this was not a crucial step in the pipeline. Finally, we used HMM smoothing as described in Sect. 4.2. This again substantially boosted the results as can be seen in Table 4. Both the competition score for micro-activities as well as accuracy are around 90%. When combining this score with the recipe detection accuracy, we achieved the competition score of 95%.

Identification of Cooking Preparation Using Motion Capture …

111

Table 4 Accuracy and the competition score after using HMM to smooth out the predictions Sandwich Cereal Fruit salad All Accuracy (%) Competition score (%)

92.3 83.0

92.3 94.7

90.0 93.7

91.5 90.4

6 Discussion We believe that the high accuracy achieved by our approach stems from three main advantages. The first is that reordering and relabeling the data adds a lot of temporal information to the dataset—both on the order of 30 s segments and on the exact timing of the micro-activities in them. This allowed us to use conventional machine learning techniques, which turned out to be completely adequate for this research problem. Another advantage is that we interleaved the recognition of micro-activities and recipes: the first helped us to determine the second, and when the recipe was recognized it helped us to refine the micro-activity recognition. To do so, we had to use multiple machine learning models, but in return we significantly improved our results. Finally, the last big advantage of our approach is the use of HMM. As all subjects follow the same cooking procedure, the challenge data was very consistent and predictable—increasing the HMM effectiveness. However, we can also point out some negative aspects of our approach. Firstly, while manual data labeling adds information to the dataset, it is very time-consuming. Finding a way to automate labeling could improve the possibilities of processing a larger amount of data, and thus obtain better results. Secondly, we chose to use the same model (Random Forest) for all our classifiers. This could be suboptimal, as different models and/or parameters for each step of the classification could lead to more accurate results. Lastly, we chose not to use acceleration data at all as it did not increase the accuracy of our models—however, finding a use for it, despite its inconsistencies, could be an interesting research problem. We also wanted to mention that the competition score felt very “noisy” at times, and perhaps a different metric could be used in the future. The problem is that it is really difficult to determine the exact start and end of each micro-activity period. If a micro-activity started at the very end of a segment (or ended at the beginning of one), it would not be recognized for that segment, resulting in massive drop in accuracy for that segment even if most of it was correctly classified. While the competition was challenging and interesting, it did not reflect well the challenges one would encounter when developing a system to support cooking in real life. The first issue was the dataset, which made the machine learning task artificially more difficult. The second issue are the sensors: mocap data was of very high quality, completely outclassing accelerometers, which are otherwise more realistic sensors in ambient assisted living. The third issue was the regularity of the preparation procedures, which made the learning easier than it would be on more naturalistic data. It would be interesting to repeat the competition next year with a more realistic dataset.

112

C. Picard et al.

7 Conclusion This paper describes our approach to the problem presented by the Cooking Activity Recognition Challenge. We identified three main problems with this task. The first was the unknown ordering of the data, the second was the type of labels given by the organizers and the final one the micro-activities themselves—some were very similar to each other, and some were very short and thus hard to detect. We solved the first problem by reordering the data, which allowed us to employ the HMM model which substantially boosted our results (roughly 10% points). The second problem was solved by relabeling the whole training dataset by hand, which enabled the use of conventional machine learning techniques. Our approach for the machine learning itself (solving the third problem) was fairly conventional, using a well-tuned Random Forest. Nonetheless, we list some key insights from the tuning process: we used small window size (2 s) in order to better detect short activities, we trained specialized classifiers for each recipe which improved the accuracy by roughly 10%, and we only used the mocap data—as the acceleration data was too inconsistent. Our final approach had the competition score of 95% when doing the crossvalidation on the training set. Given that the test set did not exhibit any major statistical difference to the training one, we hope that our final submission will receive a similar score.

8 Appendix: General Information See Table 5.

Table 5 General information for the used approach Sensor modalities Features used Programming language Library used Window size Post-processing Training/testing time Hardware

Mocap data Section 3 Python sklearn, xgb, hmm, tsfresh 2s Hidden Markov Model