Vol. 29, No. 4, DECEMBER 2022 
IEEE Robotics & Automation Magazine

Citation preview

Displaying the world in your hands. Inventing new ways to interact. Force Dimension designs and manufactures the finest master haptic devices for leading-edge applications in research, medical, industry, and human exploration.

The ScanTrainer from Intelligent Ultrasound is an educational tool that uses real patient scans and curriculum-based teaching across obstetrics, gynecology, general medicine, and emergency medicine. The system integrated a customized delta.3 haptic device. Force Dimension Switzerland www.forcedimension.com [email protected]

Vol. 29, No. 4 DECEMBER 2022 ISSN 1070-9932 http://www.ieee-ras.org/publications/ram

FEATURES 10

A Multimodal Fusion Model for Estimating Human Hand Force

Comparing Surface Electromyography and Ultrasound Signals By Yongxiang Zou, Long Cheng, and Zhengwei Li

25

A Hybrid Visual–Haptic Framework for Motion Synchronization in Human–Robot Cotransporting A Human Motion Prediction Method By Xinbo Yu, Sisi Liu, Wei He, Yifan Wu, Hui Zhang, and Yaonan Wang

36

Toward Holistic Scene Understanding

50

Biomimetic Electric Sense-Based Localization

A Transfer of Human Scene Perception to Mobile Robots By Florenz Graf, Jochen Lindermayr, Çag˘atay Odabas¸i, and Marco F. Huber

A Solution for Small Underwater Robots in a Large-Scale Environment By Junzheng Zheng, Jingxian Wang, Xin Guo, Chayutpon Huntrakul, Chen Wang, and Guangming Xie

66

Biomimetic Force and Impedance Adaptation Based on Broad Learning System in Stable and Unstable Tasks Creating an Incremental and Explainable Neural Network With Functional Linkage By Zhenyu Lu and Ning Wang

78

Controlling Maneuverability of a Bio-Inspired Swimming Robot Through Morphological Transformation Morphology Driven Control of a Swimming Robot By Kai Junge, Nana Obayashi, Francesco Stella, Cosimo Della Santina, and Josie Hughes

92 ON THE COVER This issue of IEEE Robotics and Automation Magazine focuses on the state of the art in biomimetic perception, cognition, and control research. ©SHUTTERSTOCK.COM/MONIKA WISNIEWSKA

104

Simulation to Real

Learning Energy-Efficient Slithering Gaits for a Snake-Like Robot By Zhenshan Bing, Long Cheng, Kai Huang, and Alois Knoll

A Robust Visual Servoing Controller for Anthropomorphic Manipulators With Field-of-View Constraints and Swivel-Angle Motion

Overcoming System Uncertainty and Improving Control Performance By Jiao Jiang, Yaonan Wang, Yiming Jiang, He Xie, Haoran Tan, and Hui Zhang (Continued) If you like an article, click this icon to record your opinion. This capability is available for online Web browsers and offline PDF reading on a connected device.

Digital Object Identifier 10.1109/MRA.2022.3218353

DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



1

FEATURES 115

(Continued)

A Small-Scale, Rat-Inspired Whisker Sensor for the Perception of a Biomimetic Robot

Design, Fabrication, Modeling, and Experimental Characterization By Yulai Zhang, Shurui Yan, Zihou Wei, Xuechao Chen, Toshio Fukuda, and Qing Shi

127

Contact Shape and Pose Recognition Utilizing a Multipole Magnetic Tactile Sensor With a Metalearning Model By Ziwei Xia, Bin Fang, Fuchun Sun, Huaping Liu, Weifeng Xu, Ling Fu, and Yiyong Yang

A Publication of the IEEE ROBOTICS AND AUTOMATION SOCIETY Vol. 29, No. 4 December 2022 ISSN 1070-9932 http://www.ieee-ras.org/publications/ram

EDITORIAL BOARD Editor-in-Chief Yi Guo ([email protected]) Stevens Institute of Technology (USA) Editors Elena De Momi Politecnico di Milano (Italy) Jindong Tan University of Tennessee (USA) Associate Editors Ming Cao University of Groningen (The Netherlands) Feifei Chen Jiao Tong University (China) Carlos A. Cifuentes Bristol Robotics Laboratory (UK) Kingsley Fregene Lockheed Martin (USA) Antonio Frisoli Scuola Superiore Sant’Anna (Italy) Jonathan Kelly University of Toronto (Canada) Ka-Wai Kwok The University of Hong Kong (Hong Kong) Surya G. Nurzaman Monash University (Malaysia)

COLUMNS & DEPARTMENTS 4 FROM THE EDITOR’S DESK 6 PRESIDENT’S MESSAGE 8 FROM THE GUEST EDITORS

Weihua Sheng Oklahoma State University (USA) Yue Wang Clemson University (USA) Enrica Zereik CNR-INM (Italy) Houxiang Zhang Norwegian University of Science and Technology (Norway) Past Editor-in-Chief Bram Vanderborght Vrije Universiteit Brussel (Belgium)

138 WOMEN IN ENGINEERING 141 INDUSTRY ACTIVITIES

RAM Column Manager Amy Reeder (USA)

145 TC SPOTLIGHT

RAM Editorial Assistant Joyce Arnold (USA)

148 COMPETITIONS 158 SOCIETY NEWS

COLUMNS Competitions: Hyungpil Moon (Korea)

164 CALENDAR

Education: Andreas Müller (Austria) From the Editor’s Desk: Yi Guo (USA) Industry News: Tamas Haidegger (Hungary)

2



Humanitarian Technology: Vacant Standards: Craig Schlenoff (USA) President’s Message: Frank Park (Korea) Regional Spotlight: Megan Emmons (USA) Student’s Corner: Francesco Missiroli (Germany) TC Spotlight: Yasuhisa Hirata (Japan) Women in Engineering: Karinne Ramirez Amaro (Sweden) IEEE RAS Vice-President of Publication Activities Todd Murphey (USA) RAM home page: http://www.ieee-ras.org/publications/ram IEEE Robotics and Automation Society Executive Office Amy Reeder Program Specialist [email protected] Advertising Sales Mark David Director, Business Development—Media & Advertising Tel: +1 732 465 6473 Fax: +1 732 981 1855 [email protected] IEEE Periodicals Magazines Department Kristin Falco LaFleur Senior Journals Production Manager Patrick Kempf Senior Manager Journals Production Janet Dudar Senior Art Director Gail A. Schnitzer Associate Art Director Theresa L. Smith Production Coordinator Felicia Spagnoli Advertising Production Manager Peter M. Tuohy Production Director Kevin Lisankie Editorial Services Director Dawn M. Melley Staff Director, Publishing Operations IEEE-RAS Membership and Subscription Information: +1 800 678 IEEE (4333) Fax: +1 732 463 3657 http://www.ieee.org/membership_services/ membership/societies/ras.html

Digital Object Identifier 10.1109/MRA.2022.3218352

IEEE prohibits discrimination, harassment, and bullying. For more information, visit http://www.ieee.org/web/aboutus/whatis/policies/p9-26.html.

IEEE Robotics and Automation Magazine (ISSN 1070-9932) (IRAMEB) is published quarterly by the Institute of Electrical and Electronics Engineers, Inc. Headquarters: 3 Park Avenue, 17th Floor, New York, NY 10016-5997 USA, Telephone: +1 212 419 7900. Responsibility for the content rests upon the authors and not upon the IEEE, the Society or its members. IEEE Service Center (for orders, subscriptions, address changes): 445 Hoes Lane, P.O. Box 1331, Piscataway, NJ 08855 USA. Telephone: +1 732 981 0060. Individual copies: IEEE Members US$20.00 (first copy only), non-Members US$140 per copy. Subscription rates: Annual subscription rates included in IEEE Robotics and Automation Society member dues. Subscription rates available on request. Copyright and reprint permission: Abstracting is permitted with credit to the source. Libraries are permitted to

photocopy beyond the limits of U.S. Copyright law for the private use of patrons 1) those post-1977 articles that carry a code at the bottom of the first page, provided the per-copy fee indicated in the code is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923 USA; 2) pre-1978 articles without a fee. For other copying, reprint, or republication permission, write Copyrights and Permissions Department, IEEE Service Center, 445 Hoes Lane, Piscataway, NJ 08854. Copyright © 2022 by the Institute of Electrical and Electronics Engineers Inc. All rights reserved. Periodicals postage paid at New York and additional mailing offices. Postmaster: Send address changes to IEEE Robotics and Automation Magazine, IEEE, 445 Hoes Lane, Piscataway, NJ 08854 USA. Canadian GST #125634188 PRINTED IN THE U.S.A.

IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

Say Hello_to the next robotic innovator Research teams from around the globe were asked to submit their concepts for the “Open Platform Challenge”. Three teams made it to the finals and will be presenting their projects on automatica fair in June 2023. The award comes with a 20,000-euro prize. Meet our finalists.

KUKA Innovation Award 2023

MeRLIn Team SPIRIT, German Aerospace Center, Germany A single large refinery often requires more than 50,000 maintenance and inspection routines. The team from the German Aerospace Center is working on a telepresence solution that enables safe and intuitive operation of aerial manipulators for industrial applications in the oil and gas industry.

www.kuka.com/InnovationAward2023

Team Fashion & Robotics, University of Arts and Industrial Design Linz, Austria The fashion and textile industry are under great pressure to reduce the environmental footprint by producing more durable, high-quality products and developing circular material flows. The departments for “Creative Robotics” and “Fashion and Technology” at UfG Linz are working on creating a way for textile companies and designers to increase their production by setting up micro factories with collaborative robot systems.

Team JARVIS, Politecnico di Milano, Italy The goal of the team from Politecnico di Milano is to develop a complete plug-and-play method for programming collaborative robotic applications fully compatible with the new iiQKA ecosystem. This will facilitate their introduction in small and medium-sized enterprises, enabling unskilled operators to program the robot for a new task and, thanks to the system AI-enhanced capabilities, generalize to unknown situations, new tasks, and product variants.

FROM THE EDITOR’S DESK

Humans and the Environment By Yi Guo 

T

he COVID-19 pandemic has changed a lot of things, one of which is human behavior. For me, I found a new hobby of hiking during the first year of the pandemic. I hiked in dozens of state parks around me during the fall and winter seasons, some of w h i ch I d i d We need more and not even know more innovative existed before the pandemic. robots to join I felt relieved humans’ effort both physically and mentalto protect the ly after the environment. weekend hiking trips, and it was helpful for me to reduce the Zoom fatigue built up during work days. Nature creates the environment that we live in. Environmental damage can lead to new diseases. Nearly two-thirds of the hundreds of diseases that emerged in the past century were transmitted from animals to humans, including HIV/AIDS and probably COVID-19. Scientists also found a correlation between the loss of forests in Africa and the outbreaks of Ebola. Altering land use threatens biodiversity, and deforestation and intensive farming are linked to outbreaks of transmitted diseases. Also, scientists have studied how people have altered living organisms and exerted an evolutionary pressure on other species. Digital Object Identifier 10.1109/MRA.2022.3213198 Date of current version: 2 December 2022

4



IEEE ROBOTICS & AUTOMATION MAGAZINE



We need to nurture the environment we live in and avoid environmental damage that essentially endangers human health. The role nature plays in human welfare and economic activity has been overlooked. Complex natural systems can flip from one equilibrium to another when under pressure. The demands humans currently place on nature may not be sustainable by Earth’s ecosystem. Natural capital was recently included in an analysis of the sustainability of current rates of economic growth. As we learn more through interaction with nature, it is necessary to be aware of the sustainability of the ecosystem around us and pay attention to the impact of human behavior and what new technology will impose on the environment. Robots can help protect the environment by adding monitoring capabilities in extreme conditions, such as robotic vehicles for supply transport, under-ice exploration to support climate change studies in the Arctic, and surface and underwater robots to monitor chemical and oil pollution. Robotics can also help fight climate change by reducing carbon emissions through renewable energy resources, helping crops survive droughts, and planting trees. We need more and more innovative robots to join humans’ effort to protect the environment. We have no choice. Bioinspired robotics is a field of robotics that studies biological systems to achieve engineering goals. Among the well-studied topics, locomotion

DECEMBER 2022

concepts involve principles found in nature for creating robot locomotive capabilities; multirobotic systems study how animal swarms communicate and exchange information and the hierarchical and territorial structure of animal societies; and robotics learning draws inspiration from human learning and neural network-based brain cognition capability. The field of soft robotics stems from bioinspired concepts and research on deformable structures, soft materials, and morphological computation. While bioinspired robotics is a very broad research area, challenges exist in many frontiers, including the integration of science and technology advances toward enhanced robotics capability in perception, cognition, learning, and control. This December issue is a special issue (SI) on biomimetic perception, cognition, and control: “From Nature to Robots.” We received more than 30 submissions to this themed issue, which is the highest number of SI submission in the past two years. I’d like to thank the lead guest editor, Dr. Chenguang Yang, and his team, Dr. Shan Luo, Dr. Nathan Lepora, Dr. Fanny Ficuciello, Dr. Dongheui Lee, Dr. Weiwei Wan, and Dr. Chun-Yi Su, for their hard work managing the peer review process within the constrained time frame of this SI. I’d also like to thank the assisting IEEE Robotics and Automation Magazine associate editor, Surya Nurzaman, for his help on the SI. I hope you find inspiration from the feature articles. Enjoy reading! 

2022

PRESIDENT’S MESSAGE

Rethinking the Research Paper By Frank Park

L

ike many of you, I’ve spent a good part of the summer months harried by paper deadlines. For each paper, the story unfolds something like this: the triumph and euphoria that one experiences immediately after submitting the paper quickly wears off, to be replaced by an encroaching regret that with more time, the paper could have been so While more evidence much better. needs to be collected Then the reviews arrive: before any definitive poring over the conclusions can be comments, dedrawn, experimenting c i s i o n s a r e quickly made with new review on whether to accept or reject practices is a very a criticism— welcome development. should I hold firm or show some contrition for not explaining more clearly?—and whether to comply with all the requests (demands?) for additional experiments and comparisons to existing work. The road to publication is always a bumpy one, it seems. Lately I’ve noticed that reviewers seem to be asking for more and more elaborate experimental comparisons against the state-of-the-art. At one level, it’s a sign that robotics is maturDigital Object Identifier 10.1109/MRA.2022.3214390 Date of current version: 2 December 2022

6



IEEE ROBOTICS & AUTOMATION MAGAZINE



ing as a technical field. Our work is more and more embodied and validated by code, data, and benchmarks, and like the machine learning and data science research communities, there is growing acceptance within our community on the need to make code and data available. What we still seem to lack in robotics, however, are benchmarks. To be fair, benchmarks have been developed for grasping, pick-and-place, and various other robotics tasks, but these tend to be narrowly defined for specific hardware and environment requirements and are difficult to implement. Simulation benchmarks have been proposed as an alternative, but even the best simulators today lack the ability to model friction, contact, deformations, and other complex physical interactions in a realistic way. I think it’s fair to say that the inherent challenges of developing benchmarks for robotics are much harder than, say, those for vision or natural language. Returning to the review of our paper, not surprisingly a reviewer had asked for comparisons of our method against another recently published method, claiming this to be the state-of-the-art. Since no code was provided for this state-of-the-art method, we spent a great deal of effort implementing the algorithm (whose description turned out to be lacking some small but crucial

DECEMBER 2022

details) and performed the comparison experiments as faithfully as we could. We’ve not yet received feedback on these latest set of experiments, but already I can anticipate heated discussions on the experimental setup, algorithm implementation, and how truly meaningful these comparisons are in the absence of accessible, reproducible benchmarks. I have also noticed that several our conferences are now experimenting with double-anonymous (or doubleblind) reviews, open discussion phases between reviewers and authors, and several other new practices. All these efforts to try to reduce bias in our reviews—be it gender, nationality, author or institutional reputation, or any number of factors— are highly welcome and in keeping with our community’s spirit of innovation and fairness. Some members of our community are currently studying the advantages and challenges of implementing doubleanonymous reviews, as well as the experiences of other research communities. While more evidence needs to be collected before any definitive conclusions can be drawn, experimenting with new review practices is a very welcome development. I encourage you to let your voices be heard in the ongoing community discussion on how we measure research progress, and exploring ways to further reduce bias in our review process.

FROM THE GUEST EDITORS

Biomimetic Perception, Cognition, and Control: From Nature to Robots By Chenguang Yang , Shan Luo, Nathan Lepora, Fanny Ficuciello, Dongheui Lee, Weiwei Wan, and Chun-Yi Su

A

wide range of technological developments are inspired by biological individuals and advanced synthetic materials, cognitive sensors, control algorithms, artificial intelligence technology, and intelligent systems. One of the major challenges is to create a comprehensive study by integrating different techniques into robotic systems so that the perforOn the topic of mance of rorobot system and bots can be improved and control, Stella et al. applied to more develop a feather complex and diverse scenaristar-like robot that os. The articles can actuate layers of in this issue flexible feathers and f o cus on the state of the art detach them at will. in biomimetic perception, cognition, and control research and aim to explore related technical avenues in the multimodal bioinformation perception framework, intelligent cognition and learning, robotic systems and control, and new biomimetic sensors. In the research of multimodal bioinformation perception framework, Zou et al. propose the clinical readiness of a multimodal fusion model to estimate hand force based on Digital Object Identifier 10.1109/MRA.2022.3213199 Date of current version: 2 December 2022

8



IEEE ROBOTICS & AUTOMATION MAGAZINE



surface electromyography signals and A-mode ultrasound signals of forearm muscles. Yu et al. introduce a hybrid visual–haptic framework enabling a robot to achieve motion synchronization in human–robot co-transporting. Graf et al. build a framework for transferring ideas from human scene perception to robot scene perception to contribute toward robots’ holistic scene understanding, based on a wide survey and comparison of robotic scene perception approaches with neuroscience theories and studies of human perception. In the area of intelligent cognition and learning, Zheng et al. propose a biomimetic electric sense-based localization scheme, including an electric sense-based localization scheme and three model-based perception methods for large-scale underwater localization. Lu and Wang introduce a novel biomimetic force and impedance adaption framework based on Broad Learning System for robot control in stable and unstable environments. The connections of created neural network layers and the settings of the feature nodes are explainable by human motor control and learning principles. On the topic of robot system and control, Stella et al. develop a feather star-like robot that can actuate layers of flexible feathers and detach them at will. Based on this optimized feather and theoretical framework, the new robotic setup can change its motion path by

DECEMBER 2022

using the detachment of feathers while maintaining the same low-level controller. Bing et al. apply reinforcement learning (RL) on snake-like robot control that can fully exploit the hyper-redundant bodies of the robot. Simulations and experiments show that RL can generate substantially more energy-efficient gaits than those generated by conventional model-based controllers. Jiang et al. propose a humanoid control method based on visual servoing by utilizing a swivel angle derived from the human arm to realize the human-like behavior of anthropomorphic robot manipulators. The proposed method is substantiated based on a 7-degree-of-freedom Sawyer robot and a constructed visual servo physical platform. In the field of biomimetic sensors, Zhang et al. design a rat-inspired whisker sensor for a biomimetic robotic rat and demonstrate its superior tactile perception performance. Experimental results demonstrate its outstanding texture discrimination ability and excellent performance on contour reconstruction. Xia et al. develop a soft magnetic tactile sensor and a multipole magnetization method for extracting contact surface features. We sincerely thank all the authors for their contributions and the editors for their efforts and hope that the contents of this issue can bring information and inspiration to researchers in related fields. 

Introducing IEEE Collabratec™ The premier networking and collaboration site for technology professionals around the world.

IEEE Collabratec is a new, integrated online community where IEEE members, researchers, authors, and technology professionals with similar fields of interest can network and collaborate, as well as create and manage content. Featuring a suite of powerful online networking and collaboration tools, IEEE Collabratec allows you to connect according to geographic location, technical interests, or career pursuits. You can also create and share a professional identity that showcases key accomplishments and participate in groups focused around mutual interests, actively learning from and contributing to knowledgeable communities. All in one place!

Learn about IEEE Collabratec at ieee-collabratec.ieee.org

Network. Collaborate. Create.

A Multimodal Fusion Model for Estimating Human Hand Force Comparing surface electromyography and ultrasound signals By Yongxiang Zou, Long Cheng , and Zhengwei Li

B

iomimetic robots have received significant attention in recent years. Among them, the wearable exoskeleton, which imitates the functions of the musculoskeletal system to assist humans, is a typical biomimetic robot. Given that safe human–robot interaction plays a critical role in the successful application of wearable exoskeletons, this work studies the clinical readiness of a multimodal fusion model that estimates hand force based on the surface electromyography (sEMG) and A-mode ultrasound signals of the forearm muscles. The proposed multimodal fusion model affords the biomimetic hand exoskeleton assisting the elderly in completing daily tasks or quantitatively assessing the recovery level of poststroke patients. The suggested fusion model is called Optimization of Latent Representation for the Self-Attention Convolutional Neural Network (OLR-SACNN), which utilizes a common component extraction module (CCEM) and a complementary component retention module (CCRM) to optimize latent representation of the multiple modalities. Then the optimized latent representations are fused with the self-attention mechanism. The experiments conducted on a self-collected multimodal data set verify performance of the proposed OLR-SACNN model. Specifically, compared to solely employing sEMG or A-mode ultrasound signals, the force estimation’s normalized mean-square error (NMSE) based on the multiple modalities decreases by 97.7 and 38.92%, respectively. Furthermore, the OLRSACNN model has been used to estimate the hand force of some poststroke patients and attained the desired performance.

©SHUTTERSTOCK.COM/LOGIN

Digital Object Identifier 10.1109/MRA.2022.3177486 Date of current version: 10 June 2022

10



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

1070-9932/22©2022IEEE

Overview and Motivation The growing trend in developing biomimetic robots has attracted significant research interest in applications such as biomedical operation, rehabilitation engineering, and robotic manipulation [1]. The hand exoskeleton is a typical biomimetic robot [2]. As illustrated in Figure 1, the hand is controlled by the forearm muscles in a natural state, while with the hand exoskeleton, the hand’s movement can also be controlled by the worn hand’s exoskeleton according to specific forearm muscles’ signals. For example, the hand exoskeleton robot can assist the patient’s paralyzed hand-output sufficient force and complete some daily tasks by amplifying the hand’s remaining force. Therefore, accurately estimating the hand force or torque is a significant challenge [3], providing a valid control signal in the general human–robot interaction [4]. Furthermore, accurately estimating the patient’s hand force can provide a quantitative index for assessing the patient’s recovery level [5], alleviating the labor-intensive and time-consuming issues encountered in the traditional clinical assessment using the modified Ashworth scale. Hand force estimation through sEMG or ultrasound signals has been investigated in recent years. Current research has demonstrated that high-density sEMG is highly appealing for the force-estimation process as the absolute estimation error of the patient’s wrist force can reach 1.6% [6]. Compared to high-density sEMG, sparse sEMG consumes fewer computation resources, with the absolute estimation error reaching 2.06 and 2.04% on the nonamputees and the amputees, respectively [7]. An alternative hand force monitoring solution relies on ultrasound signals due to their ability to sense muscle

morphology changes in deep tissues. Typically, a B-mode ultrasound captures 2D images of the muscle using an array of ultrasonic sensing units, quantitatively estimating muscle stiffness [8], i.e., a B-mode ultrasound can estimate the muscle output force because muscle stiffness is proportional to muscle output force when muscle A multiview multikernel contraction is constant. However, deformations of CNN model is proposed the nearby muscles are strongly correlated during to extract deep hand movement, resulting in the recording of features of the sEMG considerable redundant information for describsignals from the time ing the muscle’s state. In nature, humans can even and frequency domains, roughly judge output force of the hand by respectively. observing changes with a few very sparse points on the muscle. To verify this point, Akhlaghi et al. used a 128-scanline image and the equally spaced four-scanline image of the B-mode ultrasound to recognize five motion classes and achieved an appealing classification accuracy [9], highlighting that sparse information on the hand muscle morphology can sufficiently represent hand muscle states. Spurred by these findings, we fully exploit this observation and utilize an A-mode ultrasound. Unlike a B-mode ultrasound, the A-mode one only detects the echoes of sparse ultrasonic channels in some

Force Estimation Control Signals Imitation

Hand Biomimetic Robots

s1

s2

s3



sk

Natural Hand (a)

(b)

Figure 1. A biomimetic hand exoskeleton that can be controlled by the user’s intended hand force. (a) How the hand exoskeleton imitates the function of the hand musculoskeletal system and (b) how the intended force of (a) the healthy hand can control the hand exoskeleton to help (b) the hemiplegic hand’s rehabilitation training.

DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



11

x 1(1)

x 1(1)

CCRM

x 2(1)

Raw sEMG

x (1)–1 N

x (1) N

G

M

-E

LR

Spectral sEMG

CCEM LR

x 1(v)

-U

x 2(v)

I

f (2)(.)

x 1(v )

SelfAttention

x 2(v )

x (v)–1 N

x (v)

Raw Ultrasonic

Latent Representation

N x (1) N

f (1)(.)

FFT

x (1)–1

x 2(1)

CCRM

N

) x (v–1 N

x (v ) N

Figure 2. A framework of the proposed multimodal fusion model OLR-SACNN. FFT: fast Fourier transform LR-EMG: latent representation of sEMG; LR-Ul: latent representation of ultrasound.

12



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

FC2

Learning Methods on a Single Modality In recent years, sEMG signals have been studied for hand gesture recognition, hand (or wrist) force estimation [11], [12], and knee-ankle-joint motion regression [13]. Traditional

Multimodal Fusion Methods In the multimodal fusion scenario, each modality describes the object from a partial view, while multiple modalities involve multiple views. Learning a latent representation that preserves the complementary information and extracts the consistent information from these different views is challenging. Given that the learned latent representation should reach a satisfactory accuracy in realizing downstream tasks,

FC1

Related Work

machine learning methods have analyzed sEMGs by extracting handcrafted time- and frequency-domain features to complete sEMG-based recognition tasks [14], [15]. However, selecting these handcrafted features depends highly on the practitioner’s experience. Recent studies have confirmed that deep learning models such as long short-term memory, CNNs, and transformers outperform traditional machine learning methods. Additionally, the automatic feature-extraction abilities of these deep learning models afford sEMG signals to be easily preprocessed by some digital filters [16] without involving a complicated, handcrafted feature-extraction process. Nevertheless, A-mode ultrasound signals have been rarely studied for hand force estimation. One exception is the work in [17], where the absolute estimation error reached 1.04%, but in this method, preprocessing the A-mode ultrasound is complicated, including time-gain compensation, digital filtering, envelope detection, log compression, and signal segmentations. After that, the mean and standard deviation are selected as features for the estimation task [18]. To the best of our knowledge, a deep learning method has never been applied to the A-mode ultrasound for hand force estimation, which is the research goal of this study.

Fused Multimodal Features

critical positions of the human muscles. Additionally, an A-mode ultrasound can be designed as a lightweight device with a thickness of only 0.6 mm [10], broadening its practical applications. However, the sEMG and A-mode ultrasound signals describe partial attributes of the forearm muscles: the sEMG records activation intensity of the muscle nerve, while the ultrasound records changes of muscle morphology. Therefore, a multimodal information fusion process can estimate hand force, and thus we design a wearable system that can synchronously collect the sEMG and A-mode ultrasound signals to complete the hand-force-estimation task. This study proposes a multimodal fusion model named OLR-SACNN, whose architecture is shown in Figure 2. To build this model, first, we extract some deep features of each modality separately, which are then optimized, employing two custom modules. Finally, the optimized latent representation is fused with the self-attention module and then used for hand force estimation. This study is divided into the following three stages: 1) A multimodal data-acquisition system aimed at hand force estimation is designed. 2) Deep learning models are studied to estimate hand force with either the single sEMG or single A-mode ultrasound. 3) A multimodal fusion model is further investigated to obtain a more accurate estimation of hand force with experimental verification.

1

2 3

7

4 5 6 (a)

(b)

(c)

Figure 3. The proposed multimodal data-acquisition system. (a) A display of the data-collection process from one health subject, with key components of the system: 1) software for the sEMG, force, and ultrasound collection; 2) a force-collection device; 3) an eightchannel stabilizer; 4) an NI data-acquisition unit; 5) sEMG armband; 6) ultrasound armband; and 7) ultrasound instrument. (b) A display of the scenario that one patient was instructed about the force-estimation study. (c) A display of one snapshot during the acquisition system in operation.

this study employs a canonical correlation analysis to learn the correlated components of different deep features from multiple modalities. To avoid discarding the complementary information of different modalities while extracting consistent information, it is necessary to build a function that measures information loss between the latent representation and its corresponding input signals, which is then used to preserve each modality’s complementary components. To this end, Wang et al. utilized an autoencoder to restore the input with its extracted deep features [19]. However, minimizing the restored error in the autoencoder may lead to the latent representation reserving noise information, which is helpful for reconstruction but unnecessary for characterizing the input signal. This work utilizes mutual information to retain the complementary components from multiple modalities to solve this problem. Materials and Methods Architecture of the Multimodal System The overall architecture of the multimodal data-acquisition platform comprises eight parts: one sEMG armband (OYMotion Technologies, China), one ultrasound instrument (Shantou Ultrasonic Electronics Company), one dataacquisition unit [NI USB-6008, National Instruments (NI), USA], one eight-channel stabilizer (Junyi Technology, China), one customized hand-force-collection device, one customized ultrasound armband, and one customized datacollection software tool. The data-acquisition system and its components are shown in Figure 3(a). The sEMG armband, ultrasound, and force-measurement devices are connected to the computer through a Bluetooth adapter, Ethernet, and serial port. The sEMG and ultrasound armbands have eight channels and are worn on the forearm. Moreover, the National Instruments-Data AcQuisition. Unit and the stabilizer collect the data, while the software records the time stamp of each modality for follow-up data alignment and readings, and monitors multiple modalities during data collection.

The ultrasound armband transducers are placed in one 3D-printed bracket and are worn around the forearm, whose structure is depicted in Figure 4(c). To ensure that the bracket’s stability and adaptability to different individuals, eight brackets are connected using an elastic-woven material, and the transducers are pressed hermetically against the skin by preloaded springs. The main parameters of the ultrasound device and ultrasound armband are listed in Table 1. The structure of the force-measurement device is shown in Figure 4(b). Specifically, two force sensors are used to measure hand force ^0 + 110 Nh . By the structure of the measurement device, the hand force is the sum of the measured values of these two force sensors. The guide rail and sliding can make sure that the gripping force is exerted on these two force sensors through the force-bearing link. Data Processing The designed multimodal system involves three signal types. The measured sEMG and ultrasound signals are used to estimate hand force, and the measured hand force is treated as the ground truth. The signals from these multiple modalities should be aligned during the data preprocessing stage.

Table 1. Key parameters of the ultrasound ­armband. Parameter Value

Parameter

Value

Horizontal linearity

#4

Bandwidth

0.5 + 15 MHz

Repetition frequency

25 + 10 kHz

Pulsewidth

55 ns

Damp

400 X

Voltage of excitation –250 V

Gain

80 dB

Measurement depth 50 mm

Center frequency

5 MHz

Velocity of ­ ultrasound

1,400 m/s

Vertical linearity

# 3%

Size of transducer

z5 # 20 mm

DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



13

The hand force at time t is scalar (red bar in Figure 5). The noise interference is eliminated by filtering the force using an eight-order low-pass filter, reserving the bandwidth signals below 50 Hz, and is normalized to a range from 0 to 1.

Considering that the sampling rates of the sEMG, ultrasound, and force signals are 1,000, 100, and 1,000 Hz, respectively, the data-alignment scheme is determined accordingly and presented in Figure 5.

T3

T8

Force-Bearing Link

Guide Rail Sliding Block T1

T2

Posterior

Anterior

Link for Hand Grasping

Force Sensor

Bracing Link

Symmetric Centerline Defined Flange (a) (b)

Signal Wire Preloaded Spring Transducer

Ch1

Ch8

Bracket for Transducer Ch2

s

Radiu

APL

ECU

Posterior

B

Ulna

Near the Upper Arm

Near the Hand

ECR

FDP

Anterior L

FCU

FDS

BR

ECR

FC R FPL PLT

PL

EDC

Ch7

Elastic-Woven Material

sEMG Device Ch3

Ch6 Ch5

Cover Plate

Ch4 (c)

Figure 4. The structure of the self-designed ultrasound armband and the hand-force-measurement device. (a) The positions in which the ultrasound and sEMG armbands were placed; (b) the structure of the force-measurement device, which is a symmetrical structure (the red dot dashed line is the symmetric line); and (c) details of the self-designed ultrasound armband. APL: abductor pollicis longus; BR: brachoiradialis; Ch: channel; ECRB: extensor carpiradialis brevis; ECRL: extensor carpiradialis longus; ECU: extensor carpi ulnaris; EDC: extensor digitorum communis; FCR: flexor carpiradialis; FCU: flexor carpi ulnaris; FDP: flexor digitorum profundus; FDS: flexor digitorum superficialis; FPL: flexor pollicis longus; PL: palmaris longus; PT: pronator teres.

14



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

At each instant, the size of an ultrasound sample is 800 # 8, where “800” represents the number of detection dots along the detected ultrasonic echo direction, and “8” denotes the number of transducers. Each transducer can detect information from specific muscles, e.g., the first transducer mainly detects the flexor carpi radialis, palmaris longus tendon, and flexor pollicis longus muscles. In this study, we reshape the size of the collected ultrasound to 10 # 80 # 8 so that the signal on each layer (the size is 80 # 8) reflects the thickness of the forearm muscles at different depths. Moreover, this reshaping process significantly reduces parameter cardinality in the following feature-extraction model. The sEMG signal at time t is an eight-dimensional vector. Considering that an sEMG is a typical time-sequence signal whose information is contained within one period, the sEMG sample corresponding to the force at time t is the sEMG signal within the time-sliding window, resulting in the size of each sEMG sample being 1 # 80 # 8. Feature Extraction Multichannel sEMG signals represent the activation information of different muscles, coupled after biological electrical signals are transmitted from deep muscles to the surface skin. In this study, a multiview multikernel CNN (MMCNN) model is proposed to extract deep features of the sEMG signals from the time and frequency domains, respectively. The MMCNN structure is depicted in Figure 6(a). Considering submodel 2 in Figure 6(a), we use a CNN to extract spatiotemporal information from the raw sEMG

Force Sample 1

signals. As the length of an sEMG sample is much larger than its width (the length is 80, and the width is 8), the convolution kernel of only one size can barely extract appropriate deep features. To solve this problem, CNN block 1 involves five convolution layers with different kernel sizes of 11 # 3, 21 # 3, 31 # 3, 41 # 3, and 51 # 3 to extract different timescale deep features. The Estimating the covariance sizes of all the polling layers in this block are 5 # 1. matrices with regularization The five outputs of this block have the same-sized also reduces the detection deep features. Accordingly, in CNN block 2, five of spurious correlations in convolution layers are employed to fur ther the training data. extract features from the aforementioned multitime-scale deep features, whose kernel sizes are 3 # 3, and all the pooling layers in this block are 2 # 2, reducing the deep-features dimension. For submodel 2 in Figure 6(a), spectrograms of the sEMG signals are transformed from the corresponding time-domain sEMG signals through a fast Fourier transform (FFT), and the sEMG signal of each channel is transformed separately. Thus, the size of the sEMG sample is 1 # 80 # 8. After the FFT, the sEMG spectrograms are regarded as images, with the convolution layers extracting deep features from the frequency domain. Then, deep features of the sEMG signals from both

Force Sample 2

Force Sample 3

Force Sample k

Power of the Signal

Force (a) sEMG

sEMG Sample 1 the of h pt cle De Mus

sEMG Sample 2

(b)

sEMG Sample 3

sEMG Sample k

Time (ms) nic so 1 tra ple l U m Sa

t

nic so 2 tra ple l U m Sa

nic so 3 tl ra ple U m Sa

nic so k a r e t Ul mpl Sa

t + 100

t + 200

t + 100 K

(c) Figure 5. The alignment scheme of signals measured from multiple modalities.

DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



15

FC1

FC2 FC2

Conv-13

Pooling-12

LR-EMG FC1

Spectrogram sEMG

Submodel 2

CNN Block 1

Conv-12

CNN Block 2

Pooling-13

FFT

CNN Block 1

Concatenaten

Time-Domain sEMG

Conv-11

Submodel 1

CNN Block 2

LR-UI

Pooling-13

CNN Block 2

Conv-13

CNN Block 1

Conv-12

Ultrasound

Pooling-12

(a)

(b) CNN Block 1 Conv-1 Conv-2

Conv-3

Conv-4

Conv-5

CNN Block 2 Conv-6

Pooling-1

Pooling-6

Pooling-2

Conv-7 Pooling-7

Pooling-3

Conv-8

Pooling-4

Conv-9

Conv-10

Pooling-5

Conv-i

Pooling-8

Separable Conv2d

BatchNorm2d + ELU (e)

Pooling-9

Pooling-i

Pooling-10

MaxPooling2d (c)

(d)

Dropout

(f)

Figure 6. (a) The MMCNN model structure for extracting deep features of sEMG signals; (b) the MCNN model for extracting deep features of ultrasound signals; (c)–(f) show the detailed structure of CNN blocks 1 and 2, and the convolution and pooling layers, respectively. CONV: convolutional.

16



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

the time and frequency domains are obtained, concatenated, and fused via a convolution layer with a kernel size of 1 # 1. Then, deeper convolution layers are applied to obtain more complex information about the sEMG signals. It is noted that all the activation functions in the convolution layers are Exponential Linear Unit (ELU). The ultrasound signals from eight channels are also coupled, similarly to the sEMG. Precisely, one transducer captures its ultrasonic echo or the echoes from other transducers. In addition, there is a complex coordination mechanism among the forearm muscles when the hand exerts force, imposing a complicated mapping between muscle state and hand force. Therefore, we fully, deeply exploit the learning of automatic feature-extraction ability to analyze ultrasound signals and conclude that the feature extraction of such signals suffers from the same problems as sEMGs: the length is much larger than the width. Hence, we propose a multikernel CNN (MCNN) model for the ultrasound, which is depicted in Figure 6(b). An MCNN has the same structure as an MMCNN, except that submodel 2 is omitted because the ultrasound sample is an instantaneous signal. Optimization of Latent Representation After deep features of the sEMG and ultrasound signals are extracted, they need to be further optimized to achieve better performance on hand force estimation. To this end, the following CCEM is proposed, whose optimization objective is

max

i 1, i 2, U, V

1 trace (U T f T (x ) f (x ) V ), (1) 1 1 2 2 m

subject to U T ` 1 f T1 (x 1) f1 (x 1) + rI 64 # 64 j U = I o # o, (2) m

r 1, covariance R 22 of H r 2, and The covariance R 11 of H their cross covariance R 12 are defined as

t ij = R

1 H r TH r + rI o # o, ^i, j ! " 1, 2 ,h, (6) m -1 i j

where r 2 0 is a regularization parameter. Estimating the covariance matrices with regularization also reduces the detection of spurious correlations in the training data. By t -11(1/2) R t 12 R t -22(1/2), the training objective of the defining T _ R deep networks f1($) and f2($) is to make the correlation corr (H 1, H 2) as large as possible:

1

corr (H 1, H 2) = trace (T T T ) 2 . (7)

The CCEM would lead to complementary components of the multiple modalities being discarded while their common components are extracted. Generally, the autoencoder can preserve the complementary information of multiple modalities by reconstructing input sEMG and ultrasonic signals using the extracted deep features. However, the autoencoder would extract the deep noise features, which can decrease the reconstruction error but are useless for estimating hand force. In this view, the CCRM is proposed in the following paragraph, which employs the mutual information to extract essential information from the sEMG and ultrasonic signals. Let the original data set (the sEMG or ultrasonic signal) be denoted by X i, i ! " 1, 2 ,, and let x i ! X i be one sample within this set. The extracted deep feature set from X i is denoted by Z i, and z i ! Z i is the corresponding deep feature of xi. The mutual information of X i and Z i is given by p(x i, z i) dxdz p(x i) p(z i)  = KL (P (z i ; x i) pu i (x i) < p(z i) pu (x i)),

I (X i, Z i) =

##

p (x i, z i) log



V T ` 1 f T2 (x 2) f2 (x 2) + rI 64 # 64 j V = I o # o, (3) m





u Ti f T(x 1) f2 (x 2) v j = 0, for i ! j, (4)

where p ($) represents the probability density function. It is required that the mutual information I (X i, Z i) be maximized to make the deep feature z i describe the corresponding x i more exactly, but not the other original samples. Equation (8) indicates that the mutual information of x i and z i is equal to maximize the distance between the distribution of P (z i ; x i) pu (x i) and p (z i) pu (x i). As the Kullback–Leibler divergence has no upper bound while being maximized, which may lead to the failure of model training, the Jensen–Shannon divergence is utilized to maximize the mutual information. Hence, (8) is transformed into

where m is the number of samples in a batch during model training, and o is the dimension of the optimized deep feature generated by the CCEM. x 1 ! R m #1# 80 # 8 represents the input sEMG sample, and x 2 ! R m #10 # 80 # 8 symbolizes the input ultrasound sample. f1($) ! R m # 64 denotes the model used for extracting the deep features of sEMG signals, and f2($) ! R m # 64 represents the model utilized for obtaining deep features of the ultrasound signals. i 1 and i 2 are the parameters of f1($) and f2($), respectively. Both U ! R 64 # o and V ! R 64 # o are the matrices to be optimized. u i is the ith column of U, and v j is the jth column of V. To find the optimal parameter set (i )1, i )2), the output deep features of f1($) and f2($) are projected as H 1 ! R m # o and H 2 ! R m # o by U and V, respectively:

r 1 = H 1 - 1 H 1 1 o # o, H r 2 = H 2 - 1 H 2 1 o # o . (5) H m m

(8)

max I (X i, Z i) + max JS ( p (z i ; x i) pu (x i) < p (z i) pu (x i)) p (z i ; x i)

=

max E(x i, z i) + p(z i ; x i) pu (x i)[log v (F (x i, z i))]

p (z i ; x i) T (x, z i)

+ E(x i, z i) + p (z i) pu (x i)[log (1 - v (F (x i, z i))]. (9) In (9), E (x i, z i) + p (z i ; x i) pu (x i) indicates that the deep feature z i in this term is extracted from the corresponding x i, while DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



17

E (x i, z i) + p (z i) pu (x i) specifies that the deep feature z i in this term is extracted from a random sample x i in X i . F ($) represents a three-layer perceptron network, and v ($) is the activation function. Feature Fusion We obtain the complementary and consistent information of multiple modalities by optimizing latent representation through (1) and (9). However, not all of the extracted deep features are related to hand force estimation. In this study, a self-attention layer is adopted to select deep features from the optimized latent representation. The latent representation of an sEMG sample is denoted by X e = [e 1, e 2, f, e k]T , and the one of the ultrasound is designated by X u = [u 1, u 2, f, u L - k]T . Their concatenated feature is an L-dimensional vector, denoted by X s = [X Te , X Tu ]T . Considering that the deep features in the latent representation have no definitive order, the positionencoding layer is not necessary before they are input into the self-attention layer. Therefore, output of the self-attention layer is given by



Z = softmax e

Q = W Q X s, K = W K X s, V = W V X s,

KT Q o V, d

 (10)

where W Q, W K , and W V are three weight matrices in the self-attention layer to be trained. The output Z is the ultimate latent representation of multiple modalities, which is used for hand force estimation after being connected to two fully connected layers, as shown in Figure 2. The input and output dimensions of “FC1” in Figure 2 are 128 and 32, respectively, and the input and output dimensions of “FC2” in Figure 2 are 32 and 1, respectively. The activation functions of “FC1” and “FC2” are ELU and unit mapping, respectively. Through explaining each component of the proposed OLR-SACNN model, it is evident that its overall loss function comprises the CCEM (1), CCRM (9), and hand force prediction losses (the mean-square error between the output of the OLR-SACNN model and the ground-truth hand force). In the model training stage, these three loss items should be scaled to the same order of magnitude to balance their effects. Experiments To verify the OLR-SACNN model’s performance, we consider experiments divided into three steps: 1) collect and process multimodal data; 2) verify performance of the MMCNN, MCNN, and OLR-SACNN models on hand force estimation; and 3) test performance of the OLR-SACNN model with data from clinical poststroke patients. It should be noted that all the models are implemented on Ubuntu 18.04, utilizing an Nvidia GP102 Titan XP graphics card. 18



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

Data Collection Performance of the MMCNN model is evaluated on the NinaPro DB2 benchmark database as it contains the finger output force and corresponding sEMG signals [20]. This database sampled sEMG signals of the biceps, finger extensor, and triceps at 2,000 Hz using the Delsys Trigno wireless system from 40 subjects. After being filtered out by a third-order Butterworth filter to preserve the 20 + 50-Hz components, the sEMG samples are segmented by a time-sliding window with an 80-ms length, creating a 160 # 12 # 1 sEMG sample. Except for the sEMG signal, the finger force signals are processed by an eight-order low-pass Butterworth filter to preserve the 0 + 100-Hz component. In addition, we built a multimodal data set containing sEMG, ultrasound, and output force signals from 10 healthy subjects. This experiment was approved by the Ethics Committee of the Institute of Automation, Chinese Academy of Sciences on 5 April 2020 (IA-201931). According to the experiment’s protocol, the test subjects are required to sit in front of the experiment table, wear the sEMG and ultrasound armbands, put their forearms on the table, and complete the hand grasp action according to the guidance displayed on the laptop screen. During the task, the subjects slowly increase their hand forces on the forcecollection device, and then slowly release their hands from the force-collection device. Each subject repeats this process five times. Model Validation In this study, the performance of hand force estimation is assessed based on the NMSE and the coefficient of determination R 2 . MMCNN Model Verification The experiments challenge the proposed MMCNN model to some representative traditional machine learning models like support vector regression (SVR), k-nearest neighbor (KNN), and gradient-boosting regression tree (GBRT). All the models are tested on the NinaPro DB2 benchmark database, and the sEMG samples from each subject are shuffled and divided into two parts: 70% of samples are regarded as the training set and 30% as the test set. Considering the traditional machine learning models, four typical handcrafted sEMG features are extracted, i.e., integrated electromyography (EMG,) root-means square, zero crossing, and channel energy percentage [14]. The dimension of each handcrafted feature vector is 48 because each channel in one sEMG sample (12 channels) can generate four features. The comparison results illustrated in Figure 7(a) demonstrate that the forceestimation performance of the MMCNN is better than SVR, kNN, and GBRT. The average R 2 based on the MMCNN reaches 87.4%, and its average NMSE is 0.0634. The specific force-estimation profiles of one subject are presented in Figure 7(b). The MMCNN’s performance is also evaluated on the self-collected data set. As illustrated in Figure 7(c), the

NMSE

0.12 0.10 0.08 0.06 0.04 0.02 0.00

R2

1.00 0.90 0.80 0.70 0.60 0.50

S1

S3

S5

S7

S9 S11 S13 S15 S17 S19 S21 S23 S25 S27 S29 S31 S33 S35 S37 S39 S-Avg

S1

S3

S5

S7

S9 S11 S13 S15 S17 S19 S21 S23 S25 S27 S29 S31 S33 S35 S37 S39 S-Avg MMCNN

SVR

KNN

GBRT

(a)

DOF 2

DOF 1

Force Estimation With the MMCNN Model 1 0.5 0

DOF 3 DOF 4

1,000

2,000

3,000

4,000

5,000

6,000

0

1,000

2,000

3,000

4,000

5,000

6,000

0

1,000

2,000

3,000

4,000

5,000

6,000

0 –1

DOF 5

0 1

1 0.5 0

Force Estimation With the MMCNN Model

1 0.5 0 0

1,000

2,000

3,000

4,000

5,000

6,000

0

1,000

2,000

3,000

4,000

5,000

6,000

0

1,000

2,000

3,000

4,000

5,000

6,000

1 0.5 0

DOF 6

1 0.5 0 –0.5

Original Force

Estimated Force

Error

(b) Figure 7. Experiment results of the hand force estimation based on either the MMCNN or MCNN. (a) The NMSE and R2 of the sEMGbased force-estimation results of different models tested on NinaPro DB2. (b) The estimated force profiles of six degrees of freedom (6 DoF) of one subject of NinaPro DB2 by the MMCNN. (Continued )

DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



19

GBRT

0.6671 AVG

0.6669

0.5719 S8

S10

0.702 S7

0.8424

0.6416 S6

S9

0.6589 S5

0.7683

0.6359 S3

S4

0.6357 S2

0.5475 S1

R2

1 2 4 8 10 16 20 The Number of Segment for Every Ultrasound Sample

1.2 1 0.8 0.6 0.4 0.2

0.7304 0.7397 0.3098 0.81517 0.7938 0.4899 0.86151 0.8173 0.5717 0.88857 0.8406 0.767 0.90593 0.8535 0.7368 0.92817 0.8876 0.8415 0.91357 0.8695 0.7903

AVG

0.0114 S10

0.0051 S9

0.0114 S8

0.006 S7

0.0099 S6

0.0073 S5

0.0058 S4

0.0137 S3

0.007 S2

0.0107

0.0088

R2

0.016 0.012 0.008 0.004 0

1 0.8 0.6 0.4 0.2 0

(c)

0.00498 0.0049 0.0138 0.00355 0.0039 0.0112 0.00281 0.0035 0.0094 0.00234 0.003 0.0063 0.00193 0.0028 0.0069 0.00159 0.0023 0.0043 0.00181 0.0025 0.0055

NMSE

S1

NMSE

0.016 0.012 0.008 0.004 0

1 2 4 8 10 16 20 The Number of Segment for Every Ultrasound Sample

KNN

SVR

0.9892

0.99195

0.98984

0.98586

S9

S10

AVG

0.98062

0.98539 S6

S7

0.98758 S5

0.97806

0.98564 S3

S4

0.98439

S8

(e)

S2

0.00037 AVG

0.98594

0.00035 S10

1 0.99 0.98 0.97

S1

0.00026 S9

R2

0.00029

0.00039 S7

S8

0.00041 S6

S4

0.00027

0.00055

S3

S5

0.00054

S2

0.0003

0.00033

0.0006 0.0004 0.0002 0

S1

NMSE

(d)

Figure 7. (Continued ) (c) The NMSE and R2 of the sEMG-based force-estimation results of the MMCNN on the self-collected data set. (d) The NMSE and R2 of the ultrasound-based hand-force-estimation results of traditional machine learning methods with the different segment number n. (e) The NMSE and R2 of the ultrasound-based hand-force-estimation results of the MCNN.

NMSE of MMCNN is 0.0088, and the R 2 of MMCNN is 66.71%. MCNN Model Verification As there is no public database for estimating hand force with A-mode ultrasound signals, the proposed MCNN model’s performance is verified on the self-collected data set by challenging MCNN against traditional SVR, kNN, and GBRT machine learning methods. The ultrasound samples from each subject are shuffled and divided into two parts: 70% of samples are regarded as the training set and 30% as the test set. For traditional machine learning methods, the following features are selected. For each ultrasound sample, after being divided into n segments, the mean value and standard deviation of each segment are extracted as a feature, and thus, the feature dimension is 16n for one ultrasound sample. In this experiment, n is set to 16 through a trial-and-error method. The experiment results in Figure 7(d) highlight that GBRT achieves the best performance among the selected machine learning models. However, according to Figure 7(e), the MCNN outperforms all competing traditional machine learning methods (NMSE of the MCNN is 0.00037, and R 2 of the MCNN is 98.59%). 20



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

OLR-SACNN Model Verification First, we verify performance of the OLR-SACNN without the CCEM and CCRM modules. In this case, deep features of sEMG and ultrasound signals are directly concatenated without being optimized. As illustrated in Figures 8(b), 7(c), and 7(e), even without being optimized [F-concatenated in Figure 8(b)], the OLR-SACNN based on multimodal information still attains a better force-estimation result than the single-modal methods (MMCNN and MCNN). R 2 of the OLR-SACNN achieves 98.9%, increasing R 2 by 0.31 and 32.2% compared to the ultrasound- and sEMG-based MMCNN, respectively; the NMSE is 0.00029, decreasing by 21.62% and 96.7%, respectively. Second, the CCEM and CCRM modules are used in the OLR-SACNN to optimize the latent representation. Considering that the output dimension o of the CCEM in (1) affects OLR-SACNN’s performance, we evaluated OLR-SACNN with different o values. The experiment results in Figure 8(a) indicate that the OLR-SACNN can reach the best performance for o = 4, with the corresponding force-estimation results on all subjects provided in Figure 8(b). Compared to the case neglecting the CCEM and CCRM modules [F-concatenated in Figure 8(b)], the OLR-SACNN still improves the

0.995 0.99 0.985

16

0.00107

0.9534

8 (a)

32

64

0.9623

4

0.00105

16

0.9632

0.00133

8

0.9616

0.00118

2

0.00024

0.00072 0.9747

1

4 0.9915

0.00065

2

32

64

0.000207 0.000304 0.000166 0.000179 0.000219 0.000169 0.000268 0.000414 0.000563 0.000289 0.000361 0.000324 0.000161 0.000267 0.000199 0.000278 0.000293 0.000503 0.000217 0.000232 0.000287 0.000203 0.000203 0.000296 0.000171 0.000276 0.000190 0.000284 0.000328 0.000377 0.000226 0.00029 0.000307

0.0009 0.0008 0.0007 0.0006 0.0005 0.0004 0.0003 0.0002 0.0001 0

1

S1

S2

S3

S4

S5

S6

S7

S8

S9

S10

AVG

0.9912 0.9871 0.993 0.9908 0.9887 0.9913 0.9929 0.989 0.985 0.9886 0.9857 0.9872 0.9925 0.9876 0.9908 0.99 0.9894 0.9819 0.9893 0.9885 0.9858 0.9923 0.9923 0.9889 0.9947 0.9915 0.9941 0.9917 0.9904 0.989 0.9914 0.989 0.9887

NMSE

R2

1 0.98 0.96 0.94 0.92

1

R2

Analysis of OLR-SACNN We extract and compare deep features of the sEMG and ultrasound signals to analyze how the OLR-SACNN balances information from multiple modalities. As illustrated in Figure  9(a), the solid regression line is obtained by fitting the scatters (each scatter represents one sample) with the least-squares method. The slope of the regression line is 3.8427, indicating that in OLR-SACNN’s latent representation, ultrasound plays a more critical role. This result is consistent with the previous experiment’s results, proving that force estimation based on an ultrasound signal is better than an sEMG one. To analyze the function of the CCEM, the correlation between the deep feature of the sEMG and the ultrasound is calculated. Specifically, both deep features of the sEMG and the ultrasound (“latent representation of sEMG” and “latent representation of ultrasound” in Figure 2) are processed by the principal component analysis method to generate simplified features with a dimension of 16. Then the correlation matrix of these two 16-dimensional features is calculated, considering whether the

CCEM is adopted or not. As presented in Figure 9(c), the right correlation matrix (adopting CCEM) has larger values than the left (not adopting CCEM), indicating that the CCEM module prompts the OLR-SACNN to extract more relevant deep features from different modalities. To analyze whether or not the CCRM can guide the OLR-SACNN to extract essential features from the samples rather than noise, the correlation between the input samples and their corresponding deep features is determined by calculating their distance. In Figure 9(c), the horizontal axis represents the distance between the extracted deep feature and the input sample when the CCRM module is

0.9771

NMSE

performance as the NMSE decreases by 11.54%, and R 2 increases by 0.24%. Third, as discussed in the “Data Processing” section, the CCRM adopts the mutual information mechanism rather than the autoencoder to optimize the latent representation. To verify that the OLR-SACNN adopting the mutual information mechanism can obtain a better result, we perform comparative experiments against the mutual information mechanism, revealing that the NMSE decreases by 0.27%, as shown in Figure 8(b). Fourth, we test estimation frequency of the OLR-SACNN. For each multimodal sample, the average processing time is roughly 0.3 ms. Considering that the time consumed for data collection is 10 ms (the sampling rate is 100 Hz), the total time required is 10.3 ms. Therefore, the OLR-SACNN’s estima0.0015 tion frequency can reach 99.1 Hz. 0.001 Thus, it is easy to see that the data 0.0005 sampling rate mainly limits the esti0 mation frequency, and it can be dramatically improved if an advanced data-collection device is employed.

0.98 0.975

S1

S2

S3

S4

OLR-SACNN

S5

S6

S7

F-Concatenation

S8

S9

S10

AVG

AE-SACNN

(b) Figure 8. Experiment results of the hand force estimation based on the OLR-SACNN with different parameters and structures. (a) The NMSE and R2 of the force estimation based on the OLR-SACNN with different values of o in (1). (b) The NMSE and R2 of the hand force estimation based on based on the OLR-SACNN with different structures: “OLR-SACNN” represents the case when the deep feature is optimized by the CCEM and CCRM modules, “F-Concatenation” denotes the case without feature optimization, and “AE-SACNN” signifies the case when the deep feature is optimized by the autoencoder. AVG: average.

DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



21

adopted, while the vertical axis represents the distance when the CCRM module is not adopted. Each scatter corresponds to one sample, and the solid regression line is calculated by fitting all samples. It is clear that the slopes of the

solid lines are much less than one, indicating that the proposed OLR-SACNN can extract deep features sensitive to the sample changes, i.e., the deep features extracted are closely related to the samples.

Norm of Deep Feature (Ultrasound)

40 30 k = 3.8427 20 10 0 Sample Regression Line

–10 –10

0 10 20 30 Norm of Deep Feature (sEMG)

40

(a) E4

E8

E12

E16

E8

E4

E12

E16 0.15

0.15 U4

U4

0.1

0.1 U8

U8

0.05

U12

U16

0.05

U12

U16

0

0

(b)

0.4

The Distance Between the sEMG Feature and the sEMG Samples

The Distance Between the Ultrasound Feature and the Ultrasound Samples 0.3

0.3

0.2

Without CCRM

Without CCRM

k = –0.1793

k = 0.0061

Sample Regression Line

0.1 0.1

0.2 0.3 Adopting CCRM

0.25 0.2 0.15 0.1 0.1

0.4

Sample Regression Line 0.15

0.2 0.25 Adopting CCRM

0.3

(c) Figure 9. A function analysis of different modules of the OLR-SACNN model. (a) A 2D distribution of norms of the deep features from the sEMG and the ultrasound. (b) The correlation matrix of deep features from the sEMG and the ultrasound (the horizontal axis is the simplified deep feature of the sEMG, and the vertical axis is the one of the ultrasound). (c) The distance between the original samples and their corresponding deep features.

22



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

Table 2. The 12 recruited patients’ information. Patient Number

Age

Cause

Side

BS

Number

Age

Cause

Side

BS

1

48

Hemorrhagic

R

5

P7

78

Ischemic

L

4

2

63

Hemorrhagic

L

4

P8

22

Ischemic

R

4

3

49

Ischemic

L

5

P9

43

Ischemic

R

6

4

63

Hemorrhagic

L

6

P10

60

Hemorrhagic

L

5

5

64

Ischemic

R

4

P11

47

Ischemic

L

6

6

61

Hemorrhagic

R

5

P12

73

Hemorrhagic

R

5

Side: paralyzed side of the patient; L: left; R: right; BS: Brunnstrom stage of the patient.

0.002453 0.002349

0.002412 0.000931

0.001066

0.00089

0.002367 S9

0.001159

0.00231 S8

0.001

0.00257 0.001295

0.0015

0.001388

NMSE

0.002

0.000986

0.003 0.0025

S12

0.85068

0.85261

0.8545

0.79794

S10

S6

S12

AVG

AVG

S6

0.84921 S7

S10

0.84347 S4

S7

0.8298 S3

S4

0.77816 S9

S3

0.77329 S8

S2 0.76138 S2

S11 0.74473 S11

S5

S1

0.72985 S5

0

0.70757

0.0005

S1

Discussion and Conclusion This study proposed a multimodal fusion model named OLR-SACNN to estimate hand force. An informationacquisition system capable of simultaneously measuring sEMG, ultrasound, and hand force was also presented. The proposed OLR-SACNN attains satisfactory performance on the self-collected data set, while compared to the estimation methods based on a single modality (sEMG or ultrasound), the NMSE of force estimation decreases by 97.7% (sEMG) and 38.92% (ultrasound), respectively. The proposed OLR-SACNN has also been utilized to estimate the hand force of poststroke patients, which shows its potential for robot-assisted rehabilitation and quantitative assessment.

Our research shows that ultrasound-based estimation of hand force outperforms sEMG based. Given that the human muscle can be modeled as a unit-like spring, muscle force is strongly related to the muscle’s morphology. Therefore, ultrasound has better performance, consistent with the muscle’s activity in nature. However, A-mode ultrasound has been less prevalent than sEMG to estimate force in recent years. For additional details, the reader is referred to the literature related to ultrasound-based human–robot interaction and the further broadening of its application domain. This study also illustrated that force estimation based on multiple modalities has a better accuracy than a single

Subject (a) 1 0.8 R2

Performance Test of OLR-SACNN by Clinical Data This section applies the proposed OLR-SACNN model to estimate poststroke patients’ hand force. We collected clinical data from poststroke patients at the China Rehabilitation Research Center, and this clinical experiment was approved by China Rehabilitation Research Center (2021-108-1) on 12 July 2021. The patient recruitment criteria are 1) between 18 and 80 years of age; 2) at Brunnstrom stage 4, 5, or 6 (the hemiplegic hand has voluntary strengths); and 3) without severe cognitive impairment. The exclusion criteria are 1) other neurological diseases, 2)  orthopedic diseases, and 3) disability with learning how to use the hand-force-collection device. Based on these criteria, we recruited 12 qualified patients whose information is listed in Table 2. All patients were given written and verbal information about this experiment, and the signed informed consent statements were received from the patients before the experiment. The experiment protocol for poststroke patients is the same as for the healthy subjects, and the collection process is illustrated in Figure 3(b) and (c). According to the experiment results presented in Figure 10, the average NMSE of the hand force estimation is 0.002349, and the average R 2 is 0.79794. Compared to the subjective assessment using the modified Ashworth scale, this estimated hand force has the potential to provide a quantitative strength rehabilitation assessment for poststroke patients.

0.6 0.4 0.2 0

Subject (b) Figure 10. The NMSE and R2 of the estimated hand force by the OLR-SACNN based on the poststroke patient’s clinical data.

DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



23

modality. Although ultrasound dominates force estimation, an sEMG is mainly affected by the central nervous system, which is less susceptible to interference with muscle deformation caused by external factors. Combining these complementary modalities in a wearable device increases estimation accuracy, which plays an essential role in medical assessment and other fields, like human-dexterous skill learning. Acknowledgments This work is supported in part by the National Natural Science Foundation of China (grant numbers 62025307 and U1913209), the Beijing Municipal Natural Science Foundation (grant number JQ19020), and the Department of Mathematics and Theories, Peng Cheng Laboratory, China. References [1] F. Wang, Z. Qian, Y. Lin, and W. Zhang, “Design and rapid construction of a cost-effective virtual haptic device,” IEEE/ASME Trans. Mechatron., vol. 26, no. 1, pp. 66–77, 2021, doi: 10.1109/TMECH.2020.3001205. [2] S. Crea et al., “Feasibility and safety of shared EEG/EOG and visionguided autonomous whole-arm exoskeleton control to perform activities of daily living,” Sci. Rep., vol. 8, no. 1, pp. 1–9, 2018, doi: 10.1038/ s41598-018-29091-5. [3] D. Leonardis et al., “An EMG-controlled robotic hand exoskeleton for bilateral rehabilitation,” IEEE Trans. Haptics, vol. 8, no. 2, pp. 140– 151, 2015, doi: 10.1109/TOH.2015.2417570. [4] L. Rinaldi et al., “Adapting to the mechanical properties and active force of an exoskeleton by altering muscle synergies in chronic stroke survivors,” IEEE Trans. Neural Syst. Rehabil. Eng., vol. 28, no. 10, pp. 2203–2213, 2020, doi: 10.1109/TNSRE.2020.3017128. [5] B.-S. Lin, I.-J. Lee, and J. L. Chen, “Novel assembled sensorized glove platform for comprehensive hand function assessment by using inertial sensors and force sensing resistors,” IEEE Sensors J., vol. 20, no. 6, pp. 3379–3389, 2020, doi: 10.1109/JSEN.2019.2958533. [6] G. Hajian, A. Etemad, and E. Morin, “Generalized EMG-based isometric contact force estimation using a deep learning approach,” Biomed. Signal Process. Control, vol. 70, 2021, Art. no. 103012, doi: 10.1016/j.bspc. 2021.103012. [7] I. J. R. Martínez, A. Mannini, F. Clemente, and C. Cipriani, “Online grasp force estimation from the transient EMG,” IEEE Trans. Neural Syst. Rehabil. Eng., vol. 28, no. 10, pp. 2333–2341, 2020, doi: 10.1109/TNSRE. 2020.3022587. [8] C. H. Wu, Y. C. Ho, M. Y. Hsiao, W. S. Chen, and T. G. Wang, “Evaluation of post-stroke spastic muscle stiffness using shear wave u ­ ltrasound elastography,” Ultrasound Med. Biol., vol. 43, no. 6, pp. 1105–1111, 2017, doi: 10.1016/j.ultrasmedbio.2016.12.008. [9] N. Akhlaghi et al., “Sparsity analysis of a sonomyographic musclecomputer interface,” IEEE Trans. Biomed. Eng., vol. 67, no. 3, pp. 688– 696, 2020, doi: 10.1109/TBME.2019.2919488. [10] J. Yan, X. Yang, X. Sun, Z. Chen, and H. Liu, “A lightweight ultrasound probe for wearable human-machine interfaces,” IEEE Sensors J., vol. 19, no. 14, pp. 5895–5903, 2019, doi: 10.1109/JSEN.2019.2905243. [11] C. Ahmadizadeh, L. Merhi, B. Pousett, S. Sangha, and C. Menon, “Toward intuitive prosthetic control: Solving common issues using force myography, surface electromyography, and pattern recognition

24



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

in a pilot case study,” IEEE Robot. Automat. Mag., vol. 24, no. 4, pp. 102– 111, 2017, doi: 10.1109/MRA.2017.2747899. [12] H. Mao, P. Fang, and G. Li, “Simultaneous estimation of multi-finger forces by surface electromyography and accelerometry signals,” Biomed. Signal Process. Control, vol. 70, 2021, Art. no. 103005, doi: 10.1016/j.bspc.2021.103005. [13] X. Xi et al., “Simultaneous and continuous estimation of joint angles based on surface electromyography state-space model,” IEEE Sensors J., vol. 21, no. 6, pp. 8089–8099, 2021, doi: 10.1109/JSEN.2020. 3048983. [14] L. Cheng, Y. Liu, Z. G. Hou, M. Tan, D. Du, and M. Fei, “A rapid spiking neural network approach with an application on hand gesture recognition,” IEEE Trans. Cogn. Develop. Syst., vol. 13, no. 1, pp. 151–161, 2021, doi: 10.1109/TCDS.2019.2918228. [15] M. Jochumsen, A. Waris, and E. N. Kamavuako, “The effect of arm position on classification of hand gestures with intramuscular EMG,” Biomed. Signal Process. Control, vol. 43, pp. 1–8, May 2018, doi: 10.1016/j. bspc.2018.02.013. [16] W. Wei, Q. Dai, Y. Wong, Y. Hu, M. Kankanhalli, and W. Geng, “Surface-electromyography-based gesture recognition by multi-view deep learning,” IEEE Trans. Biomed. Eng., vol. 66, no. 10, pp. 2964–2973, 2019, doi: 10.1109/TBME.2019.2899222. [17] X. Yang, J. Yan, Z. Chen, H. Ding, and H. Liu, “A proportional pattern recognition control scheme for wearable A-mode ultrasound sensing,” IEEE Trans. Ind. Electron., vol. 67, no. 1, pp. 800–808, 2020, doi: 10.1109/TIE.2019.2898614. [18] X. Yang, J. Yan, and H. Liu, “Comparative analysis of wearable A-mode ultrasound and sEMG for muscle-computer interface,” IEEE Trans. Biomed. Eng., vol. 67, no. 9, pp. 2434–2442, 2020, doi: 10.1109/ TBME.2019.2962499. [19] W. Wang, R. Arora, K. Livescu, and J. Bilmes, “On deep multi-view representation learning: Objectives and optimization,” in Proc. 32nd Int. Conf. Mach. Learn., Lille, France, Dec. 2015, pp. 1083–1092. [20] M. Atzori et al., “Electromyography data for non-invasive naturally-controlled robotic hand prostheses,” Sci. Data, vol. 1, no. 1, p. 140, 053, 2014, doi: 10.1038/sdata.2014.53.

Yongxiang Zou, School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, 100049, China, and State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China. Email: zouyongxiang2019@ ia.ac.cn. Long Cheng, State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China, and School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, 100049, China. Email: long.cheng@ ia.ac.cn. Zhengwei Li, State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China. Email: lizheng [email protected]. 

By Xinbo Yu , Sisi Liu , Wei He Yifan Wu , Hui Zhang , and Yaonan Wang

,

©S H

A Human Motion Prediction Method

UT T ER STO C

K. C

OM

/M

F PR OD

UC

A Hybrid Visual–Haptic Framework for Motion Synchronization in Human–Robot Cotransporting TI

N

Digital Object Identifier 10.1109/MRA.2022.3210565 Date of current version: 28 October 2022

1070-9932/22©2022IEEE

O

I

n this article, we propose a hybrid visual–haptic framework enabling a robot to achieve motion synchronization in human–robot cotransporting. Visual sensing is employed in capturing human motion in real time. To deal with the inherent delays between the human’s initiative motion and the robot’s responsive motion in cotransporting, a human motion prediction method is developed to make the robot follow human motion proactively. Motion synchronization is achieved when the robot accurately tracks the filtered and predicted human motion. Force sensing is

utilized to regulate interaction forces to ensure compliance when motion error between the human and robot is generated. A neural network (NN)-based control is proposed to achieve precise trajectory tracking. Comparative experimental results show that our proposed framework is effective in such cotransporting tasks. Introduction In recent years, a vast variety of robots has come into our daily life. Robots can physically interact with humans to accomplish many human–robot cooperative tasks, such as cotransporting, collaborative assembly, teleoperation, cosawing, and so on. Endowing robots with the ability to learn DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



25

human complex skills has become an important method in physical human–robot interaction [1]. Robots are expected to have the ability to control both positions and forces to regulate cooperative behaviors. Achieving motion synchronization is a key issue in human–robot Endowing robots with the cotransporting. Considering a human–human ability to learn human cotransporting scenario, a leader and a follower complex skills has become cotransport an object and move it to a target an important method in position, with the leader aware of the target posiphysical human–robot tion and moving the object actively, and the interaction. follower sharing the load of the object and cooperating with the leader passively, so there exists an inherent delay between the leader and the follower. As shown in Figure 1, the human knows the task flow and makes decisions to perform the task, while the robot measures

Make Decision

Follow Human Robot

Motion Synchronization

Human

Delay

Robot

Motion Prediction

Motions

Command Sensor

Robot Torque Control

Robot Motion Planning

Motion Recognition Delay (a) Human (Leader)

Robot (Follower) 2 Place

1 Pick

Motion Delay

Storage Location

Starting Area

Motion Synchronization (b) Figure 1. A human–robot cotransporting task. (a) Delays in a human–robot collaborative task. (b) An illustration of the human–robot cotransporting task.

26



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

the human motion and plans its own motion to follow the human’s. As illustrated by the black arrows, there are delays between the human’s initiative motion and the robot’s responsive motion, which will decrease the collaboration efficiency. If the robot can predict human motion, it can plan its own motion under the designed controller to decrease the delays, and motion synchronization between humans and robots can be achieved [see the red arrows in Figure 1(a)]. Many human motion prediction methods have been developed in recent years, such as the hidden Markov model, linear regression (LR), the autoregressive integrated moving average model, recurrent NNs (RNNs), and so on. In [2], NNs are utilized to estimate human motion intent. In [3], human intent is estimated without force sensors by observing robot control efforts. In [4], an intentional reaching direction is used to describe the human motion intention of the wearer’s upper limb, and a novel human–robot interface is designed to estimate it in real time. Researchers in the field of computer vision are also working on human motion prediction, and [5] proposes a simple and scalable RNN architecture for predicting human motion. Other methods also can be used to accomplish the human–robot cotransporting task, such as the vision method (V method, based only on visual recognition of human motion), the force method (F method, based on an impedance control framework), and so on. The visual– force method (VF method) combines visual and force sensing together in such tasks, and the robot will adapt to human motion actively, which can be applied to the scenario of human–robot cotransporting. However, it is not synchronized between human and robot motion, and there exist delays of robot motion that generate a large interaction force in applications. Yu et al. [6] propose a VF framework in the scenario of human–robot cocarrying. Recently, some novel and important control methods for robot control were proposed in [7], [8], [9], and [10]. From a control perspective, controller design in human– robot cotransporting tasks is challenging because of uncertainties in the coupled dynamics and kinematics. The performance of human–robot collaboration with the current controller design is still far from expected, especially when compared to the performance of human–human collaboration. Many types of research combine some learning method [11] and focus on only force or position control design to improve efficiency in such human–robot collaboration tasks. Yang et al. 2018 [12] introduce a physical haptic feedback mechanism to realize the human adaptive impedance transfer and apply it to the physical human–robot interaction. Dong et al. 2021 [13] study the impedance control of a robot interacting with an unknown environment, so that the interaction performance between the robot and the environment is improved. In [14], a novel hierarchical human-in-the-loop control is proposed that includes impedance learning and human-in-the-loop adaptive management.

Several recent works have been aimed at human–robot cotransporting and have explored human–robot joint manipulation. Huang et al. [15] propose a method for single-leader–dual-follower teleoperation based on force to estimate the position in the presence of dynamic uncertainties. Humanoids and mobile robots are utilized to transport the object in collaboration with humans. In [16], a machine learning approach based on sensory information is proposed without force sensors, and a statistical model of human behavior is developed. Some works focus on the estimation of unknown cotransported object dynamics. Cehajic et al. [17] proposes a method to identify an unknown grasp pose of a human, where the leastsquares method is utilized to achieve an online estimation of relative kinematics. Task assignment is also an important research content in human–robot collaboration tasks. Some research works achieve role exchanging by granting the control effort of a human regarding human motion intentions or systematically deriving role assignment from the geometric and dynamic task properties. Some strategies on the assignment of task effort are proposed in [18], in which a controller is introduced that enables humanoid robots to perform a cotransporting task with the human. Robots can estimate the human motion intentions to assist the human, and then robots take over the leadership of the task to perform it. In [19], the authors propose a method by which the robot can learn reactive and proactive behaviors in the joint task by learning from demonstrations. Most of the aforementioned control approaches use force information to generate passive behaviors for the robot, which inherently brings a burden to the human. Some works have considered a hybrid visual–haptic framework, which combines vision and haptic information, with the state of the task obtained from visual sensing. Inspired by these works, we have integrated vision and haptic sensing to perform a cotransporting task [20]. However, the aforementioned issues of delays have not been considered. In this article, we propose a hybrid visual–haptic framework for a human–robot cotransporting task. A novel method is proposed to predict human motion for reducing delays to achieve motion synchronization. Visual sensing is utilized in capturing human motion in real time, and robot motion is refined based on force sensing. Adaptive NN-based control is proposed to achieve precise trajectory tracking. Comparative experimental results show that our proposed framework is effective in a cotransporting task, motion synchronization can be achieved, and the human participants find it easy to accomplish the task. Based on these discussions, we highlight our contributions as follows: 1) A human motion prediction method is proposed to reduce delays in a cotransporting task, and a composite filter is designed to obtain human motion data.

2) A hybrid visual–haptic framework is proposed that integrates impedance control and visual servoing. The rest of the article is presented as follows. The next section presents the dynamics of human–robot collaboration systems, and the section “Hybrid Visual–Haptic Framework” introduces visual signal sensing, predicting, and filtering. The section “Control Framework” proposes a hybrid visual–haptic conIn this work, we trol framework considering uncertainties in a complement the system’s dynamics and also analyzes the system advantages of visual and stability. The proposed control framework is force sensing and propose evaluated in the section “Experiment” by coma hybrid framework parative experiments. The section “Conclusion” combining visual servoing, concludes our work.

human motion prediction,

Problem Formulation

and force feedback.

System Description Figure 1(b) shows a scenario including two agents whose goal is to transport an object from a “starting position” to a “storage position.” Because of the length and weight of the object, this task cannot be accomplished individually by a single agent. Inspired by human–human cotransporting, we introduce a robot collaborator in the task. In this work, the human plays a leader role by initiating the movement. The robot plays a follower role to share the load and collaborates with the human. We see that large delays between the human and robot may lead to task failure. We assume that the robot is equipped with visual and haptic sensors, e.g., a camera and a force torque (FT) sensor. Definition 1 Human motion is defined as the position of the human hand, which contacts the transported object. Robot motion is defined as the position of the robotic gripper, which grasps the transported object. If the robot can predict human motion and track it precisely, motion synchronization can be achieved. Visual servoing has been extensively studied, but it is not applicable to the cotransporting task under study because 1) a large resistance force will be generated if the human and robot have asynchronized motion, and 2) human motion prediction based on vision only is always subject to model overfitting. Force-based control methods are also utilized in related works, where robots can regulate the interaction force and generate different behaviors to cooperate with humans. However, delays in cooperation are inevitable based on force feedback only, and this leads to large interaction forces. DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



27

In this work, we complement the advantages of visual and force sensing and propose a hybrid framework combining visual servoing, human motion prediction, and force feedback. The purpose of our proposed control is to solve the issue that there exist delays caused by various situations that lead to unsynchronized motion and large interaction forces in the scenario of a human–robot cotransporting task. The application of our proposed control can reduce the delay time, achieve human–robot synchronization, and reduce the interaction force between human and robot. Dynamics of the Human–Robot Cotransporting System In the task under study, the human hand and robotic gripper hold the object jointly. We consider the dynamic model of an n-link robot and the object in the joint space as follows:

M (i) ip + C (i, io ) io + G (i) = x - x r, (1)

where i, io , and ip ! R n are the robot joint angle, velocity, and acceleration vectors, respectively. M (i) ! R n # n is the inertia matrix, and it is symmetric and uniformly positive definite. C (i, io ) io ! R n is the vector of centripetal and Coriolis forces. G (i) ! R n denotes the vector of the bounded gravitational force. x ! R n and x r ! R n denote the vectors of the control input torque and the external torque.

Shoulder Right A Elbow Joint Angle θh Elbow Right B

C Wrist Right

X

Figure 2. The human elbow joint angle ih obtained by visual sensing.

IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

Human Motion Prediction Human behavior is continuous when executing a transportation task, and this article designs a linear predictor to estimate the human elbow angle in the future, so that the robot can be controlled in advance to reduce the delay between the robot and the human. In this part, we define i h (t) as the actual angle of the human elbow at time t, ith (t - iT ) as the fitting value of the human elbow angle before time iT using the human motion predictor at time t, and itho (t) as the prediction angle of the human elbow at time t for the future. LR is utilized for fitting the historical data of i h (t - iT ), where i is the serial number of past measured values. x i is the past time series, and we always set x i = i. a and b are unknown time-varying parameters of the linear predictor. Square loss is a commonly used index to evaluate the fitting performance, and we use the error of the past data fitting value iuh (t - iT ) = i h (t - iT ) - it h (t - iT ) to find the estimators of a and b, which are expressed as at and bt : m

/ (i h (t - iT ) - ith (t - iT ))2,

i=0

m m 2E (at , bt) = 2 c at / x 2i - / (i h (t - (m - i) T ) - bt ) x i m = 0,  2at i=0 i=0

Y



Visual Sensing of Human Motion In the “pick and place” task, the real-time angle of a human elbow joint can be obtained by the robot’s visual sensing. A Kinect 2 depth camera is utilized in the task, and project “BodyBasics-D2D” of the Kinect 2 depth camera is utilized in obtaining position information of the human body. Pixel points of human joints can be obtained in real time, and depth maps with the color camera picture are matched to get 3D positions in camera coordinate systems. Joint points of the human body are first drawn in real time and then connected by line segments through Kinect’s internal function to obtain a real-time human skeleton image. As seen in Figure 2, the locations of the right shoulder (A), elbow (B), and wrist (C) in the task space can be identified by Kinect 2. Right shoulder location A and right elbow location B can generate one space vector AB , while right elbow location B and right wrist location C can form another space vector BC . These vectors define the elbow joint angle, which is confined on the xyplane, and we define it as i h .

E (at , bt) =

Z

28

Hybrid Visual–Haptic Framework

m 2E (at , bt ) t - / (i h (t - (m - i) T ) - ax t i)) = 0, = mb 2 ( 2bt i=0

(2)

where we consider the partial derivatives of E (at , bt) with respect to at and bt to be zero. The predicted angle value itho (t) is obtained by itho (t) = t p (t) + bt to ensure ; i h (t + pT ) - itho (t) ; minimum after ax getting at , bt , where p denotes prediction serial numbers, and x p denotes the adaptation weight determined by p, which updates continuously by the gradient descent method:

Signal Filtering Since the predicted elbow joint angle may include noises, we propose a composite filter to filter the prediction value, which combines a moving average filter and a dither removal filter. The moving average filter is utilized to calculate the mean of the data irtho (t). The first output value of the filter is the average of the first n frames, and n moving average values of each frame irt ho (t) are compared with the output value of the previous frame itho (t - T ). If the difference is less than the threshold c, the value of the previous frame is kept as output itho (t). Otherwise, the output of the filter is the sum of the weighted average of the previous frame itho (t - T ) and the current frame irtho (t). The mathematical form of the filter is described as follows:



1.6 1.4 1.2 Original p = 10 p = 13 p = 15 p = 18 p = 20

1 0.8 0.6 0.4 0

5

10

15 20 Time (s) (a)

25

30

25

20

15 Step = 20 Step = 18 Step = 15 Step = 13 Step = 10

10

5 0

200

400

600

800

1,000 1,200

Train_Number (b)

(4)

Impedance Behavior We consider impedance control for the robot with a desired mass–damping–stiffness impedance model as follows: xr

1.8

30

where d denotes the weight. Human motion predictions and filtering are always subject to model overfitting, and this will influence the control performance. Therefore, we propose a hybrid visual–haptic framework to complement the advantages of visual and force sensing to solve this issue, as introduced in the next section.



2

n

1 / it (t - iT ), n + 1 i = 0 ho  it ho (t - T ) | irt ho (t) - it ho (t - T ) ; 1 c it ho (t) = ) r dit ho (t) + (1 - d) it ho (t - T ) otherwise irt ho (t) =

p Increasing

2.2

Angle (rad)

where iuho (t) represents the prediction error, and f ! (0, 1) represents the adaptation rate of x p . As seen from Figure 3(a), generated from simulations, we set original curve i h (t) as 0.675 sin(t) + 1.508 and choose different prediction serial numbers p = 10, 13, 15, 18, 20 to evaluate the effectiveness of the fitting method. We obtained that when the value of p gets larger, the prediction step is farther, but the error between the original value i h (t + pT ) and the predicted value itho (t) also becomes larger. Therefore, choosing a proper prediction serial number is very important in practice. From Figure 3(b) and (c), we can obtain that adaptation weight x p can handle the situation for different initial values of x p and different prediction serial numbers and converge to the ideal x p .

xp

(3)

the robot, and i r denotes the reference joint angle. We design the desired trajectory i d as io d (t) = itoho (t) - ax r, where a is a positive scalar.

= M r (ip r - ip d) + D r (io r - io d) + K r (i r - i d), (5)

where M r, D r, and K r denote the desired mass, damping, and stiffness matrices. i d denotes the desired joint angle of

25 20 xp



diuho (t) dx p (t)  d (itho (t - pT ) - i h (t)) = x p (t) - f , dx p (t)

x p (t + T ) = x p (t) - f

15

Initial Value = 5 Initial Value = 13 Initial Value = 20 Initial Value = 30

10 5 0

200

400

600

800

1,000 1,200

Train_Number (c)

Figure 3. Simulation results of LR prediction method. (a) Prediction performance with different prediction steps p. (b) xp in different prediction serial numbers p. (c) xp in different initial values in the same p.

DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



29

Remark 1 We note that the adaptation law io d of the desired trajectory in io d (t) = itoho (t) - ax r includes two parts: the signal itoho is obtained by visual sensing (introduced in the section “Visual Sensing of Human Motion”), motion prediction (introduced in the section “Human Motion Prediction”), and filtering (introduced in the section “Signal Filtering”), and - ax r is the force feedback for correcting the motion error between the human and robot. We note that if there exists no external torque, i.e., x r = 0, this means it only relies on vision prediction, and this method is always subject to model overfitting, as explained before. If vision prediction is not involved in io d (t) = itoho (t) - ax r , i.e., itoho = 0, delays in cooperation are inevitable based on only force feedback. Based on (5), with M r, D r, K r and measured x r by force sensor, i r can be calculated. If i tracks i r precisely, (5) can be rewritten as x r = M r (ip - ip d) + D r (io - io d) + K r (i - i d). For later stability analysis, we consider that the human motor control can be described by an impedance model as - x r = K o (i - i h), where K o denotes the human arm stiffness gain. Control Framework Hybrid Visual–Haptic Control In this section, we design a hybrid control framework including visual servoing and impedance control. This framework combines the complementary advantages of these two methods, as will be illustrated by experimental results. We design the robot control torque as follows: + x f b + x r, x f f = M (i) co + C (i, io ) c + G (i),  x f b = - K P (i - i r) - K D (io - io r), x = x ff



(6)

where x ff denotes the feedforward torque for compensating for the robot’s dynamics, and x f b denotes the feedback torque for tracking i r . In x f b, K P and K D denote proportional and differential gains. We define the virtual vector c = - K c (i - i r) + io r, where K c denotes a positive gain. Considering that there exist uncertainties in the robot’s dynamics, we utilize a radial basis function NN (RBFNN) to solve this issue, so that x ff can be redesigned as = ~t T S(Z ) = ~) T S(Z ) - w(Z ), S(Z ) = M (i) co + C (i, io ) c + G (i),  to = -C 6S(Z ) (io - ") + v~t @, ~ x ff



~

)T

(7)

where the input vector Z = 6i T , io T , c T , co T@, S(Z) denotes the basis function, and ~t T S (Z ) is the approximate value of )T to denotes the weight adaptation law in RBFNNs, ~ S (Z ). ~ C denotes a gain matrix, and the positive constant v is designed to improve the robustness. 30



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

Stability Analysis In this section, we analyze the stability of our proposed method. In particular, we consider a Lyapunov function candidate L as follows:



L = L e + L p + L h, L e = 1 (M r eo 2 + K r e 2), 2  L p = 1 NP + iuh2, 2 L h = 1 K o iu2, 2

(8)

where L e is used to verify the stability of the tracking error, L p is used to verify the stability of the prediction error, and L h is used to verify the stability of the error of synchronization between the human and robot. iuh (t) = itho (t) i h (t + pT ) represents the prediction error, e = i - i d represents the angle tracking error, and iu = i - i h represents the motion error between the human and robot. We define Q = [a, b] and X = [x p, 1]T , so i h (t + pT ) = t p + bt can be rewritten as i h (t + pT ) = ax p + b and itho (t) = ax t . According to the gradient descent QX and itho (t) = QX method, x p will eventually converge to a positive constant when p is given; therefore, X is a positive definite matrix. Also, the parameter Qt adaptation law can be expressed as

Qto = - Niuh + Px r, (9)

where N, P ! R 1# 2 are positive matrices, and P + denotes the generalized inverse matrix of P. Remark 2 Q can be estimated according to the LS method in (2) or the gradient descent method in (9). For facilitating stability analysis, the gradient descent method is utilized in this section. According to (5), the desired impedance model can be rewritten as M r ep + D r eo + K r e = x r . Considering Qo = 0 and further taking the derivative of L and substituting (9), we obtain Lo = Lo e + Lo p + Lo h = eo (M r ep + K r eo ) + NP + iuh iuoh + iuK To iuo = eo (- D r eo - x r) + NP + iuh (itoho - io h) + x r (io - io h)

= - D reo 2 - eo x r + NP +iuh (-Niuh + Px r) X + x r (io - itoho + iuoh) = - D r eo 2 - eo x r - NP + NXiu2h + NXiuh x r + x r (io - io d - ax r + itoho - io h)

= - D r eo 2 - eo x r - NP + NXiu2h + NXiuh x r + x r eo - ax 2r + x r (- Niuh + Px r) X = - D r eo 2 - eo x r - NP + NXiu2h + NXiuh x r + x r eo + (PX - a) x 2r - NXiuh x r = - D r eo 2 - eo x r - NP + NXiu2h + NXiuh x r + x r eo + (PX - a) x 2r - NXiuh x r (10) = - NP + NXiu2h - D d eo 2 - (a - PX) x 2r # 0.

When we design a to make a - PX positive definite, then (10) indicates that iuh, e, and x r converge to 0 when t " 3, which means that itho (t) can estimate i h (t + pT ), i can track i d, and motion synchronization can be achieved.

“place” the wooden stick from a starting position to a target position (see Figure 4). Detailed information about the Baxter robot can be found in our previous work [6]. Joint “E1” of the Baxter robot’s right arm is used to perform cotransporting tasks. Angles and angular velocities can be obtained by the encoder in the joint. A Kinect 2 3D depth camera is employed for obtaining elbow joint angle i h . The Kinect 2 3D depth camera contains a color camera and a depth sensor, which provides capabilities in 3D motion capture. In this experiment, we mount it on

Experiment Experimental Setup We considered an experimental scenario where a human and a Baxter robot cocarry a wooden stick and “pick” and

Pick 1

Pick 3

Pick 2

Start Position

Pick

Target Position

Start Position

Target Position

Start Position

Place 1

Start Position

Place 2

Place 3

Start Position

Place

Start Position

Start Position

Target Position

Target Position

Target Position (a)

Kinect V2 Visual Information Kinect V2

Human Hand Wooden Stick Robotic Gripper

Human–Machine Collaboration Tasks

τ q

Robotiq FT300 Feedback

Robot

Filter

qd

Tracking Controller

LR Prediction q d "

Robot

qr Impedance – Controller

Linux

Windows

Master Computer Slave Computer

τe

Robotiq FT300

Control Framework Visual Information

External Torque

Control Torque (b)

Figure 4. The experimental setup. (a) An illustration of the experimental setup. (b) Human–robot cotransporting tasks.

DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



31

the hand of the Baxter robot. A Robotiq FT sensor is mounted on the robot wrist for measuring the external force, which is converted to x r for obtaining the external torque on joint “E1.” Joint “E1” of the robot’s arm from full flexion to extension is 0–1.8 rad, and the human The application of our elbow joint value i h from flexion to extenproposed control can sion is 0.7–2.4 rad. A linear transformation is reduce the delay time, used to normalize the value range of the human achieve human–robot elbow joint angle into the angle range of joint synchronization, and “E1” of the robot’s arm. We also transform the reduce the interaction force human joint angle from the camera coordinates between human and robot. to robot reference coordinates and transform force sensory information from force sensor coordinates to robot reference coordinates, which can be found in our previous work [6]. Two computers are employed in the experiment. One computer is used to calculate the feedforward input x ff of NN compensation in (7) by MATLAB Simulink. This computer is also employed to obtain visual and force sensory information from the Kinect 2 3D depth camera and the Robotiq FT sensor and transfer compensation values and sensory information to the other computer by User Datagram Protocol communications. The other computer is used to receive sensory information from the Baxter robot and the first computer and then calculates feedback control input x f b in (6) and

3

Angle (rad)

2.5 2

1.5

1 0.5

2

4

6

8

10 12 Time (s)

14

Filtered Angle Predicted Angle Elbow Joint Angle Figure 5. Elbow joint, predicted and filtered angle.

32



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

16

18

20

generates control input to control the robot by the Baxter Robot Operating System software development kit in Ubuntu 14.04 LTS. Experimental Objectives As mentioned in the preceding section, a human and a robot will cooperate to carry an object in this experiment. Considering human–robot cotransporting, we set the experimental objectives as follows: 1) Minimize the position error between the human’s motion and the robot’s motion during cotransporting, which can prove the effectiveness of the proposed prediction method for reducing delay. 2) Minimize the interaction force between human and robot, which means that the human can pay less force to transport the object. In the experimental results, we will analyze the mean square error (MSE) and interaction force between human and robot and verify the effectiveness of the proposed method. Experimental Results In this section, we compare our hybrid method proposed in the section “Control Framework” with other state-of-the-art methods: the V method, vision-prediction (VP) method, and VF method. In the V method, the robot tracks human motion i h under a visual servoing controller. After obtaining elbow joint angle i h in the section “Visual Sensing of Human Motion,” we use a visual servoing torque control method for the robot to track i h for comparison. This visual servoing controller x v = - K vP (i - i h) - K vD (io - io h) as in (6), where K vP and K vD denote proportional and differential gains. In the VP method, the robot tracks the predicted angle itho under the visual servoing controller. In the VF method, our proposed hybrid controller (6) is employed, but the predicted value itho is not involved, which means that we design the update law of the desired trajectory i d as io d (t) = io h (t) - ax r Composite Filter Evaluation In this part, we evaluate the effectiveness of the composite filter proposed in the section “Signal Filtering.” From Figure  5, we can see that, under our prediction method proposed in the section “Human Motion Prediction,” the curve of the predicted angle (light blue) is similar to the curve of the actual one, but there exist noises in the predicted signals. After filtering by our filters proposed in the section “Signal Filtering,” the filtered signals (deep blue) become smoother. The predicted angle shows a prediction of the actual elbow joint angle (red). Comparison Between VP Method and V Method We compare the VP and V methods to show the effectiveness of our proposed prediction method. As shown

in Figure 6, while one human subject performs a similar task (blue) in cooperation with the robot under the V and VP methods, respectively (similar trajectories of the elbow joint), we find that a large delay exists between the human and robot when only visual sensing is involved (dotted red line). In comparison, delays can be reduced by using our proposed prediction method (solid red line).

2.5

Angle (rad)

2

1.5

1

0.5 12

14

16

18

20 22 Time (s)

24

26

28

30

Elbow Joint Angle Robotic Angle Under V Method Robotic Angle Under VP Method Figure 6. Comparison of performance between V method and VP method.

4

6

8

10 Time (s) (a)

12

14

16

18

20

2

4

6

8

10 Time (s) (b)

12

14

16

18

20

2

4

6

8

10 Time (s) (c)

12

14

16

18

20

2

4

6

8

10 Time (s) (d)

12

14

16

18

20

Angle (rad)

1

2

Angle (rad)

Comparison Between VP Method and Our Proposed Method Finally, we compare the VP method and our proposed method. The experimental process is the same as that in the third experiment. The experiment has also been performed by three human subjects. The MSE 6MSE = (1/n) R nk = 1 (i - i h)2@ is employed to evaluate the robustness of our proposed method. In our experiment, we set the sampling points to n = 500. From Table  2, we can see that the motion errors under our proposed method are smaller compared with those under the VP method. From Figure 9, one human subject and robot under VP/our proposed methods perform the cotransporting task. We

Angle (rad)

Angle (rad)

Comparison Between VF Method and Our Proposed Method In addition, we compare the VF method and our proposed method. The experiment has been performed by three human subjects who are blind to the experiment setups. They perform the cotransporting task in cooperation with a robot under VF and our proposed method, respectively. The elbow joint angle of the human subjects and the robotic joint angle are recorded in the experimental processes. As shown in Figure 7, one human subject and robot under VF/our proposed methods perform the cotransporting task, and there exist large delays when prediction is not involved in the VF method. From Table 1, we can see that the average external torques under our proposed method are smaller than those under the VF method for all subjects, which indicates that all human subjects find it easier to accomplish these tasks under our proposed method. A detailed statistical analysis can be seen in Figure 8. 2 In the figure, the box plot in dark 1.5 blue is obtained by the VF method, 1 and the box plot in light blue is 2 obtained by our proposed method. We can see that the maximum forces and average forces are smaller under 1.5 our proposed method.

find that the external torque is smaller under our proposed method, indicating that the human subject finds it easier to perform these tasks. From Figure 10, position tracking

2

1 0

1 0

Elbow Joint Angle Robotic Angle Figure 7. Comparison under the (a) V method, (b) VF method, (c) VP method, and (d) our proposed method.

DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



33

3

3.019

2.5 MSE

1

Torque (Nm)

3.5

25%–75% Range Within 1.5 IQR Median Line Mean

0.5

2

1.883

1.830

1.5 1.042

1 0

0.5 VF

0

Proposed

Figure 8. The compared performance between the VF method and our proposed method. IQR: interquartile range.

Human Subject

1

2

3

VF (Nm)

0.269

0.324

0.380

Proposed (Nm)

0.221

0.184

0.126

Table 2. MSEs under the VP method and our ­proposed method.  MSE VP

Human Subject

(rad2)

Proposed (rad2)

1

2

3

1.836

1.626

2.029

0.902

1.318

0.906

1 0.5

Torque (Nm)

0 –0.5 –1 –1.5 –2

2

4

6

8

10 12 Time (s)

14

16

18

20

VP Proposed Method Figure 9. External torques with the VP method and our proposed method.

34



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

VF

VP

Proposed

Figure 10. MSEs of V method, VF method, VP method, and our proposed method.

under four methods and the averaged MSE of three subjects under different methods are utilized to evaluate the performance. We can see that the motion errors under our proposed method are smaller.

Table 1. Average values of external torque xr ­under VF method and our proposed method.  Average Value

V

Conclusion In this article, we proposed a hybrid visual–force framework in a human–robot cotransporting task. A motion prediction method was proposed to predict human motion for reducing delays to achieve motion synchronization. Visual sensing was utilized in capturing human motion in real time, and force sensing was employed to regulate interaction forces to improve compliance when a motion error between human and robot existed. An adaptive NN-based control was developed to compensate for system uncertainties and achieve precise tracking. In the experiments, human subjects and a Baxter robot cotransported a wooden stick from a starting position to a target position. Comparative experimental results showed that our proposed framework was effective in this cotransporting task, and motion synchronization could be achieved, while the human subjects found it easier to accomplish the task. Acknowledgment This work was supported in part by the National Natural Science Foundation of China under Grant 62225304, Grant 62061160371, and Grant 62003032; in part by the Beijing Natural Science Foundation under Grant JQ20026; in part by the China Postdoctoral Science Foundation under Grant 2020TQ0031 and Grant 2021M690358; in part by the Beijing Top Discipline for Artificial Intelligent Science and Engineering, University of Science and Technology Beijing; in part by the Technological Innovation Foundation of Shunde Graduate School of University of Science and Technology Beijing under Grant BK20BE013; and in part by the Guangdong Basic and Applied Basic Research Foundation under Grant 2020B1515120071. The corresponding author is Wei He.

References [1] C. Zeng, C. Yang, H. Cheng, Y. Li, and S.-L. Dai, “Simultaneously encoding movement and sEMG-based stiffness for robotic skill learning,” IEEE Trans. Ind. Informat., vol. 17, no. 2, pp. 1244–1252, Feb. 2021, doi: 10.1109/TII.2020.2984482. [2] Y. Li and S. S. Ge, “Human–robot collaboration based on motion intention estimation,” IEEE/ASME Trans. Mechatronics, vol. 19, no. 3, pp. 1007–1014, Jun. 2014, doi: 10.1109/TMECH.2013.2264533. [3] M. S. Erden and T. Tomiyama, “Human-intent detection and physically interactive control of a robot without force sensors,” IEEE Trans. Robot., vol. 26, no. 2, pp. 370–382, Apr. 2010, doi: 10.1109/TRO.2010.2040202. [4] J. Huang, W. Huo, W. Xu, S. Mohammed, and Y. Amirat, “Control of upper-limb power-assist exoskeleton using a human-robot interface based on motion intention recognition,” IEEE Trans. Autom. Sci. Eng. (from July 2004), vol. 12, no. 4, pp. 1257–1270, Oct. 2015, doi: 10.1109/ TASE.2015.2466634. [5] J. Martinez, M. J. Black, and J. Romero, “On human motion prediction using recurrent neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 2891–2900. [6] X. Yu, W. He, Q. Li, Y. Li, and B. Li, “Human-robot co-carrying using visual and force sensing,” IEEE Trans. Ind. Electron., vol. 68, no. 9, pp. 8657–8666, Sep. 2021, doi: 10.1109/TIE.2020.3016271. [7] Y. Dong, B. Ren, and Q.-C. Zhong, “Bounded universal droop control to enable the operation of power inverters under some abnormal conditions and maintain voltage and frequency within predetermined ranges,” IEEE Trans. Ind. Electron., vol. 69, no. 11, pp. 11,633–11643, Nov. 2022, doi: 10.1109/TIE.2021.3125660. [8] M. Yuan, Z. Chen, B. Yao, and X. Liu, “Fast and accurate motion tracking of a linear motor system under kinematic and dynamic constraints: An integrated planning and control approach,” IEEE Trans. Control Syst. Technol., vol. 29, no. 2, pp. 804–811, Mar. 2021, doi: 10.1109/ TCST.2019.2955658. [9] M. Yuan, Z. Chen, B. Yao, and J. Hu, “An improved online trajectory planner with stability-guaranteed critical test curve algorithm for generalized parametric constraints,” IEEE/ASME Trans. Mechatronics, vol.  23, no. 5, pp. 2459–2469, Oct. 2018, doi: 10.1109/TMECH.2018. 2862144. [10] Z. Liu, Z. Han, Z. Zhao, and W. He, “Modeling and adaptive control for a spatial flexible spacecraft with unknown actuator failures,” Sci. China Inf. Sci., vol. 64, no. 5, p. 152,208, Apr. 2021, doi: 10.1007/s11432 -020-3109-x. [11] J. Xie, S. Y. Liu, and J. X. Chen, “A framework for distributed semisupervised learning using single-layer feedforward networks,” Mach. Intell. Res., vol. 19, no. 1, pp. 63–74, 2022, doi: 10.1007/s11633-022-1315-6. [12] C. Yang, C. Zeng, P. Liang, Z. Li, R. Li, and C.-Y. Su, “Interface design of a physical human-robot interaction system for human impedance adaptive skill transfer,” IEEE Trans. Autom. Sci. Eng. (from July 2004), vol. 15, no. 1, pp. 329–340, Jan. 2018, doi: 10.1109/TASE. 2017.2743000. [13] Y. Dong, W. He, L. Kong, and X. Hua, “Impedance control for coordinated robots by state and output feedback,” IEEE Trans. Syst., Man, Cybern. Syst., vol. 51, no. 8, pp. 5056–5066, Aug. 2021, doi: 10.1109/TSMC.2019.2947453. [14] Z. Li, X. Li, Q. Li, H. Su, Z. Kan, and W. He, “Human-in-the-loop control of soft exosuits using impedance learning on different terrains,” IEEE Trans. Robot., early access, Apr. 22, 2022, doi: 10.1109/ TRO.2022.3160052.

[15] D. Huang, B. Li, Y. Li, and C. Yang, “Cooperative manipulation of deformable objects by single-leader-dual-follower teleoperation,” IEEE Trans. Ind. Electron., vol. 69, no. 12, pp. 13,162–13,170, Dec. 2022, doi: 10.1109/TIE.2021.3139228. [16] E. Berger, D. Vogt, N. Haji-Ghassemi, B. Jung, and H. B. Amor, “Inferring guidance information in cooperative human-robot tasks,” in Proc. 2013 13th IEEE-RAS Int. Conf. Humanoid Robots (Humanoids), pp. 124–129, doi: 10.1109/HUMANOIDS.2013.7029966. [17] D. Cehajic, S. Erhart, and S. Hirche, “Grasp pose estimation in human-robot manipulation tasks using wearable motion sensors,” in Proc. 2015 IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), pp. 1031– 1036, doi: 10.1109/IROS.2015.7353497. [18] C. Yang, J. Luo, C. Liu, M. Li, and S.-L. Dai, “Haptics electromyography perception and learning enhanced intelligence for teleoperated robot,” IEEE Trans. Autom. Sci. Eng. (from July 2004), vol. 16, no. 4, pp. 1512–1521, Oct. 2019, doi: 10.1109/TASE.2018.2874454. [19] L. Rozo, S. Calinon, D. G. Caldwell, P. Jimenez, and C. Torras, “Learning physical collaborative robot behaviors from human demonstrations,” IEEE Trans. Robot., vol. 32, no. 3, pp. 513–527, Jun. 2016, doi: 10.1109/TRO.2016.2540623. [20] C. Zeng, C. Yang, and Z. Chen, “Bio-inspired robotic impedance adaptation for human-robot collaborative tasks,” Sci. China Inf. Sci., vol. 63, no. 7, pp. 1–10, 2020, doi: 10.1007/s11432-019-2748-x.

Xinbo Yu, School of Intelligence Science and Technology, Institute of Artificial Intelligence, and Beijing Advanced Innovation Center for Materials Genome Engineering, University of Science and Technology Beijing, Beijing 100083, China. E-mail: [email protected]. Sisi Liu, School of Intelligence Science and Technology and Institute of Artificial Intelligence, University of Science and Technology Beijing, Beijing 100083 China. E-mail: [email protected]. Wei He, School of Intelligence Science and Technology, Institute of Artificial Intelligence and Beijing Advanced Innovation Center for Material Genome Engineering, University of Science and Technology Beijing 100083, Beijing China. E-mail: [email protected]. Yifan Wu, School of Intelligence Science and Technology and Institute of Artificial Intelligence, University of Science and Technology Beijing, Beijing 100083, China. E-mail: [email protected]. Hui Zhang, School of Robotics and the National Engineering Laboratory of Robot Visual Perception and Control Technology, Hunan University, Changsha 410082 China. E-mail: [email protected]. Yaonan Wang, School of Robotics and the National Engineering Laboratory of Robot Visual Perception and Control Technology, Hunan University, Changsha 410082 China. E-mail: [email protected].  DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



35

By Florenz Graf  , Jochen Lindermayr , Çag˘atay Odabas¸i  , and Marco F. Huber 

T

he long-term vision for robotics is to have fully autonomous mobile robots that perceive the environment as humans do or even better. This article transfers the core ideas from human scene perception onto robot scene perception to contribute toward a holistic scene understanding of robots. The first contribution is to extensively survey and compare state-of-the-art robot scene perception approaches with

neuroscience theories and studies of human perception. A step-by-step transfer of the perceptual process reveals similarities and differences between robots and humans. The second contribution represents an analysis of the status quo of holistic robot perception approaches to extract to what extent the perceptual capabilities of humans have been reached. Building on this, the gaps and potentials of robot perception are illustrated to address future research directions.

Toward Holistic Scene Understanding

©SHUTTERSTOCK.COM/BLUE PLANET STUDIO

A Transfer of Human Scene Perception to Mobile Robots

Digital Object Identifier 10.1109/MRA.2022.3210587 Date of current version: 13 October 2022

36



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Introduction concurrently in the context to reason within dynamics, The last few years indicated fast technological improve- uncertainties, and incompleteness for high-level control ments in the artificial intelligence of robots. The Interna- [7, Ch. 23]. Little is known about holistic scene perception, tional Federation of Robotics records a market growth of aiming to understand the scene in this integrated manner 12% in 2020 of professional robots used for various applica- needed for a large spectrum of real-world applications. tions, such as transport, inspection, cleaning, medical, or Motivated by the performance of human perception, we hospitality [1]. The report forecasts exponential growth for distanced ourselves in this research from robotic approachthe upcoming years. Robots will become a part of human es and studied theories and experiments on human percepsociety. Acting side by side and collaborating within the tion. Neuroscience research has been extensively studied same environments motivate robot perception being consis- since Potter [8] and Biederman [9] in the 1970s. It has tent with human perception. reached international bandwidth since the 2000s [10]. The In the past, robots mainly fulfilled highly customized tasks latest research provides a comprehensive overview of in industrial applications independently and separated from human scene perception [5], [10]. This article transfers the humans [2]. People adapted the environments to the applica- core theories of psychological studies on human scene pertion requirements. However, adaptations for the robots’ needs ception to robots and thereby reveals similarities and differare especially undesirable in nonindustrial environments. ences. The comparison of human perception with artificial Current products on the market mirror this issue through perception is not new [11], [12], [13]. However, we will specialized applications. These products either fulfill a single focus on the scene perception as a whole using a top-down task autonomously, such as vacuum cleaners and lawnmower view of the robots’ status quo. robots executing their routine independently from humans, or interact with humans physically with a low autonomy level Methodology [1]. Therefore, special attention must be directed to scene per- Figure 1 describes the methodology of this study. First, we anception as enabling technology to break this tradeoff between alyzed current mobile robot applications where we identified robot proximity to humans and autonomy. Already in 1993, a trend toward robots in everyday life, covering multiple use the psychologist Ulric Neisser mentioned that “without per- cases. For future applications, scene perception will play a ception there is no knowledge” [3]. The goal of holistic scene perception is to understand the scene by Initial Problem “considering the geometric and Identification of Real-World Challenges in Robotic Scene Perception semantic context of its contents and the intrinsic relationships between them” [4]. Scene perception is making sense of real-world scenes as a whole Human Scene Perception Robot Scene Perception by enabling the interaction with the Literature Survey On Psychological Literature Survey On Robot scene [5]. On the one hand, it offers Theories On Human Perception Perception Methods new opportunities for multiuse applications regarding the cost per function and physical assistance systems. On the other hand, it increases autonoIII. Transfer Human Scene Perception to Robots Step-By-Step Transfer of the Human Scene mous capabilities by overcoming chalPerception Process to Mobile Robots lenges, such as the occlusion of objects; instance-specific handling of dynamics; or spatial–temporal reasoning of complex situations. Holistic scene perIV. Status Quo V. Gaps VI. Potentials ception will enable safe and complex Identify Gaps Identify Potentials of Identify the behaviors for any collaboration. Robots Beyond Coverage of Holistic Scene in Scene Perception Perception Approaches of Robots Human Capabilities Increasing attention has been given to specific areas of perception. The approaches achieve excellent results in semantic extraction, such as object VII. Future Directions detection or image classification [6]. Addressing Key Technologies for Future Research However, in the real world, it is not sufficient that robots solely understand single pieces of the environ- Figure 1. A methodological approach with section references. III: The “Transferring Human Scene Perception to Mobile Robots” section. IV: The “The Status Quo of Holistic ment. Much more, various scene Scene Perception” section. V: The “The Gap to Human Scene Perception” section. VI: The information needs to be understood “The Potentials of Robots” section. VII: The “Future Directions” section. DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



37

key role as it provides comprehensive scene knowledge to the cognitive intelligence of robots. Therefore, we analyzed state-of-the-art robot scene perception approaches and theories of human scene perception to elaborate on a transfer. Research in both domains offers a comprehensive overview providing fundamental knowledge as the starting point. The “TransThis article transfers the ferring Human Scene Perception to Mobile core ideas from human Robots” section summarizes the transfer with a scene perception onto step-by-step analysis of the similarities and difrobot scene perception to ferences between human contribute toward a holistic and robot scene perception. Based on this, we addressed the status quo scene understanding of holistic scene perception approaches of roof robots. bots within a qualitative evaluation (see the “The Status Quo of Holistic Scene Perception” section) to extract how far robotic scene perception reached the performance of human perception to reveal gaps (see the “The Gap to Human Scene Perception” section) and potentials (see the “The Potentials of Robots” section). Finally, we propose future directions for robotic scene perception (see the “Future Directions” section).

smell, touch, or taste. Herewith, researchers discovered that people primarily rely on their visual perception for perceiving the environment due to the richness of information [14]. Although the process of human perception is unknown in detail, most theories define perception as the process of “recognizing (being aware of), organizing (gathering and storing), and interpreting (binding to knowledge) sensory information” [15, Ch. 3]. We transfer these three process steps to robot scene perception to reveal similarities and differences to the latest robotic research (Figure 2).

The Recognition of Information The recognition step processes sensory information to make it understandable for the subsequent perception tasks. Using the sensory input, humans convert the observations from the scene into understandable information. Whereas vision benefits from a high amount of information, other modalities, such as haptics, are robust to noise and independent of light exposure to get properties being insufficiently recognized by other senses [16]. For example, the eye serves as a transducer to convert light into the optic nerve by five layers of cells (particularly photoreceptors) within the retina [17]. The nerve cells transmit information by electric signals to the brain for processing [18]. The processing itself divides into preattentive and postattentive recognition. In preattentive processing, activities related to low-level vision are usually associated with the extraction of certain physical properties of the visible environment, such as depth, shape, color, object boundaries, or surface material properties [10], [19], [20]. Humans are not capable of proTransferring Human Scene Perception cessing all sensory information at once. Therefore, the postto Mobile Robots attentive processing extracts scene information based on Humans perceive the environment by recognizing scene attention areas to overcome this issue [15]. It compresses information through different senses, such as sight, hearing, high-level features, such as object classification or redetection [21], [22]. Herewith, humans receive information from hierarchical levels of abstraction [23], [24] corresponding to a region of interest, trigRecognition Knowledge Knowledge gered by the attention that provides of Information Representation Interpretation the semantic of sensory input. Similar to humans, the recognition of robots can be divided into pre- and postattentive recognition. Equivalent Neural Processing Knowledge Base Association to the human senses, robots use various sensors to recognize information from the environment. The most common sensor modalities are visual sensors, such as cameras, lidar, ultrasonic, and radar. They are capable of providing the color and/or range Scene Recognition Database Scene Analysis information of the surrounding. The recognition of visual features is the most popular modality for robots due to the information richness. However, few approaches investigate Figure 2. The process of human perception transferred to mobile robots. other modalities, such as acoustic

38



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

perception to recognize objects [25], map the environment [26], [27], or avoid obstacles [28] and haptic perception to actively touch objects [16]. The linking of action to perceive the information of interest is known as interactive perception, which is a common human exploration strategy. For instance, if humans cannot recognize an object by vision and touch, they will take different interactions to obtain information from other sensory channels [25]. The usage of smelling and tasting for robots is not popular. One rare example presents a navigation approach using smell for odor source localization [29]. A different niche develops a tasting sense using IR spectroscopic technologies or chemical sensors to discriminate food [30], [31]. Driven by computer vision (CV), 2D image processing is the fundamental pillar for robot recognition tasks. However, 3D methods for scene and object reconstruction gained importance for robots. Various surveys provide an overview of the state-of-the-art robot recognition tasks [32], [33]. Established robotics approaches focus on specific recognition subareas. The preattentive recognition of primitives comprises methods on normal estimation [34], [35], segmentation [36], [37], edge [38], and simple shape (such as planes and cylinders) [39] detection. The methods usually make use of the entire sensory input. However, robots prefilter or scale the data to achieve the appropriate results on their capabilities. For instance, image-based feature recognition can process high-resolution images by making use of image pyramids [40]. Herewith, the extracted features of a lower image resolution deliver discrete regions of interest by keeping the accuracy of the original scale. Postattentive processing comprises higher-level recognition capabilities based on preattentive processing. Examples are extracting the semantic by (re)-detection of objects [41], [42], humans [43], [44], and places [45]; 3D reconstruction of the scene by Simultaneous Localization and Mapping (SLAM) [32], [46], [47]; or enriching metric information by semantics, known as semantic segmentation [33], [48], [49]. Equal to humans, robots use preprocessed attention areas within postattentive processing. For instance, featurebased SLAM methods use sparse feature points to postattentively register a set of sensor data and to detect loop closures to compensate for the mapping drift [46], [47]. Besides, this example shows how humans and robots use different sensory sources to achieve highly accurate localization within the scene. Humans use walked steps combined with eye information for perceiving ego-motion, which improves localization. For illustration, imagine walking a distance with closed eyes. The motion drift sooner or later leads to a loss of localization. Therefore, a series of interconnected spatial–visual features provides the absolute localization within the known scene [50]. Equivalent to humans, the latest robotic SLAM approaches provide similar techniques. Wheel odometry, visual odometry, or an inertial measurement unit (IMU) provides the ego-motion. Laser scanners or camera-based methods simultaneously estimate the pose of landmarks to

compensate for the ego-motion drift. Also, some approaches use visual landmarks for place redetection to close the trajectory loop while mapping [46]. The usage of high-level features enables robots to add semantics to metric information. Knowing the semantics of an area in space helps robots to interpret the scene. It allows, e.g., the exclusion of dynamic objects like people while mapping. A special challenge of robot perception is the recognition of scene information from different locations and times. For instance, a robot recognizes an object in the scene and simultaneously tracks the object in its field of view (FOV) as long as it is visible. While previous research proposed using a Kalman filter [51] or particle filter [52], the latest research utilizes deep learning-based tracking, such as with a convolutional neural network (CNN) [53], [54] or a Vision Transformer (ViT) [55]. ViT, coming from natural language processing (NLP), splits images into fixed-size patches that gained popularity due to their superior performance on continuous data streams, needed for mobile robots [56]. Since the object and the robot could move, occlusions, truncation, or invisibility due to sensor noise (as mentioned previously) must be handled. When the same object appears again, a reidentification (Re-ID) to reallocate the ID is beneficial to better understand the scene. Modern approaches of Re-ID have been proposed that are similar to tracking the usage of a CNN [57], [58], ViT [59], [60], or end-to-end approaches [61]. Knowledge Representation The second step of the perception process represents environmental information. A knowledge base manages recognized information from different sources and levels of abstraction, times, and places in a centralized and ordered structure. This structure includes understandable information about the scene. The function of the knowledge representation within the human brain has been a controversial topic since the so-called gestalt theory. Modern attempts such as Wagemans and Kimchi [24] or Hommel et al. [62] reveal spatial layouts, organized hierarchically, that represent the human perception memory. A relationshipfocused multilevel hierarchical structure of parts represents environmental information. The transfer of the main functionalities of human knowledge representation to robots requires a complex memory structure focused on flexibility. The knowledge representation must be capable of merging observations and interpretations from different sources and times. For instance, recognized information, such as shape, texture, posture, state, probabilities, and trajectory (compare the “The Recognition of Information” section), must be managed in real time within the knowledge base. This issue sets high requirements for the underlying knowledge base as every piece of environmental information needs a known structured representation. Furthermore, the knowledge base includes initial and postprocessed knowledge. Robots usually store and represent the scene knowledge in a database [63], [64], [65], allowing them to deploy DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



39

a human-understandable ontology that conceptualizes multiple entities within a domain and their relationships [66]. The scene representation comprises various perceptionrelated information, such as extracted spatial features for navigation, manipulation, and semantically enhanced maps [67]. The requirements for robotic knowledge representation are high. Ideally, it must be real time capable; generic; scalable; flexible in structure; and able to update and extend during the robot’s lifetime as well as easy to connect for access and data sharing. Generally, there are two categories of databases: graph based and document based. Both seem suitable for this task as they provide comprehensive features to cover these requirements [68], [69]. Graph-based databases, also known as relational database management systems, represent knowledge through relations. Herewith, it is necessary to explicitly define relations through a common format to link semantics through an ontology [70]. In contrast to relational databases, documentbased databases represent data in JavaScript Object Notation-like documents without the need for relations or predefined structures. Nevertheless, these databases provide features for querying or indexing the data to model dependencies and relations implicitly dynamically. For instance, Kunze et al. [63] propose spatial and temporal indexing for query relations within their document-based database. Besides, linking the knowledge base for decision making [71] and providing the represented knowledge to robot actions, such as manipulation or moving, enables reactive behaviors. The comparison of scene knowledge representation indicates that robots have advantages compared to humans. First, the artificial scene knowledge representation has no memory loss due to an almost unlimited storage capacity. Second, robots can easily share and make use of foreign perception, while humans have to transfer knowledge into an appropriate modality, such as verbal communication. Additionally, humans are limited in the range of information exchange without technical assistance. In contrast, robots can share data in their original format over networks. The sharing of perceptional information enables robots to directly exchange data with the infrastructure or other robots. Another difference between human and robot knowledge representation is the capability of robots to start with a preinstalled environment perception model. Robots can use the prior information of a building information model [72] or a partial or fully premapped environment [64]. Using prior scene knowledge reduces the setup time, especially when using multiple agents. Knowledge Interpretation Based on the available scene knowledge, this perception step interprets existing knowledge to make sense using cognitive capabilities. It has been proven that humans interpret the scene; however, it is unclear how this is executed within the brain. The research by Isik et al. [73] found that the human brain starts recognizing view-invariant observations, such as human actions, quickly, in around 200 ms. This suggests that 40



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

the brain uses the form, as well as the motions, to represent states. Furthermore, previous work [74] proposes that the human brain benefits from causal relations, such as temporal continuity and spatial relations, among objects, humans, and their actions. The prerequisite is that a known structure represents the knowledge (see the “Knowledge Representation” section). A high-level scene analysis encodes spatial and temporal relations between instances [75]. For most examples, the interpretation of scene knowledge is a trivial task for humans due to lifelong learning. The interpretation of perceived information has been trained with preknowledge, dependent on the culture [76, Ch. 14]; context; and situation itself [77]. Therefore, all people have a unique perception system based on and enriched by their environment. Thus, human interpretation is neither predictable nor always the same. The famous duck–rabbit illusion [78] indicates that even the season could differ the interpretation of the scene. Therefore, perception is influenced by environmental factors [77]. For robots, the interpretation of the scene is still an unsolved problem [79]. The challenge is to generate and make use of high-level semantic knowledge to reason about the present scene. There is no commonsense model that could be applied to every environment without the adaptation of the interpretations. The association of scene knowledge across multiple dimensions, such as time and space, with or without relational dependencies between single pieces of information, leads to this new knowledge that improves scene understanding. For instance, a spatial–temporal scene analysis could reveal daily habits, such as when, how, and how often we go to the kitchen to fetch a coffee. This example shows that the kind of required high-level information is environment specific as well as use case specific. The goal is to understand what is involved and what to do when and with which objects. The use of high-level semantics within the scene is fundamental for complex robot behavior tasks. Herewith, we identify two types of interpretations. On the one hand, there is research aimed at reconstructing and interpreting the structured part of the scene, such as room segmentation, junction detection [32], or occlusion reasoning for simple shapes [80]. On the other hand, there is research focusing on the unstructured part that goes deeper into handling dynamics. The approach presented in [81] is one of the rare examples in which already collected scene information is used to gather new information. Observations of objects are anchored in the scene model, which provides basic tracking functionality. In combination with knowledge about the whole scene, including other objects and their spatial and semantic relations, this is used for reasoning about the state of occluded objects, which improves tracking and hence the whole scene state. The tracking of instances over multiple observations enables further interpretations, such as detecting their action [82]. The actions of the people together with environmental semantics are valuable input for a robot since they usually share an environment. So, they need to understand the actions and fulfill

their tasks proactively. For instance, if the person is cooking in a room, it is very likely that this room is a kitchen in which the robot can act accordingly [83]. Previous work [84] focuses on learning human actions by observing humans in their daily life. They merge joint motions and locations concerning the landmark points on the map. Kostavelis et al. [85] propose using object recognition and skeleton-based action recognition to make their social robot understand human actions. They deploy the robot in real home environments to test their system. The previous work [86] employs a long short-term memory [87] network on the robot to greet the user. The background may be misleading for the algorithms that are using appearance features. That is why generating action proposals may improve performance, and this is essential for mobile robots because the background in the robot view is dynamic [88]. In contrast to the spatial–temporal interpretations of a single instance, there are a few approaches that interpret relations within multiple instances. For example, Philipp et al. [89] propose to use Bayesian networks to estimate on which object the user’s attention focuses. The attention-sensitive functionalities as high-level scene interpretations are essential for improving human–robot interaction. More complex interpretations go into the affordance estimation [90]. Herewith, robots know what can be done with objects based on past observations, probabilities, and personal data (such as emotions, preferences, and relations). For instance, knowing that a pod can be used for cooking offers novel capabilities for robots. Affordances link perception into the cognitive capabilities that are fundamental for interpreting scenes. Implications Recent research on robot perception displays similarities to fundamental theories on human perception. Figure 3 visualizes the transferred scene perception process. Humans and robots recognize information from the scene by making sensor data understandable. The human brain and the robot storage, respectively, represent the knowledge in multiple layers in a known structure. Compared to humans, robots are capable of using external perceived data within their original format, whereas humans have to exchange perceived data through verbal, visual, or written communication. The interpretation of knowledge over multiple dimensions, such as time, space, and relations, enables a high-level scene understanding that improves the cognitive intelligence of robots. Herewith, humans highly benefit from life-long learning, whereby robots can benefit from shared and initial data. In recent years, robotic recognition and interpretation use deep learning, such as CNNs and generative adversarial networks [91], based on artificial NNs (ANNs), to solve a very specific task [92] as described previously. Inspired by biological NNs [93], ANNs consist of multiple layers of connected artificial neurons [94]. The weight assigned to these neurons relies on an initial ANN training using a high amount of labeled data. This in turn enables the ANN to compute labels

of unknown data via inference. In particular, deep learningbased approaches, such as ViTs, gained popularity in robotics due to their superior performance on continuous data streams. ViTs, coming from NLP, split images into fixed-size patches [56]. Machine learning, such as deep learning, achieves outstanding results for many recognition tasks [95], [96]. Recent research on robot However, the division of preattentive and postatperception displays tentive feature processing can neither be strictly similarities to fundamental adhered to nor easily ex­­ tended. The inseparabilitheories on human ty of ANNs challenges their flexibility and reusperception. ability due to the “black box” characteristic. Additionally, they require enhanced tensor computation power that needs special attention when pushing mobile robots to the real world [92]. The Status Quo of Holistic Scene Perception The transfer reveals similarities and differences in the perception process between humans and robots. But how much does the status quo of robotic scene perception cover holistic capabilities? To answer this question, we first deal with the challenges of fundamental technologies to extract possible boundaries for human-like perception. Afterward, we analyze the scope of most holistic robotic approaches for everyday environments to answer how much they already cover the holistic scope. From Narrow Toward Holistic Scene Perception Research on robot scene perception focuses on sensor-close processing by affordable sensors. The availability of cheap Recognition of Information by Making Sensor Data Understandable

Interpreting Knowledge by Space, Time, and Relations

Representing Knowledge in Multiple Layers

External Data Exchange (Robots Only)

Figure 3. A summary of the three steps of the transferred scene perception.

DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



41

red-green-blue cameras and 2D laser scanners set the entry barrier in robotic developments, e.g., for student or hobby projects, relatively low. Investigations mainly focus on a deep learning-based detection of common objects in images or on mapping the environment using SLAM. 2D methods in object detection benefit from matured research in CV. It reached high accuracy due to challenges, such as the Pascal Visual Object Classes Challenge [97] and the The first identified gap in Large Scale Visual Recognition Challenge [98], by robotic scene perception public datasets of labeled data, just like the Comis the missing usage of mon Objects in Context (COCO) dataset [99]. For multiple sensor modalities the application on robots, these 2D image-based as the input source. approaches must be transferred to time-continuous 3D processing merged with SLAM techniques providing the ego-motion of the robot to reconstruct the environment. Only comparatively new research focuses on 3D multiframe scene recognition solving the high computational performance with mobile graphic accelerators [49], [100], [101]. However, both the representation and the high-level interpretation of scene knowledge are not focused. Indeed, as described by Neisser [3], solely the knowledge that was recognized can be represented or interpreted. The focus on sensor-close recognition in combination with visual and mobile robotic challenges, such as changing visual appearance, as well as the limited computation resources, explains this observation. These aspects also explain why research on the representation and interpretation of scene knowledge is comparatively rare. As a result, robotic scene perception in real-world applications is focused on a concrete use case utilizing highly optimized bottom-up perception pipelines with narrow functionality. For instance, industrial environments have adapted to the perception capabilities of robots. Herewith, robots are enabled to fulfill narrow perception tasks with high accuracy. In particular, deep learning-based recognition systems are often used as monolithic black box systems, being hard to combine efficiently without probing deeply into a technical level. Research articles mostly bury these challenges by describing specific techniques [21]. However, robots in everyday environments need to fulfill multiple recognition tasks simultaneously as a requirement for complex and extensive behaviors. Open source frameworks such as the Robot Operating System (ROS) [102] provide, thanks to its large community, many tutorials as well as open source basic functionalities that are suitable for creating a powerful overall performance based on single software pieces. Standardized communications between various software components enable robots to use, fuse, and analyze multiple data streams. 42



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

However, on the one hand, paralyzing multiple narrow perception pipelines is challenging regarding performance. On the other hand, splitting pipelines into multiple steps, to reuse, e.g., preattentive information, is not easy when using monolithic black box models. Nevertheless, few researchers worked on integrated top-down approaches that aim to perceive the scene as a whole by combining multiple methods. The Holistic Scope of the Latest Integrated Approaches There are a few approaches in research that integrate multiple scene perception tasks into a single framework. In contrast to the narrow perception techniques, they achieve a more holistic understanding of the scene. These approaches simultaneously recognize environmental information over a long time. Storing these data in a known structure offers a new potential for a more complex spatial and temporal interpretation of the scene. We looked deeper into these approaches to extract how much they cover a holistic scene understanding. In the following, a comparison of the perceptual capabilities should answer this question. As criteria, we select research approaches covering more than a tabletop scene; reconstruct in real time; and represent scene data by multiple types of instances, such as semantics. However, each approach set a different focus starting at sensor-close processing, such as on paralyzing, fusing, and handling of dynamics going to ontologybased reasoning. Table 1 shows the perceptual properties of these approaches, divided into the three steps of the perception transfer presented in the “Transferring Human Scene Perception to Mobile Robots” section. If we could not find the details to a criterion, we marked it either as not available (N\A) or not specified (NS). The approaches of Table 1 use a 3D camera as sensory input providing the color and depth information of its FOV. Additionally, some approaches use an IMU to support visual odometry for a more precise estimate of the ego-motion, e.g., needed for drones and wheeled-based robots. Based on the sensory input stream, the presented approaches combine several recognition techniques to reconstruct a virtual scene model. However, they cover the recognition differently. KnowRob [65], a knowledge representation and reasoning framework, solely offers an interface for individual visual recognition systems. Wyatt et al. [67] limit the reconstruction to a metric map without recognizing semantics by vision sensors. Its scene recognition is trained from visual properties by a human tutor using supervised learning. Similarly, the SOMA framework restricts the reconstruction to a metric map but enhances objects by a CNN for color image-based object detection. SOMA aims at understanding changes in everyday environments by perceiving geometries and semantics. Alternatively, the approach of Suchan et al. [103] enhances the metric map by detecting walls, which are used for a clustering algorithm to generate a floor plan. The other approaches investigate further into a fully 3D metric–semantic reconstruction of the scene.

The SLAM++ project of Salas-Moreno et al. [104] is an early approach from 2013 that concentrates on semantic mapping. It consists of an object-based SLAM that uses object recognition trained on a database of scanned object models. It is capable of detecting changes in the environment, such as moving objects. Fusion++ [100] set its focus on semantic mapping. It is similar to SLAM++ but runs a mask region-based CNN (R-CNN) object segmentation to initialize a truncated signed distance field (TSDF) reconstruction for each object. Rosinol et al. [105] recently published Kimera, a multilayer spatial scene perception framework aiming to close the gap between human and robot scene perception. Kimera uses a metric–semantic SLAM to perform a full mesh reconstruction by TSDF volumes with its semantic on top of localization. It recognizes building structures as well as objects from a CAD model match. In addition, human detection and pose estimation extend the dynamic scene information [106]. The recognition techniques feed its scene information into the

knowledge base, where they are represented and connected in different ways. Fusion++ and SLAM++, which concentrate on the semantic mapping, are not providing details of their knowledge representation. In contrast, the approach of Wyatt et al. [67] focuses on the representation of knowledge gaps and uncertainties. A layered structure of proxies, unions, and beliefs represents the spatial scene inside a relational database. They validate by experiments in a lab that a human tutor is capable of helping a robot fill a knowledge gap through verbal conversation. The robot asks for missing visual features, such as the color and shape, to prove an object is believed to close a knowledge gap. Kimera, which extensively covers the recognition, is also concentrating on spatial knowledge representation by multiple hierarchical layers, separated by the semantic. The spatial layers comprise the metric–semantic mesh, objects, structures, rooms, and buildings. Dynamic scene graphs simultaneously update scene information by linking the spatial scene

Table 1. An overview of integrative robotic scene perception approaches. Recognition of Information

Knowledge Interpretation

Knowledge Representation

Sensory Input

Reconstruction

Static Instances

3D camera

Metric

Objects from N/A user input

Relational

Multilayer spa- Point map tial representation by proxies, unions, and beliefs

Place Belief classification verification by human

SLAM++ 3D (Salas et al. camera [104])

Metric– semantic

N/A

Objects by scan model match

NS

Single-layer spatial object graph

N/A

N/A

Suchan et al. [103]

3D camera

Metric– semantic

Walls by planes

Human pose detection

NS

Spatial–temporal Metric representation map and of entities semantic by primitives

Human activities, spatial relation, and pattern

Humancentered common sense

Fusion++ (McCormac et al. [100])

3D camera

Metric– semantic

Objects by R-CNN

N/A

NS

Single-layer spatial object graph

N/A

N/A

SOMA (Kunze et al. [63])

3D camera, 2D laser scanner

Metric

N/A

Objects Document Observation, and based semantic, and people by interpretation CNN layer

KnowRob (Beetz et al. [65], Beßler et al. [90])

NS

NS

NS

NS

Relational

Ontology Mesh and graph, multilevel poses of metric– semantic, logic, and episodic memories

Episodic memory for reasoning

Hypotheses verification, inner world, and motion control

Kimera (Rosinol et al. [105], [106])

3D camera, IMU

Metric– semantic

Building structures and objects by CAD  model match

People by pose detection

NS

Hierarchical graph connects spatial layers

Place and room classification

N/A

Wyatt et al. [67]

Dynamic Instances Database

Knowledge Structure

Scene Rep- Spatial– resentation Temporal

Metric– semantic mesh

Metric– semantic mesh

Point map, Human objects by activities pose and bounding box

Metric– semantic mesh

Reasoning

N/A

N/A: not available; NS: not specified.

DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



43

information within the layers. The knowledge representation of SOMA comprises three layers inside a document-based database: the observation layer, semantic layer, and interpretation layer. The scene information is organized similarly to Kimera with a spatial layout of hierarchical structures connected by graphs. Suchan et al. [103] propose using ontologies and formal characterization to represent knowledge by space and motion. The knowledge representation of KnowRob is the most powerful due to the usage of description logic for representing and connecting knowledge within multiple levels by the Web Ontology Language (OWL). The framework A fundamental difference links an inner world and virtual, logical, and epibetween robots and sodic memories [65]. Based on the reprehumans is the acquisition sented knowledge, a few approaches look deeper of perceptional into the interpretation. For instance, the second understanding. pillar of KnowRob is knowledge-based reasoning to learn general knowledge. An object class-related affordance extraction reasons what to do with objects [90]. Suchan et al. [103] show by temporal human detection and object recognition how to use reasoning to enhance the scene understanding. The use of logical spatial relations between instances (e.g., an object is located to the left of another object) and the detection of human activities represent valuable information about the scene. The presented approaches indicate that they already perform more than one perception process step on an advanced level. Due to the different focus of every approach, none sufficiently covers holistic scene perception. The degree of coverage is difficult to quantify due to missing metrics in research. However, the overview (see Table 1) offers a starting point to identify gaps and potentials compared to human perception. The Gap to Human Scene Perception How well does robot scene perception mimic human-level performance nowadays? Previous research compared the perceptual performance of robots with children’s age, such as Szeliski [107], who claimed that the computer vision reached the level of a two-year-old child. This comparison might not be beneficial since robot perception is not quantifiable in human terms as there is a specific rather than a broad perceptual skill development of robots. Following this hypothesis, this article details major gaps in robot perception in everyday scenes. The Nonusage of Sensory Modalities The first identified gap in robotic scene perception is the missing usage of multiple sensor modalities as the input source. Humans use different senses, providing the opportunity to fuse and rely on the optimal sensor since each sensor 44



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

modality is affected, such as vision by illumination, color, occlusion, and posture of objects [25] (see the “Recognition of Information” section). The hardware design of robots allows them an optimal sensory input due to flexible sensor choice, amount, and alignment. On the one hand, this underlines the statement of Premebida et al. [33] that robotic recognition tasks, such as vision-based object classification, could deliver higher performance than humans. However, on the other hand, all the robotic approaches of the “The Holistic Scope of the Latest Integrated Approaches” section limit the scene perception to visual sensors, except for ego-motion sensors, as a single modality. Although vision sensors provide the highest information content of the scene, using a single modality is insufficient for the situations mentioned previously (the localization of acoustics; haptic feedback of obstacles; and darkroom problem). Therefore, current integrated scene perception approaches are not able to deploy different senses in the way that humans do. Different Perceptual Learning A fundamental difference between robots and humans is the acquisition of perceptional understanding. Perceptual learning was first defined by Gibson [108] as “any relatively permanent and consistent change in the perception of a stimulus array, following practice or experience with this array.” Thus, humans learn an individual perception without initial knowledge that is constantly adjusted and influenced by society and the environment. The learning enables humans to achieve an optimal perception within the surrounding since, in particular, high-level scene interpretations depend on the subjective impression that is difficult to generalize (see the duck–rabbit illusion [78]). In contrast to humans, popular perceptual techniques of robots use rigid learning strategies, which do not offer adaptation. It would require retraining or parameter adjustment of the initially deployed models. Moreover, these approaches provide neither the flexibility for extension nor adaptation of the perception capabilities over time. All presented integrated robotic scene perception approaches (see Table 1) build up prelearned skills except for the method of Wyatt et al. [67], which allows modifications of the perception after deployment (see the “The Holistic Scope of the Latest Integrated Approaches” section). Especially in the future, when robots will be deployed for long periods of time, the initial perception will become obsolete unless the perception is adjusted continuously to the environment. Therefore, a gap concerns the learning of perceptual capabilities during runtime. Although research on robot learning, such as domain adaptation [109] and continuous learning (lifelong learning, perceptual learning, and never-ending learning) [110], reaches back almost 30 years [111], it is still rarely used in practical applications. A common strategy is to learn from demonstration [112]. For instance, a human could teach the robot to adjust a generic perception to changes in the environment. The human can approve or decline estimates of the perception system via a

user interface, such as classifying new objects or teaching novel actions. A possible solution for supervised learning by a human tutor has been presented by Wyatt et al. [67]. However, enabling robots to adapt to the environment requires new techniques. On the one hand, as proposed by Wyatt et al. [67], this technique could close knowledge gaps. On the other hand, learning environment-adapted perception skills once with a tutor will not avoid adjusting the perception within the deployed environment. The Lack of Commonsense Perception Our third identified gap is the lack of commonsense perception caused by specific recognition capabilities. Fed by a high amount of training data, the artificial system becomes an expert for narrow recognition tasks. The previous arguments reveal several applications that hit the performance for specialized tasks. Although the recognition performance of a specific perception task can outperform humans, the fundamental issue is the lack of flexibility due to the required retraining or missing capability to extend recognition. Therefore, even the latest approaches achieve just low overall recognition performance (see the “The Holistic Scope of the Latest Integrated Approaches” section). Humans have individual perception skills depending on various influences, such as profession, culture, and age, since humans learn from practice and experience [108]. However, in a society, there are perception skills simplifying a commonsense understanding. This enables humans to execute trivial tasks, such as knowing how to open a door or how to use an elevator. Nowadays, trivial commonsense perception skills, such as detecting a door and using its handle to move through the door or detecting the elevator buttons, are not default functionalities of mobile robots. Thus, robot recognition is not making sense of the whole scene. Similarly, the interpretation of scene data indicates specific capabilities. There are a few advanced approaches for interpreting a complex scene, such as KnowRob [65] (see the “Knowledge Interpretation” section). However, they provide narrow interpretations independent of the available scene information, whereas robots cannot instantly interpret an unseen scene. Therefore, the knowledge interpretation may not retrieve sufficient scene knowledge. The Potentials of Robots This section highlights the potentials in robot scene perception going beyond the perceptual capabilities of humans. Flexible Sensor Design The first presented gap in robot scene perception (using few or solely a single sensor modality) can be overcome by the usage of flexible sensor design. Human perception is limited to the recognized information of the senses and their range and accuracy, e.g., fog or darkness negatively influences scene

recognition. Contrastively, robots benefit from scene-adjusted sensor modalities and their flexible configuration. They can overcome the limitations of human senses, such as through ultrasonic or radar recognition techniques [113]. Robots can be equipped with sensor modalities, such as radar, that enable them to freely adapt to their environment. The flexibility of the robot design allows the adaptation to the application. Thus, robots can use sensors with the desired amount, properties, and alignments. For instance, multiple visual sensors could enable robots to recognize the scene in 360°. Herewith, Our third identified gap is blind spots can be avoided, which is especially importhe lack of commonsense tant for safe usage.

perception caused by

Initial Perception Capabilities specific recognition In contrast to the highlighted gap of missing percapabilities. ceptual learning, robots can be deployed with prelearned perception capabilities and with initial scene knowledge. This trivial fact allows reducing the setup time of the robot to a minimum. For instance, mirroring the perception skills of an existing robot in an environment enables a new robot to perceive the scene equivalently. In contrast to robots, humans would have to learn from scratch. Cooperative Perception Speaking the “same language” in terms of understanding and exchanging information enables the sharing of perceptual data independent of the robot, assistive system, or infrastructure. Therefore, artificial systems have fewer restrictions than humans as the number of collaborators and the communication range and bandwidth are not hard restricted. The sharing of scene information in real time with multiple artificial agents enables the fusion of scene observations from various perspectives. The data exchange enhances the scene knowledge of every agent, which offers a comprehensive scene perception from its scene overview. A popular research area presents cooperative approaches that share navigation data, e.g., to distribute the mapping [64] or for the reactive path planning of multiple robots [114]. Moreover, possible decentralized computation, e.g., cloud computing, can save the resources of mobile robots [115]. Therefore, cooperative perception is a key technique for reducing the setup time of large environments and for providing the safe navigation and better economy of larger robot fleets through its intercommunication [114], [116]. Future Directions We propose two comparatively nonpopular future directions that contribute to deploying robots within our society. First, we propose developing frameworks to combine multiple DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



45

perception modules. The scope of robotic scene perception is low since fundamental commonsense skills are missing. Commonsense skills can be generated with scene perception frameworks that combine several specific methods to achieve a holistic understanding. Second, although existing re­­search on robot perception achieves good performance on specific We propose two tasks, the flexibility for adjusting the perception comparatively nonpopular within the robot usage is missing. It must be possifuture directions that ble to extend initially deployed perception skills contribute to deploying and to adjust the perception at any time to robots within our society. the scene. Therefore, we address in particular our second and third identified gaps to future research that underlines the importance of perceptual learning being fundamental in everyday life. It is a central technique toward and beyond human-like perceptual performance. Conclusion This article transfers human scene perception to mobile robots for comparison since the performance of human perception is superior in many tasks. Current research in robotics presents specific perception skills evaluated in a predefined and constricted manner, whereby they already achieve promising results for everyday applications. However, for lifelong unsupervised and autonomous usage, multifunctional mobile robots in particular need to perceive the scene holistically to reliably handle unknown and changing situations as well as uncertainties and dynamics. Human perception has been extensively studied since the 1970s, offering mature neuroscience studies and theories that define human perception as the process of recognizing, representing, and interpreting scene information. This article uses this threefold division for a transfer from human to robot scene perception to identify similarities and differences in the process. The transfer revealed that the robotic approaches partly mirror human-like perception. However, much research investigates specific methods, such as object classification, that outperform human perception. This prerequisite toward and beyond human-like perceptual performance is promising. A new research area integrates multiple state-of-the-art methods in frameworks that aim toward a holistic scene understanding. However, these frameworks lack trivial commonsense perception skills and therefore cannot substantiate a holistic scene understanding. Moreover, only nonpopular research contributes to the perceptual learning of robots, which is needed to learn, adjust, and customize their perception. Therefore, these two major gaps need to be addressed by future research. 46



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

Acknowledgment This research has received funding from the German Ministry of Economic Affairs, Labor and Tourism Baden-Württemberg by the AI Innovation Center Learning Systems and Cognitive Robotics under Grant Agreement 017–180036 (841022). References [1] C. Müller, B. Graf, and K. Pfeiffer, “World robotics 2021 – Service robots,” International Federation of Robotics, Frankfurt, Germany, Tech. Rep., 2021. [Online]. Available: https://ifr.org/ifr-press-releases/news/service-robots-hit -double-digit-growth-worldwide [2] A. Gasparetto and L. Scalera, “A brief history of industrial robotics in the 20th century,” Adv. Historical Stud., vol. 8, no. 1, pp. 24–35, 2019, doi: 10.4236/ ahs.2019.81002. [3] U. Neisser, “Without perception, there is no knowledge: Implications for artificial intelligence,” in Natural and Artificial Minds, R. G. Burton, Ed. Albany, NY, USA: SUNY Press, 1993, pp. 174–164. [4] M. Naseer, S. Khan, and F. Porikli, “Indoor scene understanding in 2.5/3d for autonomous agents: A survey,” IEEE Access, vol. 7, pp. 1859–1887, 2019, doi: 10.1109/ACCESS.2018.2886133. [5] G. L. Malcolm, I. I. A. Groen, and C. I. Baker, “Making sense of real-world scenes,” Trends Cogn. Sci., vol. 20, no. 11, pp. 843–856, 2016, doi: 10.1016/j. tics.2016.09.003. [6] S. Garg et al., “Semantics for robotic mapping, perception and interaction: A survey,” Found. Trends® Robot., vol. 8, nos. 1–2, pp. 1–224, 2020, doi: 10.1561/2300000059. [7] H. Levesque and G. Lakemeyer, “Cognitive robotics,” in Handbook of Knowledge Representation (Foundations of Artificial Intelligence), vol. 3, F. v. Harmelen, V. Lifschitz, and B. Porter, Eds. New York, NY, USA: Elsevier, 2008, pp. 869–886. [8] M. Potter, “Meaning in visual search,” Science, vol. 187, no. 4180, pp. 965– 966, 1975, doi: 10.1126/science.1145183. [9] I. Biederman, “Perceiving real-world scenes,” Science, vol. 177, no. 4043, pp. 77–80, 1972, doi: 10.1126/science.177.4043.77. [10] R. A. Epstein and C. I. Baker, “Scene perception in the human brain,” Annu. Rev. Vis. Sci., vol. 5, no. 1, pp. 373–397, 2019, doi: 10.1146/annurevvision-091718-014809. [11] J. Beck, B. Hope, and A. Rosenfeld, Human and Machine Vision, 1st ed. New York, NY, USA: Springer-Verlag, 1983. [12] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” Int. J. Comput. Vis., vol. 42, no. 3, pp. 145–175, 2001, doi: 10.1023/A:1011139631724. [13] C. M. Funke, J. Borowski, K. Stosio, W. Brendel, T. S. A. Wallis, and M. Bethge, “Five points to check when comparing visual perception in humans and machines,” J. Vis., vol. 21, no. 3, p. 16, 2021, doi: 10.1167/ jov.21.3.16. [14] L. San Roque et al., “Vision verbs dominate in conversation across cultures, but the ranking of non-visual verbs varies,” Cogn. Linguistics, vol. 26, no. 1, pp. 31–60, 2015, doi: 10.1515/cog-2014-0089. [15] M. Ward, G. Grinstein, and D. Keim, Interactive Data Visualization: Foundations, Techniques, and Applications. Natick, MA, USA: A. K. Peters, 2015. [16] L. Seminara, P. Gastaldo, S. Watt, K. Valyear, F. Zuher, and F. Mastrogiovanni, “Active haptic perception in robots: A review,” Frontiers Neurorobot., vol. 13, p. 53, Jul. 2019, doi: 10.3389/fnbot.2019.00053.

[17] S. Deutsch and A. Deutsch, The Eye as a Transducer. Piscataway, NJ, USA: IEEE Press, 1993, pp. 227–281. [18] D. Purves, Neuroscience, 6th ed. London, U.K.: Oxford Univ. Press, 2018. [19] P. Gärdenfors, Conceptual Spaces - The Geometry of Thought. Cambridge, MA, USA: MIT Press, 2000. [20] P. Gärdenfors, The Geometry of Meaning: Semantics Based on Conceptual Spaces. Cambridge, MA, USA: MIT Press, 2014. [21] A. Dix, M. Pohl, and G. Ellis, Perception and Cognitive Aspects. Reims, France: Eurographics Association, 2010, ch. 7, pp. 109–130. [22] J. Wolfe, N. Klempen, and K. Dahlen, “Postattentive vision,” J. Exp. Psychol. Hum. Perception Performance, vol. 26, no. 2, pp. 693–716, 2000, doi: 10.1037/0096-1523.26.2.693. [23] S. E. Palmer, “Hierarchical structure in perceptual representation,” Cogn. Psychol., vol. 9, no. 4, pp. 441–474, 1977, doi: 10.1016/0010-0285(77)90016-0. [24] J. Wagemans and R. Kimchi, The Perception of Hierarchical Structure. London, U.K.: Oxford Univ. Press, 2014. [25] S. Jin, H. Liu, B. Wang, and F. Sun, “Open-environment robotic acoustic perception for object recognition,” Frontiers Neurorobot., vol. 13, p. 96, Nov. 2019, doi: 10.3389/fnbot.2019.00096. [26] C. Evers and P. Naylor, “Acoustic SLAM,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 26, no. 9, pp. 1484–1498, Sep. 2018, doi: 10.1109/TASLP.2018.2828321. [27] M. Kreković, I. Dokmanić, and M. Vetterli, “EchoSLAM: Simultaneous localization and mapping with acoustic echoes,” in Proc. IEEE Int. Conf. Acoustics, Speech Signal Process. (ICASSP), 2016, pp. 11–15, doi: 10.1109/ ICASSP.2016.7471627. [28] A. Balachandran, M. Brown, S. Erlien, and J. Gerdes, “Predictive haptic feedback for obstacle avoidance based on model predictive control,” IEEE Trans. Autom. Sci. Eng. (from July 2004), vol. 13, no. 1, pp. 26–31, Jan. 2016, doi: 10.1109/TASE.2015.2498924. [29] C. Lytridis, G. Virk, and E. Kadar, “Co-operative smell-based navigation for mobile robots,” in Climbing and Walking Robots. Berlin, Heidelberg: Springer-Verlag, 2005, pp. 1107–1117. [30] H. Shimazu, K. Kobayashi, A. Hashimoto, and T. Kameoka, “Tasting robot with an optical tongue: Real time examining and advice giving on food and drink,” in Human Interface and the Management of Information. Methods, Techniques and Tools in Information Design, M. J. Smith and G. Salvendy, Eds. Berlin, Heidelberg: Springer-Verlag, 2007, pp. 950–957. [31] B. Ciui et al., “Chemical sensing at the robot fingertips: Toward automated taste discrimination in food samples,” ACS Sensors, vol. 3, no. 11, pp. 2375–2384, 2018, doi: 10.1021/acssensors.8b00778. [32] G. Pintore, C. Mura, F. Ganovelli, L. Fuentes Perez, R. Pajarola, and E. Gobbetti, “State-of-the-art in automatic 3d reconstruction of structured indoor environments,” Comput. Graph. Forum, vol. 39, no. 2, pp. 667–699, 2020, doi: 10.1111/cgf.14021. [33] C. Premebida, R. Ambrus, and Z. Marton, “Intelligent robotic perception systems,” in Applications of Mobile Robots, London, U.K.: IntechOpen, 2018, pp. 111–127. [34] K. Klasing, D. Althoff, D. Wollherr, and M. Buss, “Comparison of surface normal estimation methods for range sensing applications,” in Proc. IEEE Int. Conf. Robot. Autom., 2009, pp. 3206–3211, doi: 10.1109/ROBOT.2009.5152493. [35] R. Bormann, J. Hampp, M. Hägele, and M. Vincze, “Fast and accurate normal estimation by efficient 3d edge detection,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2015, pp. 3930–3937, doi: 10.1109/ IROS.2015.7353930.

[36] A. Nguyen and B. Le, “3d point cloud segmentation: A survey,” in Proc. IEEE Conf. Robot., Autom. Mechatronics (RAM), 2013, pp. 225–230, doi: 10.1109/RAM.2013.6758588. [37] S. Minaee, Y. Y. Boykov, F. Porikli, A. J. Plaza, N. Kehtarnavaz, and D. Terzopoulos, “Image segmentation using deep learning: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 7, pp. 3523–3542, Jul. 2022, doi: 10.1109/TPAMI.2021.3059968. [38] J. Canny, “A computational approach to edge detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-8, no. 6, pp. 679–698, Nov. 1986, doi: 10.1109/TPAMI.1986.4767851. [39] A. Kaiser, J. A. Y. Zepeda, and T. Boubekeur, “A survey of simple geometric primitives detection methods for captured 3D data,” Comput. Graph. Forum, vol. 38, no. 1, pp. 167–196, 2019, doi: 10.1111/cgf.13451. [40] P. J. Burt and E. H. Adelson, “The Laplacian pyramid as a compact image code,” in Readings in Computer Vision: Issues, Problem, Principles, and Paradigms, M. A. Fischler and O. Firschein, Eds. San Francisco, CA, USA: Morgan Kaufmann, 1987, pp. 671–679. [41] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. 25th Int. Conf. Neural Inf. Process. Syst., Red Hook, NY, USA: Curran Associates, 2012, vol. 1, pp. 1097–1105. [42] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 779–788, doi: 10.1109/CVPR.2016.91. [43] D. T. Nguyen, W. Li, and P. O. Ogunbona, “Human detection from images and videos: A survey,” Pattern Recognit., vol. 51, pp. 148–175, Mar. 2016, doi: 10.1016/j.patcog.2015.08.027. [44] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh, “OpenPose: Realtime multi-person 2d pose estimation using part affinity fields,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, pp. 172–186, Jan. 2021, doi: 10.1109/TPAMI.2019.2929257. [45] S. Lowry et al., “Visual place recognition: A survey,” IEEE Trans. Robot., vol. 32, no. 1, pp. 1–19, Feb. 2016, doi: 10.1109/TRO.2015.2496823. [46] M. Labbé and F. Michaud, “RTAB-map as an open-source lidar and visual simultaneous localization and mapping library for large-scale and longterm online operation,” J. Field Robot., vol. 36, no. 2, pp. 416–446, 2018, doi: 10.1002/rob.21831. [47] R. Mur-Artal and J. D. Tardós, “ORB-SLAM2: An open-source slam system for monocular, stereo and RGB-D cameras,” IEEE Trans. Robot., vol. 33, no. 5, pp. 1255–1262, Oct. 2017, doi: 10.1109/TRO.2017.2705103. [48] I. Kostavelis and A. Gasteratos, “Semantic mapping for mobile robotics tasks: A survey,” Robot. Auton. Syst., vol. 66, pp. 86–103, Apr. 2015, doi: 10.1016/j.robot.2014.12.006. [49] F. Poux, “The smart point cloud: Structuring 3d intelligent point data,” Ph.D. dissertation, Université de Liège, Liège, Belgium, 2019. [50] R. Wang and J. Brockmole, “Human navigation in nested environments,” J. Exp. Psychol. Learn., Memory, Cogn., vol. 29, no. 3, pp. 398–404, 2003, doi: 10.1037/0278-7393.29.3.398. [51] G. Welch and G. Bishop, “An introduction to the Kalman filter,” Univ. of North Carolina at Chapel Hill, Chapel Hill, USA, 1995. [Online]. Available: https://perso.crans.org/club-krobot/doc/kalman.pdf [52] F. Gustafsson et al., “Particle filters for positioning, navigation, and tracking,” IEEE Trans. Signal Process., vol. 50, no. 2, pp. 425–437, Feb. 2002, doi: 10.1109/78.978396. [53] Y. Zhang, C. Wang, X. Wang, W. Zeng, and W. Liu, “A simple baseline for multi-object tracking,” 2020. [Online]. Available: https://arxiv.org/ abs/2004.01888

DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



47

[54] N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” 2017. [Online]. Available: http://arxiv. org/abs/1703.07402 [55] P. Chu, J. Wang, Q. You, H. Ling, and Z. Liu, “TransMot: Spatial-temporal graph transformer for multiple object tracking,” 2021. [Online]. Available: https://arxiv.org/abs/2104.00194 [56] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” 2020. [Online]. Available: https://arxiv.org/ abs/2010.11929 [57] X. Li and Z. Zhou, “Object re-identification based on deep learning,” in Visual Object Tracking with Deep Neural Networks. Rijeka: IntechOpen, 2019, ch. 5. [58] V. Bansal, S. James, and A. Del Bue, “Re-OBJ: Jointly learning the foreground and background for object instance re-identification,” in Proc. Image Anal. Process. – ICIAP, Berlin, Heidelberg: Springer-Verlag, 2019, pp. 402– 413, doi: 10.1007/978-3-030-30645-8_37. [59] S. He, H. Luo, P. Wang, F. Wang, H. Li, and W. Jiang, “TransReID: Transformer-based object re-identification,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 14,993–15,002, doi: 10.1109/ICCV48922. 2021.01474. [60] Y. Zhao, S. Zhu, D. Wang, and Z. Liang, “Short range correlation transformer for occluded person re-identification,” 2022. [Online]. Available: https://arxiv.org/abs/2201.01090 [61] H. Liu, J. Feng, M. Qi, J. Jiang, and S. Yan, “End-to-end comparative attention networks for person re-identification,” IEEE Trans. Image Process., vol. 26, no. 7, pp. 3492–3506, Jul. 2017, doi: 10.1109/TIP.2017.2700762. [62] B. Hommel, J. Gehrke, and L. Knuf, “Hierarchical coding in the perception and memory of spatial layouts,” Psychol. Res., vol. 64, no. 1, pp. 1–10, Oct. 2000, doi: 10.1007/s004260000032. [63] L. Kunze et al., “SOMA: A framework for understanding change in everyday environments using semantic object maps,” in Proc. Conf. Artif. Intell. (AAAI-18), 2018, pp. 47–54. [64] M. Labbé and F. Michaud, “Long-term online multi-session graph-based SPLAM with memory management,” Auton. Robots, vol. 42, no. 6, pp. 1133– 1150, 2018, doi: 10.1007/s10514-017-9682-5. [65] M. Beetz, D. Beßler, A. Haidu, M. Pomarlan, A. Bozcuoglu, and G. Bartels, “Know rob 2.0 — A 2nd generation knowledge processing framework for cognition-enabled robotic agents,” in Proc. Int. Conf. Robot. Autom. (ICRA), 2018, pp. 512–519, doi: 10.1109/ICRA.2018.8460964. [66] L. Prieto González, V. Stantchev, and R. Colomo-Palacios, “Applications of ontologies in knowledge representation of human perception,” Int. J. Metadata, Semantics Ontologies, vol. 9, no. 1, pp. 74–80, Feb. 2014, doi: 10.1504/ IJMSO.2014.059128. [67] J. Wyatt et al., “Self-understanding and self-extension: A systems and representational approach,” IEEE Trans. Auton. Mental Develop., vol. 2, no. 4, pp. 282–303, Dec. 2010, doi: 10.1109/TAMD.2010.2090149. [68] R. Ravichandran, E. Prassler, N. Huebel, and S. Blumenthal, “A workbench for quantitative comparison of databases in multi-robot applications,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2018, pp. 3744–3750, doi: 10.1109/IROS.2018.8594241. [69] A. Dietrich, S. Mohammad, S. Zug, and J. Kaiser, “ROS meets cassandra: Data management in smart environments with NoSQL,” in Proc. 11th Int. Baltic Conf. DB IS, 2014, pp. 1–12. [70] M. Tenorth, U. Klank, D. Pangercic, and M. Beetz, “Web-enabled robots – robots that use the web as an information resource,” IEEE Robot. Autom. Mag., vol. 18, no. 2, pp. 56–68, 2011.

48



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

[71] P. Philipp, M. Maleshkova, A. Rettinger, and D. Katic, “A semantic framework for sequential decision making,” in Engineering the Web in the Big Data Era, P. Cimiano, F. Frasincar, G. J. Houben, and D. Schwabe, Eds. Cham, Switzerland: Springer International Publishing, 2015, pp. 392–409. [72] C. Follini, M. Terzer, C. Marcher, A. Giusti, and D. T. Matt, “Combining the robot operating system with building information modeling for robotic applications in construction logistics,” in Advances in Service and Industrial Robotics, S. Zeghloul, M. Laribi, and J. Sandoval Arevalo, Eds. Cham, Switzerland: Springer International Publishing, 2020, pp. 245–253. [73] L. Isik, A. Tacchetti, and T. Poggio, “A fast, invariant representation for human action in the visual system,” J. Neurophysiol., vol. 119, no. 2, pp. 631– 640, 2018, doi: 10.1152/jn.00642.2017. [74] Y. Peng, “Causal action: A framework to connect action perception and understanding,” Ph.D. dissertation, University of California, Los Angeles, Los Angeles, CA, USA, 2019. [75] J. Henderson and A. Hollingworth, “High-level scene perception,” Annu. Rev. Psychol., vol. 50, no. 1, pp. 243–271, 1999, doi: 10.1146/annurev. psych.50.1.243. [76] M. H. Segall, D. T. Campbell, and M. J. Herskovit, The Influence of Culture on Visual Perception. Indianapolis, IN, USA: Bobbs-Merrill Company, 1966. [77] S. Torresin, G. Pernigotto, F. Cappelletti, and A. Gasparella, “Combined effects of environmental factors on human perception and objective performance: A review of experimental laboratory works,” Indoor Air, vol. 28, no. 4, pp. 525–538, 2018, doi: 10.1111/ina.12457. [78] P. Brugger and S. Brugger, “The easter bunny in October: Is it disguised as a duck?” Perceptual Motor Skills, vol. 76, no. 2, pp. 577–578, 1993, doi: 10.2466/pms.1993.76.2.577. [79] M. Ozturk, M. Ersen, M. Kapotoglu, C. Koc, S. Sariel, and H. Yalçın, “Scene interpretation for self-aware cognitive robots,” in Proc. AAAI Spring Symp., 2014, pp. 1–8. [80] Z. Jiang, B. Liu, S. Schulter, Z. Wang, and M. Chandraker, “Peek-a-Boo: Occlusion reasoning in indoor scenes with plane representations,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 110–118, doi: 10.1109/CVPR42600.2020.00019. [81] A. Persson, P. Z. Dos Martires, L. D. Raedt, and A. Loutfi, “Semantic relational object tracking,” IEEE Trans. Cogn. Devel. Syst., vol. 12, no. 1, pp. 84–97, Mar. 2020, doi: 10.1109/TCDS.2019.2915763. [82] X. Zhang, Z. Wu, and Y. G. Jiang, “SAM: Modeling scene, object and action with semantics attention modules for video recognition,” IEEE Trans. Multimedia, vol. 24, pp. 313–322, 2022, doi: 10.1109/TMM.2021.3050058. [83] L. Piyathilaka and S. Kodagoda, “Human activity recognition for domestic robots,” in Field and Service Robotics, L. Mejias, P. Corke, and J. Roberts, Eds. Cham, Switzerland: Springer-Verlag, 2015, pp. 395–408. [84] P. Duckworth, M. Alomari, Y. Gatsoulis, D. C. Hogg, and A. G. Cohn, “Unsupervised activity recognition using latent semantic analysis on a mobile robot,” in Proc. IOS Press, 2016, pp. 1062–1070. [85] I. Kostavelis et al., “Understanding of human behavior with a robotic agent through daily activity analysis,” Int. J. Social Robot., vol. 11, no. 3, pp. 437–462, 2019, doi: 10.1007/s12369-019-00513-2. [86] K. Li, J. Wu, X. Zhao, and M. Tan, “Real-time human-robot interaction for a service robot based on 3D human activity recognition and human-mimicking decision mechanism,” in Proc. IEEE Int. Conf. Cyber Technol. Autom., Control, Intell. Syst., 2018, pp. 498–503, doi: 10.1109/CYBER.2018.8688272. [87] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Continual prediction with LSTM,” in Proc. Int. Conf. Artif. Neural Netw. (ICANN), 1999, pp. 850–855.

[88] F. Rezazadegan, S. Shirazi, B. Upcrofit, and M. Milford, “Action recognition: From static datasets to moving robots,” in Proc. Int. Conf. Robot. Autom. (ICRA), 2017, pp. 3185–3191, doi: 10.1109/ICRA.2017.7989361. [89] P. Philipp, M. Bommersheim, S. Robert, and J. Beyerer, “Probabilistic estimation of human interaction needs in context of a robotic assistance in geriatrics,” Current Directions Biomed. Eng., vol. 5, no. 1, pp. 433–435, 2019, doi: 10.1515/cdbme-2019-0109. [90] D. Beßler, R. Porzel, P. Mihai, M. Beetz, R. Malaka, and J. Bateman, “A formal model of affordances for flexible robotic task execution,” in Proc. 24th Eur. Conf. Artif. Intell. (ECAI), 2020, pp. 2425–2432, doi: 10.3233/ FAIA200374. [91] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural Inf. Process. Syst., Curran Associates, 2014, vol. 27, pp. 1–9. [Online]. Available: https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c 97b1afccf3-Paper.pdf [92] L. Alzubaidi et al., “Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions,” J. Big Data, vol. 8, no. 1, p. 53, 2021, doi: 10.1186/s40537-021-00444-8. [93] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,” Bull. Math. Biophysics, vol. 5, no. 4, pp. 115–133, 1943, doi: 10.1007/BF02478259. [94] D. Purves, G. Augustine, D. Fitzpatrick, W. Hall, A. Lamantia, and L. White, Neuroscience, 5th ed. Sunderland: Sinauer, 2011. [95] L. Jiao and J. Zhao, “A survey on the new generation of deep learning in image processing,” IEEE Access, vol. 7, pp. 172,231–172,263, Nov. 2019, doi: 10.1109/ACCESS.2019.2956508. [96] K. O’Shea and R. Nash, “An introduction to convolutional neural networks,” 2015, arXiv:1511.08458. [97] M. Everingham, L. Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL visual object classes (VOC) challenge,” Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, 2009, doi: 10.1007/s11263-009-0275-4. [98] O. Russakovsky et al., “ImageNet large scale visual recognition challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015, doi: 10.1007/ s11263-015-0816-y. [99] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in Proc. Computer Vision – ECCV, Cham, Switzerland: Springer International Publishing, 2014, pp. 740–755, doi: 10.1007/978-3-319-10602-1_48. [100] J. Mccormac, R. Clark, M. Bloesch, A. Davison, and S. Leutenegger, “Fusion++: Volumetric object-level SLAM,” in Proc. 2018 Int. Conf. 3D Vis. (3DV), pp. 32–41, doi: 10.1109/3DV.2018.00015. [101] Y. Lin, J. Tremblay, S. Tyree, P. Vela, and S. Birchfield, “Multi-view fusion for multi-level robotic scene understanding,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2021, pp. 6817–6824, doi: 10.1109/ IROS51168.2021.9635994. [102] A. Koubaa, Robot Operating System (ROS): The Complete Reference, vol. 1, 1st ed. Cham, Switzerland: Springer Publishing Company, 2016. [103] J. Suchan and M. Bhatt, “Commonsense scene semantics for cognitive robotics: Towards grounding embodied visuo-locomotive interactions,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops (ICCVW), 2017, pp. 742–750, doi: 10.1109/ICCVW.2017.93. [104] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. J. Kelly, and A. J. Davison, “SLAM++: Simultaneous localisation and mapping at the level of objects,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 1352– 1359, doi: 10.1109/CVPR.2013.178. [105] A. Rosinol, M. Abate, Y. Chang, and L. Carlone, “Kimera: An opensource library for real-time metric-semantic localization and mapping,” in

Proc. IEEE Int. Conf. Robot. Autom. (ICRA), 2020, pp. 1689–1696, doi: 10.1109/ICRA40945.2020.9196885. [106] A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3D dynamic scene graphs: Actionable spatial perception with places, objects, and humans,” 2020, arXiv:2002.06289. [107] R. Szeliski, Computer Vision: Algorithms and Applications, 1st ed. Berlin, Heidelberg: Springer-Verlag, 2010. [108] E. J. Gibson, “Perceptual learning,” Annu. Rev. Psychol., vol. 14, no. 1, pp. 29–56, 1963, doi: 10.1146/annurev.ps.14.020163.000333. [109] A. Farahani, S. Voghoei, K. Rasheed, and H. R. Arabnia, “A brief review of domain adaptation,” in Advances in Data Science and Information Engineering, R. Stahlbock, G. M. Weiss, M. Abou-Nasr, C. Y. Yang, H. R. Arabnia, and L. Deligiannidis, Eds. Cham, Switzerland: Springer International Publishing, 2021, pp. 877–894. [110] H. Qin and D. Zhang, “A perpetual learning algorithm that incrementally improves performance with deliberation,” IEEE Access, vol. 8, pp. 131,425–131,438, Jul. 2020, doi: 10.1109/ACCESS.2020.3009718. [111] S. Thrun and T. M. Mitchell, “Lifelong robot learning,” Robot. Auton. Syst., vol. 15, nos. 1–2, pp. 25–46, 1995, doi: 10.1016/0921 -8890(95)00004-Y. [112] B. D. Argall, S. Chernova, M. Veloso, and B. Browning, “A survey of robot learning from demonstration,” Robot. Auton. Syst., vol. 57, no. 5, pp. 469–483, 2009, doi: 10.1016/j.robot.2008.10.024. [113] C.-C. Carbon, “Understanding human perception by human-made illusions,” Front. Hum. Neurosci., vol. 8, p. 566, Jul. 2014, doi: 10.3389/ fnhum.2014.00566. [114] S. Dörr, Cloud-based Cooperative Long-term SLAM for Mobile Robots in Industrial Applications (ser. Stuttgarter Beiträge zur Produktionsforschung). Stuttgart, Germany: Fraunhofer Verlag, 2020. [115] B. Kehoe, S. Patil, P. Abbeel, and K. Goldberg, “A survey of research on cloud robotics and automation,” IEEE Trans. Autom. Sci. Eng. (from July 2004), vol. 12, no. 2, pp. 398–409, Apr. 2015, doi: 10.1109/TASE. 2014.2376492. [116] A. Miller, K. Rim, P. Chopra, P. Kelkar, and M. Likhachev, “Cooperative perception and localization for cooperative driving,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), 2020, pp. 1256–1262, doi: 10.1109/ ICRA40945.2020.9197463.

Florenz Graf, Department of Robot and Assistive Systems, Fraunhofer IPA, Stuttgart 70569, Germany. E-mail: florenz. [email protected]. Jochen Lindermayr, Department of Robot and Assistive Systems, Fraunhofer IPA, Stuttgart 70569, Germany. E-mail: [email protected]. Çağatay Odabaşi, Department of Robot and Assistive Systems, Fraunhofer IPA, Stuttgart 70569, Germany. E-mail: cagatay. [email protected]. Marco F. Huber, Center for Cyber Cognitive Intelligence, Fraunhofer IPA, and Institute of Industrial Manufacturing and Management IFF, University of Stuttgart, Stuttgart 70569, Germany. E-mail: [email protected].  DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



49

©SHUTTERSTOCK.COM/KRYVENOK ANASTASIIA

Biomimetic Electric Sense-Based Localization

A Solution for Small Underwater Robots in a Large-Scale Environment

By Junzheng Zheng , Jingxian Wang , Xin Guo Chayutpon Huntrakul , Chen Wang , and Guangming Xie

Digital Object Identifier 10.1109/MRA.2022.3202432 Date of current version: 22 September 2022

50



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

,

T

his article presents a novel localization scheme for free-swimming small underwater robots in a large-scale environment. Accurate localization technology has always been a challenge for small underwater robots since the underwater lighting conditions can limit their vision, while the multipath in­­ terference issue occurring near a wall plagues the sonars. By contrast, some kinds of fish perceive their positions by sensing the electric field in their environment, giving them the ability to accurately localize in underwater environments. Inspired by the electric sense behavior in fish, this article proposes a large-scale localization scheme based on electric 1070-9932/22©2022IEEE

sense for small underwater robots. Our scheme includes an electric sense-based hardware solution and localization methods. Specifically, we first design a hardware solution, including electric emitters placed in an underwater environment and an electric receiver that can be furnished on a small underwater robot. Then, we propose distributed emitter architectures for large-scale localization. Finally, we propose three localization methods to estimate the position and orientation of the robot. We have conducted four types of localization experiments for a small underwater robot, demonstrating the robustness and effectiveness of our proposed electric sense-based scheme. Our study provides a novel solution to the localization of free-swimming underwater robots with a limited payload and also helps provide insights into large-scale underwater localization. Underwater Biomimetic Perception As one of the most popular underwater perception methods, sonar is inspired by underwater creatures, such as dolphins. Although sonars can take measurements across long distances (~2 km), they are sensitive to reflections and reverberations occurring near a wall or in turbid waters [1]. Another important underwater perception method is vision, which is informative. However, vision is more sensitive to lighting conditions and can fail in dark or turbid waters. To overcome these shortcomings, researchers and engineers keep trying to develop new technologies, especially from the perspective of bionics. In this process, the lateral line system, which has been found in many aquatic vertebrates, has attracted attention. As an important sensory system, the lateral line system is composed of mechanoreceptors and/or electroreceptors [2]. The mechanoreceptors are direction-sensitive neuromasts containing hair cells, and they are more sensitive to flow [3]. The electroreceptors are sensitive to electrical fields and have been discovered in some fish species that we call electric fish. Further research has revealed that electric sense is closely related to various behaviors of electric fish, such as prey, defense, perception, and communication [4]. Both kinds of receptors are suitable for turbid and dark water, and both are suitable for carrying on small underwater robots. Table 1 summarizes the advantages and disadvantages of the biomimetic perception technologies we have discussed. Biomimetic Electric Sense in Underwater Robots Among the various kinds of electric fish, electric eels (Electrophorus electricus) are well known for their powerful discharge capacity of hundreds of volts. They use these powerful electric currents for catching prey or in defense. Actually, weakly electric fish (gymnotiforms and mormyriforms) are more common in nature. They generate only about 10 V for short-distance electrolocalization and electrocommunication [4]. For some underwater creatures that have electroreceptors but cannot emit an electric field, their electric sense is used for navigation, enabling these creatures to follow the electric field generated by external electric sources (this is also called electrotaxis) [6].

Inspired by the weakly electric fish in nature, various underwater robots with electric sense have been designed for object estimation, localization, navigation, and so on. In 2008, Solberg et al. realized electric sense-based localization and size estimation of a sphere using a particle filter (PF) [7]. In 2015 and 2017, Bai et al. and Lanneau et al. independently realized the estimation of the pose and the shape of an ellipsoid [8], [9]. In 2013, Lebastard et al. fused the kinematic model and electric measurements to estimate the robot’s pose and then used the estimated pose for navigation close to a wall [10]. In 2014, Dimble et al. modeled the electric field in a straight water tank based on the multiple mirror method and then proposed a state (lateral position and orientation) estimation method for realizing navigation in the tank [11]. In 2015, Boyer et al. designed an electric sense-based control law for their robot “Slender probes” to move toward an external electric emitter. They then activated multiple external emitters in series, creating a dynamic electric field to further realize navigation of a single robot [12]. Recently, we proposed an electric sense-based localization method for a small underwater robot, where a single electric emitter based on dc is adopted. Due to the large dc noise in the environment and the use of the single-emitter architecture, the effective sensing distance is only twice the body length of the signal emitter [13]. However, we observe that most existing electric sensebased studies are realized by dragging robots on guide rails, and there are relatively few studies based on free-swimming robots [13], [14], [15]. Also, most of the studies are focused on the localization of an external target based on a self-generated electric field [7], [8], [9], rather than the localization of the robot itself based on an external electric field. In this article, inspired by the previous observations, we aim to design a biomimetic electric sense-based localization scheme for free-swimming underwater robots in a large-scale environment (Figure 1). To this end, we extend our previous work [13] by 1) designing an ac signal-based scheme instead of the dc one, 2) constructing the distributed electric-emitter architectures for large-scale localization, 3) exploring and analyzing the contributing factors of localization in different scenarios, and 4) verifying our localization scheme in more complex scenarios.

Table 1. Underwater biomimetic perception technologies. Methods

Bionic Objects Pros

Cons

Sonar [1]

Dolphin

Long distance Big size, turbid, near wall

Vision [5]

Many fish

Informative

Short distance, turbid, dark

Mechano Goldfish, receptors [3] boxfish

Turbid, dark Short distance, Flow sensitive

Electro Weakly receptors [6] electricfish

Turbid, dark Short distance

DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



51

Specifically, we propose an electric sense-based localization scheme, including a hardware solution and model-based perception methods. First, the hardware solution is composed of an ac-based electric receiver implemented on a robot and ac-based electric emitters fixed on the seabed (Figure 2). The robot uses the measurements of its receiver to estimate its relative position and orientation to the emitters on the seabed. Second, to enable localization in a large-scale environment, we construct a distributed electric-emitter architecture

(Figure 3). Because an ac signal is used for localization, multiple emitters can work together without interfering with each other using frequency-division multiplexing. Then, to explore the contributing factors of localization, we design three localization methods by selectively fusing the dynamic model, inertial measurement unit (IMU), and electric sensebased measurements (Figure 4). Finally, to fully verify our proposed localization scheme, we conduct four types of localization experiments in different scenarios, including the

Figure 1. A concept map depicting our biomimetic electric sense-based localization framework applied to different kinds of underwater robots in the sea. The emitters are fixed on the seabed, while the receivers are equipped on the robots.

DDS (f2)

f2

Channel 1

V02 V01

DDS (f1)

f1

V1,t V1,t–1 V1,t–n

Channel 2

V2

Channel 3

V3

FFT V1

V1, f1

FFT

V2, f1 V2, f2

V03 Pool

V04

V1, f2

FFT

Raspberry Pi

V3, f1 V3, f2

Electric Receiver

Electric Emitters Emitter Circuit

Output

Input Receiver Circuit

High-Pass Second-Order Chebyshev Circuit Object

Low-Pass Fourth-Order Butterworth

Low-Pass Second-Order Chebyshev

High-Pass Second-Order Chebyshev

Dynamic Gain

Bandpass Filter

Figure 2. The hardware solution to the underwater electric sense, including the electric emitter and the electric receiver. DDS: direct digital synthesizer; FFT: fast Fourier transform.

52



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

localization of a stationary robot or a robot under towing, the position tracking problem for an autonomous swimming robot, and the kidnapped robot problem, which is much more complicated. This article may provide insights into the localization of small underwater robots in a large-scale environment and help explore the contributing factors of this new localization technology. Our main contributions are summarized as follows. 1) We construct an electric sense-based localization scheme, including the ac-based receiver, ac-based emitters, and distributed emitter architectures. The localization scheme enables a small underwater robot to locate itself in a largescale environment. 2) We propose three localization methods to explore the contributing factors of biomimetic electric sense-based perception. The best one is successfully applied to a more complex localization issue (the kidnapped robot problem). 3) As far as the authors are aware, few studies have considered the electric sense for free-swimming robots or the localization of the robot itself by the external electric field.

wave with a set frequency. Therefore, an emitter is able to generate single-, dual-, or three-frequency signals when one, two, or three DDSs are activated, respectively. In Figure 2, an example of the single-frequency case is shown, where two emitters generate single-frequency signals at the frequencies of f1 and f2, respectively. Electric Receiver The electric receiver on a robot is composed of four copper electrodes and a receiving circuit. To eliminate the influence of the electric potential reference on the measurements, we take the electric potential differences between each pair of electrodes as the measurements. Since the strength of the original signal measured by the receiver attenuates with the third power of the distance between the receiver and the emitter [16], it needs to be amplified and filtered. To this end, first, the original signal is processed through a bandpass filter with a maximum gain of 100, which includes a high-pass second-order Chebyshev filter, a low-pass fourthorder Butterworth filter, a low-pass second-order Chebyshev filter, and a high-pass second-order Chebyshev filter. Figure 5 shows the amplitude–frequency response characteristics of the bandpass filter. It has a stable gain in the frequency range of 2–7 kHz and suppresses the indoor 50/60-Hz ac signal very well. Table 2 summarizes the main parameters of the

Architecture 3

Electric Emitter The electric emitter located in the underwater environment consists of two copper electrodes and a circuit. To generate signals at a certain frequency, we adopt a direct digital synthesizer (DDS) in the circuit for the emitter. The DDS is a low-power, programmable waveform generator capable of producing sine, triangular, and square wave outputs. The output frequency and phase are software programmable, allowing easy tuning. In our circuit, we use three DDSs (the chip model is AD9833 from Analog Devices), each of which can generate a sinusoidal

Architecture 2

Architecture 1

Hardware Solution to Biomimetic Electric Sense To realize the underwater electric sense, we first design a hardware solution, including an electric emitter and an electric receiver (Figure 2). The emitter located in the underwater environment generates signals, while the receiver equipped on a robot continuously measures the electric field. Simple Form According to the receiver’s measurements, the robot can obtain its relative position and orientation to the emitter. When there is more than one emitter, the receiver detects the frequencies of the signals to tell which emitter(s) the signals come from. Then, based on the single emitter, we construct electricemitter architectures suitable for a largescale environment and demonstrate three typical architectures (Figure 3).

Figure 3. Three typical distributed electric-emitter architectures. Columns from left to right denote each configuration’s simple form, configuration, and the schematic diagram of the distribution of electric field intensity. In the schematic diagrams, the yellow (respectively, blue) area represents the high (respectively, low) electric field intensity.

DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



53

high-pass and low-pass filters. Second, considering that the signal amplitude may still not be enough to meet the needs of engineering applications, a dynamic gain with an amplification range of 1–64 times is further added based on the sampling feedback. In this way, the total amplification of the

original signal is 100–6,400 times. Finally, the fast Fourier transform (FFT) technique is applied to the amplified signal to extract its amplitude and phase, which are then sent to the CPU (i.e., a Raspberry Pi) of the circuit to be further calculated to obtain the perception information.

Amplitude (dB)

AC-Based Electric Sense Now we introduce how to realize the underwater electric sense using our designed hardware solution. As we Task Planning Control have mentioned, the electric emitter including two electrodes is placed in u (x, y) the underwater environment, while Dynamic Model η the electric receiver, which consists of η ψ PF-Based four electrodes, is equipped on the Fusion IMU Electric Sense robot. Relative to one emitter, we build V a coordinate system in which we are (a) able to describe the pose [including the position (x, y) and the orientation ψ] of the robot. When the robot moves, the four electrodes of the Task Planning Control receiver fixed on the robot continuη ously measure the electric field generated by the emitter(s). Then, according ψ PF-Based IMU to the theoretical model of the electric Electric Sense field (see the section “Model of the V Electric Field”) and the known physi(b) cal parameters of the hardware, one can get the position information of the electrode from its measurement. Since the receiver has four electrodes, we get Task Planning Control three independent measurements, u Vi, i = 1, 2, 3, that further determine (x, y) Dynamic Model the three variables (x, y, }), i.e., the pose of the robot. η ψ Fusion IMU We also show the advantages of our ac-based solution by calculating (c) its signal-to-noise ratio (SNR) and comparing it to that of the dc-based Figure 4. Three localization methods. (a) Electric sense + IMU + dynamic model, (b) electric sense + IMU, (c) IMU + dynamic model. solution [13]. From the theoretical model of the electric field, we know that, under the same conditions, the strength of the signal measured by the receiver is identical under both ac- and 60 40 20 0 –20 –40 –60 –80 10–2

Table 2. The main parameters of the filters. Filter

10–1

100 101 Frequency (kHz)

102

103

Figure 5. Amplitude–frequency response characteristics of the bandpass filter. The resistance’s tolerance is 1%, and the capacitance’s tolerance is 5%. The blue line is the result of the standard parameters, the red line is the upper limit, and the yellow line is the lower limit.

54



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

Gain

Passband Stopband

High-pass second-order 10 dB Chebyshev

2 kHz

–40 dB@50 Hz

Low-pass fourth-order Butterworth

10 kHz

–30 dB@40 kHz

Low-pass second-order 0 dB Chebyshev

7 kHz

–15 dB@40 kHz

High-pass second-order 0 dB Chebyshev

2 kHz

–40 dB@50 Hz

30 dB

dc-based solutions. Then we just need to compare the noise of the signals. To this end, the standard deviation of the receiver’s signal Vi when the robot is stationary is calculated as the signal’s noise indoors in our laboratory. Then we have that the noise of the signal under the ac-based solution is about 2–10 µV, while the noise of the signal under the dcbased solution is about 1–3 mV [13]. That is, in terms of the SNR, the ac-based solution improves the dc-based solution by two to three orders of magnitude, thereby significantly improving the quality of the signal and further improving the detectable range of the emitter. Electric-Emitter Architectures Although our ac-based solution has successfully improved the detectable range compared with that of the dc-based solution, the detectable range of a single emitter is always limited. To enable the small underwater robots to locate in a large-scale environment, we further construct distributed electric-emitter architectures and show three typical architectures in Figure 3. To construct such architectures, first of all, we need to decide on the configurations, including the positions and orientations of the emitters. For ease of expression, we assume that each of the three typical architectures shown in Figure 3 is distributed in a rectangular form that has M rows and N columns, i.e., MN emitters in total. Specifically, in Architecture 1, every two emitters occupy the same position in an orthogonal form, while in Architectures 2 and 3, the emitters are uniformly distributed so that the distances between each pair of adjacent emitters are equal. Besides, each pair of the adjacent emitters is orthogonal in Architecture 2, while all of the emitters are parallel in Architecture 3. It is worth emphasizing that the rectangular form of configurations we choose is just for ease of expression and comparison. In real applications, one can construct various configurations and even space them in a sparse configuration (see our discussions in the section “Real-World Applications”). Having a configuration of the emitters, it is more important to assign different frequencies to different emitters appropriately. In this way, by detecting the frequency of the signals it receives, the robot can tell which emitters the signals come from, and then it superimposes the known pose information of the corresponding emitters and obtains its own global localization information. To this end, a simple idea is to assign a unique frequency to each emitter. However, doing so would obviously waste too many frequency-band resources, especially when the architecture is large. To solve this problem, we apply the technique of multifrequency signals; that is, each emitter generates a signal whose frequency is a combination of multiple frequencies, while the frequency combinations of the emitters in the architecture are different. Obviously, with the help of the multifrequency technique, the frequency-band resources can be greatly conserved. For our three typical architectures in Figure 3, we apply the dual-frequency technique to the MN emitters distributed in a rectangular form of M rows and N columns. For the emitter

(i, j) lying in the ith row and jth column, its dual frequency fi, j combines the two frequencies f irow and f col j . Now, we only need (M + N) different frequencies to enable the robots to distinguish the MN emitters. Biomimetic Electric Sense-Based Localization On the basis of our proposed hardware solution, we design localization approaches based on the biomimetic electric sense. To this end, we first build a theoretical model of the electric field generated by the electric emitter. Then, we construct three modules, including the dynamic-model module, the IMU module, and the electric sense module, each of which separately estimates the position and/or the orientation of the robot. Finally, by selectively fusing these modules, we propose three localization methods. Model of the Electric Field We first model the electric field generated by an emitter under the nonboundary case in a 3D environment, where the electric potential V ^ s h is calculated by

V^ s h =

V0 1 c 1 1 s r+ 2` - j R de

2

1 s - r-

2

m, (1)

where s = ^x, y, z h is a position in the coordinate system attached to the emitter, V0 is the voltage of the emitter, R is the radius of the emitter’s electrode, de is the distance between the emitter’s two electrodes, and r + = (x +, 0, 0) and r - = (x -, 0, 0) are the positions of the positive and negative electrodes of the emitter, respectively. · 2 denotes the Euclidean distance. Then, based on the preceding model for the nonboundary case, repeatedly using the mirror method [16], we build the theoretical model of the electric field for the six-plane boundary case corresponding to our experimental environment [i.e., the pool shown in Figure 6(b)]. For more details about the model for the six-plane boundary case, refer to [13]. According to the theoretical model of the electric field generated by an emitter, we are able to obtain the pose information (including the position and the orientation) of the robot from its receiver’s three independent measurements Vi, i = 1, 2, 3. Furthermore, we discuss the relationship among the physical parameters of the hardware and the receiver’s measurements Vi (s r), i = 1, 2, 3, where s r is the relative pose of the receiver to the emitter in a 3D environment, and dis (s r) denotes the distance from the receiver to the emitter. Specifically, one can see that the measurement Vi (s r) has a positive correlation with the voltage of the emitter V0, the radius of the emitter’s electrode R, and the distance between the emitter’s two electrodes de, respectively. In particular, when the distance de is five times larger than the radius R (i.e., d e 2 5R ), the measurement Vi (s r) is approximately directly proportional to V0, R, and de, respectively. Also, the receiver’s measurement Vi (s r) has a positive correlation with the distance between the receiver’s electrodes dr, and this DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



55

correlation becomes an approximately directly proportional relationship when the receiver is far away from the emitter 6dis (s r) 2 5d e@ . Moreover, it is obvious that Vi (s r) has a negative correlation with dis (s r) . Perception Modules For a robot equipped with our electric sense-based hardware as well as an IMU sensor, we construct three perception modules, including the dynamic-model module, the IMU module, and the electric sense module (Figure 4). These three modules separately estimate the position and/or orientation of the robot, which provides a basis for us to design localization methods. Here we assume that the robot swims in two dimensions because our experimental pool is not deep enough. Thus, the three perception modules focus on the perception in a 2D environment. It is worth emphasizing that our proposed electric sense hardware solution is able to work in a 3D environment (see our discussions in the section “Real-World Applications”). Dynamic-Model Module (the Blue Module in Figure 4) This module updates the position estimation [x k + 1, y k + 1] based on the previous estimation h k and the control input u k: [x k + 1, y k + 1] T = Dynamic (h k, u k), (2)



where h k = [x k y k } k] T  · Dynamic( · ) represents the robot’s dynamic model composed of a series of equations. IMU Module (the Green Module in Figure 4) Since the IMU sensor has very high accuracy in the short term, the IMU module updates the orientation estimation } k + 1 based on the previous estimation ψk and the IMU data D} IMU k : IMU



} k + 1 = } k + D} k

, (3)

where D} IMU is the increment of the yaw angle of the IMU. k Electric Sense Module (the Red Module in Figure 4) This module is based on the PF to fuse the estimation information, where the sampling importance resampling (SIR) algorithm [17] is adopted. Using the SIR algorithm, the electric sense module updates the pose estimation h k + 1 by fusing the electric receiver’s measurements V = [V1 V2 V3] T together with the estimation output from its previous module(s), i.e., the dynamic-model module and/or the IMU module. Specifically, first, for each particle (i) i, denoted by the superscript (i), the pose estimation h k + 1 is (i) updated by adding a random noise e k , which represents

Camera

Servo Motor

IMU

Propeller

“∞” Trajectory

Electric Emitter

V02 V01

Robot With Electric Receiver

Z

V03 Y

V04

X

Electric Receiver (a)

(b)

Electric Emitter

Robot

(c)

(d)

Figure 6. Small underwater robot and experimental platform. The electric receiver is equipped on the robot, while the electric emitters are located in the pool.

56



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

the external disturbance, the measurement errors, and so on. Then, we sequentially update the particle importance (i) weight w k . Third, we normalize the importance weights of all of the particles and calculate the final value of the pose estimation. That is, (i)



(i)

, (i) (i) (i) w k + 1 = p ^Vk + 1 | h k + 1h w k , (i) N-1 wk + 1 (i) hk + 1 = / hk + 1 N - 1 , (4) i=0 / w (ki+) 1 hk + 1 = hk + 1 + ek

i=0

where p ^V | h (i)h = e V - V (h ) is the importance weight of the measurements, and Vstd is the factor of the importance weight. We would like to mention that, according to the principle of the electric sense, our proposed electric sense-based scheme can provide not only the position (x, y) but also the orientation ψ of the robot. However, specific to the electric sense module, the robot’s position is updated using the electric sense, while the robot’s orientation information is mainly contributed from the IMU module, where the electric sense just serves as an auxiliary. The reason is that the IMU sensor is a very cheap and established way to estimate the orientation. Thus, we focus on estimating the position of the robot based on the electric sense when designing the localization algorithm of the electric sense module. (i)

2 2 2 /V std

Multi-Information Fusion-Based Localization Methods By selectively fusing the previous three perception modules, we propose three localization methods (Figure 4): 1) electric sense + IMU + dynamic model, 2) electric sense + IMU, and 3) IMU + dynamic model. Specifically, the IMU module is chosen as a necessary part of all three localization methods. Since the emitters are fixed in the underwater environment, their orientations are known ahead of time by the robot. Thus, the IMU on the robot can be calibrated before the localization starts, so that the robot obtains a more accurate initial orientation. Then, for the dynamic-model module and the electric sense module, we choose one or both of them to integrate with the IMU module. These designs enable us to evaluate the contributing factors of the electric sense-based localization. Our Robot and Experimental Platform To verify and evaluate our proposed hardware solution to the electric sense and the corresponding localization methods, we design a robot equipped with an electric receiver for conducting the experiments. Thus, in this section, we first introduce the robot’s design and its dynamic model and then present the experimental platform including the physical parameters of the electric sense hardware. Design of the Robot The small underwater robot that we have developed is shown in Figure 6(a) and (c); its size is about 34 × 27 × 24 cm3

(length × width × height). The robot has two propellers and a caudal fin. Each propeller is steered by a servo motor so that the direction of propulsion can be flexibly changed. The caudal fin is designed to increase the steering resistance of the robot, allowing it to swim straighter, so it is fixed on the robot. Moreover, an electric receiver consisting of four electrodes is equipped on the bottom of the robot to realize electric sense. For our experiments in this article, we only need the robot to swim in two dimensions because of the limitation of the experimental pool, so we just change the thrust of the two propellers but fix their thrust directions. The Dynamic Model Based on the design of the robot as well as its previous specific actuation mode when used in our experiments, we construct a 2D dynamic model of the robot. Following the notation developed by Fossen [18], the dynamic model of an underwater robot is generally described by the equation

Mvo + C (v) v + D (v) v = x, (5)

where v denotes the vehicle velocity in the body-fixed frame, which contains the vehicle surge velocity, sway velocity, and yaw velocity. M is the added mass and inertia matrix. C (v) ! R 3 # 3 is the matrix of Coriolis and centripetal terms. D (v) is the drag matrix. The propulsion forces and moments 3#1 constitute the vector of body-frame forces, which x!R can be written as

1 x = BF = > 0 a 2

0 f1 (u 1) G, (6) 1 H= f2 (u 2) a -2

where a is the distance between the robot’s two propellers, and u 1, u 2 ! [0, 1] are the normalized control outputs of the left and right propellers, respectively. f1( · ), f2( · ) are the mapping relationships between the control input u and the thrust vector F, and F to x is a linear mapping. Based on the robot’s dynamic model, we use the model predictive control (MPC) method as the control law for the free-swimming robot. Specifically, we establish a rolling optimization model and perform continuous convex optimization [19]. We would like to emphasize that the control law will not affect the performance of our electric sense. The Experimental Platform As shown in Figure 6(b), we conduct the experiments in a pool of size 3 × 2 × 0.38 m3 (length × width × height). There is an overhead camera to capture the position and orientation of the robot, which is used as the ground truth of our localization experiments. For the electric sense-based hardware used in our experiments, the physical parameters of the emitters and the receiver are as follows: the emitter’s voltage V0 is 5 V, the radius of the emitter’s electrode R is 0.75 cm, the distance between the emitter’s two electrodes de is 10 cm, and the distance between the receiver’s two electrodes dr is 20 cm. DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



57

Furthermore, to evaluate the three typical emitter architectures, considering the limiting size of the pool, we use the simple forms of their configurations, as shown in Figure 3 (left column). Specifically, each of the three architectures contains two emitters. The position and orientation of emitter 1 for all three architectures are (1.5 m, 0.5 m, 0˚). The position and orientation of emitter 2 are (1.5 m, 0.5 m, 90˚), (1.5 m, 1.5 m, 90˚), and (1.5 m, 1.5 m, 0˚), for Architecture 1, Architecture 2, and Architecture 3, respectively.

that is the robot is set to be without self-motion (its control output u = 0 ). Then, in the last two scenarios, we conduct localization experiments for an autonomous swimming robot ^ u ! 0h, including the position tracking problem (Experiment 3) and the kidnapped robot problem, which is much more complicated (Experiment 4).

Orientation Error (°)

Position Error (cm)

Y (m)

Experiment 1: Stationary Robot In this scenario, we fix the pose of the robot and then test our electric sense-based localization scheme. Since the dynamic Localization Experiments model of the robot does not work ^u = 0h, the “electric Now we conduct the localization experiments for the small sense + IMU + dynamic-model” method is equivalent to the underwater robot in four scenarios to fully demonstrate the “electric sense + IMU” method, and the “IMU + dynamiceffectiveness of our proposed hardware solution and localiza- model” method is invalid. Therefore, in this scenario, we only tion methods. Specifically, in the first two scenarios, we sepa- need to verify the localization ability of the “electric sense + rately conduct localization experiments for a stationary robot IMU” method. Note that this is a global localization problem (Experiment 1) and a robot under towing (Experiment 2); since the robot’s initial position is unknown to the robot. To this end, the initial positions of the particles, which are used in the SIR 2 algorithm in the electric sense module, are randomly distributed through4 out the experimental pool; that 1.5 is, x (i) ! [0, 3] m and y (i) ! [0, 2] m. 5 Meanwhile, the angle of the IMU 6 equipped on the robot is calibrated 1 3 according to the coordinate system 8 2 before the experiment starts, thereby 9 0.5 giving a rough actual value as the ini1 tial orientation of the robot. As shown 7 in Figure 7(a), the robot is fixed at 0 0 0.5 1 1.5 2 2.5 3 nine poses [x(m), y(m), ψ(°)] of [0.5, X (m) 0.4, 0], [0.5, 0.4, 45], [0.5, 0.4, 90], [2.6, (a) 1.6, 180], [2.6,1.6, 225], [2.6, 1.6, 270], 40 [2.0, 0.5, 270], [2.0, 0.5, 135], and [2.0, 236 69.1 92.3 Architecture 1 0.5, 0], respectively. The robot’s actual 30 Architecture 2 pose is recognized by the overhead Architecture 3 20 camera as the ground truth. The position error is defined by the Euclidean 10 distance between the position estima0 1 2 3 4 5 6 7 8 9 tion and the ground truth, and the ori(b) entation error is defined by the 20 difference between the orientation estimation and the ground truth. The 15 experiments under each pose are 10 repeated three times. The experimental results are shown in Figure 7. 5 As shown in Figure 7(b) and (c), 0 one can see that the position errors 1 2 3 4 5 6 7 8 9 Labels of the Poses and the orientation errors under (c) Architecture 3 are worse than those under Architectures 1 and 2. SpecifiFigure 7. Experiment 1: Localization of a stationary robot under the “electric sense + cally, in terms of the position estiIMU” method. (a) The robot’s nine poses, where the arrow direction represents the robot’s orientation and the start of the arrow corresponds to the robot’s position. (b) mation, there are failures under The average position errors for the nine poses under each emitter architecture. For Architecture 3 for three poses (poses Architecture 3, three of the average position errors (for poses 1, 3, and 5) are out of 1, 3, and 5). The errors of Architecture 1 range of the figure. (c) The average orientation errors for the nine poses under each architecture. Archi.: Architecture. and the errors of Architecture 2 are on 58



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

the whole equivalent. Comparing the emitters’ poses among the three architectures, one can conclude that the architectures with orthogonal emitters (Architectures 1 and 2) perform better in this static localization scenario. Experiment 2: Robot Under Towing Unlike the stationary state used in Experiment 1, in this scenario, we pull the robot from one position to another with a rope to verify the electric sense-based localization method under passive motion, as shown in Figure 8 (left column). While similar to Experiment 1, the localization task in this scenario is also a global localization problem, and we only need to verify the “electric sense + IMU” method. The robot’s pose is separately estimated under the three architectures, and the experiments under each architecture are repeated three times. The experimental results are shown in Figure 8. As shown in Figure 8, one can see that Architecture 1 has the fastest convergence speed, Architecture 2 is the second fastest, and Architecture 3 is the slowest. Further, we calculate the average position errors and the orientation errors under the electric sense-based localization method during the time period 10–30 s. The average position errors under the three architectures are 6.0 cm, 7.9 cm, and 11.4 cm, respectively. Meanwhile, the average orientation errors under the three architectures are 7.0°, 6.7°, and 6.8°, respectively. Thus, we can also draw conclusions similar to those of Experiment 1 that, in the towing scenario, the architectures with orthogonal emitters (Architectures 1 and 2) perform better. Besides, the initial position errors are quite different for each experiment since the initial positions of all particles are randomly generated, and the algorithm converges very fast in the first iteration. Experiment 3: Position Tracking of Autonomous Swimming Robot Unlike the passive motion in Experiment 2, in this localization scenario, the robot is steered by the MPC law to follow a “3”-shaped path. We verify the three localization methods in this scenario. The robot equipped with the receiver starts from the position [0.5, 0.4] m for each run. The experimental results are shown in Figure 9, and their statistical results are summarized in Table 3. As shown in Figure 9, one can see that the position errors under the “electric sense + IMU + dynamic-model” method keep a good boundary, making it clearly superior to the other methods and appropriate for long-term localization. It is further shown that the electric sense module plays an important role, compared with the drifts under the “IMU + dynamicmodel” method. On the other hand, together with the unstable errors under the “electric sense + IMU” method, one can conclude that the dynamic-model module plays a key role. In addition, together with the results in Table 3, we can draw two more conclusions. First, under the “electric sense + IMU + dynamic-model” method, there is no significant difference in the results of the three architectures. Second, Architecture 3 is not necessarily inferior to Architectures 1 and 2 in this autonomous swimming scenario.

Experiment 4: Kidnapped Robot Problem For mobile robot localization, the kidnapped robot problem refers to a situation in which the robot is instantly moved to another position without being told. Although such a scenario rarely happens in practice, it is often used to test the ability of the localization method to recover from failures [20]. In the position tracking problem (Experiment 3), the initial position is roughly true. However, the initial position is wrong in the kidnapped robot problem. Thus, the kidnapped robot problem is more difficult than the global localization problem. In the experiment, the robot is first given a wrong initial estimation, and then we evaluate whether the robot can recover from the failure. The errors of the initial position are [0.4, 0.2]  m, [–0.4, –0.2] m, [–0.6, –0.3] m, [0.6, 0.3] m, and [0.5, 0.5] m, respectively. The errors of the initial orientation are 0°, - 20c, 0°, - 20c, and - 20c, respectively. Besides, we choose Architecture 2 and the above-mentioned “3”-path. The experimental results are shown in Figure 10, and their statistical results are summarized in Table 4. From Figure 10, one can see that the position errors of the “electric sense + IMU + dynamic-model” method converge faster. Specifically, the position errors of the “electric sense + IMU + dynamic-model” method converge to within 5 cm in the first 20 s, and the position errors of the “IMU + dynamic-model” method drift. Our “electric sense + IMU + dynamic-model” method has a good performance in this scenario. Similarly, we can also see from Table 4 that the average errors of the “electric sense + IMU + dynamicmodel” method are better than those of the “IMU + dynamic-model” method. In conclusion, the “electric sense + IMU + dynamic-model” method under Architecture 2 is effective in this very complicated scenario, i.e., the kidnapped robot problem. Discussion To gain insight into the effectiveness of our proposed electric sense-based scheme, in this section, we first evaluate the detectable range of a single emitter and propose measures to increase the detectable range. Then we discuss what contributes to the performance differences among different electricemitter architectures. Further, we present the possible extensions of our scheme to real-world applications. Detectable Range As a sensing technology, the working scope of our electric sense-based scheme is one of the key properties. Therefore, we evaluate the detectable range of the electric emitter and then discuss how to increase the detectable range. It is known that the detectable range is directly related to the strength of the signal measured by the receiver as well as the noise at the receiver. As introduced in the section “Model of the Electric Field,” the theoretical model of the electric field gives the relationship between the receiver’s measurements Vi (s r), i = 1, 2, 3 and the receiver’s relative pose s r to the emitter as well as other physical parameters of the electric sense DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



59

200

0

5

5

10

10

15 Time (s)

15 Time (s)

20

20

25

25

30

30

20

40

−40

0

0.5

1

1.5

2

0

0

0.5

1

1.5 X (m)

2

2.5

3

0

50 0

100

150

200

5

10

15 Time (s)

20

25

30

−40

−20

0

20

0

0

0

40

0

−40

0

0

0

3

3

−20

2.5

2.5

50

2

2

0.5

1.5 X (m)

1.5 X (m)

−20

0

20

40

100

1

1

No. 1 No. 2 No. 3

1

0.5

0.5

50

100

150

200

150

0

0

Position Error (cm)

1.5

2

0

0.5

1

Ground Truth Start End Orientation Error (°)

5

5

5

10

10

10

15 Time (s)

15 Time (s)

15 Time (s)

20

20

20

25

25

25

30

30

30

Figure 8. Experiment 2: Localization of a robot under towing under the “electric sense + IMU” method. From top to bottom: (a) Architecture 1, (b) Architecture 2, (c) Architecture 3. (Left column) The robot’s trajectories of ground truth (black dotted lines), emitter 1 (blue circles), and emitter 2 (red circles). (Middle column) The position errors compared to the ground truth. The different lines represent repeated experiments. (Right column) The orientation errors compared to the ground truth.

(c)

(b)

(a)

2

1.5

Y (m)

Y (m)

Y (m)

DECEMBER 2022

Position Error (cm)

• Position Error (cm)

IEEE ROBOTICS & AUTOMATION MAGAZINE

Orientation Error (°)

• Orientation Error (°)

60

DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



61

(a)

(b)

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

0

0

0

1

1

1

X (m)

X (m)

X (m)

2

2

2

3

3

3

Ground Truth

0

50

100

150

200

250

300

0

50

100

150

200

250

300

0

50

100

150

200

250

300

0

0

0

50

50

50

100

100

100

150

150

150

200 250 Time (s)

200 250 Time (s)

200 250 Time (s)

Electric + IMU + Dynamic IMU + Dynamic Electric + IMU

300

300

300

350

350

350

400

400

400

0

−60

−40

−20

0

20

40

0

0 60

−60

−40

−20

0

20

40

60

−60

−40

−20

0

20

40

60

50

50

50

100

100

100

150

150

150

200 250 Time (s)

200 250 Time (s)

200 250 Time (s)

300

300

300

350

350

350

400

400

400

Figure 9. Experiment 3: Position tracking of a free-swimming robot under our three localization methods. From top to bottom: (a) Architecture 1, (b) Architecture 2, (c) Architecture 3. (Left column) The robot’s trajectories of the ground truth (black dotted lines), the estimated trajectories under the “electric sense + IMU + dynamic-model” method (red lines), the “electric sense + IMU” method (blue lines), and the “IMU + dynamic-model” method (yellow lines). (Middle column) The position errors compared to the ground truth. (Right column) The orientation errors compared to the ground truth.

(c)

Y (m)

Y (m)

Y (m)

Position Error (cm) Position Error (cm) Position Error (cm)

Orientation Error (°) Orientation Error (°) Orientation Error (°)

hardware. Then, we consider the noise at the receiver. Since our proposed electric sense-based scheme can be applied to different underwater robots or vehicles, the noise we focused on refers to the environmental noise at the receiver, excluding the electrical noise from robotic components, such as propellers. When working indoors, the noise in different rooms varies greatly because the environmental noise mainly comes

Table 3. The experimental results of Experiment 3. Method

Emitter

errpos (cm)

errori

Electric + IMU + Dynamic

Archi. 1

16.6

Archi. 2

15.5

14.8∘

Archi. 3

15.3

Archi. 1

35.5

Archi. 2

132

Archi. 3

44.7

Archi. 1

114.4

Archi. 2

113.7

Electric + IMU

IMU + Dynamic

Archi. 3

from the noise of electrical appliances and experimental equipment. However, when working in the wild, interference factors, such as electrical appliances are reduced; thus, the environmental noise is always much smaller than that indoors. Therefore, we believe that our following evaluation of the detectable range is relatively conservative since our experiments are conducted indoors in the laboratory and are inevitably affected by various indoor appliances. Now we are ready to evaluate the detectable range of a single emitter by calculating the voltage of the receiver and its SNR level. Specifically, for the electric sense-based hardware used in our experiments, taking its physical parameters given in the section “The Experimental Platform,” the receiver’s

10∘

10.3∘

Table 4. Experimental results of Experiment 4.

14.3∘

Electric + IMU + Dynamic

17.9∘ 7.2∘

Initial error

6.8∘

errpos (cm) errori errpos (cm) errori

[0.4 m, 0.2 m, 0∘]

13.2

9.3∘

[–0.6 m, –0.3 m, 0∘]

19.2

[0.5 m, 0.5 m, –20∘]

16.3

10.3∘

135

IMU + Dynamic

[–0.4 m, –0.2 m,

Archi. Architecture; errpos (respectively, errori) is the average position error (respectively, orientation error) under the three localization methods.

[0.6 m, 0.3 m,

–20∘]

–20∘]

9.1∘

38.5

7.2∘

78.9

7.6∘

76.3

7.6∘

63.9

7.4∘

60.8

40 50 Time (s) (c)

60

16.4 15.4

7∘

12.3∘ 4.4∘

12.7∘ 13.5∘

2

Y (m)

1

0 Ground Truth −1

0

1

2

3

X (m) (a) 40 Electric + IMU + Dynamic IMU + Dynamic

150

Orientation Error (°)

Position Error (cm)

200

100 50 0

0

10

20

30

40 50 Time (s) (b)

60

70

80

20 0 −20 −40

0

10

20

30

70

80

Figure 10. Experiment 4: Kidnapped robot problem of a free-swimming robot under two localization methods. The errors of the initial pose are about [–0.4 m, –0.2 m, –20∘]. (a) The robot’s trajectories of the ground truth (black dotted lines), the estimated trajectories under the “electric sense + IMU + dynamic-model” method (red lines), and the “IMU + dynamic-model” method (yellow lines). (b) The position errors compared to the ground truth. (c) The orientation errors compared to the ground truth.

62



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

Y (m)

measurements Vi (s r) are calculated by the theoretical Electric-Emitter Architectures model of the electric field. Then we calculate the SNR Now we gain insight into what contributes to the perfor^ : = 20 log 10 Vrms Vnoiseh of the receiver’s signal Vi, where mance differences among the three typical electric-emitter Vrms is the root mean square of the receiver’s voltage on the architectures, Architecture 1, Architecture 2, and Architecture 3. quadrature method, and the signal’s noise Vnoise is denoted as First of all, we mention that the emitter density of the three the standard deviation of the receiver’s signal Vi when the architectures is identical, and each emitter has the same physirobot is stationary. Thus, the signal’s noise is about 2–10 µV, cal parameters (i.e., the voltage of the emitter V0, the radius of whose median is 5 µV. the emitter’s electrode R, and the distance between the emitTaking these previous calculations, we can obtain the ter’s two electrodes de). Therefore, a comparison of the three receiver’s measurements Vi and its SNR at every pose s r of architectures is essentially a comparison of the uniformity of the receiver relative to the emitter. Since Vi (s r) has a nega- the distribution of the electric field, including the uniformity tive correlation with the distance dis (s r) from the receiver of the intensity and the uniformity of the direction. to the emitter, one can check that, for a given voltage threshIdeally, our expectation is that the electric field intensity is old Vthr, the set of positions at which Vrms $ Vthr will form a equal everywhere in space, so that the localization accuracy of region whose boundary line (also called the equipotential the robot in different spatial positions will be relatively conline) consists of the positions at which Vrms = Vthr . To pro- stant. However, the emitter’s physical properties have already vide an intuitive demonstration, we select four pairs of the determined the nonuniformity of its generated physical field typical voltage threshold Vthr and its SNR as (0.5 mV, (i.e., electric field intensity), as shown in Figure 11. Mean40 dB), (0.25 mV, 34  dB), (0.1 mV, 26 dB), and (0.05 mV, while, the architecture of the emitters, including both the 20 dB) and draw the corresponding boundary lines in Fig- position and the orientation of each emitter, will further affect ure 11. It is shown that each boundary line is an approxi- the uniformity of the electric field intensity. Therefore, our mate ellipse. The extreme values of the distances from the expectation becomes making the area above a certain voltage emitter’s center to each boundary line are denoted as D min threshold as large as possible (e.g., the yellow area in Figand D max, and their values are also shown in Figure 11. That ure 3). On the other hand, the uniformity of the electric field is, if it is required that the receiver’s signal is valid when its direction is affected by the orientation of each emitter in the SNR exceeds 20 dB, the emitter’s maximum detectable range architecture. If a robot can obtain a certain strength of the sigis around 2.32–2.76 m. Actually, we have conducted testing nal in more orientations, the robustness of its localization will experiments and found that the receiver’s signal with an be increased compared with the case when the signal is obtained in only one orientation. SNR of 20 dB is enough for localization. Following the previous discussions, using the theoretical To sum up, for the electric sense-based hardware with the model of the electric field, we calculate the spatial distribution physical parameters of V0 = 5 V, R = 0.75 cm, de = 10 cm, and dr = 20 cm, its emitters’ maximum detectable range is around 2.32–2.76 m. 3 Until now, we have evaluated the 0.05 mV, 20 dB, 2.32–2.76 m detectable range of a single emitter. We 0.1 mV, 26 dB, 1.82–2.22 m believe that the detectable range is not 2 small compared with the physical parameters of the hardware. Moreover, we discuss how to in­­ 1 crease the detectable range. From the theoretical model of the electric field, one can see that the receiver’s measure0 Emitter ment Vi has a positive correlation with four physical parameters separately, −1 including the voltage of the emitter V0, the radius of the emitter’s electrode R, the distance between the emitter’s two 0.25 mV, 34 dB, −2 1.32–1.66 m electrodes de, and the distance between 0.5 mV, 40 dB, 1.02–1.32 m the receiver’s electrodes dr. ConsiderVthr, SNR, Dmin–Dmax ing that high voltages could pose a −3 −3 −2 −1 0 −4 1 2 3 4 variety of risks (e.g., to aquatic aniX (m) mals), we can increase the receiver’s measurement and thus improve the detectable range by easily increasing Figure 11. A schematic diagram of the distribution of electric field intensity for a single emitter. The yellow (respectively, blue) area represents the high (respectively, low) one or more of the other three physical electric field intensity. The black lines denote the equipotential lines under the different parameters, R, de, and dr. voltage thresholds Vthr and the corresponding SNR. DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



63

of the electric field intensity for the three architectures separately, as shown in the right column of Figure 3. When such architectures are used in a specific localization task for the robot, their performance can be analyzed according to the uniformity of the distribution of the electric field, as we have discussed previously. Specific to the localization tasks in the section “Localization Experiments,” we talk a little more about the performance of the three architectures. In the static scenario, the localization may fail since the information obtained by the electric receiver is partial and limited, which is very obvious in Architecture 3. On the other hand, in the towing scenario, the localization can be successful since the robot is continuously acquiring new electric measurements. The experimental results under Architecture 3 also reflect this. The towing scenario can be equivalent to the displacement caused by the external flow or the external force. This means that our biomimetic electric sense-based localization method can be applied to an unknown flow environment, which is very meaningful for practical applications. Besides, the two types of scenarios without self-motion reflect that Architecture 3 may be a poor architecture, while Architectures 1 and 2 perform better. This phenomenon implies that the localization results are better when adjacent emitters are orthogonal. The previous discussion may provide a guide for us to construct electric-emitter architectures for different working tasks. Real-World Applications Based on the previous evaluations and discussions, we now try to conclude the appropriate real-world application environments for our proposed electric sense-based scheme. First, the hardware of our scheme can be customized for different sizes of the robots/vehicles or for different scales of working spaces since the detectable range of a single emitter can be easily and flexibly changed by adjusting the physical parameters of the emitter, such as the distance between its two electrodes. Second, our localization scheme can be used in a largescale 2D (respectively, 3D) environment by deploying a 2D (respectively, 3D) emitter architecture, especially considering the good property that the architecture is allowed to be built in a sparse manner. Specifically, we now discuss the issue of emitter density. On one hand, according to our previous evaluation of the maximum detectable range (2.32–2.76 m) of a single emitter, the emitters can be placed at intervals of about 4.64–5.52 m to make sure the robots realize a continuous realtime localization. On the other hand, if a lower emitter density is still needed, the emitters can also be placed sparsely (e.g., at intervals larger than 5.52 m and even up to 10–20 m) since the convergence of our localization method in such a sparse situation has been verified by Experiment 4 (the kidnapped robot problem). Once the robot is within the detectable range of the emitters, the estimation can converge quickly, so the robot doesn’t need to always stay in the detectable range. We want to claim that, although our evaluations of the maximum detectable range and further the densities of the emitter array are all in a 2D environment, it is similar to the situation in a 64



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

3D environment since the electric field generated by the emitter is 3D. Note also that the emitter densities are evaluated based on the maximum detectable range, which is determined by the physical parameters of the hardware used in our experiments, and the maximum detectable range can easily be increased by adjusting these parameters and thus further decreasing the density of the emitter array. Third, by integrating other sensors (e.g., the pressure sensors and the IMU sensors), which are also low in cost, the working scope of our localization scheme could be further improved at low cost. Concretely, one can equip the robot with a pressure sensor (e.g., CPS131 from Consensic, Inc. or MS5803-01BA from TE Connectivity Ltd., used frequently in our previous robots [21]). The pressure sensor can obtain the ground truth of the robot’s depth by calibrating the current atmospheric pressure, which may be more reliable than using an extra camera to calculate the depth. Based on the pressure sensor, the change in depth can be obtained at millimeter-level accuracy. Besides, the robot’s pitch and roll can be obtained through the IMU. When the robot works in the coverage of the 2D or 3D emitter architecture, the pressure sensor and the IMU are integrated with the electric sense for perception and localization. When the robot is out of the coverage of the emitter architecture, the pressure sensor and the IMU keep working. Once the robot is back within the architecture’s coverage, the electric sense works, and the estimation will be corrected quickly. In summary, although the experiments conducted in this article are in a 2D setting, and the experimental pool is not large, our proposed electric sense-based scheme is easily extended for large-scale 3D working environments. We also mention that, as far as the authors are aware, few studies have considered the electric sense for free-swimming robots even in two dimensions. Conclusion and Future Work In this article, for free-swimming small underwater robots in a large-scale environment, we propose a novel electric sensebased localization scheme, including a hardware solution and three model-based perception methods. First, our hardware solution is composed of an ac-based electric receiver furnished on the robot and ac-based electric emitters fixed on the seabed. The robot uses the measurements of its receiver to estimate its relative position and orientation to the emitters on the seabed. Second, we construct distributed electric-emitter architectures to enable localization in a large-scale environment. Third, we design three localization methods by selectively fusing the dynamic model, IMU, and electric sense to explore the contributing factors of the localization methods. Finally, to fully verify our proposed electric sense-based scheme, we conduct four types of localization experiments, including the very complicated scenario of the kidnapped robot problem. Three typical electric-emitter architectures are compared in these experiments, and we find that architectures with orthogonal features are stable in four localization scenarios. Our study provides a novel solution to the localization of autonomous underwater vehicles

with limited payload and also helps provide insights into largescale underwater localization. Acknowledgment This work was supported in part by the National Natural Science Foundation of China under Grant 61973007 and in part by the Southern Marine Science and Engineering Guangdong Laboratory (Guangzhou) under Grant K19313901. The corresponding author is Chen Wang. This article has supplementary downloadable material available at https://doi.org/10.1109/ MRA.2022.3202432, provided by the authors. References [1] A. M. Yazdani, K. Sammut, O. Yakimenko, and A. Lammas, “A survey of underwater docking guidance systems,” Robot. Auton. Syst., vol. 124, p. 103,382, Feb. 2020, doi: 10.1016/j.robot.2019.103382. [2] M. A. Gibbs, “Lateral line receptors: Where do they come from developmentally and where is our research going?” Brain Behav. Evol., vol. 64, no. 3, pp. 163–181, Feb. 2004, doi: 10.1159/000079745. [3] S. Coombs, H. Bleckmann, R. R. Fay, and A. N. Popper, The Lateral Line System. New York, NY, USA: Springer Science & Business Media, 2014. [4] C. Hopkins, “Electrical perception and communication,” Encyclopedia Neurosci., vol. 3, pp. 813–831, Dec. 2009, doi: 10.1016/B978-008045046-9.01827-1. [5] I. Mahon, S. B. Williams, O. Pizarro, and M. Johnson-Roberson, “Efficient view-based SLAM using visual loop closures,” IEEE Trans. Robot., vol. 24, no. 5, pp. 1002–1014, Oct. 2008, doi: 10.1109/TRO.2008.2004888. [6] A. A. Caputi, “The bioinspiring potential of weakly electric fish,” Bioinspiration Biomimetics, vol. 12, no. 2, p. 25,004, Feb. 2017, doi: 10.1088/1748-3190/12/2/025004. [7] J. R. Solberg, K. M. Lynch, and M. A. MacIver, “Active electrolocation for underwater target localization,” Int. J. Robot. Res., vol. 27, no. 5, pp. 529–548, May 2008, doi: 10.1177/0278364908090538. [8] Y. Bai, J. B. Snyder, M. Peshkin, and M. A. MacIver, “Finding and identifying simple objects underwater with active electrosense,” Int. J. Robot. Res., vol. 34, no. 10, pp. 1255–1277, Apr. 2015, doi: 10.1177/0278364915569813. [9] S. Lanneau, F. Boyer, V. Lebastard, and S. Bazeille, “Model based estimation of ellipsoidal object using artificial electric sense,” Int. J. Robot. Res., vol. 36, no. 9, pp. 1022–1041, Jun. 2017, doi: 10.1177/0278364917709942. [10] V. Lebastard, C. Chevallereau, A. Girin, N. Servagent, P.-B. Gossiaux, and F. Boyer, “Environment reconstruction and navigation with electric sense based on a Kalman filter,” Int. J. Robot. Res., vol. 32, no. 2, pp. 172–188, Feb. 2013, doi: 10.1177/0278364912470181. [11] K. D. Dimble, J. M. Faddy, and J. S. Humbert, “Electrolocationbased underwater obstacle avoidance using wide-field integration methods,” Bioinspiration Biomimetics, vol. 9, no. 1, p. 016012, Jan. 2014, doi: 10.1088/1748-3182/9/1/016012. [12] F. Boyer, V. Lebastard, C. Chevallereau, S. Mintchev, and C. Stefanini, “Underwater navigation based on passive electric sense: New perspectives for underwater docking,” Int. J. Robot. Res., vol. 34, no. 9, pp. 1228–1250, May 2015, doi: 10.1177/0278364915572071. [13] J. Zheng, C. Huntrakul, X. Guo, C. Wang, and G. Xie, “Electric sense based pose estimation and localization for small underwater robots,” IEEE Robot. Automat. Lett., vol. 7, no. 2, pp. 2835–2842, Apr. 2022, doi: 10.1109/LRA.2022.3145094. [14] M. Porez, V. Lebastard, A. J. Ijspeert, and F. Boyer, “Multi-physics model of an electric fish-like robot: Numerical aspects and application

to obstacle avoidance,” in Proc. 2011 IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), pp. 1901–1906, doi: 10.1109/IROS.2011.6094636. [15] S. Mintchev et al., “An underwater reconfigurable robot with bioinspired electric sense,” in Proc. 2012 IEEE Int. Conf. Robot. Automat. (ICRA), pp. 1149–1154, doi: 10.1109/ICRA.2012.6224956. [16] D. J. Griffiths, Introduction to Electrodynamics, 4th ed. Cambridge, U.K.: Cambridge Univ. Press, 2017. [17] T. Li, M. Bolic, and P. M. Djuric, “Resampling methods for particle filtering: Classification, implementation, and strategies,” IEEE Signal Process. Mag., vol. 32, no. 3, pp. 70–86, May 2015, doi: 10.1109/ MSP.2014.2330626. [18] T. I. Fossen, Guidance and Control of Ocean Vehicles. New York, NY, USA: Wiley, 1994. [19] M. Rubagotti, P. Patrinos, and A. Bemporad, “Stabilizing linear model predictive control under inexact numerical optimization,” IEEE Trans. Autom. Control, vol. 59, no. 6, pp. 1660–1666, Jun. 2014, doi: 10.1109/TAC.2013.2293451. [20] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics. Cambridge, MA, USA: MIT Press, 2006. [21] J. Zheng, T. Zhang, C. Wang, M. Xiong, and G. Xie, “Learning for attitude holding of a robotic fish: An end-to-end approach with simto-real transfer,” IEEE Trans. Robot., vol. 38, no. 2, pp. 1287–1303, Apr. 2022, doi: 10.1109/TRO.2021.3098239.

Junzheng Zheng, State Key Laboratory for Turbulence and Complex Systems, Intelligent Biomimetic Design Lab, College of Engineering, Peking University, Beijing 100871 China. E-mail: [email protected]. Jingxian Wang, Center for Robotics and Biosystems, Northwestern University, IL 60208 USA. E-mail: jingxianwang2026@ u.northwestern.edu. Xin Guo, State Key Laboratory for Turbulence and Complex Systems, Intelligent Biomimetic Design Lab, College of Engineering, Peking University, Beijing 100871 China. E-mail: [email protected]. Chayutpon Huntrakul, State Key Laboratory for Turbulence and Complex Systems, Intelligent Biomimetic Design Lab, College of Engineering, Peking University, Beijing 100871 China. E-mail: [email protected]. Chen Wang, National Engineering Research Center of Software Engineering, Peking University, Beijing 100871 China; State Key Laboratory for Turbulence and Complex Systems, Intelligent Biomimetic Design Lab, College of Engineering, Peking University, Beijing 100871 China. E-mail: wangchen@ pku.edu.cn. Guangming Xie, State Key Laboratory for Turbulence and Complex Systems, Intelligent Biomimetic Design Lab, College of Engineering, Peking University, Beijing 100871 China; Southern Marine Science and Engineering Guangdong Laboratory (Guangzhou), Guangzhou 511458 China. E-mail: xiegming@ pku.edu.cn. DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



65

©SHUTTERSTOCK.COM/ANDREY SUSLOV

Biomimetic Force and Impedance Adaptation Based on Broad Learning System in Stable and Unstable Tasks Creating an Incremental and Explainable Neural Network With Functional Linkage By Zhenyu Lu

T

his article presents a novel biomimetic force and impedance adaption framework based on the broad learning system (BLS) for robot control in stable and unstable environments. Different from iterative learning control, the adaptation process is

Digital Object Identifier 10.1109/MRA.2022.3188218 Date of current version: 26 August 2022

66



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

and Ning Wang 

realized by a neural network (NN)-based framework, similar to a BLS, to realize a varying learning rate for feedforward force and impedance factors. The connections of NN layers and settings of the feature nodes are related to the human motor control and learning principle that is described as a relationship among feedforward force, impedance, reflex and position errors, and so on to make the NN explainable. Some comparative simulations are created and tested in five force 1070-9932/22©2022IEEE

fields to verify advantages of the proposed framework in terms of force and trajectory tracking efficiency and accuracy, robust responses to different force situations, and continuity of force application in a mixed stable and unstable environment. Finally, an experiment is conducted to verify effectiveness of the proposed method. Introduction Variable force and impedance control mechanisms have widely existed in human daily movements and work, such as drilling and grasping, requiring the central nervous system (CNS) to control muscle reflexes and learning processes against intrinsic instability caused by unknown external forces, motor noises, and raw material properties when interacting with objects in our environment. In recent years, research on motor skills learning has been used for robotic applications, such as teleoperation [1], transferring skills from humans to robots [2], and haptic interaction [3], [4], integrating electromyography perception and learning technology. To realize the human motor learning process, a mechanism used by the CNS is proposed, involving both feedback and feedback control, to represent the relationship between motor commands and movement so that the CNS can adapt the dynamics of the limb to compensate for mechanical instability and interaction forces from the environment [6]. However, even based on the common mechanism, there are some models that describe human motor control and learning, but they are slightly different from each other. Franklin et al. built a model in which both inverse dynamics control and impedance control are active during the motor learning process to counteract mechanical instability rather than just predicting final learning outcomes [5]. In [6], a V-shaped learning function is created to specify exactly how feedforward commands are adjusted to individual muscles based on tracking error. Tee et al. claimed the shortcomings of the aforementioned models and defined a model specifying the activation of each muscle that is adapted from one movement to the next and explained how humans learn to perform movements in the environment with novel dynamics of tool use [7]. By comparison, the model shown in [8] is based on a stability measurement corresponding to Lyapunov exponents, PFM

Learning Planned Motions

Target DF Start y

Feedforward Delay

which explores the learning mechanism of human arms for unstable interaction, and each muscle is not modeled separately as in [7]. As shown in Figure 1, this diagram includes feedforward forces, motor noise, delayed reflex, muscle impedance, human body dynamics, and external forces. The feedforward forces are determined by the planned trajectory and delayed reflexes. The motor noise, inheriting in-motion generation, is considered an extension of the previous model in [9]. The impedance depends on the torque caused by muscle activation and time delays of the reflexes. Finally, the feedforward and feedback control units are combined to generate the muscle force acting on the human arm to encounter external forces. This model fully shows the human motor learning process and greatly influences subsequent studies [10]–[16], such as in [12], where Yang et al. proposed a sliding term to update the feedforward torque and impedance factors (stiffness and damping) to replace the effect of muscle contraction. Li et al. created a controller that takes into account not only feedforward force and impedance but also adjustment of the reference trajectory during interactions with an unknown environment [13]. This model is also used for impedance control of robotic upper-limb exoskeletons [15], [16]. However, most of the previous adaptation methods are based on iterative learning control and adaptive control, where the actuators’ torques are updated at fixed rates (e.g., a and b in [7]) to minimize force and position tracking errors. The rateof-force decrease in the simulations is somehow different from that observed in reality in [17]. Some recent learning methods successfully used in human activities’ monitoring [18] and robot manipulation [19], [20] are helpful to recognize and minimize the difference. The other case is that the learning rate is different in different force fields. The experimental results in [21] showed that absolute errors decrease more slowly in the divergent force (DF) field than in the velocity-dependent force (VF) field, which normally converge exponentially within 10 trails. For the aforementioned properties in experiments, this article proposes a new learning and control framework based on BLS in which the learning process is realized by increasing the number of feature or enhancement nodes and linking between different layers to increase or decrease the learning

Noise + +

+ Motor + Order

Reflex

+

Contractile Component

+

Muscle Body Force

Actual Motions

Learning

+ [0,0]

External Force

Muscle System

Impedance



x (a)

(b)

Figure 1. A simulation diagram of human movements to investigate motor control and learning (a) Point-to-point movements with lateral instability produced by a parallel-link direct drive air-magnet floating manipulandum (PFM) to show human feedforward force and impedance adaptation [12]. (b) A control diagram of the controller with neural control and feedback error learning in [8]. DF: divergent force.

DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



67

Torque Adaption

Equations (4) and (8) Planned Motions

Feedforward Torque ε Sliding Operator

q ∗(t) +

Input Torque

Eq. (7) Feedback Torque

Noise Transformation

e(t)

Impedance

Eq. (5)

– q(t)

Environment

Eq. (3)

Actual Motions

Robot Eq. (2)

Eq. (6) Inverse Dynamics Model

q (t)

Figure 2. The learning and control diagram of a BLS-based human-like torque and impedance adaptation.

rate of the force and impedance, which is fully different from previous research. Referring to the models from [5]–[13], especially [8] and [12], we construct the motor learning and control diagram for torque and impedance, as displayed in Figure 2. Referring to [12], a sliding tracking error term is built for updating feedforward force. The difference between the proposed framework and other iterative learning methods is that the feedforward force is estimated using feature expressions and their networks, which is similar to the feedforward model based on the radial basis function (RBF) NN in [22]. As introduced in [24] and [8], the joint stiffness matrix increases with torque and muscle activation. Then, the neural feature terms in the feedforward torque block are transformed by adding noise terms to describe the contractile effect of the muscle. The impedance is expressed by the linear combination of enhancement nodes (the idea described by BLS), which is achieved by transformations of feature nodes from the feedforward torque. The impedance changes are realized by increasing the number of enhancement nodes and updating the weights according to the tracking errors. Feedback torque is then achieved based on the newly updated impedance, and both the feedforward and feedback torque are combined to generate muscle forces, as presented in Figure 1(b). Following [12], the inverse dynamics model is also used for robot control. In Figure 2, the learning process is colored with a gray background and calculated

Y W

Z1

m

Z2

H1

Zn

Hn–1

Hn

φ(XWei + βei), i = 1, ..., n ξ([Z1Z2...Zn] Whj + βhj), j = 1, ..., m X Figure 3. An illustration of BLS.

68



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

by BLS, but the rules for updating connections come from the biomimetic models. Broad Learning-Based Force and Impedance Adaptation BLS A BLS is proposed based on a random vector functional-link NN and eliminates the disadvantage of a long training process [23] and so on. A BLS is a flat network where the original inputs are transferred to the feature layer as “mapped features” and the structure is extended in a wide sense in the “enhancement nodes” H i, i = 1, ..., n, as depicted in Figure 3. Recently, the BLS has been used to control robots and microair vehicles [25] and system identification [26]. In Figure 3, X is the input vector and Y ! R N # C is the output variable of the network, where N is the number of feature mapping and C is the dimension of the network’s output. To explore hidden features of the input data, feature mappings are expressed as Z i = z i ^X T W ei + b eih, i = 1, 2, ..., n, where z i is a transformation function and W ei and b ei are sampled randomly from the distribution density. Setting Z n = 6Z 1, Z 2, ..., Z n@, and the jth enhancement term H j of the functional linking networks is generated by a linear transformation function p j of Z n, similar to z i, set as H j = p j ^ Z n T W hj + b hjh, j = 1, ..., m, where W hj and b hj are randomly generated sampling data from distribution density. The linked network is expended by Y = 6Z 1,..., Z n p 1 ^Z n W h1 + b h1h,..., p m ^Z n W hm + b hmh@T W = 6Z 1, ..., Z n H 1, ..., H m@T W n m T  = 6Z H @ W (1) = ^A n +mhT W,

Where X N = 6X x@ represents that matrix X is extended in the row by x to achieve a new matrix X N . Fitting accuracy of the network is improved by inserting additional feature nodes and enhancement nodes that approach desired output Y ) [23].

BLS-Based Force and Impedance Adaptation Framework The robot system is described as M ^ q h qp + C ^q, qo h qo + G = x - x f + x v , (2)



where q ! R n represents the simplification of joint q (t) at time t ! R +, M (q) ! R n # n is the inertia matrix, C ^q, qo h ! R n # n is the Coriolis and centrifugal torque matrix, and G ! R n is the gravitational torque. x represents control torque and x f is the external torque affected by the environmental force Fe that satisfies x f = J T (q) Fe, where J T (q) is a Jacobian matrix, and x v is the noise term satisfying x v # vr 1 3 [12]. Set q ) as the reference trajectory to q, then the position error is e = q ) - q, the velocity error is eo = qo ) - qo and the sliding term in Figure 2, f / e + leo , l 2 0, is the tracking error commonly used in robot control [8], [12]. The control torque x is designed as

x

= Mqp ) + Cqo ) + G - sign (f) vr + xa , (3) 1444444 42444444 43 Torque5adaptive Robot dynamics compensation

where is output torque calculated by the BLS, and qp ) and ) qo are acceleration and velocity, respectively, of the bounded periodic reference joint q ) satisfying q ) (t) = q ) (t - T ) 1 3, T 2 0, and sign ()) is the sign function that is defined componentwise. Torque x a is created in an expression similar to that of [12, eq. (3)] in that it consists of two parts: the former is for compensating robot dynamics based on q ) (t) and bounded noise x v , and x a is composed of feedforward and feedback torques. Feature nodes in the feedforward torque are calculated by feature mapping function s i = z i ^W ei f + b eih, which is a linear or Gaussian function s i = exp 6-^f - n ihT ^f - n ih h 2i @, i = 1, 2, ..., l, where l is the number of functions, and m n i = [n i1, n i2, ..., n im] ! R is a vector consisting of the center of each receptive field, and h i is the variance. Defining S ^ f h = 6s 1, s 2, ..., s l@, ideal-weight vector W ) is defined as the optimal value of Wl that could minimize approximation errors as def T W ) = arg min l" sup Fr - S ^ f h Wl ,, (4) xa

Wl ! R

where Fr is desired feedback torque, which can be described as the reflexes in [8] or descending feedforward motor command from the CNS [7]. Kadiallah et al. utilized RBFNN learning for learning feedforward muscle activity in muscle coordinates by creating a cost function for combining the cost for movement feedback error and activation [22] The use of (4) is similar to the RBFNN-based feedforward model in [17], but the muscle visco-elasticity part is based on constant stiffness and a constant damping factor and does not reveal the relationship between feedforward torque and impedance. For this problem, the authors of [7] built a model based on the assumption that intrinsic stiffness increases linearly with the motor command. In [12] and [13], the stiffness and damping matrices are adapted based on the function of f, e,

and eo , with a varying forgetting factor of learning. In [24], stiffness in the impedance model is determined by the muscle activation. Following [27], feedback force is determined by each muscle’s stiffness and the number of activated muscles. Therefore, impedance is calculated in two steps. In the first step, stiffness K j (t) and damping factor D j (t) are created for each enhancement node (like a muscle fiber):

)



K j (t) = { kj (S (f)) + c{ kj (TS (f)) + v kj , (5) D j (t) = { dj (S (f)) + c{ dj (TS (f)) + v dj

where { kj ^ ) h is the transformation function for the jth enhancement node calculated by S ^ f h from the feedforward torque. DS ^ f h = S ^ f h t - S ^ f h t -T represents changes of S ^ f h during the learning process in (4), c is a constant factor that ensures stability of learning results, and v ij, i = k, d are noise terms. Each enhancement node generates feedback f torque x j , which is expressed in an impedance form as f

xj



= K j (t) e + D j (t) eo  = { j ^S ^ f h, DS ^ f h, v kj , v dj , c, e, eo h ,

(6)

where { j ()) is a nonlinear transformation for the term S ^ f h and its modification DS ^ f h . We can get the enhancement group containing q feedback force terms as F q = 6F1, F2, ..., Fq@ . The second step is modifying the number of enhancement node and weights of the exist feature and enhancement nodes. As the torque adaptation is the combined effect of feedforward and feedback torque, we can use (7), similar to (1), to express the torque x a as xa



= 6s 1 ^ f h, ..., s l ^ f h

{ 1 ^ S ^ f h, DS ^ f h, v 1, v 1 , c, e, eo h, ..., k

{ q ^ S ^ f h, DS ^ f h, v q, v q , c, e, eo h@

= 6S = ^A

^ f h x 1f l +q T

, ...,

h W

A

f T x q@

k

WA

d

T

d

6Wl W q@

(7)

for minimizing cost function

W A = arg min l" sup xr f - A l +q W Al ,, (8) def

Wl ! R

based on the calculation of W ) in (4), where xr f represents external torques calculated by the inverse robot dynamics at time t - T, t ! 60, T h, where T is the periodic time span. Comments Here we will comment on the new learning framework from the following two aspects. ●●  Impedance and feedback torque learning: In [7], feedforward force is described as the common effect of all the muscle fibers. In the current article, we mainly use connections of feature nodes and enhancement nodes to imitate muscle contraction mechanism, as shown in Figure 1. First, the impedance is affected by feedforward torques and their noises, just as enhancement nodes are transferred from the feature nodes. Second, each enhancement node is considered a muscle fiber and generates separate stiffness, damping, and contractile force in (5) and (6). Then, the product DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



69

6x 1f , x 2f , ..., x qf @ W q in (7) can be seen as the combined effect of all activated muscles. Finally, but most importantly, the number of enhancement nodes increases with tracking demands. According to the BLS, accuracy is improved by inserting additional feature nodes and enhancement nodes, which is similar to the contractile mechanism in which impedance is increased by activating more muscle fibers. Learning process: In previous studies, e.g., in [12] and [13], stiffness and damping are updated by a decayed factor and tracking errors, while in (5), K j (t) and D j (t) are calculated by using different transformation f functions for each muscle. TThen we can get x = f q q f f f T q 6x 1 , x 2 , f, x q @ W = R j = 1 ` x j j W j , and the equivaq q l e nt stif f ness Kr (t) = R j = 1 ^K j (t)hT W j and dampq q T k r (t) = R j = 1 ^D j (t)h W j . Due to { i and { kj , { di and ing D d { j can be different functions for i ! j, and the impedance factors satisfy K i (t) ! K j (t) and D i (t) ! D j (t) to imitate the difference in muscle fiber activations. The BLSbased learning structure takes advantage of sparse autoencoder characteristics to obtain better features, and the stiffness and damping adaptation is realized by adding new enhancement nodes and updating weights by the linear inverse functions in [23], which causes a nonlinear learning process. Moreover, referring to the control structure in [12], the proposed framework offers some advantages: no force sensor, control in the joint and Cartesian space, and adaptive control with unknown parameters of the robot dynamics model. T

●●

Simulations and Experiment There are three groups of comparative simulations. The first two simulations are based on the data set provided in [27] Table 1. Definitions of different force fields. Force Field Expressions DF

100 0 x ; E; E , x ! ^-0 . 2, 0 . 2 h FDF = * 0 0 y 0 , otherwise

CF

0 FCF = ; E 10

VF P-DF MF

70



13 -18 E FVF = K v xo , K v =- ; 18 13 x xc FPDF =100 ^ DX - 0.12h DX , DX / ; E - ; E y y DX Z 100 0 x x ! ^0, 0 . 2 h 1 ]; E; E , ] 0 0 x 2 y ! ^0 . 45, 0 . 6 h ] ] x ! ^0, 0 . 2 h ] 0 ];10E , y ! ^0 . 2, 0 . 45@ ] ] FMF = [ ^ ] -;13 -18E= xo G , x ! -0 . 2 , 0@ ] 18 13 yo y ! ^0 . 45, 0 . 6 h ] ] x ! ^-0 . 2 , 0@ ] DX ] 100 ^ DX - 0 . 12 h DX , y ! ^0 . 2, 0 . 45@ ]] T c \ X = 6 x, y@ , DX / X - X

IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

and on https://www.imperial.ac.uk/human-robotics/ software/, which are carried out on a planar arm using the two-joint model of a human arm/robot, which is detailed in [8]. In [8], there are two kinds of force field: a position-dependent DF field and a velocity dependent external force (VF) field. In [5], the authors defined two other force fields: a constant interaction force (CF) field and a position-dependent DF (P-DF). In the third simulation, we introduce a new mixed force (MF) field that combines the aforementioned four force fields to compare the effect of position tracking and force matching in a complex environment. The expressions of these force fields are listed in Table 1. Line-Tracking Task in the VF field The first simulation takes place in the VF field, and the comparison method is from [8]. The reference trajectory starts at x s = [0, 0.31] and ends at x d = [0, 0.55] with duration T = 1.3 s. The parameters are l = 5, c = 2, the feature mapping function is z i ^ x h = x, and transformation functions for impedance factors are { kj ^ x h = 2xw x and { dj ^ x h = 10xw x, where w x ! ^0.5, 1h is a randomly selected factor. The iteration times are 50. As the robust term is specific to robot control, it is not considered in the simulation as in [12]. Each iteration is completed within a periodic time and refines the parameters and calculations based on the results of the previous iteration. The simulation results are presented in Figure 4. Figure 4(a) shows the evolution of the trajectories in the first and the last three trials. The final trajectory is almost a direct line between the start and the end as planned before adaptation. Figure 4(b) and (c) show that joint torques change with time and the feedforward torque eventually approaches that which is exerted by external forces. The adjustment process of the feedforward torque and impedance of the shoulder joint with the iterations is displayed in Figure 4(d), (i), and (j). These variables can approach the final states rapidly but don’t converge to the final state within 50 periods, which is inconsistent with the experimental results that converge in exponential form within 10 trials [21]. Figure 4(e) shows trajectories under the control of the proposed method. We can see that the trajectories converge within the first three iterations and the final trajectories are straighter, benefiting from the faster and more efficient convergence rate of the feedforward torque and impedance factors, as illustrated in Figure 4(h), (k), and (l) (fewer than 10 iterations). Similar to the results in [8], impedance factors increase initially and then decrease to the final value within the next trials. Tracking Tasks in Different Force Fields The second simulation group is taken within four kinds of force fields in Table 1: DF, CF, VF, and P-DF to complete the same task that the actuator moves along the following four line segments first: 1) from x s = [-0.12, 0.43] to x d = [0.12, 0.43] 2) from x s = [-0.085, 0.345] to x d = [0.085, 0.515]

0.5

0.5

0.45

0.45

0.4 0.35 0.3

1

Torque (m.rad)

0.55

Y (m)

Y (m)

0.55

0.4 0.35

–0.05

0.3

0 0.05 X (m)

0 –1 –2 –3

–0.05

External Torque Feedforward Torque Feedback Torque

0 0.05 X (m)

0

(a)

0 Torque (N.rad)

Torque (N.rad)

0 –0.2 –0.4

1

Feedforward Torque External Torque

External Torque Feedforward Torque Feedback Torque

0.2

0.5 Time (s) (b)

–1

–2

–0.6 –3 0

0.5 Time (s) (c)

0.55

Y (m)

Y (m)

0

0.55

1

0.5

0

0.5 0.45

1

Torque (m.rad)

–0.8

0.45

0.4 0.4

0.5 Time (s) (d)

1

External Torque Feedforward Torque Feedback Torque

–1 –2

0.35 0.3

–3

0.35 –0.05

0 0.05 X (m)

–0.05 (e)

0 0.05 X (m)

0.5 Time (s)

1

Feedforward Torque External Torque

–1 –2 –3 –4

0

1

0 Torque (N.rad)

Torque (N.rad)

–0.8

0.5 Time (s) (f)

1

External Torque Feedforward Torque Feedback Torque

–0.2

0

0

0.5 Time (s)

1

(h)

(g)

Figure 4. A simulation of adaptation to unstable dynamics as in a velocity dependent external force (VF) field. (a) Trajectories in the first and last three iterations in [8]. (b) and (c) Feedback and feedforward joint torques and the torque affected by external forces in [8]. (d) The evolution of feedforward torque in [8]. (e) Trajectories in the first and last three iterations in the proposed method. (f) and (g) Feedback and feedforward joint torques and the torque affected by external forces in the proposed method. (h) The evolution of feedforward torque in the proposed method. (Continued)

DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



71

40 30 20 10

10

5

0 0

10

20 30 Iterations

40

D11 D12(D21) D22

15

K11 K12(K21) K22

Damping (N.s/rad)

Stiffness (N/m)

50

50

0

10

(i)

50

30 20

D11 D12(D21) D22

15 Damping (N.s/rad)

Stiffness (N.s/rad)

40

40

(j) K11 K12(K21) K22

50

20 30 Iterations

10

5

10 0

10

20 30 Iterations

40

50

(k)

0

10

20 30 Iterations

40

50

(l)

Figure 4. (Continued) (i) and (j) The stiffness and damping adaptation in [8]. (k) and (i) The stiffness and damping adaptation in the proposed method.

3) from x s = [0, 0.31] to x d = [0, 0.55] 4) from x s = [0.085, 0.345] to x d = [-0.085, 0.515] . Then the actuator will track the circular trajectory

)

x =-r sin (kv (t)) (9) y = r cos (kv (t)) + 0.43

with reference velocity profile as

2 2 v (t) = 30t3 ` 1 - 2 ` t j + ` t j j, (10) T T T

where r = 0.12, k = 15.2 with period T = 4.5 s and 50 iterations. The results are shown in Figure 5. Each row shows trajectories generated in the same environment and each column compares those with different conditions. From Figure 5(b), (f), (j), and (n) and (d), (h), (l), and (p), it can be seen that the proposed controller responds well in different cases. After the first three trials, the trajectories are very close to the desired values and the final trajectories match the planned curves in every case. However, the final errors for the DF and VF are slightly larger than those for the CF and P-DF, which are mainly affected by the absolute magnitude of the external forces. The initial tracking effects are also different from each other. In the line-tracking task, initial position 72



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

errors in the CF field are smaller than in the other three fields. Absolute errors in the DF field decrease more slowly than those in the VF and P-DF force fields, which coincide with the experimental situations in [21]. Position errors to the trajectory ends in the P-DF force field are somewhat larger than in other fields. The circular-tracking task achieves a similar conclusion. In the first trail, there is not a trajectory that can complete a circle. In the third trail, the trajectories in the CF and P-DF force fields can realize end-to-end connections and are close to the circles. The worst performance is shown in Figure 5(c) in the VF field, so more time is needed for trajectory convergence, and even after 50 periods there are still some overlapping trajectory points. Circular-Tracking Tasks in Complex Environments In this simulation, we first set xo = 0.2 m/s, yo = 0.3 m/s as an example and plot the force vectors of the MF field within the area x ! ^-0.2, 0.2h, y ! ^-0.3, 0.6h, as shown in Figure 6(a), which contains four force fields with four different background colors. The simulation uses the reference trajectory in (9), starting and ending at ^0, 0.55h . The adaptation is simulated in 50 iterations and T = 4.5 s. Each trajectory passes sequentially through four force fields, numbered from phase 1 to 4, as indicated by the same colors in Figure 6(d), (e), (h), and (i).

Divergent Force Field

Constant Interaction Force Field

Velocity Dependent External Force Field



0.1

0.1

0.2

0

0.1

X (m)

(n)

(m)

0

0.1

Last Three Trajectories

X (m)

0.3

0

0.3

0.5

–0.1

–0.1

0.1

Last Three Trajectories

(f)

X (m)

0

(j) First Trajectory Second Trajectory Third Trajectory

0.6

0.3

0.1

Last Three Trajectories

(b)

X (m)

0

(i)

0.2

–0.1

–0.1

X (m)

0.1

0.4

0.5

0.6

0.3

0.4

0.5

0.6

0.3

Last Three Trajectories

X (m)

0

0.2

First Trajectory Second Trajectory Third Trajectory

0.1

First Trajectory Second Trajectory Third Trajectory

(e)

X (m)

0

(a)

X (m)

0

0.4

–0.1

–0.1

–0.1

–0.1

0.4

0.5

0.6

0.4

0.5

0.6

0.3

0.4

0.5

0.6

0.3

0.4

0.5

0.6

0.3

0.35

0.4

0.45

0.5

0.55

First Trajectory Second Trajectory Third Trajectory

0.3

0.4

0.5

0.6

0.3

0.4

0.5

0.6

0.3

0.4

0.5

0.6

0.3

0.4

0.5

0.6

–0.2

–0.1

–0.1

–0.1

–0.1

0.1

First Trajectory Second Trajectory Third Trajectory

0.1

0.1

(o)

X (m)

0

0.1

0.2

First Trajectory Second Trajectory Third Trajectory

(k)

X (m)

0

First Trajectory Second Trajectory Third Trajectory

(g)

X (m)

0

(c)

X (m)

0

First Trajectory Second Trajectory Third Trajectory

–0.2

0.3

0.4

0.5

0.6

0.3

0.4

0.5

0.6

0.3

0.4

0.5

0.6

0.3

0.4

0.5

0.6

–0.1

–0.1

–0.1

–0.1

0.1

0.1

0.1

(p)

X (m)

0

0.1

0.2

Last Three Trajectories

(l)

X (m)

0

Last Three Trajectories

(h)

X (m)

0

Last Three Trajectories

(d)

X (m)

0

Last Three Trajectories

Figure 5. A simulation of trajectory adaptation in different conditions in Table 1. (a)–(d) Trajectories in the first and last three iterations that follow line segments and (9) in the DF field. (e)–(h) Trajectories in the first and last three iterations that follow line segments and (9) in the CF field. (i)–(l) Trajectories in the first and last three iterations that follow line segments and (9) in the VF field. (m)–(p) Trajectories in the first and last three iterations that follow line segments and (9) in P-DF field.

Y (m)

Y (m) Y (m) Y (m) Y (m)

Y (m) Y (m) Y (m) Y (m)

Y (m) Y (m) Y (m) Y (m)

Y (m)

Y (m)

Y (m)

Position-Dependent Divergent Force DF Field

DECEMBER 2022

IEEE ROBOTICS & AUTOMATION MAGAZINE



73

Simulation results of the controller in [8] for the circulartracking task are shown in Figure 6(b)–(e). In phases 2 and 4, the position errors are the largest compared with the results in phases 1 and 3, which are consistent with the conclusion in simulation 2. Figure 6(d) and (e) show the torque changes in different phases. Feedforward torque has a large jump at the phase-changing time, which is affected by the changes of external torques, and some values are larger than 1 N.rad per

step so that they will bring instability to system actuators. In the proposed method, the trajectories quickly converge to the desired values in all the phases except for phase 4, and the smoothness of the final trajectory is similar to that in Figure 6(c). On the other hand, compared with Figure 6(d) and (e), continuity and uniformity of the feedforward force in Figure 6(h) and (i) are much better, especially at the phaseshifting moments between phases 1 and 2, and phases 3 and

0.65 0.6 0.65

Y (m)

0.65 0.45 0.4 0.35 0.3 0.25

–0.2

–0.1

0 X (m)

0.1

0.2

(a) First Trajectory Second Trajectory Third Trajectory

0.6

0.6 Last Three Trajectories 0.5

Y (m)

Y (m)

0.5

0.4

0.4

0.3

0.3 0 X (m) (b)

–0.2

External Torque Feedforward Torque Feedback Torque State Switching Point

8 Torque (N .rad)

0.1

6 4 2 0

–0.1

0 X (m) (c)

0.1

0.2

External Torque Feedforward Torque Feedback Torque State Switching Point

6 Torque (N .rad)

–0.1

4 2 0 –2

–2

Phase 1

0

Phase 2

1

Phase 3

2 3 Time (s) (d)

Phase 4

4

–4

Phase 1

0

1

Phase 2

2 Time (s) (e)

Phase 3

Phase 4

3

4

Figure 6. A simulation of the trajectory and torque adaptation in the MF field in Table 1. (a) The force field in four phases. (b) and (c) Trajectories in the first and last three periods in [8]. (d) and (e) Joint and external torques in the final period in [8]. (Continued)

74



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

0.6

First Trajectory Second Trajectory Third Trajectory

0.6 Last Three Trajectories

0.5 Y (m)

Y (m)

0.5

0.4

0.4

0.3 –0.1

0 X (m)

0.1

0.3

0.2

–0.1

0 X (m)

(f)

6

External Torque Feedforward Torque Feedback Torque State Switching Point

6 Torque (N .rad)

Torque (N .rad)

(g)

External Torque Feedforward Torque Feedback Torque State Switching Point

8

0.1

4 2

4 2 0 –2

0

–4

–2 Phase 1

0

1

Phase 2

Phase 3

2 3 Time (s)

Phase 4

4

Phase 1

0

Phase 2

1

Phase 3

Phase 4

2 3 Time (s)

(h)

4

(i)

Figure 6. (Continued) (f) and (g) Trajectories in the first and last three periods in the proposed method. (h) and (i) Joint torques and external torques in the final period in the proposed method.

4. This is consistent with the fact that reactions of human muscles are smooth and continuous and shows that the proposed method is more suitable for processing tasks in a complex environment.

Franka Robot Pen and Paper Elastic Rope Fixing Stick 0.05

0

08

06

0.

04

0.

0.

0

02

0.

2

4

.0

–0

6

.0

–0

.0

–0

.0

8

–0.05

–0

Experiment In this experiment, a robot arm is to follow a trajectory using a pen, which is put in a forcing environment. As shown in Figure 7, the experiment is performed in an unstable environment with four elastic ropes attached to corner rods of the wood board. A pen is fixed to the end of the Franka robot arm with a crank connecting the other ends of the elastic ropes to record trajectory in the unstable environment. The Franka robot can record contact force through the embedded force sensors as well as the joints and positions in the Cartesian space. As the forces are provided by elastic cables determined by deformations along the pulling directions, the force field in the X-Y plane is shown in the zoomed figure in Figure 7. We

can see that the pen and the robot arm, except for standing the center point, will be affected by an extra force that pushes the pen to a balanced center position. If we want the robot to draw a picture on the paper, the robot has to resist the extra force to complete the drawings.

Force Field

Figure 7. An experimental setup with the force field.

DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



75

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Figure 8. Human demonstrations and robot force and impedance adaptation control. (a)–(c) A demonstration through kinesthetic teaching. (d) The teaching result. (e)–(g) Robot force and impedance adaptation to track the demonstrated trajectory. (h) The trajectory adaptation results.

76



Figure 8(a)–(d) shows the process and final drawings of demonstration. The human operator handled the pen to draw a square and the robot arm recorded the position while drawing. Then the robot tried to follow the human demonstration by increasing the feedforward force and impedance factors. Here, we use the robot to record the trajectory in one trial and send the data to the simulation environment to generate the force and impedance for the next time to complete the trajectory tracking in the unstable force field. In Figure 8(e) and (f), the robot tried several times to finally follow the trajectory demonstrated by human operator.

learning rates of torques and impedances are not controllable, while in [8] they can be changed by choosing appropriate gain matrices. On the other hand, the difference between the force drop in the simulation and the force drop observed in reality is difficult to diminish through this framework. Today, deep learning methods can transfer motions and parameters from human demonstrations to robot manipulation [15], [16] to minimize the error between simulation and reality. In the future, it is desirable to combine the research of this work with deep learning methods to develop a system for human skill learning and generalization.

Discussion The proposed learning and control framework transfers iterative learning control into a fully NN, and the neural node linkage is endowed with biomimetic meanings, i.e., the impedances are affected by the feedforward torques by adding several enhancement nodes that have nonlinear relationships with the feature mapping functions. The aforementioned quantized simulation results show that the proposed framework has a faster convergence rate to the desired effect, a more robust response to the different environmental conditions, and better continuity and uniformity of the feedforward force in a complex force field. Nevertheless, there are some improvements for the current framework. For example, the

Conclusion In this article, a new biomimetic learning and control framework for force and impedance was proposed based on the theory of BLS. Following the CNS and human motor working mechanism, an incremental and explainable NN with functional linkage was constructed for force and impedance adaptation and trajectory tracking in stable and unstable force fields. Some principles, such as the impedance of muscle fibers and amount of activated muscle fibers were reflected by enhancement nodes in a flat NN, which have hardly been modeled in previous research. Three groups of comparative simulations based on the open access data set were performed to verify the merits of tracking speed and

IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

accuracy, robustness to the differentiated force field environment, and smoothness and continuity of force changes in a complex unstable and stable environment. Our simulation results even prove, to some extent, accuracy and effectiveness of the proposed method for some experimental phenomena in previous research. In future work, we will explore optimization of the current framework and combine this research with human-like skill learning and generalization and imitation learning. Acknowledgments This work was supported by the Horizon 2020 Marie Skłodowska-Curie Actions Individual Fellowship under grant 101030691. Ning Wang is the corresponding author. References [1] C. Yang, J. Luo, Y. Pan, Z. Liu, and C.-Y. Su, “Personalized variable gain control with tremor attenuation for robot teleoperation,” IEEE Trans. Syst., Man, Cybern. Syst., vol. 48, no. 10, pp. 1759–1770, Oct. 2018, doi: 10.1109/ TSMC.2017.2694020. [2] C. Yang, C. Zeng, Y. Cong, N. Wang, and M. Wang, “A learning framework of adaptive manipulative skills from human to robot,” IEEE Trans. Ind. Informat., vol. 15, no. 2, pp. 1153–1161, 2019, doi: 10.1109/TII.2018.2826064. [3] C. Yang, K. Huang, H. Cheng, Y. Li, and C.-Y. Su, “Haptic identification by ELM-controlled uncertain manipulator,” IEEE Trans. Syst., Man, Cybern. Syst., vol. 47, no. 8, pp. 2398–2409, Aug. 2017, doi: 10.1109/TSMC.2017.2676022. [4] C. Yang, J. Luo, C. Liu, M. Li, and S. Dai, “Haptics electromyography perception and learning enhanced intelligence for teleoperated robot,” IEEE Trans. Autom. Sci. Eng. (from July 2004), vol. 16, no. 4, pp. 1512-1521, Oct. 2019, doi: 10.1109/TASE.2018.2874454. [5] D. W. Franklin, R. Osu, E. Burdet, M. Kawato, and T. E. Milner, “Adaptation to stable and unstable dynamics achieved by combined impedance control and inverse dynamics model,” J. Neurophysiol., vol. 90, no. 5, pp. 3270–3282, 2003, doi: 10.1152/jn.01112.2002. [6] D. W. Franklin et al., “CNS learns stable, accurate, and efficient movements using a simple algorithm,” J. Neurosci., vol. 28, no. 44, pp. 11,165– 11,173, 2008, doi: 10.1523/JNEUROSCI.3099-08.2008. [7] K. P. Tee, D. W. Franklin, M. Kawato, T. E. Milner, and E. Burdet, “Concurrent adaptation of force and impedance in the redundant muscle system,” Biol. Cybern., vol. 102, no. 1, pp. 31–44, 2010, doi: 10.1007/s00422-009-0348-z. [8] E. Burdet et al., “Stability and motor adaptation in human arm movements,” Biol. Cybern., vol. 94, no. 1, pp. 20–32, 2006, doi: 10.1007/s00422-005-0025-9. [9] E. Burdet, K. P. Tee, C. M. Chew, J. Peters, and V.L. Bt, “Hybrid IDM/ impedance learning in human movements,” in Proc. Int. Symp. Meas., Analysis Model. Hum. Functions, Sapporo, Japan, 2001, pp. 340–345. [10] G. Ganesh, A. Albu-Schäffer, M. Haruno, M. Kawato, and E. Burdet, “Biomimetic motor behavior for simultaneous adaptation of force, impedance and trajectory in interaction tasks,” in Proc. IEEE Int. Conf. Robot. Automat., 2010, pp. 2705–2711, doi: 10.1109/ROBOT.2010.5509994. [11] C. Zeng, C. Yang, and Z. Chen, “Bio-inspired robotic impedance adaptation for human-robot collaborative tasks,” Sci. China Inf. Sci., vol. 63, no. 7, pp. 1–10, 2020, doi: 10.1007/s11432-019-2748-x. [12] C. Yang, G. Ganesh, S. Haddadin, S. Parusel, A. Albu-Schaeffer, and E. Burdet, “Human-like adaptation of force and impedance in stable and unstable interactions,” IEEE Trans. Robot., vol. 27, no. 5, pp. 918–930, 2011, doi: 10.1109/TRO.2011.2158251.

[13] Y. Li, G. Ganesh, N. Jarrassé, S. Haddadin, A. Albu-Schaeffer, and E. Burdet, “Force, impedance, and trajectory learning for contact tooling and haptic identification,” IEEE Trans. Robot., vol. 34, no. 5, pp. 1170–1182, 2018, doi: 10.1109/TRO.2018.2830405. [14] C. Yang, Y. Jiang, W. He, J. Na, Z. Li, and B. Xu, “Adaptive parameter estimation and control design for robot manipulators with finite-time convergence,” IEEE Trans. Ind. Electron., vol. 65, no. 10, pp. 8112–8123, 2018, doi: 10.1109/TIE.2018.2803773. [15] Z. Li, Z. Huang, W. He, and C.-Y. Su, “Adaptive impedance control for an upper limb robotic exoskeleton using biological signals,” IEEE Trans. Ind. Electron., vol. 64, no. 2, pp. 1664–1674, 2016, doi: 10.1109/TIE.2016.2538741. [16] Z. Li, C.-Y. Su, G. Li, and H. Su, “Fuzzy approximation-based adaptive backstepping control of an exoskeleton for human upper limbs,” IEEE Trans. Fuzzy Syst., vol. 23, no. 3, pp. 555–566, 2014, doi: 10.1109/TFUZZ.2014.2317511. [17] R. A. Scheidt, D. J. Reinkensmeyer, M. A. Conditt, W. Z. Rymer, and F. A. Mussa-Ivaldi, “Persistence of motor adaptation during constrained, multijoint, arm movements,” J. Neurophysiol., vol. 84, no. 2, pp. 853–862, 2000, doi: 10.1152/jn.2000.84.2.853. [18] W. Qi, H. Su, and A. Aliverti, “A smartphone-based adaptive recognition and real-time monitoring system for human activities,” IEEE Trans. HumanMach. Syst., vol. 50, no. 5, pp. 414–423, Oct. 2020, doi: 10.1109/THMS.2020.2984181. [19] H. Su, W. Qi, Y. Hu, H. R. Karimi, G. Ferrigno, and E. D. Momi, “An incremental learning framework for human-like redundancy optimization of anthropomorphic manipulators,” IEEE Trans. Ind. Informat., vol. 18, no. 3, pp. 1864–1872, 2020, doi: 10.1109/TII.2020.3036693. [20] H. Su, Y. Hu, H. R. Karimi, A. Knoll, G. Ferrigno, and E. De Momi, “Improved recurrent neural network-based manipulator control with remote center of motion constraints: Experimental results,” Neural Netw., vol. 131, pp. 291–299, Nov. 2020, doi: 10.1016/j.neunet.2020.07.033. [21] R. Osu, E. Burdet, D. W. Franklin, T. E. Milner, and M. Kawato, “Different mechanisms involved in adaptation to stable and unstable dynamics,” J. Neurophysiol., vol. 90, no. 5, pp. 3255–3269, 2003, doi: 10.1152/jn.00073.2003. [22] A. Kadiallah, D. W. Franklin, and E. Burdet, “Generalization in adaptation to stable and unstable dynamics,” PLoS One, vol. 7, no. 10, p. e45075, 2012, doi: 10.1371/journal.pone.0045075. [23] C. P. Chen and Z. Liu, “Broad learning system: An effective and efficient incremental learning system without the need for deep architecture,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 1, pp. 10–24, 2017, doi: 10.1109/ TNNLS.2017.2716952. [24] K. P. Tee, E. Burdet, C.-M. Chew, and T. E. Milner, “A model of force and impedance in human arm movements,” Biol. Cybern., vol. 90, no. 5, pp. 368– 375, 2004, doi: 10.1007/s00422-004-0484-4. [25] H. Huang, T. Zhang, C. Yang, and C. P. Chen, “Motor learning and generalization using broad learning adaptive neural control,” IEEE Trans. Ind. Electron., vol. 67, no. 10, pp. 8608–8617, 2019, doi: 10.1109/TIE.2019.2950853. [26] Z. Lu, N. Wang, and C. Yang, “A novel iterative identification based on the optimised topology for common state monitoring in wireless sensor networks,” Int. J. Syst. Sci., vol. 53, no. 1, pp. 25–39, doi: 10.1080/00207721.2021.1936275. [27] E. Burdet, D. W. Franklin, and T. E. Milner, Human Robotics: Neuro-Mechanics and Motor Control. Cambridge, MA, USA: MIT Press, 2013.

Zhenyu Lu, Bristol Robotics Laboratory, University of the West of England, BS16 1QY, U.K. Email: [email protected]. Ning Wang, Bristol Robotics Laboratory, University of the West of England, BS16 1QY, U.K. Email: [email protected]. DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



77

Controlling Maneuverability of a BioInspired Swimming Robot Through Morphological Transformation Morphology Driven Control of a Swimming Robot

By Kai Junge, Nana Obayashi , Francesco Stella  , Cosimo Della Santina  , and Josie Hughes 

B

iology provides many examples of how body adaption can be used to achieve a change in functionality. The feather star, an underwater crinoid that uses feather arms to locomote and feed, is one such system; it releases its arms to

Digital Object Identifier 10.1109/MRA.2022.3198821 Date of current version: 1 September 2022

78



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

distract prey and vary its maneuverability to help escape predators. Using this crinoid as inspiration, we develop a robotic system that can alter its interaction with the environment by changing its morphology. We propose a robot that can actuate layers of flexible feathers and detach them at will. We first optimize the geometric and control parameters for a flexible feather using a hydrodynamic simulation followed by physical experiments. Second, we provide a theoretical framework for understanding how body change affects controllability. Third, we present a novel design of a soft swimming robot (Figure 1) with the ability of changing its morphology. Using this optimized 1070-9932/22©2022IEEE

feather and theoretical framework, we demonstrate, on a robotic setup, how the detachment of feathers can be used to change the motion path while maintaining the same lowlevel controller.

This previous work has highlighted the potential for morphology-driven control, which could be particularly beneficial in aquatic environments where fluid–structure interactions can be complex and challenging to control and exploit. To explore these capabilities, we must first create robots or The feather star is a structures that show significant variation in their marine crinoid, an physical structure or passive properties. To date, invertebrate with multiple this has mostly been demonstrated through soft “feathery” arms that stiffness change in robotic systems [12] or moduenable swimming and lar reconfigurable robotic systems [13]. Using these maneuvering to avoid prey. new capabilities, we must then optimize for the morphology for optimal thrust generation [14], [15] and address how we should design the global structure before and after body changes to achieve morphology-driven control. By developing a feather star-inspired robot with detachable feathers, we introduce a new approach to achieving significant morphological transformation in a swimming robot, which we then use to explore how body adaption can

Overview The feather star is a marine crinoid, an invertebrate with multiple soft “feathery” arms that enable swimming and maneuvering to avoid prey and feeding on drifting microorganisms [1]. These animals show many fascinating properties, including their deformable feather-like structure and cyclically actuated muscles. One of their properties, which is believed to be unique to echinoderms, is mutable collagenous tissue [2]. This enables them to drastically alter their body structure within a timescale of seconds, under direct control of the nervous system. In the case of feather stars, they use this tissue to detach their feathered arms. It is believed that this mechanism is used to distract prey and also change feather stars’ dynamics to assist with evading predators [3]. This demonstrated ability to drastically alter the body morphology and passive properties to alter maneuverability is of keen interest to the robotics community. It provides inspiration for the development of robots that can utilize or change their body structure to aid their end goal or, indeed, their survival [4]. Thus, the goal of this work is to develop a feather starinspired robot that uses an artificial equivalent of this “mutable tissue” to change its body structure. Furthermore, we present a theoretical model to investigate how the detachment of the arms affects the control and maneuverability of the design. Within the domain of underwater bioinspired robots, there have been a number of notable examples where limbs (similar to the feathers on the feather star) and their controllers have been optimized to maximize the generated thrust [6]. This includes an octopus-inspired robot [7] and a star fish robot [8]. While these examples (a) (b) consider the optimization of the design of the structure to maximize thrust or behavioral range, there are limited examples of underwater robots that show considerable changes in body structure to aid control. Developing and designing robots that can utilize change in the passive properties or morphology is a key quest for embodied intelligence researchers. Rings of Independent The role of morphology-driven conDetachable Actuation Feathers trol has been previously formalized [9] and shown to aid in achieving stability (c) in legged underwater vehicles [10] and shaping the behavioral landscape Figure 1. The (a, c) developed feather star robot with multiple actuated rings and (b)detachable feathers and its biological inspiration [5]. of complex systems [11]. DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



79

be used to assist with maneuvering. The novel robotic system is formed from multiple layers of actuated rings of feathers that it uses as a means of thrust generation. All the feathers in a single layer are actuated collectively, and maneuverability is achieved through adaption of the body, opposed to control of individual feathers. A mechanical system for rapid detachment has been integrated into the actuated All the feathers in a feather rings to allow for the detachment of indisingle layer are actuated vidual feathers. Due to the complex collectively, and interactions between the deformable feather and maneuverability is achieved water, we utilize simulation to perform a wide through adaption of the sweep of the cont rol and design landscape to body, opposed to control identify a small range of feather structures and of individual feathers. controllers that are likely to maximize the thrust generation. By developing a custom measurement setup, we validate a small subset of these results to find the optimal feather and controller. To understand how to design the initial configuration of the robot and the choice of feathers to detach, we have developed an algorithm that utilizes the state-space representation to evaluate and determine how to change and restore the degree of controllability. To demonstrate the contributions of this work, we experimentally validate the optimized robot structure and controllers on the robot hardware. The maneuverability of the robot is shown to alter with different configurations of the robot’s feathers, following which the ability to detach the feathers on demand to alter the heading and path is shown. In the remainder of this article, we first present the methods to systematically address this problem. The novel robotic hardware is then shown, followed by the experimental results. We finish with a discussion and conclusion. Problem Statement Using the feather star as biological inspiration, we aim to develop a robot that utilizes feather-like structures to swim. By providing the robot with the ability to alter, or morph, its body, we want to show how changes in maneuverability can be achieved through altering the body structure. To achieve this aim, we subdivide the problem into three key goals that all seek to explore how these bioinspired components can be used to improve the capabilities of robots, as follows: ●●  We must explore the role of feather structures and periodic controllers in the generation of thrust through embodied interactions with the water and optimize the 80



IEEE ROBOTICS & AUTOMATION MAGAZINE



DECEMBER 2022

feathers’ thrust generation through the codesign of the morphology and the controller. ●●  We must develop a framework for selecting the robot structure before and after the detachment of limbs to optimize the performance of the robot. In particular, we present methods for the optimization of the thrust in a particular direction and maximizing the degree of controllability of the robot. ●● We must develop robotic hardware that mimics the behavior of the feather star by considering a mechanism that allows multiple feathers to be actuated simultaneously and detached at will. The following three sections present the methods developed to address these three aspects of the problem. Modeling and Optimization of Bioinspired Feathers To achieve the best maneuverability, we wish to optimize the feather design parameters to maximize the thrust generated. The thrust is generated through complex interactions with the fluid and dependent on the geometry of the feather and the periodic motion at its root. To begin the optimization process, we first define the feather design parameters. Then, we use a hydrodynamics simulation model to explore the large design space, and we identify a subset of parameters that produce the largest thrust. Finally, the subset identified in simulation is further explored through real work experiments with a custom physical experimental setup. Parametric Feather Design The parameterized feather design and controller are shown in Figure 2(a) and (b). The geometry is defined as a rectangle, with its width w and length l as parameters. The control motion corresponds to the angular displacement at the root the feather. We evaluate only periodic motions to mimic the movements of the feather star. Consequently, the signal is fully described by the rise time t rise, fall time t fall, and hold time t hold. We keep the amplitude constant at A = 40º to limit the size of the design search space. The 40º value is the maximum amplitude of the mechanical setup. Hence, the parameters p des = [w, l, t rise, t fall, t hold] define the feather design. Polypropylene sheets of 0.4-mm thickness were chosen as the base material for the feathers, due to their flexibility and ease of fabrication using a carbon dioxide laser cutter. Hydrodynamics Modeling In this section, we present the model of the interaction of a single feather with the water. In particular, we first develop a discretized model for the feather, and then we define the forces exchanged between the structure and the fluid. We model a single feather as a collection of discrete flexible elements, using Simscape Multibody, where a single feather is approximated by 10 flexible beam elements [Figure 2(a)]. Each flexible element, i, consists of two masses (a and b) of

identical shape and mass, which are joined by internal springs and dampers, allowing bending [Figure 2(c)]. The first mass element, a, of the first beam element (i = 1) is rigidly fixed to an angular actuator at a feather’s base, while the second mass element, b, of the last beam unit (i = 10) is a free end. All flexible elements are connected by rigid rotational joints. The influence of the rest of the feather is imposed on each flexible element at the leading and trailing rigid joint locations as a combination of the force, Fr, and bending moment, M r, acting on each joint, as defined in [Figure 2(c)]. The bending deformation of the feather is captured by imposing a structural bending stiffness, k i, and damping, d i, between the masses, a and b, which results in moments, M k i and M d i. The rotational stiffness, k i, was calculated as EI i /l i, where E is the elastic modulus, I is the area moment of inertia, and l is length of the section. The elastic modulus was tuned by comparing the simulation’s visual output against a physical rectangular polypropylene feather in water. The damping was approximated to be zero for all simulated joints. Each of the masses is subject to a set of lumped external forces. Although, in the simulation, the forces are solved for each mass, for this derivation, the formulations will be expressed for the generalized flexible element, i. For each flexible element, i, the total lumped external force, Fexti, consists of gravitational force, Fg i; buoyancy force, F bi; hydrodynamic force, F hyd i; and added mass force, Fa i:

Fexti = Fg i + Fb i + Fhyd i + Fa i.(1)



Since gravity and buoyancy always oppose each other, t they can be combined as t f Vi ^1 - twf h u g, where t f and t w are the density of the polypropylene feather and water, respectively; Vi is the volume; g is the gravitational acceleration; and u g is the unit vector in the direction of gravity. The hydrodynamic force, Fhyd, is the total force due to the viscous By providing the robot interaction between the fluid and the structure. In with the ability to alter, or some literature, the hydrodynamic force is decommorph, its body, we want posed into its lift and drag components, where the to show how changes in drag is in the direction of the relative velocity maneuverability can be between the fluid and the body. In our simulation, achieved through altering we decompose the hydrodynamic force into forces the body structure. in the normal and longitudinal directions of each feather element, F norm and F long, where these individual elements can be approximated as a rectangular prism. The hydrodynamic opposing forces are approximated using the following equations [16]:

nts

eme

Ten

El ible Flex

Fhydi w l Flexible Element i

ω(t )

Simulated Feather Model ω (t ) T = Fry

F Mki,b + Mdi,b normi,b Flong i,b

Amplitude (°)

20

Mki,a + Mdi,a

tfall

40 thold

Fnormi,a

trise A

thold

Fri,a, Mri,a

0 –40 0.5

1

1.5 2 Time (s) (b)

2.5

3

Fai,b F , M ri,b ri,b

Flongi,a Fai,a

–20 0

Fgi + Fbi

Frx

(a) trise/2

Fai

Fgi,a + Fbi,a

Fgi,b + Fbi,bz Flexible Element i

: External Forces : Reaction Forces/Moments : Internal Forces/Moments (c)

Figure 2. The (a) simulation of an actuated parameterized feather in Simscape Multibody, (b) plot of the parameterized control signal, and (c) underlying multibody model of a single feather and its interaction with water, with a close-up free-body diagram of a flexible element.

DECEMBER 2022



IEEE ROBOTICS & AUTOMATION MAGAZINE



81

2 F norm i = 1 C norm i A i t w U i u =i , 2  2 F long i = 1 C long i A i t w U i u i