413 53 48MB
English Pages [1899]
Visual Speech Animation Lei Xie, Lijuan Wang, and Shan Yang
Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Typical VSA System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Face/Mouth Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Input and Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mapping Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Deep BLSTM-RNN-Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . LSTM-RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Talking Head System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Selected Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Karaoke Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Technology Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 3 4 4 5 7 9 13 13 14 16 20 21 22 23 25 26 27
L. Xie (*) School of Computer Science, Northwestern Polytechnical University (NWPU), Xi’an, P. R. China e-mail: [email protected]; [email protected] L. Wang Microsoft Research, Redmond, WA, USA e-mail: [email protected] S. Yang School of Computer Science, Northwestern Polytechnical University, Xi’an, China e-mail: [email protected] # Springer International Publishing AG 2016 B. Müller, S.I. Wolf (eds.), Handbook of Human Motion, DOI 10.1007/978-3-319-30808-1_1-1
1
2
L. Xie et al.
Abstract
Visual speech animation (VSA) has many potential applications in humancomputer interaction, assisted language learning, entertainments, and other areas. But it is one of the most challenging tasks in human motion animation because of the complex mechanisms of speech production and facial motion. This chapter surveys the basic principles, state-of-the-art technologies, and featured applications in this area. Specifically, after introducing the basic concepts and the building blocks of a typical VSA system, we showcase a state-of-the-art approach based on the deep bidirectional long short-term memory (DBLSM) recurrent neural networks (RNN) for audio-to-visual mapping, which aims to create a video-realistic talking head. Finally, the Engkoo project from Microsoft is highlighted as a practical application of visual speech animation in language learning. Keywords
Visual speech animation • Visual speech synthesis • Talking head • Talking face • Talking avatar • Facial animation • Audio visual speech • Audio-to-visual mapping • Deep learning • Deep neural network
Introduction Speech production and perception are both bimodal in nature. Visual speech, i.e., speech-evoked facial motion, plays an indispensable role in speech communication. Plenty of evidence shows that voice and face reinforce and complement each other in human-human communication (McGurk and MacDonald 1976). By viewing the speaker’s face (and mouth), valuable information is provided for speech perception. Visible speech is particularly effective when auditory speech is degraded or contaminated, due to acoustic noise, bandwidth limitation, or hearing impairment. In an early study, Breeuwer et. al. (Breeuwer and Plomp 1985) have already shown that the recognition of short sentences that have been band-pass filtered improves significantly when subjects are allowed to watch the speaker. The same level of improvement can be observed from hearing-impaired listeners and cochlear implant patients (Massaro and Simpson 2014). In the experiments, lipreading provides essential speech perceptual information. The influence of visual speech is not only limited to situations with degraded auditory input. In fact, Sumby and Pollack found that seeing the speaker’s face is equivalent to about 15dB signal-to-noise ratio (SNR) improvement of acoustic signal (Sumby and Pollack 1954). Due to the influence of visual speech in human-human speech communication, researchers have shown their interest in its impact in human-machine interaction. Ostermann and Weissenfeld (Ostermann and Weissenfeld 2004) have shown that trust and attention of humans toward machines increase by 30% when communicating with a talking face instead of text only. That is to say, visual speech is able to attract the attention of a user, making the human-machine interface more engaging.
Visual Speech Animation
3
Hence, visual speech animation (VSA)1 aims to animate the lips/mouth/articulators/ face synchronizing with speech for different purposes. In a broad sense, VSA may include facial expressions (Jia et al. 2011; Cao et al. 2005) and visual prosody (Cosatto et al. 2003) like head (Ben Youssef et al. 2013; Le et al. 2012; Busso et al. 2007; Ding et al. 2015; Jia et al. 2014) and eye (Le et al. 2012; Dziemianko et al. 2009; Raidt et al. 2007; Deng et al. 2005) motions, which are naturally accompanied with human speech. Readers can read Chapter Eye Motion and Chapter Head Motion Generation for more details. Applications of VSA can be found across many domains, such as technical support and customer service, communication aids, speech therapy, virtual reality, gaming, film special effects, education, and training (Hura et al. 2010). Specific applications may include a virtual storyteller for children, a virtual guider or presenter for personal or commercial Web site, a representative of user in computer games, and a funny puppetry for computer-mediated human communications. It is clearly promising that VSA will become an essential multimodal interface in many applications. Speech-evoked face animation is one of the most challenging tasks in human motion animation. Human face has an extremely complex geometric form (Pighin et al. 2006), and the speech-originated facial movements are the result of a complicated interaction between a number of anatomical layers that include the bone, muscle, fat, and skin. As a result, humans are extremely sensitive to the slightest artifacts in an animated face, and even the small subtle changes can lead to unrealistic appearance. To achieve realistic visual speech animation, tremendous efforts from speech, image, computer graphics, pattern recognition, and machine learning communities have been made since several decades ago (Parke 1972). Those efforts have been summarized in proceedings of visual speech synthesis challenge (LIPS) (Theobald et al. 2008), surveys (Cosatto et al. 2003; Ostermann and Weissenfeld 2004), featured books (Pandzic and Forchheimer 2002; Deng and Neumann 2008), and several journal special issues (Xie et al. 2015; Fagel et al. 2010). This book chapter aims to introduce the basic principles, survey the state-of-the-art technologies, and discuss featured applications.
State of the Art After decades of research, a state-of-the-art visual speech animation system currently can realize lifelike or video-realistic performance through 2D, 2.5D, or 3D face modeling and a statistical/parametric text/speech to visual mapping strategy. For instance, in Fan et al. (2016), an image-based 2D video-realistic talking head is introduced. The lower face region of a speaker is modeled by a compact model learned from a set of facial images, called active appearance model (AAM). Given pairs of the audio and visual parameter sequence, a deep neural network model is trained to learn 1
Also called visual speech synthesis, talking face, talking head, talking avatar, speech animation, and mouth animation.
4
L. Xie et al. Video/ Sensor Data
Face/Mouth Model
Visual Parameters Learn Mapping
Data Collection Audio/Text
New Audio/Text
Feature Extraction
Feature Extraction
Audio/Text Feature
Audio/Text Feature Mapping Visual Parameters
Face/Mouth Model
Animation
Fig. 1 The building blocks of a typical VSA system
the sequence mapping from audio to visual space. To further improve the realism of the proposed talking head, the trajectory tiling method is adopted to use the predicted AAM trajectory as a guide to select a smooth real sample image sequence from the recorded database. Based on the similar techniques, Microsoft has released an online visual speech animation system that can help users to learn English (Wang et al. 2012c).
A Typical VSA System As shown in Fig. 1, a typical visual speech animation system is usually composed of several modules: data collection, face/mouth model, feature extraction, and learning a mapping model.
Data Collection According to the source of data used for face/mouth/articulator modeling, a VSA system can be built from images, video recordings, and various motion capture equipments like mocap, electromagnetic articulography (EMA), magnetic resonance imaging (MRI), and X-ray. What type of data are collected essentially depends on the costs, the desired appearance of the face/head, and the application needs. Many approaches choose the straightforward way for data collection: videos of a speaker are recorded by a camera, and the image sequences are used as the source for 2D or 3D face/head modeling (Theobald et al. 2008; Bregler et al. 1997; Cosatto et al. 2003; Cosatto and Graf 1998; Xie and Liu 2007a; Wang et al. 2010a; Cosatto and Graf 2000; Anderson et al. 2013; Ezzat et al. 2002; Ezzat and Poggio 2000; Xie and Liu 2007b), as shown in Fig. 2a. A recent trend to produce quality facial animation is to use 3D motion-captured data (Deng and Neumann 2008), which have been successfully used in movie special effects to drive a virtual character. As shown in Fig. 2b, to record facial movements, an array of high-performance cameras is utilized
Visual Speech Animation
a
5
c
b
Camera Video
Motion Capture
EMA
Fig. 2 Various data collection methods in building a visual speech animation system. (a) Camera video (Theobald et al. 2008), (b) motion capture (Busso et al. 2007), and (c) EMA from http://www. gipsa-lab.grenoble-inp.fr/
to reconstruct the 3D marker locations on a subject’s face. Although the mocap system is quite expensive and difficult to set up, the reconstructed data provide accurate timing and motion information. Once the data are collected, facial animation can be created by controlling underlying muscle structure or blend shapes (see Chapter Blendshape Facial Animation for details). Another data collection system, EMA, as shown in Fig. 2c, is often used to record the complex movements of the lips, jaw, tongue, and even intraoral articulators (Richmond et al. 2011). The sensors, called coils, are attached to different positions on a speaker’s face or in the mouth, and the 3D movements of the sensors are collected in a high frame rate (e.g., 200 frames per second) during the speaker’s talking. Visual speech animation generated by the EMA data is usually used for articulation visualization (Huang et al. 2013; Fagel and Clemens 2004; Wik and Hjalmarsson 2009). In Wang et al. (2012a), an animated talking head is created based on the EMA articulatory data for the pronunciation training purposes.
Face/Mouth Model The appearance of a visual speech animation system is determined by the underlying face/mouth model, while generating animated talking heads that look like real people is challenging. The existing approaches to talking heads use either imagebased 2D models (Seidlhofer 2009; Zhang et al. 2009) or geometry-based 3D ones (Musti et al. 2014).Cartoon avatars are relatively easier to build. The more humanlike, realistic avatars, which can be seen in some games or movies, are much harder to build. Traditionally, expensive motion capture systems are required to track the real person’s motion or, in an even more expensive way, have some artists to manually hand touch every frame. Some desirable features of the next generation avatar are as follows: it should be a 3D avatar to be integrated easily into a versatile 3D virtual world; it should be photo-realistic; it can be customized to any user; last but not least, an avatar should be automatically created with a small amount of recorded data. That is to say, the next generation avatar should be 3D, photo-realistic, personalized or customized, and easy to create with little bootstrapping data.
6
L. Xie et al.
In facial animation world, a great variety of different animation techniques based on 3D models exist (Seidlhofer 2009). In general, these techniques first generate a 3D face model consisting of a 3D mesh, which defines the geometry shape of a face. For that a lot of different hardware systems are available, which range from 3D laser scanners to multi-camera systems. In a second step, either a human-like or cartoon-like texture may be mapped onto the 3D mesh. Besides generating a 3D model, animation parameters have to be determined for the later animation. A traditional 3D avatar requires a highly accurate geometric model to render soft tissues like lips, tongue, wrinkles, etc. It is both computationally intensive and mathematically challenging to make or run such a model. Moreover, any unnatural deformation will make the resultant output fall into the uncanny valley of human rejection. That is, it will be rejected as unnatural. Image-based facial animation techniques achieve great realism in synthesized videos by combining different facial parts of recorded 2D images (Massaro 1998; Zhang et al. 2009; Eskenazi 2009; Scott et al. 2011; Badin et al. 2010). In general, image-based facial animations consist of two main steps: audiovisual analysis of a recorded human subject and synthesis of facial animation. In the analysis step, a database with images of deformable facial parts of the human subject is collected, while the time-aligned audio file is segmented into phonemes. In the second step, a face is synthesized by first generating the audio from the text using a text-to-speech (TTS) synthesizer. The TTS synthesizer sends phonemes and their timing to the face animation engine, which overlays facial parts corresponding to the generated speech over a background video sequence. Massaro (1998), Zhang et al. (2009), Eskenazi (2009), and Scott et al. (2011) show some image-based speech animation that cannot be distinguished from recorded video. However, it is challenging to change head pose freely or to render different facial expressions. Also, it is hard to blend it into 3D scenes seamlessly. Image-based approaches have their advantages that the photo realistic appearance is guaranteed. However, a talking head needs to be not just photo-realistic in a static appearance but also exhibit convincing plastic deformations of the lips synchronized with the corresponding speech, realistic head movements, and natural facial expressions. An ideal 3D talking head can mimic realistic motion of a real human face in 3D space. One challenge for rendering realistic 3D facial animation is on the mouth area. Our lip, teeth, and tongue are mixed with nonrigid tissues, and sometimes with occlusions. This means accurate geometric modeling is difficult, and also it is hard to deform them properly. Moreover, they need to move together in sync with spoken audio; otherwise, people can observe the asynchrony and think it unnatural. In real world, when people talk, led by vocal organs and facial muscles, both the 3D geometry and texture appearance of the face are constantly changing. Ideally, we can capture both geometry change and texture change simultaneously. There is lot of ongoing research for solving this problem. For example, with the help of Microsoft Kinect kinds of motion sensing device, people try to use the captured 3D depth information to better acquire the 3D geometry model. On the other hand, people try to recover the 3D face shape from single or multiple camera views (Wang et al. 2011; Sako et al. 2000; Yan et al. 2010). In 2.5D talking head as above, as there is no captured 3D geometry information available, they adopt the work in Sako et al. (2000) which reconstructs a 3D face model from a single frontal face image. The only required input to the 2D-to-3D system is a
Visual Speech Animation
7
frontal face image of a subject with normal illumination and neutral expression. A semi-supervised ranking prior likelihood models for accurate local search and a robust parameter estimation approach are used for face alignment. Based on this 2D alignment algorithm, 87 key feature points are automatically located, as shown in Fig. 3. The feature points are accurate enough for face reconstruction in most cases. A general 3D face model is applied for personalized 3D face reconstruction. The 3D shapes have been compressed by the PCA. After the 2D face alignment, the key feature points are used to compute the 3D shape coefficients of the eigenvectors. Then, the coefficients are used to reconstruct the 3D face shape. Finally, the face texture is extracted from the input image. By mapping the texture onto the 3D face geometry, the 3D face model for the input 2D face image is reconstructed. They reconstruct a 3D face model for each 2D image sample in recordings, as examples shown in Fig. 3. Thus a 3D sample library is formed, where each 3D sample has a 3D geometry mesh, a texture, and the corresponding UV mapping which defines how a texture is projected onto a 3D model. After 2D-to-3D transformation, original 2D sample recordings turn into 3D sample sequences, which consist of three synchronous streams: geometry mesh sequences for depicting the dynamic shape, texture image sequences for the changing appearance, and the corresponding speech audio. This method combines the best of both 2D image sample-based and 3D model-based facial animation technologies. It renders realistic articulator animation by wrapping 2D video images around a simple and smooth 3D face model. The 2D video sequence can capture the natural movement of soft tissues, and it helps the new talking head to bypass the difficulties in rendering occluded articulators (e.g., tongue and teeth). Moreover, with the versatile 3D geometry model, different head poses and facial expressions can be freely controlled. The 2.5D talking head can be customized to any user by using the 2D video of the user. Techniques based on 3D models impress by their great automatism and flexibility while lacking in realism. Image-based facial animation achieves photo-realism while having little flexibility and lower automatism. The image-based techniques seem to be the best candidates for leading facial animation to new applications, since these techniques achieve photo-realism. The image-based technique combined with a 3D model generates photo realistic facial animations, while providing some flexibility to the user.
Input and Feature Extraction According to the input signal, a visual speech animation system can be driven by text, speech, and performance. The simplest VSA aims to visualize speech pronunciations by an avatar from tracked makers of human performance. Currently, performance-based facial animation can be quite realistic (Thies et al. 2016; Wang and Soong 2012; Weise et al. 2011). The aim of such a system is usually not only for speech visualization. For example, in Thies et al. (2016), an interesting application for real-time facial reenactment is introduced. Readers can go through Chapter Video-based Performance Driven Facial Animation for more details. During the facial data collection process, speech and text are always collected as well. Hence, visual speech can be driven by new voice or text input, achieved by a
8
L. Xie et al.
Fig. 3 Auto-reconstructed 3D face model in different mouth shapes and in different view angles (w/o and w/ texture)
learned text/audio-to-visual mapping model that will be introduced in the next section. To learn such a mapping, a feature extraction module is firstly used to obtain representative text or audio features. The textual feature is often similar to that used for TTS system (Taylor 2009), which may include information about phonemes, syllables, stresses, prosodic boundaries, and part-of-speech (POS) labels. Audio features can be typical spectral features (e.g., MFCC (Fan et al. 2016)), pitch, and other acoustic features.
Visual Speech Animation
9
Mapping Methods Both text- and speech-driven visual speech animation systems desire an input to visual feature conversion or mapping algorithm. That is to say, the lip/mouth/facial movements must be naturally synchronized with the audio speech.2 The conversion is not trivial because of the coarticulation phenomenon of the human speech production mechanism, which causes a given phoneme to be pronounced differently depending on the surrounding phonemes. Due to this phenomenon, learning an audio/text-to-visual mapping becomes an essential task in visual speech animation. Researchers have devoted much effort in this task, and the developed approaches can be roughly categorized into rule based, concatenation, parametric, and hybrid.
Rule Based Due to the limitations of data collection and learning methods, early approaches are mainly based on hand-crafted mapping rules. In these approaches, the counterpart of audio phoneme, viseme, is defined as the basic visual unit. Typically, visemes are manually designed as key images of mouth shapes, as shown in Fig. 4, and empirical smooth functions or coarticulation rules are used to synthesize novel speech animations. Ezzat and Poggio propose a simple approach by morphing key images of visemes (Ezzat and Poggio 2000). Due to the coarticulation phenomenon, morphing between a set of mouth images is apparently not natural. Cohen and Massaro (Cohen and Massaro 1993) propose a coarticulation model, where a viseme shape is defined via dominance functions that are defined in terms of each facial measurement, such as the lips, tongue tip, etc., and the weighted sum of dominance values determines the final mouth shapes. In a recent approach, Sarah et. al. (Taylor et al. 2012) argue that static mouth shapes are not enough, so they redefine visemes as clustered temporal units that describe distinctive speech movements of the visual speech articulators, called dynamic visemes. Concatenation/Unit selection To achieve photo- or video-realistic animation effect, concatenation of real video clips from a recorded database has been considered (Bregler et al. 1997; Cosatto et al. 2003; Cosatto and Graf 1998; 2000). The idea is quite similar with that in concatenative TTS (Hunt and Black 1996). In the off-line stage, a database of recorded videos is cut into sizable clips, e.g., triphone units. In the online stage, given a novel text or speech target, a unit selection process is used to select appropriate units and assemble them in an optimal way to produce the desired target, as shown in Fig. 5. To achieve speech synchronization and a smooth video, the concatenation algorithm should be elaborately designed. In Cosatto et al. (2003), a phonetically labeled target is first produced by a TTS system or by a labeler or an aligner from the recorded audio. From the phonetic target, a graph is created with states corresponding to the frames of the final animation. Each state of the final animation 2
Sometimes this task is called lip synchronization or lip sync for short.
10
L. Xie et al.
Fig. 4 Several defined visemes from Ezzat and Poggio (2000)
(a video frame) is populated with a list of candidate nodes (a recorded video sample from the database). Each state is fully connected to the next, and concatenation costs are assigned for each arc, while target costs are assigned to each node. A Viterbi search on the graph finds the optimal path, i.e., the path that generates the lowest total cost. The balance between the two costs is critical in the final performance, and its weighting is empirically tuned in real applications. The video clips for unit selection are usually limited to lower part of the face that has most speech-evoked facial motions. After selection, the concatenated lower face clips are stitched to a background whole face video, resulting in the synthesized whole face video, as shown in Fig. 6. To achieve seamless stitches, much efforts have been made on image processing. With a relatively large video database, the concatenation approach is able to achieve video-realistic performance. But it might be difficult to add different expressions, and the flexibility of the generated visual speech animation is also limited.
Visual Speech Animation w
ih
11 n
d
ow Frame
Reconstructed Face Images
Image Candidates
Fig. 5 Unit selection approach for visual speech animation (Fan et al. 2016)
Sample
Lower Face Image
Background Image/Sequence
Stitched Image/Sequence
Mask
Fig. 6 Illustration of the image stitching process in a video-realistic talking head (Fan et al. 2016)
Parametric/Statistical Recently, parametric methods have gained much attention due to their elegantly automatic learned mappings from data. Numerous attempts have been made to model the relationship between audio and visual signals, and many are generative probabilistic model based, where the underlying probability distributions of audiovisual data are estimated. Typical models include Gaussian mixture model (GMM), hidden Markov model (HMM) (Xie and Liu 2007a; Fu et al. 2005), dynamical Bayesian network (DBN) (Xie and Liu 2007b), and switching linear dynamical system (SLDS) (Englebienne et al. 2007).
12
L. Xie et al.
The hidden Markov model-based statistical parametric speech synthesis (SPSS) has made a significant progress (Tokuda et al. 2007). Hence, the HMM approach was also intensively investigated for visual speech synthesis (Sako et al. 2000; Masuko et al. 1998). In HMM-based visual speech synthesis, auditory speech and visual speech are jointly modeled in HMMs, and the visual parameters are generated from HMMs by using the dynamic (“delta”) constraints of the features (Breeuwer and Plomp 1985). Convincing mouth video can be rendered from the predicted visual parameter trajectories. This approach is called trajectory HMM. Usually, maximum likelihood (ML) is used as the criterion for HMM model training. However, ML training does not optimize directly toward visual generation error. To compensate this deficiency, a minimum generated trajectory error (MGE) method is proposed in Wang et al. (2011) to further refine the audiovisual joint modeling by minimizing the error between the generation result and the real target trajectories in the training set. Although HMM can model sequential data efficiently, there are still some limitations, such as the wrong model assumptions out of necessity, e.g., GMM and its diagonal covariance, and the greedy, hence suboptimal, search-derived decisiontree-based contextual state clustering. Motivated by the superior performance of deep neural networks (DNN) in automatic speech recognition (Hinton et al. 2012) and speech synthesis (Zen et al. 2013), a neural network-based photo-realistic talking head is proposed in Fan et al. (2015). Specifically, a deep bidirectional long short-term memory recurrent neural network (BLSTM-RNN) is adopted to learn a direct regression model by minimizing the sum of square error (SSE) of predicting visual sequence from label sequence. Experiments have confirmed that the BLSTM approach significantly outperforms the HMM approach (Fan et al. 2015). The BLSTM approach will be introduced in detail later in this chapter.
Hybrid Although parametric approaches have many merits like small footprint, flexibility, and controllability, one obvious drawback of those approaches is the blurring animation due to the feature dimension reduction and the non-perfect learning method. So there are some hybrid visual speech animation approaches that use the predicted trajectory to guide the sample selection process (Wang et al. 2010b), which combines the advantages of both the video-based concatenation and the parametric statistical modeling approaches. In a recent approach (Fan et al. 2016), visual parameter trajectory predicted by a BLSTM-RNN is used as a guide to select a smooth real sample image sequence from the recorded database.
A Deep BLSTM-RNN-Based Approach In the past several years, deep neural networks (DNN) and deep learning methods (Deng and Yu 2014) have been successfully used in many tasks, such as speech recognition (Hinton et al. 2012), natural language processing, and computer vision. For example, the DNN-HMM approach has boosted the speech recognition accuracy
Visual Speech Animation
13
significantly (Deng and Yu 2014). Deep neural networks have been investigated for regression/mapping tasks, e.g., text to speech (Zen et al. 2013), learning clean speech from noisy speech for speech enhancement (Du et al. 2014), and articulatory movement prediction from text and speech (Zhu et al. 2015). There are several advantages of the DNN approaches: it can model long-span, high-dimensional, and the correlation of input features; it is able to learn nonlinear mapping between input and output with a deep-layered, hierarchical, feed-forward, and recurrent structure; it has the discriminative and predictive capability in generation sense, with appropriate cost function(s), e.g., generation error. Recently, recurrent neural networks (RNNs) (Williams and Zipser 1989) and their bidirectional variant, bidirectional RNNs (BRNNs) (Schuster and Paliwal 1997), become popular because they are able to incorporate contextual information that is essential for sequential data modeling. Conventional RNNs cannot well model the long-span relations in sequential data because of the vanishing gradient problem (Hochreiter 1998). Hochreiter and Schmidhuber (1997) found that the LSTM architecture, which uses purpose-built memory cells to store information, is better at exploiting long-range context. Combining BRNNs with LSTM gives BLSTM, which can access long-range context in both directions. Speech, both in auditory and visual forms, is typical sequential data. In a recent study, BLSTM has shown state-of-the-art performance in audio-to-visual sequential mapping (Fan et al. 2015).
RNN Allowing cyclical connections in a feed-forward neural network, a recurrent neural network (RNN) is formed (Williams and Zipser 1989). RNNs are able to incorporate contextual information from previous input vectors, which allows them to remember past inputs and allows them to persist in the network’s internal state. This property makes them an attractive model for sequence-to-sequence learning. For a given input vector sequence x = (x1,x2...,xT), the forward pass of RNNs is as follows: ht ¼ HðWxh xt þ Whh ht1 þ bh Þ,
(1)
yt ¼ Why ht þ by ,
(2)
where t = 1,...,T and T is the length of the sequence; h = (h1,...,hT) is the hidden state vector sequence computed from x; y = (y1,...,yT) is the output vector sequence; W is the weight matrices, where Wxh, Whh, and Why are the input-hidden, hidden-hidden, and hidden-output weight matrices, respectively. bh and by are the hidden and output bias vectors, respectively, and H denotes the nonlinear activation function in the output layer. For the visual speech animation task, because of the speech coarticulation phenomenon, a model is desired to have access to both the past and future contexts. Bidirectional recurrent neural networks (BRNNs), as shown in Fig. 7, fit this task
14
L. Xie et al.
Outputs Backward Layer Forward Layer
Inputs
Fig. 7 Bidirectional recurrent neural networks (BRNNs) !
well. A BRNN computes both forward state sequence h and backward state !
sequence h , as formulated below: ! ! ! ! , h t ¼ H Wx !h xt þ W ! h h þ b t1 h h
(3)
h t ¼ H Wx h x t þ W h
(4)
!
!! h
hh
t1
þ bh ,
!
h þ W h y h t þ by : yt ¼ W ! hy t
(5)
LSTM-RNN Conventional RNNs can access only a limited range of context because of the vanishing gradient problem. Long short-term memory (LSTM) uses purpose-built memory cells, as shown in Fig. 8, to store information, which is designed to overcome this limitation. In sequence-to-sequence mapping tasks, LSTM has been shown capable of bridging very long time lags between input and output sequences by enforcing constant error flow. For LSTM, the recurrent hidden layer function H is implemented as follows: it ¼ σ ðWxi xt þ Whi ht1 þ Wci ct1 þ bi Þ, f t ¼ σ Wxf xt þ Whf ht1 þ Wci ct1 þ bf ,
(7)
at ¼ τðWxc xt þ Whc ht1 þ bc Þ,
(8)
ct ¼ f t ct1 þ it at ,
(9)
(6)
Visual Speech Animation
15
ht
Fig. 8 Long short-term memory (LSTM)
ot
Output gate
ct Cell
it
Forget gate
ft
Input gate Memory Block
xt ot ¼ σ ðWxo xt þ Who ht1 þ Wco ct þ bo Þ,
(10)
ht ¼ ot θðct Þ,
(11)
where σ is the sigmoid function; i, f, o, a, and c are input gate, forget gate, output gate, cell input activation, and cell memory, respectively. τ and θ are the cell input and output nonlinear activation functions; generally tanh is chosen. The multiplicative gates allow LSTM memory cells to store and access information over long periods of time, thereby avoiding the vanishing gradient problem. Combining BRNNs with LSTM gives rise to BLSTM, which can access longrange context in both directions. Motivated by the success of deep neural network architectures, deep BLSTM-RNNs (DBLSTM-RNNs) are used to establish the audio-to-visual mapping for visual speech animation. Deep BLSTM-RNN is created by stacking multiple BLSTM hidden layers.
The Talking Head System Figure 9 shows the diagram of an image-based talking head system using DBLSTM as the mapping function (Fan et al. 2015). The diagram actually follows the basic structure of a typical visual speech animation system in Fig. 1. The aim of the system is to achieve speech animation with video-realistic effects. Firstly, an audio/visual database of a subject talking to a camera with frontal view of his/her face is recorded as the training data. In the training stage, the audio is converted into a sequence of
16
L. Xie et al.
contextual phoneme labels L using forced alignment, and the corresponding lower face image sequence is transformed into active appearance model (AAM) feature vectors V. Then a deep BLSTM neural network is used to learn a regression model between the two audio and visual parallel sequences by minimizing the SSE of the prediction, in which the input layer is the label sequence L and the output prediction layer is the visual feature sequence V. In the synthesis stage, for any input text with natural or synthesized speech by TTS, the label sequence L is extracted from them ^ are predicted using the well-trained deep and then the visual AAM parameters V ^ can be BLSTM network. Finally, the predicted AAM visual parameter sequence V reconstructed to high-quality photo-realistic face images and rendering the full-face talking head with lip-synced animation.
Label Extraction The input sequence L and output feature sequence V are two time-varying parallel sequences. The input of a desired talking head system can be any arbitrary text along with natural audio recordings or synthesized speech by TTS. For natural recordings, the phoneme/state time alignment can be obtained by conducting forced alignment using a trained speech recognition model. For TTS-synthesized speech, the phoneme/state sequence and time offset are a by-product of the synthesis process. Therefore, for each speech utterance, the phoneme/state sequence and their time offset are converted into a label sequence, denoting as L = (i1,...,it,...,iT), where T is the number of frames in the sequence. The format of the frame-level label it uses the one-hot representation, i.e., one vector for each frame, shown as follows: 2
3
40, . . . , 0, . . . , 1; 1, . . . , 0, . . . , 0 ; 0, 0, 1, . . . , 0 ; 0, 1, 0 5, |fflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflffl} |fflffl{zfflffl} K
K
K
3
where K denotes the number of phonemes. In Fan et al. (2015), triphone and the information of three states to identify it are used. The first three K-element sub-vectors denote the identities of the left, current, and right phonemes in the triphone, respectively, and the last three elements represent the phoneme state. Please note that the contextual label can be easily extended to contain richer information, like position in syllable, position in word, stress, part of speech, etc. But if the training data is limited, we may consider the phoneme and state level labels only.
Face Model and Visual Feature Extraction In the system (Fan et al. 2015), the visual stream is a sequence of lower face images which are strongly correlated to the underlying speech. As the raw face image is hard to model directly due to the high dimensionality, active appearance model (AAM) (Cootes et al. 2001) is used as face model for visual feature extraction. AAM is a joint statistical model compactly representing both the shape and the texture variations and the correlation between them.
Visual Speech Animation
17
Text & Audio
A/V Database
Text & Audio
Label Extraction
L Deep BLSTM Training
Visual Feature Extraction
Face images
Label Extraction
L
Prediction
V
ˆ V
NN Model
Image Reconstruction
Fig. 9 Diagram of an image-based talking head system using DBLSTM-RNN as the mapping (Fan et al. 2015)
Since the speaker moves his/her head naturally during recording, head pose normalization among all the face images is performed before AAM modeling. With the help of an effective 3D model-based head pose tracking algorithm, the head pose in each image frame is normalized to a fully frontal view and further aligned. Facial feature points and the texture of a lower face used in Fan et al. (2015) are shown in Fig. 10. The shape of the jth lower face, sj, can be represented by the concatenation of the x and y coordinates of N facial feature points: sj ¼ xj1 , xj2 , . . . , xjN , yj1 , yj2 ,:::, yjN ,
(12)
where j = 1,2,..., J and J is the total number of the face images. In this work, a set of 51 facial feature points is used, as shown in Fig. 10a. The mean shape is simply defined by s0 ¼
XJ
s =J: j¼1 j
(13)
Applying principal component analysis (PCA) to all J shapes, sj can be given approximately by sj ¼ s0 þ
XNshape i¼1
aji~s i ¼ s0 þ aj Ps ,
(14)
18
L. Xie et al.
a
b
Fig. 10 Facial feature points and the texture of a lower face used in Fan et al. (2015). (a) 51 facial feature points. (b) The texture of a lower face
T where Ps ¼ ~s 1 , ~s 2 , . . . , ~s i , . . . ¼ ~s Nshape denotes the eigenvectors corresponding to the N shape largest eigenvalues and aj ¼ aj1 , aj2 , . . . , aji , . . . , ajN shape is the jth shape parameter vector. Accordingly, the texture of the jth face image, tj, is defined by a vector concatenating the R/G/B value of every pixel that lies inside the mean shape, so tj ¼ r j1 , . . . , r jU , gj1 , . . . , gjU , bj1 , . . . , bjU ,
(15)
where j = 1,2,..., J and U is the total number of pixels. As the dimensionality of the texture vector is too high to use PCA directly, EM-PCA (Roweis 1998) is used instead to all J textures. As a result, the jth texture tj can be given approximately by tj ¼ t0 þ
XNtexture i¼1
bjt~t i ¼ t0 þ bj Pt ,
(16)
where t0 is the mean texture. Pt contains the eigenvectors corresponding to the Ntexture largest eigenvalues, and bj is the jth texture parameter vector. The above shape and texture models can only control the shape and texture separately. In order to recover the correlation between the shape and the texture, aj and bj are combined in another round of PCA:
Visual Speech Animation
19
XNappearance aj , b j ¼ vji v~i ¼ vj Pv , i¼1
(17)
and assuming that Pvs and Pvt are formed by extracting the first Nshape and the last Ntexture values from each component in Pv. Simply combining the above equations gives sj ¼ s0 þ vj Pvs Ps ¼ s0 þ vj Qs ,
(18)
tj ¼ t0 þ vj Pvt Pt ¼ t0 þ vj Qt :
(19)
Now, the shape and texture of the jth lower face image can be constructed by a single vector vj. vj is the jth appearance parameter vector which is used as the AAM visual feature. Subsequently, the lower face sequence with T frames can be represented by the visual feature sequence V = (v1,...,vt,...,vT).
DBLSTM-RNN Model Training In the training stage, multiple sequence pairs of L and V are available. As both sequences are represented as continuous numerical vectors, the network is treated as a regression model minimizing the SSE of predicting V from L. In the synthesis stage, given any arbitrary text along with natural or synthesized speech, they are firstly converted into a sequence of input features and then fed into the trained network. The output of the network is the predicted visual AAM feature sequence. After reconstructing the AAM feature vectors to RGB images, photo-realistic image sequence of the lower face is generated. Finally, the lower face is stitched to a background face and the facial animation of the talking head is rendered. Learning deep BLSTM network can be regarded as optimizing a differentiable error function: EðwÞ ¼
XMtrain k¼1
Ek ðwÞ,
(20)
where Mtrain represents the number of sequences in the training data and w denotes the network internode weights. In the task, the training criterion is to minimize the ^ ¼ ð^ SSE between the predicted visual features V v 1 , ^v 2 ,:::, ^v T Þ and the ground truth V = (v1, v2, ... , vT). For a particular input sequence k, the error function takes the form E k ðw Þ ¼
XT k
E ¼ t¼1 kt
1 XT k ^v k vk 2 , t t t¼1 2
(21)
where Tk is the total number of frames in the kth sequence. In every iteration, the error gradient is computed with the following equation: Δwðr Þ ¼ mΔwðr 1Þ α
@Eðwðr ÞÞ , @wðr Þ
(22)
20
L. Xie et al.
where 0 α 1 is the learning rate, 0 m 1 is the momentum parameter, and w (r) represents the vector of weights after rth iteration of update. The convergence condition is that the validation error has no obvious change after R iterations. Backpropagation through time (BPTT) algorithm is usually adopted to train the network. In the BLSTM hidden layer, BPTT is applied to both forward and backward hidden nodes and back-propagates layer by layer, taking the error function derivatives with respect to the output of the network as an example. For k k k k ^ , because the activation function ^v ¼ ^v , . . . , ^v , . . . , ^v in the kth V t
t1
tj
tN appearance
used in the output layer is an identity function, we have ^v ktj ¼
X h
woh zkht ,
(23)
where o is the index of the an output node, zkht is the activation of a node in the hidden layer connected to the node o, and woh is the weight associated with this connection. By applying the chain rule for partial derivatives, we can obtain k @Ekt XNappearance @Ekt @^v tj ¼ , j¼1 @woh @^v ktj @woh
(24)
and according to Eqs. (21) and (23), we can derive XNappearance @Ekt k k k ^ ¼ v v tj tj zht , j¼1 @woh
(25)
@Ekt XT @Ekt ¼ : t¼1 @w @woh oh
(26)
Performances The performances of the DBLSTM-based talking head are evaluated on an A/V database with 593 English utterances spoken by a female in a neutral style (Fan et al. 2015). The DBLSTM approach is compared with the previous HMM-based approach (Wang and Soong 2015). The results for FBB128 DBLSTM3 and HMM are shown in Table 1. We can clearly see that the deep BLSTM approach outperforms the HMM approach by a large margin in terms of the four objective metrics. Subjective evaluation is also carried out in Fan et al. (2015). Ten sequences of labels are randomly selected from the test set as the input. The deep BLSTM-based
3
FBB128 means two BLSTM layers sitting on top of one feed-forward layer and each layer has 128 nodes.
Visual Speech Animation Table 1 Performance comparison between deep BLSTM and HMM
21
Comparison HMM DBLSTM
45.7% DBLSTM-RNNs
RMSE (shape) 1.223 1.122
RMSE (texture) 6.602 6.286
28.6% Neutral
RMSE (appearance) 167.540 156.502
CORR 0.582 0.647
25.7% HMM
Fig. 11 The percentage preference of the DBLSTM-based and HMM-based photo-real talking heads
and the HMM-based talking head videos are rendered, respectively. For each test sequence, the two talking head videos are played side by side randomly with original speech. A group of 20 subjects are asked to perform an A/B preference test according to the naturalness. The percentage preference is shown in Fig. 11. It can be seen clearly that the DBLSTM-based talking head is significantly preferred to the HMM-based one. Most subjects prefer the BLSTM-based talking head because its lip movement is smoother than the HMM-based one. Some video clips of the synthesized talking head can be found from Microsoft Research (2015).
Selected Applications Avatars, with lively visual speech animation, are increasingly being used to communicate with users on a variety of electronic devices, such as computers, mobile phones, PDAs, kiosks, and game consoles. Avatars can be found across many domains, such as customer service and technical support, as well as in entertainment. Some of the many uses of avatars include the following: • • • • •
Reading news and other information to users Guiding uses through Web sites by providing instructions and advice Presenting personalized messages on social Web sites Catching users attention in advertisements and announcements Acting as digital assistants and automated agents for self-service, contact centers, and help desks • Representing character roles in games • Training users to perform complex tasks • Providing new branding opportunities for organizations Here, we focus on one application that uses talking head avatar for audio/visual computer-assisted language learning (CALL). Imagine a child learning from his favorite TV star who appears to be personally teaching him English on his handheld device. Another youngster might show off her own avatar that tells mystery stories in a foreign language to her classmates. The
22
L. Xie et al.
speech processing technologies “talking head” are notable in its potential for enabling such scenarios. These features have been successfully tested in a largescale DDR project called Engkoo (Wang et al. 2012c), from Microsoft Research Asia. It is used by ten million English learners in China per month and was the winner of the Wall Street Journal 2010 Asian Innovation Readers’ Choice Award (Scott et al. 2011). Talking head generates karaoke-style short synthetic videos demonstrating oral English. The videos consist of a photo-realistic person speaking English sentences crawled from the Internet. The technology leverages a computer-generated voice with native speaker-like quality and synchronized subtitles on the bottom of the video; it emulates popular karaoke-style videos specifically designed for a Chinese audience in order to increase user engagement. Compared to using prerecorded human voice and video in English education tools, these videos not only create a realistic look and feel but also greatly reduce the cost of content creation by generating arbitrary content sources synthetically and automatically. The potential for personalization is there as well. For example, a choice of voice based on preferred gender, age, speaking rate, or pitch range and dynamics can be made, and the selected type of voice can be used to adapt a pre-trained TTS such that the synthesized voice can be customized.
Motivation Language teachers have been avid users of technology for a while now. The arrival of the multimedia computer in the early 1990s was a major breakthrough because it combined text, images, sound, and video in one device and permitted the integration of the four basic skills of listening, speaking, reading, and writing. Nowadays, as personal computers become more pervasive, smaller, and more portable, and with devices such as smartphones and tablet computers dominating the market, multimedia and multimodal language learning can be ubiquitous and more self-paced. For foreign language users, learning correct pronunciation is considered by many to be one of the most arduous tasks if one does not have access to a personal tutor. The reason is that the most common method for learning pronunciation, using audio tapes, severely lacks completeness and engagement. Audio data alone may not offer users complete instruction on how to move their mouth/lips to sound out phonemes that may be nonexistent in their mother tongue. And audio as a tool of instruction is less motivating and personalized for learners. As supported by studies in cognitive informatics, information is processed by humans more efficiently when both audio and visual techniques are utilized in unison. Computer-assisted audiovisual language learning increases user engagement when compared to audio alone. There are many existing bodies of work that use visualized information and talking head to help language learning. For example, Massaro (1998) used visual articulation to show the internal structure of the mouth, enabling learners to visualize the position and movement of the tongue. Badin et. al (2010) inferred learners’ tongue position and shape to provide visual articulatory
Visual Speech Animation
23
corrective feedback in second language learning. Additionally, a number of studies done in Eskenazi (2009) focused on overall pronunciation assessment and segmental/prosodic error detection to help learners improve their pronunciation with computer feedback. In the project in Wang et al. (2012c), the focus is on generating a photo-realistic, lip-synced talking head as a language assistant for multimodal, web-based, and low-cost language learning. The authors feel that a lifelike assistant offers a more authoritative metaphor for engaging language learners, particularly younger demographics. The long-term goal is to create a technology that can ubiquitously help users anywhere, anytime, from detailed pronunciation training to conversational practice. Such a service is especially important as a tool for augmenting human teachers in areas of the world where native, high-quality instructors are scarce.
Karaoke Function Karaoke, also known as KTV, is a major pastime among Chinese people, with numerous KTV clubs found in major cities in China. A karaoke-like feature has been added to Engkoo, which enables English learners to practice their pronunciation online by mimicking a photo-realistic talking head lip-synchronously within a search and discovery ecosystem. This “KTV function” is exposed as videos generated from a vast set of sample sentences mined from the web. Users can easily launch the videos with a single click at the sentence of their choosing. Similar to the karaoke format, the videos display the sentence on the screen, while a model speaker says it aloud, teaching the users how to enunciate the words, as shown in Fig. 12. Fig. 13 shows the building blocks of the KTV system. While the subtitles of karaoke are useful, it should be emphasized that the pacing offered is especially valuable when learning a language. Concretely, the rhythm and the prosody embedded in the KTV function offer users the timing cues to utter a given sentence properly. Although pacing can be learned from listening to a native speaker, what is offered uniquely in this system is the ability to get this content at scale and on demand. The KTV function offers a low-cost method for creating highly engaging, personalizable learning material utilizing the state-of-the-art talking head rendering technology. One of the key benefits is the generation of lifelike video as opposed to cartoon-based animations. This is important from a pedagogical perspective because the content appears closer in nature to a human teacher, which reduces the perceptive gap that students, particularly younger pupils, need to make from the physical classroom to the virtual learning experience. The technology can drastically reduce language learning video production costs in scenarios where the material requires a human native speaker. Rather than repeatedly taping an actor speaking, the technique can synthesize the audio and video content automatically. This has the potential for further bridging the classroom and e-learning scenarios where a teacher can generate his talking head for students to take home and learn from.
24
L. Xie et al.
Fig. 12 Screenshots of Karaoke-like talking heads on Engkoo. The service is accessible at http:// dict.bing.com.cn
Visual Speech Animation
25
Fig. 13 Using talking head synthesis technology for KTV function on Engkoo
Technology Outlook The current karaoke function, despite its popularity with web users, can be further enhanced to reach the long-term goal, the vision being that of creating an indiscernibly lifelike computer assistant, at low cost and web based, helpful in many language learning scenarios, such as interactive pronunciation drills and conversational training. To make the talking head more lifelike and natural, a new 3D photo-realistic, realtime talking head is proposed with a personalized appearance (Wang and Soong 2012). It extends the prior 2D photo-realistic talking head to 3D. First, approximately 20 minutes of audiovisual 2D video is recorded with prompted sentences spoken by a human speaker. A 2D-to-3D reconstruction algorithm is adopted to automatically wrap the 3D geometric mesh with 2D video frames to construct a training database, as shown in Fig. 14. In training, super feature vectors consisting of 3D geometry, texture, and speech are formed to train a statistical, multi-streamed HMM. The model is then used to synthesize both the trajectories of geometry animation and dynamic texture. As far as the synthesized audio (speech) output is concerned, the research direction is to make it more personalized, adaptive, and flexible. For example, a new algorithm which can teach the talking head to speak authentic English sentences which sound like a Chinese ESL learner has been proposed and successfully tested. Also, synthesizing more natural and dynamic prosody patterns for ESL learners to mimic is highly desirable as an enhanced feature of the talking head. The 3D talking head animation can be controlled by the rendered geometric trajectory, while the facial expressions and articulator movements are rendered with the dynamic 2D image sequences. Head motions and facial expressions can also be separately controlled by manipulating corresponding parameters. A talking head for a movie star or celebrity can be created by using their video recordings. With the new 3D, photo-realistic talking head, the era of having lifelike, web-based, and interactive learning assistants is on the horizon. The phonetic search can be further improved by collecting more data, both in text and speech, to generate the phonetic candidates to cover the generic and localized spelling/pronunciation errors committed by language learners at different levels.
26
L. Xie et al.
Fig. 14 A 3D photo-realistic talking head by combining 2D image samples with a 3D face model
When such a database is available, a more powerful LTS can be trained discriminatively such that the errors observed in the database can be predicted and recovered gracefully. In future work, with regard to the interactivity of the computer assistant, it can hear (via speech recognition) and speak (TTS synthesis), read and compose, correct and suggest, or even guess or read the learner’s intention.
Summary This chapter surveys the basic principles, state-of-the-art technologies, and featured applications in the visual speech animation area. Data collection, face/mouth model, feature extraction, and learning a mapping model are the central building blocks of a VSA system. The technologies used in different blocks depend on the application needs and affect the desired appearance of the system. During the past decades, much effort in this area has been devoted to the audio/text-to-visual mapping problem, and approaches can be roughly categorized into rule based, concatenation, parametric, and hybrid. We showcase a state-of-the-art approach, based on the deep bidirectional long short-term memory (DBLSM) recurrent neural networks (RNN) for audio-to-visual mapping in a video-realistic talking head. We also use the Engkoo project from Microsoft as a practical application of visual speech animation in language learning. We believe that with the fast development of computer graphics, speech technology, machine learning, and human behavior studies, the
Visual Speech Animation
27
future visual speech animation systems will become more flexible, expressive, and conversational. Subsequently, applications can be found across many domains.
References Anderson R, Stenger B, Wan V, Cipolla R (2013) Expressive visual text-to-speech using active appearance models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE. p 3382 Badin P, Ben Youssef A, Bailly G et al (2010) Visual articulatory feedback for phonetic correction in second language learning. In: Proceedings of Second Language learning Studies: Acquisition, Learning, Education and Technology, 2010 Ben Youssef A, Shimodaira H, Braude DA (2013) Articulatory features for speech-driven head motion synthesis. In: Proceedings of the International Speech Communication Association, IEEE, 2013 Breeuwer M, Plomp R (1985) Speechreading supplemented with formant frequency information from voiced speech. J Acoust Soc Am 77(1):314–317 Bregler C, Covell M, Slaney M (1997) Video rewrite: driving visual speech with audio. In: Proceedings of the 24th annual conference on Computer graphics and interactive techniques, ACM Press, p 353 Busso C, Deng Z, Grimm M, Neumann U et al (2007) Rigid head motion in expressive speech animation: Analysis and synthesis. IEEE Trans Audio, Speech, Language Process 15 (3):1075–1086 Cao Y, Tien WC, Faloutsos P et al(2005) Expressive speech-driven facial animation. In: ACM Transactions on Graphics, ACM, p 1283 Cohen MM, Massaro DW (1993) Modeling coarticulation in synthetic visual speech. In: Models and techniques in computer animation. Springer, Japan, p 139 Cootes TF, Edwards GJ, Taylor CJ (2001) Active appearance models. IEEE Trans Pattern Anal Mach Intell 23(6):681–685 Cosatto E, Graf HP (1998) Sample-based synthesis of photo-realistic talking heads. In: Proceedings of Computer Animation, IEEE, p 103 Cosatto E, Graf HP (2000) Photo-realistic talking-heads from image samples. IEEE Trans Multimed 2(3):152–163 Cosatto E, Ostermann J, Graf HP et al (2003) Lifelike talking faces for interactive services. Proc IEEE 91(9):1406–1429 Deng Z, Neumann U (2008) Data-driven 3D facial animation. Springer Deng L, Yu D (2014) Deep learning methods and applications. Foundations and Trends in Signal Processing, 2014 Deng Z, Lewis JP, Neumann U (2005) Automated eye motion using texture synthesis. IEEE Comput Graph Appl 25(2):24–30 Ding C, Xie L, Zhu P (2015) Head motion synthesis from speech using deep neural networks. Multimed Tools Appl 74(22):9871–9888 Du J, Wang Q, Gao T et al (2014) Robust speech recognition with speech enhanced deep neural networks. In: Proceedings of the International Speech Communication Association, IEEE, p 616 Dziemianko M, Hofer G, Shimodaira H (2009). HMM-based automatic eye-blink synthesis from speech. In: Proceedings of the International Speech Communication Association, IEEE, p 1799 Englebienne G, Cootes T, Rattray M (2007) A probabilistic model for generating realistic lip movements from speech. In: Advances in neural information processing systems, p 401 Eskenazi M (2009) An overview of spoken language technology for education. Speech Commun 51 (10):832–844 Ezzat T, Poggio T (2000) Visual speech synthesis by morphing visemes. Int J Comput Vision 38 (1):45–57
28
L. Xie et al.
Ezzat T, Geiger G, Poggio T (2002) Trainable videorealistic speech animation. In: ACM SIGGRAPH 2006 Courses, ACM, p 388 Fagel S, Clemens C (2004) An articulation model for audiovisual speech synthesis: determination, adjustment, evaluation. Speech Commun 44(1):141–154 Fagel S, Bailly G, Theobald BJ (2010) Animating virtual speakers or singers from audio: lip-synching facial animation. EURASIP J Audio, Speech, Music Process 2009(1):1–2 Fan B, Wang L, Soong FK et al (2015) Photo-real talking head with deep bidirectional LSTM. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, p 4884 Fan B, Xie L, Yang S, Wang L et al (2016) A deep bidirectional LSTM approach for video- realistic talking head. Multimed Tools Appl 75:5287–5309 Fu S, Gutierrez-Osuna R, Esposito A et al (2005) Audio/visual mapping with cross-modal hidden Markov models. IEEE Trans Multimed 7(2):243–252 Hinton G, Deng L, Yu D, Dahl GE et al (2012) Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process Mag 29 (6):82–97 Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertain Fuzz 6(02):107–116 Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780 Huang D, Wu X, Wei J et al (2013) Visualization of Mandarin articulation by using a physiological articulatory model. In: Signal and Information Processing Association Annual Summit and Conference, IEEE, p 1 Hunt AJ, Black AW (1996) Unit selection in a concatenative speech synthesis system using a large speech database. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, p 373 Hura S, Leathem C, Shaked N (2010) Avatars meet the challenge. Speech Technol, 303217 Jia J, Zhang S, Meng F et al (2011) Emotional audio-visual speech synthesis based on PAD. IEEE Trans Audio, Speech, Language Process 19(3):570–582 Jia J, Wu Z, Zhang S et al (2014) Head and facial gestures synthesis using PAD model for an expressive talking avatar. Multimed Tools Appl 73(1):439–461 Kukich K (1992) Techniques for automatically correcting words in text. ACM Comput Surv 24 (4):377–439 Le BH, Ma X, Deng Z (2012) Live speech driven head-and-eye motion generators. IEEE Trans Vis Comput Graph 18(11):1902–1914 Liu P, Soong FK (2005) Kullback-Leibler divergence between two hidden Markov models. Microsoft Research Asia, Technical Report Massaro DW (1998) Perceiving talking faces: from speech perception to a behavioral principle. Mit Press, Cambridge Massaro DW, Simpson JA (2014) Speech perception by ear and eye: a paradigm for psychological inquiry. Psychology Press Masuko T, Kobayashi T, Tamura, M et al (1998) Text-to-visual speech synthesis based on parameter generation from HMM. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, p 3745 McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748 Microsoft Research (2015) http://research.microsoft.com/en-us/projects/voice_driven_talking_head/ Mikolov T, Chen K, Corrado G et al (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 Musti U, Zhou Z, Pietikinen M (2014) Facial 3D shape estimation from images for visual speech animation. In: Proceedings of the Pattern Recognition, IEEE, p 40 Ostermann J, Weissenfeld A (2004) Talking faces-technologies and applications. In: Proceedings of the 17th International Conference on Pattern Recognition, IEEE, p 826 Pandzic IS, Forchheimer R (2002) MPEG-4 facial animation. The standard, implementation and applications. John Wiley and Sons, Chichester
Visual Speech Animation
29
Parke FI (1972) Computer generated animation of faces. In: Proceedings of the ACM annual conference-Volume, ACM, p 451 Peng B, Qian Y, Soong FK et al (2011) A new phonetic candidate generator for improving search query efficiency. In: Twelfth Annual Conference of the International Speech Communication Association Pighin F, Hecker J, Lischinski D et al (2006) Synthesizing realistic facial expressions from photographs. In: ACM SIGGRAPH 2006 Courses, ACM, p 19 Qian Y, Yan ZJ, Wu YJ et al (2010) An HMM trajectory tiling (HTT) approach to high quality TTS. In: Proceedings of the International Speech Communication Association, IEEE, p 422 Raidt S, Bailly G, Elisei F (2007) Analyzing gaze during face-to-face interaction. In: International Workshop on Intelligent Virtual Agents. Springer, Berlin/Heidelberg, p 403 Microsoft Research (2015) http://research.microsoft.com/en-us/projects/blstmtalkinghead/ Richmond K, Hoole P, King S (2011) Announcing the electromagnetic articulography (day 1) subset of the mngu0 articulatory corpus. In: Proceedings of the International Speech Communication Association, IEEE, p 1505 Roweis S (1998) EM algorithms for PCA and SPCA. Adv Neural Inf Process Syst:626–632 Sako S, Tokuda K, Masuko T et al(2000) HMM-based text-to-audio-visual speech synthesis. In: Proceedings of the International Speech Communication Association, IEEE, p 25 Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681 Scott MR, Liu X, Zhou M (2011) Towards a Specialized Search Engine for Language Learners [Point of View]. Proc IEEE 99(9):1462–1465 Seidlhofer B (2009) Common ground and different realities: World Englishes and English as a lingua franca. World Englishes 28(2):236–245 Sumby WH, Pollack I (1954) Erratum: visual contribution to speech intelligibility in noise [J. Acoust. Soc. Am. 26, 212 (1954)]. J Acoust Soc Am 26(4):583–583 Taylor P (2009) Text-to-speech synthesis. Cambridge university press, Cambridge Taylor SL, Mahler M, Theobald BJ et al (2012) Dynamic units of visual speech. In: Proceedings of the 11th ACM SIGGRAPH/Eurographics conference on Computer Animation, ACM, p 275 Theobald BJ, Fagel S, Bailly G et al (2008) LIPS2008: Visual speech synthesis challenge. In: Proceedings of the International Speech Communication Association, IEEE, p 2310 Thies J, Zollhfer M, Stamminger M et al(2016) Face2face: Real-time face capture and reenactment of rgb videos. In: Proceedings of Computer Vision and Pattern Recognition, IEEE, p 1 Tokuda K, Yoshimura T, Masuko T et al (2000) Speech parameter generation algorithms for HMM-based speech synthesis. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, p 1615 Tokuda K, Oura K, Hashimoto K et al (2007) The HMM-based speech synthesis system. Online: http://hts.ics.nitech.ac.jp Wang D, King S (2011) Letter-to-sound pronunciation prediction using conditional random fields. IEEE Signal Process Lett 18(2):122–125 Wang L, Soong FK (2012) High quality lips animation with speech and captured facial action unit as A/V input. In: Signal and Information Processing Association Annual Summit and Conference, IEEE, p 1 Wang L, Soong FK (2015) HMM trajectory-guided sample selection for photo-realistic talking head. MultimedTools Appl 74(22):9849–9869 Wang L, Han W, Qian X, Soong FK (2010a) Rendering a personalized photo-real talking head from short video footage. In: 7th International Symposium on Chinese Spoken Language Processing, IEEE, p 129 Wang L, Qian X, Han W, Soong FK (2010b) Synthesizing photo-real talking head via trajectoryguided sample selection. In: Proceedings of the International Speech Communication Association, IEEE, p 446
30
L. Xie et al.
Wang L, Wu YJ, Zhuang X et al (2011) Synthesizing visual speech trajectory with minimum generation error. In: IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, p 4580 Wang L, Chen H, Li S et al (2012a) Phoneme-level articulatory animation in pronunciation training. Speech Commun 54(7):845–856 Wang L, Han W, Soong FK (2012b) High quality lip-sync animation for 3D photo-realistic talking head. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, p 4529 Wang LJ, Qian Y, Scott M, Chen G, Soong FK (2012c) Computer-assisted Audiovisual Language Learning, IEEE Computer, p 38 Weise T, Bouaziz S, Li H et al (2011) Realtime performance-based facial animation. In: ACM Transactions on Graphics, ACM, p 77 Wik P, Hjalmarsson A (2009) Embodied conversational agents in computer assisted language learning. Speech Commun 51(10):1024–1037 Williams RJ, Zipser D (1989) A learning algorithm for continually running fully recurrent neural networks. l Comput 1(2):270–280 Xie L, Liu ZQ (2007a) A coupled HMM approach to video-realistic speech animation. Pattern Recogn 40(8):2325–2340 Xie L, Liu ZQ (2007b) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Trans Multimed 9(3):500–510 Xie L, Jia J, Meng H et al (2015) Expressive talking avatar synthesis and animation. Multimed Tools Appl 74(22):9845–9848 Yan ZJ, Qian Y, Soong FK (2010) Rich-context unit selection (RUS) approach to high quality TTS. In: IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, p 4798 Zen H, Senior A, Schuster M (2013) Statistical parametric speech synthesis using deep neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, p 7962 Zhang LJ, Rubdy R, Alsagoff L (2009) Englishes and literatures-in-English in a globalised world. In: Proceedings of the 13th International Conference on English in Southeast Asia, p 42 Zhu P, Xie, L, Chen Y (2015) Articulatory movement prediction using deep bidirectional long short-term memory based recurrent neural networks and word/phone embeddings. In: Sixteenth Annual Conference of the International Speech Communication Association
Blendshape Facial Animation Ken Anjyo
Contents State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Blendshape Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Blendshape Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Examples and Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Techniques for Efficient Animation Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Direct Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Use of PCA Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Blendshape Creation, Retargeting, and Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Abstract
Blendshapes are a standard approach for making expressive facial animations in the digital production industry. The blendshape model is represented as a linear weighted sum of the target faces, which exemplify user-defined facial expressions or approximate facial muscle actions. Blendshapes are therefore quite popular because of their simplicity, expressiveness, and interpretability. For example, unlike generic mesh editing tools, blendshapes approximate a space of valid facial expressions. This article provides the basic concepts and technical development of the blendshape model. First, we briefly describe a general face rig framework and thereafter introduce the concept of blendshapes as an established face rigging approach. Next, we illustrate how to use this model in animation practice, while
K. Anjyo (*) OLM Digital, Setagaya, Tokyo, Japan e-mail: [email protected] # Springer International Publishing AG 2016 B. Müller, S.I. Wolf (eds.), Handbook of Human Motion, DOI 10.1007/978-3-319-30808-1_2-1
1
2
K. Anjyo
clarifying the mathematical framework for blendshapes. We also demonstrate a few technical applications developed in the blendshape framework. Keywords
Computer facial animation • Face rig • Blendshapes • Retarget • Deformer • Facial motion capture • Performance capture
State of the Art Digital characters now appear not only in films and video games but also in various digital contents. In particular, facial animation of a digital character should then convey emotions to it, which plays a crucial role for visual storytelling. This requires a digital character animation process as well as its face rigging process (i.e., the setup process) to be very intensive and laborious. In this article, we define face rig as the pair of a deformer and its user interface (manipulator). The deformer means a mathematical model of deforming a face model’s geometry for making animation. The user interface provides animators a toolset of manipulating the face model, based on the deformer. In a production workplace, however, they usually use several deformers at a time, so that the user interface in practice should be more complicated, yet sophisticated, rather than the user interface that we will mention in later sections for blendshapes. A variety of the face rig approaches have been developed. Physics-based models provide the rigorous and natural approaches, having several applications not only in the digital production industry but also in medical sciences, including surgery simulations. The physics-based approaches for computer graphic applications approximate the mechanical properties of the face, such as skin layers, muscles, fatty tissues, bones, etc. Although the physics-based methods may be powerful in making realistic facial animations, artists are then required to have a certain amount of knowledge and experiences regarding background physics. This is not an easy task. On one hand, several commercial 3D CG packages provide proprietary face rig approaches, such as “cluster deformers” (see Tickoo (2009)), which allow the artist to specify the motion space using a painting operation for making 3D faces at key frames. The blendshapes offer a completely different face rig approach. A blendshape model generates face geometry as a linear combination of a number of face poses, each of which is called a blendshape target. These targets typically mean individual facial expressions, shapes that approximate facial muscle actions or FACS (Facial Action Coding System (Ekman and Friesen 1978)) motions. These targets are predefined (designed) by the artist. The blendshapes are therefore parameterized with the weights of the targets, which gives an intuitive and simple way for the artist to make animation. The interface is called sliders and used to control the weights.
Blendshape Facial Animation
3
Fig. 1 Blendshapes user interface example. Left: The slider box and a 3D face model under editing, where the slider box gives a partial view of the blendshape sliders. This is because, in general, the number of sliders is too large to see all sliders at a time. Instead a desired slider can be reached by scrolling the slider box. The 3D face model shows an edited result with the slider operation for right eye blink, right: the face model before the slider operation
Figure 1 presents such a slider interface example and a simple editing result for a blendshape model. The use of motion capture data has become a common approach to make animation of a digital character. As is well known, the original development of motion capture techniques was driven by the needs of life science community, where the techniques are mainly used for the analysis on a subject’s movement. In the digital production industry, facial motion capture data may be used as an input for the synthesis of realistic animations. The original data will then be converted to a digital face model and edited to obtain desired facial animations. Some of the face rig techniques are therefore indispensable in the converting (retargeting) and editing processes.
Blendshape Applications As mentioned earlier, several face rig techniques are used together for practical situations. Even when more sophisticated approaches to facial modeling are used, blendshapes are often employed as a base layer over which physically based or functional (parameterized) deformations are layered.
4
K. Anjyo
In digital production studios and video game companies, they need to develop a sophisticated system that should fully support the artists for efficient and highquality production of visual effects and character animation. The role of blendshape techniques may therefore be a small portion of the system, but is still crucial. Here we briefly describe a few state-of-the-art applications that use blendshape techniques: • Voodoo. This system has been developed in Rhythm & Hues Studios over years, which deals mainly with animation, rigging, matchmove, crowds, fur grooming, and computer vision (see Fxguide (2014)). The system provides several prodigious face rigging tools using blendshapes. For example, many great shots in the film Life of Pi in 2012 were created with this system. • Fez. This is the facial animation system developed in ILM (Bhat et al. 2013; Cantwell et al. 2016; CGW 2014), which involves an FACS implementation using blendshape techniques. It has contributed to recent films, such as Warcraft and Teenage Mutant Ninja Turtles, in 2016. • Face Plus. This is a plug-in for Unity, which is a cross-platform game engine. This plug-in enables us to construct a facial capture and animation system using a web camera (see Mixamo (2013) for details). Based on the blendshape character model created by an artist, the system provides real-time facial animation of the character. In the following sections, we describe the basic practice and mathematical background of the blendshape model.
Blendshape Practice The term “blendshapes” was introduced in the computer graphics industry, and we follow the definition: blendshapes are linear facial models in which the individual basis vectors are not orthogonal but instead represent individual facial expressions. The individual basis vectors have been referred to as blendshape targets and morph targets or (more roughly) as shapes or blendshapes. The corresponding weights are often called sliders, since this is how they appear in the user interface (as shown in Fig. 1). Creating a blendshape facial animation thus requires specifying weights for each frame of the animation, which has traditionally been achieved with key frame animation or by motion capture. In the above discussion, we use a basic mathematical term “vectors.” This section starts with explaining what the vectors mean in making 3D facial models and animations. We then illustrate how to use the blendshapes in practice.
Blendshape Facial Animation
5
Formulation We represent the face model as a column vector f containing all the model vertex coordinates in some order that is arbitrary (such as xyzxyzxyz, or alternately xxxyyyzzz) but consistent across the individual blendshapes. For example, let us consider the face model composed of n = 100 blendshapes, each having p = 1000 vertices, with each vertex having three components x, y, z. Similarly, we denote the blendshape targets as vectors bk, so the blendshape model is represented as f¼
n X
wk bk ,
(1)
k¼0
where f is the resulting face, in the form of a m = 30,000 1 vector (m = 3p); the individual blendshapes b0, b1, , bn are 30,000 1 vectors; and wk. denotes the weight for bk (1 k n). We then put b0 as the neutral face. Blendshapes can therefore be considered simply adding vectors. Equation (1) may be referred to as the global or “whole-face” blendshape approach. The carefully sculpted blendshape targets appeared in Eq. (1) then serve as interpretable controls; the span of these targets strictly defines the valid range of expressions for the modeled face. These characteristics differentiate the blendshape approach from those that involve linear combinations of uninterpretable shapes (see a later section) or algorithmically recombine the target shapes using a method other than that in Eq. (1). In particular, from an artist’s point of view, the interpretability of the blendshape basis is a definitive feature of the approach. In the whole-face approach, scaling all the weights by a multiplier causes the whole head to scale, while scaling of the head is more conveniently handled with a separate transformation. To eliminate undesired scaling, the weights in Eq. 1 may be constrained to sum to one. Additionally the weights can be constrained to the interval [0,1] in practice. In the local or “delta” blendshape formulation, one face model b0 (typically the resting face expression) is designated as the neutral face shape, while the remaining targets bk(1 k n) in Eq. (1) are replaced with the difference bk – b0 between the k-th face target and the neutral face: f ¼ b0 þ
n X
wk ðbk b0 Þ:
(2)
k¼1
Or, if we use matrix notation, Eq. (2) can be expressed as: f5Bwþb0 ,
(3)
where B is an m n matrix having bk – b0 as the k-th column vector, and w = (w1, w2,...,wn)T is the weight vector.
6
K. Anjyo
Fig. 2 Target face examples. From left: neutral, smile, disaffected, and sad
In this formulation, the weights are conventionally limited to the range [0,1], while there are exceptions to this convention. For example, the Maya blendshape interface allows the [0,1] limits to be overridden by the artist if needed. If the difference between a particular blendshape bk and the neutral shape is confined to a small region, such as the left eyebrow, then the resulting parameterization offers intuitive localized control. The delta blendshape formulation is used in popular packages such as Maya (see Tickoo (2009)), and our discussion will assume this variant if not otherwise specified. Many comments apply equally (or with straightforward conversion) to the whole-face variant.
Examples and Practice Next, we show a simple example of the blendshape model, which has 50 target faces. The facial expressions in Fig. 1 were also made with this simple model. A few target shapes of the model are demonstrated in Fig. 2, where the leftmost image shows its neutral face. Using the 50 target shapes, the blendshape model provides a mixture of such targets. As mentioned above, the blendshape model is conceptually simple and intuitive. Nevertheless, professional use of this model further requires a large and laborintensive effort of the artists, some of which are listed as follows: • Target shape construction – To express a complete range of realistic expressions, digital modelers often have to create large libraries of blendshape targets. For example, the character of Gollum in The Lord of the Rings had 946 targets (Raitt 2004). Generating a reasonably detailed model can be as much as a year of work for a skilled modeler, involving many iterations of refinement. – A skilled digital artist can deform a base mesh into the different shapes needed to cover the desired range of expressions. Alternatively, the blendshapes can be directly scanned from a real actor or a sculpted model. A common template
Blendshape Facial Animation
7
model can be registered to each scan in order to obtain vertex-wise correspondences across the blendshape targets. • Slider control (see Fig. 1) – To skillfully and efficiently use the targets, animators need to memorize the function of 50 to 100 commonly used sliders. Then locating a desired slider isn’t immediate. – A substantial number of sliders are needed for high-quality facial animation. Therefore the complete set of sliders does not fit on the computer display. • Animation editing – As a traditional way, blendshapes have been animated by key frame animation of the weights. Commercial packages provide spline curve interpolation of the weights and allow the tangents to be specified at key frames. – Performance-driven facial animation is an alternative way to make animation. Since blendshapes are the common approach for realistic facial models, blendshapes and performance-driven animation are frequently used together (see section “Use of PCA Models,” for instance). We then may need an additional process where the motion captured from a real face is “retargeted” to a 3D face model.
Techniques for Efficient Animation Production In previous sections, we have shown that the blendshapes are a conceptually simple, common, yet laborious facial animation approach. Therefore a number of developments have been made to greatly improve efficiency in making blendshape facial animation. However, in this section, let us restrict ourselves to describe only a few of our work, while we also mention some techniques related to blendshapes and facial animation. To know more about the mathematical aspect of blendshape algorithms, we would recommend referring to the survey (Lewis et al. 2014).
Direct Manipulation In general, interfaces should provide both direct manipulation and editing of underlying parameters. While direct manipulation usually provides more natural and efficient results, parameter editing can be more exact and reproducible. Artists might therefore prefer it in some cases. While inverse kinematic approaches to posing human figures have been used for many years, analogous inverse or direct manipulation approaches for posing faces and setting key frames have emerged quite recently. In these approaches, the artist directly moves points on the face surface model, and the software must solve for the underlying weights or parameters that best reproduce that motion, rather than tuning the underlying parameters. Here we consider the cases where the number of sliders is considerably large (i.e., well over 100) for a professional use of the blendshape model. Introducing a direct
8
K. Anjyo
Fig. 3 Example of direct manipulation interface for blendshapes
manipulation approach would then be a legitimate requirement. To achieve this, we solve the inverse problem of finding the weights for given point movements and constraints. In Lewis and Anjyo (2010), this problem is regularized by considering the fact that facial pose changes are proportional to slider position changes. The resulting approach is easy to implement and can cope with existing blendshape models. Figure 3 shows such a direct manipulation interface example, where selecting a point on the face model surface creates a manipulator object termed a pin, and the pins can be dragged into desired positions. According to the pin and drag operations, the system solves for the slider values (the right panel in Fig. 3) for the face to best match the pinned positions. It should then be noted that the direct manipulation developed in Lewis and Anjyo (2010) can interoperate with the traditional parameter-based key frame editing. As demonstrated in Lewis and Anjyo (2010), both direct manipulation and parameter editing are indispensable for blendshape animation practice. There are several extensions of the direct manipulation approach. For instance, a direct manipulation system suitable for use in animation production has been demonstrated in Seo et al. (2011), including treatment of combination blendshapes and non-blendshape deformers. Another extension in Anjyo et al. (2012) describes a direct manipulation system that allows more efficient edits using a simple prior learned from facial motion capture.
Use of PCA Models In performance-driven facial animation, the motion of human actor is used to derive the face model. Whereas face tracking is a key technology for the performancedriven approaches, this article focuses on performance capture methods that drive a
Blendshape Facial Animation
9
face rig. The performance capture methods mostly use PCA basis or blendshape basis. We use principal component analysis (PCA) to obtain a PCA model for the given database of facial expression examples. As usual each element of the database is represented as an m 1 vector x. Let U be an m r matrix consisting of the r eigenvectors corresponding to the largest eigenvalues of the data covariance matrix. The PCA model is then given as: f5Ucþe0 ,
(4)
where the vector c means the coefficients of those eigenvectors and e0 denotes the mean vector of all elements x in the database. Since we usually have r m, the PCA model gives a good low-dimensional representation of the facial models x. This also leads us to solutions to statistical estimation problems in a maximum a posteriori (MAP) framework. For example, in Lau et al. (2009), direct dragging and strokebased expression editing are developed in this framework to find an appropriate c in Eq. (4). The PCA approaches are useful if the face model is manipulated only with direct manipulation. Professional animation may also require slider operations, so that the underlying basis should be of blendshapes, rather than PCA representation. This is due to the lack of interpretability of the PCA basis (Lewis and Anjyo 2010). A blendshape representation (3) can be equated to a PCA model (4) that spans the same space: Bw þ bo5Ucþe0 :
(5)
We know from Eq. (5) that the weight vector w and the coefficient vector c can be interconverted: 1 w5 BT B BT ðUcþe0 b0 Þ
(6)
c5UT ðBwþb0 e0 Þ,
(7)
where we use the fact that UTU is an r r unit matrix in deriving the second Eq. (7). We note that the matrices and vectors in Eqs. (6) and (7), such as (BTB)1BTU and (BTB)1BT(e0 – b0), can be precomputed. Converting from weights to coefficients or vice versa is thus a simple affine transform that can easily be done at interactive rates. This will provide us a useful direct manipulation method for a PCA model, if the model can also be represented with a blendshape model.
Blendshape Creation, Retargeting, and Transfer Creating a blendshape model for professional animation requires sculpting on the order of 100 blendshape targets and adding hundred more shapes in several ways
10
K. Anjyo
(see Lewis et al. (2014), for instance). Ideally, the use of dense motion capture of a sufficiently varied performance should contribute to efficiently create such a large number of blendshape targets. To achieve this, several approaches have been proposed, including a PCA-based approach (Lewis et al. 2004) and a sparse matrix decomposition method (Neumann et al. 2013). Expression cloning approaches (Noh and Neumann (2001); Sumner and Popović (2004), for instance) are developed for retargeting the motion from one facial model (the “source”) to drive a face (the “target”) with significantly different proportions. The expression cloning problem was posed in Noh and Neumann (2001), where the solution was given as a mapping by finding corresponding pairs of points on the source and target faces using face-specific heuristics. The early expression cloning algorithms do not consider adapting the temporal dynamics of the motion to the target, which means that they work well if the source and target are of similar proportions. The movement matching principle in Seol et al. (2012) provides an expression cloning algorithm that can cope with the temporal dynamics of face movement by solving a space-time Poisson equation for the target blendshape motion. Relating to expression cloning, we also mention model transfer briefly. This is the case where the source is a fully constructed blendshape model and the target consists of only a neutral face (or a few targets). Deformation transfer (Sumner and Popović 2004) then provides a method of constructing the target blendshape model, which is mathematically equivalent to solving a certain Poisson equation (Botsch et al. 2006). We also have more recent progresses for the blendshape model transfer, including the one treating with a self-collision issue (Saito 2013) and the technique allowing the user to iteratively add more training poses for blendshape expression refinement (Li et al. 2010).
Conclusion While the origin of blendshapes may lie outside academic forums, blendshape models have evolved over the years along with a variety of advanced techniques including those described in this article. We expect more scientific insights from visual perception, psychology, and biology will strengthen the theory and practice of the blendshape facial models. In a digital production workplace, we should also promote seamless integration of the blendshape models with other software tools to establish a more creative and efficient production environment. Acknowledgments I would like to thank J. P. Lewis for mentoring me over the years in the field of computer facial animation research and practice. Many thanks go to Ayumi Kimura for her fruitful discussions and warm encouragements in preparing and writing this article. I also thank Gengdai Liu and Hideki Todo for their helpful comments and creation of the images in Figs. 1, 2, and 3.
Blendshape Facial Animation
11
References Anjyo K, Todo H, Lewis JP (2012) A practical approach to direct manipulation blendshapes. J Graph Tools 16(3):160–176 Bhat K, Goldenthal R, Ye Y, Mallet R, Koperwas M (2013) High fidelity facial animation capture and retargeting with contours. In: Proceedings of the 12th ACM SIG-GRAPH/Eurographics Symposium on Computer Animation, 7–14 Botsch M, Sumner R, Pauly M, Gross M (2006) Deformation transfer for detail-preserving surface editing. In: Proceedings of Vision, Modeling, and Visualization (VMV), 357–364 Cantwell B, Warner P, Koperwas M, Bhat K (2016) ILM facial performance capture, In ACM SIGGRAPH2016 Talks, 26:1–26:2 CGW web page (2014) http://www.cgw.com/Publications/CGW/2014/Volume-37-Issue-4-JulAug-2014-/Turtle-Talk.aspx Ekman P, Friesen W (1978) Facial action coding system: manual. Consulting Psychologists Press, Palo Alto Fxguide web page (2014) https://www.fxguide.com/featured/voodoo-magic/ Lau M, Chai J, Xu Y-Q, Shum H-Y (2009) Face poser: interactive modeling of 3D facial expressions using facial priors. ACM Trans Graph 29(1), 3:1–3:17 Lewis JP, Anjyo K (2010) Direct manipulation blendshapes. IEEE Comput Graph Appl 30 (4):42–50 Lewis JP, Mo Z, Neumann U (2004) Ripple-free local bases by design. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 684–688 Lewis JP, Anjyo K, Rhee T, Zhang M, Poghin F, Deng Z (2014) Practice and theory of blend-shape facial models. Eurographics 2014 (State of the Art Reports), 199–218 Li H, Weise T, Pauly M (2010) Example-based facial rigging. ACM Trans Graph 29(3), 32:1–32:6 Mixamo web page (2013) https://www.mixamo.com/faceplus Neumann T, Varanasi K, Wenger S, Wacker M, Magnor M, Theobalt C (2013) Sparse localized deformation components. ACM Trans Graph 32(6), 179:1–179:10 Noh J, Neumann U (2001) Expression Cloning. In: SIGGRAPH2001, Computer Graphics Proceedings, ACM Press/ACM SIGGRAPH, 277–288 Raitt B (2004) The making of Gollum. Presentation at U. Southern California Institute for Creative Technologies’s Frontiers of Facial Animation Workshop, August 2004 Saito J (2013) Smooth contact-aware facial blendshape transfer. In: Proceedings of Digital Production Symposium 2013 (DigiPro2013), ACM. 7–12 Seo J, Irving J, Lewis JP, Noh J (2011) Compression and direct manipulation of complex blendshape models. ACM Trans Graph 30(6), 164:1–164:10 Seol Y, Lewis JP, Seo J, Choi B, Anjyo K, Noh J (2012) Spacetime expression cloning for blendshapes. ACM Trans Graph 31(2), 14:1–14:12 Sumner RW, Popović J (2004) Deformation transfer for triangle meshes. ACM Trans Graph 23 (3):399–405 Tickoo S (2009) Autodesk maya 2010: a comprehensive guide. CADCIM Technologies, Schererville
Eye Animation Andrew T. Duchowski and Sophie Jörg
Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eye Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Listing’s and Bonders’ Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Modeling Physiologically Plausible Eye Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Modeling Induced Torsion During Vergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fixations and Saccades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parametric Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Modeling Microsaccadic Jitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Modeling Saccadic Velocity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pupil Dilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Procedural Eye Movement Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Periocular Motions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Running the Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary: Listing the Sources of Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 3 4 4 5 6 8 9 12 13 13 14 14 15 16 16
Abstract
The synthesis of eye movements involves modeling saccades (the rapid shifts of gaze), smooth pursuits (object tracking motions), binocular rotations implicated in vergence, and the coupling of eye and head rotations. More detailed movements include dilation and constriction of the pupil (pupil unrest) as well as small fluctuations (microsaccades, tremor, and drift, which we collectively call A.T. Duchowski (*) Clemenson University, Clemson, SC, USA e-mail: [email protected] S. Jörg School of Computing, Clemson University, Clemson, SC, USA e-mail: [email protected] # Springer International Publishing AG 2016 B. Müller, S.I. Wolf (eds.), Handbook of Human Motion, DOI 10.1007/978-3-319-30808-1_3-1
1
2
A.T. Duchowski and S. Jörg
microsaccadic jitter) made during fixations, when gaze is held nearly steady. In this chapter, we focus on synthesizing physiologically plausible eye rotations, microsaccadic jitter, and pupil unrest. We review concepts relevant to the animation of eye motions and provide a procedural model of gaze that incorporates rotations adhering to Donders’ and Listing’s laws, the saccadic main sequence, along with gaze jitter and pupil unrest. We model microsaccadic jitter and pupil unrest by 1/f α or pink noise.
Keywords
Eye movements • Saccades • Fixations • Microsaccadic jitter
Introduction Engaging virtual characters are highly relevant in many applications, from entertainment to virtual reality (e.g., training). Realistic eye motions are important for increasing the perceived believability of virtual actors (avatars) and physical humanoid robots. Looser and Wheatley [36] show that people are influenced by the eyes more than by other facial features when rating the animacy of virtual characters. In film, eye movements are important for conveying the character’s emotions and thoughts, e.g., as in Neil Burger’s feature film Limitless. In virtual environments, eye gaze is of vital importance for the correct identification of deictic reference – what is being looked at [42]. As characters become more realistic and humanlike, correct gaze animation becomes even more important. Garau et al. [20] found a strong subjective interaction effect between the realism of a character and their gaze; for a more realistic character, more elaborate gaze behavior is preferred, whereas for a less realistic character, random gaze patterns received better ratings. The dynamics of eye motions, however, have received little attention since Lee et al.’s [35] seminal Eyes Alive model which focused largely on saccadic eye movements, implementing what is known as the saccadic main sequence [5]. The rapid advancement of eye tracking technology has revitalized interest in recording eye movements for inclusion in computer graphics and interactive systems [17, 57, 65]. Ruhland et al. [53] survey the state of the art in eye and gaze animation, where efforts aimed at modeling the appearance and movement of the human eye are reviewed. Beyond the synthesis of saccades (the rapid shifts of gaze), their report also considers tracking motions known as smooth pursuits, binocular rotations implicated in vergence (used for depth perception), and the coupling of eye and head rotations (e.g., the vestibulo-ocular reflex (VOR)). Ruhland et al. [53] furthermore review high-level aspects of gaze behavior including past efforts to model visual attention, expression of emotion, nonverbal interaction, conversation and listening behavior, verbal interaction, and speech-driven gaze. In this chapter, we extend their review by focusing on several important aspects missing from their survey, namely, oculomotor rotations of the eyeball adhering to
Eye Animation
3
Donders’ and Listing’s laws [59], the detailed motions of the eye during fixations [38] that can be modeled with microsaccadic jitter, and rapid fluctuations of the pupil (pupil unrest) [54]. We present previous findings on these topics and derive a physiologically plausible procedural eye movement model where microsaccadic jitter and pupil unrest are modeled by 1/fα pink noise. Our chapter contribution is based on prior publications presented at Computer Graphics International (CGI) [14], Motion in Games (MIG) [15], and the Symposium on Eye Tracking Research & Applications (ETRA) [18].
Eye Rotation Almost all normal primate eye movements used to reposition the fovea result as combinations of five basic types: saccadic, smooth pursuit, vergence, vestibular, and small movements associated with fixations [51]. These smaller motions consist of drift, tremor, and microsaccades [52]. Other movements such as adaptation and accommodation refer to nonpositional aspects of eye movements (i.e., pupil dilation, lens focusing). In general, the eyes move within six degrees of freedom: three translations within the socket and three rotations, although physical displacement is required for translations to occur (e.g., a push of a finger). There are six muscles responsible for movement of the eyeball: the medial and lateral recti (sideway movements), the superior and inferior recti (up/down movements), and the superior and inferior obliques (twist) [12]. The neural system involved in generating eye movements is known as the oculomotor plant [51]. Eye movement control signals emanate from several functionally distinct brain regions. Areas in the occipital cortex are thought to be responsible for high-level visual functions such as recognition. The superior colliculus bears afferents emanating directly from the retina, particularly from peripheral regions conveyed through the magnocellular pathway. The semicircular canals react to head movements in three-dimensional space. All three areas (i.e., the occipital cortex, the superior colliculus, and the semicircular canals) convey efferents to the eye muscles through the mesencephalic and pontine reticular formations. Classification of observed eye movement signals relies in part on the known functional characteristics of these cortical regions [16]. Eye movement models typically do not consider the oculomotor plant for the purposes of animation; rather, signal characteristics are of greater importance. For example, Komogortsev et al. have developed a sophisticated model of the oculomotor plant but for biometric identification purposes rather than for animation [30, 31]. Prior models of eye rotation have been developed from the perspective of capturing and modeling observed gaze behavior but do not necessarily take into account their synthetic reproduction, i.e., animation [24, 50]. What are also often overlooked are constraints of orbital rotations following Listing’s and Donders’ laws. In this chapter we discuss previous models based on quaternion rotation and show how they can be implemented in a straightforward manner to ensure physiologically plausible eye rotation.
4
A.T. Duchowski and S. Jörg
Listing’s and Bonders’ Laws Listing’s and Donders’ laws state that eyeball rotations can effectively be modeled as compositions of rotations exclusively about the vertical and horizontal axes, with negligible torsion when head direction is fixed. Further implications arise during ocular vergence movement, as discussed below. Using recorded gaze movements from monkeys and humans, Tweed et al. [59] define the eyeball’s primary position as one in which the gaze vector is orthogonal to Listing’s plane, the plane of displacement, which essentially models the tilt of the head. Listing’s law states that the eye assumes only those orientations that can be reached from the primary position by a single rotation about an axis lying in Listing’s plane. For modeling purposes, Listing’s law, a refinement of Donders’ law, states that in the absence of head tilt and with static visual surrounding, we can effectively ignore the eyeball’s torsional component when modeling saccades [48]. In practice, torsion fluctuations of up to about 5 have been observed [19]. Interestingly, Tweed and Vilis [60] show that the primary gaze direction varies between primates (e.g., preferred head tilt varies within humans as well as within monkeys, and between the two groups, with monkeys generally carrying their heads tilted slightly more back than humans, on average). We believe that this variability in preferred primary gaze direction is a factor in believability of virtual actors which may not have been previously exploited. Ma and Deng [37], for example, describe a model of gaze driven by head direction/rotation, but their gaze-head coupling model is designed in a fashion that seems contrary to physiology: head motion triggers eye motion. Instead, because the eyes are mechanically “cheaper” and faster to rotate, the brain usually triggers head motion when the eyes exceed a certain saccadic amplitude threshold (about 30 ; see Murphy and Duchowski [41] for an introductory note on this topic). Nevertheless, Ma and Deng and then later Peters and Qureshi [47] both provide useful models of gaze/head coupling with a good “linkage” between gaze and head vectors. Our model currently focuses only on gaze direction and assumes a stationary head but eventually could offer extended functionality, expressed in terms of quaternions, which are generally better suited for expressing rotations than vectors and Euler angles. In our model, the eyes are the primary indicators of attention, with head rotation following gaze rotation when a rotational threshold (e.g., 30 ) is exceeded.
Modeling Physiologically Plausible Eye Rotation A coordinated model of the movement of the head and eyes relies on a plausible description of the eyeball’s rotation within its orbit. Eyeball rotation, at any given instant in time t, is described by the deviation of the eyeball from its primary position. This rotation can be described by the familiar Euler angles used to denote roll, pitch, and yaw. Mathematically, a concise and convenient representation of all
Eye Animation
5
three angles in one construct is afforded by a quaternion that describes the eyeball’s orientation, which describes the direction of the vector ge emanating from the eyeball center and terminating at the 3D position of gaze in space, pt ¼ ðxt , yt , zt Þ. Tweed et al. [59] specify the quaternion in question in relation to the bisector Ve of the current reference gaze direction ge and primary gaze vector gp. To precisely model Listing’s law, assuming Ve is a normalized forward-pointing vector orthogonal to the displacement plane (which may be tilted back), the quaternion q expressing the rotation between gp and ge is q ¼ ½Ve ge , ðVe ge Þ in a righthanded coordinate system. The quaternion q is a vector with four components, q ¼ ðq0 , qτ , qV , qH Þ , with q0 the scalar part of q and qτ, qV, and qH the torsional, vertical, and horizontal rotational components, respectively. Listing’s and Donders’ laws are important for setting up traditional computer graphics rotations of the eyeball because together they not only specify a convenient simplification but they also specify a physiologically observable constraint, namely, qτ ¼ 0 , negligible visible torsion. Moreover, head/eye movement coordination is made implicit since Listing’s plane can be used to model the orientation of the head with quaternion q fixed to lie in Listing’s plane. This is accomplished by first setting the gaze direction vector gr to point at reference point pt, then specifying the rotation quaternion’s plane by parameters f, fV, and fH, which are used to express qτ as a function of qH and qV : qT ¼ f þ f V qV þ f H qH . If f is not 0, then the reference position pt does not satisfy Listing’s law (see Fig. 1). Quaternion pffiffiffiffiffiffiffiffiffiffiffiffi ffi e¼ 1 f 2 , f , 0, 0 , however, does and has the same direction as gr. To force pffiffiffiffiffiffiffiffiffiffiffiffiffi q to adhere to Listing’s law, we set up quaternion e1 ¼ 1 f 2 , f , 0, 0 and right-multiply q. This fixes the reference position adjusting the quaternion’s torsional component. To find Listing’s plane, the normal vector is computed by specifying the quaternion of the primary position relative to e as p ¼ ðV 1 , V 0 , V 3 , V 2 Þ where V ¼ ð1, f V , f H Þ=jð1, f V , f H Þj. Quaternion q is then left-multiplied by p1 giving p1 qe1 as the corrected rotation quaternion satisfying Listing’s law such that qτ ¼ 0. It is important to note that gr describes the orientation of the head. That is, gr can be used to model an avatar’s preferred head tilt, and thus primary gaze direction, or rotation undergone during vergence eye movements (see below).
Modeling Induced Torsion During Vergence Torsional eye movements are associated with vergence or can occur in response to specific visual stimuli [62]. Tweed et al.’s [59] quaternion framework models Listing’s law by negating cyclotorsion (qτ ¼ 0). The shape of the surface to which the eye position quaternions are constrained resembles a plane when only the eyes move and the head is stationary. When gaze shifts involve both the eye and head
6
A.T. Duchowski and S. Jörg
a
looking ahead
b
c
no torsion
d
slight torsion
180° torsion
Fig. 1 Eye rotation from primary position (a): (b) no torsion modeling Listing’s and Donders’ laws; (c) slight torsion due to Listing’s plane tilt; (d) implausible torsion though mathematically possible if quaternions are not constrained
(e.g., during VOR), the rotation surface twists and becomes non-planar [21]. This twist is similar to that produced by a Fick gimbal model of rotations in which the horizontal axis is nested within a fixed vertical axis.1 Tweed et al.’s [59] quaternion framework does not explicitly consider vergence movements. Mok et al. [39] suggest that eye positions during vergence remain restricted to a planar surface (Listing’s plane), but that surface is rotated relative to that observed for far targets. The rotation is such that during convergence both eyes undergo extorsion during downward gaze shifts and intorsion during upward gaze shifts, i.e., Listing’s plane of the left eye is rotated to the left about the vertical axis (qV > 0) while that of the right is rotated to the right (qV < 0). For example, vergence of degree θ can be modeled by constructing quaternion b q ¼ ð cos ðθ=2Þ, sin ðθ=2ÞvÞ where v denotes the vertical axis (0, 1, 0) and then 1 rotating gr via b q , as illustrated by Fig. 2. During convergence, primary position qgr b is rotated temporally by approximately two-thirds the angle of vergence. Our model currently does not take into account refinements concerning VOR within the basic quaternion framework but produces correct vergence eye movements in the context of head-free (instead of head-fixed) rotations and allows targeting of the eye at a look point pt. This point may in turn be animated to rotate the eye, e.g., via a procedural model (see below).
Fixations and Saccades Fixations, the stationary eye movements required to maintain steady gaze when looking at a visual target, are never perfectly still but instead are composed of small involuntary eye movements, composed of microsaccades, drift, and tremor 1
The non-commutativity of rotations leads to false torsions from equivalent rotations around eye-fixed and head-fixed axes; under normal circumstances, the eye assumes orientations given by Euler rotations satisfying Listing’s law [50].
Eye Animation
a
7
b
looking in, level
c
looking in and up
looking in and down
Fig. 2 Eye rotation during 30 vergence (qV ¼ 0:07): (a) no torsion (f ¼ 0:00) when gaze level; (b) slight intorsion (f ¼ 0:06) looking up; (c) slight extorsion (f ¼ 0:06) looking down. The plane drawn behind the (left) eye is a visualization of the rotation of gaze direction gr during convergence
[52]. Microsaccades play a vital role in maintaining visibility during fixation but are perhaps the least understood of all eye movement types, despite their critical importance to normal vision [38]. If microsaccadic eye movements were perfectly counteracted, visual perception would rapidly fade due to adaptation [22, 27]. Microsaccades contribute to maintaining visibility during fixation by shifting the retinal image in a fashion that overcomes adaptation, generating neural responses to stationary stimuli in visual neurons. Martinez-Conde et al. [38] note that microsaccades are unnoticed, but this generally refers to oneself – it is not possible to detect one’s own eye movements when looking in a mirror. Because the perceptual system is sensitive to, and amplifies, small fluctuations [61], noticing others’ eye movements, even subtle ones, may be important, especially during conversation, turn-taking, etc. (see Vertegaal [63]). Even though microsaccades are the largest and fastest fixational eye movement, they are relatively small in amplitude, carrying the retinal image across a range of several dozen to several hundred photoreceptor widths [38]. Microsaccades and saccades share many physical and functional characteristics suggesting that both eye movements have a common oculomotor origin, i.e., a common neural generator for both (current evidence points to a key role of the superior colliculus). While microsaccades play a crucial role in maintaining visibility during fixation, they may also betray our emotional state, as they reflect human brain activities during attention and cognitive tasks [28]. Laretzaki et al. [34] show that fixational distribution is more widespread during periods of psychological threat versus periods of psychological safety. That is, the dispersion of microsaccades is larger under perception of threat than under perception of safety. DiStasi et al. [13] also note that saccadic and microsaccadic velocity decrease with time-on-task, whereas
8
A.T. Duchowski and S. Jörg
drift velocity increases, suggesting that ocular instability increases with mental fatigue. Thus, dispersion of the microsaccadic position distribution can be made to increase with (simulated) increased fatigue. Microsaccades and saccades follow the main sequence that describes the relationship between their amplitude (θ) and duration (Δt) and can be modeled by the linear equation Δt ¼ 2:2θ þ 21 ðmillisecondsÞ
(1)
for saccadic amplitudes up to about 20 [5, 29].2 The main sequence gives us a plausible range of durations and corresponding eyeball rotations that are intuitively understood: the larger the eye rotation (θ), the more time required to rotate that eye. All these insights can be used to develop parametric models for the synthesis of fixations and saccades.
Parametric Models Animating gaze shifts of virtual humans often involve the use of parametric models of human gaze behavior [4, 46]. While these types of models enable virtual humans to perform natural and accurate gaze shifts, signal characteristics, and in particular noise, are rarely addressed, if at all. Noise, however, although a nuisance from a signal processing perspective, is a key component of natural eye movements. To generate a stream of synthetic gaze points resembling captured data pt ¼ ðxt , yt Þ (the z-coordinate can be dropped if the points are projected on a 2D viewing plane), a reasonable strategy is to guide synthetic gaze to a sequence of known points, i.e., a grid of points that is used to calibrate the eye tracker to human viewers, or, e.g., a model of reading behavior where gaze is directed toward as yet unread or previously read words (regressions or revisits) and to lines above and below the current line of text [10], or a set of points selected by an artist. Given such a sequence (e.g., see Fig. 3), several characteristics need to be added, namely: 1. A model of the spatiotemporal fixation perturbation (microsaccadic jitter) [15] 2. A model of saccadic velocity (i.e., position and duration) 3. Control of the simulation time step and sampling rates (see section “Running the Simulation”) We suggest to model the spatiotemporal perturbation of gaze points at a fixation, which arises from microsaccades, drift, and tremor, with 1/f α or pink noise, which we call microsaccadic jitter.
2
In their development of Eyes Alive, Lee et al. [35] (see also Gu et al. [23]) expressed the main sequence as Δt ¼ dθ þ D0 (milliseconds) with d [2,2.7] ms/deg and D0 [20,30] ms.
Eye Animation
9
Modeling Microsaccadic Jitter A key aspect for the realistic synthesis of eye motion is the inclusion of microsaccadic gaze jitter. While the recorded eye movement signal is well understood from the point of view of analysis, surprisingly little work exists on its detailed synthesis. Most analytical approaches are concerned with gaze data filtering, e.g., signal smoothing and/or processing for the detection of specific events such as saccades, fixations, or more recently further distinction between ambient and focal fixations [32]. During analysis of recorded eye movements, gaze data is commonly filtered, especially when detecting fixations (e.g., see Duchowski et al. [18], who advocate the use of the Savitzky-Golay filter for signal analysis and detection of fixations). The signal processing approach (filtering) still dominates even in very recent approaches to synthesis, e.g., Yeo et al.’s [65] Eyecatch simulation, where the simulation used the Kalman filter to produce gaze, but focused primarily on saccades and smooth pursuits (see below). Microsaccades, tremor, or drift were not modeled. As noted by Yeo et al., simulated gaze behavior looked qualitatively similar to gaze data captured by an eye tracker, but comparison of synthesized trajectory plots showed absence of gaze jitter that was evident in the raw data. The distribution of microsaccade amplitudes tends to a 1 asymptote, making it a convenient upper amplitude threshold, although microsaccade amplitude distribution tends to peak at about 12 arcmin [38]. Amplitude distribution can be modeled by the Poisson probability mass function P ðx, λÞ ¼ λx eλ =x! with x ¼ 6, shifted by x 8:5 and scaled by 5.5 and approximated by a normal distribution 5.5 pffiffiffi N x 8:0, μ ¼ λ, σ ¼ λ . The resultant normal distribution resembles the microsaccade distribution reported by Martinez-Conde et al. [38] and provides a starting point for modeling of microsaccadic jitter, suggesting that perturbation about the point of fixation can be modeled by the normal distribution N ðμ ¼ 0, σ ¼ 12=60Þ (arcmin) for each of the x- and y-coordinate offsets to the fixation modeled during simulation (setting σ ¼ 0 yields no jitter during fixation and can be used to simulate keyframed saccades). Modeling microsaccadic jitter by the normal distribution yields white noise perturbation. White noise perturbation is a logical starting point for modeling microsaccadic jitter, but it is uncorrelated and therefore not necessarily a good choice. Recorded neural spikes are superimposed with noise that exhibits non-Gaussian characteristics and can be approximated as 1/f α noise [64]. Pulse trains of nerve cells belonging to various brain structures have been observed and characterized as 1/f noise [61]. The 1/f regime accomplishes a tradeoff: the perceptual system is sensitive to and amplifies small fluctuations; simultaneously, the system preserves a memory of past stimuli in the long-time correlation tails. The memory inherent in the 1/f system can be used to achieve a priming effect: the succession of two stimuli separated by 50–100 ms at the same location results in a stronger response to the second stimulus.
10
A.T. Duchowski and S. Jörg
a
900
actual gaze data
Gaze point data
800
y-coordinate (pixels)
700 600 500 400 300 200 100 0
0
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600
x-coordinate (pixels)
b
900
synthetic for rendering
Gaze point data
800
y-coordinate (pixels)
700 600 500 400 300 200 100 0
0
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600
x-coordinate (pixels)
c
900
synthetic data with noise
Gaze point data
800
y-coordinate (pixels)
700 600 500 400 300 200 100 0
0
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600
x-coordinate (pixels)
Eye Animation
11
To model microsaccades by 1/f α (pink) noise, the white noise perturbations modeled by normal distribution N ð0, σ ¼ 12=60Þ (arcmin visual angle) are digitally filtered by a digital pink noise filter with system function: H n ðzÞ ¼
n zq 1 k ∏ , Gn ðαÞ k¼1 z þ pk
n
Gn ðαÞ ¼ ∏ k¼1
a þ ak ak α þ 1
(2)
where Gn(α) is the nth order approximation to an ideal analog pink noise filter with pffiffi a system function GðsÞ ¼ 1=ð sÞ , ak ¼ tan2 ðkθÞ, and qk ¼
1 αak , 1 þ αak
pk ¼
α αk α þ αk
with f0 the unity gain frequency, α ¼ tan ðπ f 0 T Þ, and T the sampling period for filter order n ℤ. With α ¼ 1:0, the filter produces pink noise given white noise as input [25]. For other values of a, α very good approximation for θ is θ ¼ π=ð2n þ 2 2αÞ. We chose a 4th order filter for reshaping of the microsaccadic jitter modeled by Gaussian noise, N ð0, σ ¼ 12=60Þ arcmin visual angle. More formally, we define the pink noise filter as a function of two parameters, P ðα, f 0 Þ, where 1/f α describes the pink noise power spectral distribution and f0 the filter’s unity gain frequency (or more simply its gain). Setting α ¼ 1 produces 1/f noise. Setting α ¼ 0 produces white, uncorrelated noise, with a flat power spectral distribution, likely a poor choice for modeling biological motion such as microsaccades. We found α ¼ 0:6 and f 0 ¼ 0:85 gave fairly natural microsaccadic jitter [15]. In practice, a look point drives the rotation of the eyeball. We can therefore model microsaccades as separate x- and y-directional offsets to the main view vector. This then requires two pink noise filters, one for each of the two dimensions. Setting the simulation up this way allows independent control of horizontal and vertical microsaccades so that, for example, by controlling α, horizontal microsaccades can be made to be more noisy (more noise devoted to the high-frequency portion of the spectrum) than vertical microsaccades. The above model of microsaccadic jitter does not consider where fixations are made, i.e., the above model can be used to add perturbations to randomly distributed fixations. We can guide gaze to a sequence of fixation points, specified as a series of 2D look point coordinates. Microsaccadic jitter is then used to perturb the look point about each fixation point. More formally, we simulate a sequence of look points via the following fixation simulation, developed by Duchowski et al. [18]: ptþh ¼ pt þ P ðα, f 0 Þ
(3)
ä Fig. 3 Generation of a sequence of synthetic gaze points based on raw gaze data captured by an eye tracker in (a): (b) microsaccadic jitter at fixation points identified from raw gaze data; (c) addition of simulated eye tracker noise which obscures microsaccadic jitter but produces gaze distributed about calibration points resembling raw gaze data
12
A.T. Duchowski and S. Jörg Saccade acceleration, velocity, and position profiles Parameterized curve (arbitrary units)
1
0.5
0
-0.5
-1
position velocity acceleration 0
0.2
0.4
0.6
0.8
1
Time (normalized, arbitrary units)
Fig. 4 Parametric saccade position model derived from an idealized model of saccadic force-time function assumed by Abrams et al.’s [1] symmetric-impulse variability model: scaled position 60H € ðtÞ (t), velocity 31:H ðtÞ, and acceleration 10H
where pt is the look point at simulation time t and h is the simulation time step.
Modeling Saccadic Velocity To effect movement of the look point pt between fixation points, a model of saccades is required, specifying both movement and duration of the gaze point. We start with an approximation to a force-time function assumed by a symmetric-impulse variability model [1]. This function, qualitatively similar to symmetric limb movement trajectories, describes an acceleration profile that rises to a maximum, returns to zero about halfway through the movement, and then is followed by an almost mirrorimage deceleration phase. To model a symmetric acceleration function, we can choose a combination of € ðtÞ ¼ h10 ðtÞ þ h11 ðtÞ where Hermite blending functions h11(t) and h10(t), so that H 3 2 3 2 € ðtÞ is acceleration of the gaze h10 ðtÞ ¼ t 2t þ t, h11 ðtÞ ¼ t t , t ½0, 1, and H point over normalized time interval t ½0, 1 . Integrating acceleration produces velocity, H_ ðtÞ ¼ 12 t4 13 þ 12 t2 which when integrated once more produces posi1 5 tion HðtÞ ¼ 10 t 14 t4 þ 16 t3 on the normalized interval t ½0, 1 (see Fig. 4). Given an equation for position over a normalized time window (t ½0, 1), we can now stretch this time window at will to any given length t ½0, Δt. Because the distance between gaze target points is known a priori, we can use these distances (pixel distances converted to amplitudes in degrees visual angle) as input to the main sequence (1) to obtain saccade length.
Eye Animation
13
Assuming data collected from the eye tracker does not deviate greatly from the main sequence found in the literature [5, 29], we set the expected saccade duration to that given by (1) but augmented with a 10 targeting error. We also add in a slight temporal perturbation to the predicted saccade duration, based on empirical observations. Saccade duration is thus modeled as Δt ¼ 2:2N ðθ, σ ¼ 10 Þ þ 21 þ N ð0, 0:01Þ ðmillisecondsÞ
(4)
Pupil Dilation Pupil dilation is controlled by top-down and bottom-up processes. There is evidence that it responds to cognitive load [2, 6], ambient light changes [7], and visual detection [49]. However, for the purposes of animation, pupil unrest, the slight oscillation between pupil dilation and constriction, is perhaps more interesting to model. Stark et al. [54] describe pupil diameter fluctuations as noise in a biological system with the major component of the pupil unrest as random noise in the 0.05–0.3 Hz range, with transfer function GðsÞ ¼ :16expð: 18sÞ=ð1 þ 0:1sÞ3 and gain equal to 0.16. This transfer function can be modeled by a third-order Butterworth filter with system function G3 ðsÞ ¼ 1=ðs3 þ 2s2 þ 2s þ 1Þ with cutoff frequency set to 1.5915 (see Hollos and Hollos [26]). Such a filter can thus be used to smooth Gaussian noise (e.g., N ð0, 0:5Þ) but will result in uncorrelated noise. In recent work on eye capture, Bérard et al. [9] model pupil constriction/dilation but only via linear interpolation of keyframes in response to light conditions. They did not, however, procedurally animate the pupil as a function of pupil unrest. Pamplona et al. [45] modeled pupil unrest (hippus) but via small random variations to light intensity, likely to be white noise although they did not specify this directly. We model pupil unrest directly, via pink noise perturbation. We can model pupil diameter oscillation with pink noise by once again filtering white noise with the same digital pink noise filter as for microsaccadic perturbations. For pupil oscillation, we chose a 4th order filter for reshaping pupil oscillation modeled as Gaussian noise, N ð0, σ ¼ 0:5Þ. We found pink noise parameters α ¼ 1:6 and f 0 ¼ 0:35 produced fairly natural simulation of pupil unrest [15].
Procedural Eye Movement Model In the previous sections, we reviewed important concepts relevant to the animation of eye movements and presented components of a procedural model for eye movement synthesis, consisting of rotations that adhere to Donders’ and Listing’s laws, and a model of gaze that incorporates the saccadic main sequence, along with gaze jitter and pupil unrest. We model both microsaccades and pupil unrest by 1/f α or pink noise, where 0 < α < 2, with exponent α usually close to 1. The use of 1/f α pink noise to model
14
A.T. Duchowski and S. Jörg
microsaccadic jitter and pupil oscillation is a key aspect of our simulation. Various colors of noise have appeared in the computer graphics literature, e.g., blue noise for anti-aliasing, green noise for halftoning, and white noise for common random number generation [66]. Pink noise is regarded as suitable for describing physical and biological distributions, e.g., plants [11, 43] and galaxies [33], as well as the behavior of biosystems in general [56]. Aks et al. [3] suggest that memory across eye movements may serve to facilitate selection of information from the visual environment, leading to a complex and self-organizing (saccadic) search pattern produced by the oculomotor system reflecting 1/f pink noise. To complete our model, we add periocular motions to our model and run our simulation.
Periocular Motions The motions of the upper and lower eyelids comprise saccadic lid and smooth pursuit movements, where the eyelid motion is closely related to the motion of the corresponding eyeball and blinks. We create the saccadic lid and smooth pursuit movements by rigging the eyelids to the eyeball. To model blinks, we approximate the eyelid closure function proposed by Trutoiu et al. [58]. We use a piecewise function to model the eyelid blink in two temporal components, the faster closure followed by a slower opening: CðtÞ ¼
a ðt μ Þ2 , tμ b eclogðtμþ1Þ , otherwise
(5)
with C ¼ 1 indicating lid fully open and C ¼ 0 lid fully closed, where t ½0, 100 represents normalized percent blink duration (scalable to an arbitrary duration), μ ¼ 37 the point when the lid should reach full (or nearly-full) closure, a ¼ 0:98 indicating percent lid closure at the start of the blink, with b ¼ 1:18 and c ¼ μ=100 parameters used to shape the asymptotic lid opening function. Trutoiu et al. [58] recorded blink frequencies from their actors of 6.6, 8.2, and 27.0 blinks per minute, or 14 blinks per minute on average. These rates appear to be within normal limits reported by Bentivogolio et al. [8], namely, 17 blinks per minute, ranging from 4.5 while reading to 26 during conversation. Simulating conversation, our procedural model uses 25 blinks per minute as the average with a mean duration of 120 ms. Unlike Steptoe et al. [55], we do not use kinematics to model blinks; rather we use a simplified stochastic model of blink duration.
Running the Simulation When running the simulation, it is important to keep the simulation time step (h) small, e.g., h ¼ 0:0001. When about to execute a saccade, set the saccade clock t ¼ 0, and then while t < Δt, perform the following simulation steps:
Eye Animation
15
1: t ¼ t=Δt ðscale interpolant to time windowÞ 2: pt ¼ Ci1 þ H ðtÞCi ðadvance positionÞ 3: t ¼ t þ h ðadvance time by the time step hÞ where Ci denotes the ith 2D look point sequence coordinates and pt is the saccade position, both in vector form. Setting time step h to an arbitrarily small value allows dissociation of the simulation clock from the sampling rate. We can thus sample the synthetic eye tracking data at arbitrary sampling periods, e.g., d ¼ 1, d ¼ 16, or d ¼ 33 ms for sampling rates of 1000 Hz, 60 Hz, or 30 Hz, respectively. Unfortunately, eye trackers’ sampling rates are not precise, or rather, eye trackers’ sampling periods are generally non-uniform, most likely due to competing processes on the computer used to run the eye tracking software and/or due to network latencies. The simulation sampling period can be modeled by adding in a slight random temporal perturbation, e.g., N ð0, σ ¼ 0:5Þ milliseconds.
Summary: Listing the Sources of Variation To recount, the stochastic model of eye movements is based on infusion of probabilistic noise at various points in the simulation: • Fixation durations, modeled in this instance by N ð1:081, σ ¼ 2:9016Þ (seconds), the average and standard deviation from Duchowski et al. [18], • Microsaccadic fixation jitter, modeled by pink noise P ðα ¼ :6, f 0 ¼ :85Þ (degrees visual angle), • Saccade durations, modeled by (4), and • Sampling period N ð1, 000=F , σ ¼ 0:5Þ (milliseconds), with F the sampling frequency (Hz). For rendering purposes, the eye movement data stream is appended with: • Blink duration, modeled as N ð120, σ ¼ 70Þ (ms), and • Pupil unrest, modeled by pink noise P ðα ¼ 1:6, f 0 ¼ :35Þ (relative diameter). Collectively, the above sources of error can be considered as stochastic perturbation of the gaze point about its current location, i.e., ptþh ¼ pt ¼ P ðα, f 0 Þ þ η
(6)
where the primary source of microsaccadic jitter is represented by pink noise P ðα, f 0 Þ and η represents the various sources of variation, above. See Duchowski et al. [18] for further details. For affective eye movement synthesis, modulating α will result in modulation of jitter. Modulation of f0 controls the amount of dispersion of fixational points. Both
16
A.T. Duchowski and S. Jörg
model parameters can thus be used to control the expected appearance of emotional state. What remains is tuning of these parameters to effect emotional expression. Results from our perceptual experiments thus far have showed that animations based on the procedural model with pink noise jitter were consistently evaluated and perceived as highly natural in the presence of other animations as alternatives [15]. We have also found that some jitter, but not too much, captures visual attention better than when jitter is excessive or not present.
Conclusion In this chapter, we have summarized some of the physiological characteristics of eye motions and presented a physiologically plausible procedural model of eye movements, complete with blinks, saccades, and fixations augmented by microsaccadic jitter and pupil unrest, both modeled by 1/f α or pink noise. Our procedural model of gaze motion constitutes the basis of a “bottom-up” model of gaze rotations, modeled by quaternions which are used to effect eyeball rotation in response to a “look point” in space projected onto a 2D plane in front of the eye. The location of this look point is determined by the procedural model simulated over time, which is tasked with producing a characteristic fixation/saccade signal that models recorded gaze data. The procedural model differs from others, e.g., those driven by saliency such as Oyekoya et al.’s [44], or those driven by head movement propensity [47]. These latter models can be considered “top-down” models as they are more concerned with prediction of locations of gaze that, say, an autonomous avatar is likely to make. Our model is concerned with low-level signal characteristics of the fixation point regardless of how it was determined. Subjective evaluations have shown that the absence of noise is clearly unnatural. Microsaccadic jitter therefore appears to be a crucial ingredient in the quest toward natural eye movement rendering. Gaze jitter is naturally always present since the eyes are never perfectly still. We believe that correctly modeling the jitter that characterizes gaze fixation is a key factor in promoting the believability and acceptance of synthetic actors and avatars, thereby bridging the Uncanny Valley [40]. Acknowledgments This material is based in part upon work supported by the US National Science Foundation under Grant No. IIS-1423189. Any opinions, findings, and conclusions expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.
References 1. Abrams RA, Meyer DE, Kornblum S (1989) Speed and accuracy of saccadic eye movements: characteristics of impulse variability in the oculomotor system. J Exp Psychol Hum Percept Perform 15(3):529–543 2. Ahern S, Beatty J (1979) Pupillary responses during information processing vary with scholastic aptitude test scores. Science 205(4412):1289–1292
Eye Animation
17
3. Aks DJ, Zelinsky GJ, Sprott JC (2002) Memory across eye-movements: 1 / f dynamic in visual search. Nonlinear Dynamics Psychol Life Sci 6(1):1–25 4. Andrist S, Pejsa T, Mutlu B, Gleicher M (2012) Designing effective gaze mechanisms for virtual agents. In: Proceedings of the 2012 ACM Annual Conference on Human Factors in Computing Systems, CHI’12. ACM, New York, pp 705–714. doi:10.1145/2207676.2207777, http://doi.acm.org/10.1145/2207676.2207777 5. Bahill AT, Clark M, Stark L (1975) The main sequence. A tool for studying human eye movements. Math Biosci 24(3/4):191–204 6. Beatty J (1982) Task-evoked pupillary responses, processing load, and the structure of processing resources. Psychol Bull 91(2):276–292 7. Beatty J, Lucero-Wagoner B (2000) The pupillary system. In: Cacioppo JT, Tassinary LG, Bernston GG (eds) Handbook of psychophysiology, 2nd edn. Cambridge, Cambridge University Press, pp 142–162 8. Bentivoglio AR, Bressman SB, Cassetta E, Carretta D, Tonali P, Albanese A (1997) Analysis of blink rate patterns in normal subjects. Mov Disord 12(6):1028–1034 9. Bérard P, Bradley D, Nitti M, Beeler T, Gross M (2014) High-quality capture of eyes. ACM Trans Graph 33(6):2231–22312. doi:10.1145/2661229.2661285 10. Campbell CS, Maglio PP (2001) A robust algorithm for reading detection. In: ACM workshop on perceptive user interfaces. ACM Press, New York, pp 1–7 11. Condit R, Ashton PS, Baker P, Bunyavejchewin S, Gunatilleke S, Gunatilleke N, Hubbell SP, Foster RB, Itoh A, LaFrankie JV, Lee HS, Losos E, Manokaran N, Sukumar R, Yamakura T (2000) Spatial patterns in the distribution of tropical tree species. Science 288(5470):1414–1418 12. Davson H (1980) Physiology of the eye, 4th edn. Academic, New York, NY 13. Di Stasi LL, McCamy MB, Catena A, Macknik SL, Cañas JJ, Martinez-Conde S (2013) Microsaccade and drift dynamics reflect mental fatigue. Eur J Neurosci 38(3):2389–2398 14. Duchowski A, Jörg S (2015) Modeling physiologically plausible eye rotations: adhering to Donders’ and listing’s laws. In: Proceedings of computer graphics international (short papers) (2015) 15. Duchowski A, Jörg S, Lawson A, Bolte T, Świrski L, Krejtz K (2015) Eye movement synthesis with 1/f pink noise. In: Motion in Games (MIG) 2015, Paris, France 16. Duchowski AT (2007) Eye tracking methodology: theory & practice, 2nd edn. Springer, London, UK 17. Duchowski AT, House DH, Gestring J, Wang RI, Krejtz K, Krejtz I, Mantiuk R, Bazyluk B (2014) Reducing visual discomfort of 3D stereoscopic displays with gaze-contingent depth-offield. In: Proceedings of the ACM symposium on applied perception, SAP’14. ACM, New York, NY, pp 39–46. doi:10.1145/2628257.2628259, http://doi.acm.org/10.1145/2628257.2628259 18. Duchowski AT, Jörg S, Allen TN, Giannopoulos I, Krejtz K (2016) Eye movement synthesis. In: Proceedings of the ninth biennial acm symposium on eye tracking research & applications, ETRA’16. ACM, New York, NY, pp 147–154. doi:10.1145/2857491.2857528, http://doi.acm. org/10.1145/2857491.2857528 19. Ferman L, Collewijn H, Van den Berg AV (1987) A direct test of listing’s law – I. human ocular torsion measured in static tertiary positions. Vision Res 27(6):929–938 20. Garau M, Slater M, Vinayagamoorthy V, Brogni A, Steed A, Sasse MA (2003) The impact of avatar realism and eye gaze control on perceived quality of communication in a shared immersive virtual environment. In: Human factors in computing systems: CHI 03 conference proceedings. ACM Press, New York, pp 529–536 21. Glenn B, Vilis T (1992) Violations of listing’s law after large eye and head gaze shifts. J Neurophysiol 68(1):309–318 22. Grzywacz NM, Norcia AM (1995) Directional selectivity in the cortex. In: Arbib MA (ed) The handbook of brain theory and neural networks. Cambridge, MA, The MIT Press, pp 309–311 23. Gu E, Lee SP, Badler JB, Badler NI (2008) Eye movements, saccades, and multi-party conversations. In: Deng Z, Neumann U (eds) Data-driven 3D facial animation. Springer, London, UK, pp 79–97. doi:10.1007/978-1-84628-907-1_4
18
A.T. Duchowski and S. Jörg
24. Haslwanter T (1995) Mathematics of three-dimensional eye rotations. Vision Res 35 (12):1727–1739 25. Hollos S, Hollos JR (2015) Creating noise. Exstrom Laboratories, LLC, Longmont, CO, http://www.abrazol.com/books/noise/ (last accessed Jan. 2015). ISBN 9781887187268 (ebook) 26. Hollos S, Hollos JR (2015) Recursive Digital Filters: A Concise Guide. Exstrom Laboratories, LLC, Longmont, CO, http://www.abrazol.com/books/filter1/ (last accessed Jan. 2015). ISBN 9781887187244 (ebook) 27. Hubel DH (1988) Eye, brain, and vision. Scientific American Library, New York, NY 28. Kashihara K, Okanoya K, Kawai N (2014) Emotional attention modulates microsaccadic rate and direction. Psychol Res 78:166–179 29. Knox PC (2012) The parameters of eye movement (2001). Lecture Notes, URL: http://www.liv. ac.uk/~pcknox/teaching/Eymovs/params.htm (last accessed November 2012) 30. Komogortsev OV, Karpov A (2013) Liveness detection via oculomotor plant characteristics: attack of mechanical replicas. In: Proceedings of the IEEE/IARP international conference on biometrics (ICB), pp 1–8 31. Komogortsev OV, Karpov A, Holland CD (2015) Attack of mechanical replicas: liveness detection with eye movements. IEEE Trans Inform Forensics Secur 10(4):716–725 32. Krejtz K, Duchowski AT, Çöltekin A (2014) High-level gaze metrics from map viewing: charting ambient/focal visual attention. In: Kiefer P, Giannopoulos I, Raubal M, Krüger A (eds) 2nd international workshop in eye tracking for spatial research (ET4S) 33. Landy SD (1999) Mapping the universe. Sci Am 224:38–45 34. Laretzaki G, Plainis S, Vrettos I, Chrisoulakis A, Pallikaris I, Bitsios P (2011) Threat and trait anxiety affect stability of gaze fixation. Biol Psychol 86(3):330–336 35. Lee SP, Badler JB, Badler NI (2002) Eyes alive. ACM Trans Graph 21(3):637–644. doi:10.1145/566654.566629, http://doi.acm.org/10.1145/566654.566629 36. Looser CE, Wheatley T (2010) The tipping point of animacy. How, when, and where we perceive life in a face. Psychol Sci 21(12):1854–62 37. Ma X, Deng Z (2009) Natural eye motion synthesis by modeling gaze-head coupling. In: IEEE virtual reality, pp 143–150. Lafayette, LA 38. Martinez-Conde S, Macknik SL, Troncoso Xoana G, Hubel DH (2009) Microsaccades: a neurophysiological analysis. Trends Neurosci 32(9):463–475 39. Mok D, Ro A, Cadera W, Crawford JD, Vilis T (1992) Rotation of listing’s plane during vergence. Vision Res 32(11):2055–2064 40. Mori M (1970) The uncanny valley. Energy 7(4):33–35 41. Murphy H, Duchowski AT (2002) Perceptual gaze extent & level of detail in VR: looking outside the box. In: Conference abstracts and applications (sketches & applications), Computer graphics (SIGGRAPH) annual conference series. ACM, San Antonio, TX 42. Murray N, Roberts D, Steed A, Sharkey P, Dickerson P, Rae J, Wolff R (2009) Eye gaze in virtual environments: evaluating the need and initial work on implementation. Concurr Comput 21:1437–1449 43. Ostling A, Harte J, Green J (2000) Self-similarity and clustering in the spatial distribution of species. Science 27(5492):671 44. Oyekoya O, Steptoe W, Steed A (2009) A saliency-based method of simulating visual attention in virtual scenes. In: Reality V (ed) Software and technology. New York, ACM, pp 199–206 45. Pamplona VF, Oliveira MM, Baranoski GVG (2009) Photorealistic models for pupil light reflex and iridal pattern deformation. ACM Trans Graph 28(4):106:1–106:12. doi:10.1145/ 1559755.1559763, http://doi.acm.org/10.1145/1559755.1559763 46. Pejsa T, Mutlu B, Gleicher M (2013) Stylized and performative gaze for character animation. In: Navazo I, Poulin P (eds) Proceedings of EuroGrpahics. EuroGraphics 47. Peters C, Qureshi A (2010) A head movement propensity model for animating gaze shifts and blinks of virtual characters. Comput Graph 34:677–687
Eye Animation
19
48. Porrill J, Ivins JP, Frisby JP (1999) The variation of torsion with vergence and elevation. Vision Res 39:3934–3950 49. Privitera CM, Renninger LW, Carney T, Klein S, Aguilar M (2008) The pupil dilation response to visual detection. In: Rogowitz BE, Pappas T (eds) Human vision and electronic imaging, vol 6806. SPIE, Bellingham, WA 50. Quaia C, Optican LM (2003) Three-dimensional Rotations of the Eye. In: Kaufman PL, Alm A (eds) Adler’s phsyiology of the eye: clinical application, 10th edn. C. V. Mosby Co., St. Louis, pp 818–829 51. Robinson DA (1968) The oculomotor control system: a review. Proc IEEE 56(6):1032–1049 52. Rolfs M (2009) Microsaccades: Small steps on a long way. Vision Res 49(20):2415–2441. doi:10.1016/j.visres.2009.08.010, http://www.sciencedirect.com/science/article/pii/ S0042698909003691 53. Ruhland K, Andrist S, Badler JB, Peters CE, Badler NI, Gleicher M, Mutlu B, McDonnell R (2014) Look me in the eyes: a survey of eye and gaze animation for virtual agents and artificial systems. In: Lefebvre S, Spagnuolo M (ed) Computer graphics forum. EuroGraphics STAR – State of the Art Report. EuroGraphics. 54. Stark L, Campbell FW, Atwood J (1958) Pupil unrest: an example of noise in a biological servomechanism. Nature 182(4639):857–858 55. Steptoe W, Oyekoya O, Steed A (2010) Eyelid kinematics for virtual characters. Comput Animat Virtual World 21(3–4):161–171 56. Szendro P, Vincze G, Szasz A (2001) Pink-noise behaviour of biosystems. Eur Biophys J 30 (3):227–231 57. Templin K, Didyk P, Myszkowski K, Hefeeda MM, Seidel HP, Matusik W (2014) Modeling and optimizing eye vergence response to stereoscopic cuts. ACM Trans. Graph 33(4):8 Article 145 (July 2014), DOI = http://dx.doi.org/10.1145/2601097.2601148 58. Trutoiu LC, Carter EJ, Matthews I, Hodgins JK (2011) Modeling and animating eye blinks. ACM Trans Appl Percept (TAP) 2(3):17:1–17:17 59. Tweed D, Cadera W, Vilis T (1990) Computing three-dimensional eye position quaternions and eye velocity from search coil signals. Vision Res 30(1):97–110 60. Tweed D, Vilis T (1990) Geometric relations of eye position and velocity vectors during saccades. Vision Res 30(1):111–127 61. Usher M, Stemmler M, Olami Z (1995) Dynamic pattern formation leads to 1/f noise in neural populations. Phys Rev Lett 74(2):326–330 62. van Rijn LJ (1994) Torsional eye movements in humans. Ph.D. thesis, Erasmus Universiteit Rotterdam, Rotterdam, The Netherlands 63. Vertegaal R (1999) The GAZE groupware system: mediating joint attention in mutiparty communication and collaboration. In: Human factors in computing systems: CHI’99 conference proceedings. ACM Press, New York, pp 294–301 64. Yang Z, Zhao Q, Keefer E, Liu W (2009) Noise characterization, modeling, and reduction for in vivo neural recording. In: Bengio Y, Schuurmans D, Lafferty J, Williams CKI, Culotta A (eds) Advances in neural information processing systems, vol 22., pp 2160–2168 65. Yeo SH, Lesmana M, Neog DR, Pai DK (2012) Eyecatch: simulating visuomotor coordination for object interception. ACM Trans Graph 31(4):42:1–42:10 66. Zhou Y, Huang H, Wei LY, Wang R (2012) Point sampling with general noise spectrum. ACM Trans Graph 31(4):76:1–76:11. doi:10.1145/2185520.2185572, URL: http://doi.acm.org/10. 1145/2185520.2185572
Head Motion Generation Najmeh Sadoughi and Carlos Busso
Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Role of Head Motion in Human Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Relation Between Head Motion and Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Head Movement Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rule-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data-Driven Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hybrid Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Open Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Speech-Driven Models Using Synthetic Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exploring Entrainment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Modeling Personality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joint Models to Integrate Head Motion with Other Gestures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 4 4 7 8 9 12 15 19 19 20 20 21 21 22
Abstract
Head movement is an important part of body language. Head motion plays a role in communicating lexical and syntactic information. It conveys emotional and personality traits. It plays an important role in acknowledging active listening. Given these communicative functions, it is important to synthesize Conversation Agents (CAs) with meaningful human-like head motion sequences, which are timely synchronized with speech. There are several studies that have focused on synthesizing head movements. Most studies can be categorized as rule-based or data-driven frameworks. On the one hand, rule-based methods define rules that map semantic labels or communicative goals to specific head motion sequences, N. Sadoughi (*) • C. Busso (*) Multimodal signal Processing Lab, University of Texas at Dallas, Dallas, TX, USA e-mail: [email protected]; [email protected] # Springer International Publishing Switzerland 2016 B. Müller, S.I. Wolf (eds.), Handbook of Human Motion, DOI 10.1007/978-3-319-30808-1_4-1
1
2
N. Sadoughi and C. Busso
which are appropriate for the underlying message (e.g., nodding for affirmation). However, the range of head motion sequences that are generated by these systems are usually limited, resulting in repetitive behaviors. On the other hand, datadriven methods rely on recorded head motion sequences which are used either to concatenate existing sequences creating new realizations of head movements or to build statistical frameworks that are able to synthesize novel realizations of head motion behaviors. Due to the strong correlation between head movements and speech prosody, these approaches usually rely on speech to drive the head movements. These methods can capture a broader range of movements displayed during human interaction. However, even when the generated head movements may be tightly synchronized with speech, they may not convey the underlying discourse function or intention in the message. The advantages of rule-based and data-driven methods have inspired several studies to create hybrid methods that overcome the aforementioned limitations. These studies have been proposed to generate the movements using parametric or nonparametric approaches, constraining the models not only on speech, but also on the semantic content. This chapter reviews most influential frameworks to generate head motion. It also discusses open challenges that can move this research area forward. Keywords
Conversational agent • Rule-based animation • Data-driven animation • Speechdriven animation • Head movement generation • Semantic content • Backchannel • Nonverbal behaviors • Expressive head motion • Rapport • Embodied conversational agents • Visual prosody
Introduction Head movement is an integral part of our body language used during human interactions. Head motion can play a communicative role displaying emblems (i.e., gestures conveying specific meaning) and regulators (i.e., gestures to control turntaking sequence) (Heylen 2005). It is also important to establish rapport by providing suitable backchannels while listening to conversational partners (Gratch et al. 2006; Huang et al. 2011). Having rhythmic head movements coupled with speech prosodic features increases speech intelligibility by signaling syntactic boundaries (Munhall et al. 2004). Head movements are also used to convey the mood of the speaker (Busso et al. 2007a, b). It can also be used to express uncertainty (Marsi and van Rooden 2007). Given the key role of head motion during human interaction, it is not surprising that there is a need to model and capture all these aspects to generate believable conversational agents (CAs). CAs without head movements are perceived less warm (Welbergen et al. 2015) and less natural (Busso et al. 2007a; Mariooryad and Busso 2012). This chapter describes influential frameworks proposed to synthesize head motion sequences for CAs, describing the main challenges. For head movement generation, previous studies have proposed frameworks that can be categorized into two main approaches: rule-based methods (Cassell et al.
Head Motion Generation
3
1994; Kopp et al. 2006; Liu et al. 2012; Pelachaud et al. 1996) and data-driven frameworks (Busso et al. 2005; Chiu and Marsella 2011; Chuang and Bregler 2005; Deng et al. 2004; Mariooryad and Busso 2012; Sadoughi et al. 2014; Taylor et al. 2006). The predominant approach to generate head motion is rule-based systems, where hand crafted head movements such as shaking and nodding are carefully selected and stored. The semantic content is then analyzed generating head motion sequences following the selected rules. These methods usually define several heuristic rules derived from previous psychological and observational studies, mapping the syntactic and semantic structure of the utterance to prototypical head motion sequences. These methods also define rules to synchronize head movements with the underlying speech. The second category corresponds to data-driven methods, where head motion sequences are generated from existing recordings. These data-driven methods either concatenate existing head motion sequences according to given criteria or learn statistical models to capture the distribution of head motion. A prevalent modality used in previous data-driven studies is a set of speech prosodic features, leveraging the strong coupling between head motion and speech (Busso and Narayanan 2007). The two main approaches for head movement generation have advantages and disadvantages. On the one hand, rule-based systems have the advantage of considering the meaning of the message to choose appropriate movements. However, the head movements may seem repetitive, since the range and variability of head motions are usually limited to the predefined sequences per type of movement stored in the system. Under similar conditions, the system will tend to generate similar behaviors oversimplifying the complex relationship between verbal and nonverbal information. Furthermore, forcing synchronization between behaviors and speech is challenging (e.g., the coupling between speech and head motion). On the other hand, data-driven frameworks have the potential to capture the range of behaviors seen in real human interaction, creating novel realizations that resemble natural head motions. When speech features are used to generate head motion, the models can automatically learn the synchronization between the prosodic structure and head movements. However, using solely data-driven models may disregard the semantic content of the message, resulting in movements that are not aligned with the message. These systems may generate perfectly synchronized emblems contradicting the message (e.g., shaking the head during affirmation). To balance the tradeoff between naturalness and appropriateness, studies have attempted to bridge the gap between both methods creating hybrid approaches that leverage the advantages of both methods, overcoming their limitations (Chiu et al. 2015; Sadoughi and Busso 2015; Sadoughi et al. 2014; Stone et al. 2004). This chapter describes the role of head motion in human interaction, emphasizing the importance of synthesizing behaviors that properly convey the relation between head motion and other verbal and nonverbal channels. We review influential studies proposing rule-based, data-driven, and hybrid frameworks. The chapter also discusses open challenges that can lead to new advances in this research area.
4
N. Sadoughi and C. Busso
Table 1 Some of the head motion roles identified by Heylen (2005) Head motion functions Show affirmation or negation Show inclusivity or intensification Organize the interaction Mark the listing Mark the contrast Show the level of understanding Mark uncertain statements Facilitate turn taking/giving Signal ground holding Signal the mood Signal shyness and hesitation Backchannel requests
State of the Art Head motion plays an important role during human communication. This section summarizes relevant studies describing the function of head motion during human interaction (section “Role of Head Motion in Human Interaction”), emphasizing the strong relationship with other verbal and nonverbal channels (section “Relation between Head Motion and Speech”).
Role of Head Motion in Human Interaction Heylen (2005) surveyed studies analyzing the role of head motion during human conversation, listing 25 different roles, including enhancing communicative attention, marking contrast between sentences, and communicating the degree of understanding. Table 1 lists some of these functions. It is clear that head movement is an essential part of body language, which facilitates human-human interaction not only while speaking, but also while listening (McClave 2000). Speakers use head movements to reinforce the meaning of the message. We often use emblems such as head nods for affirmations, head shakes for negation, and head tilt with words like “um,” “uh,” and “well” (Lee and Marsella 2006; Liu et al. 2012). Lee and Marsella (2006) investigated the nonverbal behaviors of individuals during dyadic interactions. They annotated the videos in terms of a set of discourse functions including affirmation, negation, contrast, intensification, inclusivity, obligation, listing, assumption, possibility, response, request, and word search. They found that generally there are nonverbal behavior patterns related to head motion accompanying these labels (e.g., head shake during negation, head nod during affirmation, head shake during the use of words such as “really,” and lateral head sweep during the use of words such as “everything,” “all,” and “whole”). Head
Head Motion Generation
5
motion is also used to parse syntactic information, creating visual markers to segment phrases within an utterance. Hadar et al. (1983) recorded head movements from four subjects during conversation, reporting that after removing pauses of more than 1 s, 58.8% of the still head poses occurred during speech pauses. Graf et al. (2002) observed frequently an initial head movement after a pause, which is followed by speech. Another important communicative function of head motion is to stress words, functioning as a visual marker for intonation (Graf et al. 2002; Moubayed et al. 2010). These aspects are important during human interaction. For example, Munhall et al. (2004) studied the effect of head movements on speech intelligibility. They conducted an evaluation where an animated face replicated the original head and face motions of recorded sentences. The task was to recognize speech in the presence of noisy audio. They evaluated two conditions for the animated face: with head motion and without head motion. They count the number of correctly identified syllables, showing improved performance when the animated face contained head motion. Head motion also plays a key role while listening, where people provide nonverbal feedback to the speaker using primarily their head movements. A common behavior is to nod to acknowledge active listening (McClave 2000). Ishi et al. (2014) analyzed the occurrences of head movements during the listeners’s backchannels such as “yes” and “uhm,” observing one or multiple head nods which were timely synchronized with the verbal backchannel. Head movements also convey the affective state of the speaker (Busso et al. 2007b; Busso and Narayanan 2007). In our previous work, we studied the displacement of head motions of an individual expressing different emotions: happiness, anger, sadness, and neutrality (Busso and Narayanan 2007). Our results showed significant difference in head motions across all emotions, except between happiness and anger. In another study (Busso et al. 2007a), we demonstrated that head motion behaviors are discriminative for emotion recognition. Using only global statistics derived from head motion trajectories at the sentence level, we were able to recognize between these four emotional states with 65.5% accuracy (performance at chance was 25%). This study also demonstrated the contribution of head motion on emotional perception. We generated animations of expressive sentences. The novelty of the approach was that we purposely created mismatches between the emotion in the sentence (e.g., happiness) and the emotion on the head motion sequence (e.g., sadness). The corpus used in this study has sentences read by an actor conveying the same emotions. Therefore, we were able to create these mismatches by temporally aligning the corresponding frames across emotions. The evaluators rated the emotional content in terms of activation, valence and dominance, using a five Likert-like scale for each emotional dimension. Figure 1 shows the results for valence (1– very positive; 5– very negative). The first bar in each of the plot represents the matched condition, where the emotion on the sentence match the emotion of the head motion sequence. The next three bars provide the perception achieved in mismatched conditions by changing the emotion of the head motion sequences. The bar “FIX” represents the perception achieved without any head motion. Finally, the bar “WAV” gives the perception achieved when the stimuli
6
N. Sadoughi and C. Busso
a
b
5
5
4
4
3
3
2
2
1
1 HAP
NEU
SAD
ANG
FIX
WAV
ANG NEU
Happiness
SAD
HAP
FIX
WAV
FIX
WAV
Anger
c
d
5
5
4
4
3
3
2
2
1
1 SAD
NEU
HAP
ANG
Sadness
FIX
WAV
NEU
SAD
HAP
ANG
Neutral
Fig. 1 These figures show the perceived valence (1: positive, 5: negative) for four emotional categories, when the head movements are generated with the same emotional class (i.e., matched condition, first bar), with three other emotions (i.e., mismatched condition, second to fourth bars), when the head is fixed (FIX), and when the evaluators only listened to the audio (WAV)
only included speech. These figures show that expressive head motion sequences change the emotional perception of the animation. For the neutral sentences, adding an angry head motion sequence makes the animation more negative, and adding a happy head motion sequence makes the animation more positive. Similarly, Lance and Marsella (2007) proposed to include emotional head movements during gaze shift when synthesizing the animations. The result of their study showed that people distinguished differences between high versus low level of arousal and high versus low level of dominance. These results indicate that modeling expressive behaviors in synthesizing head motion sequences is important. Head motion also affects perception of personality traits. Arellano et al. (2011) performed a perceptual evaluation on static images of a character with various head orientations and gazes. The result showed that people’s perception of personality
Head Motion Generation
7
traits such as agreeableness and emotional stability was affected by head orientation, while for gaze no significant difference was revealed. Arya et al. (2006) analyzed the effect of head movement and facial actions on the perceived personality by others. They performed perceptual evaluations on a set of animated videos to see the effect of visual cues on the perception of personality based on two parameters: affiliation and dominance. They created several videos, each with a specific facial action, and asked the evaluators to rank them with a set of attributes. The result of this study showed that dynamic head movements such as head tilt and eye gaze aversion communicate a sense of dominance for the character. Moreover, the results showed that the higher the frequency and intensity of the head movements are, the higher is their perceived level of dominance.
Relation Between Head Motion and Speech Data-driven models have the potential of capturing naturalistic variations of the behaviors (Foster 2007). One useful and accessible modality that can be used to drive facial behaviors is speech. Spoken language carries important information beyond the verbal message that a CA engine should capitalize on. Therefore, this chapter focuses the discussion on data-driven frameworks relying on speech features. Head motion and speech are intrinsically connected at various levels (Busso and Narayanan 2007). As mentioned in section “Role of Head Motion in Human Interaction,” head motion conveys visual markers of intonation, defining syntactic boundaries and stressed segments. As a result, speech features and head motion sequences are highly correlated. Several studies have reported a strong correlation between speech prosody features and head movements. Munhall et al. (2004) analyzed the correlation between head motion and the prosodic features including fundamental frequency and RMS energy. The study focused on recordings from a single subject. The correlation at the sentence level between head motion and the fundamental frequency was ρ ¼ 0:63 , and between head motion and the RMS energy was ρ ¼ 0:324. Kuratate et al. (1999) showed a correlation of ρ ¼ 0:88 between head motion and the fundamental frequency for an American English speaker. We also reported similar results in Busso et al. (2005) using pitch, intensity, and their first and second order derivatives as speech prosodic features. We evaluate the relationship between head and speech features using canonical correlation analysis (CCA). CCA projects two modalities with similar or different dimensions into a common space where their correlation is maximized. The CCA for head and speech features was ρ ¼ 0:7 at the sentence level highlighting the strong connection between them. The study was further extended in Busso and Narayanan (2007), observing similar results. Studies have also shown co-occurrence of head movements and speech prosody events. Graf et al. (2002) showed that although the amplitude and direction of the movements may vary according to idiosyncratic characteristics, semantic content of the message, and affective state of the speaker, there is a common synchrony of the timings between pitch accents and head events. Mcclave (2000) reported that there is
8
N. Sadoughi and C. Busso
Table 2 Perception of CAs synthesized with and without head motion reported in previous studies ranging from 1 (bad) to either 5 or 7 (great). Some of these values are approximated from figures in their corresponding publications Study Busso et al. (2007a) Mariooryad and Busso (2012) Welbergen et al. (2015)
Criterion Naturalness (1–5) Naturalness (1–5) Warmth (1–7) Competence (1–7) Human-likeness (1–7)
With head movement 3.61 ~2.90
Without head movement 3.10 2.32
~5.10 ~6.10 ~4.50
~4.55 ~5.90 ~4.50
co-occurrence between head movement patterns and the meaning of speech. For instance, head shakes happen during expressions of inclusivity and intensification. Lee and Marsella (2009) proposed a hidden Markov model (HMM) classifier to detect head nods based on features selected from speech including part of speech (PoS) (e.g., conjunction, proper noun, adverb, and interjection), dialog acts (e.g., backchannel, inform, suggest), phrases, and verb boundaries. Their classifier showed high performance, indicating a close connection between head nods, dialog acts, and timing of the uttered words.
Head Movement Synthesis It is important to generate head movements for CAs, due to its role in conveying the intended message, emotion, and level of rapport displayed by speakers. Table 2 lists some studies which have compared facial animations synthesized with and without head motion, in terms of naturalness, warmth, competence, and human-likeness. These studies show the clear benefit of using head motion. Head motion has three degrees of freedom (DOF) for rotation and three DOF for translation. While some methods consider all six DOF, most studies rely only on the three rotation angles (Fig. 2). The studies focused on the synthesis of head motion are usually based on rule-based or data-driven methods. Rule-based systems define several rules about the shape and timing of the head movements and use a predefined handcrafted dictionary of head gestures to synthesize them. While these gestures are usually selected based on the meaning of the message, their variations are limited to the list of gestures defined in the system. Also, local synchronization and timing of these gestures with speech is challenging. Data-driven methods utilize prerecorded motion databases. These methods usually concatenate the prerecorded motions to create a new realization or create them by sampling from the models trained on the recordings. Due to the correlation between head movements and speech prosody features, these methods usually consider speech prosody features in generating the
Head Motion Generation
9
Fig. 2 Three degrees of freedom for head motion rotation. Some studies also include three degrees of freedom for head translation
movements. This approach also facilitates the synchronization between speech and gestures, capturing subtle timing relations between prosody and head motion. Also, these methods have the potential to capture the range of motions seen in real recordings. However, their main drawback is that these methods disregard the meaning of the message while creating the movements. Therefore, the movements are not constrained to convey the same meaning as the speech and may even contradict the message (e.g., head nods for negations). Foster (2007) compared rule-based and data-driven generation of head movements. The result of this evaluation showed that people preferred facial animations generated with data-driven methods more than rule-based methods; however, the difference was not statistically significant. The study also concluded that the range of the displays for data-driven method was more similar to the original recordings than the displays obtained with rule-based systems. Rule-based systems and data-driven methods have key features that are ideal to synthesize human-like head motions. This section describes influential studies for rule-based systems (section “Rule-Based Methods”) and data-driven methods (section “Data-Driven Models”). It also summarizes efforts to create hybrid approaches which leverage the benefits of both of these methods (section “Hybrid Approaches”).
Rule-Based Methods Rule-based methods define rules for head movements to communicate the meaning more clearly. Table 3 summarizes some of the rules defined by previous studies. One of the most influential studies on rule-based systems was presented by Cassell et al. (1994). They designed a system to generate synchronized speech, intonations, and gestures (including head motion) by defining rules. For example, they generated head nods during emphatic segments or backchannels. Their system approximated gaze with head orientation, and, therefore, all the rules for gaze behaviors involved specific head motions. For example, the CA would look up during question, look at the listener during answer, look away at the beginning of a long turn, and look at the
10
N. Sadoughi and C. Busso
Table 3 This table provides a brief summary of the rules proposed in the previous studies which provide a mapping from the discourse function or intention to specific head movements Study Cassell et al. (1994)
Pelachaud et al. (1996)
Liu et al. (2012)
Gratch et al. (2006)
Marsella et al. (2013)
Mapping Lexical/emotional affiliate Backchannel
Head movement/pose Head nods
Emphasis Question Answer Beginning of turn Turn request Anger
Head nods Look up Look away Look away Look up Forward pose
Sadness Disgust Fear Sadness Surprise Backchannel End of question End of turn when giving the turn to the interlocutor Keeping a turn by a short pause Thinking, but keeping the turn Thinking and preparing the next utterance, e.g., “uhmm” Lowering of pitch of interlocutor
Downward Backward and up Backward Downward Backward Head nod Head nod Head nod
Raised loudness of interlocutor Speech disfluency of interlocutor Posture/gaze shift of interlocutor Nods or shakes of interlocutor Affirmation
Head nod Posture/gaze shift Mimic Mimic Big nod, tilt left nod, tilt right nod Shake, small shake Tilt right, tilt left Tilt half nod left Small nod
Negation Contrast Mental state Emphasis
Head nod Head tilt Head tilt Head nod
listener for short turns. The extension of this framework resulted in REA, a CA which responded with gestures to different discourse functions (Cassell et al. 1999). DeCarlo et al. (2004) presented RUTH, a platform architecture for embodied CAs. The inputs of this platform are enriched transcriptions with prosodic and gestural markers at the word level. The prosodic markers correspond to the tones and break indices (ToBI), which define pitch accents and boundary tones (Silverman
Head Motion Generation
11
et al. 1992). The gestural markers are predefined behaviors. For head motion, they defined 14 types of head motions. They considered variations of head nods (upward, downward, upward with some rightward, upward with some leftward, downward with some rightward, and downward with some leftward) and head tilts (clockwise, counter clockwise, clockwise with downward nodding, counter clockwise with downward nodding). They also defined gestures to move (forward, backward) or turn (to the right or to the left) the head. These gestures are then rendered, synchronizing the behaviors at the points specified by the tags. There are studies that have attempted to incorporate head motions conveying emotions using rule-based systems. Pelachaud et al. (1996) developed a system to generate expressive facial and head movements, as well as eye movements. As input, they used the transcriptions tagged with accents, the desired emotional state, and its intensity. They used head and eye movements as regulators, which facilitate the communication between the speaker and listener, in a rule-based manner. They also defined rules specifying the head direction depending on the target emotional state, following the results from previous psychological studies. The emotional head motion rules include moving forward during anger, downward during sadness, and backward during surprise. Marsella et al. (2013) proposed an emotionally aware rule-based system. Their framework relied on syntactic and acoustic analysis on the input speech consisting of either natural or synthesized speech. They used the acoustic analysis to find the emotional state and word emphasis and the syntactic analysis to find the appropriate category of behaviors to be synthesized for each communicative goal. Their proposed system also handles co-articulation for consecutive behaviors that are close in time, leading to novel realizations while transitioning from one gesture to another. To define the rules, there are studies investigating video recordings of human interaction, aiming to identify consistent patterns between movements (including head motion) and discourse features. Kipp (2003) analyzed human gestures in 23 clips of a TV show. They developed ANVIL, a video annotation toolkit to annotate the gestures. They found a common set of 15 gestures occurring across the two speakers. They defined these gesture profiles, including the position and orientation of the head and hands during the gestures. For synthesis, they automatically annotate the transcription with words, PoS, what the utterance is about (theme, rheme, and focus), and discourse relations (opposition, repetition, and listing). They used carefully designed rules to map these tags to a set of semantic tags. Using these semantic tags, and the statistics derived from their annotated corpus for these tags, they choose the most probable gesture considering local and global constraints. Following a similar approach, Liu et al. (2012) proposed a rule-based approach to appropriately generate head tilts and head nods, where the rules were derived by observing and analyzing human interaction data. First, they annotated the phrases in their database with a list of dialog acts, along with head nods and tilts. Second, they created a mapping between dialog acts and the corresponding head movements. They found frequent occurrences of head nods during backchannels and the last syllable of strong phrase boundaries. They also found head tilts during weak phrase boundaries and segments when the individual was either thinking or embarrassed.
12
N. Sadoughi and C. Busso
They exploited these relations in the generation of head nods and tilts for humanrobot interaction (HRI). They used a fixed shape trajectory for head nod and tilt, driven by the rules learnt from their corpus. They used perceptual evaluational to measure the perceived naturalness of the head movements visualized on robots. Their results showed improved naturalness when both head nods and tilts are incorporated in the system compared with the case when they used only head nods and when they used the original sequences. Generating head motion for CAs is not only important while speaking but also while listening. Showing rapport is one of the aspects that needs to be considered for generating believable CAs. Gratch et al. (2006) proposed a virtual listener called virtual rapport, which aims to create a sense of rapport with the user, by defining heuristic rules from previous psychological studies. For example, the CA nods whenever it senses that the users lower their pitch or raise their loudness. It also nods when the speaker nods. Similar rule-based systems include the work of Rickel and Johnson (1998) (Steven) André et al. (1996) (the DFKI Persona) Beskow and McGlashan (1997) (Olga), Lester et al. (1999) (pedagogical agents), and Smid et al. (2004). An important drawback of rule-based systems is that they cannot easily capture the rich and complex variability observed in natural recordings, resulting many times in repetitive behaviors (Foster 2007).
Data-Driven Models The second category of approaches to generate head motion corresponds to datadriven frameworks. Data-driven approaches usually utilize motion capture recordings of head motion trajectories. Table 4 summarizes some of these frameworks, highlighting the input that drives the approaches. Studies have created head movements by blending segments of recorded motion capture data. Chuang and Bregler (2005) designed a system to create emotional facial gestures. For head motion, they stored segments of pitch contours and their corresponding head motions. During synthesis, they searched for a combination of pitch contours in their stored libraries, finding the best matching contours. Then, they connected the corresponding sequence of head motion segments, re-sampling the sequence to match the timing of the input. Deng et al. (2004) proposed a similar approach using K Nearest Neighbors (KNNs). They stored the training audio and head motion trajectories indexed by the audio features and used dynamic programming to search for the most appropriate set of motion segments using seven nearest neighbors. They allowed the user to specify head poses for key frames, which were added as constraints in the dynamic programming search. They considered smoothness of the trajectory as one of the factors in their optimization process, avoiding sudden transitions. Le et al. (2012) proposed a framework based on Gaussian mixture model (GMM) to generate head motion driven by prosodic features (loudness and fundamental frequency). Their framework learns three separate joint GMMs, modeling the
Head Motion Generation
13
Table 4 Brief summary of data-driven methods proposed in previous studies. The table lists the corresponding input during testing and the approach used to synthesize the head motion sequences Study Chuang and Bregler (2005) Deng et al. (2004)
Input Pitch, target expressive style
Method KNN, path searching KNN, path optimization
Busso et al. (2005) Busso et al. (2007a) Sargin et al. (2008) Mariooryad and Busso (2012) Chiu and Marsella (2011) Levine et al. (2010)
Pitch, five formants, 13-MFCC, 12-LPC Pitch, intensity Pitch, intensity, emotion Pitch, intensity Pitch, intensity Pitch, intensity Pitch, intensity, syllable length
Le et al. (2012)
Pitch, intensity
CRBMs HCRFs, reinforcement learning GMMs
HMMs HMMs PHMMs DBNs
relation between speech prosody features and (1) head poses, (2) the velocity of head motion, and (3) the acceleration of head motion. They approximate the joint distribution of them, by assuming that they are independent (product of the probabilities provided by the GMMs). Having the head pose at the two previous frames, and the prosodic features for the current frame, they find the current head poses by maximizing the final joint distribution using gradient descent. A key advantage of this approach is that it can run online facilitating real-time implementation. There are other data-driven studies that use probabilistic modeling strategies that capture the temporal dynamic of head motion. Examples of these frameworks include HMMs and dynamic Bayesian networks (DBNs). For example, we presented a framework based on HMM for modeling the relationship between speech prosodic features and head motion (Busso et al. 2005). We used vector quantization to quantize the space of head motion and designed HMMs to learn the joint representation of head movements and speech prosodic features. Figure 3 gives the block diagram of this study, where the RMS energy, the fundamental frequency, and their first and second order derivatives were used to create a 6D feature vector. The HMMs represents head poses where their transitions were learned during training. The HMMs decode the most likely sequence of head poses given a speech signal, where head poses transitions that are common are rewarded and uncommon transitions are penalized. Given the discrete representation of head poses used in this study, we smoothed the angular trajectories of the generated head poses. This framework was very effective in generating head motion sequences that are timely aligned with prosodic information. Following this study, we extended the HMM approach to incorporate the relationship between prosody and head motion under different emotional states (Busso et al. 2007a, b). The results showed that the models were able to generate expressive head motions accompanying speech. Other studies have also proposed speech-driven models to synthesize head motion. Sargin et al. (2008) used parallel HMMs (PHMMs) to jointly model speech and head movements
14
N. Sadoughi and C. Busso
Feature Extraction
Head Motion Synthesis
Spherical Cubic Interpolation
HMM Sequence Generator
+
Vector Quantization
Noise Generation
Fig. 3 The block diagram of the approach proposed by Busso et al. (2005), where HMMs are used to synthesize head motion sequences driven from speech
a
b
(Sargin et al. 2008)
c
(Mariooryad and Busso 2012)
(Taylor et al. 2006)
Fig. 4 (a) PHMM proposed by Sargin et al. (2008) to model the relationship between head movement primitives and speech prosody, (b) The DBN model proposed by Mariooryad and Busso (2012) to jointly model the relationship between head and eyebrow movements with speech prosody features, and (c) CRBM proposed by Taylor et al. (2006) to learn the human motion
by simultaneously clustering and segmenting the two modalities. PHMM consists of several left-to-right HMMs, where each branch models a head motion primitive automatically extracted from the data (see Fig. 4a). PHMM jointly solves the segmentation and clustering of head motion sequences. In their study, they found the most probable state sequence and their corresponding head motion values for a given speech signal. DBNs are another suitable framework to capture the relation between head motion and speech. DBN is a generative model that provides the flexibility to impose different structures by introducing nodes representing variables and direct links representing conditional dependencies between the variables. Therefore, it can model the dependencies between two temporal sequences in a principled way. Notice that HMM is a particular type of DBN. We have demonstrated the potential of DBNs to model the relation between head motion and speech. Mariooryad and Busso (2012) designed several structures of DBNs to capture the joint representation
Head Motion Generation
15
of speech with not only head movements but also eyebrow movements. Figure 4b shows an example, where the Head&Eyebrow node represents a jointly discrete state describing eyebrow and head motion. During training, all the variables are available. During synthesis, the Head&Eyebrow node is not available but is approximated by propagating the evidences from the Speech node. The animations generated with this method were compared with subjective and objective metrics demonstrating the need to jointly model eyebrow and head motion together. There are data-driven methods relying on conditional restricted Boltzmann machines (CRBMs) (Chiu and Marsella 2011). CRBM provides an efficient nonlinear tool for modeling the global dynamics and local constraints of a temporal signal (see Fig. 4c). Given N þ 1 frames, the model learns the mapping between the visible units and hidden layers, which will reconstruct missing observation during synthesis. For synthesis, the model takes the first N frames, aiming to estimate the N þ 1 frame. In addition, the auto regressive connections between the previous frames and the current frame learn the temporal constraints of the data. During synthesis, this model generates the ðN þ 1Þth sample using contrastive divergence, based on the previous N frames. These properties make CRBM very useful for predicting and generating temporal sequences. Taylor et al. (2006) demonstrated the benefits of using this framework for modeling human motion trajectories (e.g., walking). They used CRBM with auto regressive connections for predicting the human motion pose for the next frame, given the previous N frames. Following this study, Taylor and Hinton (2009) proposed to add an extra variable to constrain the CRBM’s generation based on specific stylized walk sequences such as drunk, strong, and graceful. The success of this framework in this domain motivated Chiu and Marsella (2011) to use a variation of CRBM to generate head motion sequences. They proposed a hierarchical factored conditional RBM (HFCRBM), which predicts the current head pose based on the previous two poses, constrained on speech prosody features. The aforementioned studies utilized either the concatenation approach or statistical models to learn the relation between head motion and speech. Levine et al. (2010) combined both strategies by using hidden conditional random fields (HCRFs) to model the relationship between a set of kinematic features of joint movements and speech prosodic features. The premise of the study is that prosody is related to the head motion kinematic rather than the actual head motion. For synthesis, they inferred the kinematic features based on the prosodic features. Next, they searched through the recordings, in an online manner (forward path), using a cost function that incorporates the inferred kinematic features. They used Markov decision process (MDP) to ensure smoothness in the head motion trajectories.
Hybrid Approaches Combining rule-based and data-driven approaches to exploit the benefits from each method results in an enhanced system. Several studies have focused on bridging the gap between these two methods (Huang et al. 2011; Sadoughi and Busso 2015;
16
N. Sadoughi and C. Busso
a
b
(Sadoughi and Busso 2014)
(Sadoughi et al. 2015)
Fig. 5 Dynamic Bayesian Network of the proposed constrained models (Sadoughi and Busso 2015; Sadoughi et al. 2014). The systems generate behaviors constrained by (a) the underlying discourse function or (b) the target gesture
Sadoughi et al. 2014; Stone et al. 2004). We describe some of these studies in this section. Stone et al. (2004) proposed a system to generate head and body movements, by concatenating prerecorded audio and motion units. The key aspect of the approach is that the units are associated with communicative functions or intents. Therefore, they can decompose any new utterance into units and solve a dynamic search through their data to find the best combination matching the intended communicative function. Since the approach uses a concatenative framework, the dynamic search also smoothes transition between speech segments and between motion sequences. It also synchronizes emphasis points across speech and gestures. To achieve this goal, they annotate the emphasis segments on their recordings. During testing, they provide tags describing emphasis on the transcriptions, coordinating the emphasis on motion sequences to start and end at the corresponding frames. The intermediate frames, which are the frames in between the emphatic points, are derived by interpolation. The limitation of this work is that the variations of speech and motion sequences are limited to the indexed phrases found in the recordings. There are other studies that have combined rule-based and data-driven approaches by adding meaningful constraints to their models. We have designed a speech-driven model to synthesize eyebrow and head movements constrained by discourse functions (Sadoughi and Busso 2015; Sadoughi et al. 2014). Figure 5a describes the structure of our first model (Sadoughi et al. 2014), built upon the DBN model proposed by Mariooryad and Busso (2012) (see Fig. 4b). In this structure, the Constraint node is added as a child of the hidden state Hh &e, which controls the dependency between speech and head and eyebrow motion. During training and synthesis, the Constraint node is given as input, which dictates the behaviors generated by the system. This study used the IEMOCAP database (Busso et al. 2008), which consists of dyadic interaction between two actors. We manually annotated two discourse functions corresponding to affirmation and question. The models were trained and tested with data from a single subject. The results showed
Head Motion Generation
17
Fig. 6 MSP-AVATAR, a corpus designed to generate behaviors constrained by the communicative function of the message. The figure shows the placement of the reflective markers, the skeleton used to reconstruct the data, and the setting of the recordings
that evaluators preferred the constrained models for questions. For affirmation, the results were not conclusive. A challenge in creating models constrained by the semantic meaning of the sentence is the lack of motion capture databases with appropriate annotations for discourse functions. To overcome this limitation, we recorded the MSP-AVATAR corpus (Sadoughi et al. 2015), a motion capture corpus of dyadic interactions, capturing facial expressions and upper body motion, including head motion (Fig. 6). In each session, two actors improvised scenarios carefully designed to include a set of the following communicative functions: contrast, affirmation, negation, question, uncertainty, suggest, warn, inform. We also considered scenarios to include iconic gestures for words such as large and small, and deictic gestures for pronouns such as you and I. This corpus contains the audio and video of both authors and motion capture recordings from one of them. Figure 6 shows the placement of the markers, the marker’s skeleton, and the setting of the recordings of the corpus. Using this corpus, we are currently extending our framework to combine rule-based and data-driven models by considering these discourse functions. An alternative framework to bridge rule-based and data-driven models is to generate the behaviors dictated by the predefined rules using data-driven models. To understand how this framework works, consider the SAIBA framework proposed by Kopp et al. (2006). SAIBA is a behavior generation framework for embodied conversational agents (ECAs) composed of three layers: intent planning, behavior planning, and behavior realization. The first two layers define the intent of the message and the gestures required to convey the communicative goal. We envision rule-based systems to create these layers. The last layer generates the intended behavior by setting the amplitude and timing constraints. We envision data-driven models to create this layer. Data-driven models will generate novel realization of specific gestures defined by the behavior planning layer. We have explored this approach for head and hand gestures (Sadoughi and Busso 2015). Figure 5b illustrates the proposed system, where the Constraint node is placed as a parent of the
18
N. Sadoughi and C. Busso
Fig. 7 This figure shows the overall block diagram of the method proposed by Sadoughi and Busso (2015) to retrieve arbitrary prototypical head movements. The approach only requires few examples of the target behaviors
hidden state Hh &e. The key novelty is that the constraints correspond to specific behaviors. For head motion, we only considered head nods and head shakes, but the system is flexible to incorporate other behaviors. Notice that we need several examples of the target behaviors to train the proposed model. We addressed this key problem with a semi-supervised approach to retrieve examples of the target behaviors from the database. This framework, illustrated in Fig. 7, requires few samples for training, which are used to automatically retrieve similar examples from the database. The first step searches for possible matches using one-class support vector machine (SVM). We use temporal reduction and multi-scale windows to handle similar gestures with different durations. The classifiers are fast. They are set to identify many candidates segments conveying the target gesture. The second step uses dynamic time alignment kernel (DTAK) to improve the precision of the system by removing samples that are not similar to the given examples. We use the retrieved samples to train the speech-driven framework described in Fig. 5b, generating novel data-driven realization of the target behaviors (e.g., head shakes and head nods). Another interesting domain to combine rule-base and data-driven systems is in the generation of behaviors while the CAs are listening. As we mentioned in section “Rule-Based Methods,” Gratch et al. (2006) proposed a rule-based systems to generate a virtual rapport. Their team extended their framework, entitled virtual rapport 2.0, by shifting their approach towards a more data-driven approach (Huang et al. 2011). The approach relies on an interesting data collection design described in Huang et al. (2010) to analyze human responses during an interaction. They collected data from subjects watching a story teller in a prerecorded video. Their task was to press a key each time they felt backchannel (verbal and nonverbal feedbacks, such as head nods, “uh-huh,” or “OK”) were appropriate. The subjects were informed of the interaction goal, which was to promote rapport. They collect multiple subjects under the same interaction to separate the idiosyncratic responses from essential responses. Using these recordings, they train a conditional random field (CRF) model, which uses these recorded videos to predict when and how to generate backchannels. The input of their system includes pause and the user’s eye gaze, generating different types of nodding as output. For predicting the end of the speaking turn, they defined rules using the verbal and nonverbal cues observed in their data. To make the CA system more friendly, they embedded a smile detector,
Head Motion Generation
19
smiling whenever the system detects that the user smiles. This version created a higher sense of rapport in the speakers.
Open Challenges Generating meaningful head motion sequences conveying the range of behaviors observed during human interaction is an important problem. This area offers interesting challenges that future research should address. We describe some of these challenges in this section.
Speech-Driven Models Using Synthetic Speech An important limitation for speech-driven methods is the assumption that natural speech is available to synthesize head motion. Having prerecorded audio for each sentence spoken by the CA is not realistic in many domains. Instead, text-to-speech (TTS) systems provide the flexibility to scale the system beyond prerecorded sentences. An advantage for rule-based systems is that the rules are generally derived from transcriptions instead of speech features. Therefore, they can easily handle CAs using synthetic speech. For speech-driven frameworks, however, the models rely on acoustic features derived from natural speech. Using synthetic speech is a major limitation. There are very few studies that have addressed this problem. Welbergen et al. (2015) provided a framework to generate head movements for a CA driven by synthetic speech. They used the probabilistic model proposed by Le et al. (2012). They tested their framework with synthetic speech, performing subjective evaluation to assess the warmth, competence, and human-likeness of their animations. The results showed that adding head movements by using an online implementation of their framework increases the perception level of these social attributes. Although the approach proposed by Welbergen et al. (2015) uses synthetic speech, their system has a mismatch between train and test conditions. During training, the models are built with original speech. During synthesis, the models are driven by features extracted from synthetic speech. Features extracted from synthetic speech do not have the same dynamic range as features derived from original speech. Given these clear differences in the feature space between natural and synthetic speech, this mismatch produces very limited range of behaviors. We are investigating systematic approaches to address this problem by using adaptation techniques that reduce the mismatch between train and test conditions and increase the range of behaviors generated by the models (Sadoughi and Busso 2016). Solving this problem can dramatically increase the application domain where speech-driven animation can be used.
20
N. Sadoughi and C. Busso
Exploring Entrainment As discussed in section “Role of Head Motion in Human Interaction,” head motion conveys the emotional state of the message (Busso et al. 2007a). An open challenge is to identify effective frameworks to generate head motion sequences that elicit a target emotion. While predefined rules can be used (Pelachaud et al. 1996), datadriven frameworks may provide more realistic sequences (Busso et al. 2007a). These systems, which are able to convey expressive behaviors, open opportunities to explore entrainment effects between the user and the CA. Entrainment is the phenomenon during human interaction where interlocutors mirror the behaviors of other. This phenomenon, which affect lexical, prosodic, and gestural cues, has been also observed during human computer/robot interaction (Bell et al. 2003; Breazeal 2002). Interestingly, we have also observed entrainment effects on emotional behaviors during dyadic interactions (Mariooryad and Busso 2013). Can a CA manipulate its emotional reactions to affect the affective state of the user? The head movement of the CA can be modulated with the appropriate emotional cues to increase the emotional entrainment with the user. Attempting to capture this subtle communicative aspect can lead to more effective CAs with better rapport with the user. Although Huang et al. (2011) proposed to make the CA more friendly by producing smiles as a response of smiles from users, this area offers opportunities to systematically design interfaces beyond that, leveraging the findings from entrainment studies (Jakkam and Busso 2016). The first step in this direction requires investigation of emotional entrainment in human conversations (Mariooryad and Busso 2013; Xiao et al. 2015). The investigation can be used to create appropriate affective cues for the CAs, increasing the emotional entrainment by developing models that incorporate relevant factors.
Modeling Personality Since individual differences play an important role in the range of head motion shown during human interaction (Youssef et al. 2013), the model to synthesize head motion sequences should carefully consider personality and idiosyncratic differences. Note that personality and emotional displays are interconnected. For instance, an introvert and an extrovert person will express their emotions differently under the same circumstance. There are studies proposing framework to incorporate personality traits in their CAs, especially for rule-based methods. Using a rule-based strategy, Poggi et al. (2005) proposed to modulate the goals of the message according to the personality traits of the ECA. Kipp (2003) proposed to investigate the gesture profiles displayed by two speakers. They found that the gestures of the speakers were different in important ways. There were some gestures only used by one of the speakers. They also found important differences in the frequency of their gestures, the timing patterns, and the mapping functions used to link semantic tag to actual gestures. They utilized all these aspects to personalize their animated characters. However, this was a limited study, and more effort is required to generalize the models to a broader range of personalities.
Head Motion Generation
21
Joint Models to Integrate Head Motion with Other Gestures Another remaining challenge is how to integrate the generated head movements with the movements of other parts of the body. There is a high synchrony between head movements and facial gestures. For example, we have reported a high CCA between head and eyebrow movements ( ρ ¼ 0:89 ) (Mariooryad and Busso 2012). When individual speech-driven models are separately used to synthesize individual behaviors, the relationship between these behaviors may not be preserved. For example, we can perfectly capture the timing relationship between speech and head motion and between speech and eyebrow motion. However, the generated head and eyebrow motion may not be perceived realistic when rendering the CA, as these behaviors may fail to capture the relation between head and eyebrow motion. In Mariooryad and Busso (2012), we proposed to jointly model head and eyebrow motion in a speech-driven framework. The result of this study showed that people preferred the animations synthesized by the joint model rather than the ones where the behaviors were independently generated. Capturing these subtle dependencies is not only important for generating realistic behaviors but also for conveying synthesized behaviors that increase speech intelligibility (Munhall et al. 2004). Having a datadriven model which incorporates all these relations will result in a more convincing animation. The challenge is that modeling more modalities will increase the complexity of the model. Extending data-driven approaches without significantly increasing their complexity is an open challenge.
Conclusions This chapter gives an overview on the studies relevant to head motion generation. We started by reviewing the importance of head motion in human interactions. Head movements play an important role in face to face communication. They provide semantic and syntactic cues while speaking. We use head motion as a backchannel while listening to other. They play an important role in conveying personality and emotional traits. The functions are important for communication, so realistic CAs should have well-designed head motion sequences that are timely synchronized with speech. The chapter overviewed different methods to generate head movements, which can be categorized into two main approaches: rule-based and data-driven frameworks. Rule-based methods rely on heuristic rules to generate the head movements based on the underlying communicative goal of the message. Data-driven methods rely on recorded head motion sequences to generate new instances. Within datadriven methods, we focused the review on speech-driven frameworks which leverages the close relationship between prosody and head motion. Rule-based and datadriven methods have their own advantages and disadvantages. We reviewed hybrid approaches which have attempted to bridge the gap between these methods, overcoming their limitations.
22
N. Sadoughi and C. Busso
There are still open challenges in generating realistic head motion sequences. We discussed opportunities which we believe can result in head motion sequences that are more effective and engaging. Previous studies have built the foundation to understand better the role of head motion. They have also provided convincing frameworks to generate human-like head motion sequences. They offer a perfect platform for future studies to advance even more this research area. Acknowledgments This work was funded by National Science Foundation under grant IIS-1352950.
References André E, Müller J, Rist T (1996) The PPP persona: a multipurpose animated presentation agent. In: Workshop on advanced visual interfaces, Gubbio, pp 245–247 Arellano D, Varona J, Perales FJ, Bee N, Janowski K, André EE (2011) Influence of head orientation in perception of personality traits in virtual agents. In: The 10th international conference on autonomous agents and multiagent systems-Volume 3, Taipei, pp 1093–1094 Arya A, Jefferies L, Enns J, DiPaola S (2006) Facial actions as visual cues for personality. Comput Anim Virtual Worlds 17(3–4):371–382 Bell L, Gustafson J, Heldner M (2003) Prosodic adaptation in human-computer interaction. In: 15th international congress of phonetic sciences (ICPhS 03), Barcelona, pp 2453–2456 Beskow J, McGlashan S (1997) Olga – a conversational agent with gestures. In: Proceedings of the IJCAI 1997 workshop on animated interface agents: making them intelligent, Nagoya Breazeal C (2002) Regulation and entrainment in human-robot interaction. Int J Robot Res 21 (10–11):883–902 Busso C, Narayanan S (2007) Interrelation between speech and facial gestures in emotional utterances: a single subject study. IEEE Trans Audio, Speech Lang Process 15(8):2331–2347 Busso C, Deng Z, Neumann U, Narayanan S (2005) Natural head motion synthesis driven by acoustic prosodic features. Comput Anim Virtual Worlds 16(3–4):283–290 Busso C, Deng Z, Grimm M, Neumann U, Narayanan S (2007a) Rigid head motion in expressive speech animation: analysis and synthesis. IEEE Trans Audio, Speech Lang Process 15 (3):1075–1086 Busso C, Deng Z, Neumann U, Narayanan S (2007b) Learning expressive human-like head motion sequences from speech. In: Deng Z, Neumann U (eds) Data-driven 3D facial animations. Springer-Verlag London Ltd, Surrey, pp 113–131 Busso C, Bulut M, Lee C, Kazemzadeh A, Mower E, Kim S, Chang J, Lee S, Narayanan S (2008) IEMOCAP: Interactive emotional dyadic motion capture database. J Lang Resour Eval 42 (4):335–359 Cassell J, Pelachaud C, Badler N, Steedman M, Achorn B, Bechet T, Douville B, Prevost S, Stone M (1994) Animated conversation: rule-based generation of facial expression gesture and spoken intonation for multiple conversational agents. In: Computer graphics (Proc. of ACM SIGGRAPH’94), Orlando, pp 413–420 Cassell J, Bickmore T, Billinghurst M, Campbell L, Chang K, Vilhjalmsson H, Yan H (1999) Embodiment in conversational interfaces: Rea. In: International conference on human factors in computing systems (CHI-99), Pittsburgh, pp 520–527 Chiu C-C, Marsella S (2011) How to train your avatar: a data driven approach to gesture generation. In: Intelligent virtual agents, Reykjavik, pp 127–140 Chiu C-C, Morency L-P, Marsella S (2015) Predicting co-verbal gestures: a deep and temporal modeling approach. In: Intelligent virtual agents, Delft, pp 152–166
Head Motion Generation
23
Chuang E, Bregler C (2005) Mood swings: expressive speech animation. ACM Trans Graph 24 (2):331–347 DeCarlo D, Stone M, Revilla C, Venditti JJ (2004) Specifying and animating facial signals for discourse in embodied conversational agents. Comput Anim Virtual Worlds 15(1):27–38 Deng Z, Busso C, Narayanan S, Neumann U (2004) Audio-based head motion synthesis for avatarbased telepresence systems. In: ACM SIGMM 2004 workshop on effective telepresence (ETP 2004). ACM Press, New York, pp 24–30 Foster ME (2007) Comparing rule-based and data-driven selection of facial displays. In: Workshop on embodied language processing, association for computational linguistics, Prague, pp 1–8 Graf HP, Cosatto E, Strom V, Huang FJ (2002) Visual prosody: facial movements accompanying speech. In: Proceedings of IEEE international conference on automatic faces and gesture recognition, Washington, DC, pp 396–401 Gratch J, Okhmatovskaia A, Lamothe F, Marsella S, Morales M, van der Werf R, Morency L (2006) Virtual rapport. In: 6th international conference on intelligent virtual agents (IVA 2006), Marina del Rey Hadar U, Steiner TJ, Grant EC, Rose FC (1983) Kinematics of head movements accompanying speech during conversation. Hum Mov Sci 2(1):35–46 Heylen D (2005) Challenges ahead head movements and other social acts in conversation. In: Artificial intelligence and simulation of behaviour (AISB 2005), social presence cues for virtual humanoids symposium, page 8, Hertfordshire Huang L, Morency L-P, Gratch J (2010) Parasocial consensus sampling: combining multiple perspectives to learn virtual human behavior. In: Proceedings of the 9th international conference on autonomous agents and multiagent systems: volume 1-volume 1, Toronto, pp 1265–1272 Huang L, Morency L-P, Gratch J (2011) Virtual rapport 2.0. In: Intelligent virtual agents, Reykjavik, pp 68–79 Ishi CT, Ishiguro H, Hagita N (2014) Analysis of relationship between head motion events and speech in dialogue conversations. Speech Commun 57:233–243 Jakkam A, Busso C (2016) A multimodal analysis of synchrony during dyadic interaction using a metric based on sequential pattern mining. In: IEEE international conference on acoustics, speech and signal processing (ICASSP 2016), Shanghai, pp 6085–6089 Kipp M (2003) Gesture generation by imitation: from human behavior to computer character animation. PhD thesis, Universität des Saarlandes, Saarbrücken Kopp S, Krenn B, Marsella S, Marshall AN, Pelachaud C, Pirker H, Thórisson KR, Vilhjálmsson H (2006) Towards a common framework for multimodal generation: the behavior markup language. In: International conference on intelligent virtual agents (IVA 2006), Marina Del Rey, pp 205–217 Kuratate T, Munhall KG, Rubin PE, Vatikiotis-Bateson E, Yehia H (1999) Audio-visual synthesis of talking faces from speech production correlates. In: Sixth European conference on speech communication and technology, Eurospeech 1999, Budapest, pp 1279–1282 Lance B, Marsella SC (2007) Emotionally expressive head and body movement during gaze shifts. In: Intelligent virtual agents, Paris, pp 72–85 Le BH, Ma X, Deng Z (2012) Live speech driven head-and-eye motion generators. IEEE Trans Vis Comput Graph 18(11):1902–1914 Lee J, Marsella S (2006) Nonverbal behavior generator for embodied conversational agents. Intell Virtual Agents 4133:243–255 Lee JJ, Marsella S (2009) Learning a model of speaker head nods using gesture corpora. In: Proceedings of the 8th international conference on autonomous agents and multiagent systems-volume 1, volume 1, Budapest, pp 289–296 Lester J, Stone B, Stelling G (1999) Lifelike pedagogical agents for mixed-initiative problem solving in constructivist learning environments. User Model User-Adap Inter 9(1–2):1–44 Levine S, Krähenbühl P, Thrun S, Koltun V (2010) Gesture controllers. ACM Trans Graph 29 (4):1–124
24
N. Sadoughi and C. Busso
Liu C, Ishi CT, Ishiguro H, Hagita N (2012) Generation of nodding, head tilting and eye gazing for human-robot dialogue interaction. In: Human-Robot interaction (HRI), 2012 7th ACM/IEEE international conference on, Boston, pp 285–292 Mariooryad S, Busso C (2012) Generating human-like behaviors using joint, speech-driven models for conversational agents. IEEE Trans Audio, Speech Lang Process 20(8):2329–2340 Mariooryad S, Busso C (2013) Exploring cross-modality affective reactions for audiovisual emotion recognition. IEEE Trans Affect Comput 4(2):183–196 Marsella S, Xu Y, Lhommet M, Feng A, Scherer S, Shapiro A (2013) Virtual character performance from speech. In ACM SIGGRAPH/Eurographics symposium on computer animation (SCA 2013), Anaheim, pp 25–35 Marsi E, van Rooden F (2007) Expressing uncertainty with a talking head. In: Workshop on multimodal output generation (MOG 2007), Aberdeen, pp 105–116 McClave EZ (2000) Linguistic functions of head movements in the context of speech. J Pragmat 32 (7):855–878 Moubayed SA, Beskow J, Granström B, House D (2010) Audio-visual prosody: perception, detection, and synthesis of prominence. In: COST 2102 training school, pp 55–71 Munhall KG, Jones JA, Callan DE, Kuratate T, Vatikiotis-Bateson E (2004) Visual prosody and speech intelligibility: head movement improves auditory speech perception. Psychol Sci 15 (2):133–137 Pelachaud C, Badler N, Steedman M (1996) Generating facial expressions for speech. Cognit Sci 20 (1):1–46 Poggi I, Pelachaud C, de Rosis F, Carofiglio V, de Carolis B (2005) Greta. a believable embodied conversational agent. In: Stock O, Zancanaro M (eds) Multimodal intelligent information presentation, Text, speech and language technology. Springer Netherlands, Dordrecht, pp 3–25 Rickel J, Johnson WL (1998) Task-oriented dialogs with animated agents in virtual reality. In: Workshop on embodied conversational characters, Tahoe City, pp 39–46 Sadoughi N, Busso C (2015) Retrieving target gestures toward speech driven animation with meaningful behaviors. In: International conference on Multimodal interaction (ICMI 2015), Seattle, pp 115–122 Sadoughi N, Busso C (2016) Head motion generation with synthetic speech: a data driven approach. In: Interspeech 2016, San Francisco, pp 52–56 Sadoughi N, Liu Y, Busso C (2014) Speech-driven animation constrained by appropriate discourse functions. In: International conference on multimodal interaction (ICMI 2014), Istanbul, pp 148–155 Sadoughi N, Liu Y, Busso C (2015) MSP-AVATAR corpus: motion capture recordings to study the role of discourse functions in the design of intelligent virtual agents. In: 1st international workshop on understanding human activities through 3D sensors (UHA3DS 2015), Ljubljana Sargin ME, Yemez Y, Erzin E, Tekalp AM (2008) Analysis of head gesture and prosody patterns for prosody-driven head-gesture animation. IEEE Trans Pattern Anal Mach Intell 30(8):1330–1345 Silverman K, Beckman M, Pitrelli J, Ostendorf M, Wightman C, Price P, Pierrehumbert J, Hirschberg J (1992) ToBI: a standard for labelling english prosody. In: 2th international conference on spoken language processing (ICSLP 1992), Banff, pp 867–870 Smid K, Pandzic I, Radman V (2004) Autonomous speaker agent. In: IEEE 17th international conference on computer animation and social agents (CASA 2004), Geneva, pp 259–266 Stone M, DeCarlo D, Oh I, Rodriguez C, Stere A, Lees A, Bregler C (2004) Speaking with hands: creating animated conversational characters from recordings of human performance. ACM Trans Graph (TOG) 23(3):506–513 Taylor GW, Hinton GE (2009) Factored conditional restricted Boltzmann machines for modeling motion style. In: Proceedings of the 26th annual international conference on machine learning, Montreal, pp 1025–1032 Taylor GW, Hinton GE, Roweis ST (2006) Modeling human motion using binary latent variables. Adv Neural Inf Process Syst 1345–1352
Head Motion Generation
25
Welbergen H, Ding Y, Sattler K, Pelachaud C, Kopp S (2015) Real-time visual prosody for interactive virtual agents. In: Intelligent virtual agents, Delft, pp 139–151 Xiao B, Georgiou P, Baucom B, Narayanan S (2015) Modeling head motion entrainment for prediction of couples’ behavioral characteristics. In: Affective computing and intelligent interaction (ACII), 2015 international conference on, Xi’an, pp 91–97 Youssef AB, Shimodaira H, Braude DA (2013) Head motion analysis and synthesis over different tasks. Intell Virtual Agents 8108:285–294
Hand Gesture Synthesis for Conversational Characters Michael Neff
Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gesture Generation Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gesture Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gesture Animation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Additional Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 4 4 4 6 8 8 9
Abstract
This chapter focuses on the generation of animated gesticulations, co-verbal gestures that are designed to accompany speech. It begins with a survey of research on human gesture, discussing the various forms of gesture, their structure, and timing requirements relative to speech. The two main problems for synthesizing gesture animation are determining what gestures a character should perform (the specification problem) and then generating appropriate motion (the animation problem). The specification problem has used a range of input, including speech prosody, spoken text, and a communicative intent. Both rule-based and statistical approaches are employed to determine gestures. Animation has also used a range of procedural, physics-based, and data-driven approaches in order to solve a significant set of expressive and coordination requirements. Fluid gesture animation must also reflect the context and include listener behavior and floor management. This chapter concludes with a discussion of future challenges.
M. Neff (*) Department of Computer Science & Program for Cinema and Digital Media, University of California – Davis, Davis, CA, USA e-mail: [email protected] # Springer International Publishing Switzerland 2016 B. Müller, S.I. Wolf (eds.), Handbook of Human Motion, DOI 10.1007/978-3-319-30808-1_5-1
1
2
M. Neff
Keywords
Gesture • Character animation • Nonverbal communication • Virtual agents • Embodied conversational agents
Introduction Do gestures communicate? Yes, they do. This has been the conclusion of several meta-studies on the impact of gesture (Goldin-Meadow 2005; Hostetter 2011; Kendon 1994). It is also one of the distinguishing features of gestures in animation. While all movement communicates to some degree, gestures often play a role that is explicitly communicative. Another distinguishing feature for the gestures that we are most often interested in is that they are co-verbal. That is, they occur with speech and they are inextricably linked to that speech in both content and timing. McNeill argues that gestures and language are not separate, but gestures are part of language (McNeill 2005). There are different forms of movement that can broadly be called “gesture.” Building on the categories of Kendon (1988), McNeill defined “Kendon’s Continuum” (McNeill 1992, 2005) to capture the range of gesture types people employ: • Gesticulation: gesture that conveys a meaning related to the accompanying speech. • Speechlike gestures: gestures that take the place of a word(s) in a sentence. • Emblems: conventionalized signs, like a thumbs-up. • Pantomime: gestures with a story and are produced without speech. • Sign language: signs are lexical words. As you move along the continuum, the degree to which speech is obligatory decreases, and the degree to which gestures themselves have the properties of a language increases. This chapter will focus on gesticulations, which are gestures that co-occur with speech as they are most relevant to conversational characters. Synthesis of the whole spectrum, however, presents worthwhile animation problems. Emblems and pantomimes are useful in situations where speech may not be possible. Sign languages are the native language of many members of the deaf community, and sign synthesis can increase their access to computational sources. The problems of gesticulations are unique, however, since they are co-present with speech and do not have linguistic structure on their own. Kendon introduced a three-level hierarchy to describe the structure of gestures (Kendon 1972). The largest structure is the gesture unit. Gesture units start in a retraction or rest pose, continue with a series of gestures, and then return to a rest pose, potentially different from the initial rest pose. A gesture phrase encapsulates an individual gesture in this sequence. Each gesture phrase can in turn be broken down into a sequence of gesture phases. A preparation is a motion that takes the hands to the required position and orientation for the start of the gesture stroke. A
Hand Gesture Synthesis for Conversational Characters
3
prestroke hold is a period of time in which the hands are held in this configuration. The stroke is the main meaning carrying movement of the gesture and has the most focused energy. It may be followed by a poststroke hold in which the hands are held at the end position. The final phase is a retraction that returns the hands to a rest pose. All phases are optional except the stroke. There are some gestures in which the stroke does not involve any movement (e.g., a raised index finger). These are variously called an independent hold (Kita et al. 1998) or a stroke hold (McNeill 2005). The pre- and poststroke holds were proposed by Kita (1990) and act to synchronize the gesture with speech. The prestroke hold delays the gesture stroke until the corresponding speech begins, and the poststroke hold occurs while the corresponding speech is completing. Much like they allow mental processing in humans, they can be used in synthesis systems to allow time for planning or other processing to take place. The existence of gesture units is important for animation systems as it indicates a potential need to avoid generating a sequence of singleton gestures that return to a rest pose after each gesture. While this would offer the simplest synthesis solution, people are quite sensitive to the structure of gestural communication. A study (Kipp et al. 2007) showed that people found a character that used multiple phrase gesture units more natural, friendly, and trustworthy than a character that performed singleton gestures, which was viewed as more nervous. These significant differences in appraisal occurred despite only 1 of 25 subjects being able to actually identify the difference between the multiphrase g-unit clips and single phrase g-unit clips. This illustrates what appears to be a common occurrence in our gesture research: people will react to differences in gesture performance without being consciously aware of what those differences are. Gestures are synchronized in time with their co-expressive speech. About 90 % of the time, the gesture occurs slightly before the co-expressive speech (Nobe 2000) and rarely occurs after (Kendon 1972). Research on animated characters does indicate a preference for this slightly earlier timing of gesture, but also suggests that people may not be particularly sensitive to errors in timing, at least within a +/ .6 second range (Wang and Neff 2013). A number of categorizations of gesture have been proposed. One of the best known is from McNeill and Levy (McNeill 1992; McNeill and Levy 1982) and contains the classes iconics, metaphorics, deictics, and beats. Iconic gestures create images of concrete objects or actions, such as illustrating the size of a box. Metaphorics create images of the abstract. For instance, a metaphoric gesture could make a cup shape with the hand, but refer to holding an idea rather than an actual object. Metaphoric gestures are also used to locate ideas spatially, for instance, putting positive things on the left and negative to the right and then using this space to categorize future entities in the conversation. Deictics locate objects and entities in space, as with pointing, creating a reference and context for the conversation. They are often performed with a hand that is closed except for an extended index finger, but can be performed with a wide range of body parts. Deixis can be abstract or concrete. Concrete deixis points to an existing reference (e.g., an object or person) in space, whereas abstract deixis creates a reference point in space for an idea or
4
M. Neff
concept. Beats are small back-and-forth or up-and-down movements of the hand, performed in rhythm to the speech. They serve to emphasize important sections of the speech. In later work, McNeill (2005) argued that it is inappropriate to think of gesture in terms of categories, but the categories should instead be considered dimensions. This reflects the fact that any individual gesture may contain several of these properties (e.g., deixis and iconicity). He suggests additional dimensions of temporal highlighting (the function of beats) and social interactivity, which helps to manage turn taking and the flow of conversation.
State of the Art Generation of conversational characters has achieved substantial progress, but the bar for success is extremely high. People are keen observers of human motion and will make judgments based on subtle details. By way of analogy, people will make judgments between good and bad actors, and actors being good in a particular role, but not another – and actors are human, with all the capacity for naturalness and expressivity that comes with that. The bar for conversational characters is that of a good actor, effectively performing a particular role. The field remains a long way from being able to do this automatically, for a range of different characters and over prolonged interactions with multiple subjects.
Gesture Generation Tasks Gesture Specification When generating virtual conversational characters, one of the primary challenges is determining what gestures a character should perform. Different approaches have trade-offs in terms of the type of input information they require, the amount of processing time needed to determine a gesture, and the quality of the gesture selection, both on grounds of accurately reflecting a particular character personality and being appropriate for the co-expressed utterance. One approach is to generate gestures based on prosody variations in the spoken audio signal. Prosody includes changes in volume and pitch. Such approaches have been applied for head nods and movement (Morency et al. 2008), as well as gesture generation (Levine et al. 2009, 2010). A main advantage of the approach is that good-quality audio can be highly expressive, and using it as an input for gesture specification allows the gestures to match the expressive style of the audio. Points of emphasis in the audio appear to be good landmarks for placing gesture, and their use will provide uniform emphasis across the channels. Prosody-based approaches have been used to generate gesture in real time as a user speaks (Levine et al. 2009, 2010). The drawback of only using prosody is that it does not capture semantics, so the gestures will likely not match the meaning of the audio and certainly not supplement
Hand Gesture Synthesis for Conversational Characters
5
the underlying meaning that is being conveyed in the utterance with information not present in the audio. This concern can be at least partially addressed by also parsing the spoken text (Marsella et al. 2013). It is believed that in human communication, the brain is co-planning the gesture and the utterance (McNeill 2005), so approaches that do not use future information about the planned utterance may be unlikely to match the sophistication of human gesture-speech coordination. Another approach generates gesture based on the text of the dialogue that is to be spoken. A chief benefit of these techniques is that text captures much of the information being conveyed, so these techniques can generate gestures that aid the semantics of the utterance. Text can also be analyzed for emotional content and rhetorical style, providing a rich basis for gesture generation. Rule-based approaches (Cassell et al. 2001; Lee and Marsella 2006; Lhommet and Marsella 2013; Marsella et al. 2013) can determine both the gesture locations and the type of gestures to be performed. Advantages of these techniques are that they can handle any text covered by their knowledge bases and are extensible in flexible and straightforward ways. Disadvantages include that some amount of manual work is normally required to create the rules and it is difficult to know how to author the rules to create a particular character, so behavior tends to be generic. Other work uses statistical approaches to predict the gestures that a particular person would employ (Bergmann et al. 2010; Kipp 2005; Neff et al. 2008). These techniques support the creation of individualized characters, which are essential for many applications, such as anything involving storytelling. Individualized behavior may also outperform averaged behavior (Bergmann et al. 2010), as would be contained in generic rules. These approaches, however, are largely limited to reproducing characters like the subjects modeled and creating arbitrary characters remains an open challenge. Recent work has begun applying deep learning to the mapping from text and prosody to gesture (Chiu et al. 2015). This is a potentially powerful approach, but requires a large quantity of data, and ways to produce specific characters must be developed. While the divide between prosody-driven and rule-based approaches is useful for understanding techniques, current approaches are increasingly relying on a combination of text and prosody information (e.g., (Lhommet and Marsella 2013; Marsella et al. 2013)). Techniques based on generating gesture from text are limited to ideas expressed in the text. The information we convey through gesture is sometimes redundant with speech, although expressed in a different form, but often expresses information that is different to that in speech (McNeill 2005). For example, I might say “I saw a [monster.],” with the square brackets indicating the location of a gesture that holds my hand above my head, with my fingers bent 90 at the first knuckle and then held straight. The gesture indicates the height of the monster, information completely lacking from the verbal utterance. Evidence suggests that gestures are most effective when they are nonredundant (Goldin-Meadow 2006; Hostetter 2011; Singer and Goldin-Meadow 2005). This implies the need to base gesture generation on a deeper notion of a “communicative intent,” which may not solely be contained in the text and describes the fully message to be delivered. The SAIBA (situation, agent, intention, behavior, animation) framework represents a step toward establishing a computational architecture to tackle the
6
M. Neff
fundamental multimodal communication problem of moving from a communicative intent to output across the various agent channels of gesture, text, prosody, facial expressions, and posture (SAIBA. Working group website 2012). The approach defines stages in production and markup languages to connect them. The first stage is planning the communicative intent. This is communicated using the Function Markup Language (Heylen et al. 2008) to the behavior planner, which decides how to achieve the desired functions using the agent modalities available. The final behavior is then sent to a behavior realizer for generation using the Behavior Markup Language (Kopp et al. 2006; Vilhjalmsson et al. 2007). Such approaches echo, at least at the broad conceptual level, theories of communication like McNeill’s growth point hypothesis that argue gesture and language emerge in a shared process from a communicative intent (McNeill 2005). Recent work has sought to develop cognitive (Kopp et al. 2013) and combined cognitive and linguistic models (Bergmann et al. 2013) to explore the distribution of communicative content across output modalities.
Gesture Animation Generation of high-quality gesture animation must satisfy a rich set of requirements: • Match the gesture timing to that of the speech. • Connect individual gestures into fluent gesture units. • Adjust the gesture to the character’s context (e.g., to point to a person or object in the scene). • Generate appropriate gesture forms for the utterance (e.g., show the shape of an object, mime an action being performed, point). • Vary the gesture based on the personality of the character. • Vary the gesture to reflect the character’s current mood and tone of the speech. While a wide set of techniques have been used for gesture animation, the need for precise agent control, especially in interactive systems, has often favored the use of kinematic procedural techniques (e.g., (Chi et al. 2000; Hartmann et al. 2006; Kopp and Wachsmuth 2004)). For example, Kopp and Wachsmuth kopp04 present a system that uses curves derived from neurophysiological research to drive the trajectory of gesturing arm motions. Procedural techniques allow full control of the motion, making it easy to adjust the gesture to the requirements of the speech, both for matching spatial and timing demands. While gesture is less constrained by physics than motions like tumbling, physical simulation has still been used for gesture animation and can add important nuance to the motion (Neff and Fiume 2002, 2005; Neff et al. 2008; Van Welbergen et al. 2010). These approaches generally include balance control and a basic approximation to the muscle, such as a proportional derivative controller. The balance control will add full-body movement to compensate for arm movements, and the
Hand Gesture Synthesis for Conversational Characters
7
controllers can add subtle oscillations and arm swings. These effects require proper tuning. Motion capture data has seen increasing use in an attempt to improve the realism of character motion. These techniques often employ versions of motion graphs (Arikan and Forsyth 2002; Kovar et al. 2002; Lee et al. 2002) which concatenate segments of motion to create a sequence, such as in Fernández-Baena et al. (2014) and Stone et al. (2004). The motion capture data can provide very high-quality motion, but control is more limited, so it can be a challenge to adapt the motion to novel speech or generated different characters. Gesture relies heavily on hand shape, and it can be a challenge to capture good-quality hand motion while simultaneously capturing body motion. Some techniques seek to synthesize acceptable hand motion using the body motion alone (Jörg et al. 2012). For a fuller discussion of the issues around hand animation, please refer to (Wheatland et al. 2015). As part of the SAIBA effort, several research groups have developed “behavior realizers,” animation engines capable of realizing commands in the Behavior Markup Language (Vilhjalmsson et al. 2007) that is supplied by a higher level in an agent architecture. These systems emphasize control and use a combination of procedural data and motion clips (e.g., (Heloir and Kipp 2009; Kallmann and Marsella 2005; Shapiro 2011; Thiebaux et al. 2008; Van Welbergen et al. 2010)). The SmartBody system, for example, uses a layering approach based on a hierarchy of controllers for different tasks (e.g., idle motion, locomotion, reach, breathing). These controllers may control different or overlapping parts of the body, which creates a coordination challenge. They can be combined or one controller may override another (Shapiro 2011). Often gesture specification systems will indicate a particular gesture form that is required, e.g., a conduit gesture in which the hand is cupped and moves forward. Systems often employ a dictionary of gesture forms that can be used in syntheses. These gestures have been encoded using motion capture clips, hand animation, or numerical spatial specifications. Some techniques (Kopp et al. 2004) have sought to generate the correct forms automatically, for example, based on a description of the image trying to be created by the gesture. Gesture animation is normally deployed in scenarios where it is desirable for the characters to portray clear personalities and show variations in emotion and mood. For these reasons, controlling expressive variation of the motion has been an important focus. A set of challenges must be solved. These include determining how to parameterize a motion to give expressive control, understanding what aspects of motion must be varied to generate a desired impact, ensuring consistency over time, determining how to expose appropriate control structures to the user or character control system, and, finally, synthesizing the motion to contain the desired properties. Chi et al. (2000) use the Effort and Shape components of Laban Movement Analysis to provide an expressive parameterization of motion. Changing any of the four effort qualities (Weight, Space, Time, and Flow) or the Shape Qualities (Rising-Sinking, Spreading-Enclosing, Advancing-Retreating) will vary the timing and path of the gesture, along with the engagement of the torso. Hartmann
8
M. Neff
et al. (Hartmann 2005) use tension, continuity, and bias splines (Kochanek and Bartels 1984) to control arm trajectories and provide expressive control through parameters for activation, spatial and temporal extent, and fluidity and repetition. Neff and Fiume (2005) develop an extensible set of movement properties that can be varied and a system that allows users to write character sketches that reflect a particular character’s movement tendencies and then layer additional edits on top. While gestures are often largely thought of as movements of the arms and hands and often represented this way in computational systems, they can indeed use the whole body. A character can nod its head, gesture with its toe, etc. More importantly, while arms are the dominant appendages for a motion, engaging the entire body can lead to more clear and effective animation. Lamb called this engagement of the whole body during gesturing Posture-Gesture Merger and argued that it led to a more fluid and attractive motion (Lamb 1965).
Additional Considerations Conversations are interactions between people and this must be reflected in the animation. Both the speaker(s) and listener(s) have roles to play. Visual attention must be managed through appropriate gaze behavior to indicate who is paying attention and how actively, along with indicating who is thinking or distracted. Attentive listeners will provide back channel cues, like head nods, to indicate that they are listening and understanding. These must be appropriately timed with the speaker’s dialogue. Holding the floor is also actively managed. Speakers may decide to yield their turn to another. Listeners may interrupt, and the speaker may yield in response or refuse to do so. Floor management relies on both vocal and gestural cues. Proxemics are also highly communicative to an audience and must be managed appropriately. This creates additional animation challenges in terms of small-scale locomotion in order to fluidly manage character placement. Gestural behavior must adapt to the context. Gestures will be adjusted based on the number of people in the conversation and their physical locations relative to one another. As characters interact, they may also begin to mirror each other’s behavior and postures. Gestures are also often used to refer to items in the environment and hence must be adapted based on the character’s location. Finally, characters will engage in conversations while also simultaneously performing other activities, such as walking, jogging, or cleaning the house. The gesture behavior must be adapted to the constraints of this other behavior, for example, gestures performed while jogging tend to be done with more bent arms and are less frequent than standing gestures (Wang et al. 2016).
Future Directions While significant progress has been made, the bar for conversational gesture animation is very high. We are a long way from being able to easily create synthetic characters that match the expressive quality, range, and realism of a skilled actor, and
Hand Gesture Synthesis for Conversational Characters
9
applications that rely on synthetic characters are impoverished by this gap. Some of the key issues to address include: Characters with large gesture repertoires: It currently takes a great deal of work to build a movement set for a character, generally involving recording, cleaning, and retargeting motion capture or hand animating movements. This places a practical limitation on the number of gestures that they can perform. Methods that allow large sets of gestures to be rapidly generated are needed. A particular challenge is being able to synthesize novel gestures on the fly to react to the character’s current context. Motion quality: While motion quality has improved, it remains well short of photo-realism, particularly for interactive characters. Hand motion remains a particular challenge, as is appropriate full-body engagement. Most systems focus on standing characters, whereas people engage in a wide range of activities while simultaneously gesturing. A significant challenge is correctly orchestrating a performance across the various movement modalities (breath, arm movements, body movements, facial expressions, etc.), especially when the motion diverges from playback of a recording or hand-animated sequence. Planning from communicative intent: Systems that can represent an arbitrary communicative intent and can distribute it across various communication modes, and do so in different ways for different speakers, remain a long-term goal. This will likely require both improved computational models and a more thorough understanding of how humans formulate communication. Customization for characters and mood: While people tend to have their own, unique gesturing style, it is a challenge to imbue synthetic characters with this expressive range without an enormous amount of manual labor. It is also a challenge to accurate reflect a character’s current mood; anger, sadness, irritation, excitement, etc. Authoring controls: If a user wishes to create a particular character with a given role, personality, etc., there must be tools to allow this to be authored. Substantial work is required to allow authors to go from an imagined character to an effective realization.
References Arikan O, Forsyth DA (2002) Interactive motion generation from examples. ACM Trans Graph 21 (3):483–490 Bergmann K, Kopp S, Eyssel F (2010) Individualized gesturing outperforms average gesturing–evaluating gesture production in virtual humans. In: International conference on intelligent virtual agents. Springer, Berlin/Heidelberg, pp 104–117 Bergmann K, Kahl S, Kopp.S (2013) Modeling the semantic coordination of speech and gesture under cognitive and linguistic constraints. In: Intelligent virtual agents. Springer, Berlin, Heidelberg, pp 203–216 Cassell J, Vilhjálmsson H, Bickmore T (2001) BEAT: the behavior expression animation toolkit. In: Proceedings of SIGGRAPH 2001. ACM, New York, NY, pp 477–486 Chi DM, Costa M, Zhao L, Badler NI (2000) The EMOTE model for effort and shape. In: Proceedings of SIGGRAPH 2000. ACM, New York, NY, pp 173–182
10
M. Neff
Chiu C-C,Morency L-P, Marsella S (2015) Predicting co-verbal gestures: a deep and temporal modeling approach. In: International conference on intelligent virtual agents. Springer, Cham, pp 152–166. Fernández-Baena A, Montaño R, Antonijoan M, Roversi A, Miralles D, Alas F (2014) Gesture synthesis adapted to speech emphasis. Speech Comm 57:331–350 Goldin-Meadow S (2005) Hearing gesture: how our hands help us think. Harvard University Press, Massachusetts Goldin-Meadow S (2006) Talking and thinking with our hands. Curr Dir Psychol Sci 15(1):34–39 Hartmann B, Mancini M, Pelachaud C (2006) Implementing expressive gesture synthesis for embodied conversational agents. In Proc. Gesture Workshop 2005, vol 3881 of LNAI. Springer, Berlin\Heidelberg, pp 45–55 Heloir A, Kipp M (2009) EMBR–A Realtime Animation Engine for Interactive Embodied Agents. In: Intelligent virtual agents 09. Springer, Berlin, Heidelberg, pp 393–404 Heylen D, Kopp S, Marsella SC, Pelachaud C, Vilhjálmsson H (2008) The next step towards a function markup language. In: International workshop on intelligent virtual agents. Springer, Berlin, Heidelberg, pp 270–280 Hostetter AB (2011) When do gestures communicate? A meta-analysis. Psychol Bull 137(2):297 Jörg S, Hodgins J, Safonova A (2012) Data-driven finger motion synthesis for gesturing characters. ACM Trans Graph 31(6):189 Kallmann M, Marsella S (2005) Hierarchical motion controllers for real-time autonomous virtual humans. In: Proceedings of the 5th International working conference on intelligent virtual agents (IVA’05), pp 243–265, Kos, Greece, 12–14 September 2005 Kendon A (1972) Some relationships between body motion and speech. Stud dyadic commun 7 (177):90 Kendon A (1988) How gestures can become like words. Cross-cult perspect nonverbal commun 1:131–141 Kendon A (1994) Do gestures communicate? A review. Res lang soc interact 27(3):175–200 Kipp M (2005) Gesture generation by imitation: from human behavior to computer character animation. Universal-Publishers, Boca Raton, Fl, USA Kipp M, Neff M, Kipp K, Albrecht I (2007) Towards natural gesture synthesis: evaluating gesture units in a data-driven approach to gesture synthesis. In Proceedings of intelligent virtual agents (IVA07), vol 4722 of LNAI, Association for Computational Linguistics, Berlin, Heidelberg, pp 15–28 Kita S (1990) The temporal relationship between gesture and speech: a study of Japanese-English bilinguals. MS Dep Psychol Univ Chic 90:91–94 Kita S, Van Gijn I, Van Der Hulst H (1998) Movement phase in signs and co-speech gestures, and their transcriptions by human coders. In: Proceedings of the International Gesture Workshop on Gesture and Sign Language in Human-Computer Interaction. Springer-Verlag, Berlin, Heidelberg, pp 23–35 Kochanek DHU, Bartels RH (1984) Interpolating splines with local tension, continuity, and bias control. Comput Graph 18(3):33–41 Kopp S, Wachsmuth I (2004) Synthesizing multimodal utterances for conversational agents. Comput Anim Virtual Worlds 15:39–52 Kopp S, Tepper P, Cassell J (2004) Towards integrated microplanning of language and iconic gesture for multimodal output. In: Proceedings of the 6th international conference on multimodal interfaces. ACM, New York, NY, pp 97–104 Kopp S, Krenn B, Marsella S, Marshall AN, Pelachaud C, Pirker H, Thórisson KR, Vilhjálmsson H (2006) Towards a common framework for multimodal generation: the behavior markup language. In: International workshop on intelligent virtual agents. Springer, Berlin, Heidelberg, pp 205–217 Kopp S, Bergmann K, Kahl S (2013) A spreading-activation model of the semantic coordination of speech and gesture. In: Proceedings of the 35th annual conference of the cognitive science society (CogSci 2013). Cognitive Science Society, Austin (in press, 2013)
Hand Gesture Synthesis for Conversational Characters
11
Kovar L, Gleicher M, Pighin F (2002) Motion graphs. ACM Trans Graph 21(3):473–482 Lamb W (1965) Posture and gesture: an introduction to the study of physical behavior. Duckworth, London Lee J, Marsella S (2006) Nonverbal behavior generator for embodied conversational agents. In: Intelligent virtual agents. Springer, Berlin, Heidelberg, pp 243–255 Lee J, Chai J, Reitsma PSA, Hodgins JK, Pollard NS (2002) Interactive control of avatars animated with human motion data. ACM Trans Graph 21(3):491–500 Levine S, Theobalt C, Koltun V (2009) Real-time prosody-driven synthesis of body language. ACM Trans Graph 28(5):1–10 Levine S, Krahenbuhl P, Thrun S, Koltun V (2010) Gesture controllers. ACM Trans Graph 29 (4):1–11 Lhommet M, Marsella SC (2013) Gesture with meaning. In: Intelligent Virtual Agents. Springer, Berlin, Heidelberg, pp 303–312 Marsella S, Xu Y, Lhommet M, Feng A, Scherer S, Shapiro A (2013) Virtual character performance from speech. In: Proceedings of the 12th ACM SIGGRAPH/Eurographics Symposium on Computer Animation, ACM, New York, NY, pp 25–35 McNeill D (1992) Hand and mind: what gestures reveal about thought. University of Chicago Press, Chicago McNeill D (2005) Gesture and thought. University of Chicago Press, Chicago McNeill D, Levy E (1982) Conceptual representations in language activity and gesture. In: Jarvella RJ, Klein W (eds) Speech, place, and action. Wiley, Chichester, pp 271–295 Morency L-P, de Kok I, Gratch J (2008) Predicting listener backchannels: a probabilistic multimodal approach. In: International workshop on intelligent virtual agents. Springer, Berlin/ Heidelberg, pp 176–190 Neff M, Fiume E (2002) Modeling tension and relaxation for computer animation. In Proc. ACM SIGGRAPH Symposium on Computer Animation 2002, ACM, New York, NY, pp 81–88 Neff M, Fiume E (2005) AER: aesthetic exploration and refinement for expressive character animation. In: Proceeding of ACM SIGGRAPH / Eurographics Symposium on Computer Animation 2005, ACM, New York, NY, pp 161–170 Neff M, Kipp M, Albrecht I, Seidel H-P (2008) Gesture modeling and animation based on a probabilistic re-creation of speaker style. ACM Trans Graph 27(1):5:1–5:24 Nobe S (2000) Where do most spontaneous representational gestures actually occur with respect to speech. Lang gesture 2:186 SAIBA. Working group website, 2012. http://wiki.mindmakers.org/projects:saiba:main Shapiro A (2011) Building a character animation system. In: International conference on motion in games, Springer, Berlin\Heidelberg, pp 98–109 Singer MA, Goldin-Meadow S (2005) Children learn when their teacher’s gestures and speech differ. Psychol Sci 16(2):85–89 Stone M, DeCarlo D, Oh I, Rodriguez C, Stere A, Lees A, Bregler C (2004) Speaking with hands: creating animated conversational characters from recordings of human performance. ACM Trans Graph 23(3):506–513 Thiebaux M, Marshall A, Marsella S, Kallman M (2008) Smartbody: behavior realization for embodied conversational agents. In: Proceedings of 7th International Conference on autonomous agents and multiagent systems (AAMAS 2008), International Foundation for Autonomous Agents and Multiagent Systems Richland, SC, pp 151–158 Van Welbergen H, Reidsma D, Ruttkay Z, Zwiers J (2010) Elckerlyc-A BML realizer for continuous, multimodal interaction with a virtual human. Journal on Multimodal User Interfaces 4 (2):97–118 Vilhjalmsson H, Cantelmo N, Cassell J, Chafai NE, Kipp M, Kopp S, Mancini M, Marsella S, Marshall A, Pelachaud C et al (2007) The behavior markup language: recent developments and challenges. In: Intelligent virtual agents. Springer, Berlin/New York, pp 99–111 Wang Y, Neff M (2013) The influence of prosody on the requirements for gesture-text alignment. In: Intelligent virtual agents. Springer, Berlin/New York, pp 180–188
12
M. Neff
Wang Y, Ruhland K, Neff M, O’Sullivan C (2016) Walk the talk: coordinating gesture with locomotion for conversational characters. Comput Anim Virtual Worlds 27(3–4):369–377 Wheatland N, Wang Y, Song H, Neff M, Zordan V, Jörg S (2015) State of the art in hand and finger modeling and animation. Comput Graphics Forum. The Eurographs Association and John Wiley & Sons, Ltd., Chichester, 34(2):735–760
Depth Sensor-Based Facial and Body Animation Control Yijun Shen, Jingtian Zhang, Longzhi Yang, and Hubert P. H. Shum
Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Extracting Facial and Body Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Facial Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Body Posture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Human Environment Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dealing with Noisy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Face Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Posture Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prior Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Depth Camera-Based Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 2 4 4 5 7 7 8 9 10 11 13 14
Abstract
Depth sensors have become one of the most popular means of generating human facial and posture information in the past decade. By coupling a depth camera and computer vision based recognition algorithms, these sensors can detect human facial and body features in real time. Such a breakthrough has fused many new research directions in animation creation and control, which also has opened up new challenges. In this chapter, we explain how depth sensors obtain human facial and body information. We then discuss on the main challenge on depth sensor-based systems, which is the inaccuracy of the obtained data, and explain how the problem is tackled. Finally, we point out the emerging applications in the
Y. Shen (*) • J. Zhang (*) • L. Yang (*) • H.P.H. Shum (*) Northumbria University, Newcastle upon Tyne, UK e-mail: [email protected]; [email protected]; [email protected]; [email protected] # Springer International Publishing Switzerland 2016 B. Müller, S.I. Wolf (eds.), Handbook of Human Motion, DOI 10.1007/978-3-319-30808-1_7-1
1
2
Y. Shen et al.
field, in which human facial and body feature modeling and understanding is a key research problem. Keywords
Depth sensors • Kinect • Facial features • Body postures • Reconstruction • Machine learning • Computer animation
Introduction In the past decade, depth sensors have become a very popular mean of generating character animation. In particular, since these sensors can obtain human facial and body information in real time, it is used heavily in real-time graphics and games. While it is expensive to use human motion to interact with computer applications using traditional motion capture system, depth sensors provide an affordable alternative. Due to the low cost and high robustness of depth sensors, it can be applied in a wide application domain with easy setup. Apart from popular applications such as motion-based gaming, depth sensors are also applied in emerging applications such as virtual reality, sport training, serious games, smart environments, etc. In order to work with depth sensors, it is important to understand their working principle, as well as their strength and weakness. In this chapter, we provide comprehensive information on how depth sensors track human facial and body features using computer vision and pattern recognition based techniques, and identify their strength in computational cost and robustness. Then, we focus on the major weakness of depth sensors, that is, the low accuracy that happens during occlusion, and explain possible solutions to improve recognition quality in detail. In particular, we discuss in depth on machine learning-based reconstruction method that utilize prior knowledge to correct corrupted data obtained by the sensors. Finally, we give some examples on depth sensors-based application, especially in the field of animation creation, to show how these sensors can improve existing methods in human-computer interaction. In the rest of this chapter, we review the state of the art in section “State of the Art.” We explain in more details about how depth sensors obtain and process human facial and body movement information in section “Extracting Facial and Body Information.” We then discuss on the main challenge on depth sensor-based systems, which is about the relatively low accuracy of the data obtained, and explain how this challenge can be tackled in section “Dealing with Noisy Data.” We finally point out various emerging applications developed with depth sensors in section “Depth Camera-Based Applications” and conclude this chapter in section “Conclusion.”
State of the Art Typical depth sensors utilize a depth camera to obtain a depth image. The main advantage of the depth camera over traditional color cameras is that instead of obtaining color information, it estimates the distance of the objects seen by the
Depth Sensor-Based Facial and Body Animation Control
3
camera using an infrared sensor. The images taken from a depth camera are called depth images. In these images, the pixels represent distance instead of color. The nature of depth images provides a huge advantage in automatic recognition using computer vision and machine learning algorithm. With traditional color images, recognizing objects requires segmenting them based on color information. This is challenging under situations in which the background has a similar color as the foreground objects (Fernandez-Sanchez et al. 2013). Moreover, color values are easily affected by lighting conditions, which reduces the robustness of object recognition (Kakumanu et al. 2007). On the contrary, with depth images, since the pixel value represents distance, automatic object segmentation becomes independent of the color of the object. As long as the object is geometrically separated from the background, accurate segmentation can be performed. Followed by such an improved segmentation process is an improved object recognition system, which identifies the nature of the objects using accurate geometric features. Such an advancement in accuracy and robustness allows depth sensors to become a popular commercial product that leads to many new applications. The Microsoft Kinect (https://developer.microsoft.com/en-us/windows/kinect), which utilizes both color and depth cameras, is one of the most popular depth sensors. Due to the uses of both color and depth cameras, Kinect can create a 3D point cloud based on the obtained images. Figure 1 shows the images obtained by the two cameras, as well as two views of the corresponding point cloud. Kinect gaming usually involves players controlling the gameplay with body movement. Virtual characters in the game are then synthesized on the fly based on the movement information obtained. Such a kind of application involves different domains of research. First, computer vision and machine learning techniques are applied to analyze the depth images obtained by the depth sensor. This typically involves recognizing different human features, such as the human body parts (Shotton et al. 2012). Then, human computer interaction researches are applied to translate to control signals from the body movement into gameplay controls. Computer graphics and animation algorithms are used to create real-time rendering, which usually includes character animation synthesized from the movement of the player. In some situations, virtual reality (Kyan et al. 2015) or augmented reality (Vera et al. 2011) research is adapted to enhance the immersiveness of the game. However, depth sensors are not without their weaknesses. Comparing to traditional capturing devices such as accelerometers, the accuracy of depth sensors is
Fig. 1 (From left to right) The color and depth images obtained by a Microsoft Kinect, as well as two views of 3D point cloud rendered by combing the color and depth information
4
Y. Shen et al.
considerably lower. This is mainly because these sensors usually consist of a single depth camera. When occlusions occur, the sensors cannot obtain information from the shielded area. This results in a significant drop in recognition accuracy. While it is possible to utilize multiple depth cameras to obtain better results, one has to deal with the cross-talk, interference of infrared signals, among multiple cameras (Alex Butler et al. 2012). It also deficits the advantage of using depth sensors in terms of easy setup and efficient capture. Therefore, it is preferable to enhance the sensor accuracy using software algorithms, instead of introducing more hardware. To enhance the quality of the obtained data, machine learning approaches using prior knowledge of the face and body features have shown great success (Shum et al. 2013). The main idea is to apply prior knowledge onto the tracked data and correct the less reliable parts or introduce more details onto the data. Such knowledge can either be defined manually or learned from examples. The key here is to represent the prior knowledge in a way that is efficient and effective to be used during run-time.
Extracting Facial and Body Information There is a large body of research in obtaining facial and body information from depth cameras. In this section, we explain some of the main methods and discuss on their performances.
Facial Feature Facial feature detection usually involves face segmentation and landmark detection. The former segments the face from the background, while the latter detect key regions and feature points. To segment the face area from the background and the rest of the human body, one can detect the skin color and perform segmentation (Bronstein et al. 2005). However, such a method is easily affected by illumination. Using the histogram of depth information from the depth image can improve the system robustness (Segundo et al. 2010). Since human faces have the same topology, it is possible to apply geometric rules to identify landmarks on the face. A simple example is to approximate the face with an ellipse and divide the ellipse into different slices based on predefined angles (Segundo et al. 2010). For each slices, corresponding features can be searched based on the 3D height of the face. For example, the eyes are the lowest point on the corresponding slice, while the nose is the highest. Similarly, it is possible to use local curvature to represent different features on the face, so as to determine different facial regions (Chang et al. 2006). For example, the eye regions are usually a valley and can be represented by a specific value of mean curvature and Gaussian curvature. The disadvantage for these methods is that manually defined geometric rules may not be robust for different users, especially for users coming from different countries. A better solution is to apply a data-driven approach. For example, one can construct a database with
Depth Sensor-Based Facial and Body Animation Control
5
Fig. 2 3D facial landmark identified by Kinect overlapped on 2D color images
segmented facial regions and train a random forest that can automatically identify facial regions on a face (Kazemi et al. 2014). Another direction of representing facial features is to use predefined facial template (Li et al. 2013; Weise et al. 2011). Such a template is a high-quality 3D mesh with controllable parameters. During run-time, the system deforms the 3D template to align with the geometry structure of the segmented face from the depth image. Such a deformation process is usually done by numerical optimization due to the high degree of freedom. Upon successful alignment, the systems can understand the observed face in the depth image with the deformed template. It can also represent the face with a set of deformation parameters so as to control animation in real-time. Microsoft Kinect also provides support for 3D landmark detection, as shown in Fig. 2. Different expressions can be identified based on the arrangements of 3D landmarks. Such understanding of facial orientation and expression is useful for realtime animation control.
Body Posture The mainstream of depth sensors-based body recognition system is to apply pattern recognition and machine learning techniques to identify the human subject. By training a classifier that can identify how individual body part appears in the depth image, one can recognize these parts using real-time depth camera input (Girshick et al. 2011; Shotton et al. 2012; Sun et al. 2012). There are several major challenges in this algorithm. Chief among them is the availability of training data. In order to train a classifier, a large number of depth images with annotation indicating the body parts are needed. Since body parts appear differently based on the viewing angle, the training database should capture such parts in different viewpoints. Moreover, since users of different body sizes appear differently in depth images, to train a robust classifier that can handle all users, training images consisting of body variation are
6
Y. Shen et al.
Fig. 3 (Left) The 3D skeleton obtained by Microsoft Kinect with the corresponding depth and color images. (Middle and Right) Two views of the 3D point cloud together with the obtained 3D skeleton
needed. As a result, hundreds of thousands of annotated depth images will be required, which exceeds what human labors can generate. To solve the problem, it is proposed to synthesize depth images using different humanoid models and 3D motion capture data. Since the body part information of these humanoid models is known in advance, it becomes possible to automatically annotate the position of the body parts in the synthesized depth image. With these training images, one can train a decision forest to classify depth pixels into the corresponding body parts. Different designs of decision forest have resulted in different level of success, and they are all capable of identifying body parts in real-time. Microsoft Kinect also applied a pattern recognition approach to recognize body parts with the depth images (Shotton et al. 2012). Figure 3 shows the results of Kinect posture recognition, which is shown as the yellow skeleton. By overlapping the skeleton with the 3D point cloud, it can be observed that the Kinect performs reasonably accurate under normal circumstances. Another stream of method in body identification and modeling is to take advantage of the geometry of human body and utilize body template model (Liu et al. 2016a; Zhang et al. 2014a). First, the pixels in the depth image that belongs to the human body are extracted. Since the pixel value represents distance, one can project them into a 3D space and create a point cloud of human body. Then, the system fits a 3D humanoid mesh model into such a point cloud, so as to estimate the body posture. This process involves deforming the 3D mesh model such that the surface of the model aligns with the point cloud. Since the template model contains human information such as body parts, when deforming the model to fit the point cloud, we identify the corresponding body information in the point cloud. The main challenge in this method is to deform the mesh properly to avoid unrealistic postures and over-deformed surfaces, which is still a challenging research problem. Physicsbased motion optimization can ensure the physical correctness of the generated postures (Zhang et al. 2014a). Utilizing simplified, intermediate template for deformation optimization can enhance the optimization performance (Liu et al. 2016a). This method can potentially provide richer body information depending on the template used. However, a major drawback of such an optimization-based approach is the higher run-time computational cost, making it inefficient to be applied in realtime systems.
Depth Sensor-Based Facial and Body Animation Control
7
Human Environment Interaction The depth images captured do not only contain information about the user but also the surrounding environment. Therefore, it is possible to identify high-level information about how the user interacts with the environment. Unlike the human body, the environment does not have a uniform structure, and therefore it is not possible to fit a predefined template or apply prior knowledge. Geometry information based on planes and shapes become the next available information to extract. The RANdom SAmple Consensus (RANSAC) algorithm can be used to identify planer objects in the scenes such as walls and floors, which can help to understand how the human user moves around in the open areas (Mackay et al. 2012). It is also possible to compare successive depth images to identify the moving parts, in order to understand how the user interacts with external objects (Shum 2013). Depth cameras can be used for 3D scanning in order to obtain surface information of the environment or even the human user. While one depth image only provide information about a partial surface, which we called a 2.5D point cloud, multiple depth images taken from different viewing angle can combine and form a full 3D surface. One of the most representative systems in this area is called KinectFusion (Newcombe et al. 2011). Such a system requires the user to carry a Kinect and capture depth images continuously over a static environment. Real-time registration is performed to understand the 3D translation and rotation movement of the depth camera. This allows alignment of multiple depth images to form a complete 3D surface. Apart from scanning the environment, it is possible to scan the face and body of a human user (Cui et al. 2013) and apply real-time posture deformation on the Kinect tracked skeleton (Iwamoto et al. 2015). Finally, because single view depth cameras suffer from the occlusion problem, it is proposed to capture how human users interact with objects by combining KinectFusion, color cameras, and accelerometer-based motion capture system (Sandilands et al. 2012, 2013). Since depth sensors can obtain both environment and human information, it facilitates the argument that human information can enhance understanding of unstructured environment (Jiang and Saxena 2013; Jiang et al. 2013). Using a chair as an example. A chair can come with different shapes and designs, which makes recognition extremely difficult. However, the general purpose of a chair is for human to rest on. Therefore, with the human movement information obtained by depth cameras, we can identify a chair not just by its shapes but also by the way the human interacts with it. Similarly, human movement may be ambiguous sometimes. Understanding the environment helps us to identify the correct meaning of the human motion. Depth sensors open up new directions on recognition by considering human and environment information together.
Dealing with Noisy Data The main problem of using depth sensors is to deal with the noisy data obtained. In particular, most depth sensor-based applications rely on a single point of view to obtain the depth image. As a result, the quality of the detected face and posture are of
8
Y. Shen et al.
Fig. 4 An overview of depth sensors data enhancement
low resolution and suffer heavily from occlusion. It is possible to apply machine learning algorithms to enhance the quality of the data. The idea is to introduce a quality enhancement process that considers prior knowledge of the human body, which is typically a database of high-quality faces or postures, as shown in Fig. 4. In this section, we discuss how body and facial information can be reconstructed from noisy data.
Face Enhancement While depth sensors can obtain facial features, due to the relatively low resolution, the quality of the features is not always satisfying. The 3D face obtained is usually missing details and unrealistic. In this section, we explain how we can enhance the quality of 3D faces obtained from depth sensors. Since the quality of a single depth image is usually noisy with low resolution, the 3D facial surface generated is rough. By obtaining high-quality 3D faces through 3D scanners and their corresponding color texture, one can construct a face database and extract the corresponding prior knowledge (Liang et al. 2014; Wang et al. 2014). These faces in the database are divided into patches such as eyes, nose, etc. Since color texture is available, one can take advantage of color features to enhance the segmentation accuracy. Given the low quality depth and color images of a face obtained from sensors, facial regions are obtained in run-time. For each region obtained, a set of similar patches is found in the database. Such a region is then approximated by a weight sum of the database patches. By replacing different parts of the run-time face image with their corresponding approximation, a high-quality 3D face surface can be generated. This method depends heavily on the quality and variety of face in the database, as well as the way we abstract those faces to represent the one observed by depth sensors in run-time. Constructing a database for prior knowledge is costly. It is therefore proposed to scan the face of the user in different angles, and apply such a face to enhance the run-time detected face (Zollhöfer et al. 2014). The system first requests the user to rotate around a depth sensor and obtain a higher quality 3D mesh, using registration methods similar to the KinectFusion mentioned in the last section (Newcombe et al. 2011). Then, given a run-time lower quality depth image of the face, the system deforms the high quality 3D mesh such that it aligns with the depth image
Depth Sensor-Based Facial and Body Animation Control
9
pixels. As a result, high quality mesh with run-time facial expression can be generated. The core problem here is to deform the high quality facial mesh nicely and avoid generating visual artifact. It is shown that by dividing the face into multiple facial regions to strengthen the feature correspondence, deformation quality can be improved (Kazemi et al. 2014).
Posture Enhancement The body tracked by depth sensors may contain inaccurate body parts due to different types of error. Simple sensor error can be caused by geometry shape of body parts and viewing angles. It is proposed to apply Butterworth filter (Bailey and Bodenheimer 2012) or a simple low-pass filter (Fern’ndez-Baena et al. 2012) to smooth out the vibration effect of tracked positions due to this type of error. However, when occlusions occur, in which a particular body part is shield from the camera, the tracked body position would contain a large amount of error. Simple filter will not be sufficient to correct these postures. As a solution, it is proposed to utilize accurately captured 3D human motion as prior knowledge and reconstruct the inaccurate postures from the depth sensor. In this method, a motion database is constructed using carefully captured 3D motion, usually with optical motion capture systems. Given a depth sensor posture, one can search for a similar posture in the database. The missing or error body parts from the depth sensors can be replaced by those in the corresponding database posture (Shum and Ho 2012). However, such a naive method cannot perform well for complex posture, as using only one posture from the database cannot always generalize the posture performed by the user, and therefore cannot effectively reconstruct the posture. More advanced posture reconstruction algorithms utilize machine learning to generalize posture information from the motion database (Chai and Hodgins 2005; Liu et al. 2011; Tautges et al. 2011). In particular, the motion database is used to create a low dimensional latent space by dimensionality reduction techniques. Since the low dimensional space is generated using data from real human, each point in the space represents a valid natural posture. Given a partially mistracked posture from a depth camera, one can project the posture into the learned low dimensional space and apply numerical optimization to enhance the quality of the posture. The optimized result is finally back-projected into a full body posture. Since the optimization is performed in the low dimensional latent space, the solution found should also be a natural posture. In other words, the unnatural elements due to sensor error can be removed. The major problem of this method is that the system has no information about which part of the body posture is incorrect. Therefore, while one would expect the system to correct the error parts of the posture using information from the accurate parts, the actual system may perform vice versa. As a result, the optimized posture may no longer be similar to the original depth sensor input. To solve the problem, optimization process that considers the reliability of individual body part is proposed (Shum et al. 2013). The major difference from
10
Y. Shen et al.
Fig. 5 Applying posture reconstruction to enhance the quality of the obtained data
this method comparing with prior ones is that it divide the posture reconstruction process into two steps. In the first step, a procedural algorithm is used to evaluate the degree of reliability of individual body parts. This is by accessing the behavior of a tracked body part to see if the position of the part is inconsistent, as well as accessing the part with respect to its neighbor body parts to see if it creates inconsistent bone length. In the second step, posture reconstruction is performed with reference to this reliability information, such that the system relies on the more parts with higher reliability. Essentially, the reliability information helps the system to explicitly use the correct body parts and reconstruct the incorrect ones. Such a system can be further improved by using Gaussian process to model the motion database, which helps to reduce the amount of motion data needed to reconstruct the posture (Liu et al. 2016b; Zhou et al. 2014). Better rules to estimate the reliability of the body parts can also enhance the system performance (Ho et al. 2016). Figure 5 shows the result of applying posture reconstruction. The color and depth images show that the user is occluded by a chair and the surrounding environment. The yellow skeleton on the left is the raw posture obtained by Kinect, in which less reliable body parts are highlighted in red. The right character shows the reconstructed posture using the method proposed in (Shum et al. 2013). The awkward body parts are identified and corrected using the knowledge learned from a motion database.
Prior Knowledge The major research focus of face and posture enhancement is to apply appropriate prior knowledge to improve data obtained in rum-time. In machine learning-based
Depth Sensor-Based Facial and Body Animation Control
11
algorithms, such prior knowledge is usually learned from a database and represented in a format that can be efficiently used in run-time. For motion enhancement, since human-motion is highly nonlinear with large variation, it is not effective to represent the database using a single model. Instead, many of the existing research apply multiple local models to represent the database, such as using a mixture of Gaussian model (Liu et al. 2016b). It is also proposed to apply deep learning to learn a set of manifolds that represents a motion database (Holden et al. 2015). Precomputing these models and manifolds are time-consuming, as it involves abstracting the whole database. Therefore, lazy learning algorithm is adapted, in which modeling of the database is not done as a preprocesss but as a run-time process using run-time information (Chai and Hodgins 2005; Shum et al. 2013). During run-time, based on the user-performed posture, the system retrieves a number of relevant postures from the database and models such a subset of postures only. This method has two advantages. First, by modeling only a small number of postures that are relevant to the performed posture, one can reduce the computational cost of constructing a latent space. Second, since the subset of postures are relatively similar, one can assume that they all lay in a locally linear space and apply simpler linear dimensionality reduction to generate the latent space. This allows real-time generation of the latent space. With improved database organization, the database search time can be further reduced and the relevancy of the retrieved results can be enhanced (Plantard et al. 2016a, b), such that real-time ergonomic and motion analysis applications can be preformed (Plantard et al. 2016b). Figure 6 visualizes how prior knowledge can be estimated from database. Each blue circle in the figure represents a database entry, and the filling color represents its value. The obtained prior knowledge from the scattered database entries is represented by the shaded area, which enables one to understand the change of value within the considered space. The left figure shows a traditional machine learning algorithm, in which prior knowledge is obtained as a preprocess, considering all database entries. During run-time, when a query arrives, the system uses the knowledge to estimate the corresponding value of the query. The right figure shows the case of lazy learning, in which prior knowledge is obtained during run-time. This allows the system to extract database entries that are more similar to the query and estimate the prior knowledge with only such a subset of data.
Depth Camera-Based Applications With depth sensors, it becomes possible to consider the user posture as part of an animation system and create real-time animation. Here, we discuss on some depth sensors-based animation control systems and point out the challenges and solutions. Producing real-time facial animation with depth sensors is efficient. By representing the facial features with a deformed template, it is possible to drive the facial expression of virtual 3D faces (Li et al. 2013; Weise et al. 2011). Due to the different dimensions between the faces of the user and the character, directly
12
Y. Shen et al.
Fig. 6 (Upper) Traditional machine learning that represents the prior from the whole database. (Lower) Lazy learning that represents the prior from a subset of the database based on the online query
applying facial features such as landmark locations generates suboptimal results. The proposed common template acts as a bridge to connect the two ends. Such a template is a parametric representation of the face, which is more robust against difference in dimensions. With the template, it becomes possible to retarget the user’s expression onto the character’s face. Typical real-time animation systems such as games utilize a motion database to understand what the user performs and renders the scenario accordingly. For example, one can compare the user performed motion obtained from depth sensors with a set of motion in the database and understand the nature of the motion as well as how it should affect the real-time render (Bleiweiss et al. 2010). Alternatively, with an interaction database, one can generate a virtual character that acts according to the posture of the user, in order to create a two character dancing animation, which is difficult to be captured due to hardware limitation (Ho et al. 2013). While it is possible to utilize the posture captured by depth sensors for driving the animation of virtual characters, the generated animation may not be physically correct and dynamically plausible. On the one hand, since the depth sensors track kinematic positions only, there is no information about the forces exerted. It is proposed to combine the uses of depth cameras with pressure sensors and estimate the internal joint torque using inverse dynamics (Zhang et al. 2014b). This allows simulating virtual characters with physically correct movement. On the other hand, while depth sensors can track the body parts positions, it is relatively difficult to track
Depth Sensor-Based Facial and Body Animation Control
13
Fig. 7 Real-time dynamic deformed character generated from Kinect postures
how the body deforms dynamically during the movement. Therefore, it is proposed to enhance the realism of the generated character by applying real-time physical simulation onto Kinect postures (Iwamoto et al. 2015). This allows the system to synthesize real-time dynamic deformation, such as the jiggling of fresh, based on the movement obtained in real-time, as shown in Fig. 7. Utilizing depth cameras, user can interact with virtual objects with body motion. On the one hand, predefined hand and arm gestures can be used to control virtual objects. Once the Kinect has detected a set of specific gestures, a 3D virtual object can be fitting onto the user and move according to the user’s gesture (Soh et al. 2013). On the other hand, the virtual objects can be attached on the user’s body and move with the user’s posture, such as carrying a virtual handbag (Wang et al. 2012). Depth sensors fuse a new application known as the virtual fitting, in which the shopping experience can be facilitated by letting customers to try on virtual clothing. This allows mix-and-match of clothes and accessories in real-time without being physically present in the retail shop. The system involves building a 2D segmented clothing database indexed by the postures of the user. During run-time, the system searches for suitable database entries and overlays them on the customers fitting image (Zhou et al. 2012). Another clothes fitting method is to utilize a 3D clothing database and obtain 3D models of the user with depth sensors. This allows the system to recommend items that fits with the user’s body (Pachoulakis and Kapetanakis 2012).
Conclusion In this chapter, we explained how depth sensors are applied to gather human facial and body posture information in generating and controlling animations. Depth sensors obtain human information in real-time and provide a cheaper alternative for human motion capturing. However, there are still rooms to improve the quality of the obtained data. In particular, depth sensors suffer heavily from occlusions, in which part of the human body is shield. Machine learning algorithms can reconstruct the data and improve the quality, but more research is needed to solve the problem. Depth sensors fuse many interesting applications in the computer animation and
14
Y. Shen et al.
games domains, providing real-time control on animation creation. Given the rate of new depth sensors research and applications, such a technology can become an important part of the daily life in the near future. Acknowledgment This work is supported by the Engineering and Physical Sciences Research Council (EPSRC) (Ref: EP/M002632/1).
References Alex Butler D, Izadi S, Hilliges O, Molyneaux D, Hodges S, Kim D (2012) Shake’n’sense: reducing interference for overlapping structured light depth cameras. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI’12. ACM, New York, pp 1933–1936 Bailey SW, Bodenheimer B (2012) A comparison of motion capture data recorded from a vicon system and a Microsoft Kinect sensor. In: Proceedings of the ACM symposium on applied perception, SAP’12. ACM, New York, pp 121–121 Bleiweiss A, Eshar D, Kutliroff G, Lerner A, Oshrat Y, Yanai Y (2010) Enhanced interactive gaming by blending full-body tracking and gesture animation. In: ACM SIGGRAPH ASIA 2010 Sketches. Seoul, South Korea. ACM, p 34 Bronstein AM, Bronstein MM, Kimmel R (2005) Three-dimensional face recognition. Int J Comput Vision 64(1):5–30 Chai J, Hodgins JK (2005) Performance animation from low-dimensional control signals. In SIGGRAPH’05: ACM SIGGRAPH 2005 Papers. ACM, New York, pp 686–696 Chang KI, Bowyer KW, Flynn PJ (2006) Multiple nose region matching for 3d face recognition under varying facial expression. IEEE Trans Pattern Anal Mach Intell 28(10):1695–700 Cui Y, Chang W, Nöll T, Stricker D (2013) Kinectavatar: fully automatic body capture using a single Kinect. In: Proceedings of the 11th international conference on computer vision, vol 2, ACCV’12. Springer-Verlag, Berlin/Heidelberg, pp 133–147 Fern’ndez-Baena A, SusÃn A, Lligadas X (2012) Biomechanical validation of upper-body and lower-body joint movements of Kinect motion capture data for rehabilitation treatments. In: Intelligent Networking and Collaborative Systems (INCoS), 2012 4th International Conference on, pp 656–661 Fernandez-Sanchez EJ, Diaz J, Ros E (2013) Background subtraction based on color and depth using active sensors. Sensors 13(7):8895–915 Girshick R, Shotton J, Kohli P, Criminisi A, Fitzgibbon A (2011) Efficient regression of generalactivity human poses from depth images. In: Computer Vision (ICCV), 2011 I.E. international conference on. Barcelona, Spain. pp 415–422 Ho ESL, Chan JCP, Komura T, Leung H (2013) Interactive partner control in close interactions for real-time applications. ACM Trans Multimedia Comput Commun Appl 9(3):21:1–21:19 Ho ES, Chan JC, Chan DC, Shum HP, Cheung YM, Yuen PC (2016) Improving posture classification accuracy for depth sensor-based human activity monitoring in smart environments. Comput Vis Image Underst 148:97–110. doi:10.1111/cgf.12735 Holden D, Saito J, Komura T, Joyce T (2015) Learning motion manifolds with convolutional autoencoders. In ACM SIGGRAPH ASIA 2015 technical briefs. ACM, Kobe, Japan. 2015 SIGGRAPH ASIA Iwamoto N, Shum HPH, Yang L, Morishima S (2015) Multi-layer lattice model for real-time dynamic character animation. Comput Graph Forum 34(7):99–109 Jiang Y, Saxena A (2013) Hallucinating humans for learning robotic placement of objects. In: Proceedings of the 13th international symposium on experimental robotics. Springer International Publishing, Heidelberg, pp 921–937
Depth Sensor-Based Facial and Body Animation Control
15
Jiang Y, Koppula H, Saxena A (2013) Hallucinated humans as the hidden context for labeling 3d scenes. In: Proceedings of the 2013 I.E. conference on computer vision and pattern recognition, CVPR’13. IEEE Computer Society, Washington, DC, pp 2993–3000 Kakumanu P, Makrogiannis S, Bourbakis N (2007) A survey of skin-color modeling and detection methods. Pattern Recogn 40(3):1106–22 Kazemi V, Keskin C, Taylor J, Kohli P, Izadi S (2014) Real-time face reconstruction from a single depth image. In: 3D Vision (3DV), 2014 2nd international conference on, vol 1. IEEE, Lyon, France. 2014 3DV. pp 369–376 Kinect sdk. https://developer.microsoft.com/en-us/windows/kinect Kyan M, Sun G, Li H, Zhong L, Muneesawang P, Dong N, Elder B, Guan L (2015) An approach to ballet dance training through ms Kinect and visualization in a cave virtual reality environment. ACM Trans Intell Syst Technol (TIST) 6(2):23 Li H, Yu J, Ye Y, Bregler C (2013) Realtime facial animation with on-the-fly correctives. ACM Trans Graph 32(4):42–1 Liang S, Kemelmacher-Shlizerman I, Shapiro LG (2014) 3d face hallucination from a single depth frame. In: 3D Vision (3DV), 2014 2nd international conference on, vol 1. IEEE, Lyon, France. 2014 3DV. pp 31–38 Liu H, Wei X, Chai J, Ha I, Rhee T (2011) Realtime human motion control with a small number of inertial sensors. In: Symposium on interactive 3D graphics and games, I3D’11. ACM, New York, pp 133–140 Liu Z, Huang J, Bu S, Han J, Tang X, Li X (2016a) Template deformation-based 3-d reconstruction of full human body scans from low-cost depth cameras. IEEE Trans Cybern PP(99):1–14 Liu Z, Zhou L, Leung H, Shum HPH (2016b) Kinect posture reconstruction based on a local mixture of gaussian process models. IEEE Trans Vis Comput Graph 14 pp. doi:10.1109/ TVCG.2015.2510000 Mackay K, Shum HPH, Komura T (2012) Environment capturing with Microsoft Kinect. In: Proceedings of the 2012 international conference on software knowledge information management and applications, SKIMA’12. Chengdu, China. 2012 SKIMA Newcombe RA, Izadi S, Hilliges O, Molyneaux D, Kim D, Davison AJ, Kohli P, Shotton J, Hodges S, Fitzgibbon A (2011) Kinectfusion: real-time dense surface mapping and tracking. In: Proceedings of the 2011 10th IEEE international symposium on mixed and augmented reality, ISMAR’11. IEEE Computer Society, Washington, DC, pp 127–136 Pachoulakis I, Kapetanakis K (2012) Augmented reality platforms for virtual fitting rooms. Int J Multimedia Appl 4(4):35 Plantard P, Shum HP, Multon F (2016a) Filtered pose graph for efficient kinect pose reconstruction. Multimed Tools Appl 1–22. doi:10.1007/s11042-016-3546-4 Plantard P, Shum HPH, Multon F (2016b) Ergonomics measurements using Kinect with a pose correction framework. In: Proceedings of the 2016 international digital human modeling symposium, DHM ’16, Montreal, 8 p Sandilands P, Choi MG, Komura T (2012) Capturing close interactions with objects using a magnetic motion capture system and a rgbd sensor. In: Proceedings of the 2012 motion in games. Springer, Berlin/Heidelberg, pp 220–231 Sandilands P, Choi MG, Komura T (2013) Interaction capture using magnetic sensors. Comput Anim Virtual Worlds 24(6):527–38 Segundo MP, Silva L, Bellon ORP, Queirolo CC (2010) Automatic face segmentation and facial landmark detection in range images. Systems Man Cybern Part B Cybern IEEE Trans 40 (5):1319–30 Shotton J, Girshick R, Fitzgibbon A, Sharp T, Cook M, Finocchio M, . . . Blake A (2013) Efficient human pose estimation from single depth images. IEEE Trans Pattern Anal Machine Intell 35 (12):2821–2840 Shum HPH (2013) Serious games with human-object interactions using rgb-d camera. In: Proceedings of the 6th international conference on motion in games, MIG’13. Springer-Verlag, Berlin/Heidelberg
16
Y. Shen et al.
Shum HPH, Ho ESL (2012) Real-time physical modelling of character movements with Microsoft Kinect. In: Proceedings of the 18th ACM symposium on virtual reality software and technology, VRST’12. ACM, New York, pp 17–24 Shum HPH, Ho ESL, Jiang Y, Takagi S (2013) Real-time posture reconstruction for Microsoft Kinect. IEEE Trans Cybern 43(5):1357–69 Soh J, Choi Y, Park Y, Yang HS (2013) User-friendly 3d object manipulation gesture using Kinect. In: Proceedings of the 12th ACM SIGGRAPH international conference on virtual-reality continuum and its applications in industry, VRCAI’13. ACM, New York, pp 231–234 Sun M, Kohli P, Shotton J (2012) Conditional regression forests for human pose estimation. In: Computer Vision and Pattern Recognition (CVPR), 2012 I.E. conference on. Providence, Rhode Island. pp 3394–3401 Tautges J, Zinke A, Krüger B, Baumann J, Weber A, Helten T, Müller M, Seidel H-P, Eberhardt B (2011) Motion reconstruction using sparse accelerometer data. ACM Trans Graph 30 (3):18:1–18:12 Vera L, Gimeno J, Coma I, Fernández M (2011) Augmented mirror: interactive augmented reality system based on Kinect. In: Human-Computer Interaction–INTERACT 2011. Springer, Lisbon, Portugal. 2011 INTERACT. pp 483–486 Wang L, Villamil R, Samarasekera S, Kumar R (2012) Magic mirror: a virtual handbag shopping system. In: Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 I.E. computer society conference on. IEEE, Rhode Island. 2012 CVPR. pp 19–24 Wang K, Wang X, Pan Z, Liu K (2014) A two-stage framework for 3d facereconstruction from rgbd images. Pattern Anal Mach Intell IEEE Trans 36(8):1493–504 Weise T, Bouaziz S, Li H, Pauly M (2011) Realtime performance-based facial animation. ACM Trans Graph (TOG) 30:77, ACM Zhang P, Siu K, Jianjie Z, Liu CK, Chai J (2014a) Leveraging depth cameras and wearable pressure sensors for full-body kinematics and dynamics capture. ACM Trans Graph 33(6):221:1–221:14 Zhang P, Siu K, Jianjie Z, Liu CK, Chai J (2014b) Leveraging depth cameras and wearable pressure sensors for full-body kinematics and dynamics capture. ACM Trans Graph (TOG) 33(6):221 Zhou Z, Shu B, Zhuo S, Deng X, Tan P, Lin S (2012) Image-based clothes animation for virtual fitting. In: SIGGRAPH Asia 2012 technical briefs. ACM, Singapore. 2012 SIGGRAPH ASIA. p 33 Zhou L, Liu Z, Leung H, Shum HPH (2014) Posture reconstruction using Kinect with a probabilistic model. In: Proceedings of the 20th ACM symposium on virtual reality software and technology, VRST’14. ACM, New York, pp 117–125 Zollhöfer M, Nießner M, Izadi S, Rehmann C, Zach C, Fisher M, Wu C, Fitzgibbon A, Loop C, Theobalt C et al (2014) Real-time non-rigid reconstruction using an rgb-d camera. ACM Trans Graph (TOG) 33(4):156
Real-Time Full-Body Pose Synthesis and Editing HO Edmond S. L. and Yuen Pong C.
Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Editing and Synthesizing Full-Body Pose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Editing Poses by Inverse Kinematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Data-Driven Pose Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 User Interface for Full-Body Posing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Abstract
Posing character has always been playing an important role in character animation and interactive applications such as computer games. However, such a task is time-consuming and labor-intensive. In order to improve the efficiency in character posing, researchers in computer graphics have been working on a wide variety of semi- or fully automatic approaches in creating full-body poses, ranging from traditional approaches like inverse kinematics (IK), data-driven approaches which make use of captured motion data, as well as direct pose manipulation through intuitive interfaces. In this book chapter, we will introduce the aforementioned techniques and also discuss their applications in animation production.
H. Edmond S. L. Department of Computer and Information Sciences, Northumbria University, Newcastle upon Tyne, UK e-mail: [email protected] Y. Pong C. (*) Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong e-mail: [email protected] # Springer International Publishing Switzerland 2016 B. Müller, S.I. Wolf (eds.), Handbook of Human Motion, DOI 10.1007/978-3-319-30808-1_8-1
1
2
H. Edmond S. L. and Y. Pong C.
Keywords
Pose synthesis • Pose editing • Inverse kinematics • Motion retargeting • Motion blending • Jacobian-based IK • Cyclic coordinate decent • Collision avoidance • Data-driven pose synthesis
Introduction A recent study Kyto et al. (2015) found that 61 % of time in animation production was spent on character posing by professional animators. Such a time-consuming task motivated tremendous effort in improving the efficiency of human character posing over the last three decades. Producing full-body poses has always been playing an important role in character animation. For example, a long motion sequence can be represented by key poses in keyframe animation, and the in-between poses can be interpolated from the key poses. Such an approach is still widely used in animation production nowadays. While only a small number of poses have to be created in keyframe animation, creating a full-body pose manually (e.g., controlling the locations of all body parts) is a tedious and time-consuming task. As results, researchers have been working on fully or semiautomatic methods in full-body pose creation. In this chapter, we will review methods in character posing, including those creating new pose from scratch and creating new pose by modifying existing motion data. In addition to introducing the pose creation techniques, the applications of those techniques in animation production, such as reaching, retargeting, foot-skate cleanup, etc., will be discussed.
State of the Art Jacobian-based inverse kinematics (IK) algorithms have been widely used in pose editing over the last three decades in computer animation. A recent work by Harish et al. (2016) presents a new parallel numerical IK approach for multithreaded architectures. The new approach can handle complex articulated structures (e.g., with more than 600 degrees of freedom) with multiple constraints at real time. Besides the traditional approaches, many intuitive real-time posing approaches have been proposed. Guay et al. (2013) proposed to synthesize 3D pose of a character by drawing a single aesthetic line called line of action (LOA). An example is shown in Fig. 1. The main idea of the new method is to edit the pose of the character such that the selected body line becomes similar to the sketched simple curve given by the user. Ten body lines which connect every pair of end effectors (e.g., head to right hand, left hand to right leg, etc.) are defined, and the most appropriate one will be selected according to the location of the line sketched and the viewpoint. Another interesting real-time posing approach proposed by Rhodin et al. (2014) recently enables the user to control the poses of a virtual character by a wide range of
Real-Time Full-Body Pose Synthesis and Editing
3
Fig. 1 3D poses generated by a single aesthetic line (colored in red) proposed by Guay et al. (2013) (Reproduced with permission from Guay et al. (2013))
Fig. 2: Posing characters with different articulated structures with the tangible input device proposed in Jacobson et al. (2014) (Reproduced with permission from Jacobson et al. (2014))
input motion such as full-body, hand, or face motion from any motion-sensing devices. Instead of controlling the skeleton of the characters, the new method defines the vertex-to-vertex correspondence between the meshes representing the input motion and the virtual character and deforms the mesh of the character accordingly. Furthermore, researchers have been trying to provide the users with a more natural way in posing characters. In Jacobson et al. (2014), a tangible puppet-like input device is proposed for interactive pose manipulation. By measuring the relative bone orientation on the input device, the pose of the virtual character will be updated accordingly. Different from previously developed puppet-like input devices (e.g., Esposito et al. 1995), the new device can be used for articulated structure with different topologies as the device is assembled from modular, interchangeable, and hot-pluggable parts. Examples of posing are shown in Fig. 2. To further enhance the realism of the resultant animation, physics-based approaches have been adapted in animation production. However, due to the high computational cost in previous approaches, most of them cannot be used in interactive applications. A recent work by Hämäläinen et al. (2015) synthesizes physically valid poses of humanlike characters at interactive frame rate in performing a wide variety of motions such as balancing on a ball, recovering from disturbances, reaching and juggling a ball, etc.
4
H. Edmond S. L. and Y. Pong C.
Editing and Synthesizing Full-Body Pose In this section, techniques for synthesizing and editing full-body character pose will be introduced. Firstly, a traditional posing approach inverse kinematics (IK) which is widely used in computer animation and robotics will be discussed in Section 3.1. In particular, different types of IK solvers (Sections 3.1.1 and 3.1.3) and examples of their applications (Section 3.1.2) will be presented. Secondly, data-driven approaches which make use of existing motion data to produce natural-looking pose will be explained in Section 3.2. Finally, direct manipulation of the pose of the character using different kinds of intuitive interfaces such as puppet-based, natural user interface and sketch-based approaches will be reviewed in Section 3.3.
Editing Poses by Inverse Kinematics With the advancement of motion capture technology, more and more motion data are available nowadays. Reusing the captured or existing poses in new applications can be more efficient (in terms of production time and labor cost) than creating new poses from scratch. However, the number of captured or existing poses is limited, and it is necessary to edit the available poses according to the user’s new requirements. In addition, there are demands in editing the full-body pose at runtime in interactive applications such as computer games. In computer animation, characters are usually represented as articulated structures in which the body segments connected at the joint locations. The pose of the character is then controlled by the joint parameters (e.g., joint angles) and the global translation of the root joint (e.g., the hip of a humanlike character). Given the skeletal structure (e.g., bone lengths), the joint parameters of the current pose, and the changes in the joint parameters, the changes in location or orientation of every joint can be computed: x_i ¼ J i θ_
(1)
where x_i can be the changes in positions or orientations of the i-th joint and Ji is the Jacobian matrix for mapping the changes in joint parameters θ_ to the changes in x_i of the i-th joint: @xi Ji ¼ @θ
(2)
However, it is a tedious task for the user to control the pose of a character by specifying that all joint parameters as a humanlike character usually composed of more than 40 joints. Instead of editing a pose by specifying all of the joint parameters as in Eq. 1, the new pose can be produced by specifying the target location and/or the orientation of the selected joint(s) and the required changes in the joint parameters will be computed by inverse kinematics (IK).
Real-Time Full-Body Pose Synthesis and Editing
5
IK has been widely used in robotics and computer animation for controlling robots and characters. Poses of articulated characters such as human and animal figures can be edited by such an approach for producing computer animations. IK can be applied to a wide range of applications in animation production, for example, keyframe postures editing, interactive postures editing, and editing pre-captured motion data. An early analytic approach proposed by Lee and Shin (1999) determines the posture based on the positions of the hands and feet relative to the positions of shoulder and hips. High performance in pose editing has been demonstrated in their experiments. However, one of the major limitations of analytic approaches is that the analytic solvers must be designed specifically to each individual system which cannot be applied to arbitrary articulated structures. On the other hand, numerical IK solvers are more general and will be discussed below.
Numerical Inverse Kinematics Approaches Numerical solvers linearize the relationship of joint parameters and the positions and/or orientations of the end effectors around the current posture to obtain the IK solution for new position and/or orientation of the end effectors close to the current position and/or orientation: θ_ ¼ J 1 i x_i
(3)
is the inverse of the Jacobian matrix (Eq. 2) of the i-th joint. Since the where J 1 i Jacobian matrix may not be a square matrix, the pseudoinverse of the Jacobian matrix can be used for solving the IK problem. The pseudoinverse J þ i can be calculated by: T T Jþ i ¼ Ji JiJi
1
(4)
and the IK problem can be solved by: θ_ ¼ J þ i x_i
(5)
There are three main advantages for numerical solvers: • They can be applied to arbitrary chain structures. • Various types of constraints, such as positional or planar constraints, can be handled in the same platform. • Constraints can be easily switched on and off. Therefore, many numerical IK approaches have been proposed. The most practical and commonly used numerical solver is based on the least square methods (Whitney 1969). One of the major problems in the original least squares method is that it becomes unstable near the singularity points, in which results in large changes in the solution (i.e., the joint parameters ̇ θ). The singularity problem usually occurs
6
H. Edmond S. L. and Y. Pong C.
when there is no physically feasible solution to the IK problem; for example, the target location of a controlled joint is unreachable or there are multiple constraints which are conflicting with each other. To tackle this problem, various methodologies such as singularity-robust (SR) inverse (Nakamura and Hanafusa 1986) have been developed to stabilize the system near such singularity postures for generating full-body postures in graphics applications (Yamane and Nakamura 2003). The main idea of using SR inverse is to introduce a weighting parameter to balance between satisfying constraints and stabilizing the changes in joint parameters. The SR inverse J i is calculated by: J i ¼ J Ti J i J Ti þ kI
1
(6)
where k is the weighting parameter and I is the identity matrix. The larger the value of k, the larger the error in satisfying constraints, but the solution becomes more stable. The bottleneck of these methods is the cost of computing the pseudoinverse matrix, which grows cube proportional to the number of constraints. Baraff (1996) proposes a method of forward dynamics for articulated body structures, which can be used for solving IK problems. Instead of calculating the pseudoinverse matrix, an equation of Lagrange multipliers is solved. Since the matrix used in his method is sparse, efficient solvers for sparse matrix can be used. However, the method can only handle equality constraints, and the cost still increases cubic proportional to the number of auxiliary constraints.
Application of Numerical Inverse Kinematics Approaches As numerical IK approaches can be applied to arbitrary articulated structures, a wide range of applications has been developed for real-time full-body pose creation. Four types of applications are briefly discussed below. Footskate Cleanup In animation production, the motion sequence can be produced by interpolating key poses or directly using MOCAP data. However, when interpolating the key poses, artifacts such as footskate can be produced. Footskate occurs when the motion cannot reproduce the footplants the animation intended to create. For example, in a walking motion, footskates occurs when the character slides over the surface of the ground. When reusing MOCAP data, footskate will occur if the motion data were not well captured due to the noise or tracking error. This problem can be solved by analyzing the motion sequence (e.g., the state in a walk cycle, ankle height and rotation, etc.) and determine the position of the feet at every pose (in each frame). Then, IK can be applied to edit the poses accordingly (Kovar et al. 2002; Kulpa et al. 2005; Lee and Shin 1999; Lu and Liu 2014). Character Retargeting While reusing existing motions, including both MOCAP data and previously created motions, is common practice in animation production, it is a very challenging task.
Real-Time Full-Body Pose Synthesis and Editing
7
Fig. 3 Examples of retargeting the original Judo poses (colored in red) to characters with different sizes using interaction mesh (Ho et al. 2010b) (Reproduced with permission from Ho et al. (2010b))
This is because the articulated character used in existing motions may differ from the new character(s) in body segment lengths and sizes. As a result, animating new characters with existing motion data may result in loss of contacts (e.g., wrong footplates, unable to reach an object). To tackle this problem, IK can be applied at every pose (in each frame) to constrain some of the body parts (such as hands and feet) to preserve the contacts as in the original motion while editing the other parts of the body accordingly (Gleicher 1998). Motion retargeting can also be applied to transferring the live performance of a human subject to the movement of a virtual character (Shin et al. 2001). Ho et al. (2010b) proposed interaction mesh, which represents the spatial relations between closely interacting body parts. By preserving the shape of the interaction mesh while retargeting characters to new sizes, penetration-free postures will be produced. Examples of retargeting closely interacting characters to different sizes are illustrated in Fig. 3. The spatial relation-based representation can also be used for controlling the movement of humanoid robots in highly constrained environment (Ho and Shum 2013) and synthesizing virtual partner in VR applications (Ho et al. 2013a). Collision Avoidance in Pose Editing Collision and interpenetration of body segments can significantly degrade the realism of the resultant character animation. Collisions may occur when the character is interacting with other objects and characters in the scene. To solve this problem, collision detection algorithms will be applied to determine which body part(s) are colliding with other body parts or objects. Next, the colliding segments will be moving away from each other to avoid the interpenetration by applying IK to edit the poses. Various approaches have been proposed (Kallmann 2008; Lyard and Magnenat-Thalmann 2008). Editing Poses for Interaction Between Characters For interactive applications such as computer games and virtual reality application, virtual characters will response to the avatars or objects controlled by the user. When handling virtual scenes with close interactions between the characters such as fighting and dancing, the poses of the characters have to be edited at runtime to preserve the context of the interaction. Imagine that the attacking character is trying to punch the head of the defending characters. The punching trajectories have to be
8
H. Edmond S. L. and Y. Pong C.
edited in order to reach the target (i.e., the head) when the defending characters is moving around (e.g., controlled by the user). For example, Shum et al. (Shum et al. 2007; Shum et al. 2008) edit the positions of the interacting body parts at every frame, and Ho and Komura (Ho and Komura 2009; Ho and Komura 2011; Ho et al. 2010a) edit the way the body parts tangling with each other. IK can also be used for creating the reactive motion when external perturbation is applied on the character (Komura et al. 2005).
Heuristic Inverse Kinematics Approach One of the representative heuristic search approaches in solving IK problems is the cyclic coordinate descent (CCD) (Wang and Chen 1991) method. The CCD method is a simple and fast approach for computing the joint parameters to satisfying the constraints iteratively. The IK problem is solved in cycles in which each joint parameter will be computed in a cycle. Starting with the outermost joint (i.e., closet to the end effector) (Welman 1993) in the articulated structure, the joint parameters are updated sequentially to bring the end effector E to the target location T. When editing a joint parameter on joint i, the positions of joint i, E, and T in Cartesian coordinates are projected to Pi, Ea, and Ta according to the axis of rotation a. Next, the rotation Δθjointi required to bring Ea closer to Ta can be found by calculating the angle between the two vectors originated from Pi to Ea and Pi to Ta: ð E a Pi Þ ð T a P i Þ Δθjointi ¼ arccos kðEa Pi Þk kðT a Pi Þk
(7)
The direction of rotation will be determined by the cross product of the two vectors Pi to Ea and Pi to Ta. By iteratively updating the joint parameters in every cycle, the difference between the positions of E and T can be minimized. Since the joint parameters can be computed analytically as in Eq. 7 in each step, computationally expansive calculations such as matrix manipulations are not needed and result in less computational costly than numerical approaches introduced in 3.1.1. This makes the CCD method suitable for real-time full-body pose editing applications.
Summary of IK Approaches A general problem of traditional IK algorithms is the difficulty to ensure the naturalness of the synthesized motion. This is because natural human motion involves a lot of subtle behaviors such as balancing and correlation of body parts, which are difficult to be modeled mathematically. In the next section, we will introduce methodologies in using precaptured human motion to improve the solution for pose editing.
Real-Time Full-Body Pose Synthesis and Editing
9
Fig. 4 Results obtained in Rose et al. (1998), the sample motions (green) and the blended motions (yellow) (Reproduced with permission from Rose et al. (1998))
Data-Driven Pose Synthesis The idea of data-driven motion synthesis is to make use of captured motion data to create the required postures, such that natural and humanlike movement can be created by specifying a relatively small number of constraints. An early work by Rose et al. (1998) edits poses by interpolating collected poses to satisfy the constraints. This is based on an old technique called motion blending in computer animation. In Rose et al. (1998), a concept of verbs and adverbs is proposed to generate new poses from examples. In their work, verbs refer to parameterized motions constructed from sets of similar motions, and adverbs are parameters that control the verbs. For each verb, the sample motions are time aligned by manually specifying the key-time for every motion. Then, the motions clips are placed on a parameter space based on the characteristics of the motion clips. Motion blending is done by computing the weights of the sample motions in the corresponding verb using radial basis functions (RBF). By specifying the adverbs, new motion will be created. In addition, users can create a verbgraph so that transition motions between verbs can be generated. Figure 4 shows the sample motions (green) and the blended motions (yellow) created by their method.
10
H. Edmond S. L. and Y. Pong C.
In the approaches proposed in recent years, the collected human motions have to be analyzed first, and machine learning tasks are often required to learn a model for pose synthesis in later stage. In the rest of this section, we roughly divide the datadriven approaches into two categories: offline training and online modeling approaches.
Offline Training Approaches Synthesizing a natural-looking pose can be viewed as finding a solution (i.e., joint parameters) from a natural movement space created using captured motions. Grochow et al. (2004) propose to use scaled Gaussian process latent variable model (SGPLVM) (Lawrence 2004) to create such a natural pose space. While the process of learning the pose model is done offline, the learned model can be used for real-time full-body pose synthesis. By specifying constraints such as the positions of the hands and feet, natural-looking full-body pose can be synthesized. However, due to the complexity of the learning process, the model cannot be trained with large number of poses. In Wu et al. (2011), Wu et al. further propose to select a subset of distinctive postures in a large pose database for learning a natural pose space for pose synthesis. Wei and Chai (2011) solved the same problem by constructing a mixture of factor analysis. The algorithm segments the motion database into local regions and models each of them individually. Nevertheless, the training cost and system complexity increases with the amount of source data, and the effectiveness of dimensional reduction reduces with the increase of motion data variety. Online Modeling Approaches As opposed to offline training approaches, online modeling has shown to be effective for real-time application with large motion dataset. The idea is to select a small subset of posture based on run-time information to synthesize the required posture. For example, Chai and Hodgins (2005) use a lazy learning approach to learn low-dimensional local linear models (principal component analysis (PCA)) to approximate the high-dimensional manifold which contains the natural and valid poses during runtime. Given the current pose of the character and the target positions of the selected joint(s) as constraints, a set of postures that are similar to the current one is used to learn the local linear model. By interpolating the poses in the lowdimension space while minimizing the energy terms to ensure that the synthesized pose is smooth (i.e., joint velocities), and satisfy the constraints given by the user and the probability distribution of the captures motion in the training data, naturallooking full-body motion can be synthesized. Liu et al. extended the idea by using the maximum a posteriori framework to reconstruct the motion, which enhanced the consistency of the movement in the temporal domain (Liu et al. 2011). The general problem of these methods is that it is difficult to ensure the set of extracted postures to be logically similar as kinematics metric is used. Ho et al. (2013b) also use a lazy learning approach to learn local linear models while taking into account the spatial relationship between body parts. The topology-based approach computes and represents the tangling of body parts using a subset of topology coordinates (Ho and Komura 2009). As interpolating
Real-Time Full-Body Pose Synthesis and Editing
11
Fig. 5 Posing characters with close interactions while avoiding penetration of body parts by the method proposed in Ho et al. (2013b) (Reproduced with permission from Ho et al. (2013b))
poses with significant difference can easily result in interpenetrates of the body parts, their method only selects topologically similar poses to learn the local model and ensures the changes in spatial relationship to be small when editing the pose. By this, penetration-free postures can be created (Fig. 5).
User Interface for Full-Body Posing Beside the pose synthesis and editing approaches introduced above, another stream of research lies in providing a more natural and intuitive interface for the users to pose the characters. In this subsection, three types of character posing interfaces, puppet-based interface, natural user interface, and sketch-based interface, will be introduced.
Puppet-Based Interface An early work by Esposito et al. (1995) provides the users with a puppet called Monkey as shown in Fig. 6. Monkey is a humanlike puppet with 32 degrees of freedom, 1800 tall, and about 600 wide. Rotational sensors are located at the joints of the puppet to measure the joint angles when the user manipulates the puppet. The pose sequence produced by puppet-based input devices can also be used as example motion to retrieve similar movement from the database (Numaguchi et al. 2011). A recently proposed tangible input device (Jacobson et al. 2014) introduced in Section 2 further enables users to create articulated characters with different topology for real-time full-body posing. Natural User Interface With the advancement in motion-sensing technologies, a wide variety of natural user interface (NUI) applications have been proposed. In human posing, a recent work by
12
H. Edmond S. L. and Y. Pong C.
Fig. 6 An input device, called Monkey, for interactive pose manipulation (Reproduced with permission from Esposito et al. (1995))
Oshita et al. (2013) captures the movements of the fingers and hands of the user when manipulating an intangible puppet. The design of the puppet control is inspired by traditional puppet controlling mechanism, in which the head/body rotation and body translation are controlled by the right hand while the legs of the character are controlled by the left hand. The hand and finger movement is captured at runtime using leap motion controller (Motion 2016). Unlike previous sensor-based approaches like the data glove-based methods (Isrozaidi et al. 2010; Komura and Lam 2006), no sensor or marker is required to be attached on the hand as the leap motion controller tracks the finger and hand movement by emitting infrared (IR) light and analyzing the reflected IR light to calculate the 3D positions of different parts of the hand(s) over time (Fig. 7).
Sketch-Based Interface Another type of popular intuitive posing interfaces is sketch-based approaches which are inspired by the pose design process in traditional 2D hand-drawn animation production. Mapping 2D sketch to 3D pose is an under-constrained problem. An early work proposed by Igarashi et al. (1999) enables user to sketch the 2D silhouette of the character, and the corresponding 3D mesh model will be generated. In additional, methods for posing 3D characters using sketches of the skeleton in 2D
Real-Time Full-Body Pose Synthesis and Editing
13
Fig. 7 An input device, called Monkey, for interactive pose manipulation (Reproduced with permission from Oshita et al. (2013))
stick figures (Choi et al. 2012; Davis et al. 2003; Wei and Chai 2011) are also an active research topic. In Lin et al. (2012), the user can sketch the sitting pose of a stick figure in 2D. By taking into account the interaction between the sketched pose and the environment and preserving the physical correctness such as balancing of the character, the 3D pose will be produced at interactive rate (with GPU implementation). To further simplify the input from the user, highly abstracted sketch such as the line of action approach (Guay et al. 2013) introduced in Section 2 has been proposed. A recent work proposed by Hahn et al. (2015) further proposed a method to allow the user to define a custom sketch abstraction, for example, by sketching the outline or the skeleton, and the system will map the sketch to the rigging parameters to edit the pose of the 3D character by deforming the mesh model.
Conclusion In this chapter, various kinds of real-time full-body posing approaches have been discussed. Traditional posing approaches such as IK automatically creates new pose according to a small number of constraints which reduces the workload of the animators in posing characters. Data-driven approaches produce natural-looking poses by limiting the poses to be produced lie in the natural pose space. Finally, a wide range of intuitive interfaces for directly manipulating the pose of the character to further simplify the posing process. While the research interests in full-body posing have been changing from traditional methods to more intuitive controls, we believe IK will still play an important role in character posing.
References Baraff D (1996) Linear-time dynamics using lagrange multipliers. In: SIGGRAPH ’96: Proceedings of the 23rd annual conference on computer graphics and interactive techniques. ACM, New York, pp 137–146. doi:10.1145/237170.237226
14
H. Edmond S. L. and Y. Pong C.
Chai J, Hodgins JK (2005) Performance animation from low-dimensional control signals. In: SIGGRAPH ’05: ACM SIGGRAPH 2005 papers. ACM, New York, pp 686–696. doi:10.1145/1186822.1073248 Choi MG, Yang K, Igarashi T, Mitani J, Lee J (2012) Retrieval and visualization of human motion data via stick figures. Comput Graph Forum 31(7pt1):2057–2065. doi:10.1111/j.14678659.2012.03198.x Davis J, Agrawala M, Chuang E, Popović Z, Salesin D (2003) A sketching interface for articulated figure animation. In: Proceedings of the 2003 ACM SIGGRAPH/eurographics symposium on computer animation, SCA ’03. Eurographics Association, Aire-la-Ville, pp 320–328 http://dl. acm.org/citation.cfm?id=846276.846322 Esposito C, Paley WB, Ong J (1995) Of mice and monkeys: a specialized input device for virtual body animation. In: Proceedings of the 1995 symposium on interactive 3D graphics, I3D ’95. ACM, New York, p 109–ff. doi:10.1145/199404.199424 Gleicher M (1998) Retargeting motion to new characters. In: SIGGRAPH ’98: Proceedings of the 25th annual conference on computer graphics and interactive techniques. ACM Press, New York, pp 33–42. doi:10.1145/280814.280820 Grochow K, Martin SL, Hertzmann A, Popović Z (2004) Style-based inverse kinematics. ACM Trans Graph 23(3):522–531. doi:10.1145/1015706.1015755 Guay M, Cani MP, Ronfard R (2013) The line of action: an intuitive interface for expressive character posing. ACM Trans Graph 32(6):205:1–205:8. doi:10.1145/2508363.2508397 Hahn F, Mutzel F, Coros S, Thomaszewski B, Nitti M, Gross M, Sumner RW (2015) Sketch abstractions for character posing. In: Proceedings of the 14th ACM SIGGRAPH/eurographics symposium on computer animation, SCA ’15. ACM, New York, pp 185–191. doi:10.1145/ 2786784.2786785 Hämäläinen P, Rajamäki J, Liu CK (2015) Online control of simulated humanoids using particle belief propagation. ACM Trans Graph 34(4):81:1–81:13. doi:10.1145/2767002 Harish P, Mahmudi M, Callennec BL, Boulic R (2016) Parallel inverse kinematics for multithreaded architectures. ACM Trans Graph 35(2):19:1–19:13. doi:10.1145/2887740 Ho ESL, Komura T (2009) Character motion synthesis by topology coordinates. In: Dutr’e P, Stamminger M (eds) Computer graphics forum (Proceedings of Eurographics 2009), Munich, vol 28, pp 299–308 Ho ESL, Komura T (2011) A finite state machine based on topology coordinates for wrestling games. Comput Animat Virtual Worlds 22(5):435–443. doi:10.1002/cav.376 Ho ESL, Shum HPH (2013) Motion adaptation for humanoid robots in constrained environments. In: Robotics and automation (ICRA), 2013 I.E. international conference on, pp 3813–3818. doi:10.1109/ICRA.2013.6631113 Ho ESL, Komura T, Ramamoorthy S, Vijayakumar S (2010a) Controlling humanoid robots in topology coordinates. In: Intelligent robots and systems (IROS), 2010 IEEE/RSJ international conference on, pp 178–182. doi:10.1109/IROS.2010.5652787 Ho ESL, Komura T, Tai CL (2010b) Spatial relationship preserving character motion adaptation. ACM Trans Graph 29(4):1–8. doi:10.1145/1778765.1778770 Ho ESL, Chan JCP, Komura T, Leung H (2013a) Interactive partner control in close interactions for real-time applications. ACM Trans Multimed Comput Commun Appl 9(3):21:1–21:19. doi:10.1145/2487268.2487274 Ho ESL, Shum HPH, Ym C, PC Y (2013b) Topology aware data-driven inverse kinematics. Comput Graph Forum 32(7):61–70. doi:10.1111/cgf.12212 Igarashi T, Matsuoka S, Tanaka H (1999) Teddy: a sketching interface for 3d freeform design. In: Proceedings of the 26th annual conference on computer graphics and interactive techniques, SIGGRAPH ’99. ACM Press/Addison-Wesley, New York, pp 409–416. doi:10.1145/ 311535.311602 Isrozaidi N, Ismail N, Oshita M (2010) Data glove-based interface for real-time character motion control. In: ACM SIGGRAPH ASIA 2010 Posters, SA ’10. ACM, New York, p 5:1. doi:10.1145/1900354.1900360
Real-Time Full-Body Pose Synthesis and Editing
15
Jacobson A, Panozzo D, Glauser O, Pradalier C, Hilliges O, Sorkine-Hornung O (2014) Tangible and modular input device for character articulation. ACM Trans Graph 33(4):82:1–82:12. doi:10.1145/2601097.2601112 Kallmann M (2008) Analytical inverse kinematics with body posture control. Comput Animat Virtual Worlds 19(2):79–91 Komura T, Lam WC (2006) Real-time locomotion control by sensing gloves. Comput Animat Virtual Worlds 17(5):513–525. doi:10.1002/cav.114 Komura T, Ho ESL, Lau RW (2005) Animating reactive motion using momentum-based inverse kinematics: motion capture and retrieval. J Vis Comput Animat 16(3–4):213–223. doi:10.1002/ cav.v16:3/4 Kovar L, Schreiner J, Gleicher M (2002) Footskate cleanup for motion capture editing. In: SCA ’02: Proceedings of the 2002 ACM SIGGRAPH/Eurographics symposium on Computer animation, pp 97–104. doi:10.1145/545261.545277 Kulpa R, Multon F, Arnaldi B (2005) Morphology-independent representation of motions for interactive human-like animation. Computer Graphics Forum 24(3):343–351. doi:10.1111/ j.1467-8659.2005.00859.x Kyto M, Dhinakaran K, Martikainen A, Hamalainen P (2015) Improving 3d character posing with a gestural interface. IEEE Comput Graph Appl. doi:10.1109/MCG.2015.117 Lawrence ND (2004) Gaussian process latent variable models for visualisation of high dimensional data. In: Advances in neural information processing systems (Proceedings of NIPS 2003). MIT Press, Cambridge, MA, pp 329–336 Lee J, Shin SY (1999) A hierarchical approach to interactive motion editing for human-like figures. In: Proceedings of the 26th annual conference on computer graphics and interactive techniques, SIGGRAPH ’99. ACM Press/Addison-Wesley Publishing, New York, pp 39–48. doi:10.1145/ 311535.311539 Lin J, Igarashi T, Mitani J, Liao M, He Y (2012) A sketching interface for sitting pose design in the virtual environment. IEEE Trans Vis Comput Graph 18(11):1979–1991. doi:10.1109/ TVCG.2012.61 Liu H, Wei X, Chai J, Ha I, Rhee T (2011) Realtime human motion control with a small number of inertial sensors. In: Symposium on interactive 3D graphics and games, I3D ’11. ACM, New York, pp 133–140. doi:10.1145/1944745.1944768 Lu J, Liu X (2014) Foot plant detection for motion capture data by curve saliency. In: Computing, Communication and Networking Technologies (ICCCNT), 2014 international conference on, pp 1–6. doi:10.1109/ICCCNT.2014.6963001 Lyard E, Magnenat-Thalmann N (2008) Motion adaptation based on character shape. Comput Animat Virtual Worlds 19(3–4):189–198. doi:10.1002/cav.v19:3/4 Leap Motion (2016, n.d.) https://www.leapmotion.com/ Nakamura Y, Hanafusa H (1986) Inverse kinematics solutions with singularity robustness for robot manipulator control. J Dyn Syst Meas Control 108:163–171 Numaguchi N, Nakazawa A, Shiratori T, Hodgins JK (2011) A puppet interface for retrieval of motion capture data. In: Proceedings of the 2011 ACM SIGGRAPH/Eurographics symposium on computer animation, SCA ’11. ACM, New York, pp 157–166. doi:10.1145/ 2019406.2019427 Oshita M, Senju Y, Morishige S (2013) Character motion control interface with hand manipulation inspired by puppet mechanism. In: Proceedings of the 12th ACM SIGGRAPH international conference on virtual-reality continuum and its applications in industry, VRCAI ’13. ACM, New York, pp 131–138. doi:10.1145/2534329.2534360 Rhodin H, Tompkin J, Kim KI, Kiran V, Seidel HP, Theobalt C (2014) Interactive motion mapping for real-time character control. Comput Graph Forum (Proc Eurograph) 33(2):273–282. doi:10.1111/cgf.12325 Rose C, Cohen MF, Bodenheimer B (1998) Verbs and adverbs: multidimensional motion interpolation. IEEE Comput Graph Appl 18:32–40. doi:10.1109/38.708559
16
H. Edmond S. L. and Y. Pong C.
Shin HJ, Lee J, Shin SY, Gleicher M (2001) Computer puppetry: an importance-based approach. ACM Trans Graph 20(2):67–94. doi:10.1145/502122.502123 Shum HPH, Komura T, Yamazaki S (2007) Simulating competitive interactions using singly captured motions. In: Proceedings of ACM virtual reality software technology 2007, pp 65–72 Shum HPH, Komura T, Yamazaki S (2008) Simulating interactions of avatars in high dimensional state space. In: ACM SIGGRAPH symposium on interactive 3D graphics (i3D) 2008, pp 131–138 Wang LCT, Chen CC (1991) A combined optimization method for solving the inverse kinematics problems of mechanical manipulators. IEEE Trans Robot Autom 7(4):489–499. doi:10.1109/ 70.86079 Wei XK, Chai J (2011) Intuitive interactive human-character posing with millions of example poses. IEEE Comput Graph Appl 31:78–88. doi:10.1109/MCG.2009.132 Welman C (1993) Inverse kinematics and geometric constraints for articulated figure manipulation. Master’s thesis, Simon Frasor University Whitney D (1969) Resolved motion rate control of manipulators and human prostheses. Man-Machine Syst IEEE Trans 10(2):47–53. doi:10.1109/TMMS.1969.299896 Wu X, Tournier M, Reveret L (2011) Natural character posing from a large motion database. IEEE Comput Graph Appl 31(3):69–77. doi:10.1109/MCG.2009.111 Yamane K, Nakamura Y (2003) Natural motion animation through constraining and deconstraining at will. IEEE Trans Vis Comput Graph 9(3):352–360. doi:10.1109/TVCG.2003.1207443
Real-Time Full Body Motion Control John Collomosse and Adrian Hilton
Abstract
This chapter surveys techniques for interactive character animation, exploring datadriven and physical simulation-based methods. Interactive character animation is increasingly data driven, with animation produced through the sampling, concatenation, and blending of pre-captured motion fragments to create movement. The chapter therefore begins by surveying commercial technologies and academic research into performance capture. Physically based simulations for interactive character animation are briefly surveyed, with a focus upon technique proven to run in real time. The chapter focuses upon concatenative synthesis approaches to animation, particularly upon motion graphs and their parametric extensions for planning skeletal and surface motion for interactive character animation. Keywords
Animation • 3D motion capture • Real-time motion • Virtual reality • Augmented reality • 4D mesh calculation • Parametric motion
Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Commercial Technologies for Performance Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marker-Less Human Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interactive Character Animation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Real-Time Physics-Based Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Concatenative Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 2 3 5 8 10 14 22 23 24
J. Collomosse (*) • A. Hilton Centre for Vision Speech and Signal Processing (CVSSP), University of Surrey, Surrey, UK e-mail: [email protected]; [email protected] # Springer International Publishing AG 2016 B. Müller, S.I. Wolf (eds.), Handbook of Human Motion, DOI 10.1007/978-3-319-30808-1_9-1
1
2
J. Collomosse and A. Hilton
Introduction Compelling visuals and high-quality character animation are cornerstones of modern video games and immersive experiences. Yet character animation remains an expensive process. It can take a digital artist weeks to skin (design the 3D surface representation) of a character model and then rig it with a skeleton to facilitate full body control and animation. Animation is often expedited by retargeting human performance capture data to drive the character’s movement. Yet creativity and artistic input remains in the loop, blending handcrafted animation with motion capture data which itself may be an amalgam of multiple takes (e.g., it is common for separate passes to be used for the face, head, and hands). Performance capture itself is expensive; equipment hire, operation, and studio/actor time can approach millions of US dollars on a high-end production. The recent resurgence of virtual and augmented reality (VR/AR) experiences, in which character interaction takes place at very close quarters, is further driving up expectations of visual realism. Creating believable interactive digital characters is therefore a trade-off between project budget and quality. Better tool support inevitably leads to efficiency and so a rebalancing toward higher quality. In this chapter, we survey state-of-the-art technologies and algorithms (as of 2015) for efficient interactive character animation. While a common goal is a drive toward increased automation, which in some cases can produce interactive characters with near-complete automation, one should not lose sight that these are tools only and the need for the creative artist in the loop remains essential to reach the high-quality bar demanded by modern production. As such this chapter takes a practical view on animation, first surveying the commercial technologies and academic research into performance capture and then surveying the two complementary approaches to real-time animation – physically based approaches (examined further in chapter C-2) and data-driven approaches. Although character animation is frequently used within other domains with the Creative Industries (movies, broadcast), its use within games requires new animation sequences to be generated on the fly, responding in real time to user interaction and game events. This places some design restrictions on the underpinning algorithms (efficient data structures, no temporal look ahead for kinematics). This chapter therefore focuses upon algorithms for interactive, rather than more general offline, character animation covered elsewhere in this book.
State of the Art Historically character animation has been underpinned by meticulous observations of movement in nature, for example, the gait cycles of people or animals. This link has been made explicit by contemporary character animation, which is trending toward a data-driven process in which sampled physical performance is the basis for synthesizing realistic movement in real time. This chapter therefore begins by surveying commercial technologies, and state-of-the-art Computer Vision algorithms, for capturing human motion data.
Real-Time Full Body Motion Control
3
Commercial Technologies for Performance Capture Motion Capture (mocap) technology was initially developed within the Life Sciences for human movement analysis. The adoption of mocap for digital entertainment, commonly referred to as performance capture (PC), is now widespread. PC accounts for 46% of the total 3D motion capture system market which is growing annually at a rate of around 10% and expected to reach 142.5 million US dollars by 2020 (Rohan 2015). Indeed many of the innovations in mocap (e.g., marker-less capture) are now being developed within the Creative Industries and transferred back into domains such as biomechanics and healthcare. PC systems enable sequences of skeletal joint angles to be recorded from one or several actors. The key distinction between PC systems is the kind of physical marker or wearable device (if any) required to be attached to the actors. The predominant form of PC in the Creative Industries is marker based, using passive markers that are tracked visually using synchronized multiple viewpoint video (MVV). Popular systems for passive marker PC are manufactured by Vicon (UK) and Optitrack (US), which require the actor to wear retroreflective spheres (approximately 20–30 are typically used for full body capture). A region of the studio (capture volume) is surrounded by several infrared (IR) cameras in known locations and illuminated by several diffuse IR light sources. Prior to capture of performance data, a calibration process is performed to learn the relative locations (extrinsic parameters) of the cameras. This enables the world location of the markers attached to the actor to be triangulated, resulting in a 3D point cloud from which a skeletal pose is inferred using physical and kinematic constraints. Modern software (e.g., Blade or MotionBuilder) can perform this inference in real time providing immediate availability of a pose estimate for each actor in the scene. PC service providers (e.g., The Imaginarium Studios, London UK) have begun to harness this technology to pre-visualize the appearance of digital characters for movie or game production during live actor performance. Such facilities provide immediate visual feedback to both the actor and director onset removing the trial and error and so improving efficiency in the capture process (Fig. 1, left). Other forms of passive PC in regular use include the fractal suits patented by Industrial Light and Magic for full body motion capture (Fig. 1, right). The suits are tracked using visible light and so are more amenable to deployment in outdoor sets where strong natural light makes IR impractical. Active marker-based systems include offerings from CodaMotion (UK) and PhaseSpace (US). Markers are bright IR light-emitting diodes (LEDs) that pulse with unique signatures that identify the marker to one or several observing cameras. Since markers are uniquely labeled at source, automated tracking of markers is trivial making marker confusion highly unlikely. By contrast, the labeling of triangulated markers in a passive system is performed during pose inference and may be incorrect in the presence of clutter (e.g., multiple occluding actors). Marker mislabeling causes errors in pose estimation, which can only be removed through addition of more witness cameras so reducing the chance of occlusion or manually correcting the data post-capture. An advantage of active marker-based systems is therefore the need for
4
J. Collomosse and A. Hilton
Fig. 1 Performance capture technologies. Top: Vicon IR-based system being used to pre-visualize character performance in real time within the UNREAL Games engine (EPIC). Bottom: Industrial Light and Magic’s fractal suit enabling visual light-based tracking outdoors
fewer cameras and reduced data correction. Active systems tend to perform better outdoors, again due to obviating the need for large area IR illumination. The disadvantage is the additional expense and time required for actor setup (wires and batteries) due to the complexity of the markers. The workflow to produce a skeletal pose estimate from active marker data is identical to passive systems, since the capture again results in a sequence of 3D point movements. Inertial motion capture system uses inertial motion-sensing units (IMUs) to detect changes in joint orientation and movement, providing an alternative to visual
Real-Time Full Body Motion Control
5
tracking and so removing the problem of marker occlusion. IMUs are worn on each limb (around 12–14 for full body capture) and connected wirelessly to a hub which forwards the data for software processing. Common IMU captures solutions include AnimeZoo (UK), XSens (Netherlands), and most recently the crowd-funded PerceptionNeuron (US) system. All of these solutions again rely upon a robust back-end software product to infer a skeletal pose estimate using physical and kinematic constraints. The disadvantage of inertial capture is drift, since the IMUs output only a stream of relative joint angles. For this reason, IMU mocap is sometimes combined with a secondary modality, e.g., laser ranging or passive video, to capture the world position of the actor. An emerging form of PC is marker-less mocap, using Computer Vision to track the actor without the need for wearables. Although the accuracy of commercial marker-less systems has yet to reach parity with marker-based solutions, the greatly reduced setup time and flexibility to use only regular video cameras for capture makes such systems a cost-effective option. For the purposes of teaching data-driven animation production, marker-less technologies are therefore attractive. Solutions include the OrganicMotion stage (US), a cube arrangement of around 20 machine vision cameras that calculates human pose using the silhouette of the performer against a uniform background from the multiple camera angles. More recently The Captury (Germany) launched a software-only product for skeletal PC that estimates pose against an arbitrary background using a possibly heterogeneous array of cameras. Yet although commercial solutions to marker-less PC remain in their infancy, academic research is making good progress as we next discuss.
Marker-Less Human Motion Estimation Passive estimation of human pose from video is a long-standing Computer Vision challenge, particularly when visual fiducials (markers) are not present. Methods can be partitioned into those considering monocular (single-view) video or multiple viewpoint video.
Monocular Human Pose Estimation Human pose estimation (HPE) often requires the regions of interest (ROIs) representing people to be identified within the video. This person localization problem is can be solved using background (Zhao and Nevatia 2003) or motion (Agarwal and Triggs 2006) subtraction, in the cases of simple background. In more cluttered scenarios, supervised machine can be applied to detect the presence of a person within a sliding window swept over the video frame. Within each position of the window, pretrained classifiers based on Histogram of Gradient (HoG) descriptors can robustly identify the torso (Eichner and Ferrari 2009), face (Viola and Jones 2004), or entire body (Dalal and Triggs 2005). Once the subject is localized within the frame, the majority of monocular HPE algorithms attempt to infer only a 2D, i.e., apparent pose of the performer. These adopt either (a) top-down fitting of a person model, optimizing limb parameters and
6
J. Collomosse and A. Hilton
projecting to image space to evaluate correlation with image data, or (b) individually segmenting parts and integrating their positions in a bottom-up manner to produce a maximal likelihood pose. Bottom-up approaches dominated early research into HPE, over one decade ago. Srinivasan and Shi (2007) used an image segmentation algorithm (graph cut) to parse a subset of salient shapes from an image and group these into a shape resembling a person using a set of learned rules. However the approach was limited to a single person, and background clutter was reported to interfere with the initial segmentation and so the eventual accuracy of the approach. Ren et al. proposed an alternative algorithm in which Canny edge contours were recursively split into segments, each of which was classified as a putative body part using shape cues such as parallelism (Ren et al. 2005). Ning et al. (2008) similarly attempted to label body parts individually, applying a Bag of Visual Words (BoVW) framework to learn codewords for body zone labeling – segmenting 2D body parts to infer pose. Mori and Malik described the first bottom-up algorithm capable of estimating a 3D pose in world space, identifying the position of individual joints in a 2D image using scale and symmetry constraints, and then matching those 2D joint positions to a set of many “training images” each of which had been manually annotated a priori with 2D joint positions (Mori et al. 2004) and was associated also with a 3D ground truth. Once the closest training image had been identified by matching query and training joint positions in 2D, the relevant 3D pose was returned as the result. Top-down approaches, in which the entire 2D image is used as evidence to fit a model, are more contemporary. The most common form of model fitted to the image is a “pictorial structure,” essentially a collection of 2D limbs (regions) articulated by springs, that can be iteratively deformed to fit to evidence in the image under an optimization process (Andriluka et al. 2009; Eichner and Ferrari 2009). However such approaches do not yield recover a 3D pose estimate or if so are unstable due to ambiguity in reasoning from a single image.
Multi-View Human Pose Estimation A 3D estimate of human pose may be inferred with less ambiguity using footage captured from multiple viewpoints. In such a setup, a configuration of cameras (typically surround a subject in a 180 or 360 arc) observes a capture volume within which a performance is enacted. The cameras are typically calibrated, i.e., for a subject observed by C camera views c = [1,C] the extrinsic parameters {Rc,COPc} (camera orientation and focal point) and intrinsic parameters f c , oxc , oyc (focal length and 2D optical center) are known. Two categories of approach exist: (a) those estimating 2D pose from each view independently and fusing these to deduce a 3D pose and (b) those inferring a 3D pose from 3D geometric proxy of the performer recovered through volumetric reconstruction. Computer Vision has undergone a revolution in recent years, with deep convolutional neural networks (CNNs) previously popular in text recognition being extended and applied to solve many open problems including human pose
Real-Time Full Body Motion Control
7
Fig. 2 Convolutional neural networks (CNNs) used for pose estimation in multi-viewpoint video. (a) Using 2D detections of body parts fused in a 3D probabilistic model (from (Elhayek et al. 2015)), (b) recognition of pose from 3D volumetric data recovered from multiple views (from (Trumble et al. 2016))
estimation. CNNs have shown particularly strengths in general object detection, with some state-of-the-art networks, e.g., GoogLeNet (Google Inc.), surpassing human performance in certain scenarios. Most recently CNNs have also been used to detect human body parts in single and multiple viewpoint video and infer from these human pose. Elhayek et al. (2015) estimate human body parts from individual video viewpoints using CNN detectors and then fuse these under a probabilistic model fusing color and motion constraints from a body part tracker to create a 3D pose estimate. The CNN detection step is robust to clutter, making the system suitable for estimation of 3D pose in complex scenes including outdoors (Fig. 2a). In volumetric approaches, a geometric proxy of the performer is built using a visual hull (Grauman et al. 2003) computed from foreground mattes extracted each camera image Ic using a chroma key or more sophisticated image segmentation algorithm. To compute the visual hull, the capture volume is coarsely decimated into
8
J. Collomosse and A. Hilton
a set of voxels at locations V = {V1, . . ., Vm}; a resolution of 1 cm3 is commonly used for a capture volume of approximately 6 2 6 m. The probability of the voxel being part of the performer in a given view c is: pðVj cÞ ¼ BðI c ðx½V i , y½V i ÞÞ,
(1)
where B(.) is a simple blue dominance term derived from the RGB components of B I c ðx, yÞ, i:e:1 RþGþB and (x, y) is the point within Ic that Vi projects to: x½V i ¼
f vy f c vx þ oxc and y½V i ¼ c þ oyc , vz vz vx vy vz ¼ COPc R1 c Vi:
where,
(2) (3)
The overall probability of occupancy for a given voxel p(V ) is: C pðV i Þ ¼ ∏ 1= 1 þ epðV j cÞ : i¼1
(4)
We compute p(Vi) for all Vi V to create a volumetric representation of the performer for subsequent processing. An iso-contour extraction algorithm such as marching cubes (Lorensen and Cline 1987) is used to extract a triangular mesh model from the voxel-based visual hull (Fig. 3). The result is a topologically independent 3D mesh for each frame of video. This can be converted into a so-called “4D” representation using a mesh tracking process to conform these individual meshes to a single mesh that deforms over time (Budd et al. 2013). Once obtained, it is trivial to mark up a single frame of the performance to embed a skeleton (e.g., marking each joint limb as an average of subsets of mesh vertices) and have the skeleton track with the performance as the mesh deforms. As we explain in subsection “Concatenative Synthesis,” either the skeletal or surface representations from such a 4D performance capture may be used to drive character animation interactively. CNNs have also been applied to volumetric approaches, with a spherical histogram (c.f. subsection “Surface Motion Graphs”) derived from the visual hull being fed into a CNN to directly identify human pose (Trumble et al. 2016). The system contrasts with Elhayek et al. (2015) where the CNN operates in 2D rather than 3D space and similarly adds robustness to visual clutter in the scene.
Interactive Character Animation Interactive character animation often takes place within complex digital environments, such as games, in which multiple entities (characters, moveable objects, and static scene elements) interact continuously. Since these interactions are a function of user input, they cannot be predicted or scripted a priori, and enumerating all possible
Real-Time Full Body Motion Control
9
Fig. 3 4D performance capture. Multiple video views (top) are fused to create a volumetric representation of the performance which is meshed (bottom). The per-frame meshes are conformed to a single deforming mesh over time, into which a skeleton may be embedded and tracked (right)
eventualities is intractable. It is therefore necessary to plan animation in real time using fast, online algorithms (i.e., algorithms using data from the current and previous timesteps only). Two distinct categories of algorithm exist. First, algorithms drawing upon pre-supplied database of motion for the character, usually obtained via PC and/or manual scripting. Several fragments of motion data (“motion fragments”) are stitched and blended together to create a seamless piece animation. A trivial example is a single cycle of a walk, which can be repeatedly concatenated to create a character walking forward in perpetuity. However more complex behavior (e.g., walks along an arbitrary path) can be created by carefully selecting and interpolating between a set of motion fragments (e.g., three walk cycles, one veering left, one veering right, and one straight-ahead) such that no jarring movement occurs. This form of motion synthesis, formed by concatenating (and in some cases interpolating between) several motion fragments, is referred to as “concatenative synthesis.” The challenge is therefore in selecting and sequencing appropriate motion fragments to react to planning requirements (move from A to B) under environmental (e.g., occlusion) and physical (e.g., kinematic) constraints. This is usually performed via a graph optimization process, with the motion fragments and valid transitions between these encoded in the nodes and edges of a directed graph referred to as a “move tree” or “motion graph” (Kovar et al. 2002). The key advantages of a motion graph are predictability of movement and artistic control over the motion fragments that are challenging to embody within a physical simulation. The disadvantage is that motion cannot generalize far beyond the motion fragments, i.e., character movement obtained via PC in the studio. We discuss concatenative synthesis in detail within subsection “Concatenative Synthesis.”
10
J. Collomosse and A. Hilton
Second, algorithms that do not require prescripted or directly captured animation but instead simulate the movement under physical laws. Physics simulation is now commonly included within games engines (e.g., Havoc, PhysX) but used primarily to determine motion of objects or particles or animation of secondary characteristics such as cloth attached to characters (Armstrong and Green 1985). Yet more recently, physics-based character animation has been explored integrating such engines into the animation loop of principal characters (Geijtenbeek et al. 2010). Physics-based simulation offers the significant advantage of generalization; characters modeled in this manner can react to any situation with the virtual world and are not bound to a database on preordained movements. Nevertheless, the high computational cost of simulation forces accuracy-performance trade-offs for real-time use. Simplifying assumptions such as articulated rigid bodies for skeletal structure is very common. It is therefore inaccurate to consider physically simulated animation as being more “natural”; indeed the tendency of simulation to produce “robotic” movements lacking expressivity has limited practical uptake of these methods for interactive character animation until comparatively recently. We briefly discuss physics-based character control in the next section, restricting discussion to the context of real-time animation for interactive applications. A detailed discussion of physics-based character animation in a broader context can be found in chapter C-2.
Real-Time Physics-Based Simulation Physically simulated characters are usually modeled as a single articulated structure of rigid limb components, interconnected by two basic forms of joint mimicking anatomy in nature. Characters modeled under physical simulation are typically humanoid (Hodgins 1991; Raibert and Hodjins 1991) or animal (Wampler and Popovic 2009) consisting predominantly of hinge joints, with hip and shoulder joints implemented as ball-socket joints. Depending on the purpose of the simulation, limbs may be amalgamated for computational efficiency (e.g., a single component for the head, neck, and torso) (Tin et al. 2008). More complex simulations can include sliding joints in place of some hinge joints that serve to model shock absorption within the ligaments of the leg (Kwon and Hodgins 2010).
Character Model Actuation The essence of the physical simulation is to solve for the forces and torques that should be applied to each limb, in order to bring about a desired motion. This solve is performed by a “motion controller” algorithm (subsection “Character Motion Control”). The locations at each limb where forces are to be applied are a further design consideration of the modeler. The most common strategy is to consider torque about each joint (degree of freedom), a method known as servo actuation. While intuitive, servo actuation is not natural – effectively assuming each joint to contain a motor capable of rotating its counterpart – careful motion planning is necessary to guard against unnatural motion arising under this simplified model.
Real-Time Full Body Motion Control
11
Biologically inspired models include simulated muscles that actuate through tendons attached to limbs, effecting a torque upon the connected joints. Muscleactuated models are more challenging to design motion controllers for, since the maximum torque that can be applied by a muscle is limited by the turning moment of the limb which is dependent on the current pose of the model. Furthermore the number of degrees of freedom in such models tends to be higher than servo-actuated models, since muscles tend to operate in an antagonist manner with a pair of muscles per joint enabling “push” and “pull” about the joint. Moreover such models cannot be considered natural unless the tendons themselves are modeled as nonrigid structures, capable of stretch and compressing to store and release energy in the movement. The high computational complexity of motion controllers to solve for muscle-actuated models therefore remains a barrier to their use in real-time character animation, whose applications to digital entertainment (rather than say, biomechanics) rarely require biologically accurate simulation. We therefore do not consider them further in this chapter.
Character Motion Control Use cases for character animation rarely demand direct, fine-grain control of each degree of freedom in the model. Rather, character control is directed at a higher level, e.g., “move from A to B at a particular speed, in a particular style.” Such directives are issued by game AI, narrative, or other higher level controllers. Motion controllers are therefore a mid-layer component in the control stack bridging the semantic gap between high-level control and low-level actuation parameters. In interactive scenarios, simple servo-based actuation (i.e., independent, direct control over joint torques) is adopted to ensure computation of the mapping is tractable in real time. Solving for the movement is performed iteratively, over many small timesteps, each incorporating feedback supplied by the physics engine from each actuation of the model at the previous timestep under closed-loop control. This obviates the need to model complex outcomes of movements within the controller itself. Feedback comprises not only global torso position and orientation but also individual joint orientation and velocity post-simulation of the movement. It is common for controllers to reason about the stability (balance) of the character when planning movement. The center of mass (COM) of the character should correspond to the zero-moment point (ZMP), i.e., the point at which the reaction force from the world surface results in a zero net moment. When the COM and ZMP coincide, the model is stable. We outline two common strategies to motion control that are applicable to physically based real-time interactive character animation. Control in Joint-Space via Pose Graphs Some of the earliest animation engines comprised carefully engineered software routines, to procedurally generate motion according to mechanics models embedded within kinematics solvers and key-framed poses. This approach is derived from real-time motion planning in the robotics literature. Such approaches model the desired end-position of a limb (or limbs) as a “keypose.” Using a kinematics engine, the animation rig (i.e., joint angles) is gradually
12
J. Collomosse and A. Hilton
adjusted to bring the character’s pose closer to that desired key-pose. The adjustment is an iterative process of actuation and feedback from the environment to determine the actual position of the character and subsequent motions. For example, the COM and ZMP as well as the physical difference between the current and intended joint positions are monitored to ensure that the intended motion does not unbalance the character unduly and that progress is not impeded by itself or other scene elements. A sequence of such key-poses is defined within a “Pose Space Graph” (PSG), where the nodes in the graph are procedurally defined poses, i.e., designed by the animator, but the movements between poses are solved using an inverse kinematics engine (IK). A motion, such as a walk, is performed by transitioning through states in the PSG (Fig. 4 illustrates a walk cycle in a PSG). Due to physical or timing constraints, a character often will not reach a desired pose within the PSG before being directed toward the next state. Indeed it is often unhelpful for the character to decelerate and pause (i.e., obtain ZMP of zero) and become stable at a given state before moving on the next; a degree of perpetual instability, for example, exists within the human walk cycle. Therefore key-poses in the PSG are often exaggerated on the expectation that the system will approximate rather than interpolate the key-poses within it. The operation of PSGs is somewhat analogous to motion graphs (c.f. subsection “Skeletal Motion Graphs”), except that IK is used to plan motion under physical models, rather than pre-captured performance fragments concatenated and played back. Control via Machine Learned Models Although expensive to train, machine learning approaches offer a run-time efficiency unrivaled by other real-time motion controller strategies. Most commonly, neural networks (NN) are used to learn a mapping between high-level control parameters and low-level joint torques, rather than manually identified full body poses (subsection “Character Motion Control”). Notwithstanding design of the fitness function of the network and its overall architecture, the training process itself is fully automatic using a process trial and error via feedback from the physics simulation. A further appeal of such approaches is that such training is akin to the biological process of learned actuation in nature. Networks usually adopt a shallow feed-forward network such as a multi-layer perceptron (MLP) (Pollack et al. 2000), although the growing popularity of deeply learned networks has prompted some recent research into their use as motion controllers (Holden et al. 2016). Training the MLP proceeds from an initially randomized (via Gaussian or “white” noise) set of network weights, using a function derived from some success metric typically the duration that the controlled model can execute movement (e.g., walk) for without destabilizing and falling over. Many thousands of networks (weight configurations) are evaluated to drive character locomotion, and the most successful networks are modified and explored further in an iterative optimization process to train the network (Sims 1994). Optimization of the NN weights is commonly performed by an evolutionary algorithm (EA) in which populations of network configurations (i.e., sets of weights) are evaluated in batches. The more successful configurations are propagated with the
Real-Time Full Body Motion Control
13
Fig. 4 Physically based interactive character animation. (a) Pose Space Graph used to drive highlevel goals for kinematics solvers which direct joint movements (from (Laszlo and Fiume 1996)); (b) ambulatory motion of a creature and person learned by optimization processes mimicking nature (from (Holden et al. 2016; Sims 1994)), respectively
subsequent batch and spliced with other promising configurations, to produce batches of increasingly fitter networks (Yao 1999). In complex networks with many weights and complex movements, it can be challenging for EAs to converge during training. In such cases, weight configurations for the NN can be bootstrapped by training the same network over simpler problems. This improves up white noise initialization for more complex tasks. In practice, training a NN can take tens of thousands of iterations to learn an acceptable controller (Sims 1994) for even very simple movements. Yet, once learned the controller is trivial to evaluate quickly and can be readily deployed into a real-time system. Even with bootstrapped training however, NN cannot learn complex movement, and it was not until the recent advent of more sophisticated (deeper) NNs that locomotion of a fully articulated body was
14
J. Collomosse and A. Hilton
demonstrated using an entirely machine-learned motion controller (Holden et al. 2016).
Concatenative Synthesis Motion concatenation is a common method for synthesizing interactive animation without the complexity and computational expense of physical simulation. In a concatenative synthesis pipeline, short fragments of motion capture are joined (and often blended) together to create a single seamless movement. In the simplest example, a single walk cycle may be repeated with appropriately chosen in-out points to create a perpetual gait. A more complex example may concatenate walk cycles turning slightly left, slightly right, or advancing straight-ahead to create locomotion along an arbitrary path.
Skeletal Motion Graphs Concatenative synthesis is dependent on the ability to seamlessly join together pairs of pre-captured motion fragments – subsequences of performance capture – to build complex animations. An initial step when synthesizing animation is therefore to identify the transition points within performance captured footage, at which motion fragments may be spliced together. Typically the entire capture (which may in practice consist of several movements, e.g., walking, turning) is considered as a single long sequence of t = [1, N] frames, and pairs of frames {1..N, 1..N} identified that could be transitioned between to without the view perceiving a discontinuity. A measure of similarity is defined, computable from and to any time instant in the sequence, and that measure thresholded to identify all potential transition points. Figure 5 visualizes both the concept and an example of such a comparison computed exhaustively over all frames of a motion capture sequence – brighter cells indicating closer matching frame pairs. Measures of Pose Similarity Pose similarity measures (which, in practice, often compute the dissimilarity between frames) should exhibit three important properties: 1. Be invariant to global rotation and translation – similar poses should be identified as similar regardless of the subject’s position in world space at both time instants. Otherwise, few transition points will be detected. 2. Exhibit spatiotemporal consistency – poses should not only appear similar at the pair of time instants considered but also move similarly. Otherwise, motion will appear discontinuous. 3. Reflect the importance of certain joints over others. Otherwise, a difference in position of, e.g., a finger might outweigh a difference in position of a leg. Common similarity measures include direct comparison of joint angles (in quaternion form) or, more commonly, direct comparisons of limb spatial position in 3D. A set of 3D points p1..m is computed either from limb end-points or from the vertices of a coarse mesh approximating the form of the
Real-Time Full Body Motion Control
15
Hit
Stand
Walk
Jog
Fig. 5 An example of an animation (top) generated by a motion graph (left) comprising four actions. A visualization of the inter-frame distance comparison used to compute a motion graph (right)
model and a sum of squared differences used to evaluate the dissimilarity D( p, p0 ) between point sets from a pair of frames p and p0 at times t and t0 , respectively: Dðp, p0 Þ ¼ minθ, x0 , z0
m X i¼1
2 ωi pi Mθ, x0 , z0 p0i :
(5)
where |.| is the Euclidean distance in world space, pi is the ith point in the set, and M is a Euclidean transformation best aligning the two sets of point clouds, via a translation on the ground plane (z0, z0) and a rotation θ about the vertical ( y) axis – so satisfying property (1). In order to embed spatiotemporal coherency (2), the score is computed over point sets not just from a given pair of times {t,t0 } but for a k frame window t 2k , t þ 2k . This is effectively a low-pass filter over time and explains the blurred appearance of Fig. 5 (right). For efficiency, pair-wise scores are computed and the resulting matrix low-pass filtered. The relative importance of each point (associated with the limb from which the point was derived) is set manually via ωi satisfying property (3). Motion Graph Construction Local thresholding is applied to the resulting similarity matrix, identifying nonadjacent frames (t,t0 ) that could be concatenated together to produce smooth transitions according to properties (1–3). For example, if the mocap sequence contains several cycles of a walk, it is likely that corresponding points in the walk cycles (e.g., left foot down at the start of each cycle) would be identified as transitions. Playing one walk cycle to this time t and then “seeking” forward or backward by several hundred frames to the corresponding time t0 in another walk cycle will not produce a visual discontinuity despite the nonlinear temporal playback.
16
J. Collomosse and A. Hilton
The “in” (t) and “out” (t0 ) frames of these transition points are identified and represented as nodes in a graph structure (the motion graph). Edges in the graph correspond to clips of motion, i.e., motion fragments between these frames in linear time. Additional edges are introduced to connect the “in” and “out” frames of each transition. Usually the pose of the performer at the “in” and “out” points differs slightly, and so this additional edge itself comprises a short sequence of frames constructed by interpolating the poses at “in” and “out,” respectively, e.g., using quarternion-based joint angle interpolation. Motion Path Optimization Random walks over the motion graph representation can provide a useful visualization to confirm that sensible transitions have been identified. However interactive character control requires careful planning of routes through the motion graph, to produce movement satisfying constraints the most fundamental of which are the desired end pose (and position in the world, pv), the distance that the character should walk (dv), and the time it should take the character to get there (tv). Under the motion graph representation, this corresponds to computing the optimal path routing us from the current frame of animation (i.e., the current motion capture frame being rendered) to a frame corresponding to the target key-pose elsewhere in the graph. Since motion graphs are often cyclic, there is potentially unbounded number of possible paths. The optimal path is the one minimizing a cost function, expressed in terms of these four animation constraints (Ctrans, Ctime, Cdist, and Cspatial): CðPÞ ¼ Ctrans ðPÞ þ ωtime Ctime ðPÞ þ ωdist Cdist ðPÞ þ ωspace Cspace ðPÞ:
(6)
Studying each of these terms in turn, the cost of a path P is influenced by Ctrans, reflecting the cost of all performing animation transitions along the path P. Writing the sequence of Nf edges (motion fragments) along this path as {fj} where j = [1, Nf], this cost is a summation of the cost of transitioning at each motion graph node along that path: Ctrans ðPÞ ¼
X j¼1
N f 1D f j 7! f jþ1 ,
(7)
where D ( fj 7! fj+1) expresses the cost of transitioning from the last frame of fj to the first frame of fj+1, computing by measuring the alignment of their respective point clouds p and p0 via D( p, p0 ) (Eq. 5). The timing cost Ctime(P) is computed as the absolute difference between the target time tv for the animation sequence and the absolute time time(P) taken to transition along the path P: Ctime ðPÞ ¼ jtimeðPÞ tv j:
timeðPÞ ¼ N f Δt,
(8)
1 where Δt is the time take to display a single frame of animation, e.g., Δt ¼ 25 for 25 frames per second.
Real-Time Full Body Motion Control
17
Similarly, the cost Cdist( p) is computed as the absolute difference between the target distance dv for the character to travel and the absolute distance traveled dist(P) computed by summing the distance traveled for each frame comprising P. Cdist ðPÞ ¼ jdistðPÞ tv j:
distðPÞ ¼
X j¼1
N f 1P f j P f jþ1 ,
(9)
where P is a 2D projection operation, projecting the 3D points clouds p and p0 corresponding to the end frame of fj and start frame of fj+1, respectively, to the 2D ground (x z) plane and computing the centroid. The final cost Cspatial is computed similarly via centroid projection of the animation end-point, penalizing a large distance between the target end-point of the character and end-point arising from the animation described by P. Cspace ðPÞ ¼ P f N f 1 pv :
(10)
The three parameters ωtime, ωdist, and ωspace are normalizing weights typical values of which are ωtime = 1/10, ωdist = 1/3, and ωspace = 1 (Arikan and Forsyth 2002). Finding the optimal path Popt for a given set of constraints C. . . is found by minimizing the combined cost function C(P) (Eq. 6): Popt ¼ argmin C ðPÞ: P
(11)
An efficient approach using integer programming to search for the optimal path that best satisfies the animation constraints can be found in Huang et al. (2009) and is capable of running in real time for motion fragment datasets of several minutes. Note that Ctrans is be precomputed for all possible motion fragment pairs, enabling run-time efficiencies – the total transition cost for a candidate path P is simply summed during search.
Surface Motion Graphs Surface motion graphs (SMGs) extend the skeletal motion graph concept beyond joint angles, to additionally consider the 3D volume of the performer. This is important since the movement of 3D surfaces attached to the skeleton (e.g., hair or flowing clothing) is often complex, and simple concatenation of pre-animated or captured motion fragments without considering the movement of this surface geometry can lead to visual discontinuities between motion fragments. Consideration of surfaces, rather than skeletons, requires the motion graph pipeline to change only in two main areas. First, the definition of frame similarity, i.e., Eq. 5 must be modified to consider volume rather than joint positions. Second, the algorithm for interpolating similar frames to create smooth transitions must be substituted for a surface interpolation algorithm.
18
J. Collomosse and A. Hilton
Fig. 6 Visualization of a spherical histogram computed from a character volume. Multiple video views (left) are combined to produce a volumetric estimate of the character (middle) which is quantized into a spherical (long-lat) representation at multiple radii from the volume centroid
3D Shape Similarity To construct a SMG, an alternative measure of frame similarity using 3D surface information is adopted, reflecting the same three desirable properties of similarity measures outlined in subsection “Skeletal Motion Graphs.” A spherical histogram representation is calculated from the 3D character volume within the frame. The space local to the character’s centroid is decimated into sub-volumes, divided by equispaced lines of longitude and latitude – so yielding a 2D array encoding the histogram that encodes the volume occupied by the character. Spherical histograms are computed over a variety of radii, as depicted in Fig. 6 (right) yielding a three-dimensional stack of 2D spherical histograms. The SMG is computed as with skeletal motion graphs, through an optimization process that attempts to align each video to every other – resulting in a matrix of similarity measurements between frames. The similarity between the spherical histograms Hr (.) at radius r of the 3D character meshes Qa and Qb is computed by: DðQa , Qb Þ ¼ min ϕ
R
1X ωr H r ðQa , 0Þ H r Qb , ϕ , R r¼1
(12)
where H(x, φ) indicates a spherical histogram computed over a given mesh x, rotated about the y axis (i.e., axis of longitude) by φ degrees. In practice this rotation can be performed by cycling the columns of the 2D histogram obviating any expensive geometric transformations; an exhaustive search across φ = [0, 359] degrees is recommended in Huang et al. (2009). The use of the model centroid, followed by optimization for φ, fulfills property (1), i.e., rotational and translational invariance in the comparison. The resulting 2D matrix of inter-frame comparisons is low-pass filtered as before to introduce temporal coherence, satisfying property (2). Weightings set for each radial layer of the spherical histogram ωr weight the importance of detail as distance increases, satisfying (3). Transition Generation Due to comparatively high number of degrees of freedom on a 3D surface, it is much more likely that the start and end frames of a pair of
Real-Time Full Body Motion Control
19
motion fragments fj and fj+1 selected on an optimal path Popt will not exactly match. To mitigate any visual discontinuities on playback, a short transition sequence is introduced to morph the former surface (Sj) into the latter (Sj+1). This transition sequence is substituted in for small number of frames (L ) before and after the transition point. Writing this time interval k = [L, L], a blending weight α(k) is computed: αðkÞ ¼
kþL , 2L
(13)
and a nonlinear mesh blending algorithm (such as the Laplacian deformation scheme of Tejera et al. (2013)) is applied to blend S j 7! S j+1 weighted by al pha(k).
Parametric Motion Graphs Parametric motion graphs (PMG) extend classical motion graphs (subsection “Skeletal Motion Graphs”) by considering not only the concatenation but also the blending of motion fragments to synthesize animation. A simple example is a captured sequence of a walk and a run cycle. By blending these two motion fragments together, one can create a cycle of a walk, a jog, a run, or anything in-between. Combined with the concatenation of cycles, this leads to a flexibility and storage efficiency not available via classic methods – a PMG requires only a single example of each kind of motion fragment, whereas a classical approach would require pre-captured fragments of walks and runs at several discrete speeds. Parametric extensions have been applied to both skeletal (Heck and Gleicher 2007) and surface motion graphs (Casas et al. 2013). Provided a mechanism exists for interpolating a character model (joint angles or 3D surface) between two frames, the method can be applied. Without loss of generality, we consider surface motion graphs (SMGs) here. SMGs assume the availability of 4D performance capture data, i.e., a single 3D mesh of constant topology deforming over time to create character motion (subsection “Multi-View Human Pose Estimation”). We consider a set of N temporally aligned 4D mesh sequences Q = {Qi(t)} for i = [1, N] motion fragments. Since vertices are in correspondence, it is possible to interpolate frames from such sequences directly by interpolating vertex positions in linear or piecewise linear form. We define such an interpolating function b(Q, w) yielding an interpolated mesh QB(t, w) at time t given a vector of weights w expressing how much influence each of the meshes from the N motion fragments at time t should contribute to that interpolated mesh. QB ðt, wÞ ¼ bðQ, wÞ,
(14)
where w = {wi} for normalized weights w [0, 1] that drives a mesh blending function b(.) capable of combining meshes at above 25 frames per second for interactive character animation.
20
J. Collomosse and A. Hilton
Fig. 7 Three animations each generated from a parametric motion graph (Casas et al. 2013). First row: walk to run (speed control). Second row: short to long horizontal leap (distance, i.e., leap length control). Third row: short to high jump (height control). Final row: animation from a parametric motion graph embedded within an outdoor scene under interactive user control (Casas et al. 2014) (time lapse view)
Several steps are necessary to deliver parametric control of the motion fragments: time warping to align pairs of mesh sequences (which may different in length) so that they can be meaningfully interpolated, the blending function b(.) to perform the interpolation, and mapping between high-level “user” parameters from the motion controller to low-level blend weights w. Considerations such as path planning through the graph remain, as with classical motion graphs, but must be extended since the solution space now includes arbitrary blendings of motion fragments, as well as the concatenated of those blended fragments. Exhaustively searching this solution space is expensive, motivating real-time methods to make PMG feasible for interactive character animation.
Real-Time Full Body Motion Control
21
Mesh Sequence Alignment Mesh sequences are aligned using a continuous time warping function t = f(tu) where the captured timebase tu is mapped in nonlinear fashion to a normalized range t = [0, 1] so as to align poses. The technique is described in Witkin and Popovic (1995). Although coarse results are obtainable without mesh alignment, failure to properly align sequences can lead to artifacts such as foot skate. Real-Time Mesh Blending Several interpolation schemes can be employed to blend a pair of captured poses. Assuming a single mesh has been deformed to track throughout the 4D performance capture source data (i.e., all frames have constant topology), a simple linear interpolation between 3D positions of corresponding vertices is a good first approximation to a mesh blend. Particularly in the presence of rotation, such approximations yield unrealistic results. A highquality solution is to use differential coordinates, i.e., a Laplacian mesh blend (Botsch and Sorkine 2008); however solution of the linear system comprising a 3v 3v matrix of vertex positions, where v is of the order 105, is currently impractical for interactive animation. Therefore a good compromise can be obtained using a piecewise linear interpolation (Casas et al. 2013), which precomputes offline a set of nonlinear interpolated meshes (e.g., via Botsch and Sorkine (2008)), and any requested parametric mesh is produced by weighted linear blending of the closest two precomputed meshes. The solution produces more realistic output, in general, that linear interpolation at the same computational cost. High-Level Control High-level parametric control is achieved by learning a mapping function f(w) between the blend weights w and the high-level motion parameters p, e.g., from the motion controller. A mapping function w = f1( p) is learned from the high-level parameter to the blend weights required to generate the desired motion. This is necessary as the blend weights w do not provide an intuitive parameterization of the motion. Motion parameters p are high-level user-specified controls for a particular class of motions such as speed and direction for walk or run and height and distance for a jump. The inverse mapping function f1 from parameters to weights can be constructed by a discrete sampling of the weight space w and evaluation of the corresponding motion parameters p. Parametric Motion Planning PMGs dispense of the notion of precomputed transition points, since offline computation of all possible transition and blend possibilities between, e.g., a pair of mesh sequences would yield an impractical number of permutations to permit real-time path finding. We consider instead a continuous “weight-time” space with the weight modeling the blend between one mesh sequence (e.g., a walk) and another (e.g., a run). We consider motion planning as the problem of finding a route through this space, taking us from a source time and pose (i.e., weight combination) to a target time and pose. Figure 8 illustrates such a route finding process. The requirement for smooth motion dictates we may only modify the weight or time in small steps, yielding to a “fanning out” or trellis of possible paths from the source point in weight-time space. The
22
J. Collomosse and A. Hilton
ws 1
0
wd
Qsi
Msj 0
Qdi
1
t
1
source parametric space
0
Mdj t
0
1
target parametric space
Fig. 8 Real-time motion planning under a parametric motion graph. Routes are identified between trellises fanned out from the source pose Qs and end pose Qd. The possible (red) and optimal (green) paths are indicated (illustration only)
optimal path Popt between two parametric points in that space is that minimizing a cost function balancing mesh similarity ES(P) and time taken, i.e., latency EL(P) to reach that pose: Popt ¼ argmin ES ðPÞ þ λEL ðPÞ, PΩ
(15)
where λ defines the trade-off between transition similarity and latency. The transition path P is optimized over a trellis of frames as in Fig. 7 starting at frame Qs(ts, ws) ending at Qd(td wd) where Qs and Qd are interpolated meshes (Eq. 14). The trellis is sampled forward and backward in time at discrete intervals in time Δt and parameters Δw up to a threshold depth in the weight-time space. This defines a set of candidate paths P comprising the transitions between each possible pair of frames in the source and target trellis. For a candidate path P, the latency cost EL(P) is measured as the number of frames in the path P between the source and target frames. The transition similarity cost ES(P) is measured as the similarity in mesh shape and motion at the transition point between the source and target motion space for the path P, computable via Eq. 12 for mesh data (or if using purely skeletal mocap data, via Eq. 5). Casas et al. (2012) proposed a method based on precomputing a set of similarities between the input data and interpolating these at run-time to solve routing between the two trellises at interactive speeds. Figure 7 provides examples of animation generated under this parametric framework.
Summary and Future Directions This chapter has surveyed techniques for interactive character animation, broadly categorizing these as either data driven or physical simulation based. Arguably the major use cases for interactive character animation are video games and immersive
Real-Time Full Body Motion Control
23
virtual experiences (VR/AR). In these domains, computational power is at a premium – developers must seek highly efficient real-time algorithms, maintaining high frame rates (especially for VR/AR) without compromising on animation quality. This has led interactive animation to trend toward data-driven techniques that sample, blend, and concatenate fragments of performance capture rather than spend cycles performing expensive online physical simulations. This chapter has therefore focused upon synthesis techniques that sample, concatenate, and blend motion fragments to create animation. The chapter began by surveying commercial technologies and academic research into performance capture. Although commercial systems predominantly focus upon skeletal motion capture, research in 4D performance capture is maturing toward practical solutions for simultaneous capture of skeletal and surface detail. The discussion of motion graphs focused upon the their original use for skeletal data and their more recent extensions to support not only 4D surface capture but also their parametric variants that enable blending of sequences in addition to their concatenation. Physical simulation-based approaches for character animation were examined within the context of interactive animation, deferring broader discussion of this topic to chapter C-2. Open challenges remain for interactive character animation, particularly around expressivity and artistic control. Artistic directors will often request editing of animation to move in a particular style (jaunty, sad), adjustments that can be performed manually by professional tools such as Maya (Autodesk) or MotionBuilder (Vicon) but that cannot be applied automatically in an interactive character animation engine. While work such as Brand et al.’s Style Machines (Brand and Hertzmann 2000) enables stylization of stand-alone skeletal mocap sequences, algorithms have yet to deliver the ability to modulate animation interactively, e.g., to react to emotional context in a game. An interesting direction for future research would be to integrate stylization and other high-level behavioral attributes into the motion graph optimization process.
Cross-References ▶ Biped Controller for Character Animation ▶ Blendshape Facial Animation ▶ Data-Driven Character Animation Synthesis ▶ Data-Driven Hand Animation Synthesis ▶ Depth Sensor Based Facial and Body Animation Control ▶ Example-Based Skinning Animation ▶ Eye Animation ▶ Hand Gesture Synthesis for Conversational Characters ▶ Head Motion Generation ▶ Laughter Animation Generation ▶ Physically-Based Character Animation Synthesis ▶ Real-Time Full Body Motion Control ▶ Real-Time Full Body Pose Synthesis and Editing
24
J. Collomosse and A. Hilton
▶ Video-Based Performance Driven Facial Animation ▶ Visual Speech Animation
References Agarwal A, Triggs B (2006) Recovering 3d human pose from monocular images. IEEE Trans PAMI 28(1):44–58 Andriluka M, Roth S, Schiele B (2009) Pictorial structures revisited: people detection and articulated pose estimation. In: Proceedings of computer vision and pattern recognition, IEEE Arikan O, Forsyth D (2002) Synthesizing constrained motions from examples. ACM Trans Graph 21(3):483–490 Armstrong W, Green M (1985) The dynamics of articulated rigid bodies for purposes of animation. Vis Comput 4(1):231–240 Botsch M, Sorkine O (2008) On linear variational surface deformation methods. IEEE Trans Vis Comput Graph 14(1):213–230 Brand M, Hertzmann A (2000) Style machines. In: Proceedings of ACM SIGGRAPH, ACM Press, pp 183–192 Budd C, Huang P, Klaudiny M, Hilton A (2013) Global non-rigid alignment of surface sequences. Int J Comput Vis 102(1):256–270 Casas D, Tejera M, Guillemaut JY, Hilton A (2012) 4d parametric motion graphs for interactive animation. In: Proceedings of Symposium on Interactive 3D Graphics and Games (I3D), IEEE Casas D, Tejera M, Guillemaut JY, Hilton A (2013) Interactive animation of 4d performance capture. IEEE Trans Vis Comput Graph (TVCG) 19(5):762–773 Casas D, Volino M, Collomosse J, Hilton A (2014) 4d video textures for interactive character appearance. In: Proceedings of Computer Graphics Forum (Eurographics 2014), IEEE Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings of Computer Vision and Pattern Recognition, IEEE, vol 3. pp 886–893 Eichner M, Ferrari V (2009) Better appearance models for pictorial structures. In: Proceedings of British Machine Vision Conference (BMVC), IEEE Elhayek A, Aguiar E, Tompson J, Jain A, Pishchulin L, Andriluka M, Bregler C, Schiele B, Theobalt C (2015) Efficient convnet-based marker-less motion capture in general scenes with a low number of cameras. In: Proceedings of computer vision and pattern recognition, IEEE Geijtenbeek T, Bogert AJVD, Basten B, Egges A (2010) Evaluating the physical realism of character animations using musculoskeletal models. In: Proceedings of Conference on Motion in Games, vol 5. Springer-Verlag, Heidelberg, pp 11–22 Grauman K, Shakhnarovich G, Darrell T (2003) A Bayesian approach to image-based visual hull reconstruction. In: Proceedings of CVPR, IEEE Heck R, Gleicher M (2007) Parametric motion graphs. In: Proceedings of Symposium on Interactive 3D Graphics and Games (I3D), IEEE, pp 129–136 Hodgins J (1991) Biped gait transition. In: Proceedings of Conference on robotics and automation, IEEE, pp 2092–2097 Holden D, Saito J, Komura T (2016) A deep learning framework for character motion synthesis and editing. In: Proceedings of ACM SIGGRAPH, ACM Huang P, Hilton A, Starck J (2009) Human motion synthesis from 3d video. In: Proceedings of CVPR, IEEE Kovar L, Gleicher M, Pighin F (2002) Motion graphs. ACM Trans Graph 21(3):473–482 Kwon T, Hodgins J (2010) Control systems for human running using an inverted pendulum model and a reference motion capture sequence. In: Proceedings of Eurographics Symposium on Computer Animation (SCA), Blackwell Laszlo J, Fiume EVD (1996) Limit cycle control and its application to the animation of balancing and walking. ACM Trans on graphics, ACM
Real-Time Full Body Motion Control
25
Lorensen W, Cline H (1987) Marching cubes: a high resolution 3d surface construction algorithm. ACM Comput Graph 21(4):163–169 Mori G, Ren X, Efros A, Malik J (2004) Recovering human body configurations: combining segmentation and recognition. In: Proceedings of computer vision and pattern recognition, IEEE, pp 326–333 Ning H, Xu W, Gong Y, Huang T (2008) Discriminative learning of visual words for 3D human pose estimation. In: Proceedings of CVPR, IEEE Pollack J, Lipson H, Ficici S, Funes P (2000) Evolutionary techniques in physical robotics. In: Proceedings of International Conference Evolvable Systems (ICES). Springer-Verlag Raibert M, Hodjins K (1991) Animation of dynamic legged locomotion. ACM Comput Graph 25 (4):349–358 Ren X, Berg E, Malik J (2005) Recovering human body configurations using pairwise constraints between parts. In: Proceedings of International Conference on computer vision, IEEE, vol 1. pp 824–831 Rohan A (2015) 3D motion capture system market – global forecast to 2020. Tech. rep. Markets and Markets Inc., Vancouver Sims K (1994) Evolving virtual creatures. ACM Trans. on Graphics, ACM Srinivasan P, Shi J (2007) Bottom-up recognition and parsing of the human body. In: Proceedings of computer vision and pattern recognition, IEEE, pp 1–8 Tejera M, Casas D, Hilton A (2013) Animation control of surface motion capture. ACM Trans. on Graphics, IEEE, pp 1532–1545 Tin K, Coros S, Beaudoin P (2008) Continuation methods for adapting simulated skills. ACM Trans. on Graphics, IEEE, vol 27(3) Trumble M, Gilbert A, Hilton A, Collomosse J (2016) Learning markerless human pose estimation from multiple viewpoint video. In: Proceedings of ECCV workshops, ACM Viola P, Jones M (2004) Robust real-time object detection. Int J Comput Vis 2(57):137–154 Wampler K, Popovic Z (2009) Optimal gait and form for animal locomotion. ACM Trans. on Graphics, ACM, vol 28(3) Witkin A, Popovic Z (1995) Motion warping. ACM Trans. on Graphics, ACM, pp 105–108 Yao X (1999) Evolving artificial neural networks. In: Proceedings of the IEEE, IEEE, vol 87 Zhao T, Nevatia R (2003) Bayesian human segmentation in crowded situations. In: Proceedings of computer vision and pattern recognition, IEEE, vol 2. pp 459–466
Physically Based Character Animation Synthesis Jie Tan
Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Physical Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulation in Maximal Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulation in Generalized Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contact Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulation Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Motion Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Trajectory Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Improving Realism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reducing Prior Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bringing the Character to the Real World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 3 5 6 7 8 10 10 11 14 17 17 17 18 19
Abstract
Understanding and synthesizing human motions are an important scientific quest. It also has broad applications in computer animation. Research on physically based character animation in the last two decades has achieved impressive advancement. A large variety of human activities are synthesized automatically in a physically simulated environment. The two key components of physically based character animation are (1) physical simulation that models the dynamics of humans and their environment and (2) controller optimization that optimizes the character’s motions in the simulation. This approach has an inherent realism because we all live in a world that obeys physical laws, and we evolved to survive J. Tan (*) Georgia Institute of Technology, Atlanta, GA, USA e-mail: [email protected] # Springer International Publishing Switzerland 2016 B. Müller, S.I. Wolf (eds.), Handbook of Human Motion, DOI 10.1007/978-3-319-30808-1_11-1
1
2
J. Tan
in this physical environment. In this chapter, we will review the state of the art of physically based character animation, introduce a few established methods in physical simulation and motor control, and discuss promising future directions. Keywords
Character animation • Physical simulation • Trajectory optimization • Reinforcement learning
Introduction Mother Nature has created a diverse set of awe-inspiring motions in the animal kingdom: Birds can fly in the sky, fishes can swim in the water, geckos can crawl on vertical surfaces, and cats can reorient themselves in midair. Similarly, human’s motions exhibit efficiency (locomotion), agility (kung fu), gracefulness (ballet), and dexterity (hand manipulation). Studying these motions is not only a scientific quest that quenches our curiosity but also an important step toward synthesizing them in a way that can fundamentally change our life. Character animation aims to faithfully synthesize these motions of humans and animals and display them to an audience for the purpose of entertainment, storytelling, and education. The synthesized motions need to appear realistic to give the audience an immersive experience. In the last few decades, we have seen tremendous advance in character animation. Some of the most breathtaking movies, such as Harry Potter, Avatar, and Life of Pi, rely heavily on computer-generated animations. Nowadays, it is almost impossible for the audience to tell apart the computer-synthesized motions from the real footage. Behind these realistic animations lie countless hours of tedious manual work of highly specialized experts. For example, to produce a 100-min feature film at Pixar can take dozens of artists and engineers more than 5 years of development. In today’s animation pipeline, the most popular techniques are key frames or motion capture, both of which require artistic expertise and laborious manual work. Even worse, the knowledge and efforts that are put into one animation sequence are not necessarily generalizable to other motions. In my point of view, these are not efficient or principled ways of animation synthesis. A principled way to synthesize character animation is to study the fundamental factors that have shaped our motions. Instead of focusing on the appearance of our motions, we need to dig deeper to understand why we move in the way that we are doing today. After understanding the root causes that have shaped our movements, we can then synthesize them naturally and automatically. Our motions are shaped through millions of years of optimization (evolution) in a world that obey physical laws. This insight has motivated a new paradigm of physically based character animation. The two key components of this paradigm are physical simulation and motion control. We first build a physical simulation to model the physical world and then perform optimization to control the motions of characters so that they can move purposefully, naturally, and robustly in the simulated environment.
Physically Based Character Animation Synthesis
3
Although we often take our motions for granted since we can perform them so effortlessly, physically based character animation is a notoriously difficult problem because our motions involve sophisticated neuromuscular control, sensory information processing, motion planning, coordinated muscle activation, and complicated interactions with the surrounding physical environment. Even though we are still far from fully understanding the underlying control mechanisms that govern our motions, two decades of research in physically based character animation has brought us new insights, effective methodologies, and impressive results. The purpose of this chapter is to review the state of the art (section “State of the Art”), introduce some of the established algorithms (sections “Physical Simulation” and “Motion Control”), and discuss promising future research directions (section “Future Directions”) in physically based character animation.
State of the Art Starting from the seminal work of Hodgins et al. (1995), controlling a physically simulated human character has been extensively studied in computer animation. A wide variety of human activities, including walking (Yin et al. 2007), running (Kwon and Hodgins 2010), swimming (Kwatra et al. 2009; Si et al. 2014), biking (Tan et al. 2014), dressing (Clegg et al. 2015), gymnastics (Hodgins et al. 1995), reacting to perturbations (Wang et al. 2010), falling and landing (Ha et al. 2012) and manipulating objects with hands (Liu 2009; Ye and Liu 2012; Bai and Liu 2014) are realistically synthesized in physically simulated environments (Fig. 1). Two widely used techniques in physically based character animation are trajectory optimization and reinforcement learning. Trajectory optimization formulates a constrained optimization to minimize a task-related objective function subject to physical constraints. It has been applied to control the iconic jumping Luxo Jr lamp (Witkin and Kass 1988), humanoid characters (Liu and Popovic´ 2002; Jain et al. 2009; Ye and Liu 2010), and characters with arbitrary morphologies (Wampler and Popovic´ 2009). The resulting motions are physically plausible and follow the animation principles such as anticipation and follow-through (Thomas and Johnston 1995). Reinforcement learning algorithms solve a Markov decision process (MDP) to find optimal actions at different states. When the MDP has moderate dimensions, (fitted) value function iteration has been successfully applied to generalize motion capture data (Treuille et al. 2007; Levine et al. 2012), to carry out locomotion tasks (Coros et al. 2009), and to manipulate objects with hands (Andrews and Kry 2013). When the dimensionality is high, policy search (Ng and Jordan 2000) can directly search for a control policy without the need to construct a value function. Many studies on locomotion control (Yin et al. 2008; Wang et al. 2009, 2012; Coros et al. 2011; Geijtenbeek et al. 2013) performed policy search on parameterized controllers. Although we have seen impressive advances for the last two decades, the gracefulness, agility, and versatility of real human motions remain unmatched.
4
J. Tan
Fig. 1 Various human activities, such as running, swimming, dressing, performing bicycle stunts, interacting with the environment, and manipulating clothes, are modeled in a physically simulated environment (Image courtesy of (Hodgins et al. 1995; Si et al. 2014; Clegg et al. 2015; Tan et al. 2014; Coros et al. 2010; Bai and Liu 2014))
There are challenges in physically based character animation that need further investigation. First, controlling balance is a key problem of synthesizing human motions in a physically simulated environment. Balance can be maintained by exerting virtual forces (Pratt et al. 2001; Coros et al. 2010), applying linear feedback (Laszlo et al. 1996; Yin et al. 2007; da Silva et al. 2008; Coros et al. 2010), using nonlinear control policies (Muico et al. 2009), planning the contact forces (Muico et al. 2009; Tan et al. 2012b), employing reduced models (Tsai et al. 2010; Kwon and Hod- gins 2010; Mordatch et al. 2010; Coros et al. 2010; Ye and Liu 2010), and training in stochastic environments (Wang et al. 2010). Although the balance problem in simple locomotion tasks, such as walking and running, has been solved, maintaining balance in tasks that require agile motions remains an open problem. Another challenge is to effectively plan the contacts. We human can only move ourselves and other objects through contacts. However, contact events (contact breakage, sliding, etc.) introduce unsmooth forces to the dynamics, which breaks the control space into fragmented feasible regions. As a result, a small change in control parameters can easily generate bifurcated consequences. For this reason, many previous methods explicitly assumed that the contacts remain static (Abe et al. 2007; Jain et al. 2009; Kim and Pollard 2011) while optimizing controllers. This assumption significantly restricts the effectiveness of the controller because the controller is not allowed to actively exploit contact breakage, slipping contacts, or rolling contacts to achieve control goals. Three promising research directions to
Physically Based Character Animation Synthesis
5
tackle this challenge are contact-invariant optimization (Mordatch et al. 2012, 2013), QPCC (Tan et al. 2012b) and policy search with stochastic optimization (Wu and Popovic´ 2010; Wang et al. 2010; Mordatch et al. 2010). An important criterion in character animation is the realism of the synthesized motions. There is still large room to improve the quality of physically based character animation. One possible cause of the unnatural motions is the vast simplification of the human models. To improve the realism, prior work has simulated the dynamics of muscles and demonstrated complex interplay among bones, muscles, ligaments, and other soft tissues for individual body parts, including the neck (Lee and Terzopoulos 2006), upper body (Zordan et al. 2006; DiLorenzo et al. 2008; Lee et al. 2009), lower body (Wang et al. 2012), and hands (Tsang et al. 2005; Sueda et al. 2008). However, constructing such a sophisticated biological model for a fullhuman character is computationally prohibitive. An alternative solution is to augment a physically controlled character with realistic motion capture streams (da Silva et al. 2008; Muico et al. 2009; Liu et al. 2010).
Physical Simulation Physically based character animation consists of two parts, simulation and control. This section concentrates on the simulation part while the next section on control. Although the majority of research in physically based character animation focuses on control, a good understanding of physical simulation is essential for designing effective controllers because complex human behaviors often require sophisticated controllers that exploit the dynamics of a multi-body system. In physically based character animation, a human character is often represented as an articulated rigid-body system (Fig. 2 left), a group of rigid bodies chained together through rotational joints. These joints can have different number of degrees of freedom (DOF). For example, the shoulder is a ball joint (three DOFs), the wrist is a universal joint (two DOFs), and the elbow is a hinge joint (one DOF). In some cases, if the character’s motion involves dexterous hand manipulation, a detailed hand model (Fig. 2 right) is attached to each wrist. Note that the articulated rigidbody system is a dramatic but necessary simplification since simulating each bone, muscle, and tendon that a real human has would require a prohibitively huge amount of computational resources. In the simulation, the articulated figure is represented as a tree structure. Each node is a rigid body and each edge is a joint. One node can have multiple children but at most one parent. The root node has no parent. While any body can be selected as the root node, a common choice is to use the torso as the root. In this tree structure, loops are not allowed. Although it is possible to simulate loops, such case is rare in character animation and will not be discussed here. There are two major methods to simulate the dynamics of an articulated rigid-body system: simulation in maximal coordinates (Cartesian space) and simulation in generalized coordinates (joint space).
6
J. Tan
Fig. 2 Articulated figures in character animation to represent a human character (left) and a hand (right)
Simulation in Maximal Coordinates In maximal coordinates, the physical states of the articulated figures are defined for each node (rigid body). Each body has six degrees of freedom: three translational and three rotational. The dynamics of each rigid body is considered independently. A list of additional joint constraints is imposed to ensure that the two adjacent bodies will stick together at the joint location. The dynamic equation for each body is
m I33 0
0 I
v_ mg f 0 ¼ þ þ a _ ω_ Iω τ τ
(1)
where m and I are the mass and the inertia tensor of the body; I33 is a 3 3 identity matrix; v and ω are the linear and angular velocities; [f, τ]T are the passive forces and torques from joint constraints, contacts, gravity, and other external sources; and τ a are the torques actively exerted by the controllers, which is the focus of section “Motion Control.” Joints that connect two rigid bodies constrain their relative motions. Different number of constraints is imposed according to the type of joints. For example, a hinge joint has only one DOF. Thus it has five constraints that eliminate all but the rotation along the hinge axis. A ball joint has three DOFs. Its constraints eliminate the relative translation at the joint location. Suppose a joint connects body A and body B, the translational constraints are:
I33 ½rA
I33 vA vB ¼0 ½rB ωB ωA
Physically Based Character Animation Synthesis
7
where [r] is the skew symmetric matrix of r, which is the vector from the center of mass (COM) of the body to the joint location. The rotational constraints are dTi ðωA ωB Þ ¼ 0 where di is an axis perpendicular to the rotational DOFs and i could be a subset of {0, 1, 2} depending on the type of joints. To allow a character to actively control its motion, actuators are attached to the joints. According to Newton’s third law, the two bodies connected to a common actuator should receive equal and opposite joint torques. τaA τaB ¼ 0 where τA and τB are the torques exerted by the actuator to body A and B, respectively. Although simple to understand and implement, simulating characters in maximal coordinates has a few drawbacks. First, the state representation is redundant. It is not efficient to use all six DOFs of a rigid body and then eliminate most of them with joint constraints. Second, the accumulating numerical errors in simulation would cause the joint constraints not satisfied exactly. Eventually, joints will dislocate and adjacent bodies can drift apart. Both of these shortcomings can be overcome by simulation in generalized coordinates.
Simulation in Generalized Coordinates In generalized coordinates, the physical states q, q_ of the articulated figure, are defined on the edges of the tree (joint angles). Note that the root node is attached to the world space via a 6-DOF joint that can translate and rotate freely. Each DOF is one component of q. The number of DOFs equals the dimensionality of q. In other words, there is no redundancy in this representation. The dynamic equation in generalized coordinates for an articulated rigid-body system is MðqÞq€ þ Cðq,q_ Þ ¼ Q þ τa
(2)
where q, q_ , and q€ are the position, velocity, and acceleration in generalized coordinates; M(q) is the mass matrix; C(q, q_ ) accounts for the Coriolis and centrifugal force; Q is the external generalized force, including gravity and contact force; and τa is the generalized force exerted by the controller (section “Motion Control”). This equation can be derived from Lagrangian dynamics. Detailed derivation is omitted here but can be found in Liu and Jain (2012). When the articulated rigid bodies are simulated in generalized coordinates, it is often necessary to convert physical quantities back and forth between generalized and maximal coordinates. For example, we need to compute the velocity at certain point on the articulated figure in Cartesian space for collision detection. We also need
8
J. Tan
to convert the forces from the Cartesian space to generalized coordinates to apply contact forces. Jacobian matrix J bridges these two different coordinate systems. J¼
@x @q
(3)
It represents the relation how much a point x moves in the Cartesian space if the joint angles q changes slightly. Here are the two most frequently used formulas that convert velocities and forces, and more conversion formulas can be found in Liu and Jain (2012). v ¼ Jq_ Q ¼ JT f Simulation in generalized coordinates is widely used in physically based character animation. Although it takes more effort to walk through the derivation, it has important advantages over simulation in maximal coordinates. Apparently, the representation is more compact. There is no redundancy and thus no need to use constraints to eliminate redundant states. More importantly, it ensures that the joints are satisfied exactly. Two connecting bodies can never drift apart even with numerical errors because the states of dislocated joints are not part of the state space in generalized coordinates.
Contact Modeling Most of our daily activities, such as locomotion and hand manipulation, involve interacting with our surrounding environments through contacts. Accurately simulating contacts and computing contact forces are crucial to physically based character animation. Penalty method and linear complementarity problem (LCP) are two widely used methods to model contact.
Penalty Method When a body A penetrates another body B, a repulsive penalty force fc is exerted to separate these two bodies. fc ¼
kdn 0
if d > 0, otherwise:
(4)
where k is the stiffness, d is the penetration depth, and n is the contact normal. Penalty method is trivial to implement. However, to make it work properly, tedious manual tuning is often needed. While too small k cannot effectively stop the penetration, too large k would lead to undesired bouncy collision response. Even worse, when simulating with large time steps, penalty method could make the
Physically Based Character Animation Synthesis
9
simulation unstable. In addition, it is not clear how to accurately model static friction using penalty methods.
Linear Complementarity Problem Linear complementarity problem (LCP) is a more accurate and stable method to model contacts. A contact force fc can be decomposed into the normal and the tangential (frictional) forces. f c ¼ f ⊥ n þ Bf jj where n is the contact normal; f⊥ and fk are the normal and tangential components, respectively; and B is a set of bases that span the tangential plane (Fig. 3). The more bases bi are used, the more accurate the approximation of the friction cone is, but more computation is needed to solve the resulting LCP. LCP imposes a set of constraints to satisfy the conditions of Coulomb friction: 1. In the normal direction, only repulsive forces are exerted to stop penetration. 2. In a static contact situation, the contact force lies within the friction cone. 3. In a sliding contact situation, the contact force lies at the boundary of the friction cone and the friction direction is opposite to the sliding direction. I will illustrate the concept of LCP using the formulation in the normal direction. The formulation in the tangential directions is beyond the scope of this chapter. It can be found in the following tutorials (Lloyd 2005; Tan and et al. 2012a). In a physical simulation, after the collisions are resolved, the relative velocity between the contact points of two colliding bodies can only be either zero (resting) or positive (separating), but not negative (penetrating): v⊥ 0
(5)
Fig. 3 A linearized friction cone used in LCP formulation. Left: a foot is in contact with the ground. Right: the friction cone at the contact point. n is the contact normal, and bi are a set of tangential bases
10
J. Tan
Similarly, the normal contact force can be zero (no force) or positive (repulsive force), but not negative (sticking force): f⊥ 0
(6)
The repulsive normal force exists (f ⊥ > 0) if and only if the two bodies are in contact (v⊥ ¼ 0). In contrast, when they are separating (v⊥ > 0), there is no contact force (f ⊥ ¼ 0). In other complementarity condition needs to be satisfied: v⊥ f ⊥ ¼ 0
(7)
Combining the dynamic equations (Eq. 1 or 2) and the LCP constraints (Eqs. 5, 6, and 7) forms a mixed LCP problem. It can be solved efficiently by direct (Lloyd 2005) or iterative solvers (Erleben 2007; Kaufman et al. 2008; Otaduy et al. 2009).
Simulation Software There is a growing need for simulation software that can accurately simulate the complex dynamics of virtual humans and their interactions with their surrounding environment. A number of open-source physical simulators are readily available for research in physically based character animation. The popular ones include Open Dynamic Engine (ODE) (http://www.ode.org/), Bullet (http://www.bulletphysics. org/), Dynamic Animation and Robotics Toolkit (DART) (http://dartsim.github.io/) and MuJoCo (http://www.mujoco.org). All of them can simulate articulated rigid bodies with an LCP-based contact model in real time. These simulators allow the user to specify the structure of the articulated figure, the shape and the physical properties of each body, the type of joints, and other parameters describing the environment. Different simulators may offer different features, speed, and accuracy. Erez et al. (2015) provided an up-to-date review and an in-depth comparison among these modern physical engines.
Motion Control We human can carefully plan our motion, coordinately and purposefully, and exercise our muscles to achieve a wide variety of high-level tasks, ranging from simple locomotion, dexterous hand manipulation, to highly skillful stunts. To model them in animation, simulating the passive dynamics is not enough. The key challenge that motion control tackles is to find controllers that can achieve high-level motion tasks (e.g., walk at 1 m/s, grasp a bottle and open the cap, etc.). In character animation, a controller is the character’s “algorithmic brain” that decides how much torque (τ a in Eqs. 1 and 2) are needed at each joint to successfully fulfill the task in a way that mimics human behavior. Optimization-based motion control is the most extensively researched topic in physically based character animation. The
Physically Based Character Animation Synthesis
11
Fig. 4 Different stages of walking in SIMBICON (Image courtesy of (Yin et al. 2007))
left stance 0.3 s 0
1
right foot strike
left foot strike
3
2 0.3 s right stance
optimization searches for a controller that minimizes a task-related cost function, subject to dynamical constraints. One common misunderstanding is that one can formulate a large optimization for arbitrary tasks. Due to complexity of human motions and nonlinearity of the dynamics, a large optimization may have competing objectives and many local minima. Up till today, there are no efficient optimization algorithms that can reliably find meaningful controllers in such cases. For this reason, a common practice in this field is to decompose a high-level task into multiple lower-level subtasks and formulate smaller optimization for each of the simpler subtask. For example, in SIMBICON (Yin et al. 2007), a walking cycle is decomposed into multiple stages (Fig. 4). Within each stage, separate optimizations can be used for controlling the two legs, the upper body, the balance, and the style. After solving all the optimizations, these low-level controllers can be combined so that the character can walk naturally and robustly. Controller decomposition depends on the task and requires domain knowledge. We refer the readers to the research literature to learn controller decomposition on a case-by-case basis. In this section, we will discuss two generic optimization-based methods of motion control: trajectory optimization and reinforcement learning.
Trajectory Optimization Starting from the classical paper “Spacetime Constraints” (Witkin and Kass 1988), trajectory optimization has become a mainstream technique in physically based character animation. It searches for a controller that minimizes a cost function subject to physical constraints. The general form of the optimization is
12
J. Tan
minx, u subject to
XN t¼0
gðxt , ut Þ
xtþ1 ¼ hðxt Þ þ Bt ut
(8)
where x is the physical states and u is the control. In character animation, the states are usually defined as x :¼ ½q,q_ T , and the control u :¼ τa . g is the cost function, which is handcrafted to reflect the high-level task. For example, if the task is to walk at 1 m/s, one term in the cost function could be the distance between the current COM of the character and a desired COM position that moves at 1 m/s. The constraint usually consists of the dynamic equation h. Note that in most applications of character animation, the dynamics are nonlinear to the state x but linear to the control u (see Eqs. 1 and 2). In addition, the constraints can also include joint limits, torque limits, and other task-related requirements. To make it more concrete, we will revisit the simple example in the original “Spacetime Constraints” paper: controlling a single particle. The task of this particle is to fly from point a to point b in T seconds using a time-varying jet force f(t). The dynamics of the particle is m€ x f mg ¼ 0, where x is its position, m is its mass, and g is gravity. The goal of this flight is to minimize the total fuel consumption ÐT 2 0 jfj dt. After discretization along time, the optimization has the following form: minx, f subject to
XN t¼0
jf t j 2
xtþ1 ¼ 2xt xt1 þ Δtm2 f t þ Δt2 g x0 ¼ a xN ¼ b It is not too difficult to extend the above derivations to control a human character. We will need to change the control force f(t) to joint torques τa (t), the physical constraint to the dynamic equation of articulated rigid bodies (Eq. 1 or 2), and the cost function to a relevant function specific to the task. There are different options to solve the optimization according to the structure of the problem. Assuming the cost function and the dynamic equations are smooth, Witkin and Kass (1988) applied a generic nonlinear optimizer, sequential quadratic programming (SQP), to solve the problem. It is an iterative optimization technique that solves a sequence of quadratic programs that approximate the original problem. The solution of SQP is an optimal trajectory of state x(t) and control u(t). Note that this method produces a feedforward (open-loop) controller: a trajectory over time. It cannot be generalized to the neighboring regions of the state space. As a result, the controller will fail the task even with a slight disturbance to the motion. When the cost function is quadratic and the dynamic equation is linear,
Physically Based Character Animation Synthesis
minx, u subject to
XN t¼0
13
xTt Qt xt þ uTt Rt ut
xtþ1 ¼ At xt þ Bt ut
(9)
the trajectory optimization is called an LQ problem. This problem can be solved very efficiently by linear-quadratic regulator (LQR). The derivation of LQR can be found in most of optimal control textbooks (Todorov 2006). Thus we will not repeat it here. The solution is a feedback (close-loop) controller ut ¼ Kt xt. Although the requirement of linear dynamics seems restrictive, LQR still plays an important role in physically based character animation. One important application is to design a physically based controller to track motion captured data, which is an effective way to increase the realism of the synthesized motions. Given a motion capture sequence x, we can linearize the dynamic equation at its vicinity: Δ xtþ1 ¼
@h Δ xt þ Bt ut þ hðxt Þ xtþ1 @x
where Δ x ¼ x x . This gives an LQ problem that seeks a feedback controller ut = KtΔxt that minimizes the difference between the actual and the reference motion over the entire trajectory. More importantly, LQR is a building block to solve the more general trajectory optimization problem (Eq. 8). Given an initial trajectory x0, u0, x1, u1, . . ., uN, xN, we can perform the following steps iteratively: 1. Compute the LQ approximation of the original problem (Eq. 8) around the current trajectory by computing a first-order Taylor expansion of the dynamics and a second-order expansion of the cost function. 2. Use LQR to solve the LQ approximation to get an optimal controller. 3. Apply the current optimal controller to generate a new trajectory. 4. Go to step 1 until convergence. This iterative-LQR process is similar to the core idea behind differential dynamic programming (DDP). We refer the interested reader to Todorov (2006) for a more thorough discussion about LQR and DDP. The key advantage of DDP is that it not only provides a feedforward trajectory, but also an optimal feedback controller near that trajectory. To sum up, trajectory optimization is an effective way to synthesize character animation. The synthesized motion is not only physically correct but, more importantly, demonstrates an important animation principle, anticipation. This is because the objective of trajectory optimization is to minimize a long-term cost. Thus, the character can move intelligently that gives up short-term gains to minimize the longterm cost. For example, for a jump-up task, the trajectory optimization could produce a controller that sacrifices the height of the character’s COM first for a much higher jump later. However, there are a few shortcomings of trajectory optimization. First, it often leads to a high-dimensional optimization that is expensive to solve. Another
14
J. Tan
problem of high-dimensional nonlinear optimization is that the solver is more likely to get stuck at bad local minima. Thus, a good initialization is extremely important. Last but not least, trajectory optimization exploits the mathematical form of dynamic equations to design the optimal controller. If the dynamics is not smooth, too complicated, or unknown, it is not clear how to apply trajectory optimization methods.
Reinforcement Learning Reinforcement learning is motivated by the learning process of humans. It optimizes a controller by interacting with the physical environment with numerous trials. Initially, the controller tries out random moves. If a desired behavior is observed, a reward is provided as the positive reinforcement. This reward system will gradually shape the controller until eventually it successfully fulfills the high-level task. Reinforcement learning is an active research area, with a large number of algorithms proposed every year. We refer readers to Kaelbling et al. (1996) and Kober et al. (2013) for a thorough review. We will focus on policy search in the remaining of this section. Policy search is a popular reinforcement learning algorithm in physically based character animation. It performs extremely well in this field because it can solve control problems with high-dimensional continuous state and action spaces, which is essential to control a human character. Mathematically, reinforcement learning solves a Markov decision process (MDP). An MDP is a tuple (S, A, R, D, γ, Psas0 ), where S is the state space and A is the action space. The states reflect what the current situation is and the actions are what a character can perform to achieve the specified task. R is the reward function. A reward function is a mapping from the state-action space to a real number: R : S A 7! R, which evaluates how good the current state and action are. D is the distribution of the initial state s0 D and γ [0, 1] is the discount factor of the reward over time. Psas0 is the transition probability. It outputs the probability that the next state is s0 if an action a is taken at the current state s. In physically based character animation, the transition probability is computed by physical simulation. Although most of the physical simulations are deterministic, random noise can be added in simulation to increase the robustness of the learned controller (Wang et al. 2010). The solution of an MDP is a control policy π that maps the state space to the action space π : S 7! A. It decides what actions to take at different situations. The return of a policy is the accumulated rewards along the state trajectory starting at s0 by following the policy π for N steps. V π ðs0 Þ ¼
N X i¼0
γ Ni Rðsi , π ðsi ÞÞ
Physically Based Character Animation Synthesis
15
The reward at earlier states can be exponentially discounted over time. The value of a policy is the expected return with respect to the random initial state s0 drawn from D. V ðπ Þ ¼ Es0 D ½V π ðs0 Þ
(10)
Note that the goal of MDP is not to maximize the short-term reward R at the next state, but a long-term value function V. Using V instead of R as the optimization target prevents the controller from applying short-sighted greedy strategies. This agrees with our ability of long-term planning when we are executing our motions. To formulate an MDP, we need to design the state space S, the action space A, and the reward function R for a given task. Ideally, the state space should contain all the possible states of the articulated rigid-body system, including the joint angles q, joint velocities q_ , and time t, and the actions should include all the joint torques τa. However, this means that the state and the action space can have hundreds of dimensions. Due to the curse of dimensionality, solving an MDP in such a highdimensional continuous space is computationally infeasible. In practice, researchers often carefully select states and actions specifically for the tasks in question to make the computation tractable. For example, if the task is to keep balance while standing, the state space only needs to include important features for balance, such as the character’s COM and the center of the ground contact polygon. Similarly, given the well-known balance strategies, such as ankle strategy and hip strategy, the action space can be as simple as a few torques at lower body joints. Using prior knowledge of the task can greatly simplify the problem, which is a common practice in physically based character animation. A reward function for character animation usually consists of two parts, a taskrelated component that measures how far the current state is from the goal state and a component that evaluates the naturalness of the motions. Designing a good reward function is essential to the success of the entire learning algorithm. A good design of the reward should be a smooth function that gives continuous positive reinforce whenever progress is made. Mathematically, this design will present gradient information that can guide optimization solvers. In contrast, a common mistake is to give reward only when the task is achieved, which makes the reward function a narrow spike and flat zero elsewhere. This should be avoided because nearly all the optimization algorithms would have trouble finding such a spike. To apply policy search to solve the MDP, we need to parameterize the policy. A policy can be any arbitrary function. A practical way to optimize it is to parameterize it and then search for the optimal policy parameters. Commonly used parameterizations include lookup tables, linear functions, splines, and neural networks. Parameterization determines the potential quality of the final policy. However, there is no consensus what the best way is to parameterize a policy. It is decided on a case-bycase basis. Once the policy parameterization is decided, an initial policy is iteratively evaluated and improved until the optimization converges or a user-specified maximum number of iterations is reached.
16
J. Tan
To evaluate a policy, we can execute the policy in the simulation for N time steps with different initial states s0 D, accumulate the rewards, and average the returns to compute the value of the policy. Policy improvement adjusts the policy parameters to increase its value. A straightforward way is to follow the policy gradient (Ng and Jordan 2000). However, in many character animation tasks, such as locomotion and hand manipulation, contact events happen frequently, which introduces unsmooth contact forces that invalidate policy gradient. For this reason, sample-based stochastic optimization techniques are particularly suited for physically based character animation. Covariance matrix adaptation evolution strategy (CMA-ES) (Hansen 2006) is the most frequent applied optimization methods for motion control. CMA can work as long as we can evaluate the value of the policy. It does not need to compute gradient and does not rely on good initializations. More importantly, CMA is a “global” search algorithm that explores multiple local optima. Although there is no guarantee that CMA will converge at the global optimum, in practice, it often finds good local optima in moderately high-dimensional control spaces (e.g., 20–30 dimensions). For the completeness of the chapter, we will briefly describe the CMA algorithm. Readers can refer to the original paper (Hansen 2006) for additional details. CMA starts with an initial underlying Gaussian distribution in the policy parameter space with a large covariance matrix. A population of samples is drawn from this distribution. The first generation of samples is not biased in the parameter space due to the large covariance matrix. Each CMA sample represents a control policy. The policies are evaluated through simulation. They are sorted according to their values and a certain percentage of the inferior samples are discarded. The underlying Gaussian distribution is updated according to the remaining good samples and is used to generate the next generation of the samples. This process is performed iteratively. Over iterations, the underlying distribution is shifted and narrowed. Eventually, the distribution converges to a good region of the policy space. The best CMA sample throughout all the iterations is selected as the optimal control policy. In summary, reinforcement learning is a generic method for motion control. It can automatically learn a wide range of behaviors through simulation trials. Reinforcement learning, more specifically, policy search, is becoming one of the most popular approaches in character animation synthesis. Reinforcement learning does not assume any mathematical forms of the dynamic equation. It treats the physical simulation as a black box, as long as it can output the next state given the current state and action. Thus, it is not bounded to a particular dynamics model. The same learning algorithm can still work even if the simulation software is upgraded. The main challenge of reinforcement learning is to design the states, the actions, the reward, and the policy parameterization. We need to inject enough prior knowledge into the design so that the search space is small enough to be computationally feasible, but not too small to contain effective policies. This requires a lot of experience and manual tweaking, especially for challenging motion tasks.
Physically Based Character Animation Synthesis
17
Future Directions The research on physically based character animation has achieved stunning results in the past two decades. However, we are still far from the ultimate goal of character animation: a fully automatic system that can synthesize visually realistic motions that are comparable to those of real humans. Many interesting problems remain to be challenging research problems. Here I list a few promising future research directions.
Improving Realism One of the biggest issues of the physically based character animation is its quality. The synthesized motions are not yet realistic enough for broader applications. It is currently not comparable to the quality of the animations that are hand-tuned by artists. One important reason is that the articulated rigid-body systems that are widely used today are a vast simplification of the real human model. A real human has 206 bones and over 600 muscles, which has far more degrees of freedom. In addition, the joints of the articulated rigid body are controlled independently, but human joints move in coordination due to the intricate arrangement of muscles and tendons. A recent trend is to build more sophisticated human models based on biological musculotendon structures (Lee and Terzopoulos 2006; Lee et al. 2009; Wang et al. 2012). It has been demonstrated that an accurate human model can dramatically improve the realism of the synthesized motions. As more computational power is made available to us in the next decade, I expect that we can soon afford to use highly detailed human models to synthesize character animation with high fidelity. Another source of the unnaturalness is due to the handcrafted objective function in motion control. The objective functions focus mostly on energy efficiency aspect of motions. However, efficient motions do not equal natural motions. Although minimizing energy expenditure is one important factor that governs our motion, it is not the only factor. Our motions are also governed by personal habits, emotion, task, environment, and many other external factors. It is extremely challenging to handcraft objective functions for all of them. Assuming that we have abundant of motion data, which is a realistic assumption given the large volume of motion sensors installed in phones and other wearable computing devices, it is promising to extract objective functions from these data using inverse reinforcement learning (Ng and Russell 2000).
Reducing Prior Knowledge Although physically based character animation frees us from much manual work of traditional animation pipelines, it still requires some high-level prior knowledge to work effectively. For example, we know that regulating the COM of a character relative to the contact points is important for balance tasks. We can inject this prior
18
J. Tan
knowledge by manually choosing the COM and the ground contact points as features and include them into the state space of reinforcement learning. Selecting the right features (prior knowledge) is crucial for many of the current control algorithms. However, good features for one task may not carry over to different tasks. Manually selecting features would not scale to more sophisticated characters, more complicated environments, or more challenging tasks. We need an algorithm that can discover control strategies with less or even no prior knowledge. This reminds me of the recent success of deep learning. The way that we are using hand-engineered features in reinforcement learning today is analogue to using HoG or SIFT features in computer vision a few years ago. Recent advance in computer vision has demonstrated that deep neural networks, such as autoencoder (Vincent et al. 2008) or restricted Boltzmann machine (Hinton 2012), can learn features automatically. I believe that the next breakthrough in reinforcement learning is to employ similar techniques to automatically discover important features for different motion tasks.
Bringing the Character to the Real World The recent development in physically based character animation has introduced a set of powerful computational tools. With these tools, natural, agile, and robust motions can be synthesized efficiently and autonomously in a simulation. However, creating lifelike robots is still an extremely challenging, trial-and-error process that is restricted to experts. The fast evolution of 3D printing technology will soon trigger a shift in the robotics industry from mass production to personalized design and fabrication, which will result in an immediate need for a faster, cheaper, and more intuitive way to design robotic controllers. I believe that the computational tools that are developed in physically based character animation can potentially automate and streamline the process if we can transfer the controllers from the virtual simulation to the real world. Transferring controllers optimized in a simulation onto a real robot is a nontrivial task. An optimal controller that works in a state-of-the-art simulation often fails in a real environment. This is known as the Reality Gap. This gap is caused by various simplifications in the simulation, including inaccurate physical models, unmodeled actuator dynamics, assumptions of perfect sensing, and zero latency. To fully tap into the power of the computational tools, we need to develop more accurate physical simulations to bridge the Reality Gap. Researchers in physically based character animation have started to investigate this problem (Bharaj et al. 2015; Megaro et al. 2015). I believe that with further research and development, the Reality Gap will shrink rapidly, which will make it easier to transfer controllers from the simulation to the real world. As a result, I envision that the two separate research fields of character animation and robotics will eventually start to merge. This will inevitably trigger a fundamental revolution in both character animation and robotics.
Physically Based Character Animation Synthesis
19
References Abe Y, da Silva M, Popovic’ J (2007) Multiobjective control with frictional contacts. In: Proceedings of the 2007 ACM SIGGRAPH/Eurographics symposium on Computer animation. SCA”07. pp 249–258 Andrews S, Kry P (2013) Goal directed multi-finger manipulation: control policies and analysis. Comput Graph 37(7):830–839 Bai Y, Liu CK (2014) Coupling cloth and rigid bodies for dexterous manipulation. In: Proceedings of the seventh international conference on motion in games. MIG”14. ACM, pp 139–145 Bharaj G, Coros S, Thomaszewski B, Tompkin J, Bickel B, Pfister H (2015) Computational design of walking automata. In: Proceedings of the 14th ACM SIGGRAPH/eurographics symposium on computer animation. SCA”15. ACM, pp 93–100 Clegg A, Tan J, Turk G, Liu CK (2015) Animating human dressing. ACM Trans Graph 34 (4):116:1–116:9 Coros S, Beaudoin P, van de Panne M (2009) Robust task-based control policies for physics-based characters. ACM Trans Graph 28(5):170:1–170:9 Coros S, Beaudoin P, van de Panne M (2010) Generalized biped walking control. ACM Trans Graph 29(4):130, Article 130 Coros S, Karpathy A, Jones B, Reveret L, van de Panne M (2011) Locomotion skills for simulated quadrupeds. ACM Trans Graph 30(4):59 Da Silva M, Abe Y, Popovic’ J (2008) Interactive simulation of stylized human locomotion. In: ACM SIGGRAPH 2008 Papers. SIGGRAPH”08. ACM, pp 82:1–82:10 DiLorenzo PC, Zordan VB, Sanders BL (2008) Laughing out loud: control for modeling anatomically inspired laughter using audio. In: ACM SIGGRAPH Asia 2008 papers. SIGGRAPH Asia”08. pp 125:1–125:8 Erez T, Tassa Y, Todorov E (2015) Simulation tools for model-based robotics: comparison of bullet, havok, mujoco, ode and physx. In: ‘ICRA’, IEEE. pp 4397–4404 Erleben K (2007) Velocity-based shock propagation for multibody dynamics animation. ACM Transactions on Graphics (TOG), 26(2), Article No. 12. Geijtenbeek T, van de Panne M, van der Stappen AF (2013) Flexible muscle-based locomotion for bipedal creatures. ACM Trans Graph 32(6) Ha S, Ye Y, Liu CK (2012) Falling and landing motion control for character animation. ACM Trans Graph 31(6):1 Hansen N (2006) The cma evolution strategy: a comparing review. In: Towards a new evolutionary computation. Springer, New York, pp 75–102 Hinton GE (2012) A practical guide to training restricted Boltzmann machines. In: Montavon G, Orr GB, Mller K-R (eds) Neural networks: tricks of the trade, 2nd edn, Lecture notes in computer science. Springer, New York, pp 599–619 Hodgins JK, Wooten WL, Brogan DC, O’Brien JF (1995) Animating human athletics. In: SIGGRAPH. pp 71–78 Jain S, Ye Y, Liu CK (2009) Optimization-based interactive motion synthesis. ACM Trans Graph 28 (1):1–10 Kaelbling LP, Littman ML, Moore AP (1996) Reinforcement learning: a survey. J Artif Intell Res 4:237–285 Kaufman DM, Sueda S, James DL, Pai DK (2008) Staggered projections for frictional contact in multibody systems. ACM Trans Graph 27:164:1–164:11 Kim J, Pollard NS (2011) Direct control of simulated non-human characters. IEEE Comput Graph Appl 31(4):56–65 Kober J, Bagnell JAD, Peters J (2013) Reinforcement learning in robotics: a survey. Int J Robot Res 32:1238 Kwatra N, Wojtan C, Carlson M, Essa I, Mucha P, Turk G (2009) Fluid simulation with articulated bodies. IEEE Trans Vis Comput Graph 16(1):70–80
20
J. Tan
Kwon T, Hodgins J (2010) Control systems for human running using an inverted pendulum model and a reference motion capture sequence. In: Proceedings of the 2010 ACM SIGGRAPH/eurographics symposium on computer animation. SCA”10. Eurographics Association, pp 129–138 Laszlo J, van de Panne M, Fiume E (1996) Limit cycle control and its application to the animation of balancing and walking. In: Proceedings of the 23rd annual conference on computer graphics and interactive techniques. SIGGRAPH”96. ACM, pp 155–162 Lee S-H, Terzopoulos D (2006) Heads up! Biomechanical modeling and neuromuscular control of the neck. ACM Trans Graph 25(3):1188–1198 Lee S-H, Sifakis E, Terzopoulos D (2009) Comprehensive biomechanical modeling and simulation of the upper body. ACM Trans Graph 28:99:1–99:17 Levine S, Wang JM, Haraux A, Popovic’ Z, Koltun V (2012) Continuous character control with low-dimensional embeddings. ACM Trans Graph 31(4):28:1–28:10 Liu CK (2009) Dextrous manipulation from a grasping pose. ACM Transactions on Graphics (TOG) - Proceedings of ACM SIGGRAPH 2009, 28(3), Article No. 59. Liu CK, Jain S (2012) A short tutorial on multibody dynamics, Technical report GIT-GVU-15-01-1, Georgia Institute of Technology, School of Interactive Computing Liu CK, Popovic’ Z (2002) Synthesis of complex dynamic character motion from simple animations. In: Proceedings of the 29th annual conference on computer graphics and interactive techniques. SIGGRAPH”02. ACM, pp 408–416 Liu L, Yin K, van de Panne M, Shao T, Xu W (2010) Sampling-based contact-rich motion control. ACM Trans Graph 29(4), Article 128 Lloyd J (2005) Fast implementation of Lemke’s algorithm for rigid body contact simulation. In: Proceedings of the 2005 I.E. international conference on robotics and automation. ICRA 2005. pp 4538–4543 Megaro V, Thomaszewski B, Nitti M, Hilliges O, Gross M, Coros S (2015) Interactive design of 3d-printable robotic creatures. ACM Trans Graph 34(6):216:1–216:9 Mordatch I, de Lasa M, Hertzmann A (2010) Robust physics-based locomotion using low-dimensional planning. In: ACM SIGGRAPH 2010 papers. SIG- GRAPH”10. ACM, pp 71:1–71:8 Mordatch I, Popovic’ Z, Todorov E (2012) Contact-invariant optimization for hand manipulation. In: Proceedings of the ACM SIGGRAPH/eurographics symposium on computer animation. SCA”12. Eurographics Association, pp 137–144 Mordatch I, Wang JM, Todorov E, Koltun V (2013) Animating human lower limbs using contactinvariant optimization. ACM Trans Graph 32(6):203:1–203:8 Muico U, Lee Y, Popovic’ J, Popovic’ Z (2009) Contact-aware nonlinear control of dynamic characters. In: ACM SIGGRAPH 2009 papers. SIGGRAPH”09. ACM, pp 81:1–81:9 Ng AY, Jordan M (2000) Pegasus: a policy search method for large MDPs and POMDPs. In: Proceedings of the sixteenth conference on uncertainty in artificial intelligence, UAI’00. Morgan Kaufmann Publishers, San Francisco, pp 406–415 Ng AY, Russell SJ (2000) Algorithms for inverse reinforcement learning. In: Proceedings of the seventeenth international conference on machine learning, ICML”00. Morgan Kaufmann Publishers, San Francisco, pp 663–670 Otaduy MA, Tamstorf R, Steinemann D, Gross M (2009) Implicit contact handling for deformable objects. Comput Graph Forum (Proc. of Euro- graphics) 28(2):559-568 Pratt JE, Chew C-M, Torres A, Dilworth P, Pratt GA (2001) Virtual model control: an intuitive approach for bipedal locomotion. Int J Robot Res 20(2):129–143 Si W, Lee S-H, Sifakis E, Terzopoulos D (2014) Realistic biomechanical simulation and control of human swimming. ACM Trans Graph 34(1):10:1–10:15 Sueda S, Kaufman A, Pai DK (2008) Musculotendon simulation for hand animation. ACM Trans Graph 27:83:1–83:8 Tan J, Siu K, Liu CK (2012a) Contact handling for articulated rigid bodies using lcp. Technical report GIT-GVU-15-01-2, Georgia Institute of Technology, School of Interactive Computing Tan J, Turk G, Liu CK (2012a) Soft body locomotion. ACM Trans Graph 31(4):26:1–26:11
Physically Based Character Animation Synthesis
21
Tan J, Gu Y, Liu CK, Turk G (2014) Learning bicycle stunts. ACM Trans Graph 33(4):50:1–50:12 Thomas F, Johnston O (1995) The illusion of life: Disney animation, Hyperion. Abbeville Press, New York, NY. Todorov E (2006) Optimal control theory. In: Bayesian brain: probabilistic approaches to neural coding. MIT Press, Cambridge, MA. pp 269–298 Treuille A, Lee Y, Popovic’ Z (2007) Near-optimal character animation with continuous control. ACM Trans Graph 26(3):7 Tsai Y-Y, Lin W-C, Cheng KB, Lee J, Lee T-Y (2010) Real-time physics-based 3D biped character animation using an inverted pendulum model. IEEE Trans Vis Comput Graph 16(2):325–337 Tsang W, Singh K, Eugene F (2005) Helping hand: an anatomically accurate inverse dynamics solution for unconstrained hand motion. In: Proceedings of the 2005 ACM SIGGRAPH/ eurographics symposium on computer animation. SCA”05. pp 319–328 Vincent P, Larochelle H, Bengio Y, Manzagol P-A (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on machine learning. ICML”08. ACM, pp 1096–1103 Wampler K, Popovic’ Z (2009) Optimal gait and form for animal locomotion. ACM Trans Graph 28 (3):60:1–60:8 Wang JM, Fleet DJ, Hertzmann A (2009) Optimizing walking controllers. ACM Trans Graph 28 (5):168:1–168:8 Wang JM, Fleet DJ, Hertzmann A (2010) Optimizing walking controllers for uncertain inputs and environments. ACM Trans Graph 29(4):73:1–73:8 Wang JM, Hamner SR, Delp SL, Koltun V (2012) Optimizing locomotion controllers using biologically-based actuators and objectives. ACM Trans Graph 31(4):25:1–25:11 Witkin A, Kass M (1988) Spacetime constraints. In: Proceedings of the 15th annual Conference on computer graphics and interactive techniques. SIG- GRAPH”88. ACM, pp 159–168 Wu J-C, Popovic’ Z (2010) Terrain-adaptive bipedal locomotion control. In: ACM SIGGRAPH 2010 papers. SIGGRAPH”10. ACM, pp 72:1–72:10 Ye Y, Liu CK (2010) Optimal feedback control for character animation using an abstract model. In: SIGGRAPH”10: ACM SIGGRAPH 2010 papers. ACM, New York, pp 1–9 Ye Y, Liu CK (2012) Synthesis of detailed hand manipulations using contact sampling. ACM Trans Graph 31(4):41:1–41:10 Yin K, Loken K, van de Panne M (2007) SIMBICON: simple biped locomotion control. In: ACM SIGGRAPH 2007 papers. SIGGRAPH”07 Yin K, Coros S, Beaudoin P, van de Panne M (2008) Continuation methods for adapting simulated skills. ACM Trans Graph 27(3):81 Zordan VB, Celly B, Chiu B, DiLorenzo PC (2006) Breathe easy: model and control of human respiration for computer animation. Graph Models 68:113–132
Data-Driven Hand Animation Synthesis Sophie Jörg
Abstract
As virtual characters are becoming more and more realistic, the need for recording and synthesizing detailed animations for their hands is increasing. Whether we watch virtual characters in a movie, communicate with an embodied conversational agent in real time, or steer an agent ourselves in a virtual reality application or in a game, detailed hand motions have an impact on how we perceive the character. In this chapter, we give an overview of current methods to record and synthesize the subtleties of hand and finger motions. The approaches we present include marker-based and markerless optical systems, depth sensors, and sensored gloves to capture and record hand motions and data-driven algorithms to synthesize movements when only the body or arm motions are known. We furthermore describe the complex anatomy of the hand and how it is being simplified and give insights on our perception of hand motions to convey why creating realistic hand motions is challenging. Keywords
Hand motions • Fingers • Character animation • Data-driven animation • Virtual characters • Motion capture
Introduction In recent years, character animation has made tremendous steps toward realistic virtual agents, with increasingly better solutions for motion capturing body motions, creating highly realistic facial animation, and simulating cloth and hair. With these more and more realistic components, providing plausible hand and finger motions S. Jörg (*) School of Computing, Clemson University, Clemson, SC, USA E-Mail: [email protected] # Springer International Publishing AG 2016 B. Müller, S.I. Wolf (eds.), Handbook of Human Motion, DOI 10.1007/978-3-319-30808-1_13-1
1
2
S. Jörg
has become highly important. We use our hands to explore our environment by touching and manipulating objects, to conduct basic tasks such as eating or writing, to handle complex tools, to create art pieces, or to perform musical instruments (Napier 1980). Hand movements also play a crucial role in communicating information and can even take a main role in conveying meaning in sign languages. However, the complexity of the hand anatomy and our sensitive perception of small details in hand motions make it challenging to record or synthesize accurate hand motions. Furthermore, the differences in size of the hand compared to the body motions complicate the process of capturing both at the same time. Therefore, finger motions are typically animated manually, which is a cumbersome and time-intensive process for the animator. This chapter describes how detailed hand movements can be recorded and synthesized. We first explain why we need detailed hand motions for virtual characters and why these subtle motions are challenging to create, giving further details on the anatomy of the hands and on our perception of hand motions. After a brief state of the art, we then delve into the different ways to record and to synthesize hand motions, focusing on optical systems, depth sensors, and sensored gloves to capture hand motions and on data-driven algorithms to synthesize hand movements depending on the motions of the body. We conclude by describing some of the next challenges in the field.
Applications Why do we need detailed hand motions for virtual characters? Hand and finger motions are ubiquitous in our lives. They are such an integral part of our daily lives that we take those intricate motions for granted. We see and interpret them effortlessly without the need to think about them. However, even small differences in hand motions can change our interpretation of a scene (Jörg et al. 2010). Especially as the realism of virtual characters is increasing, the lack of detailed hand motions becomes disturbing. More details on our perception of hand motions are given in section “Perception of Hand Motions.” The importance of hand motions varies with the application, the task, and the type of character. We start by describing the most common applications and tasks for which we need virtual hand motions. Of course virtual hand motions are typically required for animated characters, be it for entertainment, education, or any other application. Specific situations when hand motions are crucial are during conversations, when manipulating objects, when playing instruments, or when using American Sign Language (ASL). Approaches have been suggested to create finger motions for each of these tasks. Jörg et al. (2012) synthesize hand motions for known body movements for conversational characters with a data-driven approach. Their method is presented in more detail in section “Synthesizing Data-driven Hand Motions.” Manipulation tasks are bound by the physical constraints of the manipulated objects. Physics-based and datadriven approaches and combinations of both are very effective in that area
Data-Driven Hand Animation Synthesis
3
(Liu 2009; Pollard and Zordan 2005; Ye and Liu 2012). Algorithms have been developed to allow virtual characters to play the guitar (ElKoura and Singh 2003) or the piano (Zhu et al. 2013), and techniques have been refined to improve how to capture ASL (Huenerfauth and Lu 2010; Lu and Huenerfauth 2009). For the previously listed applications, hand and finger animations are mostly created without a time constraint. Real-time applications present additional challenges. For embodied conversational agents (ECAs), hand motions are particularly important. In conversations, finger motions can convey meaning and emphasis (McNeill 1992) and even personality and mood (Wang et al. 2016). In many approaches generating gestures for conversational characters, finger motions are not considered separately from the hand, and only a small number of noticeable hand shapes are synthesized, such as pointing with the index finger. As the quality of animation for embodied conversational agents is continuing to rise, the need for more accurate hand motions is increasing. The commercialization of new technologies in virtual reality (VR) produced further applications. Hand motions need to be tracked in real time so that a person in VR can see and use their hands. Furthermore, if multiple persons communicate in VR we need methods to create movements that accurately convey the meaning of their conversations. Finally, more realistic hand animations in games could increase immersion and presence. If a virtual character has to grab and manipulate a wide range of different objects, not all necessary hand shapes can be created in advance, and adjustments are needed in real time. After describing in which applications and for which tasks detailed hand and finger motions are most important, we will review what the challenges are when creating them.
Challenges What are the challenges when capturing or synthesizing hand and finger motions? The main difficulties come from the complex structure of the hand allowing for intricate motions, the smaller size relative to the body, and people’s impressive abilities to recognize and interpret subtleties in hand and finger motions.
Structure of the Hand In their reference work on anatomy and human movement, Palastanga and Soames (2012) characterize “the development of the hand as a sensitive instrument of precision, power and delicacy” as “the acme of human evolution.” The intricate structure of the hand that allows us to perform a wide range of actions is often taken for granted. The hand consists of 27 bones, not counting sesamoid bones. The arrangement of muscles, tendons, and ligaments enables a large variety of possible poses. The phalanges of the five digits – index, middle, ring, and little or pinky finger and the thumb – with their 14 bones in total are connected to the palm. The five metacarpal bones form the palm, and the eight carpal bones connect the metacarpals to the wrist (Jörg 2011; Napier 1980; Palastanga and Soames 2012).
4
S. Jörg
For animation, in many cases, this skeleton is simplified. The eight carpal bones at the wrist are summarized into one wrist joint, and the metacarpals that form the palm might be represented by a simple, rigid structure. A further approximation concerns the joints. The complex spinning, rolling, and sliding motions of the joints are typically approximated with rotations around a point. The number of degrees of freedom of those joints can also be reduced. For example, the metacarpophalangeal joints are represented with Hardy Spicer joints with two degrees of freedom. In reality, a slight medial or lateral rotation of the fingers, so a rotation around the axis that goes through the length of the finger, is possible with the largest angle achievable passively for the little finger. A resulting simple hand skeleton has around 24 degrees of freedom (Jörg 2011; Liu 2008; Parent 2012). However, there is no single standard skeleton when it comes to character animation, and the exact characteristics vary widely based on the application and the desired level of realism. As a consequence of the anatomy of the hand and the arrangement of the tendons, different joints tend to be moved together such as the interphalangeal joints of each finger or the ring finger and the little finger (Häger-Ross and Schieber 2000; Jörg and O’Sullivan 2009). When animating or synthesizing hand motions, we can take advantage of this property by representing hand motions in a reduced set of dimensions (Braido and Zhang 2004; Ciocarlie et al. 2007; Santello et al. 1998). The joints do not move in perfect synchrony, so the dimensionality reduction is not lossless. How much a motion can be simplified and the dimensionality reduced depends on the required quality and detail of the motions. In summary, the structure of the hand is very complex, but can be simplified for animation depending on how accurate and natural the resulting motion should look. How much simplification is possible also depends on our ability to perceive and interpret hand motions.
Perception of Hand Motions People are extremely skilled when it comes to recognizing and interpreting human full body motions. We can recognize friends from far away by their walks and posture before we can see their faces, and we can make a reasonable guess of characteristics such as the sex of a person just based on their motions (Cutting and Kozlowski 1977; Kozlowski and Cutting 1977). This process is effortless; it happens automatically without actively analyzing the motion. In a similar way, our interpretation of gestures during communication is mostly automatic – as is our usage of those gestures. A wide range of insights has been gained on the meaning and interpretation of hand and arm motions, mostly by observation of how we gesture when we communicate (Kendon 2004; McNeill 1992). The subtle motions of the fingers are an inherent part of gestures, but their exact use and perception are rarely examined separately. But as the detailed finger motions are difficult to capture and might be created separately from the body motions, it becomes important to find out when these details affect our impression of a character or a scenario. A perceptual experiment investigating the effect of small delays in finger motion compared to body motion found that viewers could even detect small
Data-Driven Hand Animation Synthesis
5
synchronization errors of 0.1 s in short motion clips of a few seconds. A 0.5 s delay in finger motion altered the interpretation of a 30 s long scenario (Jörg et al. 2010). It has been shown that animated hands and handlike structures can convey emotions (Samadani et al. 2011). More interestingly, hand poses and motions influence the perceived personality of a virtual character with and without the presence of body motions. For example, spreading motions are seen as more extraverted and open than flexion, and small hand motions are regarded as more emotionally stable and agreeable than large motions (Wang et al. 2016). Hand animation is thus essential when conveying meaning and creating convincing virtual characters. Our perception of virtual characters also depends on their appearance. The same body motions on a more realistic humanlike character are rated to be biological (in contrast to artificial) less often than when they are shown on a less detailed and more abstract character (Chaminade et al. 2009). There are less studies on this subject when it comes to hand motions, but it has been shown for hands as well that different brain areas are activated when we watch real and when we watch virtual hand actions (Perani et al. 2001). The appearance of virtual hands also has an impact on our perception in virtual reality applications, notably on the virtual hand illusion. The virtual hand illusion is a body ownership illusion: When one sees a virtual hand in virtual reality that is controlled by and moves in synchrony with one’s own hand, after a short conditioning phase, a threat to the virtual hand can trigger an affective response as if the virtual hand was seen as a part of one’s own body. That means that if a virtual knife hits the virtual hand, a user can get startled and quickly pulls away their hand even if the virtual knife cannot do any real damage. Users feel to a large degree as if the virtual hand was their own. This illusion can be induced for a surprisingly large variety of models. It has been shown to occur for realistic hands, cartoony hands, a zombie and a robot hand, an abstract hand made of a torus and ellipsoids, for a cat claw, and for objects or shapes such as a square, a sphere, a balloon, and a wooden block. However, the illusion is much weaker for objects and strongest for realistic hands (Argelaguet et al. 2016; Lin and Jörg 2016; Ma and Hommel 2015a, b; Yuan and Steed 2010; Zhang and Hommel 2015). While the motion of the hand plays a crucial role in inducing the virtual hand illusion, it is not known as of yet how much offset, latency, or errors are possible without destroying the illusion of ownership. While many questions remain to be solved when it comes to our perception of hand motions, evidence suggests that viewers are able to notice small details that can be crucial for our interpretation of a situation. These perceptual skills contribute to the challenges when creating detailed, realistic finger motions.
State of the Art Motion capturing has become a standard technique when it comes to creating highly realistic body motions for movies, games, or similar applications. While more effective and less expensive systems are still being developed, it has been possible for about two decades to capture body motions with sufficient accuracy for typical
6
S. Jörg
applications with virtual characters. For finger motions, however, it is still not possible to capture accurate data in real time without restrictions. Many technologies and algorithms have been suggested to create detailed finger motions such as automatically synthesizing them based on the body motions or using sensored gloves to measure the finger’s joint angles. With the rise of virtual reality applications in recent years, new demand for capturing hand motions has emerged. Current devices to capture only the hands for the consumer market have been developed, but these devices still lack a high reliability and accuracy. A recent, detailed survey of research literature on finger and hand modeling and animation has been compiled by Wheatland et al. (2015).
Capturing Hand Motions Motion capture has become a widely adopted technology to animate realistic virtual characters for movies and games (Menache 1999). The detailed movements of the fingers are, however, challenging to capture. Several methods have been developed to accomplish this task, each with their advantages and drawbacks. Optical markerbased motion capture, sensored gloves, markerless optical motion capture, and depth sensors are the most popular solutions and are described in this section.
Optical Marker-Based Motion Capture Optical marker-based motion capture records the positions of retroreflective or LED markers and computes a skeleton based on that data. It provides a high accuracy compared to other capturing techniques. The optical motion capturing of finger movements requires careful planning. The denser the markers are placed, the smaller the markers need to be to avoid mislabelings and occurrences where multiple markers are mistaken as one by the cameras. For a rather comprehensive marker set with 20 markers (three on each digit and five on the palm), 6.5 mm spherical or 3/4 spherical retroreflective markers are recommended. The optical cameras used in such a system need to be placed closer to the performer than if only body motions were captured as a higher resolution is needed. Furthermore, cameras on the ground level should be added as the back of the hand where the markers are placed is directed downward during many gestures (Jörg 2011). Even with a careful setup, occlusions, where individual markers are hidden from the cameras, cannot be avoided. For example, when the hand forms a fist, the ends of the fingertips are hidden. Therefore, adequate post-processing is required, which typically involves arduous manual labor. These hurdles are the reason that in applications that do not require real-time animations, such as for animated movies, finger animations are typically created manually. A small number of markers or sensored gloves can be used for previsualization (Kitagawa and Windsor 2008). Several approaches have been suggested to compute an optimal sparse marker set to ensure better marker separation and identification. The detailed finger motions are
Data-Driven Hand Animation Synthesis
7
reconstructed based on that subset of markers, thus reducing the time required for post-processing. It is possible to use dimensionality reduction techniques such as principal component analysis to take advantage of the approximate redundancy in hand motions (Wheatland et al. 2013). Another approach tests which markers can be left out by reconstructing the hand pose finding similar poses in a database and verifying how different the resulting pose is (Mousas et al. 2014). The computed optimal marker sets vary with the approach and the example databases used. Common to most of them is a marker on the thumb and one marker at most on each finger except for the index finger where there might be two markers (Kang et al. 2012; Mousas et al. 2014; Schröder et al. 2015; Wheatland et al. 2013). Once the motions with the reduced marker set are recorded, the full-resolution motions are reconstructed using a database and searching for correspondences. Hoyet et al. (2012) evaluate the perception of a diverse set of finger motions including grasping, opening a bottle, and playing the flute, recorded with reduced marker sets. They use simple methods such as inverse kinematics and interpolation techniques to reconstruct the hand motions. They show that the required number of markers depends on the type of motion. For motions where the fingers only display secondary motions, a static hand pose might be good enough. For a majority of cases, a simple eight-marker hand model with six markers to capture the four fingers (four on the fingertips and two on the finger base of the index and pinky) and two markers to capture the thumb produces motions with sufficiently high quality so that viewers do not realize a difference with a full marker set. Still, for some motions a full marker set using forward kinematics is needed.
Sensored Gloves As recording finger motions with optical, marker-based motion capture systems is challenging, alternative methods have been developed. Sensored gloves directly measure the bending angle of different joints (Sturman and Zeltzer 1994). The number and configuration of bend sensors vary between 5 and 24. Commercially available technologies include fiber optic sensors, piezoresistive sensors, and inertial sensors. The main advantage of sensored gloves is that they create a continuous signal without any occlusions that can be used in real time. However, the accuracy is lower than for marker-based optical motion capture systems. The number of sensors of the gloves is typically small compared to the number of degrees of freedom of the hand. Furthermore, the sensors do not measure the global position of the hand. The gloves might move compared to the skin during the capture; therefore regular recalibrations are necessary if a high accuracy is required, which might interrupt capture sessions. Finally, cross-coupling between sensors adds further challenges, and more complex calibration methods have been developed (Kahlesz et al. 2004; Wang and Neff 2013; Wheatland et al. 2015). Sensored gloves are therefore most useful in applications where a continuous signal is important but accuracy is not crucial. These properties explain why gloves were used, for example, for virtual reality applications or as a baseline for movies where the motions can be adjusted in
8
S. Jörg
post-processing. A survey of glove-based systems has been compiled by Dipietro et al. (2008).
Markerless Optical Systems and Depth Sensors Markerless optical systems and depth sensors have become very popular in the past years. These systems, examples are the Microsoft Kinect or the Leap Motion sensor, only cover a small capture volume, but they are small, light, and inexpensive. Their accuracy depends largely on the captured hand poses and the hand orientations in relation to the sensor. Detailed silhouettes can be recognized with a much higher accuracy than poses where fingers are touching each other and might be hidden from the sensor by the palm or by other fingers or the thumb. Fast motions might also not be recognized accurately. However, algorithms for these systems are currently developed at a high speed, so that further progress is likely in the near future. The Microsoft Kinect is an RGB-D camera, which records color like a standard camera and depth information in addition to it. The Leap Motion sensor on the other hand uses only depth information. Other approaches only use information from a regular camera. Wang and Popović presented a method where the user wears a cloth glove with a colored pattern (Wang and Popović 2009). With the pattern, the pose of the hand can be estimated in single frames by looking for the nearest neighbor hand pose in a database. As a result of several improvements such as approximating the nearest neighbor lookup, the search can be conducted in real time. Many suggestions for algorithms using only monocular videos exist (de La Gorce et al. 2008). They are typically not very accurate and computationally too intensive for real-time tasks as of now. When the recognition problem with or without depth information is reduced to recognizing a specified subset of poses and gestures in a controlled environment, more reliable approaches exist.
Synthesizing Data-Driven Hand Motions An alternative approach to capturing finger motions is to synthesize the complete hand movements. Once a database of motions has been created, data-driven methods can learn from this data and reuse and adapt it as required. An example of how to synthesize hand and finger motions based on the method developed by Jörg et al. (2012) is elaborated in the next paragraph. The goal of the approach is to automatically create motions for all fingers and the thumb, not including the wrist, for conversational situations with a full body motion clip without hand movements as input motion and an available database of motions that includes both hand and body movements. To this aim, the algorithm finds the best hand motion clip from the database taking into account features such as the similarity of the arm motions and the smoothness of consecutive finger motions. The synthesis process consists of the following steps: First, the input motion and the database are segmented based on the wrist velocity. Second, the algorithm searches
Data-Driven Hand Animation Synthesis
9
the database for segments with similar wrist motions than the ones from the input motion applying dynamic time warping to adapt the length of each segment within certain limits. Third, a weighted motion graph is computed. The start node of the graph is connected to the k segments from the database that are most similar to the first input motion segment. Each of these k segments is then connected to the k segments that are most similar to the second input motion segment and so on. For each transition, a cost is calculated by comparing the orientations and angular velocities of the fingers at the last frame of a segment and at the first frame of the next segment. A weighted sum of the corresponding transition and segment costs is applied to each connection, and the shortest path is computed with Dijkstra’s algorithm resulting in a choice of motion segments. Finally, when combining these segments, transitions are created where necessary. This algorithm creates plausible finger motions for conversational situations but excludes any interactions with objects or self-collisions with the virtual character itself and does not take into account any partial information of finger positions. Enhancements to this approach have been developed by Mousas et al. (2015). Ye and Liu’s approach (Ye and Liu 2012) in contrast creates detailed and physically plausible hand manipulations when presented with a full body motion and the movements of objects that are being manipulated. They determine feasible hand-object contact points using a database and create the hand movements according to these contact positions. They select visually diverse solutions that result in intricate motion strategies such as the relocation of contact points. Many further methods have been suggested to create hand and finger motions when starting with body motions, such as capturing them separately from the body and synchronizing them in a post-processing step (Majkowska et al. 2006), procedural algorithms using databases (Aydin and Nakajima 1999), determining key hand poses based on the body motion with a support vector machine (Oshita and Senju 2014), or approaches taking advantage of data-driven and physics-based methods (Kry and Pai 2006; Neff and Seidel 2006), to cite just a few examples. Each method has their advantages and drawbacks. A more exhaustive review of the research literature on hand and finger motion synthesis can be found in Wheatland et al.’s state-of-the-art report (Wheatland et al. 2015).
Conclusion and Future Directions Detailed hand motions are highly important, especially as our expectations toward realistic virtual characters increase. But high-quality, accurate hand motions are still time consuming to capture or synthesize. While methods and techniques in this field are improving at a fast pace, there are still many open questions and processes which need improvements. Future directions include the following topics: • Approaches for a variety of applications: Many approaches that have been suggested specialize on a specific task. One next step would be to develop
10
•
•
•
•
•
S. Jörg
approaches that are effective for multiple tasks or to combine approaches and use the optimal approach based on an automatic assessment of the situation. Interpretation of subtle hand motions: Many questions are unsolved when it comes to our interpretation of subtle hand motions. How do details in finger motions influence our understanding and our interpretation when communicating? How much error or latency in hand motions is tolerable? Efficient methods for real-time capturing or synthesis: Recording or synthesizing movement in real time has its own challenges, for example, optimizations over longer segments of motion are not possible. The accuracy and reliability of current devices need to be improved. Improve our understanding of the virtual hand illusion: How much error is allowed? Why are some people more prone to the illusion than others? Our understanding of the reasons for the illusion and the conditions in which it occurs is still limited. Insights could allow for more appropriate feedback to improve communication and manipulation in virtual reality applications (Ebrahimi et al. 2016; Prachyabrued and Borst 2014). Details of hand motions: For animations that aim to look realistic, details such as skin deformations and wrinkles based on the anatomy of the hand or on contacts need to be synthesized. While progress has been made in this area (Andrews et al. 2013; Li and Kry 2014), automatic photo-realism has not been reached yet. Control tools for animators: Finally, methods need to be made accessible and usable for animators, which includes the development of intuitive controls to allow for a more efficient workflow.
Cross-References ▶ 3D Dynamic Pose Estimation Using Cameras and No Markers ▶ 3D Dynamic Pose Estimation Using Reflective Markers or Electromagnetic Sensors ▶ 3D Dynamic Probabilistic Pose Estimation Using Cameras and Reflective Markers ▶ Body Movements in Music Performances: On the Example of Clarinet Players ▶ Data-Driven Character Animation Synthesis ▶ Hand Gesture Synthesis for Conversational Characters ▶ Movement Efficiency in Piano Performances ▶ Perceptual Evaluation of Human Animation ▶ Postural Movements of Violin Players
References Andrews S, Jarvis M, Kry PG (2013) Data-driven fingertip appearance for interactive hand simulation. In: Proceedings of motion on games, MIG ‘13, Dublin, pp 155:177–155:186
Data-Driven Hand Animation Synthesis
11
Argelaguet F, Hoyet L, Trico M, Lecuyer A (2016) The role of interaction in virtual embodiment: effects of the virtual hand representation. In: IEEE virtual reality (VR), Greenville, pp 3–10 Aydin Y, Nakajima M (1999) Database guided computer animation of human grasping using forward and inverse kinematics. Comput Graph 23(1):145–154. doi:10.1016/S0097-8493(98) 00122-8 Braido P, Zhang X (2004) Quantitative analysis of finger motion coordination in hand manipulative and gestic acts. Hum Mov Sci 22(6):661–678. doi:10.1016/j.humov.2003.10.001 Chaminade T, Hodgins J, Kawato M (2009) Anthropomorphism influences perception of computeranimated characters’ actions. Soc Cogn Affect Neurosci 2(3):206–216 Ciocarlie M, Goldfeder C, Goldfeder C (2007) Dimensionality reduction for hand-independent dexterous robotic grasping. In: IEEE/RSJ international conference on intelligent robots and systems, IROS 2007, San Diego, pp 3270–3275 Cutting J, Kozlowski L (1977) Recognizing friends by their walk: gait perception without familiarity cues. Bull Psychon Soc 9(5):353–356 de La Gorce M, Paragios N, Fleet DJ (2008) Model-based hand tracking with texture, shading and self-occlusions. In: IEEE conference on computer vision and pattern recognition, Anchorage, pp 1–8 Dipietro L, Sabatini A, Dario P (2008) A survey of glove-based systems and their applications. IEEE Trans Syst Man Cybern Part C Appl Rev 38(4):461–482 Ebrahimi E, Babu SV, Pagano CC, Jörg S (2016) An empirical evaluation of visuo-haptic feedback on physical reaching behaviors during 3D interaction in real and immersive virtual environments. ACM Trans Appl Percept 13(4):19:1–19:21 ElKoura G, Singh K (2003) Handrix: animating the human hand. In: Proceedings of the ACM SIGGRAPH/Eurographics symposium on computer animation, San Diego, pp 110–119 Häger-Ross C, Schieber MH (2000) Quantifying the independence of human finger movements: comparisons of digits, hands, and movement frequencies. J Neurosci 20(22):8542–8550 Hoyet L, Ryall K, McDonnell R, O’Sullivan C (2012) Sleight of hand: perception of finger motion from reduced marker sets. In: Proceedings of the ACM SIGGRAPH symposium on interactive 3D graphics and games, I3D ‘12, Costa Mesa, pp 79–86 Huenerfauth M, Lu P (2010) Accurate and accessible motion-capture glove calibration for sign language data collection. ACM Trans Access Comput 3(1):2:1–2:32 Jörg S (2011) Perception of body and hand animations for realistic virtual characters. Ph thesis, University of Dublin, Trinity College, Dublin Jörg S, O’Sullivan C (2009) Exploring the dimensionality of finger motion. In: Proceedings of the 9th Eurographics Ireland workshop (EGIE 2009), Dublin, pp 1–11 Jörg S, Hodgins J, O’Sullivan C (2010) The perception of finger motions. In: Proceedings of the 7th symposium on applied perception in graphics and visualization (APGV 2010), Los Angeles, pp 129–133 Jörg S, Hodgins JK, Safonova A (2012) Data-driven finger motion synthesis for gesturing characters. ACM Trans Graph 31(6):189:1–189:7 Kahlesz F, Zachmann G, Klein R (2004) Visual-fidelity dataglove calibration. In: Computer graphics international. IEEE Computer Society, Crete, pp 403–410 Kang C, Wheatland N, Neff M, Zordan V (2012) Automatic hand-over animation for free-hand motions from low resolution input. In: Motion in games. Lecture notes in computer science, vol 7660. Springer, Berlin/Heidelberg, pp 244–253 Kendon A (2004) Gesture – visible action as utterance. Cambridge University Press, Cambridge Kitagawa M, Windsor B (2008) MoCap for artists: workflow and techniques for motion capture. Focal Press, Amsterdam/Boston Kozlowski LT, Cutting JE (1977) Recognizing the sex of a walker from a dynamic point-light display. Percept Psychophys 21(6):575–580 Kry PG, Pai DK (2006) Interaction capture and synthesis. ACM Trans Graph 25(3):872–880 Li P, Kry PG (2014) Multi-layer skin simulation with adaptive constraints. In: Proceedings of the 7th international conference on motion in games, MIG ‘14, Playa Vista, pp 171–176
12
S. Jörg
Lin L, Jörg S (2016) Need a hand?: how appearance affects the virtual hand illusion. In: Proceedings of the ACM symposium on applied perception, SAP ‘16, Anaheim, pp 69–76 Liu CK (2008) Synthesis of interactive hand manipulation. In: Proceedings of the ACM SIGGRAPH/Eurographics symposium on computer animation, Dublin, pp 163–171 Liu CK (2009) Dextrous manipulation from a grasping pose. ACM Trans Graph 28(3):3:1–3:6 Lu P, Huenerfauth M (2009) Accessible motion-capture glove calibration protocol for recording sign language data from deaf subjects. In: Proceedings of the 11th international ACM SIGACCESS conference on computers and accessibility, pp 83–90 Ma K, Hommel B (2015a) Body-ownership for actively operated non-corporeal objects. Conscious Cogn 36:75–86 Ma K, Hommel B (2015b) The role of agency for perceived ownership in the virtual hand illusion. Conscious Cogn 36:277–288 Majkowska A, Zordan VB, Faloutsos P (2006) Automatic splicing for hand and body animations. In: Proceedings of the ACM SIGGRAPH/Eurographics symposium on computer animation. Boston, MA, USA, pp 309–316 McNeill D (1992) Hand and mind: what gestures reveal about thought. The University of Chicago Press, Chicago Menache A (1999) Understanding motion capture for computer animation and video games. Morgan Kaufmann Publishers Inc., San Francisco Mousas C, Newbury P, Anagnostopoulos CN (2014) Efficient hand-over motion reconstruction. In: Proceedings of the 22nd international conference in Central Europe on computer graphics, visualization and computer vision, WSCG ‘14. Plzen, Czech Republic, pp 111–120 Mousas C, Anagnostopoulos CN, Newbury P (2015) Finger motion estimation and synthesis for gesturing characters. In: Proceedings of the 31st spring conference on computer graphics, SCCG ‘15. Smolenice, Slovakia, pp 97–104 Napier J (1980) Hands. Pantheon Books, New York Neff M, Seidel HP (2006) Modeling relaxed hand shape for character animation. In: Articulated Motion and deformable objects. Lecture notes in computer science, vol 4069. Springer, Berlin/ Heidelberg, pp 262–270 Oshita M, Senju Y (2014) Generating hand motion from body motion using key hand poses. In: Proceedings of the 7th international conference on motion in games, MIG ‘14. Playa Vista, CA, USA, pp 147–151 Palastanga N, Soames R (2012) Anatomy and human movement – structure and function, 6th edn. Butterworth Heinemann/Elsevier, Edinburgh/New York Parent R (2012) Computer animation: algorithms and techniques, 3rd edn. Morgan Kaufmann, Burlington Perani D, Fazio F, Borghese NA, Tettamanti M, Ferrari S, Decety J, Gilardi MC (2001) Different brain correlates for watching real and virtual hand actions. Neuroimage 14:749–758 Pollard NS, Zordan VB (2005) Physically based grasping control from example. In: Proceedings of the ACM SIGGRAPH/Eurographics symposium on computer animation. Los Angeles, CA, USA, pp 311–318 Prachyabrued M, Borst CW (2014) Visual feedback for virtual grasping. In: IEEE symposium on 3D User Interfaces, 3DUI, 2014. Minneapolis, MN, USA, pp 19–26 Samadani AA, DeHart BJ, Robinson K, Kulic D, Kubica E, Gorbet R (2011) A study of human performance in recognizing expressive hand movements. In: IEEE international symposium on robot and human interaction communication. Atlanta, GA, USA Santello M, Flanders M, Soechting JF (1998) Postural hand synergies for tool use. J Neurosci 18 (23):10,105–10,115 Schröder M, Maycock J, Botsch M (2015) Reduced marker layouts for optical motion capture of hands. In: Proceedings of the 8th ACM SIGGRAPH conference on motion in games, MIG ‘15. Paris, France, pp 7–16 Sturman DJ, Zeltzer D (1994) A survey of glove-based input. IEEE Comput Graph Appl 14 (1):30–39
Data-Driven Hand Animation Synthesis
13
Wang Y, Neff M (2013) Data-driven glove calibration for hand motion capture. In: Proceedings of the 12th ACM SIGGRAPH/Eurographics symposium on computer animation, SCA ‘13. Anaheim, CA, USA, pp 15–24 Wang RY, Popović J (2009) Real-time hand-tracking with a color glove. ACM Trans Graph 28 (3):63 Wang Y, Tree JEF, Walker M, Neff M (2016) Assessing the impact of hand motion on virtual character personality. ACM Trans Appl Percept 13(2):9:1–9:23 Wheatland N, Jörg S, Zordan V (2013): Automatic hand-over animation using principle component analysis. In: Proceedings of motion on games, MIG ‘13. Zürich, Switzerland, pp 175:197–175:202. ACM Wheatland N, Wang Y, Song H, Neff M, Zordan V, Jörg S (2015) State of the art in hand and finger modeling and animation. Comput Graph Forum 34(2):735–760 Ye Y, Liu CK (2012) Synthesis of detailed hand manipulations using contact sampling. ACM Trans Graph 31(4):245–254 Yuan Y, Steed A (2010) Is the rubber hand illusion induced by immersive virtual reality? Virtual Reality Conference (VR). IEEE Computer Soc. Waltham, MA, USA, pp 95–102 Zhang J, Hommel B (2016) Body ownership and response to threat. Psychol Res 80(6):1020–1029 Zhu Y, Ramakrishnan AS, Hamann B, Neff M (2013) A system for automatic animation of piano performances. Comput Anim Virtual Worlds 24(5):445–457
Example-Based Skinning Animation State of the Art Tomohiko Mukai
Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . State-of-the-Art Skinning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example-Based Skinning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linear Blend Skinning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Skinning Weight Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Skinning Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example-Based Helper Bone Rigging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Per-Example Optimization of Helper Bone Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Helper Bone Controller Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 4 5 5 6 7 9 9 11 12 14 18 18 19
Abstract
The skinning technique has been widely used for synthesizing the natural skin deformation of human-like characters in a broad range of computer graphics applications. Many skinning methods have been proposed to improve the deformation quality while achieving real-time computational performance. The design of skinned character models, however, requires heavy manual labor even for experienced digital artists with professional software and tools. This chapter presents an introduction to an example-based skinning method, which builds a
T. Mukai (*) Tokai University, Tokyo, Japan e-mail: [email protected]; [email protected] # Springer International Publishing AG 2016 B. Müller, S.I. Wolf (eds.), Handbook of Human Motion, DOI 10.1007/978-3-319-30808-1_14-1
1
2
T. Mukai
skinned character model using an example sequence of handcrafted or physically simulated skin deformations. Various types of machine learning techniques and statistical analysis methods have been proposed for example-based skinning. In this chapter, we first review state-of-the-art skinning techniques, especially for a standard skinning model called linear blend skinning that uses a virtual skeleton hierarchy to drive the skin deformation. Next, we describe several automated methods for building a skeleton-based skinned character model using example skin shapes. We introduce skinning decomposition methods that convert a shape animation sequence into a skinned character and its skeleton motion. We also explain a practical application of skinning decomposition, which builds a so-called helper bone rig from an example animation sequence. We finally discuss the future directions of example-based skinning techniques. Keywords
Animation • Rigging • Linear blend skinning • Helper bone
Introduction In computer animation, natural skin deformation is vital for producing lifelike characters, and there are many techniques and professional software packages for creating expressive skin animations. For example, a physics-based volumetric simulation is a typical approach for generating physically valid skin deformations. A physics simulation, however, requires large computational costs and careful the design of a character’s musculoskeletal model. Moreover, a numerical simulation is inferior to kinematic or analytical models in terms of controllability and stability. The animation of a human character in interactive graphics applications is often created using a skeleton-based skinning method that embeds a virtual skeletal structure into the solid geometry of the character’s surface model. This technique deforms the skin surface according to the movement of the virtual skeleton on the basis of simple geometric operations and has become a de facto standard skinning method because of its simplicity and efficiency. The typical production procedure of skinning animations is composed of the following three processes: Modeling A surface geometry of a character model is created as a polygonal mesh or parametric surface, and its material appearance is designed via shading models and texture images. The character model is often created in the rest pose such as the so-called T-stance shown in Fig. 1a. In this chapter, we assume that the skin geometry is constructed as a polygonal mesh. Rigging A virtual skeleton hierarchy, typically composed of rigid bones and rotational joints, is bound to the character geometry. This process requires specifying the skinning weights that describe the relative influence of each bone over the vertex of skin mesh (Fig. 1b). This process is called rigging, character setup, or simply setup.
Example-Based Skinning Animation
a
3
b
c
Skeleton Skin
Modeling in the T-stance
Rigging
Animation
Fig. 1 Typical process for creating skeleton-driven skin animation
Animation An animation of the skinned character is created as a time series of rotations of the skeleton joints. The deformation of the skin surface is driven by the joint movements (Fig. 1c). In this chapter, we especially focus on the rigging of the skeleton-driven character. Building a good character rig is a key requirement for synthesizing natural and fine skin deformation while providing a full range of motion control for animators, i.e., an ill-designed skeleton and rig cause unnatural deformation even if the artists carefully design the character’s geometry and skeleton movement. Moreover, the skeleton rig should be as simple and intuitive to manipulate as possible so that the animators can easily create the character motion. Owing to this trade-off between quality and manipulability, the rigging of a complex human-like character is still a challenging task even for skilled riggers and animators using professional tools, which requires trial and error and artistic experience and intuition. Many researchers have tried to develop automatic and semiautomatic methods of building a character rig. In particular, recent studies have primarily focused on datadriven approaches for constructing an optimal skeleton-based character rigs by using example data. Example-based skinning methods optimize the structure of the skeleton hierarchy and skinning weights so that the skinned animation approximates the example shapes well. Various types of machine learning techniques and statistical analysis methods have been proposed for stably and efficiently obtaining an accurate approximation. The rest of this chapter is structured as follows. In the following section, we review state-of-the-art skinning techniques. Next, we explain skinning decomposition methods that build skinned character models from example skin shapes. We also introduce a practical skinning model called a helper bone rig and its example-based construction algorithm based on the skinning decomposition method. Finally, we will discuss the future prospects of example-based skinning techniques.
4
T. Mukai
State-of-the-Art Skinning Techniques Accurate skin deformation is often generated using physics-based musculoskeletal (Li et al. 2013) or volumetric (Fan et al. 2014) simulations. These simulation-based methods are unsuitable for intuitively or intentionally changing the style of deformations and real-time applications because of their high computational cost. Moreover, the manual rigging of a character’s musculoskeletal model is a very challenging task. A data-driven approach learns the dynamical properties of softtissue materials from the example data (Shi et al. 2008) but still suffers from the computational complexity. The kinematic skinning method efficiently computes the vertex positions of the skin mesh on the basis of the pose of the internal skeleton structure. Linear blend skinning (LBS) is a standard technique for synthesizing skin deformation in realtime applications, which computes a deformed vertex position by transforming each vertex through a weighted combination of bone transformation matrices (MagnenatThalmann et al. 1988). Multiweight enveloping (Merry et al. 2006; Wang and Phillips 2002) extends the LBS model by adding a weight to each matrix element. The nonlinear skinning technique uses a dual quaternion instead of a transformation matrix to overcome LBS artifacts, but a side effect called a bulging artifact is still caused while bending (Kavan et al. 2007). Several hybrid approaches have been proposed to blend bone transformations with fewer artifacts. Kavan and Sorkine (2012) proposed a blending scheme that decomposes a bone rotation into swing and twist components and separately blends each component using different algorithms. The stretchable and twistable bone model (Jacobson and Sorkine 2011) uses different weighting functions for scaling and bone twisting, respectively. These methods successfully synthesize artifact-free skin deformation and do not discuss stylized skin deformation such as muscle–skin deformation. EigenSkin constructs an efficient model of the additive vertex displacement for LBS using a principal component analysis (Kry et al. 2002). These methods successfully synthesize artifact-free skin deformation. Naive LBS, however, is still a de facto standard skinning model in interactive graphics applications because of its efficiency and simplicity. One practical solution for minimizing LBS artifacts is to add extra bones called helper bones. The helper bone rig has become a practical real-time technology for synthesizing stylized skin deformation based on LBS. The helper bone is a secondary rig that influences skin deformation, and its pose is procedurally controlled according to the pose of the primary skeleton. Mohr and Gleicher (2003) first introduced the basic concept of a helper bone system. In their work, helper bones are generated by subdividing primary bones, and a scaling parameter is procedurally controlled according to the twist angle of the primary bone thus minimizing the candy-wrapper artifact. This technique has been widely used in many products because of its efficiency, flexibility, and compatibility with the standard graphics pipeline (Kim and Kim 2011; Parks 2005). Although this technique provides a flexible yet efficient synthesis of a variety of expressive skin deformations, rigging with helper bones is still a labor-intensive process. We have developed an example-
Example-Based Skinning Animation
5
based technique to build helper bone rigs, as explained in Section “Example-Based Helper Bone Rigging.” Scattered data interpolation such as pose-space deformation (PSD) is another approach for synthesizing skin deformation from example shapes (Kurihara and Miyata 2004; Lewis et al. 2000; Sloan et al. 2001). PSD uses radial basis function interpolation for blending example shapes according to the skeleton pose. This technique produces high-quality skin animation via intuitive designing operations. However, the PSD model requires a runtime engine to store all example data in memory. Furthermore, the computational cost of PSD increases in proportion to the number of examples. Consequently, many example shapes cannot be used in realtime systems with a limited memory capacity, such as mobile devices. The machine learning-based approach for skin deformation analyzes the relationship between a skeleton pose and its corresponding skin shape using a large set of samples. A regression technique was proposed to estimate a linear mapping from the skeletal pose to the deformation gradient of the skin surface polygons (Pulli and Popović 2007). The seminal work of Park and Hodgins (2008) predicts an optimal mapping from the skeletal motion to the dynamic motion of several markers placed on a skin surface. Neumann et al. (2013) proposed a statistical model of skin deformation that is learned from human skin shapes captured with range scan devices. These methods construct a regression model from a set of example skeleton poses and skin shapes.
Example-Based Skinning Linear Blend Skinning LBS (Magnenat-Thalmann et al. 1988) is a standard method that has been widely used for a broad range of interactive applications such as games. Most real-time graphics engines support the LBS-based rig because of its simplicity and efficiency. The LBS model computes a deformed vertex position by transforming each vertex through a weighted combination of bone transformation matrices. Given a skeleton with P bones, the global transformation matrix of the p-th bone is denoted as a 4 4 homogeneous matrix Gp, p {1, , P}. Let Ḡp and vj , j {1, , J} denote the matrix and positions of the j-th vertex on a skin in the initial T-stance pose, and 1. The global let the skinning transformation matrix be represented by Mp ¼ Gp G p transformation matrix Gp can be decomposed into a product of the local transformation matrix Lp and the parent’s global transformation matrix as Gp = Gϕ(p) Lp where ϕ( p) {1, , P} is the parent of the p-th bone. The deformed vertex position vj is computed with nonnegative skinning weights wj,p as vj ¼
X p
wj, p Mp vj ;
(1)
6
T. Mukai
where the affinity constraint p wj, p = 1, 8j is satisfied. The number of nonzero skinning weights at each vertex is assumed to be less than a constant number k, which is expressed as p |wj, p|1 k, 8p, where ||α denotes the Lα norm. This sparsity assumption can be interpreted as that each vertex moves according to the transformations of a few spatially neighboring bones. Moreover, this constraint ensures the efficient computation of a skin animation regardless of the total number of skeleton bones since the uninfluenced bones can be eliminated in the computation of the vertex deformation.
Skinning Weight Optimization An LBS rig is built by designing a skeleton hierarchy, including the initial bone transformation Ḡp and the parent–child relation ϕ( p) of each bone and the corresponding skinning weights wj,p of each vertex. Since the skinning weights are more difficult to design than the skeleton structure owing to the larger number of free parameters, several methods have been proposed to optimize the skinning weights for an arbitrarily skeleton structure and skin geometry. The Pinocchio system (Baran and Popović 2007) uses an analogy to heat diffusion over the skin mesh for estimating shape-aware skinning weights. The bounded biharmonic weight model (Jacobson et al. 2011) produces a smooth distribution of the skinning weights that minimizes the Laplacian energy over the character’s volumetric structure. The deformation-aware method (Kavan and Sorkine 2012) optimizes the skinning weights to minimize an elastic deformation energy over a certain range of skeletal poses. These methods make an assumption regarding the material properties of the skin surface, e.g., they assume that the physical properties of the character skin, such as the elasticity, stiffness, and friction, do not change over the entire body surface. However, this assumption is somewhat optimistic because many characters should have heterogeneous distribution of material properties that include those of bones, muscles, and fats. Further investigation is therefore demanded to improve the quality of the weight optimization. Alternative approach uses a set of example data of the skin shape and skeleton poses (Miller et al. 2011; Mohr and Gleicher 2003; Wang and Phillips 2002). Given the prior information of the skeleton hierarchy, and N examples of a pair b p, n and skin shape b of skeleton pose M vj, n , n {1, , N}, the optimal skinning weights wj,p are obtained by solving a constrained problem to minimize the squared error of the vertex positions between the example shape b vj, n and the deformed skin vj as 2 X X X ^ p, n vj ; wj, p M vj , n fw g ¼ argminj ^ w n p j 2
(2)
Example-Based Skinning Animation
7
subject to
X p
wj, p ¼ 1,
8j;
(3)
wj, p 0, 8j, p; X jwj, p j k, 8j; 1
(4) (5)
p
where the three constraints are (Eq. 3) the affinity constraint, (Eq. 4) nonnegativity constraint, and (Eq. 5) sparsity constraint. This constrained least-squares problem can be approximately solved using a quadratic programming (QP) solver by relaxing the sparsity constraint, as detailed in Section “Weight Update Step.” We can alternatively use the simpler nonnegative least-squares method (James and Twigg 2005) if no sparsity constraint is imposed. These weight optimization techniques produce a good initial guess for the skinning weights for an arbitrary shape of the skin mesh and skeleton hierarchy. Manual refinement, however, is still required in practice to eliminate undesirable artifacts and to add artist-directed stylized skin behavior.
Skinning Decomposition Several algorithms have been proposed to extract both the skinning weights and optimal bone transformations from a set of example shapes, which is called the skinning decomposition problem. In other words, the goal of skinning decomposition is to convert a shape animation into a bone-or skeleton-based skinned animation. Given a set of N example shapes b vj, n , the goal of skinning decomposition is to find the optimal skinning weights wj,p and skinning matrix Mp,n to best approximate the example shapes in a least-squares sense as 2 X X X wj, p Mp vj vj , n fw , M g ¼ argmin ^ w, M n p j
(6)
2
subject to the affinity, nonnegativity, and sparsity constraints on the skinning weights. The skinning mesh animation algorithm (James and Twigg 2005) uses a meanshift clustering algorithm for identifying rigid or near-rigid bone transformations and applies the nonnegative least-squares method to estimate the skinning weights. Kavan et al. (2010) proposed a least-squares method with a dimensionality reduction to efficiently extract the nonrigid bone movements and skinning weights from an example shape animation. The smooth skinning decomposition with rigid bones (SSDR) algorithm (Le and Deng 2012) introduced a rigidity constraint on the bone
8
T. Mukai
transformation M, which requires that M be a product of the rotation matrix R and translation matrix T as M = T R, where RT R = I and det(R) = 1 are satisfied. This algorithm was later extended to identify hierarchically structured bone transformations from a shape animation sequence (Le and Deng 2014). The SSDR algorithm is designed to meet the requirements of the production of interactive graphics systems: Three types of constraints on the skinning weights are always assumed in many graphics software packages, and the rigidity constraint on the bone transformations is also necessary for most game engines. The SSDR algorithm uses a block coordinate descent algorithm that optimizes the transformation of each bone or skinning weight at each subiteration while fixing the other variables. For instance, the weight update step optimizes the skinning weights while fixing all bone transformations, and the transformation update step optimizes the transformation of each bone while fixing the transformations of the other remaining bones and the skinning weights. These alternative processes are repeated until the objective function converges. The details of each subiteration are described below.
Initialization In the first step, each vertex of the skin mesh is bound to one bone with a skinning weight of one. The initialization problem then becomes the clustering of vertices into the specified number of clusters, where vertices in the same cluster have similar rigid transformations. For each cluster, a rigid bone transformation is fitted to relate the vertex positions in the rest pose to the vertex positions at each example. Since the quality of this motion-driven clustering has a great effect on the remaining skinning decomposition steps, several clustering algorithms have been explored, such as mean shift clustering (James and Twigg 2005), K-means clustering (Le and Deng 2012), and Linde–Buzo–Gray algorithm (Le and Deng 2014), for stably obtaining an accurate result. It is possible to apply a more sophisticated algorithm for enhancing the stability and efficiency of the clustering, which is an important open question. Weight Update Step The optimal skinning weights wj,p are updated while fixing all bone transformations Mp,n in Eq. 6. The resulting optimization problem is rewritten as the following per-vertex constrained least squares problem: wj ¼ argmin Wj
subject to wj 0 Where
wj ¼ 1 1
wj k, 0
2 M v T ^ 4 1, 1 j ⋮ wj ¼ wj, 1 wj, p , A ¼ M1, N vj
Awj b2 ; 2
8j;
... ⋱
3 Mp, 1 vj h iT ⋮ 5, b ¼ ^vTj, 1 ^vTj, N : Mp, N vj
(7)
Example-Based Skinning Animation
9
This problem is difficult to directly solve using a standard numerical solver owing to the L0-norm constraint |wj|0 k. Hence, an approximate solution is used to relax the sparsity constraint (Le and Deng 2012). Specifically, the L0-norm constraint is first excluded from Eq. 7, and the resulting QP problem is solved using a stock numerical solver. When the solution does not satisfy the L0-norm constraint, the k bones requiring the most effort are selected, and the weights for other bones are set to zero. The final solution is obtained by solving the QP problem again with the selected k bones.
Bone Transformation Update Step The transformation of the p-th bone for each example shape is optimized while fixing the skinning weights and the transformations of remaining P 1 bones at each subiteration. Each subproblem becomes a per-example weighted absolute orientation problem given by n o Rp, n, Tp, n ¼ argmin Rp, n , T p, n RTp, n Rp, n ¼ I,
2 X X wj, p T p, n Rp, n vj ; vj , n ^ p j det Rp, n ¼ 1,
(8)
2
8p, n;
^ are obtained by the closed-form method. Please refer where the optimal Rp,n and T p, n to Le and Deng (2012) for further details.
Applications The skinning decomposition algorithm allows for the conversion of any type of shape animation sequence into a skeleton-or bone-based skinned animation. For example, the animation of soft body objects is often created using numerical simulations such as a finite element method or mass–spring network models. A facial expression is created using a blendshape animation technique. These advanced techniques, however, are not supported by some graphics engines, especially those for mobile devices. Therefore, a complex skin behavior created using professional content creation tools is converted into an LBS-based skinning animation that is widely supported by most engines. This procedure is fully compatible with a standard production workflow. The main drawback is the lack of detail in the skin deformation caused by wrinkles, self-collisions, etc., because the sparse set of rigid bones merely linearly approximates the example deformations.
Example-Based Helper Bone Rigging The helper bone rig has become a practical real-time technology for synthesizing stylized skin deformation based on LBS (Kim and Kim 2011; Mohr and Gleicher 2003; Parks 2005). The helper bone is a secondary rig that influences the skin
10
T. Mukai
deformation, and its pose is procedurally controlled according to the pose of the primary skeleton as illustrated in Fig. 2. Although the helper bone rig is manually designed in common practice, it requires a labor-intensive process for developing a procedural bone controller and the skinning weights. We have proposed an examplebased rigging method (Mukai 2015). Our method uses a two-step algorithm to add helper bones to a predesigned primary skeleton rig using example pairs of the primary skeletal pose and desirable skin shape. In the first step, our system estimates the optimal skinning weights and helper bone transformation for each example. We used a modified version of the SSDR algorithm to incrementally insert rigid helper bones into the character rig. In the second step, the helper bone controller is constructed as a polynomial function of the primary skeleton movement. Here, we first formulate LBS with P primary bones and H helper bones as vj ¼
X p
wj, p Mp þ
X
! wj, h Sh vj ;
(9)
h
where Sh and wj,h denote the skinning matrix of the h-th helper bone and the corresponding skinning weight, respectively. The first term represents the skin deformations driven by the primary skeleton, and the second term contributes additional control of the deformations using helper bones. The number of helper bones H is manually set to balance the deformation quality and computational cost. Helper bones are procedurally controlled with simple expressions according to the pose of the primary skeleton in common practice. We use a polynomial function fh that maps the primary skeleton pose Lp to the helper bone transformation Sh as
Initial pose Helper bones
Skin mesh Primary bone
Primary bone
Deformation
a
b
Skin deformation Rotation of primary bone Candy-wrap artifact
Linear blend skinning
Procedural control
Linear blend skinning + Helper bones
Fig. 2 Linear blend skinning with procedural control of helper bones
Example-Based Skinning Animation
11
Sh f h ðL1 , L2 , , LP Þ:
(10)
Our helper bone rigging technique builds the regression function fh and skinning weights using example shapes and skeleton poses. Given a set of N pairs of an ^ p, n, our problem is formulated as a example shape and a primary skeleton pose ^ vj, n, M constrained least-squares problem that minimizes the squared reconstruction error between the example shape and the skin mesh with respect to the skinning weights wj,p, wj,h and the skinning matrices Sh,n as
fw , S g ¼ argmin w, S
2 X X X X ^ p, n vj wj, p M wj, h Sh, n vj vj, n ^ n p j h
(11)
2
subject to the affinity, nonnegativity, and sparsity constraints on the skinning weights and the rigidity constraint on the skinning matrix Sh, n.
Per-Example Optimization of Helper Bone Transformations The optimal rigid transformations of the helper bones are first estimated for each example shape using the optimization procedure summarized in Algorithm 1. Our system inserts the specified number of helper bones into the character rig in an incremental manner. Then, the helper bone transformations for each example and the skinning weights are optimized using an iterative method. The overall procedure is similar to the SSDR algorithm, where the skinning weights and bone transformations are alternately optimized by subdividing the optimization problem (Eq. 11) into subproblems of bone insertion, skinning weight optimization, and bone transformation optimization. We used the optimization techniques from the SSDR algorithm to solve these three subproblems. The main difference here is that the SSDR algorithm does not have the prior information about the transformable bones but only the information about the number. Hence, the SSDR algorithm applies a clustering technique to simultaneously estimate an initial bone configuration. In our method, the primary skeleton and its example poses are given in the problem. This method inserts helper bones using incremental optimization with a hard constraint on the primary bone transformation. Algorithm 1 Optimization of helper bone transformations and skinning weights Input: {^ vj}, {Ḡp¯}, {^ vp,n}, {Ĝp,n}, H Output: {Sh,n}, {wj,p}, {wj,h} 1: {Sh,n} = I, 8h, n, {wj,h} = 0, 8j, h 2: Initialize {wj,p} 3: repeat 4: Insert a new helper bone 5: Update helper bone transformations {Sh,n}
12
T. Mukai
6: Update skinning weights {wj,p} and {wj,h} 7: Remove insignificant helper bones 8: until The number of inserted helper bones is reached 9: repeat 10: Update helper bone transformations {Sh,n} 11: Update skinning weights {wj,p} and {wj,h} 12: until The error threshold is reached
Incremental bone insertion Our technique uses an incremental method to insert a new helper bone into the region where the largest reconstruction errors occur. For example, if the new LBS causes an elbow-collapse artifact, a helper bone is generated around the elbow to minimize this artifact. First, our system searches for a vertex with the largest reconstruction error, which is computed as 2 X X X ^ p, n vj j ¼ argmin wj, p M wj, h Sh, n vj vj , n ^ j n p h
(12)
2
Second, we compute a rigid transformation that closely approximates the displacement of the identified vertex and its one-ring neighbors from their initial position by solving an absolute orientation problem (Horn 1987). Then, a new helper bone is generated using the estimated transformation as its own transformation. Next, the skinning weights wj,p and wj,h and the transformation matrix of all helper bones are updated by solving constrained least-squares problems. Finally, the system removes the insignificant helper bones that have little influence on the skin deformation. Our current implementation removes the helper bones that influence less than four vertices. This process is repeated until the specified number of helper bones is achieved. Weight and bone transformation update After inserting the specified number of helper bones, the skinning weight update step (Section “Weight Update Step”) and bone transformation update step (Section “Bone Transformation Update Step”), except for the primary bones, are alternately iterated until the approximation error converges.
Helper Bone Controller Construction The helper bone controller is constructed by learning a mapping from the primary bone transformations to the helper bone transformations. We use a linear regression model to represent the mapping from the local transformation of primary bones Lp to that of helper bones Lh. Transformation parameterization The local transformation matrix Lh is extracted from the estimated skinning matrix Sh. By definition, Sh is decomposed into a product of the transformation matrices as
Example-Based Skinning Animation
13
1 ; Sh ¼ GϕðhÞ Lh G h
(13)
where ϕ(h) {1, , P} is the parent primary bone of the h-th helper bone, which is selected to minimize the approximation error as detailed later. The initial transformation matrix Gh is an unknown rigid transformation matrix. Assuming that the local transformation matrix of the helper bones is the identity matrix in the initial stance pose, Ḡh is equal to that of the parent primary bone Ḡϕ(h) by the definition of forward kinematics. Therefore, we can uniquely extract the local transformation matrix by Lh ¼ G1 ϕðhÞ Sh GϕðhÞ :
(14)
The extracted local matrix is parameterized with fewer variables to reduce the dimensionality of the regression problem. Using a rigid transformation, Lh can be parameterized using a combination of a translation vector th ℜ3 and bone rotation variables rh SO(3). We used exponential maps for rh (Grassia 1998). This results T in the transformation of Lh into a six-dimensional vector form tTh rTh ℜ6 . In addition, the local transformation of the primary bone Lp is parameterized by its animating variables. For simplicity, we have assumed that each primary bone does not have a translation or scale key and that a bone rotation is always represented by exponential maps. Regression model construction We have used a χ-th-order polynomial function as a regression model. The transformation parameter of each helper bone is approximated by
H
th rh
f n L 1 , L 2 , , Lp h iT ¼ Fh xT1 xTp 1 ;
(15)
1
where xp ℜ4 χ becomes an independent variable vector that is composed of all of the variables of the χ-th-order polynomial of rp. For example, if we take χ = 2, the independent variable vector from r = [r1, r2, r3] is x = [r1, r2, r3, r21, r22, r23, r1r2, r1r3, r2r3]. The regression matrix for the h-th helper bone Fh ℜ6(1+p dim(xp)) is estimated from examples using the least-squares technique. In addition, we add a sparsity constraint to minimize the number of nonzero regression coefficients to generate a simpler model. The least-squares problem with the sparsity constraint can be formulated as a Lasso problem (Tibshirani 2011) given by Fh ¼ argmin jYh Fh j22 þ λjFh j1 ; Fh
where
(16)
14
T. Mukai
2
3 X1, 1 X1, N 6 ... ⋱ ... 7 7 X¼6 4 XP, 1 XP, N 5 1 1 t th, N Y h ¼ h, 1 rh, 1 rh, N and λ is the positive shrinkage parameter that controls the trade-off between the model accuracy and the number of nonzero coefficients. Using a stock Lasso solver, we can efficiently solve this problem. Parent bone selection There is only one problem that remains: the selection of an appropriate parent bone ϕ(h) for each helper bone. This is a discrete optimization problem. Generally, since the number of primary bones is smaller, we can implement an exhaustive search to find the optimal one. Further, the best parent bone ϕ(h) can be identified by evaluating Eqs. 14 and 15. Here, each primary bone can be used as ϕ(h), and the best one that minimizes the objective function (Eq. 16) can be selected.
Experimental Results We evaluated the approximation capability and computational performance of our helper bone rigging system. For all experiments, the parameter k, which is the maximum number of transformations to be blended, was fixed at 4. The reconstruction error was evaluated using root mean square (RMS) error of the vertex position. The optimization procedure of the helper bone transformation is parallelized over vertices, helper bones, or examples using Intel Threading Building Blocks. The computational timing was measured at 3.4 GHz on a Core i7-4770 CPU (eight logical processors) with 16 GB of RAM.
Test Dataset We used a muscle function from Autodesk Maya to synthesize an example skin shape from a skeleton pose. The muscle system emulates static muscle–skin deformation with a skeletal pose. The muscle system also produces a dynamic deformation that is caused by bone acceleration and the inertia of the muscles. For our experiment, we used only static deformation because our method supports only static mapping from a skeleton pose to a skin shape. The test character model is a sample asset of a Maya tutorial, as shown in Fig. 3. The height of the leg model is 200 cm in the initial pose. The skeleton has P = 3 animating bones and five degrees of freedom (DOFs) including hip swing and twist (3 DOFs), knee bend (1 DOF), and ankle bend (1 DOF). The eleven muscles expand and contract according to the movement of the primary skeleton. They drive the deformation of 663 vertices using a proprietary algorithm. A test dataset was created by uniformly sampling the bone rotation of the primary skeleton every 20 within each range of joint motion. Consequently, we created 6750
Example-Based Skinning Animation
a
15
c
b
Primary skeleton
Muscle
Skin mesh
Fig. 3 Character model used to create an example pose and skin shape. The skin deformation is driven by a primary skeleton and virtual muscle, which is a built-in function of Autodesk Maya
example pairs consisting of a skeleton pose and skin shape by discretizing the DOFs of the hip swing, hip twist, knee bend, and ankle bend into 6 6, 9, 5, and 5 levels, respectively.
Evaluating the Optimized Bone Transformations In the first experiment, different numbers of helper bones were inserted into the character rig while fixing the polynomial order χ = 2 and the shrinkage parameter λ = 0. Figure 4 shows the convergence of the reconstruction error with the number of helper bones and the number of iterations. The reconstruction error decreased according to the number of helper bones. In addition, there were no significant differences between the reconstruction error of four helper bones and that of five helper bones. This result indicates that the approximation almost converged at four helper bones. The reconstruction error monotonically decreased with the number of iterations, which demonstrates the stability of our SSDR-based rigging system. Figure 5 shows optimized models using different numbers of helper bones. The center image of each screen shot shows the initial pose, and the left and right images show a leg stretching pose and bending pose, respectively. The helper bones a, b, and c are located near the hip, knee, and ankle to minimize LBS artifacts. Helper bone d is located in the thigh to emulate the muscle bulge. The skinning weight map for each helper bone is visualized in Fig. 6. Helper bone a had a significant influence on a large area of the thigh, whereas the other helper bones had a lesser influence. This is the inevitable result of our incremental bone insertion algorithm where the first helper bone is inserted to offset the largest reconstruction error.
16
T. Mukai
Fig. 4 Convergence of the reconstruction error according to the numbers of helper bones and iterations
RMS error [cm] Without helper bone 5.0
4.0
3.0
H =1
2.0
2 3
0
5
a
10
4 5 20 iteration
15
b a
a
a
a
a
a
b
b
b
(a1) Leg extension
(a2) Initial pose
(a3) Leg bend
(b1) Leg extension
(b2) Initial pose
(b3) Leg bend
d
c
a
a
a
a b
b
d
b
b c
d
b
b c
c
(c1) Leg extension
a
a
c
(c2) Initial pose
(c3) Leg bend
(d1) Leg extension
(d2) Initial pose
(d3) Leg bend
Fig. 5 Optimized character rigs using different numbers of helper bones. Each model shows a different helper bone behavior
Example-Based Skinning Animation
17
Fig. 6 Skinning weight map for each helper bone. The larger weight is indicated by a darker area
To build the rig with one helper bone, our system consumed 0.17, 0.51, and 0.17 s per iteration for the bone insertion step, weight update step, and transformation update step, respectively. The total optimization time was about 15 s for 20 iterations. For the rig with four helper bones, the time recorded was 0.17, 0.82, and 0.72 s per iteration. The total time was about 32 s.
Evaluating the Accuracy of the Bone Controller In the second experiment, we examined the approximation capability of the helper bone controller. We evaluated the increase in the RMS error caused by approximating the bone transformations with the regression model. We also counted the number of nonzero polynomial coefficients using different polynomial orders X and shrinkage parameters λ while fixing H = 4. The experimental results are summarized in Table 1. The baseline RMS reconstruction error, which was measured after the per-example transformation optimization, was 1.36 cm. The ratio of the increase in the approximation error was within the range of 150–190 %. The reconstruction error decreased according to the polynomial order, and there was no significant difference between the quadratic and cubic polynomials. On the other hand, the redundant polynomial terms were removed through the shrinkage parameter λ while minimizing the increase in the approximation error. In this experiment, our prototype system consumed about 5 μs per frame to compute all skinning matrices Ŝh from the primary skeleton pose Lp. In detail, 1 μs was consumed to compose the independent variables xp from the local transformation matrices Lp, and the computation of the regression model using Eq. 15 consumed 1 μs for each helper bone. The former time increases in proportion to the number of primary bones, and the latter increases with the number of helper bones. The computational speed is sufficiently fast, although we could further improve the performance by parallelizing the execution of bone controllers.
18
T. Mukai
Table 1 Statistics of reconstruction error and the number of nonzero polynomial coefficients with respect to the polynomial order w and the shrinkage parameter l λ Linear Quadratic Cubic
Average # of nonzeros 0 10 6 5.3 14 10.9 26 18.4
20 5.1 9.1 15.1
RMS error [cm] 0 10 2.57 2.58 2.11 2.12 2.03 2.07
20 2.59 2.17 2.11
Limitations Currently, our system does not provide any guidelines for creating an example dataset. Even though we have used uniform sampling of the joint DOFs to create example poses in the experiments, this simple method might generate many redundant examples. This method may even possibly fail to sample important poses and shapes. We plan to perform further studies to identify a more artist-friendly workflow that can create a minimal example dataset. We believe that an active learning method (Cooper et al. 2007) could be a possible solution that allows artists to design example shapes in a step-by-step manner. Our method does not ensure global optimality for the skinning weights and helper bone controller. We have found that an increase in the number of helper bones often degrades the reconstruction, because numerical errors are separately accumulated when solving the optimizations for the skinning decomposition and bone controller construction.
Future Directions In this chapter, we have described an example-based rigging technique for building an LBS rig using the example data of skin deformation. Although the example-based method requires a large amount of example data to construct a character rig, several simulation methods have been proposed to generate physically valid skin deformation using a heavy computation burden. Moreover, the recent development of 3D scanning devices allow for the acquisition of the skin deformations of actual human beings. These state-of-the-art shape acquisition techniques will enable the mass production of example skin shapes within a short period of time and significantly reduce the amount of unartistic manual labor for rig construction. Although most skinning decomposition techniques and helper bone rigs have been adopted for the LBS-based technique, popular nonlinear blend skinning techniques such as dual quaternion skinning (Kavan et al. 2007), stretchable and twistable bones (Jacobson and Sorkine 2011), and elasticity-inspired joint deformers (Kavan and Sorkine 2012) are worth investigating, which is an interesting open question. Dynamic skinning is another promising future direction. The kinodynamic skinning technique (Angelidis and Singh 2007) provides volume-preserving deformation
Example-Based Skinning Animation
19
based on proxy muscles. A rig-space physics technique optimizes the free parameters of the handcrafted kinematic rig to approximate physically simulated skin deformation (Hahn et al. 2012, 2013). Position-based dynamics method (or PBD) was used to synthesize the skin deformation caused by self-collisions and the secondary effects of soft tissues (Rumman and Fratarcangeli 2015). This method provides a plausible and stable soft body motion at interactive rates but requires the elaborate construction of a PBD-based rig. The MoSh model (Loper and Black 2014) estimates the dynamic skin deformation from a sparse set of motion capture markers using a statistical model of human skin shapes. This method generates the skin shape in a low-dimensional subspace to meet the movement of the markers. The Dyna model (Pons-Moll et al. 2015) also constructs a dynamic skin deformation model using a subspace analysis of 4D scans of human subjects. This skin deformation is generated using a second-order autoregressive with an exogenous input model in the low-dimensional subspace. The SMPL model (Loper et al. 2015) learns corrective blendshape models from shape samples and has been extended to synthesize dynamic deformation by incorporating the autoregressive model. These methods successfully produce realistic deformation of human skin. We have proposed an example-based method for controlling the helper bones to mimic the secondary dynamics of soft tissues (Mukai and Kuriyama 2016) while neglecting the effect of gravity and the interactions with other objects.
References Angelidis A, Singh K (2007) Kinodynamic skinning using volume-preserving deformations. In: Proceedings of ACM SIGGRAPH/Eurographics symposium on computer animation 2007, pp 129–140 Baran I, Popović J (2007) Automatic rigging and animation of 3d characters. ACM Trans Graph 26(3):72:1–72:8 Cooper S, Hertzmann A, Popović Z (2007) Active learning for real-time motion controllers. ACM Trans Graph 26(3):5 Fan Y, Litven J, Pai DK (2014) Active volumetric musculoskeletal systems. ACM Trans Graph 33(4):152 Grassia FS (1998) Practical parameterization of rotations using the exponential map. Graph Tool 3(3):29–48 Hahn F, Martin S, Thomaszewski B, Sumner R, Coros S, Gross M (2012) Rigspace physics. ACM Trans Graph 31(4):72:1–72:8 Hahn F, Thomaszewski B, Coros S, Sumner R, Markus G (2013) Efficient simulation of secondary motion in rig-space. In: Proceedings of ACM SIGGRAPH/Eurographics symposium on computer animation 2013, pp 165–171 Horn BKP (1987) Closed-form solution of absolute orientation using unit quaternions. J Opt Soc Am A 4(4):629–642 Jacobson A, Sorkine O (2011) Stretchable and twistable bones for skeletal shape deformation. ACM Trans Graph 30(6):Article 165 Jacobson A, Baran I, Popović J, Sorkine O (2011) Bounded biharmonic weights for real-time deformation. ACM Trans Graph 30(4):78:1–78:8 James DL, Twigg CD (2005) Skinning mesh animations. ACM Trans Graph 24(3):399–407 Kavan L, Sorkine O (2012) Elasticity-inspired deformers for character articulation. ACM Trans Graph 31(6):Article 196
20
T. Mukai
Kavan L, Collins S, Zara J, O’Sullivan C (2007) Skinning with dual quaternions. In: Proceedings of ACM SIGGRAPH symposium on interactive 3D graphics 2007, pp 39–46 Kavan L, Sloan PP, O’Sullivan C (2010) Fast and efficient skinning of animated meshes. Comput Graph Forum 29(2):327–336 Kim J, Kim CH (2011) Implementation and application of the real-time helperjoint system. In: Game developers conference 2011 Kry PG, James DL, Pai DK (2002) Eigenskin: real time large deformation character skinning in hardware. In: Proceedings of ACM SIGGRAPH/Eurographics symposium on computer animation 2002, pp 153–159 Kurihara T, Miyata N (2004) Modeling deformable human hands from medical images. In: Proceedings of ACM SIGGRAPH/Eurographics symposium on computer animation 2004, pp 355–363 Le BH, Deng Z (2012) Smooth skinning decomposition with rigid bones. ACM Trans Graph 31(6): Article 199 Le BH, Deng Z (2014) Robust and accurate skeletal rigging from mesh sequences. ACM Trans Graph 33(4):1–10 Lewis JP, Cordner M, Fong N (2000) Pose space deformation: a unified approach to shape interpolation and skeleton-driven deformation. In: Proceedings of SIGGRAPH 2000, pp 165–172 Li D, Sueda S, Neog DR, Pai DK (2013) Thin skin elastodynamics. ACM Trans Graph 32(4):49 Loper M, Black NMMJ (2014) Motion and shape capture from sparse markers. ACM Trans Graph 33(6):220:1–220:13 Loper M, Mahmood N, Romero J, Pons-Moll G, Black MJ (2015) SMPL: a skinned multi-person linear model. ACM Trans Graph 34(6):248:1–248:16 Magnenat-Thalmann N, Laperrière R, Thalmann D (1988) Joint-dependent local deformations for hand animation and object grasping. In: Proceedings on graphics interface’88, pp 26–33 Merry B, Marais P, Gain J (2006) Animation space: a truly linear framework for character animation. ACM Trans Graph 25(6):1400–1423 Miller C, Arikan O, Fussell DS (2011) Frankenrigs: building character rigs from multiple sources. IEEE Trans Vis Comput Graph 17(8):1060–1070 Mohr A, Gleicher M (2003) Building efficient, accurate character skins from examples. ACM Trans Graph 22(3):562–568 Mukai T (2015) Building helper bone rigs from examples. In: Proceedings of ACM SIGGRAPH symposium on interactive 3D graphics and games 2015, pp 77–84 Mukai T, Kuriyama S (2016) Efficient dynamic skinning with low-rank helper bone controllers. ACM Trans Graph 35(4):1 Neumann T, Varanasi K, Hasler N, Wacker M, Magnor M, Theobalt C (2013) Capture and statistical modeling of arm-muscle deformations. Comput Graph Forum 32(2):285–294 Park SI, Hodgins JK (2008) Data-driven modeling of skin and muscle deformation. ACM Trans Graph 27(3):Article 96 Parks J (2005) Helper joints: advanced deformations on runtime characters. In: Game developers conference 2005 Pons-Moll G, Romero J, Mahmood N, Black MJ (2015) Dyna: a model of dynamic human shape in motion. ACM Trans Graph 33(4):120:1–120–10 Pulli RYWK, Popović J (2007) Real-time enveloping with rotational regression. ACM Trans Graph 26(3):73 Rumman NA, Fratarcangeli M (2015) Position-based skinning for soft articulated characters. Comput Graph Forum 34(6):240–250 Shi X, Zhou K, Tong Y, Desbrun M, Bao H, Guo B (2008) Example-based dynamic skinning in real time. ACM Trans Graph 27(3):29:1–29:8 Sloan PPJ, Rose CF, Cohen MF (2001) Shape by example. In: Proceedings of ACM SIGGRAPH symposium on interactive 3D graphics 2011, pp 135–143
Example-Based Skinning Animation
21
Tibshirani R (2011) Regression shrinkage and selection via the lasso: a retrospective. J R Stat Soc Ser B (Stat Methodol) 73(3):273–282 Wang XC, Phillips C (2002) Multi-weight enveloping: least-squares approximation techniques for skin animation. In: Proceedings of ACM SIGGRAPH/Eurographics symposium on computer animation, pp 129–138
Crowd Formation Generation and Control Jiaping Ren, Xiaogang Jin, and Zhigang Deng
Abstract
Crowd formation transformation simulates crowd behaviors from one formation to another. This kind of transformation has often been used in animation films, group calisthenics performance, video games, and other special effect applications. Given a source formation and a target formation, one intuitive approach to achieve this kind of transformation between two formations is to establish the source point and the destination point for each agent and plan the trajectory for each agent while maintaining collision free maneuvers. Crowd formation generation and control usually consists of five different parts: formation sampling, pair assignment, trajectory generation, motion control, and evaluation. In this chapter, we will describe the involved techniques from abstract user input to collective crowd formation transformations. Keywords
Crowd simulation • Motion control • Motion transition • Evaluation
Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Formation Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Formation Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pair Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Trajectory Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Motion Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 3 4 4 4 6 7
J. Ren (*) • X. Jin State Key Lab of CAD&CG, Zhejiang University, Hangzhou, China e-mail: [email protected]; [email protected] Z. Deng Department of Computer Science, University of Houston, Houston, TX, USA e-mail: [email protected] # Springer International Publishing AG 2017 B. Müller, S.I. Wolf (eds.), Handbook of Human Motion, DOI 10.1007/978-3-319-30808-1_15-1
1
2
J. Ren et al.
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Introduction In recent years, simulation of group formation transformation has been increasingly used in feature animation films, video games, mass performance rehearsal, tactical arrangements of players for sports teams training, and so on. Furthermore, group formation generation and control can also find its wide applications in many other scientific and engineering fields including but not limited to robot control, multiagent systems, and behavioral biology. In most crowd simulation systems, each agent intelligently moves toward its destination through navigational pathfinding algorithms and avoids collisions with other agents and obstacles through local behavior control models. In collective crowd formation transformation, we should also consider the behaviors of the entire group. When simulating a crowd, one intuitive way to form a target formation is to provide each agent’s desired position at a particular moment and generate transitions between that position and the destination. However, users must manually specify many spatial, temporal, and correspondence constraints, which is time-consuming and nontrivial, particularly when the crowd includes many agents that change their locations frequently. Similarly, when a group of people perform a collective action simultaneously in the real world, it is generally impossible for their commander or team leaders to convey detailed movement information such as every group member’s position at every time instance. So there is a need to find automatically methods to solve these problems. How to transfer abstract user inputs to the expression that we can deal with easily? How to choose the destination for each agent? When finding the path for each agent, which path is the best? And, how to control the formation efficiently? To answer these questions, we decompose collective crowd formation transform into five different steps. Suppose that the user specifies the source and target formation shapes with sketches or contours, we first sample the agents in the formations. Then, we assign the source position and the target position for each agent. We use crowd simulation methods to generate realistic collision-avoidance trajectories of agents. We also introduce some methods to control the movements of agents and high-level transformation for crowds. We consider the approaches to measure the crowd simulation and transformation results. Organization The rest of the chapter is organized as follows. We give an overview of prior work in section “State of the Art.” In section “Formation Sampling,” we introduce how to sample the source and target formations given by users. In section “Pair Assignment,” we assign corresponding positions in both source and target
Crowd Formation Generation and Control
3
formations for each agent. In section “Trajectory Generation,” we generate trajectories for each agent from the source formation to the target formation. In section “Motion Control,” five methods about motion control are introduced. In section “Evaluation,” we give six approaches to evaluate the results of crowd formation transformation. We conclude and discuss future work in section “Summary.”
State of the Art Numerous crowd simulation and modeling approaches have been developed during the past several decades. Here we briefly review recent efforts about crowd simulation, formation transformation, and evaluation. For crowd simulation, there are two major kinds of models: rule-based models and force-based models. Rule-based crowd models are flexible to simulate various crowd agents through a set of delicately designed rules. The seminal work by Reynolds (1987) presented the concept of Boids that simulates flocks of birds and schools of fishes via several simple yet effective steering behavioral rules to keep the group cohesion, alignment, and separation, as well as avoid collisions between group members. Recently, Klotsman and Tal (2012) provided a biologically motivated rule-based artificial bird model, which produces plausible and realistic line formations of birds. A distinct research line of crowd simulations is force-based models, originally developed from human social force study by Helbing and Molnr (1995). Later, it was further applied and generalized to other simulation scenarios such as densely populated crowds (Pelechano et al. 2007), simulation of pedestrian evolution (Lakoba et al. 2005), and escape panic (Helbing et al. 2000). Group formation control is a vital collective characteristic of many crowds. Existing approaches typically combine heuristic rules with explicit hard constraints to produce and control sophisticated group formations. For example, Kwon et al. (2008) proposed a framework to generate aesthetic transitions between key crowd formation configurations. A spectral-based group formation control scheme (Takahashi et al. 2009) was also proposed. However, in these approaches, exact agent groups distributions at a number of key frames need to be specified by users. Gu and Deng (Gu and Deng 2011, 2013; Xu et al. 2015) proposed an interactive and scalable framework to generate arbitrary group formations with controllable transitions in a crowd. Henry and colleagues (Henry et al. 2012, 2014) proposed a singlepass algorithm to control crowds using a deformable mesh, and this approach can be used to control crowd environment interaction and obstacle avoidance. In addition, they proposed an alternative metric for use in a pair assignment approach for formation control that incorporates environment information. These approaches either need nontrivial manual involvements (Kwon et al. 2008; Takahashi et al. 2009) or are focused on intuitive user interfaces for formation control and interaction (Gu and Deng 2011, 2013; Henry et al. 2012, 2014; Xu et al. 2015). Many approaches are proposed to evaluate the results or improve the accuracy of multiagent and crowd simulation algorithms. Most of them perform evaluation by
4
J. Ren et al.
comparing the algorithms’ output with real-world sensor data. Pettré et al. (2009) compute appropriate parameters based on Maximum Likelihood Estimation. Lerner et al. (2009) annotate pedestrian agent trajectories with action-tags to enhance their natural appearance or realism. Guy et al. (2012) propose an entropy-based evaluation approach to quantify the similarity between real-world and simulated trajectories.
Formation Generation Formation Sampling In this section, we describe how to generate the source and target formation based on agents from user inputs. Different assumptions about the inputs by users correspond to different sampling strategies. Here we mainly introduce two formation sampling strategies based on different user inputs. One assumption supposes that the user gives contour shapes, such as squares or circles, or brush paintings (Gu and Deng 2013). This work proposes a unified formation shape representation called a formation template, i.e., an oversampled point space with a roughly even distribution, and it offers an interactive way for users to draw the input formations. In this case, to generate a visually balanced target formation template, it first evenly samples the points on the boundaries, and then fills the area between the inclusive and exclusive boundaries through an extended scanline flood-fill algorithm. In addition, it uses a filling algorithm to discretize the grid space. To avoid sampling points too close to the boundaries, it checks four points with constant offsets (top, bottom, left, and right) to the current checkpoint. Another assumption supposes that the user specifies the source and target formation shapes with sketches and the number of agents in the formation (Xu et al. 2015). In order to automatically use a user-specified number of agents to form a specified formation shape with a visually natural distribution, the sampling process from the formation shape is mainly divided into the following two stages. First, it uses a simple way to tentatively sample the approximation of the parameterized number of formation points in the formation shape with a roughly even distribution. However, the sampled result is very rigid, lacking the aesthetic effect. More importantly, it is difficult for this method to accurately sample a user-specified number of formation points. Therefore, users have to tune the number of sampled points to be exactly equivalent to the user-specified number by randomly deleting some sampled points or adding some formation points according to the Roulette Wheel Selection strategy. Then a corresponding agent is located at the location of a formation point.
Pair Assignment After the sampling process, to generate the formation transformation, we need to pair the agents in the source formation with those in the target formation. We introduce two methods to solve this pair assignment problem in this section.
Crowd Formation Generation and Control
5
Fig. 1 Generating arbitrary formations from random agent distributions (left-most) through inclusion sketch (white boundaries) and exclusion sketch (red boundaries) in Gu and Deng (2011)
The method presented in (Gu and Deng 2011, 2013) estimates the agent distribution in the target formation to find the correspondence between any agent in the initial formation and its appropriate candidate position in the formation template (see Fig. 1). Using formation coordinates, it designs a pair assignment algorithm based on two key heuristics. First, in the target formation, boundary agents should closely fit the boundary curves to clearly exhibit the user-specified formation shape. Second, each nonboundary agent should keep its adjacency condition as much as possible. This algorithm first finds correspondences for the boundary agents, and then finds correspondences for the nonboundary agents. To find correspondences for the boundary agents, it converts the positions of all the agents in the initial distribution into formation coordinates and subtracts the formation orientation from each agent’s relative direction to yield the relation agent direction. It stores this direction along with the relative agent distance in a KD-tree data structure. This approach performs the same operations for each point on the target formation template boundaries. Thus it can efficiently compute the agent corresponding to each boundary point by finding the nearest neighbor in the KD-tree. To find correspondences for the nonboundary agents, it identifies the corresponding template point for each nonboundary agent that was not selected in the previous step. Similarly, it uses each agent’s formation coordinate to find the closest inner template point. The approach further transforms that point to its world coordinate representation, i.e., the agent’s target position. In the method presented in Xu et al. (2015), Delaunay Triangulation is employed to represent the relationship among adjacent agents. Pair assignment can be formulated as the problem of building a one-to-one correspondence between the source point set and the target point set (see Fig. 2). It can be further formulated as finding an optimal assignment in a weighted bipartite graph. In the matching process, they apply a novel match measure to effectively minimize the overall disorder including the variations of both time synchronization and local structure. That is to minimize the distance from source to target for each agent, and the average distances to neighbors are similar between source formation and target formation. Finally, this method applies the classical Kuhn-Munkres algorithm (Kuhn 1955; Munkres 1957) to solve the pair assignment problem.
6
J. Ren et al.
Fig. 2 Delaunay triangulation and pair assignment in Xu et al. (2015)
Trajectory Generation In this section, we address how each agent reaches its destination at every time step after determining its target position. Six different simulation approaches are described in the following. Reciprocal Velocity Obstacles (RVO) (Van den Berg et al. 2008) which extends the Velocity Obstacle concept (Fiorini and Shiller 1998) is a widely used velocitybased crowd simulation model. This method introduces a new concept for local reactive collision avoidance. The only information each agent is required to have about the other agents is their current position and velocity and their exact shape. The basic idea is that: instead of choosing a new velocity for each agent that is outside the other agent’s velocity obstacle, this method chooses a new velocity that is the average of its current velocity and a velocity that lies outside the other agent’s velocity obstacle. Optimal reciprocal collision avoidance (ORCA) (Van Den Berg et al. 2011) is a revised model derived from the RVO, and it presents a rigorous approach for reciprocal n-body collision avoidance that provides a sufficient condition for each agent to be collision-free for at least a fixed amount of time into the future, only assuming that the other robots use the same collision-avoidance protocol. There are infinitely many pairs of velocity sets that make two agents avoiding collision, but among those, it selects the pair maximizing the amount of permitted velocities “close” to the optimized velocities for two agents. For n-body collision-avoidance, each agent performs a continuous cycle of sensing the acting within the time step. In each iteration, the agent acquires the radius, the current position, and the current optimization velocity of the other robots. Based on this information, the agent infers the permitted half-plane of velocities that make the agent avoiding collision with other agents. The set of velocities that are permitted for the agent with respect to all agents is the intersection of the half-planes of permitted velocities induced by each other agent. Then the agent chooses a new velocity that is closest to its preferred velocity among all velocities inside the region of permitted velocities. The method in Gu and Deng (2011) can automatically compute the desired position of each agent in the target formation and generate the agent correspondences between key frames. The force that drives an agent from its original position to its estimated target position can simply be the direction vector between the two positions. However, this force only considers the group formation factor. In a
Crowd Formation Generation and Control
7
dense group, such a pure formation-driven strategy cannot fully avoid agent collisions. As such, a local collision model is needed to refine within-group collision avoidance. To this end, the authors employ a force-based model (Pelechano et al. 2007) for the collision avoidance task due to its capability of handling very highdensity crowds. Because this model takes into account the collision avoidance, forces and repulsion force from neighboring group members and obstacles. In Gu and Deng (2013), local formation transition is the transition from one formation to another without considering the whole group’s general locomotion. In addition to considering to compute a linear interpolation from an agent’s initial position to its estimated target position, this method also consider an extra repulsion force to avoid collision. Without user interactions, each agent would go straight to the target position with minor transition adjustments on the way to avoid local collisions with other agents. For an agent, the method (Xu et al. 2015) locally adjusts its trajectory by applying social forces such as driving and repulse forces to navigate and avoid collisions. A mutual information-based method is introduced, which is a well-known concept in the field of information theory, and it is designed to quantify the mutual dependence between two random variables. Mutual information is somewhat correlated with the fluency and stability of agent subgroup’s localized movements in a crowd. In their method, the mutual information between direction and position and the mutual information between velocity and position are used to adjust the ideal heading and desired velocity in the basic social force model. The online real-time motion synthesis method (Han et al. 2016) transforms the initial motion automatically using the following parameters: target turning angle of the agent, the target scaling factor for the moving speed with respect to that of the source formation, and time required to achieve the target formation. This method builds interpolation functions of turning angle and scaling factor for each agent, thus contains the velocity and position for each agent in every frame.
Motion Control As we can generate the trajectory for each agent from one formation to another, in this section, we want to address the following problem: How to make the transformation more reliable, more controllable? Here, we describe five different methods to solve this problem. The method presented in Gu and Deng (2011) is a two-level hierarchical group control, that is, breaking the full group dynamics into within-group dynamics and intergroup dynamics. Xu et al. (2015) extend the adapted social force method with a subgroup formation constraint (see Fig. 3). This method clusters individual agents in a crowd into subgroups to maximally maintain the formation of the collective subgroups during the formation transition. An Affinity Propagation (AP) clustering algorithm (Frey and Dueck 2007) is used by Xu et al. (2015). The AP algorithm identifies a set of exemplars to best represent agents’ positions in the formation. They choose the AP algorithm to cluster agents since an exemplar can be
8 Fig. 3 The adapted social force method with a subgroup formation constraint of Xu et al. (2015)
J. Ren et al.
Mutual Information
Movement Control
Subgroup Clustering
conceptually considered to represent the overall movement of its corresponding agent-subgroup, and the cluster number is determined adaptively and automatically. The measure for similarity is the local relative distance variance for the collective subgroup clustering. In the method presented in Gu and Deng (2013), the authors construct a second virtual local grid field to evaluate a flow vector to guide transitions, for the need of implementing a splitting and merging transition. In order to form a user-defined formation (see Fig. 4) while moving as a whole to other location, this method introduce three factors to the group level: the global navigation vector heading to the target formation’s location, the velocity driven by global collision avoidance between different groups, and the user-guided velocity computed from the sketching interface. Vector field is introduced (Jin et al. 2008) to guide agents’ movements (see Fig. 5). A vector field can be considered as position-to-vector mapping in the problem domain. A physics-based predictive motion control is described in Han et al. (2014). It first generates a reference motion automatically at run-time based on existing data-driven motion synthesis methods. Given a reference motion, it repeatedly generates an optimal control policy for a small time window that spans a couple of footsteps in the reference motion through model predictive control while shifting the window along the time axis, which supports an on-line performance.
Evaluation As there are lots of methods for crowd formation transformation, which one is better in one definite situation? Here we introduce six different approaches to evaluate transformation results. The visual method is the most direct and basic way to judge whether an animation is aesthetic or not. However, the results by visual methods are subjective by nature, and quantitative approaches are more reliable and objective for researchers.
Crowd Formation Generation and Control
9
Fig. 4 Formation transitions with trajectory controls of Gu and Deng (2013)
Fig. 5 Agents’ movements guided by a vector field of Jin et al. (2008)
The simulation time consumption is a generally used measure of quantifying the performance of simulation methods. With the same function, people prefer the method that has a lower time consumption. Especially in interactive applications, run-time performance is required for a good user experience. The mutual information introduced by Xu et al. (2015) can be adapted to measure the aesthetic aspect of a crowd formation transform, as well as how to compute the mutual information of a dynamics crowd. The stability of local structure Xu et al. (2015), just as its name implies, is a measure of the local stability for transformation. It uses the standard deviation and the average value to quantify the stability property of local structure during a crowd formation transformation. The stability of local structure is the standard deviation of minimum neighbor distance for agents. Clearly, a lower value of the standard deviation indicates that the agents have more similar distances from their neighbors and vice versa. The effort balancing (Xu et al. 2015) is a metric to measure the synchronization of the transformation and employs the standard deviation and the average value to estimate the balancing of the agents’ efforts. When a formation transformation is smooth and visually pleasing, for any agent, we anticipate that the effort from its source position to its current position is not only the least but also balanced. Effort balancing is the standard deviation of distance to agents’ source position and their current position.
10
J. Ren et al.
A data-driven quantitative approach (Ren et al. 2016; Wang et al. 2015) is presented to evaluate collective dynamics models by using real-world trajectory datasets. It is possible that two different groups with noisy trajectories may exhibit similar behaviors even when their trajectory positions are quite different. This approach uses discrete probability density distribution functions that are generated from the time-varying metrics and reflects the global characteristics of groups. The influence of a small amount of data abnormality or noise can be ignored. It introduces seven time-varying metrics: the velocity, the acceleration, the angular velocity, the angular acceleration, the Cartesion jerk, the shortest distance, and the velocity difference. The evaluation model is related to the sum of the differences in discrete PDF between the real-world data and the simulation data for the seven metrics. To compare different simulation methods, the overall evaluation of this approach is the iterations of two components: optimizing the dynamics model parameters and optimizing the weights of seven energy terms. This method uses a genetic algorithm to compute the optimal parameters by maximizing the evaluation model and introduces entropy to compute the weights of seven time-varying metrics.
Summary We have discussed the processes involved in collective crowd formation transformation in this chapter. In formation sampling level, we describe two formation sampling strategies. The first strategy samples on the input sketches with a roughly even distribution and then adopts Roulette Wheel Selection to relate the agents to their certain positions. The second strategy fills the boundary and considers the inner area. In pair assignment level, we introduce two strategies to find the correspondence between an agent’s original position and its new position. One method finds correspondences for the boundary agents followed by the nonboundary agents. The other method presents a measure to minimize the overall disorder including the variations of both time synchronization and local structure and transfers the pair assignment problem into an optimization problem. In the trajectory generation level, we introduce six methods to guide the movement of each agent in the formation. The main concern is about collision-avoidance. In the motion control level, we describe five methods to make the results more reliable and controllable. In the evaluation level, we introduce six different approaches to evaluate the simulation methods, the transformation results, and the execution performance, and most of them are quantitative methods. Limitation and Future Work Although various methods can be applied into the crowd formation transformation problem, these approaches still cannot handle complex scenarios with obstacles and the multiformation’s mixture and separation. Moreover, current approaches suppose that the collective crowds are moving on the two dimensions. In the future, we can improve the existing methods by dealing with complex obstacles, considering interactions among different groups, extending the 2D methods to 3D, and changing target formation dynamically.
Crowd Formation Generation and Control
11
Cross-References ▶ Crowd Evacuation Simulation ▶ Functional Crowds ▶ Optimal Control Modeling of Human Movement ▶ Segmental Movements in Cycling
References van den Berg J, Lin M, Manocha D (2008) Reciprocal velocity obstacles for real-time multi-agent navigation. In: Robotics and automation, 2008. ICRA 2008. IEEE international conference on Robotics and Automation, IEEE, pp. 1928–1935 van den Berg J, Guy SJ, Lin M, Manocha D (2011) Reciprocal n-body collision avoidance. In: Robotics research, Springer, pp 3–19 Fiorini P, Shiller Z (1998) Motion planning in dynamic environments using velocity obstacles. Int J Rob Res 17(7):760–772 Frey BJ, Dueck D (2007) Clustering by passing messages between data points. Science 315(5814):972–976 Gu Q, Deng Z (2011) Formation sketching: an approach to stylize groups in crowd simulation. In: Proceedings of graphics interface 2011, Canadian Human-Computer Communications Society, pp 1–8 Gu Q, Deng Z (2013) Generating freestyle group formations in agent-based crowd simulations. IEEE Comput Graph Appl 33(1):20–31 Guy SJ, van den Berg J, Liu W, Lau R, Lin MC, Manocha D (2012) A statistical similarity measure for aggregate crowd dynamics. ACM Trans Graph 31(6):190:1–190:11 Han D, Noh J, Jin X, S Shin J, Y Shin S (2014) On-line real-time physics-based predictive motion control with balance recovery. Comput Graphics Forum 33:245–254. Wiley Online Library Han D, Hong S, Noh J, Jin X, Shin JS (2016) Online real-time locomotive motion transformation based on biomechanical observations. Comput Anim Virtual Worlds 27(3–4):378–384 Helbing D, Molnar P (1995) Social force model for pedestrian dynamics. Phys Rev E 51(5):4282 Helbing D, Farkas I, Vicsek T (2000) Simulating dynamical features of escape panic. Nature 407(6803):487–490 Henry J, Shum HP, Komura T (2012) Environment-aware real-time crowd control. In: Proceedings of the 11th ACM SIGGRAPH/Eurographics conference on computer animation, Eurographics Association, pp 193–200 Henry J, Shum HP, Komura T (2014) Interactive formation control in complex environments. IEEE Trans Vis Comput Graph 20(2):211–222 Jin X, Xu J, Wang CC, Huang S, Zhang J (2008) Interactive control of large-crowd navigation in virtual environments using vector fields. IEEE Comput Graph Appl 28(6):37–46 Klotsman M, Tal A (2012) Animation of flocks flying in line formations. Artif Life 18(1):91–105 Kuhn HW (1955) The hungarian method for the assignment problem. Naval Res Logist Q 2(1–2):83–97 Kwon T, Lee KH, Lee J, Takahashi S (2008) Group motion editing. ACM Trans Graph 27:80 Lakoba TI, Kaup DJ, Finkelstein NM (2005) Modifications of the helbing-molnar-farkasvicsek social force model for pedestrian evolution. Simulation 81(5):339–352 Lerner A, Fitusi E, Chrysanthou Y, Cohen-Or D (2009) Fitting behaviors to pedestrian simulations. In: Proceedings of the 2009 ACM SIGGRAPH/Eurographics symposium on computer animation, ACM, pp 199–208 Munkres J (1957) Algorithms for the assignment and transportation problems. J Soc Ind Appl Math 5(1):32–38
12
J. Ren et al.
Pelechano N, Allbeck JM, Badler NI (2007) Controlling individual agents in high-density crowd simulation. In: Proceedings of the 2007 ACM SIGGRAPH/Eurographics symposium on computer animation, Eurographics Association, pp 99–108 Pettré J, Ondřej J, Olivier AH, Cretual A, Donikian S (2009) Experiment-based modeling, simulation and validation of interactions between virtual walkers. In: Proceedings of the 2009 ACM SIGGRAPH/Eurographics symposium on computer animation, ACM, pp 189–198 Ren J, Wang X, Jin X, Manocha D (2016) Simulating flying insects using dynamics and data-driven noise modeling to generate diverse collective behaviors. PLoS One 11(5):e0155698 Reynolds CW (1987) Flocks, herds and schools: a distributed behavioral model. ACM SIGGRAPH Comput Graph 21(4):25–34 Takahashi S, Yoshida K, Kwon T, Lee KH, Lee J, Shin SY (2009) Spectral-based group formation control. Comput Graphics Forum 28: 639–648. Wiley Online Library Wang X, Ren J, Jin X, Manocha D (2015) Bswarm: biologically-plausible dynamics model of insect swarms. In: Proceedings of the 14th ACM SIGGRAPH/Eurographics symposium on computer animation, ACM, pp 111–118 Xu M, Wu Y, Ye Y, Farkas I, Jiang H, Deng Z (2015) Collective crowd formation transform with mutual information–based runtime feedback. Comput Graphics Forum 34:60–73. Wiley Online Library
Functional Crowds Jan M. Allbeck
Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Animation to AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . HCI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 3 4 6 6 7 8 9
Abstract
Most crowd simulation research either focuses on navigating characters through an environment while avoiding collisions or on simulating very large crowds. Functional crowds research focuses on creating populations that inhabit a space as opposed to passing through it. Characters exhibit behaviors that are typical for their setting, including interactions with objects in the environment and each other. A key element of this work is ensuring that these large-scale simulations are easy to create and modify. Automating the inclusion of action and object semantics can increase the level at which instructions are given. To scale to large populations, behavior selection mechanisms must be kept relatively simple and, to demonstrate typical human behavior, must be based on sound psychological models. The creation of roles, groups, and demographics can also facilitate behavior selection. The simulation of functional crowds necessitates research in animation, artificial intelligence, psychology, and human-computer interaction (HCI). This chapter provides a brief introduction to each of these elements and their application to functional crowds. J.M. Allbeck (*) George Mason University, Fairfax, VA, USA e-mail: [email protected] # Springer International Publishing AG 2016 B. Müller, S.I. Wolf (eds.), Handbook of Human Motion, DOI 10.1007/978-3-319-30808-1_16-1
1
2
J.M. Allbeck
Keywords
Crowd simulation • Virtual humans • Patterns of life • Computer animation • AI
Introduction Virtual humans can be used as stand-ins when using real humans would be too dangerous and cost-prohibitive or precise control is required. Virtual humans are often used as extras or background characters in movies and games (see Fig. 1). They are similarly used in virtual training scenarios for military personnel and first responders. They can also be used to analyze urban and architectural design as well as various policies and procedures. For many of these applications and others, the virtual humans must both reflect typical or normal human behavior and also be controllable or directable. Furthermore, in order to create sizeable crowds of virtual humans functioning in rich virtual environments, they must have relatively simple behavior selection mechanisms. Functional crowds, in contrast to more typical crowd simulations, depict animated characters interacting with the environment in meaningful ways. They do not simply walk from one location to another avoiding obstacles. They perform the same behaviors we see from real humans every day, as well as not so typical behaviors that might be required for the application. The first element needed to achieve functional crowds is animation. Traditional crowd simulations focus on walking animation clips, perhaps with a few idle behaviors or depending on the application some battle moves. There is little or no interaction with objects in the environment. Animating virtual humans manipulating objects can be quite challenging. It involves detection of collisions and fine motor movements. We will give an overview of some of these challenges and approaches for solving them in this chapter.
Fig. 1 Virtual characters in a scene in the unreal game engine
Functional Crowds
3
Another required element relates to providing the virtual humans with the knowledge of what actions can be performed and what objects are required to perform them. If we are going to eat, we need an object corresponding to food to eat. We may optionally need instruments such as utensils. A lot of this needed information could be considered commonsense, but unless explicitly supplied to the virtual humans, they lack it. This information is also needed as input to higher-level artificial intelligence mechanisms such as behavior selectors, planners, and narrative generators. Functional crowds should also depict a heterogeneous population. In real life not everyone does the same thing, they do not have the same priorities, and they do not all perform tasks in exactly the same way. Some of these variations stem from prior observations and experiences. They are learned. Some stem from psychological states and traits, such as personalities and emotions. Finally, many, if not all, applications of functional crowds require some humancomputer interaction (HCI). This interaction may come during the authoring of the crowd behavior. The application or scenario may require some of the behaviors to be more tightly controlled or even scripted. The application may also require users (e.g., players, trainees, evaluators, etc.) to interact with the crowd during the simulation. These interactions may simply require the virtual humans to avoid collisions with the real human’s avatar, or they may require communication and perhaps even cooperation between the real and virtual humans. This chapter will present these various elements of functional crowds and discuss challenges and approaches to address them. We will start by providing a snapshot of the current state of the art in related research fields. Then we will in turn discuss issues related to AI and animation, psychological models, and HCI. Finally, we will conclude with a brief summary and potential future direction.
State of the Art In the past decade or so, crowd simulations have made enormous progress in the number of characters that can be simulated and in creating more natural behaviors. More detailed analysis of crowd simulation research can be found in a number of published volumes (Kapadia et al. 2015; Pelechano et al. 2008, 2016; Thalmann et al. 2007). It is now possible to simulate over one million characters in real time in high-density crowds. Crowd simulations can also be more heterogeneous. Not every character looks or behaves exactly the same. Certainly some variations stem from differences in appearance and motion clips (Feng et al. 2015; McDonnell et al. 2008). Others come from psychological models such as emotion and personality (Balint and Allbeck 2014; Durupinar et al. 2016; Li et al. 2012). Most crowd simulations assign, fairly randomly, starting positions and ending destinations for the characters in the simulation. While this appears fine for short durations at a distance, if a player follows a character for a period of time, it quickly appears false. Sunshine-Hill and
4
J.M. Allbeck
Badler have created a framework for generated plausible destinations for characters on the fly to provide reasonable “alibis” for them (Sunshine-Hill and Badler 2010). Simulating functional crowds also requires other advanced computer graphics techniques. Commercial game engines, such as Unity ® and Unreal ®, provide much of the technology necessary. In the past couple of years, they have both changed their licensing structure in ways that enable researchers to take advantage of and add to their capabilities. Other needed advancements come from the animation research community. A key feature of functional crowds is the ability of characters to interact with objects in their environment in meaningful ways. We require animations of characters sitting and eating food, getting in and out of vehicles, conversing with one another, displaying emotions, getting dressed, etc. (Bai et al. 2012; Clegg et al. 2015; Hyde et al. 2016; Shapiro 2011).
Animation to AI To simulate a functional crowd, we need the characters to interact with their objectrich environments and with each other. While great work has been done in pathfinding, navigation, and path following, additional advancements are still needed (Kapadia et al. 2015; Pelechano et al. 2016). Characters still struggle to get through cluttered environments with narrow walkways. We need to give characters enhanced abilities to turn sideways, sidestep, and even back up in seamless natural motions. Furthermore, characters need to be able to grab, carry, place, and use objects of different shapes and sizes and do so when the objects are placed at various locations in the world and approached from any direction. The core of motions for characters is generally generated in one of three ways: key framed, motion capture, or procedural. Artist created key-framed and motion-captured motions that tend to look natural and expressive, but lack the flexibility needed for most object interactions. Procedurally generated motions use algorithms such as inverse kinematics that work well to target object locations (e.g., for a reach and grab), but often lack a natural look and feel and require objects to be labeled with sites, regions, and directions referenced in the code. While progress continues in virtual human animation research, natural-looking functional crowds will require even more advancement to make the authoring and animating large populations of characters more feasible. Once the characters can be animated interacting with objects in the environment, they need to possess an understanding of what can be done with objects and what objects are needed in order to perform various actions. In other words, they need to understand object and action semantics. This includes knowing what world state must exist prior to the start of an action (i.e., applicability conditions and preparatory specifications), what state must hold during the action, what the execution steps of the action are, and finally what the new world state is after the successful execution of the action. As indicated previously, there also needs to be information about the parts and various locations of the objects (e.g., grasp locations, regions to sit on (see Fig. 2), etc.) so that animation components can be effective. Representations, such as the Parameterized Action Representation (PAR), are designed to hold this
Functional Crowds
5
Fig. 2 Regions indicating places where characters could sit
information, but authoring them is time-consuming and error prone (Bindiganavale et al. 2000). In order to scale to the level needed to simulate functional crowds in large, complex environments, the creation of action and object semantics needs to be automated. Automating action and object semantics would also help to ensure some consistency within and between scenarios, whereas ad hoc, handing authoring tends to be sloppy and error prone. Online lexical databases, such as WordNet, VerbNet, and FrameNet, have been shown to provide a viable foundation for action and object semantics for virtual worlds (Balint and Allbeck 2015; Pelkey and Allbeck 2014). Additional work is needed to ensure the information represented is what is needed for the applications in virtual worlds and to ensure that mechanisms for searching and retrieval are fast enough. Given that characters have some basic understanding of the virtual world they are inhabiting, the next question is at any given time, how should characters select their behaviors? Planning and other sophisticated AI techniques can be computationally intensive and difficult to control. For functional crowds, it would be better to start with simple techniques both in authoring and execution (J. M. Allbeck 2009, 2010). Human behaviors stem from a variety of different impetuses. Some behaviors, such as going to work or school or attending a meeting, are scheduled. These actions provide some structure to our lives and the lives of our virtual counterparts. They are selected based on the simulated time of day. Reactive actions are responses to the world around us. They add life and variation to the behaviors of virtual characters. They are selected based on the objects, people, and events around the character. Aleatoric or stochastic actions include sub-actions with different distributions. For example, we may want a character to appear to be working in her office, but are not very concerned with the details. Our WorkInOffice action would include sub-actions like talking on the phone, filing papers, and using the computer. The character would switch between these actions for the specified period of time at the specified distribution, but what exact sub-action is being performed at any point in time would not need to be specified. Need-based actions add humanity to the virtual characters. Needs grow over time and are satisfied by performing certain actions with the necessary object participants (e.g., eat food). As a need grows, the priority
6
J.M. Allbeck
of selecting a behavior that would fulfill it also grows. These needs could correspond to established psychological models, such as Maslow’s hierarchy of needs, or they could be specific to the scenario (e.g., drug addiction). These are just a few examples of simple behavior selection mechanisms. Certainly, others are possible and may be more applicable to some scenarios. Practically speaking, it may be best to completely script the behaviors of some key characters in a scenario. Background characters could then have variations in their schedules, reactions, needs and distributions. More sophisticated AI techniques could be included when and where needed, as long as the overall framework remains fast enough for human interaction.
Heterogeneity In real human populations, not everyone is doing the same thing at the same time. There are variations in behaviors that stem from different factors. The psychological research committee has spent decades positing numerous models of personality, emotions, roles, status, culture, and more. The virtual human research community has taken these models as inspiration for computational models for virtual human behaviors (Allbeck and Badler 2001; IVA 1998; Li and Allbeck 2011). Variations in behavior and behavior selection can also evolve as the characters learn about and from their environment and each other (Li and Allbeck 2012). All of this research needs additional attention and revision. In particular, how these different traits are manifest in expressive animation needs continuing work, as does the interplay of psychological models. How does personality effect emotion and the display of emotion? How does a character’s roles and changing role effect emotional displays? Certainly culture and its impacts are not well modeled in virtual humans. How do all of these psychological models influence a character’s priorities? At any point in time, a character’s behavior selection should reflect what is most important for them to achieve at that time. Their priorities can be influenced by any number of factors. For functional crowds, it is important that priorities be weighed quickly and behavior selection is not delayed by an overly complex psychological framework. An open question for most scenarios is what parts of human behavior are really important to model and what can be left out? It is possible that a fair amount of just random choices would suffice for the majority of the characters a lot of the time, but this depends on the duration of the simulation and how often the same character or characters are encountered by the viewer.
HCI Most applications of functional crowds require them to have some interaction with real humans either during the authoring process and/or while the simulation unfolds. Authoring the behavior of an entire population of characters from the ground up would be infeasible. Providing a layer of automatically generated common
Functional Crowds
7
understanding (i.e., action and object semantics) does help. Simple, yet robust, behavior selection mechanisms are also helpful. Furthermore, the action types described earlier can be linked to even higher-level constructs, such as groups and roles (Li and Allbeck 2011). When authoring behaviors, it is important to balance autonomy and control. To accomplish the objectives of the scenario, authors need to have control over the characters and their behaviors. However, authoring every element of every behavior of every character would be overwhelming even for short-duration simulations of forty or fifty characters. The characters need to have some level of autonomy. They need to be able to decide what to do and how to do it on their own. Then, when and if they receive instruction from the simulation author, they need to suspend or preempt their current behaviors to follow those instructions. There may also be times when authors have an overall narrative in mind for the simulations, but are less concerned about some of the details of the characters’ behaviors. This is one place where more sophisticated AI methods like partial planners may play a role (Kapadia et al. 2015). HCI also comes into play as one or more humans interact with the functional crowds during the simulation. They may be using a standard keyboard, mouse, and monitor. They may be using a mobile device. They may be using a gaming console. Or they may be using more advanced virtual reality (VR) devices. VR devices can provide a higher fidelity and therefore enable the subjects to see the virtual world in more detail. Using head-mounted displays (HMD) or CAVE systems allows the subject to view the virtual characters in a life-size format. The movements of subjects can also be motion captured in real time and displayed on their avatar, providing more realistic interaction with the virtual characters. Hardware interfaces can impact the level of a subject’s immersion into a virtual world and potentially their level of presence in the virtual world. Another aspect of HCI with virtual characters and functional crowds is a kind of history. If a subject spends longer durations in the virtual world and/or has repeated exposure to it, he or she may become familiar with some of the characters and form expectations about them. Subjects may learn their personality and behavioral quarks. Subject will expect some consistency in these behaviors. They may also expect the virtual characters to have some level of memory of past interactions. While these expectations can be met, it is still a challenge to provide the virtual characters with techniques that make these memories compact, efficient, and plausibly imperfect (Li et al. 2013). More research is needed.
Conclusions Functional crowds can increase the number of applications of crowd simulations and increase their accuracy, but as this chapter has discussed, there is additional research needed from character animation to AI to psychological models to HCI. Increased computing power will help, but is not an overall solution. Attempting to simulate realistic human behaviors is difficult. It is even more challenging at a large scale. When attempting to simulate realistic human behavior, we can end up losing focus.
8
J.M. Allbeck
One model or technique leads us to another and another until we have lost sight of our original goal. Too often researchers also design and implement a method and then go in search of a problem it might address. We might be better served to keep focused on an application or scenario and then determine what is and is not most critical to achieving its goals. Does the application really require a sophisticated planner or emotion model? How closely and for how long is the subject going to be observing the characters’ behaviors? Also, do we really need to simulate 500,000 characters at a time? At ground level in the center of a village or even large city, how many people can be seen at one time? Are there existing tools, open source or commercial, that can be used or modified? Too often researchers feel they have to construct their own models from scratch, ignoring years of effort done by others. In terms of both human effort and computation, use available resources wisely and do not put a large amount of effort into areas that will have little impact on the application. In this area of research, another question that is often asked is how do you validate your model. How can one validate human behavior? We could show videos of functional crowds to hundreds of people and ask them a variety of questions to try to determine if they think the character behaviors are realistic, reasonable, or even plausible, but we all have rather different ideas of what is reasonable behavior. Instead we choose to framework work in this area as the construction of a toolset to be used by subject matter experts to achieve their own goals. For example, an urban planner may wish to use functional crowds to analyze a proposed transportation system. Evaluate then becomes about whether or not the urban planner can use the functional crowds toolset to do the desired analysis. Does it have the parameters required? Is it usable by nonprogrammers? Can they increase or decrease fidelity relative to the input effort? As a research area, functional crowds is a young, but promising direction. It sits at the overlap of several other research communities, namely, computer graphics and animation, artificial intelligence, human-computer interaction, and psychology. As advances are made in each one of these disciplines, functional crowds can benefit.
Cross-References ▶ Biped Controller for Character Animation ▶ Blendshape Facial Animation ▶ Comparative Evalution of Crowd Animation ▶ Crowd Evacuation Simulation ▶ Crowd Formation Generation and Control ▶ Data-Driven Character Animation Synthesis ▶ Data-Driven Hand Animation Synthesis ▶ Depth Sensor Based facial and Body Animation Control ▶ Example-Based Skinning Animation ▶ Eye Animation ▶ Hand Gesture Synthesis for Conversational Characters
Functional Crowds
9
▶ Head Motion Generation ▶ Laughter Animation Generation ▶ Perceptual Evaluation of Human Animation ▶ Perceptual Study on Facial Expressions ▶ Perceptual Understanding of Virtual Patient Appearance and Motion ▶ Physically-Based Character Animation Synthesis ▶ Real-time Full Body Motion Control ▶ Real-Time Full Body Pose Synthesis and Editing ▶ Video-Based Performance Driven Facial Animation ▶ Visual Speech Animation
References Allbeck JM (2009) Creating 3D animated human behaviors for virtual worlds. University of Pennsylvania, Philadelphia, PA Allbeck JM (2010) CAROSA: a tool for authoring NPCs. In: Presented at the international conference on motion in games. Springer, pp 182–193. Allbeck J, Badler NI (2001) Consistent Communication with Control. In: Workshop on non-verbal and Verbal Communicative Acts to achieve contextual embodied agents at autonomous agents. Bai Y, Siu K, Liu CK (2012) Synthesis of concurrent object manipulation tasks. ACM Trans Graph 31(6):156 Balint T, Allbeck JM (2014) Is that how everyone really feels? emotional contagion with masking for virtual crowds. In: Presented at the international conference on intelligent virtual agents. Springer, pp 26–35 Balint T, Allbeck JM (2015) Automated generation of plausible agent object interactions. In: Presented at the international conference on intelligent virtual agents. Springer, pp 295–309 Bindiganavale R, Schuler W, Allbeck JM, Badler NI, Joshi AK, Palmer M (2000) Dynamically altering agent behaviors using natural language instructions. In: Proceedings of the fourth international conference on autonomous agents. ACM, New York, pp 293–300. doi:10.1145/ 336595.337503 Clegg A, Tan J, Turk G, Liu CK (2015) Animating human dressing. ACM Trans Graph 34(4):116 Durupinar F, Gudukbay U, Aman A, Badler N (2016) Psychological parameters for crowd simulation: from audiences to Mobs. IEEE Trans Vis Comput Graph 22(9):2145–2159 Feng A, Casas D, Shapiro A (2015) Avatar reshaping and automatic rigging using a deformable model. In: Presented at the proceedings of the 8th ACM SIGGRAPH conference on motion in games. ACM, pp 57–64 Hyde J, Carter E, Kiesler S, Hodgins J (2016) Evaluating animated characters: facial motion magnitude influences personality perceptions. ACM Trans Appl Percept 13(2):8:1–8:17 IVA (1998) International conference on intelligent virtual humans. Springer, Berlin Kapadia M, Pelechano N, Allbeck J, Badler N (2015) Virtual crowds: steps toward behavioral realism. Morgan & Claypool Publishers, San Rafael, California Li W, Allbeck JM (2011) Populations with purpose. In: Motion in games. Springer, Berlin/ Heidelberg, pp 132–143 Li W, Allbeck JM (2012) Virtual humans: evolving with common sense. In: Presented at the international conference on motion in games. Springer, pp 182–193 Li W, Di Z, Allbeck JM (2012) Crowd distribution and location preference. Comput Anim Virtual Worlds 23(3–4):343–351 Li WP, Balint T, Allbeck JM (2013). Using a parameterized memory model to modulate NPC AI. In: Presented at the intelligent virtual agents: 13th international conference, IVA 2013, Edinburgh, August 29–31, 2013, Proceedings, vol 8108. Springer, p 1
10
J.M. Allbeck
McDonnell R, Larkin M, Dobbyn S, Collins S, O’Sullivan C (2008) Clone attack! perception of crowd variety. In: Presented at the ACM Transactions on Graphics (TOG), vol 27. ACM, p 26 Pelechano N, Allbeck JM, Badler NI (2008) Virtual crowds: methods, simulation, and control. Synth Lect Comput Graph Animation 3(1):1–176 Pelechano N, Allbeck JM, Kapadia M, Badler NI (eds) (2016) Simulating heterogeneous crowds with interactive behaviors. CRC Press, Boca Raton, FL Pelkey CD, Allbeck JM (2014) Populating semantic virtual environments. Computer Animation and Virtual Worlds 25(3–4):403–410 Shapiro A (2011) Building a character animation system. In: Presented at the international conference on motion in games. Springer, pp 98–109 Sunshine-Hill B, Badler NI (2010) Perceptually realistic behavior through alibi generation. In Proceedings of the Sixth AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE’10), G. Michael Youngblood and Vadim Bulitko (Eds.). AAAI Press 83–88 Thalmann D, Musse SR, Braun A (2007) Crowd simulation, vol 1. Springer, Berlin
Crowd Evacuation Simulation Tomoichi Takahashi
Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Lessons from the Past and Requirements for Simulation Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Agent-Based Approach to Evacuation Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Crowd Evacuation Using Agent-Based Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Evacuation Scenarios and Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Agent Mental States and their Action Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Pedestrian Dynamics Model and the Mentality of Individuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Guidance to Agents and Communication During Evacuation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Abstract
Evacuation simulation systems simulate the evacuation behaviors of people during emergencies. In an emergency, people are upset and hence do not behave as they do during evacuation drills. Reports on past disasters reveal various unusual human behaviors. An agent-based system enables an evacuation simulation to consider these human behaviors, including their mental and social status. Simulation results that take the human factor into consideration seem to be a good tool for creating and improving preventions plans. However, it is important to verify and validate the simulation results for evacuations in unusual scenarios that have not yet occurred. This chapter shows that the combination of an agent’s physical and mental status and pedestrian dynamics is the key to replicating
T. Takahashi (*) Department of Information Engineering, Meijo University, Nagoya, Japan e-mail: [email protected] # Springer International Publishing Switzerland 2016 B. Müller, S.I. Wolf (eds.), Handbook of Human Motion, DOI 10.1007/978-3-319-30808-1_17-1
1
2
T. Takahashi
various human behaviors in crowd evacuation simulation. This realistic crowd evacuation simulation has the potential for practical application in the field. Keywords
Evacuation behavior • Emergency scenario • Agent-based simulation • Cognitive map • Psychological factor • Belief-desire-intention • Information transfer and sharing model • Verification and validation
Introduction Emergencies such as fires, earthquakes, or terrorist attacks can occur at any time in any location. Human lives are at risk both from man-made and natural disasters. The importance of emergency management has been reaffirmed by a number of reports related to various disasters. The September 11, 2001 World Trade Center (WTC) attacks and the Great East Japan Earthquake (GEJE) and ensuing tsunami that occurred on March 11, 2011 took many lives and caused serious injuries. Detailed reports that focus on occupant behavior during the WTC disaster and evacuation behavior after the tsunami alarm indicate that safety measures implemented beforehand and evacuation announcements on site can exert significant influence on individual evacuation behaviors (de Walle and Murray 2007; Averill et al. 2005; Cabinet Office Government of Japan). Many organizations engage in emergency preparation and provide training to save human lives during emergencies and reduce damage during future disasters (Cabinet Office of UK 2011; Turoff 2002). The disaster-prevention departments of governments, buildings, and other organizations develop these training programs. This training, executed beforehand, is useful to check whether the people are well prepared for unseen emergencies, can operate according to prevention plans, and evacuate quickly to safer locations. It is difficult to replicate emergent situations in the real world and drill for these situations while involving real humans. It is well known that humans behave differently when training and during emergencies. Sometimes, a drill can cause accidents. In fact, in December 2015, a university in Nairobi executed an antiterror exercise. The drill included the use of gunshots and this caused students and staff to panic. A number of people jumped from windows of the university buildings and were injured (News). Even statutory training in real situations can create risks for disabled people and some vulnerable groups. Simulation of the movement of people has been studied in various fields including computer graphics, movie special effects, and evacuations (Hawe et al. 2012; Dridi 2015). This technology allows a prevention center to simulate crowd evacuation behavior in multiple emergency scenarios that cannot be executed in the real world. Computer simulations help the prevention center to assess their plans for all emergencies that need to be considered. Crowd evacuation simulation is a key technology for making safety plans for future emergencies.
Crowd Evacuation Simulation
3
State of the Art Lessons from the Past and Requirements for Simulation Systems A number of studies have focused on human behavior during past disasters. The National Institute of Standards and Technology (NIST) examined occupant behavior during the attacks on the WTC buildings (Averill et al. 2005; Galea et al. 2008). The Cabinet Office of Japan also reported on evacuations of individuals during the GEJE (Cabinet Office Government of Japan). Common types of evacuation behaviors have been discovered: some individuals evacuated immediately when the disasters occurred, but others did not evacuate, even though they heard emergency alarms provided by the authorities. These people consisted of individuals who had family members located in remote areas, individuals who attempted to contact their families by phone, and individuals who continued to work because they believed they were safe. It is interesting to note that the individuals’ behaviors during these two disasters were similar to the behaviors of individuals during a flood in Denver, USA, on June 16, 1965, even though communication methods have changed over the past 50 years (Drabek 2013). Approximately 3,700 families were suddenly told to evacuate from their homes. The family behaviors that occurred following the warnings were categorized as follows: those who evacuated immediately, those who attempted to confirm the threat of disaster, and those who ignored the initial warning and continued with routine activities. Other features of human behaviors have been reported in other disasters. (1) In the 2003 fire in the Station nightclub, Rhode Island, most building occupants attempted to retract their steps to the entrance rather than follow emergency signs, even though the emergency exit was adequately signposted (Grosshandler et al. 2005). (2) In emergencies, humans tend to fulfill the roles assigned to them beforehand. For instance, trained people led the others in their offices promptly to safe places in the WTC attacks (Ripley 2008). (3) In contrast, a tragedy that occurred at the Okawa Elementary School during the GEJE demonstrates how untrained leaders may lead to tragedies (Saijo 2014). The school was located 5 km from the sea and had never practiced evacuation drills. When the earthquake occurred, an hour elapsed before teachers decided on an evacuation location. When moving to that location, they were informed that the tsunami was imminent and that their evacuation location was unsafe. They tried to evacuate to a higher location, but their efforts were too late. Most of the students and staff of the Okawa Elementary School were engulfed by the tsunami and died. The human behaviors that typically occur during emergencies vary by individual, and the behaviors may be different from those that are planned. The fluctuations in these behaviors are the key features that must be simulated in evacuation simulations. The evacuation behaviors depend on the individual who makes decisions and changes his/her actions according to his/her conditions and information. This information includes signs and public announcements (PAs) and is thought to affect
4
T. Takahashi
human behavior and be useful for guiding people quickly to safe places during dynamically changing situations.
Agent-Based Approach to Evacuation Simulations NIST simulated some evacuation scenarios to estimate the evacuation time from the WTC buildings (Kuligowski 2005). The travel times of several cases were simulated using several evacuation simulation systems, all which assume the following. People are equal mentally and functionally. In some simulators, sex and age are taken into consideration as parameters for walking speed in pedestrian dynamics models. To address roles in society, only human behaviors such as leaders in an office guiding people to get out of the building immediately were modeled. All people start their evacuation simultaneously. In fact, some people evacuate after they finish their jobs. The difference in premovement time of the individuals is not considered in these simulations. All people have the same knowledge about the building. They use one route when they evacuate from the building. Indeed, knowledge about the evacuation route differs among people, and the evacuation routes can be different. An agent-based approach provides a platform that corrects these assumptions. An agent-based simulation system (ABSS) models and computes individuals’ behaviors related to evacuation (Musse and Thalmann 2007). Various types of human behavior have been studied using the ABSS platform, for example, simulation of human behavior in a hypothetical human-initiated crisis in the center of Washington DC and a simulation tool incorporating different agent types and three kinds of interaction: emotion, information, and behavior (Tsai et al. 2011; Parikh et al. 2013). An ABSS consists of three parts: the agents, the interaction methods among agents and environments, and the surrounding environment. Agents perceive data from the environment and determine their actions according to their goals. An agent has the properties of physical features, social roles, mentality, and others. The actions are interactions with other agents and the environment. Information exchanges among agents and starting to evacuate are examples of actions. The interactions with the environment are simulated by sub-simulators and affect the status of the environment. The ABSS repeats these simulation steps: agent perception, agent decision-making, and environment calculations. The environment involves CAD models of buildings and scenarios of disaster situations. The following example demonstrates the ABSS process applied to an evacuation from a building during a fire. Agents hear alarms and PAs directing them to evacuate the building. The alarm noise and announcements can increase the anxiety of the agents, which is calculated using a psychological status model. The mental status and individual knowledge of the agent determine its actions. When it decides to go to a safe place, it visualizes the route to that place and moves. One sub-simulator
Crowd Evacuation Simulation
5
calculates the agent locations and the status of pedestrian jams inside the building, and the other sub-simulator calculates the spread of the fire.
Crowd Evacuation Using Agent-Based Simulations Evacuation Scenarios and Environment The environment corresponds to the tasks that the ABSS is applied to. The parameters in the environment affect the results of the simulations. Table 1 lists the categories of building evacuation scenarios. Case 1 is a situation in everyday life and the scenario corresponds to an emergency drill. The other four cases correspond to emergency situations in which some accident happens, but people do not have all the information they need. The conditions of each situation worsen from Cases 2 to 5. Providing a real-time evacuation guide for dynamically changing situations is thought to effectively reduce evacuation time. Case 2 corresponds to a minor emergency such as a small fire inside a building. The layout of the floor inside the buildings remains the same during the evacuation, as in Case 1. People also keep calm in this case. Cases 3, 4, and 5 correspond to situations where some people become distressed and may have trouble evacuating safely to exits. Case 3 is a situation where an earthquake causes furniture to fall to the floor that hinders or prevents evacuation. A case in which fire spreads and humans operate fire shutters to prevent the fire from spreading further is modeled in Case 4. This operation may block the evacuation routes and cause differences between the cognitive map of the evacuees and the real situation. Case 5 is the situation in extreme disasters, where large earthquakes cause so much destruction to parts of the building that the floor layout is completely changed. In Cases 3, 4, and 5, it is necessary to improve prevention plans in terms of available safe-escape time and required safe-escape time (ISO TR16738 2009). However, it is difficult to execute evacuation drills for such situations, as the case in Nairobi demonstrated. Evacuation simulation systems are instead proposed to simulate the evacuation behaviors of people in such situations. Table 1 Category of changing situations at evacuations Map Case 1 2 3 4 5
Situation Everyday Emergency
Map (3D) Static environment
Layout Same
Dynamic environment
Different Unknown
Agent Mental state Normal (getting more anxious)
Interaction mode Normal (getting more confusing)
Fitness for drills Fit (getting more unexpected)
Distressed
Crisis
Beyond the scope of drill
6
T. Takahashi
Agent Mental States and their Action Selection People’s state of distress reflects the motions of agents during emergencies. As a result, the agents take various actions according to the information that they have. Some people may prefer to trust only information from an authority figure, but others will trust their neighbors or heed messages sent from their acquaintances. These individual behaviors form into crowd behavior in emergencies. During the GEJE, about 34 % of 496 evacuees began their evacuation by taking the advice of acquaintances who themselves took the evacuation guidance seriously (Cabinet Office Government of Japan). The value of 34 % is the average of three prefectures, Iwate, Miyagi, and Fukushima. Their averages are 44 %, 30 %, and 3 %, respectively. The question then arises as to where and how people evacuate during emergencies. Abe, et al. conducted a questionnaire survey with individuals who shopped at a Tokyo department store (Abe 1986). Three hundred subjects were selected from shoppers in the department store. The number of male and female participants was equal, and participants ranged in age from teenagers to adults in their 60s. The questions addressed the following factors that occur during emergencies: the provision of evacuation instructions during emergencies, knowledge of emergency exit locations, an individual’s ability to evacuate safely, and other factors. The results in Table 2 reveal that: • Individuals’ intentions during emergencies were diverse. Differences were apparent between the sexes and between age groups. • Half of all surveyed individuals stated they would follow the authorities’ instructions. The other half stated they would select directions by themselves, and individuals who chose the fourth and fifth strategies (in Table 2) tended to choose opposite directions. Agents act according to their code of conduct or will, and social psychological factors affect human behavior. The implementation of autonomous agents includes modeling the process of an individual’s perception, planning, and decision-making. Modeling the mental state of an agent is key to simulating the evacuation behavior of people. The psychological factors affect human actions that include selfish Table 2 Responses to “In which direction would you evacuate?” (Abe 1986)
1 2 3 4 5 6 7 8
Selected actions Follow instructions from clerks or announcements Hide from smoke Go to the nearest staircase or emergency exit Follow other individuals' movements Go in the direction that has fewer people Go to bright windows Retrace his/her path Other
All (%) 48.7 26.3 16.7 3 3 2.3 1.7 0.3
Sex Male (%) 38 30.7 20.7 1.3 2.7 2.7 2.7 0.7
Female (%) 54.7 22 12.7 4.7 3.3 2 0.7 –
Crowd Evacuation Simulation
7
movements, altruistic movements, and others. The following cases demonstrate some properties of human behavior. These actions also change the behavior of crowd evacuations. • People swerve when they come close to colliding with each other. When people see responders approaching, they make way to pass them automatically. The two behaviors are similar; however, they are different at the conscious level of an agent. Agents categorize the agents around them into normal or high-priority groups depending on common beliefs in the agent’s community. For example, the agent gives consideration to the rescuers and disabled, both of whom are categorized as agents with high priority. • Families evacuate together. When parents are separated from their children during emergencies, they become anxious and go to their children at the risk of their own safety. For instance, the child might be in a toy section in a department store and have no ability to ask others about his/her parents.
Pedestrian Dynamics Model and the Mentality of Individuals The belief-desire-intention (BDI) model is one method for representing how agents select actions according to the situation during the sense-reason-act cycle (Weiss 2000). Belief represents the information that the agent obtains from the environment and other agents. Desire represents the objectives or situations that the agent would like to accomplish or bring about, and their actions, which are selected after deliberation, are represented by intention. In the case of evacuation in emergency situations, the desires are to move quickly to a safe place, know what happened, or join families. The associated actions are to move to specific places. These actions are represented as a sequence of target points. The target points are the places where people go to satisfy their desires. Movements, including bidirectional movements in a crowd, can be microsimulated in one step using pedestrian dynamics models (Helbing et al. 2000). The models are composed of geometrical information and a force model that resembles the behaviors of real people. The behaviors of individuals may block others who are hurrying to refuges and hence cause pedestrian jams in evacuation (Pelechano et al. 2008; Okaya and Takahashi 2014).
Guidance to Agents and Communication During Evacuation The NIST report showed differences in evacuation behaviors between the two buildings, WTC1 and WTC2. The buildings were similar in size and layout, and similar numbers of individuals were present in the buildings during the attacks. Individuals in both buildings began to evacuate when WTC1 was attacked, and WTC2 was attacked 17 min later. At that time, about 83 % of survivors from WTC1 remained inside the tower, and about 60 % of survivors remained inside WTC2. The
8
T. Takahashi
difference in evacuation rates between two buildings given similar conditions indicates that there are other interactive and social issues that should be taken into consideration to simulate crowd evacuation behavior. A PA gives evacuation guidance to people. According to the GEJE report, only 56 % of evacuees heard the emergency alert warning from a loudspeaker. Of these, 77 % recognized the urgent need for evacuation, and the remaining 23 % did not understand the announcement because of noisy and confused situations. Nowadays, people communicate with others in public using cellar phones. This behavior is assumed to happen during emergencies. Indeed, in GEJE, 2011, it was reported that people knew and shared information using SNS and personal communications (Okumura 2014). In a case of family’s evacuation, the following communications between parents and their children often occurs when they are apart. Where are you? I am at location X. All right, I will be there soon, stay there.
Information regarding the situation and personal circumstances play an important role when determining actions. The information affects both the premovement and travel times of evacuation behaviors. With respect to the information or knowledge of people, whether broadcast or communicated personally, the evacuation process has the following phases: When emergencies occur, people either perceive the occurrence themselves or authorities make announcements. The alarm contains urgent messages conveying that an emergency situation has occurred and gives evacuation instructions. People confirm and share the information that they obtain by communicating with people nearby. After that, people perform actions according to their personal reasons: some evacuate to a safe place, others hurry to their families, and still others join rescue operations. People who are unfamiliar with the building follow guidance from authorities or employees who act according to prescribed rules or manuals of the buildings. The information that authorities and employees have may vary with time. The information transfer and sharing model enables the announcement of proper guidance to people or information sharing during evacuation (Okaya et al. 2013). The difference in agents’ information and style of communication causes the diversity of human behavior and affects the behavior of evacuations (Niwa et al. 2015).
Future Directions An ABSS is expected to simulate the behaviors of agents in unusual scenarios that are difficult to test in the real world. We learn how people behave and evacuate during disasters from media stories and reports published by those in authority.
Crowd Evacuation Simulation
9
These reports cover evacuation from airplanes, ships, theaters, sport stadiums, stations, underground transport systems, and others (Wanger and Agrawal 2014; Peacock et al. 2011; Weidmann et al. 2014). Behavior models have been formulated to meet the innate human features that were described in the reports and are key features of these evacuation simulations. Table 3 shows the parameters of the evacuation models in which human behaviors are taken into account. The parameters represent the features of the agents, environment, and interactions among agents or others during the scenarios. In addition, the parameters specify the evacuation scenarios. Some of the parameters are related to each other; for example, parameters related to pedestrian dynamics are personal spaces, speed, and avoidance sides, and others are dependent on countries (Natalie Fridman and Kaminka 2013). In scientific and engineering fields, the following principle, hypothesis ! compute consequence ! compare results, has been used to make models and to increase the fidelity of simulations (Feynman 1967). Fundamentally, this principle is applied to the crowd evacuation simulation. The following points are assumed when modeling crowd evacuation behaviors: Whole-part relations assumption. A crowd evacuation simulation system is composed of subsystems: an agent’s action planning, pedestrian dynamics, and disaster situations. A model for evacuation behavior is implemented in each Table 3 Evacuation simulation parameters Subsystem Agent
Parameters Physical
Mental/social
Perception Action
Preference Environment
Map/buildings Subsystem
Interaction
Communication Human Relationship
Age Sex Impaired/unimpaired State of mind Human relationships (family, office member, etc.) Role (teacher, leader, rescue responder, etc.) Visual data Auditory data Evacuate (walk/run) Communicate (hear, talk, share information among agents) Others (altruistic behavior, rescue operation) Culture Nationality 2D/3D Elevator Pedestrian dynamics Disasters effects (fire, smoke, etc.) Announcement (guidance from PA) Information sharing Personal Community
10
T. Takahashi
agent, and the pedestrian dynamics models calculate the positions of the agents. The movements of agents are integrated into crowd behaviors. Subsystem causality assumption. The agent’s behavior is simulated by formulas or rules in each agent at every simulation step. In each step, the status of the system is changed to a new status according to the parameters, models, and formulas. They may be refined to cover more phenomena or make the results of subsystem simulations more consistent with experimental data or empirical rules. Total system validity assumption. The simulation results of the subsystems and the positions of all agents are integrated into the results of the crowd evacuation simulation. The results of the simulation are checked with empirical rules or previous data. At the second assumption, the model of the subsystem is verified with respect to real data, and the parameters are tuned to the conditions of the scenarios (Peacock et al. 2011; Weidmann et al. 2014; Ronchi et al. 2013). The Tokyo fire department publishes a guide for building fire safety certificates based on simulation results (Tokyo Fire Department). The results predict the evacuation time at fire under their specified method and can be used to certify the likelihood of a safe evacuation. The simulations are in Case 1 of Table 1, which is equivalent to evacuation drills in everyday conditions. At the third assumption, people evaluate the results of simulation from their personal and organizational perspectives. Using an ABSS with the functions mentioned in section “Introduction” can simulate more realistic conditions such as those in Cases 2 to 4.When the integrated simulation results are likely to be reasonable in unexpected situations, there is no evidence to endorse whether or not the results can be used in real applications. In a case in which the results do not fit the empirical rule, even though it may involve a significant predictor, it is difficult to adopt the simulation results in a prevention plan according to scientific and engineering principles because we do not have enough real data and cannot perform experiments in real situations, as in the case of evacuation simulation. It is important and required to verify the results of evacuation simulations for emergency situations that have not occurred and affirm that the planning based on the simulation results will work well in a possible emergency situation. Verification and validation (V&V) of the simulation tools and results has been one of the most important issues in crowd evacuation simulations. V & V problems are represented using the following questions: How do we judge if a tool is accurate enough? How many and which tests should be performed to assess the accuracy of the model predictions? Who should perform the test, i.e., the model developers, the model users, or a third party? Does the model accurately represent the source system? Does the model accommodate the experimental frame? Is the simulator correct?
Crowd Evacuation Simulation
11
The questions are essential to ABSS. Questions 1 to 3 are from the test methods that are suggested from quantitative/qualitative points for behavioral uncertainty (Ronchi et al. 2013). Questions 4 to 6 are from a study of validation on evacuation drills from a ten-story building (Isenhour and Löhner 2014). A method of comparing simulation results to real scenarios as macroscopic patterns in a quantitative manner has been proposed as a validation method (Banerjee and Kraemer 2010). Interactions among agents and dynamically changing environments also affect the behavior of crowd evacuations. A verification test is suggested in order to check evacuation plans under the dynamic availability of exits (Ronchi et al. 2013). The following qualitative standards are proposed for application in simulations without real-world data that involve real evacuation data and experimental data (Takahashi 2015): Consistency with data. The simulation results or its variations after changing parameters or modifying subsystems are compatible with past anecdotal reports. Generation of new findings. The results involve something that was not recognized as important before the simulations, which is reasonable given empirical rules. Accountability of results. The cause of the changes can be explainable from the simulation data systematically. While we do not have answers to these questions, ABSS has been applied to more realistic situations. For example, an evacuation from a building with fire shutters is a realistic case (Takahashi et al. 2015). The fire shutters are installed in buildings by law to prevent fire and smoke from spreading inside. Some agents evacuate instantly and others evacuate after finishing their jobs. Operators at the prevention center close the fire shutters at time t1 to prevent fire spreading. If there is no announcement regarding the shutter closing, the agents don’t know the changes of environments. They evacuate according to their own cognitive map, which might not be updated until they notice the fire shutter closing at t2. As a result, the evacuation time from t1 to t2 is wasted, even though the agent starts evacuation instantly. This simulation demonstrates that evacuation times change for various scenarios in dynamically changing environments corresponding to Case 3, 4 and 5 and proves the potential of evacuation simulation for future applications. In this chapter, we presented some features of crowd evacuation simulations: the role of human mental conditions during emergencies, the presentation of agent mental states, and information on evacuations. We also showed that the combination of an agent’s physical and mental status and pedestrian dynamics is the key to simulating crowd evacuation and replicating various human behaviors. Simulating crowd evacuation more realistically introduces additional human-related factors. This makes it difficult to systematically analyze the simulation results and compare them with data from the real world. At present, the simulation results are not so much objectively measured as subjectively interpreted by humans. Future research and model development will focus on the study of agent interactions, human mental models, and verification and validation problems.
12
T. Takahashi
References Abe K (1986) Panic and human science: prevention and safety in disaster management. Buren Shuppan. in Japanese, Japan Averill JD, Mileti DS, Peacock RD, Kuligowski ED, Groner NE (2005) Occupant behavior, egress, and emergency communications (NIST NCSTAR 1–7). Technical report, National Institute of Standards and Technology, Gaitherburg B. news Woman dies after ‘terror drill’ at Kenya’s strathmore university. http://www.bbc.com/news/ world-africa-34969266. Date:16 Mar 2016 Banerjee B, Kraemer L (2010) Validation of agent based crowd egress simulation (extended abstract). In: International conference on autonomous agents and multiAgent systems (AAMAS’10). pp 1551–1552. http://www.aamas-conference.org/proceeding.html Cabinet Office Government of Japan. Prevention Disaster Conference, the Great West Japan Earthquake and Tsunami. Report on evacuation behavior of people (in Japanese). http://www. bousai.go.jp/kaigirep/chousakai/tohokukyokun/7/index.html. Date: 16 Mar 2016. in Japanese. Cabinet Office of UK (2011) Understanding crowd behaviours: documents. https://www.gov.uk/ government/publications/understanding-crowd-behaviours-documents. 20 Mar 2016 de Walle BV, Murray T (2007) Emergency response information systems: emerging trends and technologies. Commun ACM 50(3):28–65 Drabek TE (2013) The human side of disaster, 2nd edn. CRC Press, Boca Raton Dridi M (2015) Simulation of high density pedestrian flow: a microscopic model. Open J Model Simul 3(4):81–95 Feynman RP (1967) Seeking new laws. In: The character of physical law. The MIT Press, Cambridge Galea ER, Hulse L, Day R, Siddiqui A, Sharp G, Boyce K, Summerfield L, Canter D, Marselle M, Greenall PV (2008) The uk wtc9/11 evacuation study: an overview of the methodologies employed and some preliminary analysis. In: Schreckenberg A, Klingsch WWF, Rogsch C, Schreckenberg M (eds) Pedestrian and evacuation dynamics 2008 (pp. 3–24). Springer, Heidelberg Grosshandler WL, Bryner NP, Madrzykowski D, Kuntz K (2005) Report of the technical investigation of the station nightclub fire (NIST NCSTAR 2). Technical report, National Institute of Standards and Technology, Gaitherburg Hawe GI, Coates G, Wilson DT, Crouch RS (2012) Agent-based simulation for large-scale emergency response. ACM Comput Surv 45(1):1–51 Helbing D, Farkas I, Vicsek T (2000) Simulating dynamical features of escape panic. Nature 407:487–490 Isenhour ML, Löhner R (2014) Validation of a pedestrian simulation tool using the {NIST} stairwell evacuation data. Transp Res Procedia 2:739–744, The Conference on Pedestrian and Evacuation Dynamics 2014 (PED 2014), 22–24 October 2014, Delft, The Netherlands ISO:TR16738:2009. Fire-safety engineering – technical information on methods for evaluating behaviour and movement of people Kuligowski ED (2005) Review of 28 egress models. In: NIST SP 1032; Workshop on building occupant movement during fire emergencies. Musse SR, Thalmann D (2007) Crowd simulation. Springer-Verlag, London Natalie Fridman AZ, Kaminka GA (2013) The impact of culture on crowd dynamics: an empirical approach. In: International conference on autonomous agents and multiagent systems, AAMAS’13, p 143–150 Niwa T, Okaya M, Takahash T (2015) TENDENKO: agent-based evacuation drill and emergency planning system. Lecture Notes in Computer Science 9002. Springer, Heidelberg Okaya M, Takahashi T (2014) Effect of guidance information and human relations among agents on crowd evacuation behavior. In: Kirsch U, Weidmann U, Schreckenberg M (eds) Pedestrian and evacuation dynamics 2012. Springer, Cham
Crowd Evacuation Simulation
13
Okaya M, Southern M, Takahashi T (2013) Dynamic information transfer and sharing model in agent based evacuation simulations. In: International conference on autonomous agents and multiagent systems, AAMAS 13. pp 1295–1296 Okumura H (2014) The 3.11 disaster and data. J Inf Process 22(4):566–573 Parikh N, Swarup S, Stretz PE, Rivers CM, Bryan MVM, Lewis L, Eubank SG, Barrett CL, Lum K, Chungbaek Y (2013) Modeling human behavior in the aftermath of a hypothetical improvised nuclear detonation. In: International conference on autonomous agents and multiagent systems, AAMAS’13, pp 949–956 Peacock RD, Kuligowski ED, Averill JD (2011) Pedestrian and evacuaion dynamics. Springer, Heidelberg Pelechano N, Allbeck J, Badler N (2008) Virtual crowds: methods, simulation, and control. Morgan & Claypool Publishers series, California, New York Ripley A (2008) The unthinkable: who survives when disaster strikes – and why. New York: Three Rivers Press Ronchi E, Kuligowski ED, Reneke PA, Peacock RD, Nilsson D (2013) The process of verification and validation of building fire evacuation models. Technical report. National Institute of Standards and Technology, Gaitherburg. Technical Note 1822 Saijo T (2014) Be a tsunami survivor. http://wallpaper.fumbaro.org/survivor/tsunami_en_sspj.pdf. Date:17 Mar 2016 Takahashi T (2015) Qualitative methods of validating evacuation behaviors. In Takayasu H, Ito N, Noda I, Takayasu M (eds) Proceedings of the international conference on social modeling and simulation, plus econophysics colloquium 2014. Springer proceedings in complexity. Springer International Publishing, pp 231–242 Takahashi T, Niwa T, Isono R (2015) Method for simulating the evacuation behaviours of people in dynamically changing situations. In: Proceedings of TGF2015. Springer. To be published in 2016 Fall Tokyo Fire Department. Excellence mark -certified fire safety building indication system. http:// www.tfd.metro.tokyo.jp/eng/inf/excellence_mark.html. Date: 25 Jan 2016 Tsai J, Fridman N, Bowring E, Brown M, Epstein S, Kaminka G, Marsella S, Ogden A, Rika I, Sheel A, Taylor ME, Wang X, Zilka A, Tambe M (2011) Escapes: evacuation simulation with children, authorities, parents, emotions, and social comparison. In: The 10th international conference on autonomous agents and multiagent systems, vol 2, AAMAS’11. International Foundation for Autonomous Agents and Multiagent Systems, Richland, pp 457–464 Turoff M (2002) Past and future emergency response information systems. Commun ACM 45 (4):29–32 Wanger N, Agrawal V (2014) An agent-based simulation system for concert venue crowd evacuation modeling in the presence of a fire disaster. Expert Syst Appl 41:2807–2815 Weidmann U, Kirch U, Schreckenberg M (eds) (2014) Pedestrain and evacuation dynamics 2012. Springer, Heidelberg Weiss G (2000) Multiagent systems. The MIT Press, Massachusets
Perceptual Study on Facial Expressions Eva G. Krumhuber and Lina Skora
Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Early Beginnings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Dynamic Advantage in Facial Expression Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Temporal Characteristics: Directionality and Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Effects of Facial Motion on Perception and Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Ratings of Authenticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Person Judgments and Behavioral Responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Facial Mimicry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Neuroscientific Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Abstract
Facial expressions play a paramount role in character animation since they reveal much of a person’s emotions and intentions. Although animation techniques have become more sophisticated over time, there is still need for knowledge in terms of what behavior appears emotionally convincing and believable. The present chapter examines how motion contributes to the perception and interpretation of facial expressions. This includes a description of the early beginnings in research on facial motion and more recent work, pointing toward a dynamic advantage in facial expression recognition. Attention is further drawn to the potential characteristics (i.e., directionality and speed) that facilitate such dynamic advantage. This is followed by a review on how facial motion affects perception and
E.G. Krumhuber (*) • L. Skora University College London, London, UK e-mail: [email protected] # Springer International Publishing AG 2016 B. Müller, S.I. Wolf (eds.), Handbook of Human Motion, DOI 10.1007/978-3-319-30808-1_18-1
1
2
E.G. Krumhuber and L. Skora
behavior more generally, with the neural systems that underlie the processing of dynamic emotions. The chapter concludes by discussing remaining challenges and future directions for the animation of natural occurring emotional expressions in dynamic faces. Keywords
Motion • Dynamic • Facial expression • Emotion • Perception
Introduction Among the 12 principles of animation developed in the early 1930s, animators at the Walt Disney Studios considered motion to be fundamental for creating believable characters. They were convinced that the type and speed of an action help define a character’s intentions and personality (Kerlow 2004). Since the early days of character animation, much has changed in animation techniques and styles. From hand-drawn cartoon characters to real-time three-dimensional computer animation, the field has seen a major shift toward near-realistic characters that exhibit humanlike behavior. Whether those are used for entertainment, therapy, or education, the original principle of motion continues to be of interest in research and design. This particularly applies to the topic of facial animation as subtle elements of the character’s thoughts and emotions are conveyed through the face (Kappas et al. 2013). Facial expressions provide clues and insight about what the character thinks and feels. They act as a powerful medium in conveying emotions. Although the tools for facial animation have become more sophisticated over time (i.e., techniques for capturing and synthesizing facial expressions), there is still need for knowledge about how humans respond to emotional displays in moving faces. Only if the character appears emotionally convincing and believable will the user/audience feel comfortable in interaction. The present chapter aims to help with this task by providing an overview of the existing literature on the perception of dynamic facial expressions. Given the predominant focus on static features of the face in past research, we seek to highlight the beneficial role of facial dynamics in the attribution of emotional states. This includes a description of the early beginnings in research on facial motion and more recent work pointing toward a dynamic advantage in facial expression recognition. The next section draws attention to the potential characteristics that facilitate such dynamic advantage. This is followed by a review on how facial motion affects perception and behavior more generally. Neural systems in the processing of dynamic emotions and their implications for action representation are also outlined. The final section concludes the paper by discussing remaining challenges and future directions for the animation of natural occurring emotional expressions in dynamic faces.
Perceptual Study on Facial Expressions
3
State of the Art Early Beginnings In everyday settings, human motion and corresponding properties (e.g., shapes, texture) interact to produce a coherent percept. Yet, motion conveys important cues for recognition even in isolation from the supportive information. The human visual system, having evolved in dynamic conditions, is highly attuned to dynamic signals within the environment (Gibson 1966). It can use this information to identify an agent or infer its actions purely by the motion patterns inherent to living organisms, called biological motion (Johansson 1973). Investigations of biological motion of the face suggest that the perception of faces is aided by the presence of nonrigid facial movements, such as stretching, bulging, or flexing of the muscles and the skin. In an early and now seminal point-light paradigm (Bassili 1978), all static features of actors’ faces, such as texture, shape, and configuration, were obscured with the use of black makeup. Subsequently, the darkened faces were covered with approximately 100 luminescent white dots and video recorded in a dark room displaying a range of nonrigid motion, from grimaces to the basic emotional expressions (happiness, sadness, fear, anger, surprise, and disgust). The dark setup resulted in only the bright points being visible to the observer, moving as a result of facial motion. In a recognition experiment, the moving dots were recognized as faces significantly better than when the stimulus was shown as a sequence of static frames or as a static image. Similarly, moving point-light faces enabled above-chance recognition of the six basic emotional expressions in comparison to motionless point-light displays (Bassili 1979; Bruce and Valentine 1988). This suggests that when static information about the face is absent, biological motion alone is distinctive enough to provide important cues for recognition.
Dynamic Advantage in Facial Expression Recognition Subsequent research has pointed toward a motion advantage especially when static facial features are compromised. This is of particular relevance for computergenerated, synthetic faces (e.g., online avatars, game characters). In comparison to natural human faces, synthetic faces are still inferior in terms of their realistic representation of the finer-grained features, such as textures, skin stretching, or skin wrinkling. Such impairment in quality of static information can be remedied by motion. Numerous studies have shown that expression recognition in dynamic synthetic faces consistently outperforms recognition in static synthetic faces (Ehrlich et al. 2000; Wallraven et al. 2008; Wehrle et al. 2000; Weyers et al. 2006). This suggests that motion is able to add a relevant layer of information when synthetic features fail to provide sufficient cues for recognition. The effect is found both under uniform viewing quality and when the featural or textural information is degraded (e.g., blurred).
4
E.G. Krumhuber and L. Skora
For natural human faces, however, the dynamic advantage is weaker or inexistent when the quality of both static and dynamic displays is comparably good (Fiorentini and Viviani 2011; Kamachi et al. 2001; Kätsyri and Sams 2008). As such, motion is likely to provide additional cues for recognition when key static information is missing (i.e., in degraded and obscured expressions). Its benefits may be redundant when the observer can draw enough information from the static properties of the face. This applies to static stimuli that typically portray expressions at the peak of emotionality. Such stimuli, prominently used in face perception research, are characterized by their stereotypical depiction of a narrow range of basic emotions. They are often also posed upon instructions by the researcher and follow a set of prototypical criteria (e.g., Facial Action Coding System, FACS; Ekman and Friesen 1978). In this light, it is likely that stylized static expressions contain the prototypical markers of specific emotions, thereby facilitating recognition. Yet, everyday emotional expressions are spontaneous and often include non-prototypical emotion blends or patterns. They are normally also of lower intensity, potentially becoming more difficult to identify without supportive cues such as motion. For instance, low-intensity expressions, which tend to be more difficult to identify the less intense they get, are recognized significantly better in a dynamic than static form (Ambadar et al. 2005; Bould and Morris 2008). In this context, motion appears to provide additional perceptual cues, making up for insufficient informative signals.
Temporal Characteristics: Directionality and Speed How can we explain the motion advantage in expression recognition? Could it simply derive from an increase in the amount of cues in a dynamic sequence? Early hypotheses point out that a moving sequence contains a greater number of static information from which to infer emotion judgments than a single static portrayal (Ekman and Friesen 1978). Arguably, as a dynamic sequence unfolds, it provides multiple samples of the developing expression compared to a single sample in static displays. To test this assumption, Ambadar et al. (2005) compared emotion recognition performances between dynamic, static, and multi-static expressions. In the multi-static condition, static frames constituting a video were interspersed with visual noise masks disrupting the fluidity of motion. Out of these, dynamic expressions were recognized with a significantly greater accuracy than both multi-static and static portrayals (see also Bould and Morris 2008). This suggests that the intrinsic temporal quality of the unfolding expression is what helps to disambiguate its content rather than a mere increase in static frames. A likely candidate that facilitates the dynamic advantage is the directionality of change in the expression over time. Research shows that humans are highly sensitive to the direction in which the expression unfolds. For example, they are able to accurately detect the directionality in a set of scrambled images and arrange them into a temporally correct sequence (Edwards 1998). Similarly, disrupting the natural temporal direction of the expression results in worse recognition accuracy than when
Perceptual Study on Facial Expressions
5
the expressions unfold naturally. In a series of experiments, Cunningham and Wallraven (2009b) demonstrated this by applying various manipulation techniques to the direction of unfolding, such as scrambling the frames in a dynamic sequence or playing them backward. Their results indicate that the identification of emotional expressions suffers considerably when natural motion is interrupted. Recognition performance also appears to be better in sequences in which the temporal unfolding is preserved, thereby allowing the directionality of change to be observed as the expression emerges (Bould et al. 2008, but see Ambadar et al. 2005 for a contrasting result). Yet, it is noteworthy that this effect might not affect all emotions equally. For example, happiness is typically recognized better than other basic emotions regardless of condition. In addition to the movement direction, the velocity of unfolding plays a crucial role in emotion perception. Changes in viewing speed, such as slowing down or speeding up of the dynamic sequences, significantly affect expression recognition accuracy. This effect appears to be different between emotions based on the differences in their intrinsic optimum velocities. For example, sadness is naturally slow; so slowed-down viewing conditions do not impact it negatively as much as they impact recognition accuracy for all other tested emotions (Kamachi et al. 2001). Conversely, surprise is naturally fast, and it could be its natural velocity that distinguishes it from the morphologically similar expression of fear which is slower (Sato and Yoshikawa 2004). Importantly, changing the speed throughout an entire expression results in different effects as changes to the duration of the peak. This suggests that the beneficial effects of natural movements cannot simply be explained by the mere exposure time to the expression (Kamachi et al. 2001; Recio et al. 2013). Overall, altering the speed of expression unfolding appears to influence perception without affecting the direction of change. As such, the intrinsic velocities of particular emotional expressions are likely to provide stronger cues for recognition than the perception of change alone (Bould et al. 2008). Finally, the perception of dynamic faces is also linked to the quality of motion. While expressions in real faces unfold in a biologically natural manner (i.e., nonlinearly), facial animations have been often characterized by linear techniques. Such linearly unfolding facial expressions (e.g., dynamic displays morphed from individual static displays) yield slower and poorer recognition accuracy in comparison to natural, nonlinear unfolding, as well as worse naturalness and genuineness ratings (Cosker et al. 2010; Wallraven et al. 2008). As a result, linear morphs might not constitute a good representation of the real-life quality of facial motion, which is particularly relevant to the construction of realistic synthetic faces. However, recent developments within the field of affective computing identify multiple parameters linked to naturalistic expression unfolding that can improve the quality of motion in computer-generated faces and raise their recognition rates, such as appropriate speeds, action unit (AUs) activations, intensities, asymmetries, and textures (Krumhuber et al. 2012; Recio et al. 2013; Yang et al. 2013). As such, the benefits provided by motion appear to be more than the perception of motion itself. Instead, it is a comprehensive set of information deriving from the temporal characteristics
6
E.G. Krumhuber and L. Skora
including the perception of change, intrinsic velocity of an expression, and the quality of motion.
Effects of Facial Motion on Perception and Behavior In addition to the supportive role in expression recognition, motion also affects a number of perceptual and behavioral factors. Those include expression judgments such as intensity and authenticity, as well as behavioral responses and even mimicry. Firstly, emotions expressed in a dynamic form are perceived to be more intense than the same emotions in a static form (Biele and Grabowska 2006; Cunningham and Wallraven 2009a). Motion appears to enhance intensity estimates because of the changes in the expression as it develops from neutral to fully emotive. While static portrayals retain the same intensity level throughout the presentation time, dynamic changes highlight the contrast between the neutral and fully emotional expression. As such, the contrast makes the expression seem more intense (Biele and Grabowska 2006). Another explanation for this effect was offered in terms of representational momentum (RP). RP is a visual perception phenomenon in which the observer exaggerates the final position of a gradually moving stimulus. It often involves a forward displacement. For example, when a moving object disappears from the visual field, observers tend to report its final position as displaced further down its trajectory than it objectively was. In a study about dynamic facial expressions and RP, Yoshikawa and Sato (2008) found that participants exaggerated the last – fully emotive – frame of the dynamic sequence and remembered it as more intense that it was in reality. The effect also got more pronounced with increasing velocity of expression unfolding. As such, it seems that the gradual shift from neutral to emotional in dynamic expressions generates a forward displacement, inducing an exaggerated and intensified perception of the final frame in the sequence.
Ratings of Authenticity Motion also appears to help observers assess the authenticity of an expression better than static portrayals can. Authenticity refers to more than correct identification of the emotional expression observed. It is a quality telling us whether the emotion is genuinely experienced or not. Smiles have been prominently used to study this dimension. Being universal and widespread in everyday interactions, smiles can indicate a range of feelings, from happiness and amusement to politeness and embarrassment (Ambadar et al. 2009). However, smiles can also be easily used to mask real emotions or to deceive others (e.g., Ekman 1985). As such, they constitute a good stimulus to study the genuineness of the underlying feeling. Traditionally, the so-called Duchenne marker has been considered as an indicator of smile authenticity (Ekman et al. 1990), where its presence signals that a smile is genuine (“felt”) as opposed to false (“unfelt”). The Duchenne marker involves, in addition to the lip corner puller (zygomaticus major muscle), the activation of the
Perceptual Study on Facial Expressions
7
orbicularis oculi muscle surrounding the eye. This results in wrinkling on the sides of the eyes, commonly referred to as crow’s feet. While the validity of the Duchenne marker in the perception of static expressions is well documented, motion properties are crucial for assessing smile authenticity in dynamic displays (e.g., Korb et al. 2014; Krumhuber and Manstead 2009). For example, genuine smiles differ in lip corner and eyebrow movements from deliberate, false smiles (Schmidt et al. 2006; Schmidt et al. 2009). More specifically, Frank et al. (1993) highlighted three dynamic markers of genuine smiles: expression duration, synchrony in muscle activation (between zygomaticus major and orbicularis oculi muscles), and smoothness of mouth movements. Overall, genuine smiles last between 500 and 4000 ms, whereas false smiles tend to be shorter or longer (Ekman and Friesen 1982). Furthermore, the smoothness and duration of the expressive components of smiles are meaningful indicators. Bugental (1986) and Weiss et al. (1987) were first to show that the onset and offset in false smiles tend to be faster in comparison to felt smiles (see also Hess and Kleck 1990). To investigate whether these differences affect expression perception, Krumhuber and Kappas (2005) manipulated onset, apex, and offset timings of computergenerated smiles. Their results confirmed the proposition that each dynamic element of a smiling expression has an intrinsic duration range at which it looks genuine. In particular, expressions are perceived as more authentic the longer their onsets and offsets, while a long apex is linked to lower genuineness ratings.
Person Judgments and Behavioral Responses Besides their effects on authenticity ratings, dynamic signals influence trait attributions and behavioral responses to the target expressing an emotion. For instance, people displaying dynamic genuine smiles (long onset and offset) are rated as more trustworthy, more attractive, and less dominant than those who show smile expressions without those characteristics (Krumhuber et al. 2007b). In addition, facial movement helps to regulate interpersonal relations by shaping someone’s intention to approach or cooperate with another person. In economic trust games, participants can receive a financial gain if their counterpart cooperates but incur a loss if the counter-player fails to cooperate. As such, their performance depends on accurate assessment of the counterpart’s intentions. Krumhuber and colleagues (Krumhuber et al. 2007a) showed that people are more likely to trust and engage in an interaction with a counterpart who displays a dynamic authentic smile than a dynamic false smile or neutral expression. Participants with genuinely smiling counterparts also ascribe more positive emotions and are more inclined to meet them again. Furthermore, people showing dynamic genuine smiles are evaluated more favorably and considered more suitable candidates in a job interview than those who do not smile or smile falsely (Krumhuber et al. 2009). Notably, this effect applies to real human faces as well as to computer-generated ones. When comparing static and dynamic facial features, it appears that they contribute to different evaluations and social decisions. Static and morphological features,
8
E.G. Krumhuber and L. Skora
such as bone structure or width, have been found to affect judgments of ability and competence. In turn, features that are dynamic and malleable, like muscular patterns in emotional expressions, affect judgments of intentions (Hehman et al. 2015). Given that these facial signals are also linked to evaluations of trustworthiness and likeability, they are likely to drive decision-making in social interactions. In line with this argument, participants were shown to choose a financial advisor, a role requiring trust, based on dynamic rather than static facial properties (Hehman et al. 2015).
Facial Mimicry Existing evidence suggests that dynamic facial displays elicit involuntary and subtle imitative responses more evidently than do static versions of the same expression (Rymarczyk et al. 2011; Sato et al. 2008; Weyers et al. 2006). Those responses, interpretable as mimicry, are a result of activity in facial muscles corresponding to a given perceived expression (i.e., lowering the eyebrows in anger, pulling the lip corners in happiness). They occur spontaneously and swiftly (about 800–900 ms) after detecting a change in the observed face. While involuntary facial mimicry is a subtle rather than full-blown replication of a witnessed emotion, it is evident enough to be distinguished in terms of its valence (positive or negative quality) by independent observers (Sato and Yoshikawa 2007a). Crucially, the presence of mimicry has a supporting role in emotion perception. For example, being able to mimic helps observers to recognize the emotional valence of expressions (Sato et al. 2013). Happiness and disgust are less well identified when corresponding muscles are engaged by biting on a pen which effectively blocks mimicry in the lower part of the face (Oberman et al. 2007; Ponari et al. 2012). In a similar vein, blocking mimicry in the upper part of the face by drawing together two stickers placed above the eyebrows impairs the recognition of anger. Mimicry also appears useful in detecting changes in expressions. Having to identify the point at which an expression transforms from one emotion into another (e.g., happiness to sadness) proves more difficult when mimicry is blocked by holding a pen sideways between the teeth. For this task, participants who are free to mimic are quicker in spotting changes in the dynamic trajectory of facial expressions (Niedenthal et al. 2001). Furthermore, mimicry aids emotion judgments, particularly in the context of smile authenticity. Dynamic felt smiles are more easily distinguished from dynamic false ones when expressions can be freely mimicked compared to when mimicry is blocked by a mouth-held pen (Maringer et al. 2011; Rychlowska et al. 2014). Overall, those findings suggest that facial mimicry helps to make inferences about dynamic emotional faces such as emotion recognition and trajectory changes or authenticity judgments. As such, it adds to the evidence that facial motion conveys information that is essential to comprehensive expression perception, while also driving behavioral responses.
Perceptual Study on Facial Expressions
9
Neuroscientific Evidence Evidence from neuroscience suggests that differences in the processing of dynamic and static facial stimuli begin at a neural level. For example, studies of patients with brain lesions or neurological disorders point toward a dissociation in the neural routes for processing dynamic and static faces. In the most notable cases, patients who are unable to recognize emotions from static displays can easily do so from moving displays (Adolphs et al. 2003; Humphreys et al. 1993). In healthy people, dynamic facial expressions evoke significantly larger and more widespread activation patterns in the brain than static expressions (LaBar et al. 2003; Sato et al. 2004). This enhanced activation is apparent in a range of brain regions, starting with the visual area V5 which subserves motion perception. It has also been observed in the fusiform face area (FFA), a number of frontal and parietal regions, and the superior temporal sulcus (STS), areas implicated in the processing of faces, emotion, and biological motion, respectively (Kessler et al. 2011; Trautmann et al. 2009). The STS has been given particular consideration due to its involvement in interpreting social signals, in addition to biological motion. As such, enhanced activation in the STS in response to dynamic facial stimuli could be related to extracting socially relevant information (i.e., intentions) from the changeable features of the face (Arsalidou et al. 2011; Kilts et al. 2003). Additionally, in an electroencephalograph (EEG) study, attention-related brain activity was found to be greater and longer when participants observed dynamic compared to static stimuli (Recio et al. 2011). This higher activity continued throughout the duration of an expression, contributing to more elaborate processing of dynamic faces. Such enhanced and more widespread brain activation in response to facial motion could be caused by the fact that dynamic expressions are inherently more complex to process. Equally, it could derive from greater evolutionary experience with moving faces and the need to extract social meaning from them for effective communication. In this light, neurological evidence lends support to the behavioral findings. Improved recognition accuracy, sensitivity to the temporal characteristics, and the ability to make inferences about genuineness, trustworthiness, or approachability could be an effect of enhanced processing of dynamic faces. Besides phenomena of neural adaptation, there is work suggesting that brain activity while observing facial movements may encompass regions which are linked to one’s own experience of emotional states, as well as areas reported to contain mirror neurons (Dapretto et al. 2006). Initially observed in macaque monkeys, mirror neurons fire both when performing an action and when watching the action in others (Rizzolatti et al. 1996). Emotion perception may therefore be partially subserved by the mirror neuron system (i.e., premotor and parietal regions, superior temporal sulcus; Iacoboni and Dapretto 2006; Rizzolatti and Craighiero 2004) which activates an internal representation of the observed state almost as if it was felt by oneself. Supportive evidence comes from research showing that facial mimicry in response to observed expressions activates similar patterns in the brain of the perceiver (Lee et al. 2006). Also, observing an emotional experience of someone elicits
10
E.G. Krumhuber and L. Skora
corresponding subjective arousal in oneself (Lundqvist and Dimberg 1995) which is found to be stronger for dynamic than static faces (Sato and Yoshikawa 2007b). Importantly, it has been proposed that this mirror neuron system has evolved to produce an implicit internal understanding of others’ mental states and intentions (Dimberg 1988; Gallese 2001). Following from this assumption, mirroring brain activity in response to facial expressions could be the driving force behind higherorder cognitive processes such as empathy or mentalizing (Iacoboni 2009). For example, witnessing a painful expression on someone’s face and feeling pain oneself activate largely overlapping neural pathways which are correlated with regions linked to empathy (Botvinick et al. 2005; Singer et al. 2004). The ability to mimic expressions was also shown to cause greater prosocial behavior, arguably mediated by greater empathy derived from mimicry and shared activations (Stel et al. 2008). Overall, this has been taken to suggest that humans understand, empathize with, and make inferences about mental states of others because the action-perception overlap activates internal experiences of the same state (Schulte-Rüther et al. 2007).
Future Directions From the literature reviewed above, there is conclusive evidence suggesting that humans have remarkable abilities to perceive and understand the actions of others. Driven by the universal need for social connection, the efficient detection and interpretation of social signals appears essential for successful interaction. Given the rapid advances in technology, these uniquely adaptive skills are likely to be translated to a new form of social partners in the near future. With the move of computing into the social domain, nonhuman agents are envisaged to become integral parts of our daily lives, from the workplace to social and private applications (Küster et al. 2014). As a result, many interactions will not occur in their traditional form (i.e., human to human) but instead involve computer-generated avatars and social robots. In order to build animated systems that emit appropriate social cues and behavior, it is imperative to understand the factors that influence perception. Facial expressions prove to play a vital part in this process since they reveal much of a character’s emotions and intentions. While animation techniques offer more control than ever over visual elements, subtle imperfections in the timing of facial expressions could evoke negative reactions from the viewer. In 1970, Masahiro Mori described a phenomenon called the “uncanny valley” (UV) in which human-realistic characters are viewed negatively if they are almost but not quite perfectly human. As such, increased human-likeness may result in unease when appearance or behavior fall short of emulating those of real human beings. Classic examples can be found in computer-animated films, such as The Polar Express and Final Fantasy: The Spirits Within, which many viewers find disturbing due to their human-realistic but eerie characters (Geller 2008). According to Mori, this perceived deviation from normal human behavior is further pronounced when movement is added. Particularly, if the appearance is more advanced than the behavior, violated perceptual expectations could make the moving character less
Perceptual Study on Facial Expressions
11
acceptable. In line with this argument, Saygin et al. (2012) showed that androids that look human but don’t move in a humanlike (biological) manner elicit a prediction error that leads to stronger brain activity in the perceiver. Furthermore, virtual characters are more likely to be rated as uncanny when their facial expressions lack movement in the forehead and eyelids (Tinwell et al. 2011). Although the exact role of motion in the UV remains an issue of debate (see Kätsyri et al. 2015), there is increasing evidence suggesting that natural human motion positively influences the acceptability of characters, particularly those that would fall appearance-wise into the UV (i.e., zombies; Piwek et al. 2014). As such, high-quality motion has the potential to improve ratings of familiarity and humanlikeness by eliciting higher affinity (McDonnnell et al. 2012, Thompson et al. 2011). In order for natural motion to become the standard in animation, it is essential to rely on behavior representative of the real world. At the moment, databases depicting dynamic emotional expressions are still limited in the range and type of facial movements being captured. The majority of them contain deliberately posed affective displays recorded under highly constrained conditions (for a review see Krumhuber et al. in press). Such acted portrayals may not provide an optimal basis for the modeling of naturally occurring emotions. For progress to occur in the future, efforts that target the dynamic analysis and synthesis of spontaneous behavior will prove fruitful. This also includes the study of how multiple dynamic cues interact to produce a coherent percept. Only once the dynamic nature of facial expressions is fully understood will it be possible to successfully incorporate this knowledge into animation models. The present chapter underscores the importance of this task by showing that perceivers are highly sensitive to the motion dynamics in the perceptual study of facial expressions.
Cross-References ▶ Blendshape Facial Animation ▶ Real-Time Full Body (or face) Posing ▶ Video-Based Performance Driven Facial Animation
References Adolphs R, Tranel D, Damasio AR (2003) Dissociable neural systems for recognizing emotions. Brain Cogn 52:61–69. doi:10.1016/S0278-2626(03)00009-5 Ambadar Z, Cohn JF, Reed LI (2009) All smiles are not created equal: morphology and timing of smiles perceived as amused, polite, and embarrassed/nervous. J Nonverbal Behav 33:17–34. doi:10.1007/s10919-008-0059-5 Ambadar Z, Schooler JW, Cohn JF (2005) Deciphering the enigmatic face. The importance of facial dynamics in interpreting subtle facial expressions. Psychol Sci 16:403–410. doi:10.1111/j.09567976.2005.01548.x Arsalidou M, Morris D, Taylor MJ (2011) Converging evidence for the advantage of dynamic facial expressions. Brain Topogr 24:149–163. doi:10.1007/s10548-011-0171-4
12
E.G. Krumhuber and L. Skora
Bassili JN (1978) Facial motion in the perception of faces and of emotional expression. J Exp Psychol Hum Percept Perform 4:373–379. doi:10.1037/0096-1523.4.3.373 Bassili JN (1979) Emotion recognition: the role of facial movement and the relative importance of upper and lower face. J Pers Soc Psychol 37:2049–2058. doi:10.1037//0022-3514.37.11.2049 Biele C, Grabowska A (2006) Sex differences in perception of emotion intensity in dynamic and static facial expressions. Exp Brain Res 171:1–6. doi:10.1007/s00221-005-0254-0 Botvinick M, Jha AP, Bylsma LM, Fabian SA, Solomon PE, Prkachin KM (2005) Viewing facial expressions of pain engages cortical areas involved in the direct experience of pain. Neuroimage 25:312–319. doi:10.1016/j.neuroimage.2004.11.043 Bould E, Morris N (2008) Role of motion signals in recognizing subtle facial expressions of emotion. Br J Psychol 99:167–189. doi:10.1348/000712607X206702 Bould E, Morris N, Wink B (2008) Recognising subtle emotional expressions: the role of facial movements. Cognit Emot 22:1569–1587. doi:10.1080/02699930801921156 Bruce V, Valentine T (1988) When a nod’s as good as a wink: the role of dynamic information in facial recognition. In: Gruneberg MM, Morris PE, Sykes RN (eds) Practical aspects of memory: current research and issues, vol 1. John Wiley and Sons, New York, pp 169–174 Bugental DB (1986) Unmasking the “polite smile”. Situational and personal determinants of managed affect in adult-child interaction. Pers Soc Psychol Bull 12:7–16. doi:10.1177/ 0146167286121001 Cosker D, Krumhuber EG, Hilton A (2010) Perception of linear and nonlinear motion properties using a FACS validated 3D facial model. In: Proceedings of the symposium on applied Perception in graphics and visualization (APGV), Los Angeles Cunningham DW, Wallraven C (2009a) The interaction between motion and form in expression recognition. In: Bodenheimer B, O’Sullivan C (eds) proceedings of the 6th symposium on applied perception in graphics and visualization (APGV2009), New York Cunningham DW, Wallraven C (2009b) Dynamic information for the recognition of conversational expressions. J Vis 9:1–17. doi:10.1167/9.13.7 Dapretto M, Davies MS, Pfeifer JH, Scott AA, Sigman M, Bookheimer SY, Iacoboni M (2006) Understanding emotions in others: mirror neuron dysfunction in children with autism spectrum disorders. Nat Neurosci 9:28–30. doi:10.1038/nn1611 Dimberg U (1988) Facial electromyography and the experience of emotion. J Psychophysiol 2:277–282 Edwards K (1998) The face of time: temporal cues in facial expressions of emotion. Psychol Sci 9:270–276. doi:10.1111/1467-9280.00054 Ehrlich SM, Schiano DJ, Sheridan K (2000) Communicating facial affect: it’s not the realism, it’s the motion. In: Proceedings of ACM CHI 2000 conference on human factors in computing systems, New York Ekman P (1985) Telling lies. Norton, New York Ekman P, Friesen WV (1982) Felt, false, and miserable smiles. J Nonverbal Behav 6:238–252. doi:10.1007/BF00987191 Ekman P, Friesen WV (1978) Facial action coding system: a technique for the measurement of facial movement. Consulting Psychologists Press, Palo Alto Ekman P, Davidson RJ, Friesen WV (1990) The Duchenne smile: emotional expression and brain physiology: II. J Pers Soc Psychol 58:342–353. doi:10.1037/0022-3514.58.2.342 Fiorentini C, Viviani P (2011) Is there a dynamic advantage for facial expressions. J Vis 11:1–15. doi:10.1167/11.3.17 Frank MG, Ekman P, Friesen WV (1993) Behavioral markers and recognizability of the smile of enjoyment. J Pers Soc Psychol 64:83–93. doi:10.1037/0022-3514.64.1.83 Gallese V (2001) The ‘shared manifold’ hypothesis. From mirror neurons to empathy. J Conscious Stud 8:33–50 Geller T (2008) Overcoming the uncanny valley. IEEE Comput Graph Appl 28:11–17. doi:10.1109/ MCG.2008.79
Perceptual Study on Facial Expressions
13
Gibson JJ (1966) The senses considered as perceptual systems. Houghton Mifflin, Boston doi:10.1080/00043079.1969.10790296 Hehman E, Flake JK, Freeman JB (2015) Static and dynamic facial cues differentially affect the consistency of social evaluations. Pers Soc Psychol Bull 41:1123–1134. doi:10.1177/ 0146167215591495 Hess U, Kleck RE (1990) Differentiating emotion elicited and deliberate emotional facial expressions. Eur J Soc Psychol 20:369–385. doi:10.1002/ejsp.2420200502 Humphreys GW, Donnely N, Riddoch MJ (1993) Expression is computed separately from facial identity, and is computed separately for moving and static faces: neuropsychological evidence. Neuropsychologia 31:173–181. doi:10.1016/0028-3932(93)90045-2 Iacoboni M (2009) Imitation, empathy, and mirror neurons. Annu Rev Psychol 60:653–670. doi:10.1146/annurev.psych.60.110707.163604 Iacoboni M, Dapretto M (2006) The mirror neuron system and the consequences of its dysfunction. Nat Rev Neurosci 7:942–951. doi:10.1038/nrn2024 Johansson G (1973) Visual perception of biological motion and a model for its analysis. Percept Psychophys 14:201–211. doi:10.3758/BF03212378 Kamachi M, Bruce V, Mukaida S, Gyoba J, Yoshikawa S, Akamatsu S (2001) Dynamic properties influence the perception of facial expressions. Perception 30:875–887. doi:10.1068/p3131 Kappas A, Krumhuber EG, Küster D (2013) Facial behavior. In: Hall JA, Knapp ML (eds) Nonverbal communication (Handbooks of Communication Science, HOCS 2). Mouton de Gruyter, Berlin, pp 131–165 Kätsyri J, Sams M (2008) The effect of dynamics on identifying basic emotions from synthetic and natural faces. Int J Hum Comput Stud 66:233–242. doi:10.1016/j.ijhcs.2007.10.001 Kätsyri J, Förger K, Mäkäräinen M, Takala T (2015) A review of empirical evidence on different uncanny valley hypotheses: support for perceptual mismatch as one road to the valley of eeriness. Front Psychol 6:390. doi:10.3389/fpsyg.2015.00390 Kerlow IV (2004) The art of 3D computer animation and effects, 3rd edn. John Wiley and Sons, Hoboken Kessler H, Doyen-Waldecker C, Hofer C, Hoffmann H, Traue HC, Abler B (2011) Neural correlates of the perception of dynamic versus static facial expressions of emotion. GMS Psychosoc Med 8:1–8. doi:10.3205/psm000072 Kilts CD, Egan G, Gideon DA, Ely TD, Hoffmann JM (2003) Dissociable neural pathways are involved in the recognition of emotion in static and dynamic facial expressions. Neuroimage 18:156–168. doi:10.1006/nimg.2002.1323 Korb S, With S, Niedenthal PM, Kaiser S, Grandjean D (2014) The perception and mimicry of facial movements predict judgments of smile authenticity. PLoS One 9:e99194. doi:10.1371/ journal.pone.0099194 Krumhuber EG, Tamarit L, Roesch EB, Scherer KR (2012) FACSGen 2.0 animation software: generating three-dimensional FACS-valid facial expressions for emotion research. Emotion 12:351–363. doi:10.1037/a0026632 Krumhuber EG, Kappas (2005) Moving smiles: the role of dynamic components for the perception of the genuineness of smiles. J Nonverbal Behav 29:3–24. doi:10.1007/s10919-004-0887-x Krumhuber EG, Manstead ASR (2009) Can Duchenne smiles be feigned? New evidence on felt and false smiles. Emotion 9:807–820. doi:10.1037/a0017844 Krumhuber EG, Manstead ASR, Cosker D, Marshall D, Rosin PL (2009) Effects of dynamic attributes of smiles in human and synthetic faces: a simulated job interview setting. J Nonverbal Behav 33:1–15. doi:10.1007/s10919-008-0056-8 Krumhuber EG, Manstead ASR, Cosker D, Marshall D, Rosin PL, Kappas A (2007a) Facial dynamics as indicators of trustworthiness and cooperative behavior. Emotion 7:730–735. doi:10.1037/1528-3542.7.4.730 Krumhuber EG, Manstead ASR, Kappas A (2007b) Temporal aspects of facial displays in person and expression perception: the effects of smile dynamics, head-tilt and gender. J Nonverbal Behav 31:39–56. doi:10.1007/s10919-006-0019-x
14
E.G. Krumhuber and L. Skora
Krumhuber EG, Skora P, Küster D, Fou L (in press) A review of dynamic datasets for facial expression research. Emotion Rev Küster D, Krumhuber EG, Kappas A (2014) Nonverbal behavior online: a focus on interactions with and via artificial agents and avatars. In: Kostic A, Chadee D (eds) Social psychology of nonverbal communications. Palgrave MacMillan, New York, pp 272–302 LaBar KS, Crupain MJ, Vovodic JT, McCarthy G (2003) Dynamic perception of facial affect and identity in the human brain. Cereb Cortex 13:1023–1033. doi:10.1093/cercor/13.10.1023 Lee TW, Josephs O, Dolan RJ, Critchley HD (2006) Imitating expressions: emotion specific neural substrates in facial mimicry. Soc Cogn Affect Neurosci 1:122–135. doi:10.1093/scan/nsl012 Lundqvist L, Dimberg U (1995) Facial expressions are contagious. J Psychophysiol 9:203–211 Maringer M, Krumhuber EG, Fischer AH, Niedenthal P (2011) Beyond smile dynamics: mimicry and beliefs in judgments of smiles. Emotion 11:181–187. doi:10.1037/a0022596 McDonnnell R, Breidt M, Buelthoff HH (2012) Render me real? Investigating the effect of render style on the perception of animated virtual humans. ACM Trans Graph 31:1–11. doi:10.1145/ 2185520.2185587 Mori M (1970) Bukimi No Tani. The Uncanny Valley (MacDorman KF and Minato T, Trans). Energy 7:33–35 Niedenthal P, Brauer M, Halberstadt JB, Innes-Ker AH (2001) When did her smile drop? Facial mimicry and the influences of emotional state on the detection of change in emotional expression. Cognit Emot 15:853–864. doi:10.1080/02699930143000194 Oberman LM, Winkielman P, Ramachandran VS (2007) Face to face: blocking facial mimicry can selectively impair recognition of emotional expressions. Soc Neurosci 2:167–178. doi:10.1080/ 17470910701391943 Piwek L, McKay LS, Pollick FE (2014) Empirical evaluation of the uncanny valley hypothesis fails to confirm the predicted effect of motion. Cognition 130:271–277. doi:10.1016/j. cognition.2013.11.001 Ponari M, Conson M, D’Amico NP, Grossi D, Trojano L (2012) Mapping correspondence between facial mimicry and emotion recognition in healthy subjects. Emotion 12:1398–1403. doi:10.1037/a0028588 Recio G, Schacht A, Sommer W (2013) Classification of dynamic facial expressions of emotion presented briefly. Cognit Emot 27:1486–1494. doi:10.1080/02699931.2013.794128 Recio G, Sommer W, Schacht A (2011) Electrophysiological correlates of perceiving and evaluating static and dynamic facial emotional expressions. Brain Res 1376:66–75. doi:10.1016/j. brainres.2010.12.041 Rizzolatti G, Craighero L (2004) The mirror-neuron system. Annu Rev Neurosci 27:169–192. doi:10.1146/annurev.neuro.27.070203.144230 Rizzolatti G, Fadiga L, Gallese V, Fogassi L (1996) Premotor cortex and the recognition of motor actions. Cogn Brain Res 3:131–141. doi:10.1016/0926-6410(95)00038-0 Rychlowska M, Canadas E, Wood A, Krumhuber EG, Niedenthal P (2014) Blocking mimicry makes true and false smiles look the same. PLoS One 9:e90876. doi:10.1371/journal. pone.0090876 Rymarczyk K, Biele C, Grabowska A, Majczynski H (2011) EMG activity in response to static and dynamic facial expressions. Int J Psychophysiol 79:330–333. doi:10.1016/j. ijpsycho.2010.11.001 Sato W, Yoshikawa S (2004) Brief report. The dynamic aspects of emotional facial expressions. Cognit Emot 18:701–710. doi:10.1080/02699930341000176 Sato W, Yoshikawa S (2007a) Spontaneous facial mimicry in response to dynamic facial expressions. Cognition 104:1–18. doi:10.1109/DEVLRN.2005.1490936v Sato W, Yoshikawa S (2007b) Enhanced experience of emotional arousal in response to dynamic facial expressions. J Nonverbal Behav 31:119–135. doi:10.1007/s10919-007-0025-7 Sato W, Fujimura T, Suzuki N (2008) Enhanced facial EMG activity in response to dynamic facial expressions. Int J Psychophysiol 70:70–74. doi:10.1016/j.ijpsycho.2008.06.001
Perceptual Study on Facial Expressions
15
Sato W, Fujimura T, Kochiyama T, Suzuki N (2013) Relationships among facial mimicry, emotional experience, and emotion recognition. PLoS One 8:e57889. doi:10.1371/journal.pone.0057889 Sato W, Kochiyama T, Yoshikawa S, Naito E, Matsumura M (2004) Enhanced neural activity in response to dynamic facial expressions of emotion: an fMRI study. Cogn Brain Res 20:81–91. doi:10.1016/S0926-6410(04)00039-4 Saygin AP, Chaminade T, Ishiguro H, Driver J, Frith C (2012) The thing that should not be: predictive coding and the uncanny valley in perceiving human and humanoid robot actions. Soc Cogn Affect Neurosci 7:413–422. doi:10.1093/scan/nsr025 Schmidt KL, Ambadar Z, Cohn J, Reed LI (2006) Movement differences between deliberate and spontaneous facial expressions: zygomaticus major action in smiling. J Nonverbal Behav 301:37–52. doi:10.1007/s10919-005-0003-x Schmidt KL, Bhattacharya S, Delinger R (2009) Comparison of deliberate and spontaneous facial movement in smiles and eyebrow raises. J Nonverbal Behav 33:35–45. doi:10.1007/s10919008-0058-6 Schulte-Rüther M, Markowitsch HJ, Fink GR, Piefke M (2007) Mirror neuron and theory of mind mechanisms involved in face-to-face interactions: a functional magnetic resonance imaging approach to empathy. J Cogn Neurosci 19:1354–1372. doi:10.1162/jocn.2007.19.8.1354 Singer T, Seymour B, O’Doherty JP, Frith CD (2004) Empathy for pain involves the affective but not the sensory components of pain. Science 303:1157–1162. doi:10.1126/science.1093535 Stel M, Van Baaren RB, Vonk R (2008) Effects of mimicking: acting prosocially by being emotionally moved. Eur J Soc Psychol 38:965–976. doi:10.1002/ejsp.472 Thompson JC, Trafton JG, McKnight P (2011) The perception of humanness from the movements of synthetic agents. Perception 40:695–704. doi:10.1068/p6900 Tinwell A, Grimshaw M, Nabi DA, Williams A (2011) Facial expression of emotion and perception of the Uncanny Valley in virtual characters. Comput Hum Behav 27:741–749. doi:10.1016/j. chb.2010.10.018 Trautmann SA, Fehr T, Hermann M (2009) Emotions in motion: dynamic compared to static facial expressions of disgust and happiness reveal more widespread emotion-specific activations. Brain Res 1284:100–115. doi:10.1016/j.brainres.2009.05.075 Wallraven C, Breidt M, Cunningham DW, Bülthoff H (2008) Evaluating the perceptual realism of animated facial expressions. ACM Trans Appl Percept 4:1–20. doi:10.1145/1278760.1278764 Wehrle T, Kaiser S, Schmidt S, Scherer K (2000) Studying the dynamics of emotional expression using synthetized facial muscle movement. J Pers Soc Psychol 78:105–119. doi:10.1037/00223514.78.1.105 Weiss F, Blum GS, Gleberman L (1987) Anatomically based measurement of facial expressions in simulated versus hypnotically induced affect. Motiv Emot 11:67–81. doi:10.1007/BF00992214 Weyers P, Mühlberger A, Hefele C, Pauli P (2006) Electromyographic responses to static and dynamic avatar emotional facial expressions. Psychophysiology 43:45–453. doi:10.1111/ j.1469-8986.2006.00451.x Yang M, Wang K, Zhang L (2013) Realistic real-time facial expression animation via 3D morphing target. J Softw 8:418–425. doi:10.4304/jsw.8.2.418-425 Yoshikawa S, Sato W (2008) Dynamic facial expressions of emotion induce representational momentum. Cogn Affect Behav Neurosci 8:25–31. doi:10.3758/CABN.8.1.25
Utilizing Unsupervised Crowdsourcing to Develop a Machine Learning Model for Virtual Human Animation Prediction Michael Borish and Benjamin Lok
Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CB Framework and VPF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Application and Real-Time Prediction Adjustments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Crowdsourcing Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analysis and Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prediction Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . User Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Crowdsourcing and Expert Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 3 5 5 5 7 8 9 10 11 11 14 15 16 17 17
Abstract
One type of experiential learning in the medical domain is chat interactions with a virtual human. These virtual humans play the role of a patient and allow students to practice skills such as communication and empathy in a safe, but realistic sandbox. These interactions last 10–15 min, and the typical virtual human has approximately 200 responses. Part of the realism of the virtual human’s response is the associated animation. These animations can be time consuming to create and associate with each response. M. Borish (*) • B. Lok Computer and Information Sciences and Engineering Department, University of Florida, Gainesville, FL, USA e-mail: mborish@ufl.edu; [email protected]fl.edu # Springer International Publishing Switzerland 2016 B. Müller, S.I. Wolf (eds.), Handbook of Human Motion, DOI 10.1007/978-3-319-30808-1_21-1
1
2
M. Borish and B. Lok
We turned to crowdsourcing to assist with this problem. We decomposed the process of creating basic animations into a simple task that nonexpert workers can complete. We provided workers with a set of predefined basic animations: six focused on head animation and nine focused on body animation. These animations could be mixed and matched for each question/response pair. Then, we used this unsupervised process to create a machine learning model for animation prediction: one for head animation and one for body animation. Multiple models were evaluated and their performance was assessed. In an experiment, we evaluated participant perception of multiple versions of a virtual human suffering from dyspepsia (heartburn-like symptoms). For the version of the virtual human that utilized our machine learning approach, participants rated the character’s animation on par with a commercial expert. Head animation specifically was rated more natural and typically expected than other versions. Additionally, analysis of time and cost show the machine learning approach to be quicker and cheaper than an expert alternative. Keywords
Crowdsourcing • Machine Learning • Virtual Human • Animation Pipeline
Introduction One style of experiential learning in the medical domain is chat interactions with virtual humans. Virtual human interactions allow students to learn by interacting with and observing the response of a virtual human. A typical example of this style of interaction is shown in Fig. 1. Here, students type in questions to the virtual patient and receive responses. Students usually interact with multiple versions of a character that differ in certain details. A single interaction lasts 10–15 min and trains the student on appropriate questions to ask in order to diagnose a patient. To facilitate these interactions, the virtual human typically contains approximately 200 responses. Part of the realism of a virtual human response is animation. Reasonable response animations play a role in believability of a virtual human and fulfill our natural coherency model with how a human should behave. Animations also play a role in emotion and personality expression (Cassell and Thorisson 1999). However, developing animations for such a large number of responses can contribute to significant logistical, technological, and content requirements necessary to deliver an effective interaction (Triola et al. 2007, 2012). There are numerous approaches for creating the necessary animations for a virtual human. Approaches include automated and procedural algorithms, motion capture pipelines, and animation experts. All of these approaches have trade-offs when considering cost, time, and quality. We propose the use of crowdsourcing as an alternative to create a machine learning model for animation prediction. This prediction model is specifically meant for virtual humans playing the roles of patients for medical interviews.
Utilizing Unsupervised Crowdsourcing to Develop a Machine Learning Model for. . .
3
Fig. 1 Example interaction page – student interacts with a virtual patient by typing questions and receiving responses while also tracking their interview progress
We first decomposed the process of adding animations to a virtual human into a simple task that nonexperts could complete. We then leveraged nonexperts recruited from Amazon’s Mechanical Turk Marketplace to provide creative input to generate complex animations from a small, generic set of 15 basic animations. These animations then provided the basis for a machine learning model to predict future animations. In this chapter, we describe the machine learning models created via our crowdsourcing process. We then apply the prediction models to a virtual human and report on an experiment. In this experiment, participants reported improved perceptions of the virtual human using these prediction models. We also provide a detailed analysis of the prediction models separate from the experiment as well as a time and cost analysis.
State of the Art The effort and people required to construct a virtual human scenario is significant (Triola et al. 2007, 2012). One area of significant effort is the realism of virtual humans afforded through animations. Expert modelers and animators represent a
4
M. Borish and B. Lok
gold standard in providing realism as the skills these experts utilize can take many years to acquire. An expert was compared to our machine learning models, and details of this comparison will be discussed in both the Experiment and Results section. This expert has over 15 years of modeling and animation experience in the video game industry with over a dozen games to his name. While the contributions of this individual are of high quality, the effort and time provided by this expert are substantial. When faced with limited resources, various attempts to automate the creation and application of animations have been made. Relative success with procedural algorithms has been found in facial animation. Work such as Hoon et al. (2014) has shown automatic generation of facial expressions to be possible and effective. In work by Brand and Hertzmann (2000), reference motion capture clips were used as the basis of new animation synthesis. Additionally, Deng et al. (2009) presented a method of example-based animation selection and creation for virtual characters. Similarly, Min and Chai (2012) developed a methodology for procedurally generated animation via short descriptions such as “walk four steps to green destination.” In this work too, motion capture clips were used as reference. Our work builds upon a similar structure as reference animation blocks were used to construct more complex animations. However, unlike this work, our model does not rely exclusively on motion capture data or expert coding. Rather, our model could be applied to a variety of animations and all construction is handled via crowdsourcing. Generation of animation from user cues has also been explored (Cassell and Thorisson 1999; Sargin et al. 2006). While this work has been successful, typically, the user is evaluated during the interaction for behavior such as posture and intonation. While such analysis provides additional features to evaluate, we limit ourselves to text of a virtual human’s response and basic parameters of the audio. This allows our model to be used for virtual humans in large classroom settings, typical of many medical schools. In these settings, individual simulation presentation is not feasible, and a student is likely to interact via laptop in informal surroundings. In-depth analysis of audio and visual components has also been used with success. In both Marsella et al. (2013) and and Levine and Theobolt (Levine and Theobolt 2009), in-depth analysis of audio cues were evaluated to predict appropriate gestures for a virtual human response. These gestures were created using motion capture clips as a ground truth to establish timings and constraints on the gestures. Similarly, in Xu et al. (2014), audio and visual analysis was conducted to structure the gesture generation into ideational units. Ideational units are conceptual units that bind verbal and nonverbal behavior together as well as provide constraints on various attributes of interaction such as transitions and rhythm. All of this work assumes the existence of databases of motion capture information or video clips tagged by experts with a variety of potentially complex pieces of information. While our system does require annotation information, the information is simpler than in similar systems and can be provided on demand by crowdsourced workers. Additionally, this simple information can produce the same affect as an expertly animated character.
Utilizing Unsupervised Crowdsourcing to Develop a Machine Learning Model for. . .
5
CB Framework and VPF Our animation process builds upon both the Crowdsource Bootstrapping (CB) Framework (Borish and Lok 2016) and Virtual People Factory (VPF) (Rossen et al. 2010; Rossen and Lok 2012). VPF is a web-based tool for the creation and improvement of virtual humans. VPF also facilitates online virtual human interactions and has been integrated into multiple classes at several universities. VPF has been used by thousands of medical, health, and pharmacy students to practice interpersonal skills and develop diagnostic reasoning. The CB Framework is a gateway tool for VPF that allows an educator to rapidly develop a new virtual experience once a need is identified. The CB Framework decomposes the process of virtual human creation into several discrete steps that utilize crowdsourced nonexperts. The completion of these steps results in a basic virtual human corpus. The corpus is the structured set of text that comprises the knowledge of the virtual human. These stages can be completed in a matter of hours with minimal commitment from the author. With the initial stages of the CB Framework complete, our animation process can be applied as a subsequent stage in the CB Framework. Alternatively, this process can be applied to already existing virtual humans as well. However, the crowdsourcing and machine learning models are framework agnostic, and the implementation described in the subsequent section can be applied to any creation pipeline or virtual human.
Implementation In order to rapidly create animation predictions for the virtual human’s responses, several discrete steps are necessary. The dataflow outlining these steps is shown in Fig. 2. The animation predictions for each response occur before our virtual human begins an interaction. These predictions are stored as part of the virtual human’s corpus. Any additional crowdsourcing also occurs as part of this process. Once the interaction begins, real-time adjustments to the predictions are needed. First, we will describe feature selection and prediction model specifics. We will then discuss real-time adjustments needed to combat repetition in the application of the models. Lastly, we will discuss details related to the crowdsourcing task used to create the prediction models.
Model Metrics In order to create the animation prediction models, we focused on two sets of related features: sentiment and lexical similarity. We reasoned that an accurate animation choice for a specific response should be similar in both sentiment and lexical content. To facilitate these features, we utilized N-grams. N-grams are often used in NLP analysis, and bigrams have proven effective when utilized for sentiment analysis
6
M. Borish and B. Lok Highly Ranked Input Response
Input Response
Prediction Model
NLP comparison
Predicted Animation
Shuffled and Sentimentappropriate Animation Real-time Adjustment
Lowly Ranked Input Response
Crowdsourcing
Fig. 2 Example dataflow for a response that utilizes our animation prediction system and the role of crowdsourcing within it
(Wang and Cardie 2014). Since multiple sentiment metrics were used as part of the animation prediction model, we too utilized bigrams as well as unigrams as features for prediction. All feature and machine learning model analyses were carried out using Weka (Hall et al. 2009). A list of the features used is as follows: • Sentence Sentiment – overall sentiment for the entire sentence was provided as part of the crowdsourcing task. This will be discussed in more detail in section “Crowdsourcing Task”. • Bigram Sentiment – automated calculations of sentiment were provided by the Stanford NLP Sentiment Pipeline (Socher et al. 2013). Sentiment analysis consisted of five categories including very negative, negative, neutral, positive, and very positive. These five categories form a distribution of overall sentiment opinion. Further, five additional metrics were calculated. These metrics included kurtosis, skewness, minimum, maximum, and range of the sentiment distribution. These metrics were calculated to describe the overall shape and agreement of the distribution. Bigram and sentence sentiment have previously been used to create virtual human animation (Hoque et al. 2013). Many previous systems are generally concerned with facial animation resulting from specific emotions. Our system is concerned with body animation; however, like facial animation, body animation is informed in part by sentiment. • Bigram Position and Total – the position of a bigram in the sentence as well as the total number of bigrams in a sentence were included. We found bigrams near the beginning of sentences to be of relatively higher importance. Typically, crowdsourced workers would favor the beginning of sentences to assign an animation even in multi-sentence responses. This makes intuitive sense as animations would be expected to begin when the speech for a response does. Sentence and clause boundaries have also been shown to be important locations for information in a sentence including head motion (Toshinori et al. 2014). • Bigram Part of Speech (POS) – bigram POS was also included. POS tagging for individual words was provided by the Stanford CoreNLP Pipeline. Each bigram POS was an aggregation of the POS tagging of the individual words that comprise the bigram.
Utilizing Unsupervised Crowdsourcing to Develop a Machine Learning Model for. . .
7
• Bigram and Bag of Words – the actual bigram as well as a “bag of words” approach to the sentence the bigram was drawn from was included in the model. We reasoned that beyond sentiment similarity and location. Any prediction should be based on lexically similar bigrams and sentences. Thus, we broke each sentence into unigrams for a “bag of words” approach commonly used for lexical similarity. • Head Animation – this feature was only included in the body animation prediction model. Feature evaluation indicated that the head animation was the single best predictor. Thus, the body prediction model forms a small predictor chain whereby head animations are predicted first. Then, all listed features including the predicted head animation are used to predict body animation.
Application and Real-Time Prediction Adjustments The animation prediction models previously described can suffer from repetition. For example, if two subsequent responses that occur during an interaction were “No, I don’t drink.” and “No, I don’t smoke.” similar predictions of “Shake No Once” and “Arms Crossed” might be predicted. While both predictions would be correct from a machine learning perspective, the repetition would hurt user perception during the actual interaction. So, our system also adds an additional layer of animation selection logic at interaction time. This selection logic is a simple shuffled deck algorithm. A shuffled deck algorithm randomly shuffles all items into a list and iterates over that list. Then, all items are reshuffled and the process repeated. This process creates a pseudo-random selection. For head animations, the shuffled deck algorithm only shuffles animations that have the same general sentiment grouping. Body animations were simply shuffled regardless of sentiment. As will be shown in Tables 3 and 5 in the Results section, there is a clear grouping for head animations while body animations do not show the same pattern. Before an interaction begins, animation prediction is applied to each response in a virtual human’s corpus. First, each bigram in the input response has head and body animations predicted. Then, these predictions are culled. Animation predictions for bigrams at the start of sentences are prioritized due to the importance of sentence boundaries as previously explained. Once these predictions are selected, remaining time is greedily filled. Remaining time is calculated based on access to the audio file associated with the input response and the length of the animations already predicted. Additionally, no animations that extend past the end of the audio file will be suggested. This restriction is to prevent the character from performing gestures after the response to a question is complete. A similarity evaluation takes place separately and simultaneously from animation prediction. This evaluation compares the full input response to the responses used in construction of the animation prediction model. The probability that the input response is a paraphrase of any of the other responses is calculated. This comparison is exactly the same as the NLP algorithm that performs paraphrase selection during a conversation in VPF and is based on the work of McClendon et al. (2014). While
8
M. Borish and B. Lok
animation predictions will still be returned regardless of score, scores below a certain threshold are sent to the crowdsourcing task to be evaluated by workers and included for future predictions. In this way, we increase the size of the prediction model whenever a lexically dissimilar response is encountered during the prediction process.
Crowdsourcing Task As previously mentioned, whenever a lexically dissimilar response is encountered during the prediction process, animation predictions are still provided. However, that input response is sent to a crowdsourcing task for future inclusion in the prediction model. The interface for the crowdsourcing task is shown in Fig. 3. Here, at the top, workers are shown a simple unity scene with the character. Below the scene, workers are presented with the current question/response pair for which animation assignment will take place. The words that comprise the response are spread across two timelines: one for head and one for body animations. These timelines represent the length of the audio and all animations are sized proportional to this length for each question/response pair. Proportional sizing of the animations was done so that animation length would be commensurate with the speech and not be assigned or playing for a significant period of time afterwards. In this area, there is also a button to play the entire timeline. This button allows workers to see how their animation selections appear in context with the audio and facial animations that the virtual character normally has for the given response. Workers can also click on any of the individual animations to have the virtual character perform a single, simple animation in order to better visualize the resulting timeline selection. On the bottom right are the animation lists and each animation can be dragged to the corresponding timeline. For example, in Fig. 3, for the question/response pair of “Can you read the lines?” and “I can’t read any of the lines.” a worker might drag a “Shake No Once” and “Hand Gesture” animation to the head and body timelines respectively. By providing a defined set of animations, overall task difficulty can be reduced. Low task difficulty leads to lower variability in responses by workers and overall better quality data (Callison-Burch and Dredze 2010). Additionally, workers can be confident they are suggesting an animation that is worthwhile since a majority of question/response pairs should be covered by the predefined set of generic animation building blocks. Further, confidence of the workers also plays a role in the quality of data that is provided (Madirolas and de Polavieja 2014). Once animation selections are complete, workers were also asked to provide a rating for the overall sentiment of the response. Worker choices were limited to negative, neutral, and positive. The ultimate sentiment of a response was determined by simple majority vote. This vote occured after at least three workers provided animation and sentiment selections for a question/response pair.
Utilizing Unsupervised Crowdsourcing to Develop a Machine Learning Model for. . .
9
Fig. 3 Crowdsourced worker task page – workers suggest animations for a question/response pair from a predefined, base set
Compensation payment for this task was set at $.40 a Human-Intelligence Task (HIT) on Amazon’s Mechanical Turk service and workers were asked to complete five question/response pairs. The HIT compensation was selected at this level due to previous pilot study findings related to the effect of payment on quality and completion time. Paying workers additional money does not improve the overall quality of data as there has been no link found between compensation and quality of work (Mason et al. 2009). As a result, keeping compensation reasonable will result in a lower overall cost for the task without a detriment to quality. In contrast, low compensation reduces the incentive by mainstream workers to complete your task. Additionally, high compensation as Adda et al. (2013) note can attract malicious workers that game the task in order to maximize payment and thus reduces data quality.
Analysis and Experiment We conducted an experiment to evaluate user perception of a virtual human utilizing these head and body animation prediction models. Four versions of a virtual patient suffering from dyspepsia (heartburn-like symptoms) were created. The first version is our control and consisted of a virtual patient whose only animation was lip syncing and an idle “breathing” animation. This version will be referred to as Dcontrol. The
10
M. Borish and B. Lok
second version added response animations randomly selected from the set of possible animations supplied to crowdsourced workers. This version will be Drandom. The third version had animations suggested by two predictive models: one for head animation and one for body animation as described in the last section. This version will be referred to as DML. Finally, DPro was created by an animation expert. This expert was provided the character and audio files and was allowed to animate the character as he saw fit. This animator regularly produces animations for commercial projects and serves as a gold standard comparison. We also conducted an in-depth analysis of the predictive models in isolation as well as a comparison of cost/time for the crowdsourcing process and expert animator. These will be presented separately.
Procedure Participants were recruited and paid US$1.00 to view a video showing a previous interaction between a student and one of the virtual humans previously described. Participants then filled out a survey. Previous interaction logs were used to create the interaction and the same interaction was used for all four versions of the dyspepsia virtual human. The video was approximately 5 min in length. The survey consisted of questions based on assessments for naturalness of virtual human interactions (Huang et al. 2011; Ho and MacDorman 2010). The questions were answered on a 7-point scale from 1- Not at all to 7- Absolutely and were as follows: Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
Do you think the virtual agent’s overall behavior is natural? Do you think the virtual agent’s overall behavior is sincere? Do you think the virtual agent’s overall behavior is typical? Do you think the virtual agent’s head behavior is natural? Do you think the virtual agent’s head behavior is sincere? Do you think the virtual agent’s head behavior is typical? Do you think the virtual agent’s body behavior is natural? Do you think the virtual agent’s body behavior is sincere? Do you think the virtual agent’s body behavior is typical?
The participants also described the virtual human on a number of attributes. Each attribute was on a 7-point binary scale. Those attributes were: Q10 Q11 Q12 Q13 Q14 Q15 Q16
Artificial – Natural Synthetic – Real Human-made – Humanlike Mechanical – Biological movement Predictable – Thrilling Passive – Animated Smooth/graceful – Sudden/jerky movement
Utilizing Unsupervised Crowdsourcing to Develop a Machine Learning Model for. . .
11
Results and Discussion In total, N = 89 participants were recruited and were distributed as follows: N = 16 viewed Dcontrol, N = 23 viewed Drandom, N = 25 viewed DML, and N = 25 viewed DPro.
Prediction Model The animations used in the prediction models were taken from previous virtual humans. A simple description of the animations is provided below. Animations A–F represent the head animations, while animations G–O represent the body animations. A B C D E F G H I J K L M N O
NodYesOnce – Single head nod NodYesTwice – Multiple head nods TiltHeadLeft – Head tilt looking toward the left side TiltHeadRight – Head tilt looking toward the right side ShakeNoOnce – Single head shake ShakeNoTwice – Multiple head shakes HandsInLap – Both hands are placed in the lap ArmsSweepOut – Arms progress from low to high position sweeping out from the body ArmsCrossed – Both arms are crossed in front of body HandFlickGestureA – Both hands are raised and in motion in front of the body. The animation ends with the hands flicking away from the body HandFlickGestureB – Similar to the previous gesture, however, the length of arm motion is shortened and the flick is less pronounced ScratchHeadLeft – Left hand is used to scratch the head ScratchHeadRight – Right hand is used to scratch the head HandGesture – Arms are up in front of the body at alternate times Shrug – Simple shoulder shrug
Multiple models were evaluated using both test sets created from 10 % of the data set and 10-fold cross-validation. The head prediction model contained 992 entries, while the body prediction model contained 939. Due to the relatively small size of the test sets, we present 10-fold cross-validation as more representative of performance. The overall accuracy of several different models is shown in Table 1. Ultimately, we settled on the use of a Bayesian network. The Bayesian net outperformed simpler models and was also on par with more computationally expensive models. We also investigated how often specific animations were applied by crowdsourcing workers according to the sentiment of the response. Tables 2 and 3 show the confusion matrix and sentiment distribution for the head prediction model. Tables 4 and 5 show the confusion matrix and sentiment distribution for the body prediction model.
12 Table 1 Model accuracies – percentage accuracies for a selection of different machine learning models for head and body animation
M. Borish and B. Lok
Head model 42.9 37.8 48.4 21.7
J48 Naive Bayes Bayesian Net Multilayer Preceptron
Table 2 Head confusion matrix
A 131 77 20 10 0 0
Table 3 Sentiment of head animations according to crowdsourced workers (all values are percentages)
Animation A B C D E F
Table 4 Body confusion matrix
G 32 16 27 15 10 12 7 35 12
B 37 22 7 6 0 0
H 58 63 50 26 14 18 13 28 31
C 48 19 157 110 10 11
D 1 0 0 0 0 0
Negative 1.0 1.0 8.6 12.4 99.4 94.2
I 0 1 1 0 2 0 1 1 1
J 0 0 0 0 0 0 0 0 0
K 0 0 0 0 0 0 0 0 0
Body model 16.4 14.8 19.8 16.6
E
F 1 0 4 3 15 17
1 0 16 13 153 103
Neutral 20.4 14.4 78.0 76.1 0.6 5.8
L 0 0 0 0 0 0 0 0 0
M 0 0 0 0 0 0 0 0 0
A B C D E F
Positive 78.6 84.6 13.4 11.5 0 0
N 45 25 21 24 14 16 7 42 13
O 38 28 34 26 16 28 10 35 43
G H I J K L M N O
As can be seen from the confusion matrix in Table 2, the head prediction is reasonably accurate overall and achieves an accuracy of 48.4 %. This is significantly better than a random chance of 16.7 % for a classification with six alternatives. Further, when the prediction model is incorrect, it is generally incorrect in a reasonable way. Most of the errors in the confusion matrix are with interchangeable animations. For example, if the participant were to ask “Do you do any drugs?” and receive a response of “No, I don’t do any illegal drugs.” then the typical response would include an animation in which the virtual human is shaking its head no. Whether the model predicts shaking the head no once or multiple times, both
Utilizing Unsupervised Crowdsourcing to Develop a Machine Learning Model for. . . Table 5 Sentiment of body animations according to crowdsourced workers (all values are percentages)
Animation G H I J K L M N O
Negative 36.5 50.8 37.5 26.3 31.3 28.0 36.1 21.4 34.4
Neutral 25.8 22.7 29.7 40.0 29.2 40.0 36.1 40.5 46.9
13 Positive 37.7 26.6 32.8 33.8 39.6 32.0 27.8 38.2 18.8
are reasonable. The same holds true for the other head animations. The interchanging of certain animations is also shown in the sentiment distribution. There are three clear groupings in the distribution as assigned by crowdsourced workers. The highly skewed distributions indicate agreement among workers as to when certain head animations are expected based on the sentiment of the response. Indeed, feature evaluation shows that worker assigned sentiment is the best predictor of head animation out of all features. While the head animation prediction model showed clear patterns and groupings, the same is not true for the body prediction model. The confusion matrix and sentiment distributions are shown in Tables 4 and 5, respectively. As can be seen in Table 4, the overall accuracy is lower than the head prediction model and is around 20 %. This is still higher than random chance at 11 %, but several of the animations are simply never predicted. Alternative simpler models such as naive Bayes and J48 do capture some of these predictions; however, the overall accuracy of these models is even lower. Again, similar to head prediction, more complex models did not produce an increase in overall accuracy. The reason for this performance becomes clearer when looking at the sentiment distribution for body animations. As shown in Table 5, the sentiment is roughly evenly distributed for each of the body animations. The body animations do not show any clear groupings that were evident for head animations. This relatively equal use of a body animation regardless of response sentiment indicate crowdsourced workers could not truly arrive at an agreement on what a “correct” body animation was for a given response. Our use of a neutral medical interview scenario could be one factor. Many of our virtual humans such as the dyspepsia scenario are meant to allow students to practice basic interviewing skills. The students are learning what questions to ask and how to gather information. A typical example of this interview would be an exchange similar to the following: “Do you do any drugs?”, “No, I don’t do any illegal drugs.”, “Do you drink at all?”, “Yeah, I have a couple of beers a week.” For both of these responses, there are expected head behaviors. The first response would have some variation of shaking no, while the second response would contain some variation of nodding yes. However, what is the “correct” body animation?
14
M. Borish and B. Lok
In such an example, a patient might simply need to be animated to be believable without any pattern. An alternative scenario such as revealing a diagnosis of cancer may contain emotional moments where specific body motions would be expected and represents one avenue of future work. Another factor might have been the coarse grained nature of control provided to crowdsourced workers. Workers were allowed only to place animations on the appropriate timeline. However, there are multiple ways to provide more fine-grained control. Two examples would be specific gaze targets to accompany the head animation and speed control to allow fine tuning of an animation. With additional controls, a more notable pattern to the body animations might present itself. The expansion of the controls is still another opportunity for future investigation.
User Perception An ANOVA was calculated for each survey response, Q1–Q16, listed in the previous section. For each question for which statistical significance was found, a Tukey analysis was conducted. While the ANOVA was significant for questions Q7, Q8, Q10, Q11, Q13, Q14, and Q15, the Tukey analysis did not show any interesting results. For all these questions, statistical significance for a difference in means was between Dcontrol and one of the other models. This makes sense as any type of animation would increase user perception over a virtual human with no animation. Tukey analysis showed results of interest for Q4 and Q6. These questions asked whether or not the virtual human head behavior is natural or typical, respectively. The results are summarized in Table 6. Users found the head animation of DML more natural and more typical than both DControl and DRandom. Further, DML was found to be comparable to DPro. However, no difference was found for the overall or body animations. These differences make sense in context of the model analysis from the previous section. With clear sentiment groupings and patterns for the head prediction model, crowdsourced workers generally agree that there exists a “correct” head animation for specific responses. This is reflected in the improved ratings for the virtual human. Additionally, the machine learning models produced from the crowdsourced workers produced the same affect as an expert animator. As we will discuss in the next section, these models were created cheaper and faster than the expert version. These savings make this an attractive alternative. In contrast, the body animation model consisted of animations whose sentiment was evenly distributed across all categories. There were no clear patterns and crowdsourced workers could not agree on a “correct” body animation for a response. This lack of agreement aligns with user perception as body predictions for DML were not perceived any differently from the other versions. As participant perception shows, appropriate animation for a virtual human’s response is important. In the medical domain, realistic virtual human interactions are an increasingly used educational component. This realism is directly effected by the appropriate choice of head animation and a reasonable choice for body animation.
Utilizing Unsupervised Crowdsourcing to Develop a Machine Learning Model for. . .
15
Table 6 User perception results with significant Tukey analysis Question Do you think the virtual agent’s head behavior is natural?
FAnova 7.593
pAnova .000
Do you think the virtual agent’s head behavior is typical?
7.194
.000
Tukey significance DMLDpro = .062 DMLDRandom = .025 DMLDControl = .000 DMLDPro = .105 DMLDRandom = .049 DMLDControl = .000
Mean DControl = 3.50 DRandom = 4.360 Dml = 5.38 DPro = 4.48
SD DControl = 1.50 DRandom = 1.15 D ml = 1.01 DPro = 1.32
DControl = 3.44 DRandom = 4.40 Dml =5.45 DPro = 4.56
DControl = 1.75 DRandom = 1.29 Dml = 0.93 DPro = 1.50
Additionally, the animations predicted by our machine learning approach produce the same affect as an expertly animated character.
Crowdsourcing and Expert Cost As shown in Table 7, the cost and time per response for the machine learning model construction is much lower than the use of an expert. The model construction required 15.75 h of effort and cost $75.00, while the expert animator required 25.75 h of effort and $901.25 to complete their work. While time will always be variable in a crowdsourced approach, the construction effort occurred over an approximately 24 h period. We are confident that 24–48 h is plenty of time to accomplish model construction in the future. The expert required one week to accomplish animation of the responses. Additionally, the machine learning models covered a larger breadth of responses and was constructed from 314 responses, while the expert animator worked on only 34 responses. The animator focused on these 34 responses because those were the responses necessary for the 5 min video shown to participants. This creates a cost and time per response as shown in Table 7. Based on these results, if every response in the 314 responses used in the machine learning model required a unique animation, an expert animator would require approximately six weeks of full-time work and $8,000.00 to complete the work. The machine learning models utilized for DML have several clear advantages. The models cover a larger breadth of responses and were evaluated much more quickly. Further, while there will certainly be some reuse of the animations created by the expert, the animations were not intended to be generic and reusable. These results suggest a shift in the role of expert animators from covering all responses to covering only necessary responses. As previously described, our system performs a similarity comparison. This comparison is a paraphrase identification algorithm. This algorithm determines if a particular response is a paraphrase than any
16
M. Borish and B. Lok
Table 7 Time and cost estimates ML model construction Animation expert
Time per response (minutes) 3 45
Cost per response (US dollars) $0.24 $26.51
other response previously encountered. For those responses that are scored low, an expert could create the necessary animation. Alternatively, different approaches that specifically address unfamiliar content could be used. For responses that score highly, the expert’s time is better spent elsewhere as the machine learning models can produce the same affect quicker and cheaper. These highly scoring responses are likely to make up the bulk of any virtual human corpus for a chat interaction in the medical domain. As mentioned at the beginning of this chapter, these virtual humans have several hundred responses but contain numerous similarities as multiple versions of the same scenario are often created. For example, most virtual humans would be asked whether or not they smoke, drink, or do drugs. Expert animators’ time can be utilized more efficiently by targeting information that requires their attention rather than these common responses. Our crowdsourcing approach also offers the benefit of continuous improvement. Whenever an unfamiliar response is encountered, regardless of whether an animation expert provides a new animation, the crowdsourcing algorithm can quickly incorporate new information on demand. The crowdsourcing algorithm can send the response to workers who provide timeline selections and sentiment scores. Our system also avoids requiring experts to provide complex information tagging as in related systems previously described.
Limitations Our prediction models have been applied to virtual humans in the medical domain for a dialogue type interaction with a relatively short conversation time of 10–15 min. While this process would generalize to similar dialogue domains, other types of interactions such as instruction or open-ended conversations may not be applicable. Such interaction styles may not have the same type of information overlap present here and may suffer from repetition from this model, even with the mitigation described. However, this does present an opportunity for future improvement as crowdsourced workers could be given a larger set of animations to work with and additional controls. This does need to be balanced with the increase in task difficulty and the potential decrease in agreement among workers. Our prediction models are also bias based on the interaction context and domain. Virtual human chat interactions in the medical domain are meant to teach students how to ask questions and retrieve information. A majority of these questions have muted emotional context as they are usually matter of fact. When emotional context is involved, the context is skewed negative toward sad emotions as the virtual humans are suffering from some medical issue. These biases must be accounted
Utilizing Unsupervised Crowdsourcing to Develop a Machine Learning Model for. . .
17
for and the models trained on appropriate data if the domain differs from what is described here.
Future Directions Our process has demonstrated the unsupervised creation of an animation prediction model via crowdsourced workers to be a viable alternative to more resource intensive creation methods. A virtual human whose animations are assigned by our prediction models has head animation that is regarded as more natural and typical by participants. Based on crowdsourced workers input, any reasonable body animation will suffice as no difference in perception occured. This was in alignment with the data collected for the machine learning model that did not find a clear agreement on what constitutes a “correct” body animation for a general virtual human response. The virtual human animated using our prediction model produces similar affect to a version of the virtual human animated by an expert. Importantly, this crowdsourced model is cheaper and faster to create, and can also be updated on demand through the use of additional crowdsourcing. These benefits highlight the need to refocus experts only on the information that requires their attention while leaving mundane responses to be animated automatically by our models. By freeing experts from repetitious work, our system aims to reduce the barrier to creation to allow virtual humans to be utilized more widely in medical education. We intend to pursue this research with additional studies. One such study we are planning is expansion of the crowdsourcing task. With additional controls such as animation speed and gaze targets, additional patterns may present themselves. We also intend to continue development of the body prediction model. As mentioned, specific emotional events or scenarios may play a role in the “correct”-ness of body animation, and we plan to investigate such cases with models specifically tuned to those situations. Additionally, we plan to further investigate whether other features may be required to identify an already existing pattern in body animations.
References Adda G, Mariani J, Besacier L, Gelas H (2013) Economic and ethical background of crowdsourcing for speech. In: Crowdsourcing for speech processing: applications to data collection, pp 303–334 Borish M, Lok B (2016) Rapid low-cost virtual human bootstrapping via the crowd. Trans Intell Syst Technol 7(4):47 Brand M, Hertzmann A (2000) Style machines. In: 27th SIGGRAPH, pp 183–192 Callison-Burch C, Dredze M (2010) Creating speech and language data with Amazons Mechanical Turk. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk, number June, pp 1–12 Cassell J, Thorisson KR (1999) The power of a nod and a glance: envelope vs. emotional feedback in animated conversational agents. Appl Artif Intell 13(4–5):519–538
18
M. Borish and B. Lok
Deng Z, Gu Q, Li Q (2009) Perceptually consistent example-based human motion retrieval. In: Interactive 3D graphics and games, vol 1, pp 191–198 Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software. ACM SIGKDD Explor Newsl 11(1):10 Ho C-C, MacDorman KF (2010) Revisiting the uncanny valley theory: developing and validating an alternative to the Godspeed indices. Comput Hum Behav 26(6):1508–1518 Hoon LN, Chai WY, Aidil K, Abd A (2014) Development of real-time lip sync animation framework based on viseme human speech. Arch Des Res 27(4):19–29 Hoque ME, Courgeon M, Mutlu B, Picard RW, Link C, Martin JC (2013) MACH: My Automated Conversation coacH. In: Pervasive and ubiquitous computing, pp 697–706 Huang L, Morency LP, Gratch J (2011) Virtual rapport 2.0. In: Intelligent virtual agents, pp 68–79 Levine S, Theobalt C (2009) Real-time prosody-driven synthesis of body language. ACM Trans Graph 28(5):17 Madirolas G, de Polavieja G (2014) Wisdom of the confident: using social interactions to eliminate the bias in wisdom of the crowds. In: Collective intelligence, pp 2012–2015 Marsella S, Lhommet M, Feng A (2013) Virtual character performance from speech. In: 12th SIGGRAPH/Eurographics symposium on computer animation, pp 25–35 Mason W, Street W, Watts DJ (2009) Financial incentives and the performance of crowds. SIGKDD 11(2):100–108 Mcclendon JL, Mack NA, Hodges LF (2014) The use of paraphrase identification in the retrieval of appropriate responses for script based conversational agents. In: Twenty-seventh international flairs conference, pp 19–201 Min J, Chai J (2012) Motion graphs++. ACM Trans Graph 31(6):153 Rossen B, Lok B (2012) A crowdsourcing method to develop virtual human conversational agents. IJHCS 70(4):301–319 Rossen B, Cendan J, Lok B (2010) Using virtual humans to bootstrap the creation of other virtual humans. In: Intelligent virtual agents, pp 392–398 Sargin ME, Aran O, Karpov A, Ofli F, Yasinnik Y, Wilson S, Erzin E, Yemez Y, Tekalp AM (2006) Combined gesture-speech analysis and speech driven gesture synthesis. In: Multimedia and Expo, number Jan 2016, pp 893–896 Socher R, Perelygin A, Wu JY, Chuang J, Manning CD, Ng AY, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: EMNLP, p 1642 Toshinori C, Ishiguro H, Hagita N (2014) Analysis of relationship between head motion events and speech in dialogue conversations. Speech Comm 57:233–243 Triola MM, Campion N, Mcgee JB, Albright S, Greene P, Smothers V, Ellaway R (2007) An XML standard for virtual patients: exchanging case-based simulations in medical education. In: AMIA, pp 741–745 Triola MM, Huwendiek S, Levinson AJ, Cook DA (2012) New directions in e-learning research in health professions education: report of two symposia. Med Teach 34(1):15–20 Wang L, Cardie C (2014) Improving agreement and disagreement identification in online discussions with a socially-tuned sentiment lexicon. In: ACL, vol 97, p 97 Xu Y, Pelachaud C, Marsella S (2014) Compound gesture generation: a model based on ideational units. In: IVA, pp 477–491
Clinical Gait Assessment by Video Observation and 2D Techniques Andreas Kranzl
Abstract
Observational gait analysis, in particular video-based gait analysis, is extremely valuable in the daily clinical routine. Certain requirements are necessary in order to be able to perform a high-quality analysis. The walking distance must be sufficiently long enough (dependent on the type of patient), the utilized equipment should meet the requirements, and there should be a recording log. The quality of the videos for evaluation is dependent on the recording conditions of the video cameras. Exposure time, additional lighting, and camera position all need to be adjusted properly for sagittal and frontal imaging. Filming the video in a room designated for this purpose will help to ensure constant recording conditions and quality. The recordings should always be carried out based on a recording log. The test form can act as a guide for the evaluation of the video. This provides an objective description of the gait. It is important to always keep in mind that the evaluation must remain subjective to a certain degree. Based on the gait parameter, the reproducibility of this value (intra- and inter-reliability) is moderate to good. In addition to a database function, current video recording software is able to measure angles and distances. It should also be possible to play back two videos in parallel, in order to, for example, play back both the presurgical and postsurgical gait simultaneously. Despite the implementation of three-dimensional measurement systems for gait analysis, observation or videosupported gait analysis is justified in daily clinical operations. Keywords
Database • Observational gait • Reliability • Room size • Video camera
A. Kranzl (*) Laboratory for Gait and Human Motion Analysis, Orthopedic Hospital Speising, Vienna, Austria e-mail: [email protected] # Springer International Publishing AG 2017 B. Müller, S.I. Wolf (eds.), Handbook of Human Motion, DOI 10.1007/978-3-319-30808-1_24-1
1
2
A. Kranzl
Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Observational Gait Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Treadmill . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Patient Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Structured Analysis of the Gait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reliability of the Evaluation of Observational Video-Based Gait Analysis . . . . . . . . . . . . . . . . . . . . Video Recording Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Clinical Use of 2D Motion Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Database Software and Capturing Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion/Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 2 3 9 9 10 10 11 12 12 13 14 15
Introduction Apart from pure observational gait analysis (optical), recording of gait with a video camera is the most common type of analysis. Modern video cameras for the consumer market are readily available and fairly inexpensive. Recording is mostly performed directly in the camera or stored on a PC using suitable software. Advantages of video recording are being able to replay the video imaging several times, the pause option and slow motion option. Video recording is also easy to use. Video recordings are often also used additionally to three-dimensional motion capture systems, on the one hand, to visually document how the patient walks and, on the other hand, to document movements that the implemented biomechanical model does not image in the analysis. Size of the room, position of the camera, lighting in the room, marking of the walking distance, and the recording log are all important for producing high-quality video for gait evaluation.
State of the Art Video recordings are used in the clinical movement analysis laboratory as well as in other institutions such as medical offices and physiotherapy offices, to evaluate movement processes (in particular in walking). The use of gait analysis as an additional component in treatment plans is uncontested. Recordings are usually carried out using two video cameras (sagittal and frontal planes). The advantage of video recordings without additional measurement instruments is that the patient is not burdened with additional material (markers, measurement equipment, etc.). Video recordings are stored in a database and are therefore easy to find. This allows simple comparisons to be carried out before and after therapy.
Clinical Gait Assessment by Video Observation and 2D Techniques
3
Fig. 1 Room for video recording Optimal use of a room 6 m wide and 10 m long is shown in Fig. 1. The walking distance is color coded (gray) and cross lines help determine stride length
Observational Gait Analysis Observation of gait analysis has an invaluable role to play in clinical routine. However, human vision can only observe a frequency of 12–18 Hz. Due to this low resolution, not all movement details can be recognized with the human eye. A further problem is that gait disturbances not only occur in one plane of the joint, but in numerous planes and numerous joints simultaneously. The recording of gait with a video camera significantly simplifies analysis. With the possibility of watching the video several times and at various speeds, a detailed analysis of individual joints and planes is possible. A video analysis can be carried out quickly; however, preparations are necessary in order to perform a high-quality video analysis. In order to obtain a high-quality recording, a suitable recording room is required and certain technical requirements need to be fulfilled for the camera. It is not always easy to find a room of suitable size. For recording the sagittal plane, the room should be at least 10 m long to ensure sufficient room for walking, enabling the patient to walk at their normal speed with enough steps recorded for evaluation of the gait (Fig. 1). This walking distance is sufficient for patients with mild gait disturbances; the distance can be shorter for patients with more severe gait disturbances or for children, due to the shorter step length. For better guidance of the patient, it is useful to mark the walking distance in a color that contrasts from the rest of the floor (Fig. 1). The start and end of the walking distance should be marked for the patient. Subtle lines or predefined distances along the walking distance help determine step length in the subsequent
4
A. Kranzl
Fig. 2 Treadmill wide angle Due to the size of the room, a video camera cannot be positioned far enough from the treadmill to be able to see the entire patient in the image (image within the red section), whereas a wide angled lens allows imaging of the entire patient in the same position. The camera position in the left image is oriented at exactly 90 to the axis of movement, the central image is positioned at an angle of 30 to the patient, and the right image at 45 . The measured knee angles are 30 , 28 , and 25 , although the recordings were performed simultaneously with three cameras. These different angle measurements are caused by parallax error. It is necessary to be aware of this error if the patient is not centered in the image
evaluation by video. The spatial depth of the room is important for the lateral recording. The spatial depth of the room is defined by the number of double steps as well as by the camera optics. In the sagittal recording, in order to perform an adequate evaluation, the entire patient needs to be seen, from head to toe. The camera should be positioned at a right angle to the walking distance. It is possible to have the camera follow the patient; however, this can create difficulties in the measurement of joint positions and distances. Parallax error can lead to one or more errors in the determination of the joint angles. Therefore, care is to be taken to ensure that the planes of movement are at a right angle to the camera. For evaluation of joint positions, the view of the camera should be positioned centrally on the patient and not at an angle. The evaluation of joint angles in the border areas of the video image is only possible to a limited extent due to parallax error. A wide-angle lens can be of assistance in a room with insufficient depth (Fig. 2). A wide-angle lens allows video recordings to be carried out in rooms with limited room depth. It should be noted though that distortion can occur in the border areas of the video. Depending on the quality of the wide-angle lens, these can be more or less pronounced. For frontal recordings, it is necessary to ensure that the patient is in the center of the image. For all recordings, the patient should be imaged as large as possible. This can be performed with the zoom function of the video camera. The video format 16:9
Clinical Gait Assessment by Video Observation and 2D Techniques
5
Fig. 3 Patient recorded with three cameras simultaneously Fig. 4 Accompanying camera
also has the advantage that the patient can be recorded in portrait layout. It is also possible to record numerous gait sequences in the sagittal recording with the wider video image. In some centers, the side camera is mounted on a motorized or manually operated track system (Fig. 4). This has the advantage that the camera can follow the patient at a 90 angle and thereby record numerous gait sequences. Numerous gait sequences can also be recorded by moving the camera, but these extra gait sequences are more difficult to evaluate due to parallax error (Fig. 3). A camera that accompanies the patient is shown in Fig. 4. This camera position allows a right-angled view of the patient. The camera is mounted on a track system and is controlled with a motion capture system. It is useful to record the frontal image of the patient in full size, from the pelvis downward and from the knees downward (focus ankle joint). The detailed view allows a more precise evaluation of movements. This can be performed either via the positioning of the camera on a height adjustable tripod or on a height adjustable wall
6
A. Kranzl
Fig. 5 Optimized view of the relevant segments Fig. 6 Camera position for frontal recordings With a height-adjustable frontal camera, an optimized position can be achieved to be able to image the relevant body segments as large as possible. From left to right: entire body, pelvis downward, and foot area (Fig. 5)
mount (manual or electric) (Fig. 6). In general, it is useful to attach the video camera to a tripod, which allows optimal horizontal positioning. A fixed position on the wall is even better, so that the camera position is the same for all recordings and does not need to be adjusted or checked for each recording. In the left of the image (Fig. 6), a manual height-adjustable camera can be seen, which is focused on the walking distance of the instrumented 3D gait analysis. On the right, a camera can be seen which is automatically height adjusted with the PC and focused on the walking distance for the video recording. Apart from the room measurements, lighting conditions and the color of the floor and walls are also important. There should be no interfering light from windows,
Clinical Gait Assessment by Video Observation and 2D Techniques
7
Fig. 7 Influence of illumination on the imaging quality
since this can lead to lighting problems. Furthermore, the background of the video recording should have calm colors. Equipment that is standing around should also not be visible in the video. The room lighting should be suitable for evening lighting from above. The room should be evenly illuminated. The light source should not flicker, as especially in high-resolution video recording, a flickering light can be very disturbing. Conventional video cameras have an automatic feature that adjusts exposure time and shutter, depending on available light. For high-quality recordings, however, a manual adjustment of exposure time is useful. Exposure time should be as short as possible without the video image being too dark (Fig. 7). Illumination of 4000 lx or more is useful. Additional lighting, which is appropriate for the camera,
8
A. Kranzl
can be helpful to improve recording conditions. It is necessary to ensure, however, that no reflections on the skin are created with the additional lighting. For a patient with moderate walking speed on a treadmill, exposure time and additional lighting will vary. Top left: automatic setting of the camera without additional lighting, top right: Automatic setting of the camera without additional lighting, left center: exposure time 1/100 with additional lighting, right center: exposure time 1/500 with additional lighting, bottom left: exposure time 1/1000 with additional lighting. It can be seen that the exposure time of 1/500 with additional lighting provides the clearest image; a shorter exposure time creates an image that is too dark. Some centers also perform video recordings from above and/or from below in order to better record transversal movement in the gait. The transversal view, however, does not provide imaging of all body parts while walking. From below, the feet block the pelvis; from above, the upper body is well recognized; but from the pelvis downward, there is almost no visibility. In order to evaluate femur rotation, marking the patella is useful. The foot angle is evaluated from behind in the frontal video image. The sagittal and frontal video recordings are performed either sequentially or simultaneously. The advantage of a simultaneous video is that gait events can be viewed in the frontal and sagittal video recordings at the same time. Videos can also be compiled to a single video image (split screen) or videos can be recorded in parallel, synchronized on a computer screen. Most video cameras have a built-in storage option for recording the video. This has the disadvantage that it can be difficult to find the recording of the video. It is better to record directly to the computer via suitable recording software. This should include a database function so that patient videos can be found quickly and without difficulty. The connection of the video camera to the computer depends on the requirements of the camera. Common connections to the computer are USB or HDMI or the component connection. The FireWire/IEEE 1394 connection is seldom found in video cameras today, if at all. If possible, when using two or more video cameras, care should be taken that the hardware can be synchronized. If the hardware cannot be synchronized, at least the software should allow subsequent synchronization. Technical requirements for a video camera: • • • • • • •
Lens size Manual adjustment of shutter and illumination Automatic and manual focus 16:9 image sensor ratio Optical zoom USB or HDMI or component connection Recording frequency of 50 Hz or more depending on speed of motion
Clinical Gait Assessment by Video Observation and 2D Techniques
9
The costs for a video camera range from € 200 to € 1500, depending on the quality. A higher purchase price for a high-quality camera is quickly reimbursed by the evaluation of the video recordings. High-quality cameras usually have a large lens and sensors, which means better imaging quality with the same light conditions. The recording frequency should be at least 25 images per second for gait evaluation. For running analyses, the recording speed should be at least 100 Hz. Most operating systems have simple software for video recording on the computer. There is, however, usually a limitation in playback options. It is better to purchase suitable software for video recordings. There are numerous manufacturers that have developed such software. The requirements for software include the possibility of recording one or more video sources simultaneously, as well as a database function. The following points are important for playback: Real time, slow motion, exact focusing of frozen images, as well as moving forward and backward in the frozen images. To playback numerous videos for comparison of various conditions and examination times is extrem helpful.
Treadmill The use of a treadmill in combination with video cameras allows optimal positioning of the camera as well as the ability to record a larger number of gait sequences in a short time. With the exact positioning of the camera parallel to or at a right angle to the axis of motion, the measurement of angles and distances is simplified. However, not all patients are used to walking on a treadmill. Examinations show that healthy patients require a period of 6 min for walking and running, whereas older patients require at least 14 min (Wass et al. 2005). Even then there are differences in the kinematics and kinetics of the gait (Alton et al. 1998; Wass et al. 2005). Therefore, the use of a treadmill for recording gait is only sensible if the patient is used to it. Usually, there is insufficient time and staff to provide the patient with enough time for adaptation.
Patient Preparation It is important for the quality of the recording that the patient is adequately adjusted. For this, it is best if the patient walks in his/her undergarments during the recording. Shorts and a t-shirt are also possible; however, it must be ensured that the pelvis can be well observed and that the t-shirt does not cover the pelvis. Marking of the joint points (ankle, knee, hip) as well as marking of the patella is helpful in video recordings in order to be able to better evaluate and/or measure the joint angle during walking. The markings should be performed during standing (not while lying down ➔ due to skin movement). Otherwise, the marked spots may no longer represent the desired skeletal reference points.
10
A. Kranzl
Structured Analysis of the Gait It is important for a high-quality evaluation that every plane and every joint is observed in a structured manner. This should ensure that gait disturbances or gait patterns are also recognized, even if this is not the main focus of the pathology. Numerous gait cycles should be observed: Experts tend to concentrate on certain parameters in the gait (Toro et al. 2003). Prefabricated examination forms which guide the examiner through the analysis are useful. One of the most well-known examination forms is the form from Jaqueline Perry (1992). Values can be entered for each gait phase and for each joint. Other structured examination forms go a step further. In addition, a further evaluation system is introduced (Visual Gait Assessment Scale, Edinburgh Visual Gait Scale, Observational Gait Scale, Physician’s Rating Scale). This score system allows determination of the degree of a gait disturbance and whether the gait is improved in the second analysis following the therapeutic measures (Viehweger et al. 2010). In particular, in the therapeutic environment, observational gait analysis is suitable for documenting therapeutic advances (Coutts 1999). The score compilation describes the gait with one or more values. However, it needs to be taken into account that a reduction of data may no longer allow an exact description of the disturbance.
Reliability of the Evaluation of Observational Video-Based Gait Analysis With respect to study data for gait evaluation (validity and reliability) via observational gait analysis (including implementation of videos), different images are presented dependent on the parameters of the analysis. Determination of the initial floor contact, for example, shows good interobserver reliability (Mackey et al. 2003). The reproducibility of the results depends on the experience of the reviewer. Experienced persons tend to demonstrate a higher reproducibility of results. These evaluations, however, remain subjective and mostly only show moderate reproducibility (Borel et al. 2011; Brunnekreef et al. 2005; Eastlack et al. 1991; Hillman et al. 2010; Krebs et al. 1985; Rathinam et al. 2014). The recording of force is not possible with video recordings only, although force plates can be included via video recording software products. Following calibration of the position of the force plates with the video image, an overlay function is provided which adds a force vector to the video image. In this way, force and leverage can be visualized. This is especially helpful in the fine adjustment of lower leg orthotics and prosthetics.
Clinical Gait Assessment by Video Observation and 2D Techniques
11
Video Recording Log It is useful to be guided by a standardized recording log so that a basic recording is present for each patient. This log (a and b) is carried out for each video recording at Speising Orthopaedic Hospital in Vienna. These recordings are usually carried out barefoot, if it is possible for the patient to perform the examination barefoot. In video recordings with numerous auxiliary means (shoes, orthoses), it needs to be ensured that the patient does not become tired during the recording. The gait speed should be determined by the patient. Prior to the actual recording, the patient should have time to become familiar with the laboratory surroundings and the requested performance. If possible, the patient should walk the walking distance a few times prior to the actual recording. (a) While walking: 1. sagittal and frontal recording 2. entire body in the frontal recording 3. from the pelvis downward in the frontal recording 4. lower legs and feet in the frontal recording The video image is zoomed in the frontal recording in order to continuously have the entire person or the relevant body segments in the image (Fig. 5). The recording during walking is performed at normal speed, and additionally at a quicker speed and running. Stopping and turning of the patient is also recorded in the frontal recording. If the patient requires an aid or various aids, the recording is performed with each of these aids individually. It is to be noted that recording with additional aids also requires more recording and analysis time. (b) While standing: 1. standing on one leg, both left and right 2. standing on toes 3. standing on heels 4. knee bending with both legs A further recording possibility is the recording of standing up from a chair and sitting down again. The 2D analysis can be used relatively easily to get an overview of the gait pattern. As already mentioned, 2D video analyses show only moderate reproducibility. If the motion occurs strictly perpendicular to the plane, we do not have any parallax error. Data presented by Davis et al. (1991) at the International Symposium on 3D Analysis of Human Movement in 1991 compared 62 normal subjects with 124 sides and 5 patients with 10 sides with 2D- and 3D-captured data. For normal subjects, they found a good accordance for the hip (sagittal, mean relative % difference 1%1% and frontal, mean relative % difference 9%7%) and knee joint angles (sagittal, mean relative % difference 4%2%). For the ankle joint angle, it seems that the accordance is less good (sagittal, mean relative % difference 13% 5%). For the impaired gait pattern, these values increase in all joints. Hip joint angle
12
A. Kranzl
(sagittal, mean relative % difference 8%8% and frontal, mean relative % difference 28%17%), knee joint angles (sagittal, mean relative % difference 8%5%), and ankle joint angle (sagittal, mean relative % difference 54%120%). The authors conclude the following: “Moreover, the utilization of 2D gait analysis strategies in clinical settings where the pathology can result in significant ‘out-of-plane’ motion is not appropriate and ill-advised. For gait analysis, there is no substitute for 3D motion analysis if we have ‘out-of-plane’ motion even though the 2D method is simpler and less expensive, it may produce results which are wrong.” Also Clarke and Murphy (2014) were able to show for healthy participants that for the sagittal knee joint motion, a revealed excellent agreement between 2D and 3D measurement exists. This is supported by Nielsen and Daugaard (2008) and Fatone and Stine (2015).
Clinical Use of 2D Motion Capture Looking at the literature in which the observational gait analysis is used, it is shown that this is used in a wide variety of diseases to analysis gait disorders. A large area of use is found in the range of the assessment of musculoskeletal abnormalities or gait disorders with cerebral palsy (Chan et al. 2014; Chantraine et al. 2016; Deltombe et al. 2015; Esposito and Venuti 2008; Maathuis et al. 2005; Satila et al. 2008). In addition to the quantification of gait disorders (Moseley et al. 2008), it is also used for the examination of therapeutic programs and therapeutic devices (Taylor et al. 2014). In patients with amputation of the lower extremity, it is used for checking and determining the function of prostheses. Through the standardized use of the video analysis, an objective documentation of the advantages and disadvantages of prosthesis parts is achieved (Lura et al. 2015; Vrieling et al. 2007). In the area of neurology, such as Parkinson’s patients (Guzik et al. 2017; Johnson et al. 2013; Obembe et al. 2014) or traumatic brain injuries (Williams et al. 2009), the video analysis serves as a tool to describe gait pattern. Button et al. (2008) used two-dimensional gait analysis to analyze patients with anterior cruciate ligament rupture after rehabilitation.
Database Software and Capturing Software The use of a database for video recordings is useful since it is easier to find previous recordings for gait comparison. Specialized software products for recording usually include a database. Depending on the software product, the search function may not be present or be only rudimentary, for searching for already recorded videos. If the software allows a keyword for the video, this should be carried out. The keyword makes finding the video with a special aid significantly easier. There are many programs on the market that allow you to record your videos on a PC and use a database function at the same time (e.g., Opensource KINOVEA or
Clinical Gait Assessment by Video Observation and 2D Techniques
13
Fig. 8 Knee joint angle measurement at initial contact, manual digitization
Tracker) or commercial products like TEMPLO (Contemplate), SIMI Motion (SIMI), myDartfish (Dartfish), MyoMotion (Noraxon), and others. The requirements for video recording software are relatively low. However, additional functions are essential for the quality of the gait analysis itself. Besides the playback function in real-time speed, the possibility of a slow motion should be available. In addition to the exact selection of a still image, a frame-by-frame function should also be available. For a more accurate analysis, it is helpful to measure distances and/or joint angles (Fig. 8) from the video. This is supported by most analysis programs and the results can be taken relatively easily into a report. When it comes to comparing a gait with two conditions (e.g., walking barefoot and walking with the shoe, or preoperative and postoperative), the software should allow you to play two videos or more in parallel (Fig. 9). Thus, a direct comparison between the conditions is possible. Another good option for comparing conditions is the overlay function; here two videos are superimposed and thus changes in the gait are easier to recognize. Newer systems also use a tracking function (Fig. 10). Thus, it is possible to output joint angles over the full gait cycle. For this purpose, use of markers is useful (white/black circular stickers).
Conclusion/Summary The use of the observational gait analysis is an important part of clinical practice. The use of video cameras for the documentation and evaluation of gait disturbances supports diagnosis and selection of treatment. Therefore, optimal recording
14
A. Kranzl
Fig. 9 Follow-up control, comparison of changes over years (screenshot TEMPLO software from Contemplas)
Fig. 10 Automatic tracking option to calculate the knee joint angle sagittal (screenshot MyoVideo software(MR3)) from Noraxon (Screen capture, myoVIDEO™ software, © Noraxon USA. Reprinted with permission)
conditions as well as structured processes in the analysis of the videos are important. In addition, optimal room size as well as optimum adjustment of equipment is a key component in obtaining high-quality video recordings.
Clinical Gait Assessment by Video Observation and 2D Techniques
15
Cross-References ▶ Activity Monitoring in Orthopaedic Patients ▶ Ankle Foot Orthoses and Their Influence on Gait ▶ Assessing Club Foot and Cerebral Palsy by Pedobarography ▶ Assessing Pediatric Foot Deformities by Pedobarography ▶ Gait Scores – Interpretations and Limitations ▶ Motion Analysis Through Video – How Does Dance Changes with the Visual Feedback ▶ The Use of Low Resolution Pedoboragraphs ▶ Upper Extremity Activities of Daily Living
References Alton F, Baldey L, Caplan S, Morrissey MC (1998) A kinematic comparison of overground and treadmill walking. Clin Biomech 13:434–440 Borel S, Schneider P, Newman CJ (2011) Video analysis software increases the interrater reliability of video gait assessments in children with cerebral palsy. Gait Posture 33:727–729 Brunnekreef JJ, van Uden CJ, van Moorsel S, Kooloos JG (2005) Reliability of videotaped observational gait analysis in patients with orthopedic impairments. BMC Musculoskelet Disord 6:17 Button K, van Deursen R, Price P (2008) Recovery in functional non-copers following anterior cruciate ligament rupture as detected by gait kinematics. Phys Ther Sport 9:97–104 Chan MO, Sen ES, Hardy E, Hensman P, Wraith E, Jones S, Rapley T, Foster HE (2014) Assessment of musculoskeletal abnormalities in children with mucopolysaccharidoses using pGALS. Pediatr Rheumatol Online J 12:32 Chantraine F, Filipetti P, Schreiber C, Remacle A, Kolanowski E, Moissenet F (2016) Proposition of a classification of adult patients with hemiparesis in chronic phase. PLoS One 11:e0156726 Clarke L, Murphy A (2014) Validation of a novel 2D motion analysis system to the gold standard in 3D motion analysis for calculation of sagittal plane kinematics. Gait Posture 39(Suppl 1): S44–S45 Coutts F (1999) Gait analysis in the therapeutic environment. Man Ther 4:2–10 Davis R, Ounpuu S, Tyburski D & Deluca P (1991) A comparison of two dimensional and three dimensional techniques for the determination of joint rotation angles. Proceedings of international symposium on 3D analysis of human movement. p 67–70. Deltombe T, Bleyenheuft C, Gustin T (2015) Comparison between tibial nerve block with anaesthetics and neurotomy in hemiplegic adults with spastic equinovarus foot. Ann Phys Rehabil Med 58:54–59 Eastlack ME, Arvidson J, Snyder-Mackler L, Danoff JV, McGarvey CL (1991) Interrater reliability of videotaped observational gait-analysis assessments. Phys Ther 71:465–472 Esposito G, Venuti P (2008) Analysis of toddlers’ gait after six months of independent walking to identify autism: a preliminary study. Percept Mot Skills 106:259–269 Fatone S, Stine R (2015) Capturing quality clinical videos for two-dimensional motion analysis. J Prosthet Orthot 27:27–32 Guzik A, Druzbicki M, Przysada G, Kwolek A, Brzozowska-Magon A, Wolan-Nieroda A (2017) Analysis of consistency between temporospatial gait parameters and gait assessment with the use of Wisconsin gait scale in post-stroke patients. Neurol Neurochir Pol 51:60–65 Hillman SJ, Donald SC, Herman J, McCurrach E, McGarry A, Richardson AM, Robb JE (2010) Repeatability of a new observational gait score for unilateral lower limb amputees. Gait Posture 32:39–45
16
A. Kranzl
Johnson L, Burridge JH, Demain SH (2013) Internal and external focus of attention during gait re-education: an observational study of physical therapist practice in stroke rehabilitation. Phys Ther 93:957–966 Krebs DE, Edelstein JE, Fishman S (1985) Reliability of observational kinematic gait analysis. Phys Ther 65:1027–1033 Lura DJ, Wernke MM, Carey SL, Kahle JT, Miro RM, Highsmith MJ (2015) Differences in knee flexion between the Genium and C-Leg microprocessor knees while walking on level ground and ramps. Clin Biomech (Bristol, Avon) 30:175–181 Maathuis KG, van der Schans CP, van Iperen A, Rietman HS, Geertzen JH (2005) Gait in children with cerebral palsy: observer reliability of physician rating scale and Edinburgh visual gait analysis interval testing scale. J Pediatr Orthop 25:268–272 Mackey AH, Lobb GL, Walt SE, Stott NS (2003) Reliability and validity of the observational gait scale in children with spastic diplegia. Dev Med Child Neurol 45:4–11 Moseley AM, Descatoire A, Adams RD (2008) Observation of high and low passive ankle flexibility in stair descent. Percept Mot Skills 106:328–340 Nielsen D, Daugaard M (2008) Comparison of angular measurements by 2D and 3D gait analysis. Dissertation, Jonkoping University. Obembe AO, Olaogun MO, Adedoyin R (2014) Gait and balance performance of stroke survivors in south-western Nigeria – a cross-sectional study. Pan Afr Med J 17(Suppl 1):6 Perry J (1992) Gait analysis, normal and pathological function. SLACK, Thorofare Rathinam C, Bateman A, Peirson J, Skinner J (2014) Observational gait assessment tools in paediatrics – a systematic review. Gait Posture 40:279–285 Satila H, Pietikainen T, Iisalo T, Lehtonen-Raty P, Salo M, Haataja R, Koivikko M, Autti-Ramo I (2008) Botulinum toxin type A injections into the calf muscles for treatment of spastic equinus in cerebral palsy: a randomized trial comparing single and multiple injection sites. Am J Phys Med Rehabil 87:386–394 Taylor P, Barrett C, Mann G, Wareham W, Swain I (2014) A feasibility study to investigate the effect of functional electrical stimulation and physiotherapy exercise on the quality of gait of people with multiple sclerosis. Neuromodulation 17:75–84. Discussion 84 Toro B, Nester CJ, Farren PC (2003) The status of gait assessment among physiotherapists in the United Kingdom. Arch Phys Med Rehabil 84:1878–1884 Viehweger E, Zurcher Pfund L, Helix M, Rohon MA, Jacquemier M, Scavarda D, Jouve JL, Bollini G, Loundou A, Simeoni MC (2010) Influence of clinical and gait analysis experience on reliability of observational gait analysis (Edinburgh gait score reliability). Ann Phys Rehabil Med 53:535–546 Vrieling AH, van Keeken HG, Schoppen T, Otten E, Halbertsma JP, Hof AL, Postema K (2007) Obstacle crossing in lower limb amputees. Gait Posture 26:587–594 Wass E, Taylor NF, Matsas A (2005) Familiarisation to treadmill walking in unimpaired older people. Gait Posture 21:72–79 Williams G, Morris ME, Schache A, Mccrory P (2009) Observational gait analysis in traumatic brain injury: accuracy of clinical judgment. Gait Posture 29:454–459
The Conventional Gait Model - Success and Limitations Richard Baker, Fabien Leboeuf, Julie Reay, and Morgan Sangeux
Abstract
The Conventional Gait Model (CGM) is a generic name for a family of closely related and very widely used biomechanical models for gait analysis. After describing its history, the core attributes of the model are described followed by evaluation of its strengths and weaknesses. An analysis of the current and future requirements for practical biomechanical models for clinical and other gait analysis purposes which have been rigorously calibrated suggests that the CGM is better suited for this purpose than any other currently available model. Modifications are required, however, and a number are proposed. Keywords
Clinical Gait Analysis • Biomechanical Modeling
R. Baker (*) University of Salford, Salford, UK e-mail: [email protected] F. Leboeuf • J. Reay School of Health Sciences, University of Salford, Salford, UK e-mail: [email protected]; [email protected] M. Sangeux Hugh Williamson Gait Analysis Laboratory, The Royal Children’s Hospital, Parkville/Melbourne, VIC, Australia Gait laboratory and Orthopaedics, The Murdoch Childrens Research Institute, Parkville/Melbourne, VIC, Australia e-mail: [email protected] # Springer International Publishing AG 2017 B. Müller, S.I. Wolf (eds.), Handbook of Human Motion, DOI 10.1007/978-3-319-30808-1_25-2
1
2
R. Baker et al.
Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model Structure and Anatomical Segment Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marker Placement to Estimate Anatomical Segment Position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kinematic Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kinetic Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Strengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 2 4 4 4 7 8 9 10 11 14 15 17 17
Introduction The Conventional Gait Model (CGM) is a generic name for a family of biomechanical models which emerged in the 1980s based on very similar principles and giving very similar results. It has a rather complex history (outlined below) and as a consequence has been referred to by a range of different names. The use of the name Conventional Gait Model is an attempt to emphasize the essential similarity of these models despite those different names. For a number of reasons, the CGM became the de facto standard for gait analysis in the 1990s, particularly in clinical and clinical research applications. Despite considerable strengths, technological advances have left aspects of the CGM looking quite outdated. The model, as originally formulated, also has a number of intrinsic limitations and, as these have become more widely appreciated, a variety of modifications and alternatives have been developed. Although the model can no longer be regarded as an industry-wide standard as was once the case, many of the more established and respected clinical centers still prefer to use the model considering its strengths to outweigh its limitations. After a brief summary of the historical development of the CGM, this chapter will describe its characteristics and then assess its strengths and limitations, concluding with some suggestions as to how the model could be developed in future in order to address those limitations while preserving its strengths.
History (Italicized words in this section are names that are sometimes used to refer to the CGM.) The origins of the model can be traced to the work of John Hagy in the laboratory established by David Sutherland (Sutherland and Hagy 1972) who digitized the
The Conventional Gait Model - Success and Limitations
3
positions of skin markings indicating anatomical landmarks from bi-planar movie stills. The coordinates were then used to compute a number of joint angles. Patrick Shoemaker extended this approach (Shoemaker 1978) to incorporate Ed Chao’s ideas on representing three-dimensional joint motion as Euler angles (Chao 1980). Jim Gage on a visit to San Diego prior to developing his own gait analysis laboratory at the Newington Hospital in Connecticut and a succession of engineers including Scott Tashman, Dennis Tyburski, and Roy Davis(Davis et al. 1991) further developed the ideas in a number of ways. Perhaps the most important of these were the calculation of joint angles on the basis of estimated joint centers (rather than directly from marker locations) and the incorporation of three-dimensional inverse dynamics to estimate joint moments (Ounpuu et al. 1991) based on the approach of David Winter (Winter and Robertson 1978). At about this time Murali Kadaba developed a very similar model at the Helen Hayes Hospital (Kadaba et al. 1990; Kadaba et al. 1989). There was communication between the two groups over this period, but there are now different memories as to the extent of this collaboration and the precise role of the different individuals involved. Although some minor modifications have been proposed since, the subsequent history is largely about how the model was distributed. The Helen Hayes Model was developed as a package and distributed across seven American hospitals. A little later, Oxford Metrics (now Vicon), the manufacturers of Vicon movement analysis systems, chose to develop their own version of the model (with support from individuals at both Newington and Helen Hayes). This was embedded within a package known as the Vicon Clinical Manager (VCM) and later developed as the Plug-in Gait (PiG) model for Workstation software. Most manufacturers of gait analysis systems produce some version of the model which goes under a variety of names. Perhaps because of commercial sensitivities it is generally rather unclear what level of agreement there is between data processed with these alternative models. Perhaps the most important factor leading to the widespread adoption of the CGM was the prominence of Vicon measurement systems in clinical and academic gait analysis at this time with VCM and PiG being delivered alongside with their hardware. Many of the more established clinical services were founded at this time and most adopted VCM and continued to use PiG. Jim Gage became a strong advocate for clinical gait analysis and with Roy Davis and Silvia Ounpuu established extremely well-regarded teaching courses first at Newington, then Gillette Children’s Hospital which were based on what they regarded as the Newington Model. The model was also explained and validated in a number of key papers (Kadaba et al. 1989, 1990; Davis et al. 1991; Ounpuu et al. 1991,1996) in considerably more detail than any other model at the time. Thus by the early 2000s, the CGM had become established as the predominant gait model for clinical and clinical research purposes, and a large community of users had developed embodying a solid understanding of its strengths and limitations. Since that time, this status has diminished somewhat. A larger number of suppliers to the gait analysis market and the increasing ease of integrating different software have widened the options for data processing. There have been
4
R. Baker et al.
considerable and often justified criticisms of the limitations of the CGM and a general failure of the CGM community to develop the model to address these issues. Despite this, the model is still almost certainly the most widely used and understood single model within the clinical and clinical research community.
State of the Art As stated above, the CGM is actually a family of closely related models but for simplicity this section will be limited to a description of that embodied in the VCM and PiG which are identical and the most commonly used versions. It is arguable whether the CGM is a model at all as the word is now understood in biomechanics and it was originally described as “an algorithm for computing lower extremity joint motion” (Kadaba et al. 1990) and “a data collection and reduction technique” (Davis et al. 1991) when first described. In the sections below, however, a modern understanding of biomechanical modeling will be used to describe the underlying concepts.
Model Structure and Anatomical Segment Definitions The model has seven segments linked in a chain by ball joints (three rotational degrees of freedom) in the sequence left foot, left tibia, left femur, pelvis, right femur, right tibia, right foot. An orthogonal coordinate system is associated with each segment. While the three segment axes are mathematically equivalent, clinical convention is to define the segment alignment in terms of the alignment of a primary axis and the rotation about this as defined by some off-axis reference point. The primary axes for each segment is taken to be that linking the joints which attach it to the two neighboring segments in the kinematic chain. Conceptually the segment axis systems are thus defined by specifying a primary axis and reference point for each. These are defined in Table 1.
Marker Placement to Estimate Anatomical Segment Position Markers are placed in such a way that the segment orientations can be estimated. When the model was developed, optoelectronic measurements systems were limited to resolving a small number of markers and thus the minimum number of markers possible is used. This is based on the assumption that the proximal joint of any leg segment (all those other than the pelvis) is known from the position and orientation of the joint to which it is linked proximally. More distal segment orientations are dependent on the orientation of the more proximal segments and the model is thus often described as being hierarchical. Because of the difficulty in resolving more than two markers on the foot at the time when the model was developed, it defined
The Conventional Gait Model - Success and Limitations
5
Table 1 Anatomical segment definition for the CGM Pelvis The primary axis is the mediolateral axis running from one hip joint center to the other. In most clinical applications, it is assumed that the pelvis is symmetrical and that this axis is thus parallel to the line running from one anterior superior iliac spine (ASIS) to the other. The reference point for rotation about this axis is the mid-point of the posterior superior iliac spines (PSIS). Femur The primary axis is that running from the hip joint center to knee joint center. The reference point is the lateral epicondyle. For validation purposes: • The hip joint center will be taken as the geometrical center of a sphere fitted to the articular surface of the femoral head. • The knee joint center will be taken as the mid-point of the medial and lateral epicondyles. These are often difficult to palpate, however, and for some purposes the line between these landmarks will be assumed to be parallel to that linking the most posterior aspects of the femoral condyles. Tibia The primary axis is that running from the knee joint center to ankle joint center. The reference point is the lateral malleolus. For validation purposes: • The ankle joint center will be assumed to be the mid-point of the medial and lateral epicondyles.
Foot The primary axis is that running from the most posterior axis of the calcaneus along the second ray and parallel to the plantar surface of the foot. Rotation about this axis is not defined.
the orientation of its primary axis but not any rotation about this. The locations of markers are given in Table 2. The hierarchical process requires a method for determining the location of the joints within each segment. The hip joint location within the pelvis coordinate system is specified by three equations (Davis et al. 1991) which are functions of leg length and ASIS to ASIS distance. These are measured during physical examination (although ASIS to ASIS distance can also be calculated from the marker positions during a static trial). The knee joint center in the femur coordinate system is assumed to lie in the coronal plane at the point at which the lines from it to the hip joint center and lateral femoral epicondyle are perpendicular and the distance between joint center and epicondyle is half the measured knee width. The ankle
6
R. Baker et al.
Table 2 Marker placement for the CGM Pelvis Markers are placed over both ASIS and PSIS in order that they lie in the plane containing the anatomical landmarks. A set of equations are used to estimate the location of the hip joint within the pelvic coordinate system. Femur The hip joint center within the femur is coincident with that within the pelvis. A marker is placed over the lateral femoral epicondyle and another on a wand on the lateral thigh in such a way that the two markers and the hip joint center lie within the coronal plane of the femur. The knee joint center is to be defined such that it, the hip joint center and the epicondyle marker form a right angle triangle within the coronal plane of the femur with a base of half the measured knee width. Tibia The knee joint center within the tibia is coincident with that within the femur. A marker is placed over the lateral malleolus and another on a wand on the lateral leg in such a way that the two markers and the knee joint center lie within the coronal plane of the tibia. The ankle joint center is to be defined such that the knee joint center and the malleolar marker form a right angle triangle within the coronal plane of the tibia with a base of half the measured ankle width. Foot The ankle joint center in the foot is defined to be coincident with that with the tibia. A marker is placed on the forefoot. Another marker is placed on the posterior aspect of the heel for the static trial such that the line between the two makers is parallel to the long axis of the foot. The angles between this and the line from the ankle joint center to the forefoot marker in the sagittal and horizontal planes are calculated. The heel marker is not used in walking trials but the offsets are used to estimate the alignment of the long axis of the foot based on the line between ankle joint center and forefoot marker.
joint center within the tibia is specified analogously with respect to the lateral malleolus. The wand markers (on both femur and tibia) are thus important to define the segmental coronal plane. Use of the wand (rather than a surface mounted marker) has two main purposes. The first is that wands (particularly those with a moveable ball and socket joint at the base) can be adjusted easily to define the correct plane. At least as important, however, is that by moving the marker away from the primary axis of the segment they make definition of the coronal plane much less sensitive to marker placement error or soft tissue artifact. Concerns have been expressed that the markers
The Conventional Gait Model - Success and Limitations
7
wobble but there is little evidence of this in gait data (it would appear as fluctuation in the hip rotation graph) if they are taped or strapped securely to the thigh. The foot segment uses the ankle joint center (which has already been defined in the tibia coordinate system) and one forefoot or toe marker. The placement of this marker varies considerably with some centers placing quite distally (typically at the level of the metatarsophalangeal joint) in which case it indicates overall foot alignment. Other centers, particularly those dealing with clinical populations who often have foot deformities, choose a more proximal placement (typically at the level of the cuneiforms) in order to give a better indication of hind foot alignment. Placement of a heel marker during the static trial also allows for offsets to ensure that ankle measurements were aligned with the long axis of the foot rather than simply by the line from the ankle joint center to the toe marker. A common variant is to calculate the plantar flexion offset on the assumption that the foot is flat and thus that the long axis of the foot is in the horizontal plane, during the static trial.
Kinematic Outputs Kinematic outputs are mainly joint angles describing the orientation of the distal segment with respect to that of the proximal segment. The orientation of the pelvis is output as segment angles (with respect to the laboratory-based axis system) as is the transverse plane alignment of the foot (called foot progression). In three dimensions, the orientation of one segment with respect to another must be represented by three numbers. The CGM uses Cardan angles which represent the set of sequential rotations about three different and mutually perpendicular axes that would rotate the distal segment from being aligned with the proximal segment (or the laboratorybased coordinate system) to its actual position. In the original model, the rotation sequence was about the medial-lateral, then the anterior-posterior and finally the proximal-distal axis for all joints (and segments). Although this sequence maps onto the conventional clinical understanding of the angles for most joints, it does not for the pelvis (Baker 2001). This is because with this rotation sequence, pelvic tilt is calculated as the rotation around the mediallateral axis of the laboratory coordinate system, rather than the medial-lateral axis of the pelvis segment, as per conventional understanding. Baker (Baker 2001) proposed to reverse the rotation sequence which results in pelvic angles that more closely map onto the conventional clinical understanding of these terms (confirmed by Foti et al. 2001). Following Baker’s recommendation to use globographic angles (Baker 2011), these can be interpreted exactly as listed in Table 3. While not formally a part of the model, the CGM is closely associated with a particular format of gait graph (see Fig.1). All data is time normalized to one gait cycle and the left side data plotted in one color (often red) and the right side data in another (often green, but blue reduces the risk of confusion by those who are color blind). The time of toe off is denoted by a vertical line across the full height of the graph and opposite foot off and contact by tick marks at either the top or bottom of the graphs (in the appropriate color). Normative data is often plotted as a grey band
8
R. Baker et al.
Table 3 Definition of joint angles as commonly used with the CGM Pelvis (with respect to global coordinate system) Internal/external rotation: rotation of the mediolateral axis about the vertical axis Obliquity (up/down): rotation of the mediolateral axis out of the horizontal plane Anterior/posterior tilt: rotation around the mediolateral axis Hip (femur with respect to pelvis coordinate system) Flexion/extension: rotation of the proximal-distal axis about the medio-lateral axis Ad/abduction: rotation of the proximal-distal axis out of the sagittal plane Internal/external rotation: rotation around the proximal-distal axis Knee (tibia with respect to femur coordinate system) Flexion/extension: rotation of the proximal distal axis about the medio-lateral axis Ad/abduction: rotation of the proximal-distal axis out of the sagittal plane Internal/external rotation: rotation around the proximal-distal axis Ankle (foot with respect to tibia coordinate system) Dorsiflexion/plantarflexion: rotation of the proximal distal axis about the medio-lateral axis Internal/external rotation: rotation of the proximal-distal axis out of the sagittal plane Foot (with respect to global coordinate system) Foot progression (in/out): rotation of the proximal-distal axis out of the “sagittal” plane
in the background (typical one standard deviation about the mean). The graphs are then commonly displayed as arrays with the columns representing the different anatomical planes and the rows representing the different joints.
Kinetic Outputs The CGM is commonly used to calculate kinetic as well as kinematic outputs (Davis et al. 1991; Kadaba et al. 1989). Both the Newington and Helen Hayes approaches used inverse dynamics to estimate joint moments from force plate measurements of the ground reaction, an estimate of segment accelerations from kinematic data and estimates of segment inertial parameters. The main difference was that the Newington group took segment inertial parameters from the work of Dempster (1955) whereas the Helen Hayes group (Kadaba et al. 1989) took them from Hindrichs (1985) based on Clauser et al. (1969). Joint moments are fairly insensitive to these parameters (Rao et al. 2006; Pearsall and Costigan 1999), and it is unlikely that this would have led to noticeable differences in output. VCM and PiG used values from Dempster (1955). Joint moments were presented in the segment coordinate systems. The early papers do not specify whether the proximal or distal segment was used for this. PiG and VCM allowed the user to select which (or to use the global coordinate system) and the default setting of the distal system is probably most widely used. Joint power is also calculated as the vector dot product of the joint moment and angular velocity (note that this should be the true angular velocity vector and not that of the time derivatives of the Cardan angles). Power is a scalar quantity and there is thus no biomechanical justification for presenting “components” of power.
The Conventional Gait Model - Success and Limitations
9
75
Knee flexion (degrees)
60
45
30
15
0 Left Right -15 0%
20%
40%
60%
80%
100%
% gait cycle Fig.1 A standard gait graph. The curves represent how a single gait variable varies over the gait cycle. The vertical lines across the full height of the graph represents foot-off and the tick marks represent opposite foot off (to the left of graph) and opposite foot contact (to the right). Line in red is for the left side and in blue is for the left side. The grey areas represent the range of variability in some reference population as 1 standard deviation either side of the mean value
Variants Over the years a number of variants to the CGM have been implemented by particular groups. Most of these have not been formally described in the academic literature. • The original papers describing the model assumed that the femur and tibia wand markers could be placed accurately. Early experience was that this was challenging and an alternative technique was developed in which the markers were only positioned approximately and a Knee Alignment Device (KAD) was used during static trials to indicate the orientation of the knee flexion axis and hence the coronal plane of the femur. This allowed rotational offsets to be calculated to correct for any misalignment of the wand markers (with the tibial offset requiring an estimate of tibial torsion from the physical examination). • A development within PiG allowed a medial malleolar marker to indicate the position of the transmalleolar axis during the static trial and hence to calculate a value of tibial torsion rather than requiring this to be measured separately.
10
R. Baker et al.
• A method of allowing for the thickness of soft tissues over the ASIS was provided by allowing the measurement of the ASIS to greater trochanter distance which is an estimate of the distance by which the hip joint center was posterior to the base plate of the ASIS marker. • A technique called DynaKAD has been proposed (Baker et al. 1999) to define the thigh rotation offset by minimizing the varus-valgus movement during the walking trial. Other techniques have been used suggested to define this from functional calibration trials (Schwartz and Rozumalski2005; Sauret et al. 2016; Passmore and Sangeux2016). • VCM and PiG introduced an angular offset along the tibia such that knee rotation is defined as being zero during a static trial when the KAD is used and the orientation of the ankle joint axis is defined by a measurement of tibial torsion made during the physical exam (rather than the tibial wand marker). • Another development of PiG allowed the heel marker to be used to give an indication of inversion/eversion of the foot (rotation about the long axis) if it was left in place during the walking trial. • A further development allowed an angular offset to be applied allowing for the foot being pitched forward by a known amount during a static trial (to take account of the pitch of a shoe, for example). • An upper body model was developed by Vicon which, though widely used, has never been rigorously validated.
Strengths Recent opinion has tended to emphasize the weaknesses of the CGM, but it is also important to acknowledge its many strengths. In a world in which clinical governance is increasingly important, the CGM has been more extensively validated than any other model in routine clinical use. The early papers of Kadaba et al. were considerably ahead of their time in their approach to validation. The basic description of the model (Kadaba et al. 1990) includes presentation of normative data, a comparison of this against normative data from a range of previous papers and a sensitivity analysis of the most common measurement artifact arising from the difficulty in placing thigh wands accurately. The follow up paper (Kadaba et al. 1989, which was actually published first!) is also a definitive repeatability study. 15 out of the 23 papers identified in the classic systematic review of repeatability studies of kinematic models of McGinley et al. (McGinley et al. 2009) used a variant of the CGM, and a more recent study (Pinzone et al. 2014) has demonstrated the essential similarity of normative kinematic data collection from gait analysis services on different sides of the world but captured by the CGM. This body of formal validation literature is strongly reinforced by a large number of papers reporting use of the CGM in a very wide range of clinical and research applications. The CGM is thus particularly appropriate as a standardized and validated model for users who are more interested in interpreting what the results mean than in further model development and validation.
The Conventional Gait Model - Success and Limitations
11
Although the implementation of the model is not trivial, the basic concepts are about as simple as possible for a clinically useful model. It uses a minimal marker set which can be applied efficiently in routine clinical practice. The model is deterministic (does not require any optimized fitting process) and thus the effects of marker misplacement and or soft tissue artifact are entirely predictable (Table 4 illustrates the effect that a given movement in each marker will affect outputs). It is thus possible to develop a comprehensive understanding of how the model behaves without being an expert in biomechanics. This can be logically extended to give clear indications of how marker placement can be best adapted in order to obtain clinically meaningful outputs in the presence of bone and joint deformities or devices such as orthoses and prostheses. It is unfortunate, therefore, that in the early years the model developed a reputation for behaving as a “black box.” This probably arose because the most commonly available implementation, in the VCM, incorporated some refinements to the previously published versions (e.g., the thigh and shank offsets) which were only described conceptually in the accompanying product documentation. Many people assumed that there was insufficient information to fully understand the model; an assumption proved false by a number of exact clones emerging (Baker et al. 1999 is an example).
Weaknesses Accuracy While the CGM has been subjected to several studies to investigate its repeatability, there have been very few studies of its accuracy and those have focused on very specific issues such as the location of the hip joint center ( Sangeux et al. 2011,2014; Peters et al. 2012) and orientation of the knee flexion axis (Sauret et al. 2016; Passmore and Sangeux 2016) in standing. The model is intended to track the movements of the bones and there have been no studies performed to establish how accurately it can do this. This is principally because gold standard methods for tracking bone movement during walking are challenging (although a range of techniques are available – see section on “Future Directions” below). It should be emphasized, however, that this is a weakness of all commonly used biomechanical models for gait analysis and not just the CGM. Hip Joint Center Position A considerable body of knowledge now suggests that there are better methods for specifying the location of the hip joint center within the pelvic coordinate system than those used within the CGM (Leardini et al. 1999; Sangeux et al. 2011, 2014; Harrington et al. 2007; Peters et al. 2012). While the first of these (Leardini et al. 1999) suggested that functional calibration methods were superior to equations, more recent studies suggest that alternative equations can give results at least as good as functional methods in healthy adults (Sangeux et al. 2011, 2014; Harrington et al. 2007) and better in children with disabilities (Peters et al. 2012).
Tilt Obliquity Rotation -0.9 1.4 0.1 0.4 0.2 1.8
Pelvis
-0.1
0.1
-0.5
1.3
Flexion Adduction -1.2 1.8 -0.1 2
Hip
-1.8
2.8
Internal rotation -0.5 -0.3
-0.1 -0.9
-0.2 2.2
-0.9
1
Varus -0.4 -0.1 0.1
-0.9
Flexion -0.2 -0.2 0.2
Knee
-0.1
0.4
-0.1
-0.3
Internal rotation
-1.1 -4.6
0.1
0.1
0.1 -1
-0.4 1.9
a
-2.7
-0.1
Internal rotation
-0.2 0.8
-0.1
Dorsiflexion
Ankle
Notes: Data is unaffected by the location of the tibial wand marker as a KAD was used for the static trial b Moving the toe marker anteriorly or posteriorly has no effect on outputs as a “foot flat” option was used for the static trial
Marker moved RASI up RASI out SACR up SACR out RTHI up RKAD int RKNE up RKNE ant RTIB up1 RTIB ant1 RANK up RANK ant RTOE out RTOE ant
4.7
0.1
0.1
0 0.1
-0.1
Foot Internal progression
Table 4 Effects of moving a marker 5 mm in the specified direction on the outputs of the CGM. Note that because of the hierarchical basis of the model, movements can only affect segments on or below that to which a given marker is attached. Changes in angle of less than 0.1 are left blank
12 R. Baker et al.
The Conventional Gait Model - Success and Limitations
13
Defining the Coronal Plane of the Femur The first of the papers of Kadaba et al. (1990) highlighted the sensitivity of the CGM to misplacement of the thigh markers leading to erroneous definition of the coronal plane of the femur. This leads to a wellknown artifact in which the coronal plane knee kinematics show cross-talk from knee flexion-extension which is generally of little clinical significance but highlights uncertainty in hip rotation which is a major limitation of the model. Use of the KAD (which is very poorly documented in the literature) led to some improvements but this is still generally regarded as one of the most significant limitations of the model. Over-Simplistic Foot Modeling Modeling the foot as single axis rather than threedimensional segment arose from the difficulty early models had in detecting more than one marker placed on a small foot. While reliable detection of many markers on the foot has been possible for many years now, a formal extension of the model has never been proposed to model the foot more comprehensively. The Oxford Foot Model (Carson et al. 2001), which is probably now the most widely used in clinical and research practice, differs markedly from the CGM in that it allows translations between the forefoot, hind foot, and tibia (rather than the spherical joints that are a characteristic of the CGM). Unconstrained Segment Dimensions The CGM does not require the segments to be of a fixed length and soft-tissue artifact generally acts in such a way that the distance between the hip and knee joint centers can vary by as much as 2 cm over the gait cycle during walking. While this probably has a small effect on kinematic and kinetic outputs, it does prevent the use of the model with more advanced modeling techniques such as muscle length modeling and forward dynamics for which a rigid linked segment model is required. Modern inverse kinematic techniques (Lu and O’Connor 1999) which depend on rigid linked segment models also offer the potential to incorporate modeling of soft-tissue artifact (Leardini et al. 2005) based on data such as fluoroscopy studies (Tsai et al. 2009; Akbarshahi et al. 2010) in a manner that is not possible within the CGM. Inadequate Compensation for Soft Tissues over Pelvic Landmarks While methods have been proposed for measuring and taking into account the soft tissues over pelvic landmarks, none are particularly convincing or validated. As populations, particularly those with limited walking abilities, become increasingly overweight, this becomes a more important problem. Poorly Validated Upper Body Model While Davis et al. (Davis et al. 1991) did suggest placement of markers on the shoulders to give an indication of trunk alignment, this has not been widely implemented. Vicon developed an upper body model for PiG but, despite this being quite widely used, there have been no published validations of its outputs. It is still not clear how important upper limb movements are in relation to clinical gait analysis, but knowledge of trunk alignment and dynamics is clearly important to understand the mechanics of the gait patterns of many people with a range of conditions.
14
R. Baker et al.
Alternatives Perhaps the most commonly used alternatives to the CGM are 6 degree of freedom (6DoF) models. These can be traced back to the work of Cappozzo et al. (1995) and have been popularized through Visual 3D software (C-motion, Kingston, Canada). They track the segments independently (without constraining the joints) and can be based on skin mounted markers (as implied by the illustration in the original paper) or rigid marker clusters (as is more common nowadays). Perhaps the most important limitation of this approach is that it refers to a modeling technique rather than any specific model (CAST is an abbreviation for calibrated anatomical landmark technique) and no specific model has been widely used and rigorously validated. The Cleveland Clinic Marker Set was an early example which achieved popularity when it was implemented in the Orthotrack Software (Motion Analysis Corporation, Santa Rosa, USA) but has never been validated (or even fully described) in the peer-review literature. More recently Leardini et al. (2007) published and validated the IOR model but there are only limited reports of use outside Bologna in the literature (and it is worth noting that the IOR model, in using skin mounted markers, differs quite markedly from most contemporary 6DoF modeling which uses rigid clusters). 6DoF models are sometimes presented as addressing the known limitations of the CGM. Sometimes there is justification in these claims (e.g., the segments are fixed length) but often corresponding issues are overlooked (e.g., nonphysiological translations between the proximal and distal bones at some joints). Soft tissue artifact between markers is certainly eliminated by using rigid clusters but a different form of soft tissue artifact will affect the orientation and position of the whole cluster in relation to the bones (Barre et al. 2013). Other issues such as the difficulty in estimating the hip joint center or knee axis alignment affects all models. One advantage of most 6DoF models is that they use medial and lateral epicondyle markers during a static trial to define the knee joint axis. This may be more repeatable than precise alignment of thigh wands or KADs. It is also worth noting that this is only a difference of knee calibration technique which could easily be incorporated into the CGM. Inverse kinematic (often referred to as kinematic fitting or global optimization) models have also been reported (Lu and O’Connor 1999; Reinbolt et al. 2007; Charlton et al. 2004), and this approach has become more popular since it was incorporated within OpenSim (Seth et al. 2011) as the default technique for tracking marker data. In this, a linked segment rigid body model is defined and an optimization technique is used to fit the model to the measured marker positions, generally using some weighted least-squares fit cost function. As with 6DoF models, this approach has advantages and disadvantages with respect to the CGM. It is also similar to the 6DoF approach in that no single model has received widespread use or been subject to rigorous validation. The approach is inherently compatible with advanced modeling techniques (e.g., muscle length modeling and forward dynamics) and is well suited to either stochastic or predictive approaches to modeling soft tissue artifact. Its most notable weakness is that it is nondeterministic and on occasions artifacts can arise in the data from soft-tissue artifact, marker
The Conventional Gait Model - Success and Limitations
15
misplacement, or erroneous model definition that can be extremely difficult to source. On balance, however, it is likely that future developments will be based on an inverse kinematic approach.
Future Directions Over the lifetime of the CGM, the nature of gait analysis has changed considerably in at least two important ways. The first is the growing importance of clinical governance (Scally and Donaldson 1998) and evidence-based practice within healthcare organizations. This requires increasing standardization of all operations based upon well-validated procedures. The emergence of accreditation schemes such as those now operated by the Clinical Movement Analysis Society (CMAS, UK, and Ireland) or the Committee for Motion Laboratory Accreditation (USA) is a consequence of this. At present the focus is on whether written protocols exist at all, but it is inevitable, as this minimal standard becomes universally implemented, that more attention will be paid to ensuring that any procedures are appropriately validated. This may be reinforced by more rigorous implementation of medical device legislation to gait analysis software which should require manufacturers to ensure that clinically relevant outputs (such as joint angles from a specific biomechanical model) are reproducible (rather than just the technical outputs such as marker locations). The other change, which has implications beyond gait analysis for purely clinical purposes, is that gait analysis systems are getting much cheaper and more user friendly. It can no longer be assumed that laboratories will have a staff member suitably qualified in biomechanics to create and adapt their own models. People using current technology generally want to implement standardized techniques allowing them to focus on the interpretation of data rather than on developing individualized solutions and being distracted by the challenge of their validation. Such users will require a model that is simple enough to be understood conceptually in sufficient detail to guide quality assurance and interpretation of the data produced. In scientific research, it would also be useful to have a widely accepted standardized approach to capturing data to ensure that results from different centers are as comparable as possible. For clinical users and those in other fields who want to focus on the interpretation of data rather than the mechanics of data capture, therefore, there is a real need for a widely accepted, standardized, and validated approach to data capture (including biomechanical modeling) which is efficient and robust in application and sufficiently simple to be understood by the users themselves (rather than relying on biomechanical experts). To be useful in this context, it needs to be widely applicable to all people who are old enough to walk and who have a range of different health conditions (or none). There needs to be a strong evidence base for the reproducibility of measurements, specific training for staff involved in the capture and processing of data, and appropriate metrics to assure the quality of measurements in routine practice.
16
R. Baker et al.
The CGM satisfies all of these requirements at least, as well, and in most cases considerably better than alternatives. Despite this, many users are frustrated by its limitations while potential users are often put off by its commonly perceived weaknesses (some justified, some not). It is clear that if the CGM is to have a future, it will require modifications to address these. A particular issue for the CGM is that many older laboratories have databases stretching back over considerable periods of time (several decades in many cases) and backward compatibility is perceived as extremely important. Ensuring rigorous backwards is incompatible with improving the modeling of course, so a compromise is required. The most obvious is to ensure that any new model uses the same anatomical segment definitions (see Table 1) as the original. It may be that modifications lead to systematic differences with the original CGM, but it will be clear that these are consequences of improvements in the modeling rather than redefinition of what is being measured. It will also be important to quantify any such systematic changes in order that they can be accounted for if data processed using different versions of the model can be compared. Another specific issue with the CGM is the perception of it as a “black box” processing technique which cannot be properly understood. This has persisted despite increasingly good documentation being produced but will be best addressed by publishing the actual computer code through which the model is implemented. Implementing the code in an open source language (such as Python) which is available to all users will also be important. Training and education packages will also be required for those less technically minded. The specific modifications that are indicated would be: • Adoption of a robust inverse kinematic fitting approach based around a linked rigid segment model that is compatible with advanced musculoskeletal modeling techniques. • Replacement of wand markers with a limited number of skin mounted tracking markers on the femur and tibia positioned to minimize sensitivity to soft tissue artifact (Peters et al. 2009) or marker misplacement. • Incorporation of more accurate equations for estimating the hip joint center and techniques for accounting for the depth of soft tissues anatomical landmarks on the pelvis. • Improved methods for determining the orientation of the coronal plane of the femur. Basing this upon the position of medial and lateral femoral epicondyle markers during a calibration trial may be an improvement and functional calibration of the knee should be implemented as a quality assurance measure. • Improvement of foot modeling by formalizing the PiG approach to using the heel marker to give an indication of inversion and eversion of about the long axis of the foot. There is a lack of standardization in where the forefoot (toe) marker is placed. Opting for a more proximal placement (at about the level of the tarsometatarsal joints) would lead to the foot representing movement of the hind foot and open the possibility for some indication of forefoot alignment in relation to this using markers placed on the metarsophalangeal joints.
The Conventional Gait Model - Success and Limitations
17
• Validation of an appropriate trunk model should be regarded as essential. Doing so on the basis of force plate measurements of center of mass displacement during walking (Eames et al. 1999) would be useful to establish just how important measuring upper limb movement is in gait analysis. Future versions should be adequately validated in line with a modern understanding of clinical best practice. At a minimum, this should include evidence of reproducibility of results, but it would also be useful to have accuracy established with reference to a variety of static and dynamic imaging techniques such as threedimensional ultrasound (Peters et al. 2010; Hicks and Richards 2005; Passmore and Sangeux2016), low intensity biplanar x-rays (Pillet et al. 2014; Sangeux et al. 2014; Sauret et al. 2016), or fluoroscopy (Tsai et al. 2009; Akbarshahi et al. 2010). There should also be publication of benchmark data with which services can compare their own to ensure consistency (Pinzone et al. 2014) and streamlined processed for conducting in-house repeatability studies would also be extremely useful.
Cross-References ▶ 3D Dynamic Pose Estimation Using Reflective Markers or Electromagnetic Sensors ▶ 3D Dynamic Probablistic Pose Estimation From Data Collected Using Cameras and Reflective Markers ▶ 3D Kinematics of Human Motion ▶ Next Generation Models Using Optimized Joint Center Location ▶ Observing and revealing the hidden structure of the human form in motion throughout the centuries ▶ Physics-based Models for Human Gait Analysis ▶ Rigid Body Models of the Musculoskeletal System ▶ Variations of Marker-sets and Models for Standard Gait Analysis
References Akbarshahi M, Schache AG, Fernandez JW, Baker R, Banks S, Pandy MG (2010) Non-invasive assessment of soft-tissue artifact and its effect on knee joint kinematics during functional activity. J Biomech 43(7):1292–1301. doi:10.1016/j.jbiomech.2010.01.002 Baker R (2001) Pelvic angles: a mathematically rigorous definition which is consistent with a conventional clinical understanding of the terms. Gait Posture 13(1):1–6. doi:10.1016/S09666362(00)00083-7 Baker R (2011) Globographic visualisation of three dimensional joint angles. J Biomech 44 (10):1885–1891. doi:10.1016/j.jbiomech.2011.04.031 Baker R, Finney L, Orr J (1999) A new approach to determine the hip rotations profile from clinical gait analysis data. Hum Mov Sci 18:655–667. doi:10.1016/S0167-9457(99)00027-5
18
R. Baker et al.
Barre A, Thiran JP, Jolles BM, Theumann N, Aminian K (2013) Soft tissue artifact assessment during treadmill walking in subjects with total knee arthroplasty. IEEE Trans Biomed Eng 60 (11):3131–3140. doi:10.1109/TBME.2013.2268938 Cappozzo A, Catani F, Croce UD, Leardini A (1995) Position and orientation in space of bones during movement: anatomical frame definition and determination. Clin Biomech 10 (4):171–178. doi:10.1016/0268-0033(95)91394-T Carson MC, Harrington ME, Thompson N, O’Connor JJ, Theologis TN (2001) Kinematic analysis of a multi-segment foot model for research and clinical applications: a repeatability analysis. J Biomech 34(10):1299–1307. doi:10.1016/S0021-9290(01)00101-4 Chao EY (1980) Justification of triaxial goniometer for the measurement of joint rotation. J Biomech 13:989–1006. doi:10.1016/0021-9290(80)90044-5 Charlton IW, Tate P, Smyth P, Roren L (2004) Repeatability of an optimised lower body model. Gait Posture 20(2):213–221. doi:10.1016/j.gaitpost.2003.09.004 Clauser C, McConville J, Young J (1969) Weight volume and centre of mass of segments of the human body (AMRL Technical Report). Wright-Patterson Air Force Base, Ohio Davis RB, Ounpuu S, Tyburski D, Gage J (1991) A gait analysis data collection and reduction technique. Hum Mov Sci 10:575–587. doi:10.1016/0167-9457(91)90046-Z Dempster W (1955) Space requirements of the seated operator (WADC Technical Report :55–159). Wright-Patterson Airforce Base, Ohio Eames M, Cosgrove A, Baker R (1999) Comparing methods of estimating the total body centre of mass in three-dimensions in normal and pathological gait. Hum Mov Sci 18:637–646. doi:10.1016/S0167-9457(99)00022-6 Foti T, Davis RB, Davids JR, Farrell ME (2001) Assessment of methods to describe the angular position of the pelvis during gait in children with hemiplegic cerebral palsy. Gait Posture 13:270 Harrington ME, Zavatsky AB, Lawson SE, Yuan Z, Theologis TN (2007) Prediction of the hip joint centre in adults, children, and patients with cerebral palsy based on magnetic resonance imaging. J Biomech 40(3):595–602. doi:10.1016/j.jbiomech.2006.02.003 Hicks JL, Richards JG (2005) Clinical applicability of using spherical fitting to find hip joint centers. Gait Posture 22(2):138–145. doi:10.1016/j.gaitpost.2004.08.004 Hinrichs RN (1985) Regression equations to predict segmental moments of inertia from anthropometric measurements: an extension of the data of Chandler et al. (1975). J Biomech 18 (8):621–624. doi:10.1016/0021-9290(85)90016-8 Kadaba MP, Ramakrishnan HK, Wootten ME, Gainey J, Gorton G, Cochran GV (1989) Repeatability of kinematic, kinetic, and electromyographic data in normal adult gait. J Orthop Res 7 (6):849–860. doi:10.1002/jor.1100070611 Kadaba MP, Ramakrishnan HK, Wootten ME (1990) Measurement of lower extremity kinematics during level walking. J Orthop Res 8(3):383–392. doi:10.1002/jor.1100080310 Leardini A, Cappozzo A, Catani F, Toksvig-Larsen S, Petitto A, Sforza V, Cassanelli G, Giannini S (1999) Validation of a functional method for the estimation of hip joint centre location. J Biomech 32(1):99–103. doi:10.1016/S0021-9290(98)00148-1 Leardini A, Chiari L, Della Croce U, Cappozzo A (2005) Human movement analysis using stereophotogrammetry. Part 3. Soft tissue artifact assessment and compensation. Gait Posture 21(2):212–225. doi:10.1016/j.gaitpost.2004.05.002 Leardini A, Sawacha Z, Paolini G, Ingrosso S, Nativo R, Benedetti MG (2007) A new anatomically based protocol for gait analysis in children. Gait Posture 26(4):560–571. doi:10.1016/j. gaitpost.2006.12.018 Lu TW, O’Connor JJ (1999) Bone position estimation from skin marker co-ordinates using global optimisation with joint constraints. J Biomech 32(2):129–134. doi:10.1016/S0021-9290(98) 00158-4 McGinley JL, Baker R, Wolfe R, Morris ME (2009) The reliability of three-dimensional kinematic gait measurements: a systematic review. Gait Posture 29(3):360–369. doi:10.1016/j. gaitpost.2008.09.003
The Conventional Gait Model - Success and Limitations
19
Ounpuu S, Gage J, Davis R (1991) Three-dimensional lower extremity joint kinetics in normal pediatric gait. J Pediatr Orthop 11:341–349 Ounpuu O, Davis R, Deluca P (1996) Joint kinetics: methods, interpretation and treatment decisionmaking in children with cerebral palsy and myelomeningocele. Gait Posture 4:62–78. doi:10.1016/0966-6362(95)01044-0 Passmore E, Sangeux M (2016) Defining the medial-lateral axis of an anatomical femur coordinate system using freehand 3D ultrasound imaging. Gait Posture 45:211–216. doi:10.1016/j. gaitpost.2016.02.006 Pearsall DJ, Costigan PA (1999) The effect of segment parameter error on gait analysis results. Gait Posture 9(3):173–183 Peters A, Sangeux M, Morris ME, Baker R (2009) Determination of the optimal locations of surface-mounted markers on the tibial segment. Gait Posture 29(1):42–48. doi:10.1016/j. gaitpost.2008.06.007 Peters A, Baker R, Sangeux M (2010) Validation of 3-D freehand ultrasound for the determination of the hip joint centre. Gait Posture 31:530–532. doi:10.1016/j.gaitpost.2010.01.014 Peters A, Baker R, Morris ME, Sangeux M (2012) A comparison of hip joint centre localisation techniques with 3-DUS for clinical gait analysis in children with cerebral palsy. Gait Posture 36 (2):282–286. doi:10.1016/j.gaitpost.2012.03.011 Pillet H, Sangeux M, Hausselle J, El Rachkidi R, Skalli W (2014) A reference method for the evaluation of femoral head joint center location technique based on external markers. Gait Posture 39(1):655–658. doi:10.1016/j.gaitpost.2013.08.020 Pinzone O, Schwartz MH, Thomason P, Baker R (2014) The comparison of normative reference data from different gait analysis services. Gait Posture 40(2):286–290. doi:10.1016/j. gaitpost.2014.03.185 Rao G, Amarantini D, Berton E, Favier D (2006) Influence of body segments’ parameters estimation models on inverse dynamics solutions during gait. J Biomech 39(8):1531–1536. doi:10.1016/j.jbiomech.2005.04.014 Reinbolt JA, Haftka RT, Chmielewski TL, Fregly BJ (2007) Are patient-specific joint and inertial parameters necessary for accurate inverse dynamics analyses of gait? IEEE Trans Biomed Eng 54(5):782–793. doi:10.1109/TBME.2006.889187 Sangeux M, Peters A, Baker R (2011) Hip joint centre localization: Evaluation on normal subjects in the context of gait analysis. Gait Posture 34(3):324–328. doi:10.1016/j.gaitpost.2011.05.019 Sangeux M, Pillet H, Skalli W (2014) Which method of hip joint centre localisation should be used in gait analysis? Gait Posture 40(1):20–25. doi:10.1016/j.gaitpost.2014.01.024 Sauret C, Pillet H, Skalli W, Sangeux M (2016) On the use of knee functional calibration to determine the medio-lateral axis of the femur in gait analysis: Comparison with EOS biplanar radiographs as reference. Gait Posture 50:180–184. doi:10.1016/j.gaitpost.2016.09.008 Scally G, Donaldson L (1998) Clinical governance and the drive for quality improvement in the new NHS in England. Br Med J 317:61–65. doi:10.1136/bmj.317.7150.61 Schwartz MH, Rozumalski A (2005) A new method for estimating joint parameters from motion data. J Biomech 38(1):107–116. doi:10.1016/j.jbiomech.2004.03.009 Seth A, Sherman M, Reinbolt JA, Delp SL (2011) OpenSim: a musculoskeletal modeling and simulation framework for in silico investigations and exchange. Procedia IUTAM 2:212–232. doi:10.1016/j.piutam.2011.04.021 Shoemaker P (1978) Measurements of relative lower body segment positions in gait analysis. University of California, San Diego Sutherland D, Hagy J (1972) Measurement of gait movements from motion picture film. J Bone Joint Surg 54A(4):787–797 Tsai T-Y, Lu T-W, Kuo M-Y, Hsu H-C (2009) Quantification of three-dimensional movement of skin markers realtive to the underlying bones during functional activities. Biomed Eng: Appl Basis Commun 21(3):223–232. doi:10.4015/S1016237209001283 Winter D, Robertson D (1978) Joint torque and energy patterns in normal gait. Biol Cybern 29:137–142. doi:10.1007/BF00337349
Variations of Marker Sets and Models for Standard Gait Analysis Felix Stief
Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anatomical and Technical Markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marker Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Definition of a Segment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prediction Approach or the Conventional Gait Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Functional Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Impact of Marker Set and Joint Angle Calculation on Gait Analysis Results . . . . . . . . . . . . . . . . . . Errors Involved with Marker Placement and Soft-Tissue Artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . Errors Associated with the Regression Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . How to Address the Measurement Error and What is the Extent of This Error? . . . . . . . . . . . Accuracy for Marker-Based Gait Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of Marker Sets and Models for Standard Gait Analysis . . . . . . . . . . . . . . . . . . . . . . . Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Importance of Repeatability Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Removing the Effects of Marker Misplacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 3 3 4 5 6 6 7 7 8 9 10 10 13 13 14 14 15
Abstract
A variety of different approaches is used in 3D clinical gait analysis. This chapter provides an overview of common terms, different marker sets, underlying anatomical models, as well as a fundamental understanding of measurement techniques commonly used in clinical gait analysis and the consideration of possible errors associated with these different techniques. Besides the different marker
F. Stief (*) Movement Analysis Lab, Orthopedic University Hospital Friedrichsheim gGmbH, Frankfurt/Main, Germany e-mail: [email protected]; [email protected] # Springer International Publishing AG 2016 B. Müller, S.I. Wolf (eds.), Handbook of Human Motion, DOI 10.1007/978-3-319-30808-1_26-1
1
2
F. Stief
sets, two main approaches can be used to quantify marker-based joint angles: a prediction approach based on regression equations and a functional approach. The prediction approach uses anatomical assumptions and anthropometric reference data to define the locations of joint centers/axes relative to specific anatomical landmarks. In the functional approach, joint centers are determined via optimization of marker movement. The accuracy of determining skeletal kinematics is limited by ambiguity in landmark identification and soft-tissue artifacts. When the intersubject variability of control data becomes greater than the expected change due to pathology, the clinical usefulness of the data becomes doubtful. To allow a practical interpretation of a comparison of approaches, differences and the measurement error should be quantified in the unit of interest (i.e., degree or percent). The highest reliability indices occurred in the hip and knee in the sagittal plane, with lowest reliability and highest errors for hip and knee rotation in the transverse plane. In addition, knowledge about sources of errors should be known before the approach is applied in practice. Keywords
Marker sets • Anatomical markers • Technical markers • Clusters • Modeling • Segment definition • Prediction approach • Functional approach • Regression equations • Conventional Gait Model • Measurement error • Soft-tissue artifacts • Reliability • Accuracy
Introduction This chapter provides an overview of common terms, different marker sets, underlying anatomical models, as well as a fundamental understanding of measurement techniques commonly used in clinical gait analysis and the consideration of possible errors associated with these different techniques. It is possible for a clinician or physician to subjectively study gait; however, the value and repeatability of this type of assessment is questionable due to poor interand intra-tester reliability. For instance, it is impossible for one individual to study, by observation alone, the movement pattern of all the main joints involved during an activity like walking simultaneously. Therefore, skeletal movements in three dimensions during gait are typically recorded using markers placed on the surface of the skin on various anatomical landmarks to represent body segments. The markerbased analysis of human movement helps to better understand normal and pathological function and results in a detailed and objective clinical assessment of therapeutic and surgical interventions. A variety of different anatomical models and marker sets were used for clinical gait analysis. While a certain amount of standardization could be established in recent years for the marker placement on anatomical points and the definition for most of the rigid body segments (pelvic, thigh, shank, foot), protocols differ in the underlying biomechanical model, the definition of joint centers and axes, and the
Variations of Marker Sets and Models for Standard Gait Analysis
3
number of markers used. These differences have an effect on the outcome measures (e.g., joint angles and moments). The main focus of this chapter is to demonstrate the impact of marker sets and joint angle calculations on gait analysis results.
State of the Art Markers are either described as passive or active (“▶ Estimating Hierarchical Rigid Body Models of the Musculoskeletal System”). Passive markers for camera-based systems are generally made of a retroreflective material. This material is used to reflect light emitted from around the camera back to the camera lens. Some camerabased systems use a stroboscopic light, while others use light from synchronized infrared light-emitting diodes mounted around the camera lens. In contrast, active markers produce light at a given frequency, so these systems do not require illumination, and, as such, the markers are more easily identified and tracked (Chiari et al. 2005). These light-emitting diodes (LED) are attached to a body segment in the same way as passive markers, but with the addition of a power source and a control unit for each LED. Active markers can have their own specific frequency which allows them to be automatically detected. This leads to very stable real-time three-dimensional motion tracking as no markers can be misidentified as adjacent markers. Regardless of whether passive or active, the use of markers should not significantly modify the movement pattern being measured.
Anatomical and Technical Markers Anatomical markers are used to set up the segment reference frame. This is generally done during a static trial with the subject standing still. Anatomical markers may be attached on bony landmarks directly to the skin or fixed to a pointer. These markers are not required for the dynamic trials as long as at least three fixed points are available on each segment. Technical markers have no specific location and are chosen purely to meet the other requirements above. Additional technical markers can be used to create a technical coordinate system from data collected in a static calibration trial during which both anatomical and technical markers are present. In subsequent dynamic trials, absent anatomical markers can be expressed in relation to the technical coordinate system. Technical markers can also be used to avoid areas of adipose tissue in obese patients, to accommodate walking aids, or to replace markers that are obscured dynamically. Two approaches are commonly used. Technical markers may be used to replace only those anatomical markers that cannot be used dynamically. In this case, the majority of anatomical markers remain in place for the walking trials. Alternatively, clusters of technical markers attached to a plate (see “Marker Clusters” below) may be used to provide all the dynamic information needed. Anatomical markers are then only used for the static trial to allow segment reconstruction.
4
F. Stief
Fig. 1 Rigid marker cluster with four retroreflective markers
Marker Clusters Other techniques for minimizing soft-tissue artifacts and in order to reduce intersubject variability are marker clusters (an array of markers) (Cappozzo et al. 1995). They must be in place during the static anatomical calibration. The exact placement of the clusters is less reliant as this technique uses the relative positions to the anatomical landmarks used in the static calibration. The purpose is to define the plane of each segment with 3–5 markers and then track its movement through the basic reference planes. Clusters can be directly attached to the skin or mounted on rigid fixtures (Fig. 1), which are dependent upon the anatomy, the activity, and the nature of the analysis. In a rigid body or cluster, the distance between any two points within the body or cluster does not change. In general, tracking of marker clusters helps to reduce noise within the motion signal and improve accuracy of kinematic data. When the markers are fixed to rigid plates, the markers never move independently with deformation of the skin. It has been shown that the absolute and relative variance in out-of-sagittal plane rotations tended to be higher for the Conventional Gait Model (“▶ The Conventional Gait Model: Success and Limitations”) compared with a cluster technique (Duffell et al. 2014) and that a cluster marker set overcomes a number of theoretical limitations compared to the conventional set (Collins et al. 2009) when both models were compared simultaneously. Much work has been carried out determining the optimal configuration of marker clusters, and it is now widely accepted that a rigid shell with a cluster of four markers is a good practical solution
Variations of Marker Sets and Models for Standard Gait Analysis
5
(Cappozzo et al. 1997; Manal et al. 2000). However, when the cluster markers were fixed to a rigid plate, these methods were not able to address absolute errors and can still result in inaccurate identification of joint centers (Holden and Stanhope 1998). Although an extended version of this method has reported improvements in estimation of the position of the underlying bones (Alexander and Andriacchi 2001), it can only model skin deformations and has limited use in clinical applications due to the number of additional markers required.
The Definition of a Segment In general, three markers are needed to fix a rigid body in space. When using motion capture to define the pelvic segment (“▶ Estimating Hierarchical Rigid Body Models of the Musculoskeletal System”) and measure pelvic motion, the International Society of Biomechanics (ISB) recommends the pelvic anatomical coordinate system be defined by surface markers placed on the right and left anterior superior iliac spines (ASISs) and on the right and left posterior superior iliac spines (PSISs). The pelvic anatomical coordinate system can be described as the origin at the midpoint between the right ASIS and the left ASIS; the Z-axis points from the origin to the right ASIS; the X-axis lies in the plane defined by the right ASIS, left ASIS, and the midpoint of the right PSIS and left PSIS markers and points ventrally orthogonal to the Z-axis; and the Y-axis is orthogonal to these two axes (Wu et al. 2002). These markers would ideally be used to track the pelvis during gait or clinical assessment protocols that involve movement. However, situations in which the ASIS or PSIS markers are obscured from view require that alternative technical marker sets are used. Occlusion of the ASIS markers could be as a result of soft tissue around the anterior abdomen (a common issue in overweight and obese subjects), arm movement, or activities that require high degrees of hip and trunk flexion, such as running, stair climbing, or level walking. It has been shown that pelvic models that include markers placed on the ASISs and the iliac crests (ICs), and PSISs and ICs, are suitable alternatives to the standard pelvic model (ASISs and PSISs) for tracking pelvic motion during gait (Bruno and Barden 2015). Alternatively, the use of a rigid cluster of three orthogonal markers as technical markers attached to the sacrum can be used (Borhani et al. 2013). Using the calibrated anatomical system technique (Benedetti et al. 1998; Cappello et al. 2005) allows the position of ASIS defined relative to the cluster in a static trial, and then during dynamic trial, the position of the ASIS is linked to the cluster and thus affected by the same skin movement artifact that affects the cluster. Another alternative to solve skin artifacts is to use the right and left hip joint centers described in the technical coordinate system of the right and left thighs, together with the right PSIS and left PSIS markers, as technical markers for tracking the pelvis movement (Kisho Fukuchi et al. 2010).
6
F. Stief
Prediction Approach or the Conventional Gait Model Besides the different marker sets, two main approaches can be used to quantify joint angles: a prediction approach based on regression equations and a functional approach. The prediction approach uses anatomical assumptions and anthropometric reference data to define the locations of joint centers/axes relative to specific anatomical landmarks (Isman and Inman 1969; Weidow et al. 2006). In the functional approach, joint centers are determined via optimization of marker movement. The advantages and disadvantages of both approaches were described below in detail. Most biomechanical analysis systems use regression equations based on predictive methods to calculate joint centers. Kadaba et al. (1989), Davis III et al. (1991), and Vaughan et al. (1992) provided detailed descriptions of a marker-based system to calculate joint centers in the lower extremities. This marker setup has become one of the most commonly used models in gait analysis. It is referred to as Helen Hayes Hospital marker setup, and the regression equations are referred to as the Plug-inGait (PiG) model or the Conventional Gait Model (“▶ The Conventional Gait Model: Success and Limitations”).
Functional Approach In general, technical marker sets require data capture in a static standing trial to determine rotation values (offsets) to place these markers into the anatomical coordinate system. If a marker does, for instance, not accurately represent the position of the hip during standing data capture, the technical markers will not be placed into the correct anatomical plane for the dynamic trial. This is particularly problematic if the static and dynamic positions of the hip vary from one another. It has been shown that static standing posture greatly affected the dynamic hip rotation kinematics when using a thigh wand in the typical clinical gait analysis process for the Conventional Gait Model (McMulkin and Gordon 2009). Therefore, if a thigh wand is to be used in clinical practice, it is necessary that patients stand in a hip rotation posture that is equivalent to hip rotation position used in gait. This can be very difficult because it requires clinicians to have a priori knowledge of the gait hip rotation before testing. Also, patients may use different strategies in static standing than with walking posture. One way of addressing this issue is to use functional joint center techniques (Ehrig et al. 2006; Leardini et al. 1999; Schwartz and Rozumalski 2005). This functional approach is considered functional due to the calculation of subject-specific joint centers/axes by using specific movement data of adjacent segments derived from basic motion tasks. With a focus on assessing motion patterns in a subject-specific manner, functional methods rely on the relative motion between the marker clusters of neighboring segments to identify joint centers and axes (Cappozzo et al. 1997; Ehrig et al. 2006). Previously developed functional methods have been demonstrated to be precise (Ehrig et al. 2006; Kornaropoulos et al. 2010; Kratzenstein et al. 2012) as well as rapid and robust (Schwartz and Rozumalski
Variations of Marker Sets and Models for Standard Gait Analysis
7
2005) in estimating joint centers. Nevertheless, in many patient groups, functional calibration has been reported to be difficult (Sangeux et al. 2011) due to the fact that the range of motion (ROM) of affected joints is restricted. In addition, functional methods have not been able to demonstrate consistent advantages over more traditional regression-based approaches (Assi et al. 2016; Bell et al. 1990; Davis III et al. 1991; Harrington et al. 2007), possibly due to issues of marker placement and the nonlinear distribution of soft-tissue artifacts across a segment (Gao and Zheng 2008; Stagni et al. 2005). Kratzenstein et al. (2012) present an approach for understanding the contribution of different regions of marker attachment on the thigh toward the precise determination of the hip joint center. This working group used a combination of established approaches (Taylor et al. 2010) to reduce skin marker artifacts (Taylor et al. 2005), determine joint centers of rotation (Ehrig et al. 2006), and quantify the weighting of each of a large number of markers (Heller et al. 2011) attached to the thigh. Consequently, markers that are suboptimally located and therefore strongly affected by soft-tissue artifacts are assigned a lower weighting compared to markers that follow spherical trajectories around the joint. Based on these methods, six regions of high importance were determined that produced a symmetrical center of rotation estimation (Ehrig et al. 2011) almost as low as using a marker set that covered the entire thigh. Such approaches could be used to optimize marker sets for targeting more accurate and robust motion capture for aiding in clinical diagnosis and improving the reliability of longitudinal studies.
Impact of Marker Set and Joint Angle Calculation on Gait Analysis Results Errors Involved with Marker Placement and Soft-Tissue Artifacts The accuracy of determining skeletal kinematics is limited by ambiguity in landmark identification and soft-tissue artifacts that is the motion of markers over the underlying bones due to skin elasticity, muscle contraction, or synchronous shifting of the soft tissues (Leardini et al. 2005; Taylor et al. 2005). Generally, two types of errors are referred to soft-tissue artifacts. Relative errors are defined as the relative movement between two or more markers that define a rigid segment. Absolute errors are defined as the movement of a marker with respect to the bony landmark it is representing (Richards 2008). Relative and absolute errors are often caused by movement of the soft tissue on which the markers are placed (Cappozzo et al. 1996). The magnitude of these errors has been studied by using pins secured directly into the bone and comparing the data collected from skin-mounted markers to markers attached to bone pins. These data give a direct measure of soft-tissue movement with respect to the skeletal system (Cappozzo 1991; Cappozzo et al. 1996; Reinschmidt et al. 1997a; b). However, the applicability of this method is limited due to their invasive nature. The amount and the effects of soft-tissue artifacts from skin markers are discussed controversial with relative skin to bone marker movements in the range of 3 mm up to 40 mm, dependent upon the specific body
8
F. Stief
segment and soft-tissue coverage (Cappozzo et al. 1996; Holden et al. 1997; Manal et al. 2000, 2003; Reinschmidt et al. 1997b). Differences can be accounted for by variation in marker placement and configuration, differences in techniques, intersubject differences, and differences in the task performed (Leardini et al. 2005). Inaccuracies in lower limb motion and in particular knee kinematics are present mainly because of soft-tissue artifacts at the thigh segment (Alexander and Andriacchi 2001; Cappello et al. 1997; Fuller et al. 1997; Leardini et al. 2005; Lucchetti et al. 1998). Conversely, soft-tissue movement on the shank has only a small effect on three-dimensional kinematics and moments at the knee (Holden et al. 1997; Manal et al. 2002). In addition, mainly in the frontal and transverse planes, substantial angular variabilities were noted (Ferrari et al. 2008; Miana et al. 2009) due to the small ROM in these planes compared to sagittal plane movements. This reasoning agrees with the results of Leardini et al. (2005) who assert that angles outof-sagittal planes should be regarded with much more caution as the soft-tissue artifact produces spurious effects with magnitudes comparable to the amount of motion actually occurring in the joints. In addition, an increase in velocity (for instance, during running) produces an increased variability of the joint centers’ distances and increases the maximum differences between the joint angles when using different protocols (Miana et al. 2009).
Errors Associated with the Regression Equations Besides soft-tissue artifacts and variability of the marker placement, errors associated with the regression equations used to calculate the joint center locations are also considerable (Harrington et al. 2007; Leardini et al. 1999; Sangeux et al. 2011). Clinically, the definition of the joint center is generally achieved by using palpable anatomical landmarks to define the medial-lateral axis of the joint. From these anatomical landmarks, the center of rotation is generally calculated in one of two ways: through the use of regression equations based on standard radiographic evidence or simply calculated as a percentage offset from the anatomical marker based on some kind of anatomical landmarks (Bell et al. 1990; Cappozzo et al. 1995; Davis III et al. 1991; Kadaba et al. 1989). The issue of hip joint center (HJC) identification is one that has been covered in much depth, and there are still many debates around this area. The location of this joint center is one of the most difficult anatomic reference points to define. The center of the femoral head is the center of the hip joint and located within the acetabulum on the obliquely aligned and tilted lateral side of the pelvis. Therefore, common approaches have used landmarks on the pelvis as the anatomical reference (Perry and Burnfield 2010). The regression equations in the Conventional Gait Model are based on the HJC regression equations by Davis et al. (1991) and chord functions to predict the knee and the ankle joint centers. The HJC regression equation was based on 25 male subjects and has been validated in later studies (Harrington et al. 2007; Leardini et al. 1999; Sangeux et al. 2011) showing
Variations of Marker Sets and Models for Standard Gait Analysis
9
significant errors, which were corrected with new regression equations (Sandau et al. 2015). In the chord function, the HJC, thigh wand marker, and the epicondyle marker were used to define a plane. The knee joint center (KJC) was then found so that the epicondyle marker was at a half knee diameter distance from the KJC in a direction perpendicular to the line from the HJC to KJC. The ankle joint center (AJC) was predicted in the same way as the knee, where the chord function was used to predict the joint center based on the KJC, the calf wand marker, and the malleolus marker. The chord functions predict the KJC and the AJC with the assumption that the joint centers are lying on the transepicondylar axis and the transmalleolar axis in the frontal plane, respectively. This assumption seems reasonable for the knee (Asano et al. 2005; Most et al. 2004), but to a lesser extent regarding the ankle joint (Lundberg et al. 1989). The exact position of the joint centers influences the joint angles as well as joint angular velocity and acceleration which are part of inverse dynamics. Likewise, the location of segmental center of mass will influence the inverse dynamics calculations via the moment arms acting together with both proximal and distal joint reaction forces.
How to Address the Measurement Error and What is the Extent of This Error? In general, when addressing the measurement error in marker-based movement analysis, it is helpful to provide an absolute measure of reliability, for instance, the root mean square error or standard error of measurement (SEM). It is thus possible to express the variability in a manner that can be directly related to the measurement itself, in the same measurement units (e.g., degrees). Furthermore, with the transformation of the absolute error into relative error, one can obtain the error expressed as percentage corresponding to the total ROM of the variable to be analyzed. This is of particular importance for the between-plane comparison of the measurement error with different amplitude of the kinematic and kinetic parameters (Stief et al. 2013). In contrast, the commonly reported intraclass correlation coefficient or coefficient of variation and coefficient of multiple correlations allow limited information, as high coefficient values can result from a low mean value of the variable of interest and thus could hide measurement errors of clinical importance (Luiz and Szklo 2005). Furthermore, expressing data variability as a coefficient results in units that are difficult to interpret clinically (Leardini et al. 2007). Regarding the literature, kinematic measurement errors of less than 4 and 6 were reported for the intertrial and intersession variability, respectively (Stief et al. 2013). A systematic review from McGinley et al. (2009) identifies that the highest reliability indices occurred in the hip and knee in the sagittal plane, with lowest reliability and highest errors for hip and knee rotation in the transverse plane. Most studies included in this review article providing estimates of data error reported values of less than 5 , with the exception of hip and knee rotation. Fukaya et al. (2013) investigated the interrater reliability of knee movement analyses during the
10
F. Stief
stance phase using a rigid marker set with three attached markers affixed to the thigh and shank. Each of three testers independently attached the infrared reflective markers to four subjects. The SEM values for reliability ranged from 0.68 to 1.13 for flexion-extension, 0.78 –1.60 for external-internal rotation, and 1.43 –3.33 for abduction-adduction. In general, the measurement errors between testers are considered to be greater than the measurement errors between sessions and within testers (Schwartz et al. 2004).
Accuracy for Marker-Based Gait Analysis The accuracy of body protocols can hardly be assessed in clinical routine since invasive methods such as radiographic imaging (Garling et al. 2007) or bone pins (Taylor et al. 2005) are required in order to provide sufficient access to the skeletal anatomy but are generally not available. Ultrasound assessment of the joint provides one noninvasive opportunity (Sangeux et al. 2011), but assessment of the images can be somewhat subjective. According to Schwartz and Rozumalski (2005), the following indirect indicators of accuracy can be computed instead: 1. Knee varus/valgus ROM during gait: An accurate knee flexion axis alignment minimizes the varus/valgus ROM resulting from cross-talk, that is, one joint rotation (e.g., flexion) being interpreted as another (e.g., adduction or varus) due to axis malalignment (Piazza and Cavanagh 2000). 2. Knee flexion/extension ROM during gait: An accurate knee flexion axis alignment maximizes knee flexion/extension ROM by reducing cross-talk. In general, the knee varus/valgus curve can be evaluated for signs of marker misplacement or Knee Alignment Device misalignment. Moreover, it has been shown that for the stable knee joint, the physiological ROM of knee varus/valgus only varies between 5 and 10 (Reinschmidt et al. 1997a). Minimization of the knee joint angle cross talk can therefore be considered to be a valid criterion to evaluate the relative merits of different protocols and marker sets.
Comparison of Marker Sets and Models for Standard Gait Analysis There is still a variety of different approaches being used in clinical gait analysis. Protocols differ in the underlying biomechanical model, associated marker sets, and data recording and processing. The former defines properties of the modeled joints, the number of involved segments, the definitions of joint centers and axes, the used anatomical and technical reference frames, and the angular decomposition technique to calculate joint angles. Despite apparent differences of the outcome measures derived from different gait protocols (Ferrari et al. 2008; Gorton et al. 2009), specifically for out-of-sagittal plane rotations (Ferrari et al. 2008), data of different studies are compared and interpreted.
Variations of Marker Sets and Models for Standard Gait Analysis
11
Any protocol for movement analysis will only prove useful if it displays adequate reliability (Cappozzo 1984). Moreover, and as stated before, the placement of the markers has considerable influence on the accuracy of gait studies (Gorton et al. 2009). One of the first protocols proposed by Davis et al. (1991), and known as Conventional Gait Model or PiG model, is still used by a vast majority of gait laboratories (Schwartz and Rozumalski 2005). Although the protocol is practicable and has been established over the years, some main disadvantages exist. It has been shown that intersession and interexaminer reliability are low for this protocol, especially at the hip and knee joint in the frontal and transverse plane (McGinley et al. 2009). The errors in the PiG protocol, for example, knee varus/valgus ROM up to 35 (Ferrari et al. 2008), are very likely caused by inconsistent anatomical landmark identification and marker positioning by the examiner. This leads to well-documented errors of skin movement (Leardini et al. 2005) and kinematic cross talk. Moreover, accurate placement of the wand markers on the shank and the thigh is difficult (Karlsson and Tranberg 1999). Wands on the lateral aspect of the thighs and shanks are also likely to enlarge skin motion artifact effects (Manal et al. 2000) and variability of the gait results (Gorton et al. 2009). One way of addressing these errors is the usage of additional medial malleolus and medial femoral condyle markers to determine joint centers. This eliminates the reliance on the difficult, subjective palpation of the thigh and tibia wand markers necessary for the PiG model, which has been shown to have large variability (Gorton et al. 2009) between laboratories and to enlarge skin motion artifact effects (Manal et al. 2002), especially when placed proximally where the greatest soft-tissue artifact of any lower-limb segment is found (Stagni et al. 2005). Besides that, it has been shown that thigh wand markers capture approximately half of actual femoral axial rotation (Schache et al. 2008; Schulz and Kimmel 2010; Wren et al. 2008). The reason for this may be that substantial proportions of hip external-internal rotations were being detected as knee motions by the marker sets using thigh markers (Schulz and Kimmel 2010). Wren et al. (2008) have suggested using a patella marker (placed in the center of the patella), which was reported to detect 98% of the actual hip rotation ROM. And indeed, dynamic hip rotation during gait when utilizing a patella marker in lieu of a thigh wand was not effected by static hip posture (McMulkin and Gordon 2009). In a comparative study, the reliability and accuracy of the PiG model and an advanced protocol (MA) with additional medial malleolus and medial femoral condyle markers were estimated (Stief et al. 2013) (Fig. 2). For the MA, neither anthropometric measurements nor joint alignment devices are necessary. Knowledge of anatomical landmarks spatial location enables automatic calculation of anthropometric measurements necessary for joint center determination. In both protocols, the center of the hip joint was calculated using a geometrical prediction method (Davis III et al. 1991). The PiG model derived the rotational axis of the knee joint from the position of the pelvic, knee, and thigh markers and the rotational axis of the ankle joint from the position of the knee, ankle, and tibia markers. In contrast to the PiG model, the centers of the knee and ankle joints using the MA were statically defined as the midpoint between the medial and
12
F. Stief SACR
LASI
RASI
LTRO
RTRO
RTHI
LTHI
RKNEL
LKNEM
LKNEL
RKNEM
RTIB
LTIB
RHEE RANKL RANKM
LHEE
LANKL
LANKM LTOE
RTOE
Abbreviation SACR LASI (RASI) LTRO (RTRO) LTHI (RTHI)
Placement On the skin mid-way between the posterior superior iliac spines On the left (right) anterior superior iliac spine On the prominent point of the left (right) trochanter major Rigid wand marker mounted on the skin over the distal and lateral aspect of the left (right) thigh aligned in the plane that contains the hip and knee joint centers and the knee flexion/extension axis LKNEL (RKNEL) On the left (right) lateral femoral condyle LKNEM (RKNEM) On the left (right) medial femoral condyle Rigid wand marker mounted on the skin over the distal and lateral aspect LTIB (RTIB) of the left (right) shank aligned in the plane that contains the knee and ankle joint centers and the ankle flexion/extension axis LANKL (RANKL) On the left (right) lateral malleolus aligned with the bimalleolar axis LANKM (RANKM) On the left (right) medial malleolus aligned with the bimalleolar axis Ont the left (right) second metatarsal head, on the mid-foot sides LTOE (RTOE) of the equinus break between fore-foot and mid-foot On the left (right) aspect of the Achilles tendon insertion, on the calcaneous LHEE (RHEE) at the same height above the plantar surface of the foot as the LTOE (RTOE) marker
Required for protocol PiG / MA PiG / MA MA PiG
PiG / MA MA PiG
PiG / MA MA PiG / MA PiG / MA
Fig. 2 Marker set of both lower body protocols. The markers indicated by circles are part of the standard Plug-in-Gait (PiG) marker set (Conventional Gait Model); those indicated by triangles are the additional markers used in the custom made protocol (MA)
lateral femoral condyle and malleolus markers. The anatomical medial malleolus and femoral condyle markers can then be removed for the dynamic trials. The results of this comparative study (PiG model vs. MA) show for both protocols and healthy subjects a good intersession reliability for all ankle, knee, and hip joint angles in the sagittal plane. Nevertheless, the lower intersession errors for the
Variations of Marker Sets and Models for Standard Gait Analysis
13
MA compared to the PiG model regarding frontal plane knee angles and moments and transverse plane motion in the knee and hip joint suggest that the error in repeated palpation of the landmarks is lower using the MA. Moreover, the MA significantly reduced the knee axis cross talk phenomenon, suggesting improved accuracy of knee axis alignment compared to the PiG model. These results are comparable to those reported by Schwartz and Rozumalski (2005) using a functional approach in comparison with the PiG model. The MA eliminates the reliance on the subjective palpation of the thigh and tibia wand markers and the application of a Knee Alignment Device method (Davis and DeLuca 1996), which is difficult to handle and less reliable within or between therapists than manual palpation, especially in non-experienced investigators (Serfling et al. 2009). Nevertheless, a correct marker placement based on the exact identification of the characteristic anthropological points of the body (bony landmarks) is required. Especially the position of the knee markers is very important, because it influences not only knee joint kinematics, but also hip and ankle joints. It has been shown that simultaneous knee hyperextension, internal hip rotation, and external ankle rotation can be caused by back lateral knee marker misplacement, and simultaneous knee overflexion, external hip rotation, and internal ankle rotation may be influenced by forward knee marker misplacement (Szczerbik and Kalinowska 2011). Therefore, if such phenomena are represented by kinematic graphs, their presence should be confirmed by video registration prior to the formulation of clinical conclusions.
Future Directions There are many anatomical models and marker sets reported in the literature. The increase in complexity in the models relates not only to the ability of movement analysis systems to track more and more markers but also in the increase in the knowledge of modeling human movement.
The Importance of Repeatability Studies Some of the theoretical aspects of marker placement have been presented in this chapter. The practical implications are best explored in the gait laboratory by repeat marker placement. Repeated testing of a single subject will give some insight into the variability for a single person placing markers (intrasubject reliability) and between different people placing markers (intersubject reliability). Intersubject variability would additionally be affected by differences in each subject’s walking style and between-subject differences in marker placement and motion relative to bony landmarks. When the intersubject variability of control data becomes greater than the expected change due to pathology, the clinical usefulness of the data becomes doubtful. To allow a practical interpretation of a comparison of approaches, differences and their variability should be quantified in the unit of interest (i.e., degree or percent).
14
F. Stief
Removing the Effects of Marker Misplacement The placement of markers is not easy, and there is a limit to the accuracy we can realistically achieve. Even if the markers are in the right place, the effects of skin movement and oscillation will introduce errors once the subject is walking. One possibility is that the marker placement is “corrected” as part of the data processing. Complex algorithms are now becoming available for performing such corrections. A simpler approach has been used for some time to increase the accuracy in joint center determination. In addition to the Conventional Gait Model, at least the use of medial malleolus and medial femoral condyle markers is recommended when analyzing frontal and transverse plane gait data. This should lead to lower measurement errors for most of the gait variables and to a more accurate determination of the knee joint axis. Nevertheless, gait variables in the transverse plane are poorly reproducible (Ferber et al. 2002; Krauss et al. 2012), and their variability associated with the underlying biomechanical protocol is substantial (Ferrari et al. 2008; Krauss et al. 2012; Noonan et al. 2003). In future, approaches that combine key characteristics of proven methods (functional and/or predictive methods) for the assessment of skeletal kinematics could be used to optimize marker sets for targeting more accurate and robust motion capture for aiding in clinical diagnosis and improving the reliability of longitudinal studies. On the other hand, procedural distress should be minimized. Especially children cannot always stand still for a long time, walk wearing a large number of markers, and perform additional motion trials. The marker set and possible associated anatomical landmark calibration or anthropometric measurement procedures, therefore, must be minimized to contain the time taken for subject preparation and data collection (Leardini et al. 2007).
Conclusion When comparing movement data, it is worth noting that care must be taken where different marker sets have been used. Whatever approach is used, the problem is separating patterns produced by errors from those produced by pathology. Till this day, it is, for instance, not clear how different marker configurations impact hip rotation for the typical clinical gait analysis process. For this reason, the “true” values for rotation often remain unknown. Therefore, gait protocols have to be described precisely, and comparison with other studies should be done critically. In addition, knowledge about sources of errors should be known before the approach is applied in practice. Learning and training of the examiners, which is considered to be a critical issue (Gorton et al. 2009), is important to ensure exact anatomical landmark locations which may also reduce intra- and inter-examiner variability. Moreover, the graphs from instrumented gait analysis should be confirmed by video registration prior to the formulation of clinical conclusions.
Variations of Marker Sets and Models for Standard Gait Analysis
15
References Alexander EJ, Andriacchi TP (2001) Correcting for deformation in skin-based marker systems. J Biomech 34(3):355–361 Asano T, Akagi M, Nakamura T (2005) The functional flexion-extension axis of the knee corresponds to the surgical epicondylar axis: in vivo analysis using a biplanar image-matching technique. J Arthroplast 20(8):1060–1067. doi:10.1016/j.arth.2004.08.005 Assi A, Sauret C, Massaad A, Bakouny Z, Pillet H, Skalli W, Ghanem I (2016) Validation of hip joint center localization methods during gait analysis using 3D EOS imaging in typically developing and cerebral palsy children. Gait Posture 48:30–35. doi:10.1016/j. gaitpost.2016.04.028 Bell AL, Pedersen DR, Brand RA (1990) A comparison of the accuracy of several hip center location prediction methods. J Biomech 23(6):617–621 Benedetti MG, Catani F, Leardini A, Pignotti E, Giannini S (1998) Data management in gait analysis for clinical applications. Clin Biomech (Bristol, Avon) 13(3):204–215 Borhani M, McGregor AH, Bull AM (2013) An alternative technical marker set for the pelvis is more repeatable than the standard pelvic marker set. Gait Posture 38(4):1032–1037. doi:10.1016/j.gaitpost.2013.05.019 Bruno P, Barden J (2015) Comparison of two alternative technical marker sets for measuring 3D pelvic motion during gait. J Biomech 48(14):3876–3882. doi:10.1016/j.jbiomech.2015.09.031 Cappello A, Cappozzo A, La Palombara PF, Lucchetti L, Leardini A (1997) Multiple anatomical landmark calibration for optimal bone pose estimation. Hum Mov Sci 16(2–3):259–274. doi:10.1016/S0167-9457(96)00055-3 Cappello A, Stagni R, Fantozzi S, Leardini A (2005) Soft tissue artifact compensation in knee kinematics by double anatomical landmark calibration: performance of a novel method during selected motor tasks. IEEE Trans Biomed Eng 52(6):992–998. doi:10.1109/tbme.2005.846728 Cappozzo A (1984) Gait analysis methodology. Hum Mov Sci 3(1–2):27–50. doi:10.1016/01679457(84)90004-6 Cappozzo A (1991) Three-dimensional analysis of human walking: experimental methods and associated artifacts. Hum Mov Sci 10(5):589–602. doi:10.1016/0167-9457(91)90047-2 Cappozzo A, Catani F, Croce UD, Leardini A (1995) Position and orientation in space of bones during movement: anatomical frame definition and determination. Clin Biomech (Bristol, Avon) 10(4):171–178 Cappozzo A, Catani F, Leardini A, Benedetti MG, Croce UD (1996) Position and orientation in space of bones during movement: experimental artefacts. Clin Biomech (Bristol, Avon) 11(2): 90–100 Cappozzo A, Cappello A, Della Croce U, Pensalfini F (1997) Surface-marker cluster design criteria for 3-D bone movement reconstruction. IEEE Trans Biomed Eng 44(12):1165–1174. doi:10.1109/10.649988 Chiari L, Della Croce U, Leardini A, Cappozzo A (2005) Human movement analysis using stereophotogrammetry. Part 2: instrumental errors. Gait Posture 21(2):197–211. doi:10.1016/j. gaitpost.2004.04.004 Collins TD, Ghoussayni SN, Ewins DJ, Kent JA (2009) A six degrees-of-freedom marker set for gait analysis: repeatability and comparison with a modified Helen Hayes set. Gait Posture 30(2): 173–180. doi:10.1016/j.gaitpost.2009.04.004 Davis RB, DeLuca PA (1996) Clinical gait analysis: current methods and future directions. In: Harris GF, Smith PA (eds) Human motion analysis: current applications and future directions. The Institute of Electrical and Electronic Engineers Press, New York, pp 17–42 Davis RB III, Õunpuu S, Tyburski D, Gage JR (1991) A gait analysis data collection and reduction technique. Hum Mov Sci 10(5):575–587. doi:10.1016/0167-9457(91)90046-Z Duffell LD, Hope N, McGregor AH (2014) Comparison of kinematic and kinetic parameters calculated using a cluster-based model and Vicon’s plug-in gait. Proc Inst Mech Eng H 228 (2):206–210. doi:10.1177/0954411913518747
16
F. Stief
Ehrig RM, Taylor WR, Duda GN, Heller MO (2006) A survey of formal methods for determining the centre of rotation of ball joints. J Biomech 39(15):2798–2809. doi:10.1016/j. jbiomech.2005.10.002 Ehrig RM, Heller MO, Kratzenstein S, Duda GN, Trepczynski A, Taylor WR (2011) The SCoRE residual: a quality index to assess the accuracy of joint estimations. J Biomech 44 (7):1400–1404. doi:10.1016/j.jbiomech.2010.12.009 Ferber R, McClay Davis I, Williams DS 3rd, Laughton C (2002) A comparison of within- and between-day reliability of discrete 3D lower extremity variables in runners. J Orthop Res 20 (6):1139–1145. doi:10.1016/s0736-0266(02)00077-3 Ferrari A, Benedetti MG, Pavan E, Frigo C, Bettinelli D, Rabuffetti M, Crenna P, Leardini A (2008) Quantitative comparison of five current protocols in gait analysis. Gait Posture 28(2):207–216. doi:10.1016/j.gaitpost.2007.11.009 Fukaya T, Mutsuzaki H, Wadano Y (2013) Interrater reproducibility of knee movement analyses during the stance phase: use of anatomical landmark calibration with a rigid marker set. Rehabil Res Pract 2013:692624. doi:10.1155/2013/692624 Fuller J, Liu LJ, Murphy MC, Mann RW (1997) A comparison of lower-extremity skeletal kinematics measured using skin- and pin-mounted markers. Hum Mov Sci 16(2–3):219–242. doi:10.1016/S0167-9457(96)00053-X Gao B, Zheng NN (2008) Investigation of soft tissue movement during level walking: translations and rotations of skin markers. J Biomech 41(15):3189–3195. doi:10.1016/j. jbiomech.2008.08.028 Garling EH, Kaptein BL, Mertens B, Barendregt W, Veeger HE, Nelissen RG, Valstar ER (2007) Soft-tissue artefact assessment during step-up using fluoroscopy and skin-mounted markers. J Biomech 40(Suppl 1):S18–S24. doi:10.1016/j.jbiomech.2007.03.003 Gorton GE 3rd, Hebert DA, Gannotti ME (2009) Assessment of the kinematic variability among 12 motion analysis laboratories. Gait Posture 29(3):398–402. doi:10.1016/j. gaitpost.2008.10.060 Harrington ME, Zavatsky AB, Lawson SE, Yuan Z, Theologis TN (2007) Prediction of the hip joint centre in adults, children, and patients with cerebral palsy based on magnetic resonance imaging. J Biomech 40(3):595–602. doi:10.1016/j.jbiomech.2006.02.003 Heller MO, Kratzenstein S, Ehrig RM, Wassilew G, Duda GN, Taylor WR (2011) The weighted optimal common shape technique improves identification of the hip joint center of rotation in vivo. J Orthop Res 29(10):1470–1475. doi:10.1002/jor.21426 Holden JP, Stanhope SJ (1998) The effect of variation in knee center location estimates on net knee joint moments. Gait Posture 7(1):1–6 Holden JP, Orsini JA, Siegel KL, Kepple TM, Gerber LH, Stanhope SJ (1997) Surface movement errors in shank kinematics and knee kinetics during gait. Gait & Posture 5(3):217–227. doi:10.1016/S0966-6362(96)01088-0 Isman RE, Inman VT (1969) Anthropometric studies of the human foot and ankle. Bull Prosthet Res 10(11):97–219 Kadaba MP, Ramakrishnan HK, Wootten ME, Gainey J, Gorton G, Cochran GV (1989) Repeatability of kinematic, kinetic, and electromyographic data in normal adult gait. J Orthop Res 7 (6):849–860. doi:10.1002/jor.1100070611 Karlsson D, Tranberg R (1999) On skin movement artefact-resonant frequencies of skin markers attached to the leg. Hum Mov Sci 18(5):627–635. doi:10.1016/S0167-9457(99)00025-1 Kisho Fukuchi R, Arakaki C, Veras Orselli MI, Duarte M (2010) Evaluation of alternative technical markers for the pelvic coordinate system. J Biomech 43(3):592–594. doi:10.1016/j. jbiomech.2009.09.050 Kornaropoulos EI, Taylor WR, Duda GN, Ehrig RM, Matziolis G, Muller M, Wassilew G, Asbach P, Perka C, Heller MO (2010) Frontal plane alignment: an imageless method to predict the mechanical femoral-tibial angle (mFTA) based on functional determination of joint centres and axes. Gait Posture 31(2):204–208. doi:10.1016/j.gaitpost.2009.10.006
Variations of Marker Sets and Models for Standard Gait Analysis
17
Kratzenstein S, Kornaropoulos EI, Ehrig RM, Heller MO, Popplau BM, Taylor WR (2012) Effective marker placement for functional identification of the centre of rotation at the hip. Gait Posture 36(3):482–486. doi:10.1016/j.gaitpost.2012.04.011 Krauss I, List R, Janssen P, Grau S, Horstmann T, Stacoff A (2012) Comparison of distinctive gait variables using two different biomechanical models for knee joint kinematics in subjects with knee osteoarthritis and healthy controls. Clin Biomech (Bristol, Avon) 27(3):281–286. doi:10.1016/j.clinbiomech.2011.09.013 Leardini A, Cappozzo A, Catani F, Toksvig-Larsen S, Petitto A, Sforza V, Cassanelli G, Giannini S (1999) Validation of a functional method for the estimation of hip joint centre location. J Biomech 32(1):99–103 Leardini A, Chiari L, Della Croce U, Cappozzo A (2005) Human movement analysis using stereophotogrammetry. Part 3. Soft tissue artifact assessment and compensation. Gait Posture 21(2):212–225. doi:10.1016/j.gaitpost.2004.05.002 Leardini A, Sawacha Z, Paolini G, Ingrosso S, Nativo R, Benedetti MG (2007) A new anatomically based protocol for gait analysis in children. Gait Posture 26(4):560–571. doi:10.1016/j. gaitpost.2006.12.018 Lucchetti L, Cappozzo A, Cappello A, Della Croce U (1998) Skin movement artefact assessment and compensation in the estimation of knee-joint kinematics. J Biomech 31(11):977–984 Luiz RR, Szklo M (2005) More than one statistical strategy to assess agreement of quantitative measurements may usefully be reported. J Clin Epidemiol 58(3):215–216. doi:10.1016/j. jclinepi.2004.07.007 Lundberg A, Svensson OK, Nemeth G, Selvik G (1989) The axis of rotation of the ankle joint. J Bone Joint Surg (Br) 71(1):94–99 Manal K, McClay I, Stanhope S, Richards J, Galinat B (2000) Comparison of surface mounted markers and attachment methods in estimating tibial rotations during walking: an in vivo study. Gait Posture 11(1):38–45 Manal K, McClay I, Richards J, Galinat B, Stanhope S (2002) Knee moment profiles during walking: errors due to soft tissue movement of the shank and the influence of the reference coordinate system. Gait Posture 15(1):10–17 Manal K, McClay Davis I, Galinat B, Stanhope S (2003) The accuracy of estimating proximal tibial translation during natural cadence walking: bone vs. skin mounted targets. Clin Biomech (Bristol, Avon) 18(2):126–131 McGinley JL, Baker R, Wolfe R, Morris ME (2009) The reliability of three-dimensional kinematic gait measurements: a systematic review. Gait Posture 29(3):360–369. doi:10.1016/j. gaitpost.2008.09.003 McMulkin ML, Gordon AB (2009) The effect of static standing posture on dynamic walking kinematics: comparison of a thigh wand versus a patella marker. Gait Posture 30(3):375–378. doi:10.1016/j.gaitpost.2009.06.010 Miana AN, Prudencio MV, Barros RM (2009) Comparison of protocols for walking and running kinematics based on skin surface markers and rigid clusters of markers. Int J Sports Med 30 (11):827–833. doi:10.1055/s-0029-1234054 Most E, Axe J, Rubash H, Li G (2004) Sensitivity of the knee joint kinematics calculation to selection of flexion axes. J Biomech 37(11):1743–1748. doi:10.1016/j.jbiomech.2004.01.025 Noonan KJ, Halliday S, Browne R, O’Brien S, Kayes K, Feinberg J (2003) Interobserver variability of gait analysis in patients with cerebral palsy. J Pediatr Orthop 23(3):279–287 discussion 288-291 Perry J, Burnfield JM (2010) Gait Analysis. Normal and pathological function, 2nd edn. SLACK Incorporated, Thorofare Piazza SJ, Cavanagh PR (2000) Measurement of the screw-home motion of the knee is sensitive to errors in axis alignment. J Biomech 33(8):1029–1034 Reinschmidt C, van den Bogert AJ, Lundberg A, Nigg BM, Murphy N, Stacoff A, Stano A (1997a) Tibiofemoral and tibiocalcaneal motion during walking: external vs. skeletal markers. Gait Posture 6(2):98–109. doi:10.1016/S0966-6362(97)01110-7
18
F. Stief
Reinschmidt C, van den Bogert AJ, Nigg BM, Lundberg A, Murphy N (1997b) Effect of skin movement on the analysis of skeletal knee joint motion during running. J Biomech 30 (7):729–732 Richards J (2008) Biomechanics in clinic and research. Elsevier, Philadelphia Sandau M, Heimburger RV, Villa C, Jensen KE, Moeslund TB, Aanaes H, Alkjaer T, Simonsen EB (2015) New equations to calculate 3D joint centres in the lower extremities. Med Eng Phys 37 (10):948–955. doi:10.1016/j.medengphy.2015.07.001 Sangeux M, Peters A, Baker R (2011) Hip joint centre localization: evaluation on normal subjects in the context of gait analysis. Gait Posture 34(3):324–328. doi:10.1016/j.gaitpost.2011.05.019 Schache AG, Baker R, Lamoreux LW (2008) Influence of thigh cluster configuration on the estimation of hip axial rotation. Gait Posture 27(1):60–69. doi:10.1016/j.gaitpost.2007.01.002 Schulz BW, Kimmel WL (2010) Can hip and knee kinematics be improved by eliminating thigh markers? Clin Biomech (Bristol, Avon) 25(7):687–692. doi:10.1016/j.clinbiomech.2010.04.002 Schwartz MH, Rozumalski A (2005) A new method for estimating joint parameters from motion data. J Biomech 38(1):107–116. doi:10.1016/j.jbiomech.2004.03.009 Schwartz MH, Trost JP, Wervey RA (2004) Measurement and management of errors in quantitative gait data. Gait Posture 20(2):196–203. doi:10.1016/j.gaitpost.2003.09.011 Serfling DM, Hooke AW, Bernhardt KA, Kaufman KR, 2009 Comparison of techniques for finding the knee joint center. In: Proceedings of the gait and clinical movement analysis society. p 43 Stagni R, Fantozzi S, Cappello A, Leardini A (2005) Quantification of soft tissue artefact in motion analysis by combining 3D fluoroscopy and stereophotogrammetry: a study on two subjects. Clin Biomech (Bristol, Avon) 20(3):320–329. doi:10.1016/j.clinbiomech.2004.11.012 Stief F, Bohm H, Michel K, Schwirtz A, Doderlein L (2013) Reliability and accuracy in threedimensional gait analysis: a comparison of two lower body protocols. J Appl Biomech 29 (1):105–111 Szczerbik E, Kalinowska M (2011) The influence of knee marker placement error on evaluation of gait kinematic parameters. Acta Bioeng Biomech 13(3):43–46 Taylor WR, Ehrig RM, Duda GN, Schell H, Seebeck P, Heller MO (2005) On the influence of soft tissue coverage in the determination of bone kinematics using skin markers. J Orthop Res 23 (4):726–734. doi:10.1016/j.orthres.2005.02.006 Taylor WR, Kornaropoulos EI, Duda GN, Kratzenstein S, Ehrig RM, Arampatzis A, Heller MO (2010) Repeatability and reproducibility of OSSCA, a functional approach for assessing the kinematics of the lower limb. Gait Posture 32(2):231–236. doi:10.1016/j.gaitpost.2010.05.005 Vaughan CL, Davis BL, O’Conner JC (1992) Dynamics of human gait. Human Kinetics Publishers, Champaign Weidow J, Tranberg R, Saari T, Karrholm J (2006) Hip and knee joint rotations differ between patients with medial and lateral knee osteoarthritis: gait analysis of 30 patients and 15 controls. J Orthop Res 24(9):1890–1899. doi:10.1002/jor.20194 Wren TA, Do KP, Hara R, Rethlefsen SA (2008) Use of a patella marker to improve tracking of dynamic hip rotation range of motion. Gait Posture 27(3):530–534. doi:10.1016/j. gaitpost.2007.07.006 Wu G, Siegler S, Allard P, Kirtley C, Leardini A, Rosenbaum D, Whittle M, D’Lima DD, Cristofolini L, Witte H, Schmid O, Stokes I (2002) ISB recommendation on definitions of joint coordinate system of various joints for the reporting of human joint motion – part I: ankle, hip, and spine. International Society of Biomechanics. J Biomech 35(4):543–548
Next-Generation Models Using Optimized Joint Center Location Ayman Assi, Wafa Skalli, and Ismat Ghanem
Contents State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Motion Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Motion Capture Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joint Kinematics and Kinetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hip Joint Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Predictive Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Functional Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Knee Joint Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ankle Joint Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Glenohumeral Joint Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Validation of the Joint Center Localization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X-Rays and Stereophotogrammetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Magnetic Resonance Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3D Ultrasound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Low-Dose Biplanar X-Rays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Effect of Errors on JC Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 3 3 4 4 6 6 8 10 10 11 11 11 12 12 13 14
A. Assi (*) Laboratory of Biomechanics and Medical Imaging, Faculty of Medicine, University of Saint-Joseph, Mar Mikhael, Beirut, Lebanon Institut de Biomécanique Humaine Georges Charpak, Arts et Métiers ParisTech, Paris, France e-mail: [email protected] W. Skalli Institut de Biomécanique Humaine Georges Charpak, Arts et Métiers ParisTech, Paris, France e-mail: [email protected] I. Ghanem Laboratory of Biomechanics and Medical Imaging, Faculty of Medicine, University of Saint-Joseph, Mar Mikhael, Beirut, Lebanon Hôtel-Dieu de France Hospital, University of Saint-Joseph, Beirut, Lebanon e-mail: [email protected] # Springer International Publishing AG 2016 B. Müller, S.I. Wolf (eds.), Handbook of Human Motion, DOI 10.1007/978-3-319-30808-1_27-1
1
2
A. Assi et al.
Errors on Kinematics and Kinetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Errors on Musculoskeletal Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Correction of 3D Positioning of the JC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Registration Techniques for the Use of Exact Joint Center Location . . . . . . . . . . . . . . . . . . . . . . . . Estimation from External Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14 15 15 16 16 16 17
Abstract
Joint center location is essential in order to define anatomical axes of skeletal segments and is therefore clinically significant for the calculation of joint kinematics during motion analysis. Different methods exist to localize joint centers using either predictive methods, based on anthropometric measurements, or functional methods, based on the relative movement of the segments adjacent to the joint. Validations of these methods using medical imaging have been extensively studied in the literature on different groups of subjects. Consequently, methods of correction between the calculated location of the joint center and the exact one, found by medical imaging, were suggested by several authors. Recent studies showed that new age-specific predictive methods could be computed in order to better locate joint coordinate systems. In the future, new techniques could use the exact locations of joint centers, which would be localized by medical imaging, in combination with motion capture techniques using registration techniques; thus, exact kinematics and kinetics of the joints could be computed. Keywords
Joint center • Predictive • Functional • Medical imaging • Validation
State of the Art Joint center location is essential in order to calculate anatomical 3D joint kinematics, kinetics, and muscle lever arms during musculoskeletal simulations. Several methods can be used in order to localize the joint center. These methods can be either predictive or functional. While the predictive methods are mainly based on regression equations that use anthropometric measurements, the functional ones require the performance of a joint movement in order to estimate the center of rotation between the two adjacent segments. Most of the softwares used in motion capture systems are based on the predictive methods. These methods have the advantage of being more rapid and easier to use, especially when used on subjects with disabilities who require assistance in performing the ranges of motion needed for the use of functional methods. Several authors have attempted to compare different methods or to validate these methods using 3D medical imaging as a gold standard. The CT scan has been used as a validation method since it is known to be precise for 3D reconstruction; however, its use is critical because of the high dose of
Next-Generation Models Using Optimized Joint Center Location
3
radiation which it entails. Magnetic resonance imaging could solve the problem of radiation, but this technique is known to be time-consuming because of acquisition time and the necessity of image segmentation during the post-processing phase. More recently, new techniques, using 3D ultrasound to obtain the joint center location, have been explored and were found to have a precision of 4 mm (Peters et al. 2010). The major inconvenience of this technique is the calibration needed prior to acquisition. Moreover, the skeletal segments, joint center, and external markers needed for motion analysis cannot be captured in the same image. In the past years, the low-dose biplanar X-ray technique has shown very high potential in clinical diagnosis of skeletal deformities through accurate 3D reconstructions of the spine and the lower limbs (Dubousset et al. 2005; Humbert et al. 2009; Chaibi et al. 2012). In addition to this feature, it allows 3D reconstruction of points of interest, such as joint centers. This technique has been shown to be precise (2.9 mm) in localizing external markers and joint centers that appear in the same image along with the skeletal segments. This technique was recently used in the validation of joint center localization techniques which are commonly used in the literature (Sangeux et al. 2014; Assi et al. 2016). The validation studies have shown that the functional methods are more precise than the predictive methods in localizing the joint center in the adult population. However, special attention should be given to the range of motion performed by the subject during the calibration trial (i.e., flexion-extension, abduction-adduction, and internal-external rotation should be > 30 ). It was surprising to find that this was not the case when it comes to children, where the predictive methods were found to be more precise compared to the functional ones (Peters et al. 2012; Assi et al. 2016). This chapter will first develop the need of joint center localization in motion analysis and then the current techniques of joint center localization that can be used during motion analysis processing. The validation process of these techniques will be reviewed at a later stage along with the effect of errors of misplacement of the joint center on kinematics, kinetics, and model simulations. The future directions will be discussed at the end of this chapter.
Motion Analysis Purpose Medical imaging is widely used in the diagnosis of musculoskeletal diseases through the use of images of human body anatomy. While different medical images can be used in order to visualize the musculoskeletal system, such as X-rays, CT scan, or MRI, these modalities allow images only in a static position. Apparent dynamic images can be obtained when the patient or subject is asked to perform a certain motion (i.e., shoulder) and then hold still in a given position during image acquisition. Thereafter, images are collected from different joint positions in order to obtain pseudo-dynamic images. The same images can be obtained using fluoroscopy, but
4
A. Assi et al.
this technique is known to be highly irradiant (since it is an X-ray video) and has small image dimensions. The technique of motion analysis has been widely used since the early 1990s in order to assess the joint motions of the musculoskeletal system, especially in patients with orthopedic disorders such as cerebral palsy (Gage 1993). This technique is based on 3D reconstruction by stereophotogrammetry of external markers positioned on the skin of a subject (Cappozzo et al. 2005).
Motion Capture Techniques Different motion capture techniques exist (check section on “Methods and Models: Dynamic Pose Estimation”). In this chapter, we will be focusing on infrared wavebased systems. The markers that are fixed on the subject’s skin could either be active or passive. Active markers send waves to the cameras fixed in the acquisition room, whereas passive markers only reflect waves to the transmissive-receptive cameras. These cameras send their waves at a high frequency (i.e., 50 Hz and more) in order to reconstruct the movement.
Joint Kinematics and Kinetics Since a 3D coordinate system can be obtained using three nonlinear points, marker placement on the skin respects this rule by placing at least three markers on each skeletal segment. Then, a local coordinate system is calculated for each skeletal segment at each frame of movement. These local coordinate systems are expressed in a global coordinate system defined in the acquisition room during a calibration process, performed prior to motion trials (Cappozzo et al. 1995, 2005) (check section on “Methods and Models: Data Analysis”). In a second step, the angles between adjacent local coordinate systems are calculated by applying either the Euler method or the Cardan method, which require the specification of the axes sequence. A consensus of joint angle definitions was set by the International Society of Biomechanics in 2005 (Wu et al. 2002, 2005). An illustration of motion capture and kinematic curves is presented in Fig. 1.
Need for a Joint Center The markers are usually placed on palpable bony landmarks and this is for two main reasons: (1) to avoid displacement of the marker during motion because of the underlying soft tissue movement and (2) to ensure repeatability of placement of markers between operators and between subjects. Moreover, the skeletal coordinate system obtained from these reflective markers should be anatomically relevant in order to obtain anatomical angles between adjacent segments, i.e., a local coordinate system of the humerus should have one axis that represents the diaphysis and a second axis in relation with the line joining the two epicondyles, and the third axis derives from the two others.
Next-Generation Models Using Optimized Joint Center Location
5
Fig. 1 Motion analysis: (a) subject equipped with reflective markers, (b) 3D capture of subject’s movement, (c) example of kinematic curves
6
A. Assi et al.
In some cases, a rigid body on which three or more markers are fixed, called cluster, is attached to the segment. This method serves to reduce soft tissue artifacts. The local coordinate system obtained from a cluster is not anatomically representative of the segment. Thus, the performance of a static calibration trial, in which a transformation matrix is calculated between the anatomical and cluster coordinate systems, is recommended prior to the movement trials. Since the local coordinate system of a skeletal segment should be anatomically relevant and representative of the geometry of the bone, in some cases the placement of a marker on the joint center is required, i.e., hip joint center for the femoral segment, knee joint center for the tibial segment, and glenohumeral joint center for the humeral segment. Since it is impossible to place a marker in the joint center, several techniques exist to approximate the location of this point. Kinematics calculated during gait analysis usually includes nine graphs. These graphs represent the joint angular waveforms during a gait cycle in the three planes: sagittal, frontal, and horizontal. Six of these nine graphs, the hip and the knee joints in the three planes each, are based on the use of the local coordinate system of the femur, which necessitates the hip joint center localization. Thus, the hip joint center is one of the most important joint centers to be localized in gait analysis. In the following section, we discuss the hip joint center in details followed by the methods used in other joints.
Hip Joint Center Different methods are available for the hip joint center localization. These methods could be predictive, based on cadaveric or anthropometric measurements, or functional, based on the movement of a segment relatively to the adjacent one.
Predictive Methods The predictive methods usually use anthropometric regression-based equations established from cadaveric specimens or in vivo medical imaging of the skeletal segments. Some examples of predictive methods are exposed in this section (more information can be found in previous chapters in section “Medical Application: Assessment of Kinematics”).
The Davis Method This is the most used model both in the literature and in many laboratories around the world since it is implemented in most of the motion analysis softwares (Davis et al. 1991). It is based on the examination of 25 hips. The predictive equations use the inter-ASIS (anterior superior iliac spine) distance, the anteroposterior distance between the ASIS, marker radius, and leg length. Thus, the location of the hip joint center is obtained in the pelvic coordinate system.
Next-Generation Models Using Optimized Joint Center Location
7
Fig. 2 Example of predictors (pelvic depth, pelvic width, and lower limb length) used in regression equations to localize the hip joint center
The Bell Method In 1982, Tylkowski defined a method of prediction of the hip joint center in the lateral plane, while Andriacchi predicted the position of the hip joint center in the frontal plane (Tylkowski et al. 1982; Andriacchi and Strickland 1985). Bell et al. used these two techniques to develop a new predictive method to localize the hip joint center (Bell et al. 1989). The method was based on anterior-posterior (AP) radiographs for children and adults; however, validation was only performed using AP and lateral radiographs on dry pelvises of adults. The Harrington Method In 2007, Harrington suggested an image-based validation of previous predictive methods (Davis, Bell, and motion analysis software) using MRI and a new predictive method (Harrington et al. 2007). The new method is derived from 8 adults, 14 healthy children, and 10 children with CP. The hip joint center location in a pelvic coordinate system was found by fitting a sphere to points identified on the femoral head. The new predictive method was based on pelvic width, pelvic depth, and leg length (Fig. 2).
8
A. Assi et al.
Functional Methods The functional methods derive from the movement of a skeletal segment relatively to the adjacent one. The subject is asked to perform a movement of the joint, i.e., for the hip joint, a movement of flexion-extension and/or abductionadduction is performed. Since the hip is approximated as a ball and socket joint, each marker attached to the moving segment (i.e., the thigh) moves on a sphere surface where its center is the center of the hip joint. Thus, the joint center is assumed to be the center of rotation (CoR) defined by the movement between two adjacent segments. The first to describe this technique was Cappozzo in 1984 (Cappozzo 1984). Several algorithms or mathematical techniques of computing the CoR exist; we will be discussing the most common ones in the following section.
Sphere Fit Methods This technique assumes that the CoR is stationary; this assumption can be true if one segment (segment 1) is at rest. The markers on the adjacent segment (segment 2) move on the surface of a sphere with specific radii around one common CoR. The frequent approach uses the minimization of the sum of the squared Euclidean distances between the sphere and the marker positions. A selected cost function determines whether the optimal solution can be calculated exactly or only approximately by successive iterative steps toward the optimal CoR. Some authors use the least squares method which gives an exact CoR estimation (Pratt 1987; Gamage and Lasenby 2002). An example is displayed in Fig. 3. Alternative approaches are iterative (Halvorsen 2003). In the geometric method, an initial guess of the CoR is required, whereas in the algebraic methods do not require a starting estimate. The major disadvantage of these techniques is in the convergence to local minima of the cost function and in their poor accuracy of estimating the CoR when a reduced ROM is performed. Chang et al. (2007) proposed a new numerical technique of sphere fit methods that can be used for reduced ROM. Center Transformation Technique The center transformation technique (CTT) assumes that at least three markers on the moving segment are present; it is then possible to define a rigid-body transformation (rotations and translations) which transforms a given reference marker at one frame into another frame (Piazza et al. 2004). The appropriate transformation of these local systems for all time frames into a common reference system enables the approximation of the joint center at a fixed position. Another approach, called the two-sided approach that does not require the assumption of a stationary CoR, can be alternatively used (Schwartz and Rozumalski 2005). Score Technique This algorithm is a continuation of the CTT method, with the assumption that the coordinates of the CoR must remain constant relative to both segments, without requiring the assumption that one segment remain at rest (Ehrig et al. 2006).
Next-Generation Models Using Optimized Joint Center Location
9
Fig. 3 Sphere fit method: trajectories of four different markers (each marker with a different color) during calibration trials. The star points represent the center of the spheres. The black star point represents the middle of the different points
Calibration Trial While different movements can be performed especially in a ball and socket joint (i.e., the hip or the glenohumeral joint), such as flexion-extension, abduction-adduction, internal-external rotation, or circumduction, Camomilla et al. in 2006 showed that the movement that gives the most accurate results on calculating the hip joint center using the functional technique is the star arc (Camomilla et al. 2006). This calibration trial consisted of several flexion-extension/abduction-adduction movements performed on vertical planes of different orientations, followed by a circumduction movement. Uncertainties Related to Low ROM It has been shown that the errors on the localization of a joint center increase when the range of motion (ROM) performed during the calibration method decreases. In Ehrig et al.’s study, the approaches of CTT, score, and sphere fit were tested in a simulation model, while noise was added to the markers (Ehrig et al. 2006). The RMS between the calculated CoR and the exact one was calculated. RMS errors decreased exponentially with increasing ROM. Theoretical accuracy on the position of the CoR was within 1 cm using all approaches when the ROM increased beyond 20 . Accuracy was within 0.3 cm as long as the ROM of the joint was 45 or more. Some of these simulations have taken into account the skin-marker movement which is an additional source of uncertainty.
10
A. Assi et al.
Therefore, some authors prefer to assist the patients/subjects when they are performing the functional calibration movement, in order to make sure that the ROM is adequate for functional localization of the joint center (Peters et al. 2012). Piazza in 2001 used a mechanical model to simulate the errors on the localization of the hip joint center using functional methods (Piazza et al. 2001). Significant increases in the magnitude of HJC location errors (4–9 mm) were noted when the range of hip motion was reduced from 30 to 15 . The same result was found by Camomilla et al. (2006). The accuracy of the HJC estimate improved, with an increasing rate, as a function of the amplitude of the performed movements in the hip.
Knee Joint Center In the particular case of the knee, a center of rotation is usually calculated using the predictive method, using external markers located on the condyles. The Davis model, presented above, also predicts the knee joint center which is approximated as being in the middle of the knee width (measured during clinical examination): from the external to the internal condyle and defined relatively to the thigh marker (Davis et al. 1991). In the functional method, a knee axis is usually calculated that represents the complexity of the flexion-extension movement. The knee flexion is a combination of the femoral condyles rolling (rotation) over the tibial plateau and the posterior gliding (translation) of the condyles along the plateau (Ramsey and Wretenberg 1999). Usually, two mathematical methods exist to calculate the axis of rotation (AoR) of the knee. The first method fits cylindrical arcs to the moving segment, while assuming that the adjacent segment is at rest (Gamage and Lasenby 2002; Halvorsen 2003). The second method is based on the transformation techniques (CTT) exposed above, where the helical axes technique is used based on the work of Woltring et al. (1985). More recently, another algorithm for the localization of the AoR was presented by Ehrig et al. (2007); the symmetrical axis of rotation approach determines a unique axis of rotation that considers the movement of two dynamic segments simultaneously.
Ankle Joint Center The most frequently used methods to localize the ankle joint center are predictive, using external markers located on the malleoli. The Davis method (Davis et al. 1991) is the most common. A similar strategy to the knee localization method is applied to obtain the ankle joint center. While the Davis model is the most commonly used for clinical gait analysis, the foot is represented as a single segment, and only ankle joint motion is quantified. In order to quantify the dynamic adaptability of the different foot segments, several
Next-Generation Models Using Optimized Joint Center Location
11
models have been described. The most commonly used model is the Oxford foot model, where three segments are defined in the foot (hindfoot, forefoot, hallux) in addition to the tibial segment (Stebbins et al. 2006). Other models described in the literature use four segments, such as the Leardini (calcaneus, midfoot, first metatarsal, and the hallux) (Leardini et al. 1999b) and Jenken models (hindfoot, midfoot, medial forefoot, lateral forefoot) (Jenkyn and Nicol 2007). Further information on this topic can be found in the preceding chapter (▶ Variations of Marker-Sets and Models for Standard Gait Analysis).
Glenohumeral Joint Center Different predictive and functional methods exist to localize the glenohumeral joint center. In the predictive Meskers’ method, a linear regression is used to predict the glenohumeral joint center based on specific points on the scapula, the acromioclavicular joint and the processus coracoideus (Meskers et al. 1997). This method was elaborated by digitizing 36 sets of cadaver scapulae and adjacent humeri. The functional methods are based on the movement of the humerus relatively to the scapula or the thorax; the same algorithms as the one used for the hip joint center can be applied. The same result, as for the hip joint center, was also found regarding the low ROM when performed during the calibration trial; Lempereur et al. showed that high amplitude of movement should be performed (>60 ) in order to improve reliability when functional methods are used for the localization of the glenohumeral joint center (Lempereur et al. 2011). Further information on this topic can be found in the chapter on “▶ Upper Extremity Models for Clinical Movement Analysis.”
Validation of the Joint Center Localization Methods Several authors have attempted to assess the accuracy of both predictive and functional methods by localizing the joint center obtained by medical imaging as a gold standard. The technique consists of obtaining the joint center in 3D, while being expressed in the local coordinate system of the adjacent segment. The latter is built based on the external markers placed on the skin. The joint centers calculated through predictive and functional methods would also be expressed in the same local coordinate system of the adjacent segment. Thus, when all calculated joint centers are expressed in the same coordinate system, distances from each joint center to the gold standard could be calculated, and a comparison between methods could be performed.
X-Rays and Stereophotogrammetry The method developed by Bell in 1989 was based on the use of digitized AP radiographs with localization of specific bony landmarks on the radiograph as well
12
A. Assi et al.
as digitization of the center of a circle that matches the size of the femoral head (Bell et al. 1989). In 1990, Bell et al. were the first to use pairs of orthogonal radiographs (Bell et al. 1990); by knowing the exact distances between X-ray sources and film cassette locations, it was possible to estimate in 3D the location of the bony landmarks and the pelvic skin markers. The accuracy of HJC localization methods were thus assessed for methods such as the functional method described by Cappozzo in 1984 but also the predictive methods described by Tylkowski and Andriacchi (Tylkowski et al. 1982; Andriacchi and Strickland 1983; Cappozzo 1984). In another study, Leardini et al. assessed the validity of functional and predictive methods in calculating the HJC on 11 healthy adults (Leardini et al. 1999a). The average root mean square (RMS) distance to the gold standard was 25–30 mm for predictive methods and 13 mm for functional method. The technique of stereoradiography has the major disadvantage of being irradiant to the patient.
Magnetic Resonance Imaging In a study lead by Harrington et al., MRI was used as the gold standard in obtaining the hip joint center for a population of healthy adults, healthy children, and children with cerebral palsy (Harrington et al. 2007). The validation of existing predictive methods was assessed in addition to a new method presented by the authors. In a study performed by Lempereur et al., the authors used MRI acquisition to validate several functional methods for the localization of the glenohumeral JC (Lempereur et al. 2010). This technique required the coverage of the scapula with 120 reflective markers in order to perform the matching between the surface of the scapula obtained by motion analysis capture and MRI reconstruction. This technique is time-consuming since it requires manual segmentation. The major disadvantage of the MRI technique is the time of acquisition and the time of image processing.
3D Ultrasound The ultrasound technique was widely used in order to validate the JC localization methods since it is not irradiant and easier to perform compared to the MRI technique. However, the US method requires a calibration process in order to obtain 3D reconstructions of the JC. In a study performed by Peters et al., the authors described the required calibration process in order to obtain 3D US reconstructions (Peters et al. 2010). The repeatability of the technique was assessed as well as the accuracy of the localization of a reference object within a water basin. The accuracy was about 4 2 mm. After the validation of this technique, the same authors performed different studies on the validation of both predictive and functional HJC localization techniques in adults (Sangeux et al. 2011) and both typically
Next-Generation Models Using Optimized Joint Center Location
13
developing children and children with cerebral palsy (Peters et al. 2012). In the study on adults, it was shown that the functional method and more precisely the geometric sphere fitting method was the most precise in localizing the HJC (mean absolute distance error of about 15 mm) followed by the Harrington predictive method. In the study on TD and CP children, the Harrington method was the closest method to the 3D US technique (14 8 mm), whereas the functional techniques performed much worse (22–33 mm). It should be noted that the functional calibration trials of the hip had been assisted by an external operator. The 3D ultrasound technique has also been used for the localization of the glenohumeral joint (Lempereur et al. 2013).
Low-Dose Biplanar X-Rays More recently, the low-dose biplanar X-ray technique (Dubousset et al. 2005; Humbert et al. 2009; Chaibi et al. 2012) was applied in order to validate JC localization techniques. The EOS system was used as an image-based reference. The localization of external markers was reliable within 0.15 mm for trained operators, and the mean accuracy for HJC localization was 2.9 1.3 mm (Pillet et al. 2014), even less than the values obtained by the 3D US method. The EOS system allows the acquisition in the same image of external markers, skeletal segments, and joint centers. Thus, a joint center can be located directly in the local coordinate system of the adjacent segment, based on the location of the external markers (Fig. 4). The EOS system was used to compare the accuracy of several predictive and functional techniques in localizing the HJC in healthy adults (Sangeux et al. 2014). Different scenarios were applied when functional methods were assessed: different algorithms, different ranges of motion of the hips (30 ), and selfperformance or assisted performance. The best results were obtained for the comfortable ROM when they were self-performed by the subjects. The best method was the functional geometrical sphere fitting method which localized the hips 1.1 cm from the EOS reference. It was shown that the worst results were obtained for functional methods when the ROM was reduced. In the latter case, the best method was the Harrington predictive method which localizes the HJC at 1.7 cm from the EOS reference. In a more recent study, the EOS system was used to evaluate the accuracy of both predictive and functional methods in TD and CP children (Assi et al. 2016). Contrarily to the findings in adults, the functional methods performed much worse (>60 mm) compared to the predictive methods, where the Harrington method showed the best results (18 9 mm). The authors explained the differences in results between adults and children as being due to the shorter length of the thigh segment in children, which could increase the noise when the algorithms of functional methods are applied to locate the CoR. It was also shown that children with CP performed significantly lower ROM of hip movements during calibration compared to TD children. However, average ROM in both groups was >30 , and the ROM was not a confounding factor on the errors on the HJC calculated by the functional methods.
14
A. Assi et al.
Fig. 4 Frontal and lateral X-rays of the lower limbs obtained by low-dose biplanar X-rays, with the external markers fixed on the skin as well as the 3D reconstruction of the femur and the hip joint center, expressed in the local coordinate system of the pelvis
In a study performed by Lempereur et al., different functional methods as well as the 3D ultrasound technique were compared in localizing the glenohumeral joint center relatively to the one obtained by the 3D EOS reconstruction, considered as the reference (Lempereur et al. 2013). The 3D ultrasound technique placed the glenohumeral joint center at 14 mm from the EOS image-based reference, while functional methods varied from 15.4 mm (the helical axis method) to 34 mm using iterative methods (Halvorsen 2003).
Effect of Errors on JC Localization Errors on Kinematics and Kinetics Misplacement errors of joint centers can distort kinematics and kinetics of the hip and knee in the case of gait analysis, since the thigh local coordinate system is affected. The effects of hip JC misplacement on gait analysis were studied by Stagni et al. (2000). The latter found an error on the joint moment that can reach 22% of
Next-Generation Models Using Optimized Joint Center Location
15
flexion-extension and 15% on abduction-adduction with a delay of 25% of the flexion to extension timing of the stride duration. In a study performed by Kiernan et al., the authors assessed the clinical agreement of the Bell, Davis, and Orthotrak methods in localizing the HJC compared to the Harrington method as a gold standard (Kiernan et al. 2015). This was applied on 18 healthy children. Kinematics, kinetics, Gait Profile Score, and Gait Deviation Index were calculated. The authors found that errors, when the Davis or Orthotrak methods were used, are clinically meaningful especially on kinetics. The results on the glenohumeral joint were different from those obtained on the hip. Lempereur et al. showed that misplacement of the glenohumeral joint center will propagate to the kinematics of the shoulder, but errors do not exceed 4.8 on the elevation angle during shoulder flexion and 4.3 on the elevation plane during shoulder abduction (Lempereur et al. 2014). The authors related this difference of propagated errors between the movements of the arm and the thigh to the difference in mass between the two segments.
Errors on Musculoskeletal Simulations Inaccurate localization of JC can also influence the results obtained in musculoskeletal simulations. In a study performed by Scheys et al., moment-arm and muscletendon lengths were computed using three kinds of musculoskeletal models: a personalized model based on MRI data, an isotropic rescaled generic model, and an anisotropic rescaled generic model (Scheys et al. 2008). Different hip joint center techniques were used in each of the models. These simulations were applied on the gait of an asymptomatic adult. The generic model simulations showed large offsets of moment arm and muscle-tendon lengths when compared to the personalized model for most of the major muscles of the lower limbs.
Future Directions Joint center localization is essential in order to obtain anatomically accurate kinematics and kinetics. Joint center localization techniques could be either predictive or functional. While predictive techniques are based on regression equations that use anthropometric measurements, the functional techniques require the performance by the subject of ranges of motion in the joint of interest in order to calculate the center of rotation between the two adjacent segments of the joint. Several authors have validated these techniques in children and adults by comparing the location of the joint center obtained by these methods to the joint center obtained by 3D medical imaging. The most frequently used medical imaging system in the validation processes were stereoradiography, CT scan, MRI, 3D ultrasound, and more recently low-dose biplanar X-rays. It was shown that the functional methods were more accurate in locating the joint center compared to the predictive ones in the adult population, which was not the
16
A. Assi et al.
case in children. This could be due to the shorter segment in children which renders the markers in movement closer to the joint center, thus increasing the noise during the calculation of the joint center. Moreover, the amount of ROM performed during functional calibration should be within certain limits: a low ROM could not be sufficient for the calibration process and a high ROM could induce more soft tissue artifacts. The errors on the localization of the joint center have been shown to directly affect both kinematic and kinetic calculation. They also affect the computation of muscle lever arms when running musculoskeletal simulations.
Correction of 3D Positioning of the JC Several authors have shown the deviation of the joint center calculated by either predictive or functional methods compared to the exact joint center obtained by 3D medical imaging in each direction: anterior-posterior, medial-lateral, and superiorinferior. A first solution could be the correction of this location prior to the calculation of kinematics or kinetics or computation of musculoskeletal simulation (Sangeux et al. 2014; Assi et al. 2016).
Registration Techniques for the Use of Exact Joint Center Location It was shown in the validation methods that both predictive and functional methods localize the joint center at 11–18 mm from its exact location. In an ideal setting, when a medical imaging tool is present in the same laboratory along with the motion capture equipment, the exact location of the joint center should be used in the calculation process of kinematics and kinetics, even for musculoskeletal simulations. Markers would be placed on the patient, and an image acquisition, such as the EOS biplanar X-rays, will be performed in order to obtain the exact 3D location of the joint center in the local coordinate system of the adjacent segment (i.e., location of the hip joint center expressed in the local coordinate system of the pelvis, glenohumeral joint center expressed in the local coordinate system of the scapula or thorax). These 3D coordinates would be used after motion analysis acquisition for either kinematic/kinetic calculations or musculoskeletal model simulations.
Estimation from External Information Another solution could be to optimize regression equations of joint center localization techniques. Since the validation methods showed different results depending on the population type, new regression equations could be tailored to each population (i.e., children and adults). The new biplanar low-dose X-ray technique could allow the acquisition in the same image of the 3D reconstruction of both the external markers and the joint center. Thus, a large cohort of subjects/patients of different age
Next-Generation Models Using Optimized Joint Center Location
17
Fig. 5 Estimation of the hip joint center using the 3D reconstruction of the skin and the skeleton and based on morphological and barycentermetric predictors (Nerot et al. 2016)
intervals could allow the attainment of age-specific regression equations based on anthropometric measurements. The possibility to get both the external envelope and the internal skeleton (Nérot et al. 2015a, b) opens the way for a large-scale analysis and improvement on the regression equations for joint center localization by combining morphological and barycentermetric predictors (Fig. 5).
References Andriacchi T, Strickland A (1985) Gait analysis as a tool to assess joint kinetics. In: Berme N, Engin A, Correia Da Silva K, (eds). Biomechanics of Normal and Pathological Human Articulating Joints. Martinus Nijhoff, Dordrecht: NATO SI Series. pp. 83–102. Assi A, Sauret C, Massaad A, Bakouny Z, Pillet H, Skalli W, et al (2016) Validation of hip joint center localization methods during gait analysis using 3D EOS imaging in typically developing and cerebral palsy children. Gait Posture [Internet] 42:30–5. Available
18
A. Assi et al.
from http://dx.doi.org/10.1016/j.gaitpost.2016.04.028%5Cn, http://linkinghub.elsevier.com/ retrieve/pii/S0966636216300455%5Cn, http://dx.doi.org/10.1016/j.gaitpost.2015.06.089 Bell AL, Brand RA, Pedersen DR (1989) Prediction of hip joint centre location from external landmarks. Hum Mov Sci [Internet] 8(1):3–16. Available from: http://www.sciencedirect.com/ science/article/pii/0167945789900201. [cited 2015 Oct 27] Bell AL, Pedersen DR, Brand RA (1990) A comparison of the accuracy of several hip center. J Biomech 23:6–8 Camomilla V, Cereatti A, Vannozzi G, Cappozzo A (2006) An optimized protocol for hip joint centre determination using the functional method. J Biomech [Internet] 39(6):1096–1106. Available from: http://linkinghub.elsevier.com/retrieve/pii/S0021929005001004 Cappozzo A (1984) Gait analysis methodology. Hum Mov Sci 3:27–50 Cappozzo A, Catani F, Della Croce U, Leardini A (1995) Position and orientation in space of bones during movement: anatomical frame definition and determination. Clin Biomech [Internet] 10(4):171–178. Available from: http://www.sciencedirect.com/science/article/pii/ 026800339591394T Cappozzo A, Della Croce U, Leardini A, Chiari L (2005) Human movement analysis using stereophotogrammetry. Part 1: theoretical background. Gait Posture [Internet] 21(2):186–196. Available from: http://www.sciencedirect.com/science/article/pii/S0966636204000256. [cited 2015 Nov 4] Chaibi Y, Cresson T, Aubert B, Hausselle J, Neyret P, Hauger O et al (2012) Fast 3D reconstruction of the lower limb using a parametric model and statistical inferences and clinical measurements calculation from biplanar X-rays. Comput Methods Biomech Biomed Eng 15(5):457–466 Davis RB, Ounpuu S, Tyburski D, Gage JR (1991) A gait analysis data collection and reduction technique. Hum Mov Sci 10(5):575–587 Dubousset J, Charpak G, Dorion I, Skalli W, Lavaste F, Deguise J et al (2005) A new 2D and 3D imaging approach to musculoskeletal physiology and pathology with low-dose radiation and the standing position: the EOS system. Bull Acad Natl Med [Internet]. 189(2):287–297. Available from: http://www.ncbi.nlm.nih.gov/pubmed/16114859 Ehrig RM, Taylor WR, Duda GN, Heller MO (2006) A survey of formal methods for determining the centre of rotation of ball joints. J Biomech [Internet] 39(15):2798–2809. Available from: http://linkinghub.elsevier.com/retrieve/pii/S002192900500446X Ehrig RM, Taylor WR, Duda GN, Heller MO (2007) A survey of formal methods for determining functional joint axes. J Biomech 40(10):2150–2157 Gage JR (1993) Gait analysis. An essential tool in the treatment of cerebral palsy. Clin Orthop Relat Res [Internet] (288):126–134. Available from: http://www.ncbi.nlm.nih.gov/pubmed/8458125 Gamage SSHU, Lasenby J (2002) New least squares solutions for estimating the average centre of rotation and the axis of rotation. J Biomech 35(1):87–93 Halvorsen K (2003) Bias compensated least squares estimate of the center of rotation. J Biomech [Internet] 36(7):999–1008. Available from: http://www.sciencedirect.com/science/article/pii/ S0021929003000708. [cited 2016 Jun 21] Harrington ME, Zavatsky AB, Lawson SEM, Yuan Z, Theologis TN (2007) Prediction of the hip joint centre in adults, children, and patients with cerebral palsy based on magnetic resonance imaging. J Biomech [Internet] 40(3):595–602. Available from: http://linkinghub.elsevier.com/ retrieve/pii/S0021929006000583 Humbert L, De Guise JA, Aubert B, Godbout B, Skalli W (2009) 3D reconstruction of the spine from biplanar X-rays using parametric models based on transversal and longitudinal inferences. Med Eng Phys 31(6):681–687 Jenkyn TR, Nicol AC (2007) A multi-segment kinematic model of the foot with a novel definition of forefoot motion for use in clinical gait analysis during walking. J Biomech 40(14):3271–3278 Kiernan D, Malone A, O’Brien T, Simms CK (2015) The clinical impact of hip joint centre regression equation error on kinematics and kinetics during paediatric gait. Gait Posture [Internet] 41(1):175–179. Available from: http://www.sciencedirect.com/science/article/pii/ S0966636214007255
Next-Generation Models Using Optimized Joint Center Location
19
Leardini A, Cappozzo A, Catani F, Toksvig-Larsen S, Petitto A, Sforza V et al (1999a) Validation of a functional method for the estimation of hip joint centre location. J Biomech 32(1):99–103 Leardini A, O’Connor JJ, Catani F, Giannini S (1999b) Kinematics of the human ankle complex in passive flexion; a single degree of freedom system. J Biomech 32(2):111–118 Lempereur M, Leboeuf F, Brochard S, Rousset J, Burdin V, Rémy-Néris O (2010) In vivo estimation of the glenohumeral joint centre by functional methods: accuracy and repeatability assessment. J Biomech 43(2):370–374 Lempereur M, Brochard S, Rémy-Néris O (2011) Repeatability assessment of functional methods to estimate the glenohumeral joint centre. Comput Methods Biomech Biomed Eng 5842:1–6 Lempereur M, Kostur L, Leboucher J, Brochard S, Rémy-Néris O (2013) 3D freehand ultrasound to estimate the glenohumeral rotation centre. Comput Methods Biomech Biomed Eng [Internet] 16(Suppl 1):214–215. Available from: http://www.ncbi.nlm.nih.gov/pubmed/23923914 Lempereur M, Leboeuf F, Brochard S, Rémy-Néris O (2014) Effects of glenohumeral joint centre mislocation on shoulder kinematics and kinetics. Comput Methods Biomech Biomed Eng [Internet] 17(Suppl 1):130–131. Available from: http://www.ncbi.nlm.nih.gov/pubmed/25074199 Meskers CGM, Van Der Helm FCT, Rozendaal LA, Rozing PM (1997) In vivo estimation of the glenohumeral joint rotation center from scapular bony landmarks by linear regression. J Biomech 31(1):93–96 Nérot A, Choisne J, Amabile C, Travert C, Pillet H, Wang X, et al (2015a) A 3D reconstruction method of the body envelope from biplanar X-rays: evaluation of its accuracy and reliability. J Biomech [Internet] 48(16):4322–4326. Available from: http://dx.doi.org/10.1016/j. jbiomech.2015.10.044 Nérot A, Wang X, Pillet H, Skalli W (2015b) Estimation of hip joint center from the external body shape: a preliminary study. Comput Methods Biomech Biomed Eng [Internet] 5842:1–2. Available from: http://www.tandfonline.com/doi/full/10.1080/10255842.2015.1069603 Peters A, Baker R, Sangeux M (2010) Validation of 3-D freehand ultrasound for the determination of the hip joint centre. Gait Posture [Internet] 31(4):530–2. Available from: http://linkinghub. elsevier.com/retrieve/pii/S0966636210000299 Peters A, Baker R, Morris ME, Sangeux M (2012) A comparison of hip joint centre localisation techniques with 3-DUS for clinical gait analysis in children with cerebral palsy. Gait Posture [Internet] 36(2):282–286. Available from: http://linkinghub.elsevier.com/retrieve/pii/ S0966636212000999 Piazza SJ, Okita N, Cavanagh PR (2001) Accuracy of the functional method of hip joint center location: effects of limited motion and varied implementation. J Biomech 34(7):967–973 Piazza SJ, Erdemir A, Okita N, Cavanagh PR (2004) Assessment of the functional method of hip joint center location subject to reduced range of hip motion. J Biomech 37:349–356 Pillet H, Sangeux M, Hausselle J, El Rachkidi R, Skalli W (2014) A reference method for the evaluation of femoral head joint center location technique based on external markers. Gait Posture [Internet] 39(1):655–658. Available from: http://linkinghub.elsevier.com/retrieve/pii/ S096663621300578X Pratt V (1987) Direct least-squares fitting of algebraic surfaces. Comput Graph (ACM) 21:145–152 Ramsey DK, Wretenberg PF (1999) Biomechanics of the knee: methodological considerations in the in vivo kinematic analysis of the tibiofemoral and patellofemoral joint. Clin Biomech 14 (9):595–611 Sangeux M, Peters A, Baker R (2011) Hip joint centre localization: evaluation on normal subjects in the context of gait analysis. Gait Posture [Internet] 34(3):324–328. Available from: http://dx.doi. org/10.1016/j.gaitpost.2011.05.019 Sangeux M, Pillet H, Skalli W (2014) Which method of hip joint centre localisation should be used in gait analysis? Gait Posture [Internet] 40(1):20–25. Available from: http://linkinghub.elsevier. com/retrieve/pii/S0966636214000642 Scheys L, Spaepen A, Suetens P, Jonkers I (2008) Calculated moment-arm and muscle-tendon lengths during gait differ substantially using MR based versus rescaled generic lower-limb musculoskeletal models. Gait Posture 28(4):640–648
20
A. Assi et al.
Schwartz MH, Rozumalski A (2005) A new method for estimating joint parameters from motion data. J Biomech [Internet] 38(1):107–116. Available from: http://linkinghub.elsevier.com/ retrieve/pii/S002192900400137X Stagni R, Leardini A, Cappozzo A, Grazia Benedetti M, Cappello A (2000) Effects of hip joint centre mislocation on gait analysis results. J Biomech 33(11):1479–1487 Stebbins J, Harrington M, Thompson N, Zavatsky A, Theologis T (2006) Repeatability of a model for measuring multi-segment foot kinematics in children. Gait Posture. 23(4):401–410 Tylkowski C, Simon S, Mansour J (1982) Internal rotation gait in spastic cerebral palsy. In: Nelson JP (ed) Proceedings of the 10th Open Scientific Meeting of the Hip Society. C. V. Mosby, St Louis, pp 89–125 Woltring H, Huiskes R, de Lange A, Veldpaus F (1985) Finite centroid and helical axis estimation from noisy landmark measurements in the study of human joint kinematics. J Biomech 18 (5):379–389 Wu G, Siegler S, Allard P, Kirtley C, Leardini A, Rosenbaum D et al (2002) ISB recommendation on definitions of joint coordinate system of various joints for the reporting of human joint motion – Part I: ankle, hip, and spine. J Biomech [Internet] 35(4):543–548. Available from: http://www.sciencedirect.com/science/article/pii/S0021929001002226 Wu G, Van Der Helm FCT, Veeger HEJ, Makhsous M, Van Roy P, Anglin C et al (2005) ISB recommendation on definitions of joint coordinate systems of various joints for the reporting of human joint motion – Part II: shoulder, elbow, wrist and hand. J Biomech 38(5):981–992
Kinematic Foot Models for Instrumented Gait Analysis Alberto Leardini and Paolo Caravaggi
Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Validation and Application of Foot Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Validation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Musculoskeletal Multi-Segment Foot Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kinetic Analysis Including Foot Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 6 11 11 12 13 15 16 17 18
Abstract
In many clinical and biomechanical contexts of human motion analysis the model assumption of the foot as a single rigid segment is no longer acceptable. This has given rise to a large number of proposals for multi-segment foot models. The relevant experimental and analytical techniques differ for many aspects: the number of foot segments; the bony landmarks involved; the type of marker clusters; the definition of the anatomical frames; and the convention for the calculation of joint rotations. Different definitions of neutral reference posture have also been adopted, along with their utilization to offset kinematic data. Following previous partial review papers, the present chapter aims at introducing the current methodological studies for in vivo analysis of multi-segment foot kinematics. The survey has found more than 30 different techniques; however, only a limited number of these have reported convincing validation activities and A. Leardini (*) • P. Caravaggi Movement Analysis Laboratory and Functional-Clinical Evaluation of Prostheses, Istituto Ortopedico Rizzoli, Bologna, Italy e-mail: [email protected]; [email protected]; [email protected] # Springer International Publishing AG 2016 B. Müller, S.I. Wolf (eds.), Handbook of Human Motion, DOI 10.1007/978-3-319-30808-1_28-1
1
2
A. Leardini and P. Caravaggi
have been exploited in a clinical context. A number of papers have also compared the experimental performance of different multi-segment foot models and highlighted the main advantages and disadvantages of each of them. Important biomechanical applications of musculoskeletal models for reliable estimation of biomechanical parameters are also discussed. In addition, we report on the feasibility and limitations of kinetic analysis applied to multi-segment foot models from ground reaction force data. The chapter ends with recommendations both for the selection of a most suitable technique from those already available, as well as for the design of an original one suitable to address the needs of the specific application.
Keywords
Foot joint mobility • 3D joint motion • Multi-segment kinematics • Ankle complex • Chopart joint • Lisfranc joint • Metatarsophalangeal joint • Foot arches • Stereophotogrammetry • Marker clusters • Skin motion artifact
Introduction In standard clinical gait analysis body segments are tracked in three-dimensions (3D) by a stereophotogrammetric system, while their relative positions are calculated to assess patterns of joint rotations during the execution of motor activities. Full-body gait analysis requires several passive reflective markers to be fixated to the trunk, pelvis, thigh, shank, and foot. Kinematics of segments are assessed with respect to the laboratory co-ordinate system, i.e., the absolute motion, and with respect to any adjoining segments for calculation of relative joint rotations. Frequently, motion of the trunk and pelvis is reported in the laboratory reference frame, in case with respect to the line of progression, together with hip, knee, and ankle joint motion. The moment of the external forces can also be calculated at these joints as the product of the joint center distance and the ground reaction force recorded by the force plate. Together with the spatiotemporal parameters (e.g., walking speed and stride length), these are the standard kinematic parameters necessary to assess, and characterize, most of the pathological conditions investigated in gait analysis. The use of skinbased reflective markers to track body segments comprised of single bones (e.g., the femur or the humerus) can result in a fairly accurate representation of their real motion. From a kinematic prospective, these segments can reasonably be assumed to move as rigid bodies, thus the position of at least three nonaligned points is required for their motion to be tracked in the 3D space. In gait analysis, these points are mostly palpable bony landmarks, the temporal position of which is tracked via reflective markers attached to the skin. With skin markers, however, the rigid body assumption is violated and accuracy is lower when multiple small bones are connected in a reduced volume (Nester et al. 2010). On the other hand, prospects to continuously track different segments are of particular relevance in the evaluation of pathologies regarding the shank and foot. Although the tibia and fibula bones present very small relative motion and thus can be considered reasonably as a single segment for
Kinematic Foot Models for Instrumented Gait Analysis
3
kinematic analysis (Arndt et al. 2007), the foot is made up of 26 bones and several joints connecting them. Therefore, standard kinematic protocols based on three markers only appear inadequate in describing the complex foot biomechanics. In addition, foot bones are rather small and some of them, e.g., the talus, have no clear palpable landmarks, thus making it very difficult for those to be tracked in 3D space. The importance of multi-segment foot models (MFMs) rather than singlesegment foot tracking has been largely discussed in the literature. Benedetti et al. (2011) demonstrated the value of 3D motion analysis of the ankle joint in the clinical context. De Ridder et al. (2015) compared results from a MFM (“Ghent” in Table 1, by De Mits et al. 2012) and a single-segment foot model and showed the value of distal factors in chronic ankle instability, in particular the deviation in kinematics at the midfoot, which simply cannot be detected with a rigid foot model. Pothrat et al. (2015) reported significant differences and even opposite results for the same variables when the multi-segment Oxford Foot Model (OFM, see Table 1) and the Plugin-Gait (modeling the foot as a single segment) were used to characterize normal and flat feet, concluding that the type of foot model strongly affects the measured ankle joint kinematics. Dixon et al. (2012) performed a similar study, i.e., OFM versus Plug-in-Gait albeit on kinetic data, and revealed that the latter overestimated ankle power by about 40% with respect to OFM and neglected the important midfoot loading. The authors of these papers shared the same recommendation of using caution when foot and ankle kinematics are measured with a single-segment foot model. Interestingly, the value of multi-segment foot kinematic analysis has been praised also in studies related to the analysis of more proximal lower limb joints (Arnold et al. 2014b). In 2003, a special technical session of the Gait and Clinical Movement Analysis Society (GCMAS) agreed on and recommended that shank, rearfoot, forefoot, and hallux are clinically meaningful foot segments to be tracked. These segments are in fact to be found in most of the multi-segment foot techniques reported in the literature. Many basic foot biomechanic studies and clinical investigations employing various MFMs can be found in the literature. A number of relevant review papers have also been published, which represent valuable sources for an overview of foot modeling in kinematic analysis. Rankine et al. (2008) reported first a systematic analysis of 25 papers on foot kinematic modeling, thoroughly classified in terms of number of bony segments and joint rotations. All major technical and exploitation related issues were discussed systematically. Later, Deschamps et al. (2011) reported and assessed many of these techniques in relation to their exploitation in the clinical context. It was shown that whereas many foot joint rotations can be tracked in a consistent and repeatable way, some measures are still critical, and several of these techniques have yet to be used to address clinical problems. According to Bishop et al. (2012) this is the consequence of poorly described or flawed methodologies, preventing the readers from obtaining the same algorithms and programs to replicate the analysis. A minimum of five reporting standards were proposed in this paper; this aimed at guaranteeing full access to the most relevant modeling concepts and at providing a common platform for sharing and comparing foot kinematic data and as to improve their interpretation and usability. The association between foot posture
4
A. Leardini and P. Caravaggi
Table 1 Papers on multi-segment foot techniques and models (i.e., methodological studies). The column Number of segments counts all foot and shank segments. The model name is indicated when it was recognized somehow in the following literature or cited frequently in that way. Some of these studies were taken from a previous review paper (Rankine et al. 2008). Papers that reported further assessments of the technique/model are cited in the last column. Models and marker sets designed for bone pin analysis are excluded from this Table
Year 1991
Number Number Model name of segments of subjects (best known as) 3 3
1990 1996 1996
3 3 5
5 14 1
Cornwall and McPoil
1999
3
43
Milwaukee Foot Model Cornwall I
Woodburn et al. Rattanaprasert et al. Leardini et al.
1999
3
10
Woodburn I
1999
5
10
Rattanaprasert
1999
6
9
Leardini Foot Model I
Wu et al. Hunt et al. Carson et al.
2000 2001 2001
4 4 5
10 18 1
Arampatzis et al. MacWilliams et al. Hwang et al. Davis et al.
2002
7
6
2003
10
18
2004 2006
10 3
5 1
Pohl et al. Kitaoka et al.
2006 2006
12 4
3 20
Authors Scott and Winter Kepple et al. Moseley et al. Kidder et al.
Following papers, with developments and technical assessments
Myers et al. (2004), Long et al. (2010) Cornwall and McPoil (1999b), Cornwall and McPoil (2002)
Hetsroni et al. (2011)
Atkinson et al. (2010) Oxford Foot Model (OFM)
Stebbins et al. (2006), Curtis et al. (2009), Levinger et al. (2010), Wright et al. (2011), van Hoeve et al. (2015), Carty et al. (2015), Lucareli et al. (2016), Milner and Brindle (2016), Halstead et al. (2016)
Kinfoot
Shriners Hospital for Children Greenville Foot Model (SHCG)
Maurer et al. (2013) Saraswat et al. (2013)
(continued)
Kinematic Foot Models for Instrumented Gait Analysis
5
Table 1 (continued)
Authors Rao et al. Simon et al.
Year 2006 2006
Number of segments 4 11
Tome et al. Jenkyn and Nicol Leardini et al.
2006 2007
5 6
Number Model name of subjects (best known as) 10 10 Heidelberg Foot Measurement Method l 14 12
2007
5
10
Wolf et al. Sawacha et al. Cobb et al. Hyslop et al. Oosterwaal et al.
2008a 2009 2009 2010 2011
4 4 4 6 26
6 10 11 9 25
Bruening et al. De Mits et al.
2012a 4 2012 6
10 10
Saraswat et al. Bishop et al. Nester et al.
2012 2013 2014
4 4 6
15 18 100
Seo et al. Souza et al.
2014 2014
5 3
20 10
Following papers, with developments and technical assessments Kalkum et al. (2016)
Jenkyn et al. 2010
Rizzoli Foot Model (RFM)
Caravaggi et al. (2011), Deschamps et al. (2012a, b), Arnold et al. (2013), Portinaro et al. (2014), Van den Herrewegen et al. (2014)
GlasgowMaastricht foot model
Oosterwaal et al. (2016) Bruening et al. (2012b)
Ghent Foot Model Saraswat
Saraswat et al. (2013)
Salford Foot Model
and lower limb kinematics has been the objective of another interesting review analysis of twelve papers (Buldt et al. 2013). Evidence was found for increased frontal plane motion of the rearfoot during walking in individuals with pes planus. The latest review thus far by Novak et al. (2014) has highlighted the strengths and weaknesses of the most widely used and known MFMs, including an insight on their kinetic analyses. While joint rotations have been thoroughly addressed in the literature, joint translations have been studied and discussed very rarely: generally these are within 2 mm (Bruening et al. 2012a) in any anatomical direction. Because this is in the order of magnitude of skin motion artifact, this topic would not be further discussed in this chapter. For foot joint kinematics reconstruction, also the so-called “global
6
A. Leardini and P. Caravaggi
Table 2 Multi-segment foot models most used in clinical context. For each model (first column) the relevant clinical papers are reported (second column) Model Milwaukee Foot Model (1996)
Oxford Foot Model (2001)
Heidelberg Foot Measurement Method (2006) Rao et al. (2006) Rizzoli Foot Model (2007 and 2014)
Clinical papers Khazzam et al. (2007), Ness et al. (2008), Canseco et al. (2008), Marks et al. (2009), Brodsky et al. (2009), Canseco et al. (2009), Graff et al. (2010), Canseco et al. (2012), Krzak et al. (2015) Theologis et al. (2003), Woodburn et al. (2004), Turner et al. (2006), Turner and Woodburn (2008), Alonso-Vázquez et al. (2009), Wang et al. (2010), Deschamps et al. (2010), Stebbins et al. (2010), Bartonet al. (2011a, b, c), Hösl et al. (2014), Merker et al. (2015) Houck et al. (2009), Twomey et al. (2010), Dubbeldam et al. (2013) Nawoczenski et al. (2008), Neville et al. (2009), Rao et al. (2009) Chang et al. (2008), Deschamps et al. (2013), Portinaro et al. (2014), Chang et al. (2014), Deschamps et al. (2016), Arnold et al. (2014a, b), Lin et al. (2013), Hsu et al. (2014), Kelly et al. (2014), Deschamps et al. (2016)
optimization” has been used recently (Arnold et al. 2013; Bishop et al. 2016). This basically entails with an iterative search of the best estimation of foot segment position and orientation, all together also called “pose”. This procedure starts from skin marker trajectories, but the optimal poses must be compatible also with predetermined kinematic models for all the joints, i.e. global, this according to an original technique for the lower limbs (Lu and O’Connor 1999). The present chapter aims at introducing the current full series of methodological studies on this topic, in order to provide the basic knowledge for either the selection or the design of the most appropriate technique, according to the specific populations and hypotheses of the foot kinematic study to be performed.
State of the Art An extensive survey of the currently available multi-segment foot techniques and models is reported in Table 1. Several differences can be found between multisegment foot techniques in the following factors: – – – – – – –
Foot segments Bony landmarks Type of marker clusters Definition of the anatomical frames Joint convention – including 2D versus 3D measurements Neutral reference posture Offsets
Kinematic Foot Models for Instrumented Gait Analysis
7
The major difference between MFMs is found in the number and selection of foot segments (Fig. 1). While tibia, rearfoot, and forefoot are tracked by most techniques, the hallux – or the first metatarso-phalangeal joint – is seldom tracked, and the midfoot is tracked only by few models (MacWilliams et al. 2003; Leardini et al. 2007; Rouhani et al. 2011; Portinaro et al. 2014). Medial and lateral forefoot subdivisions have also been proposed (MacWilliams et al. 2003; Hwang et al. 2004; Buczek et al. 2006; Rouhani et al. 2011). The current models available include up to 12 segments (Table 1); even a 26 segment foot model has been proposed (Oosterwaal et al. 2011, 2016), but its application is limited to advanced musculoskeletal modeling studies. The number and selection of foot segments to be tracked, somehow the resolution of the model, is usually defined according to the field of application, the clinical interest, but also to the number, quality, and location of available cameras of the stereophotogrammetric system. While kinematic analysis of foot segments has been devised mostly for barefoot gait analysis, a number of techniques were explicitly designed for the analysis of shod feet (Wolf et al. 2008b; Cobb et al. 2009; Shultz et al. 2011b; Bishop et al. 2013). Moreover, the effect of foot and ankle orthoses has been investigated by established models (Lin et al. 2013; Leardini et al. 2014). The overall results, in terms of patterns of foot joint kinematics, can be confusing and difficult to interpret because of the differences mentioned above. Also, the varying populations analyzed, as highlighted in Table 1, in terms also of physical status, size, age, gender, etc., make it difficult to compare data across different studies. The process to include and track a segment within a MFM for kinematic analysis requires a profound knowledge of foot biomechanics and of the limits and accuracy of the measuring system. For example, the actual joint motion to be recorded should be much larger than the accuracy of the stereophotogrammetric instrumentation used for the analysis and than any other source of error (particularly, the skin motion artifact). In addition to the known large rotations occurring at the tibiotalar and metatarso-phalangeal joints, and between metatarsus and calcaneus, in vivo and in vitro studies have demonstrated that significant and consistent rotations are experienced in normal feet also at the Chopart (talo-calcaneo-navicular and calcaneo-cuboid joints) and Lisfranc joints (tarso-metatarsal joints). These studies have confirmed that midfoot motion during gait is significant and its assessment should be included in relevant foot kinematic studies. The subtalar joint is also subjected to relatively large motion; however, this is very difficult to track in vivo with skin-based markers. Both skin-based and plate-mounted marker clusters (Leardini et al. 1999; Carson et al. 2001; Houck et al. 2006; Hyslop et al. 2010; Nester et al. 2014; Raychoudhury et al. 2014; Souza et al. 2014; Buldt et al. 2015) on relevant foot and shank bony landmarks have been used to track foot segments. The differences between skinmarkers and plate-mounted markers in measured joint motion were found to be small (Nester et al. 2007a). The markers used for motion analysis are usually passive, i.e., reflecting IR (infrared) light emitted by LEDS embedded in the motion cameras, or active, i.e., emitting IR light. While the latter usually provide a more accurate 3D location, they also require a wired external power which can result in uncomfortable
8
A. Leardini and P. Caravaggi
Fig. 1 Diagrammatic representation of the foot segment subdivisions (different grey tones) for the main MFMs
setups also restraining the movement of the subject. The number of markers used in MFM can be as high as 35, as in Oosterwaal et al. (2011, 2016) and Raychoudhury et al. (2014). A compromise must always be found between the required degrees of freedom of the model, which is related also to the number of segments tracked in 2D or 3D, and the number, quality, and location of the available cameras. These are arranged usually to collect motion data for other anatomical districts and motor tasks in the same laboratory, and therefore compromising layouts must be found, as explicitly discussed for one widely used MFM (Leardini et al. 2007). As mentioned above, at least three markers need to be fixated to each segment for a complete 3D representation of its motion. This setup is technically suitable for establishing a local reference frame on each segment and for calculation of triplanar joint rotations using the Euler or the Joint-Coordinate-System convention (Grood and Suntay 1983) (see typical results in Fig. 2). Anatomical landmarks are necessary to establish anatomical based reference frames. However, the paucity of bony landmarks and the small size of several foot bones limit the application of the three-marker tracking for foot segments kinematics. While most techniques for the kinematic analysis of foot segments use the 3D approach, i.e., three independent rotations about three different axes, 2D projection angles can also be used to measure relative rotations of a joint, with respect to anatomical planes. In the latter, line segments determined by the position of two markers are projected at each time sample onto an anatomical or other relevant planes, for the planar rotation to be calculated during motion (Simon et al. 2006; Leardini et al. 2007; Portinaro et al. 2014). 2D planar angles have been largely used to track motion of metatarsal bones, as well as for motion representations of the arches of the foot, particularly the medial
Kinematic Foot Models for Instrumented Gait Analysis
9
Plantarflexion Dorsiflexion angle [deg]
Shank-Calcaneus
Calcaneus-Metatarsus
0
0
–10
–10 –20
–20
–30 –30
y
–40
–40 –50
x sagittal
1
20
40
60
100
80
y
–50 –60
x sagittal
1
20
40
60
80
100
Eversion Invesion angle [deg]
40 20
30
10
20
0
10 0
–10 z frontal
1 Abduction Adduction angle [deg]
x
–20 20
40
60
80
–10
x z frontal
–20 1
100
30
40
20
30
20
40
60
80
100
20 10 10 0
y
–10
0
z
y
–10 z transverse
transverse
–20
1
20
load mid-stance response
40
60
late-stance
% gait cycle
80
swing
100
–20
1
20
load mid-stance response
40
60
late-stance
80
100
swing
% gait cycle
Fig. 2 Typical mean ( one standard deviation) temporal profiles of foot joint rotations over the full gait cycle from a control population of normal subjects. In the left and right columns, respectively: motion of the calcaneus in the shank reference frame and of the metatarsals in the calcaneus reference frame. From top to bottom rows: rotations in the sagittal, frontal, and transverse anatomical planes
longitudinal arch, and the varus/valgus inclination of the calcaneus. With this approach, however, very erroneous and misleading values can be obtained in extreme conditions, particularly in case of large ranges of joint motion and in case of large deviations between the line segment and the projection plane. Another important question is whether to use a reference neutral position for the foot and ankle joints. Most frequently, a double leg standing posture is recorded to provide reference orientations of the foot and lower limb segments. The neutral orientation can be used as offset and subtracted from the corresponding temporal profile of joint rotation. The so-called “subtalar neutral” is also sought (Rao et al.
10
A. Leardini and P. Caravaggi
2009) to establish the correct initial alignment of the foot and ankle. Plaster molds have also been exploited to control the foot resting position (Saraswat et al. 2012, 2013), ensuring foot placement reproducibility and segment neutral orientation. This procedure is intended to compensate for differing anatomical frame definitions and foot static deformities, in order to establish a common “zero reference level” for inter-subject comparisons. The use of a neutral posture has the advantage of removing the bias associated to the anatomical frame definitions, thus allowing to focus the analysis and all relevant measurements on the “dynamic” pattern of the joint rotations. Unfortunately, it also removes any joint misalignments due to bone and/or joint deformity, which are frequently included in the clinical evidence of each patient and therefore should not be removed from the analysis. The choice of offsetting joint rotations by using a neutral posture is thus related to the specific study and its hypotheses and should take into consideration, for example, if there is any ongoing treatment to correct a foot deformity. Regardless of its application to offset the kinematic data, the inter-segmental orientations with the subject in the neutral posture represent extremely valuable information that should always be analyzed and assessed, in relation to the corresponding temporal profiles of joint rotations. In order to help final users to identify which MFM is more reliable, repeatable, and/or best fitting the aims of their investigation, few studies have been published which compare the performance of the most popular MFM. Mahaffey et al. (2013) have used intra-class correlation coefficients to analyze the OFM, the Rizzoli Foot Model (RFM), and the Kinfoot (MacWilliams et al. 2003) in 17 children on two testing sessions. Although some variability has been found between segments, multisegment foot kinematics were shown to be quite repeatable even in pediatric feet. A standard error of measurement greater than 5 was found in 26%, 15%, and 44% of the kinematic parameters, respectively, for the OFM, RFM, and the Kinfoot model. The latter showed the lowest repeatability and the highest errors. The OFM demonstrated moderate repeatability and reasonable errors in all segments except for the hindfoot in the transverse plane. The RFM resulted in moderate repeatability and reasonable test-retest error similar to that of the OFM, but with original additional data also on midfoot kinematics. In another paper by Powell et al. (2013), the OFM and RFM were assessed in the context of foot function and alignment as possible predisposition factors for overuse and traumatic injury in athletes. Both models helped detect significant differences in frontal plane motion between high- and low-arched footed athletes. However, the RFM was suggested to be the more appropriate because it allows to track also midfoot motion. While it was not the main scope of the study, a comparison between the Shriners Hospital for Children Greenville Foot Model (Davis et al. 2006) and the OFM can be found also in Maurer et al. (2013). The former model was shown to be more effective in quantifying the presence and severity of midfoot break deformity in the sagittal plane and in monitoring the progression over time. Di Marco et al. (2015a, b) performed the most comprehensive comparative analysis to date of the OFM, RFM, the Sawacha et al. (2009), and Saraswat et al. (2012) models. The best coefficient of multiple correlation between-sessions of the kinematic parameters during ground and treadmill walking was observed for the RFM (range 0.83–0.95).
Kinematic Foot Models for Instrumented Gait Analysis
11
Perhaps an overabundance of multi-segment foot techniques and models has been proposed to date. Some of these have been made available to the motion analysis community also via simple-to-use software codes. New users are free to choose the most appropriate model/technique for their needs according to the experimental conditions. In particular, the visibility and traceability of the relevant markers must be considered, both in relation to their dimension and location, together with its applicability on the clinical population under investigation, and to the motor activities to be analyzed. Moreover, foot and leg deformities should be carefully assessed before starting the data collection campaign. The advantages and disadvantages of existing techniques should be considered and analyzed before developing and validating a novel MFM suitable to the aims of the investigation.
Validation and Application of Foot Models Validation Studies New motion analysis procedures always require proper validation, but this is particularly challenging for the kinematic analysis of foot segments via skin-markers. Usually, MFMs are only assessed for repeatability of measurements (see Table 1) (Mahaffey et al. 2013; Di Marco et al. 2015a, b). Videofluoroscopy has been employed to estimate the error in the measurements due to the skin motion artifacts (Leardini et al. 2005; Wrbaskić and Dowling 2007; Shultz et al. 2011a). Skin motion artifacts were shown to be as large as 16 mm in very strenuous foot conditions. The largest errors were measured in the hindfoot and midfoot clusters at toe-off, likely because of the large deformations experienced by the foot bones and skin in this phase of stance. Still, the skin-to-bone relative motion at the foot was found to be smaller than that of typical markers on the shank and thigh (Leardini et al. 2005), thus it has been deemed sufficiently reliable for foot bone tracking. However, the most convincing evidence of skeletal motion is from in-vitro and in-vivo bone pins measurements. In vitro, robotic gait simulators are used to replicate the biomechanical conditions of the stance phase of walking on foot cadaver specimens (Whittaker et al. 2011; Peeters et al. 2013), and kinematics of foot bones can be accurately tracked via bone pins instrumented with markers. This data helped verify a promising consistency in foot joint kinematic patterns, for most of the foot joints, between skin-markers and bone pin measurements. Moreover, it has been possible to detect motion in a number of joints that are difficult to analyze in-vivo. In-vitro kinematic data should always be critically evaluated in relation to the fidelity of the replication of the real in-vivo conditions. Validation of MFMs has been performed also by tracking real bone motion invivo (Nester et al. 2007a, b, 2010; Arndt et al. 2007; Lundgren et al. 2008; Wolf et al. 2008a; Okita et al. 2009). This required bone pins to be instrumented with marker clusters and fixated to a number of foot segments in volunteers under a small dose of local anesthesia. In this condition, the motion pattern of the main foot joints during walking and running can be established very accurately. It has been shown that the
12
A. Leardini and P. Caravaggi
motion patterns with and without the inserted pins compare well, indicating that the subjects had little motion restriction due to such invasive intervention. Motion of major joints was revealed to be very complex, and that of small joints, such as the talo-navicular, to be larger than what expected – about 10 in the three anatomical planes – and also larger than that of the talo-calcaneal joint. Motion larger than 3 , therefore non negligible, was also measured between tibia and fibula. These studies also showed the kinematic differences between multibone segments, as measured by external skin clusters, and single bone pins. These experiments are limited by the small number of subjects and are hardly replicable for technical and ethical reasons. The relevant data published so far must serve as reference for other investigations on normal and pathological feet.
Musculoskeletal Multi-Segment Foot Modeling MFMs can be used also to develop and validate complex musculoskeletal computer models for forward and inverse dynamic analysis. Typically, medical imaging is used to define geometrical models of the anatomical structures and in vivo recorded kinematics, whereas ground reaction forces provide the data to perform inversedynamics. This allows measurement of bone segment kinematics and estimes of loading conditions, at the joints, muscle-tendon units, and ligaments. These models are particularly valuable for an insight into pathological conditions, understanding disease mechanisms, and simulating the effects of possible treatments, whether surgical, pharmacological or physical. Saraswat et al. (2010) proposed a generic musculoskeletal model of an adult foot, including the intrinsic muscles and ligaments of the foot and ankle, configured and scaled by skin marker trajectories and an optimization routine. The predicted muscle activation patterns were assessed against corresponding EMG measurements from the litera