Perceiving in Depth, Volume 3: Other Mechanisms of Depth Perception 0199764166, 9780199764167, 2602773093

The three-volume workPerceiving in Depthis a sequel toBinocular VisionandStereopsisand toSeeing in Depth, both by Ian P.

467 103 9MB

English Pages 400 [399] Year 2012

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Perceiving in Depth, Volume 3: Other Mechanisms of Depth Perception
 0199764166, 9780199764167, 2602773093

Table of contents :
Cover......Page 1
Contents of Volume 3......Page 6
25. Depth from accommodation and vergence......Page 8
26. Depth from perspective......Page 22
27. Depth from interposition and shading......Page 70
28. Depth from motion parallax......Page 91
29. Constancies in visual depth perception......Page 129
30. Interactions between visual depth cues......Page 154
31. Seeing motion-in-depth......Page 186
32. Pathology of visual depth perception......Page 223
33. Visual depth perception in the animal kingdom......Page 240
34. Reaching and moving in 3-D space......Page 267
35. Auditory distance perception......Page 284
36. Electrolocation and thermal senses......Page 316
37. Animal navigation......Page 325
38. Final word......Page 341
References......Page 343
C......Page 389
N......Page 390
Z......Page 391
W......Page 392
C......Page 393
D......Page 394
K......Page 395
O......Page 396
R......Page 397
T......Page 398
Z......Page 399

Citation preview

PERCEIVING IN DEPTH

OX F O R D P SYC H O L O GY S E R I E S 1. The Neuropsychology of Anxiety J. A. Gray

18. Perceptual and Associative Learning G. Hall

34. Looking Down on Human Intelligence I. J. Deary

2. Elements of Episodic Memory E. Tulving

19. Implicit Learning and Tacit Knowledge A. S. Reber

3. Conditioning and Associative Learning N. J. Mackintosh

20. Neuromotor Mechanisms in Human Communication D. Kimura

35. From Conditioning to Conscious Recollection H. Eichenbaum and N. J. Cohen

4. Visual Masking B. G. Breitmeyer 5. The Musical Mind J. A. Sloboda 6. Elements of Psychophysical Theory J.-C. Falmagne 7. Animal Intelligence L. Weiskrantz 8. Response Times R. D. Luce 9. Mental Representations A. Paivio 10. Memory, Imprinting, and the Brain G. Horn 11. Working Memory A. Baddeley 12. Blindsight L. Weiskrantz 13. Profile Analysis D. M. Green 14. Spatial Vision R. L. DeValois and K. K. DeValois 15. The neural and Behavioural Organization of Goal–Directed Movements M. Jeannerod 16. Visual Pattern Analyzers N. V. S. Graham 17. Cognitive Foundations of Musical Pitch C. L. Krumhansl

36. Understanding Figurative Language S. Glucksberg

21. The Frontal Lobes and Voluntary Action R. Passingham

37. Active Vision J. M. Findlay and I. D. Gilchrist

22. Classification and Cognition W. K. Estes

38. The Science of False Memory C. J. Brainerd and V. F. Reyna

23. Vowel Perception and Production B. S. Rosner and J. B. Pickering

39. The Case for Mental Imagery S. M. Kosslyn, W. L. Thompson, and G. Ganis

24. Visual Stress A. Wilkins

40. Seeing Black and White A. Gilchrist

25. Electrophysiology of Mind Edited by M. D. Rugg and M. G. H. Coles

41. Visual Masking, 2e B. Breitmeyer and H. Öğmen

26. Attention and Memory N. Cowan

42. Motor Cognition M. Jeannerod

27. The Visual Brain in Action A. D. Milner and M. A. Goodale

43. The Visual Brain in Action A. D. Milner and M. A. Goodale

28. Perceptual Consequences of Cochlear Damage B. C. J. Moore

44. The Continuity of Mind M. Spivey

29. Perceiving in Depth, Vols. 1, 2, and 3 I. P. Howard with B. J. Rogers 30. The Measurement of Sensation D. Laming 31. Conditioned Taste Aversion J. Bures, F. Bermúdez–Rattoni, and T. Yamamoto 32. The Developing Visual Brain J. Atkinson 33. The Neuropsychology of Anxiety, 2e J. A. Gray and N. McNaughton

45. Working Memory, Thought, and Action A. Baddeley 46. What is Special about the Human Brain? R. Passingham 47. Visual Reflections M. McCloskey 48. Principles of Visual Attention C. Bundesen and T. Habekost 49. Major Issues in Cognitive Aging T. A. Salthouse

PERCEIVING IN DEPTH VOLUME 3 OTHER MECHANISMS OF DEPTH PERCEPTION

Ian P. Howard CENTRE FOR VISION RESEARCH YORK UNIVERSITY TORONTO

1

1 Oxford University Press, Inc., publishes works that further Oxford University’s objective of excellence in research, scholarship, and education. Oxford New York Auckland Cape Town Dar es Salaam Hong Kong Karachi Kuala Lumpur Madrid Melbourne Mexico City Nairobi New Delhi Shanghai Taipei Toronto With offices in Argentina Austria Brazil Chile Czech Republic France Greece Guatemala Hungary Italy Japan Poland Portugal Singapore South Korea Switzerland Thailand Turkey Ukraine Vietnam

Copyright © 2012 by Oxford University Press, Inc. Published by Oxford University Press, Inc. 198 Madison Avenue, New York, New York 10016 www.oup.com Oxford is a registered trademark of Oxford University Press All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior permission of Oxford University Press. ____________________________________________ A copy of this book’s Cataloging-in-Publication Data is on file with the Library of Congress.

ISBN: 978-0-19-976416-7 ____________________________________________

987654321 Printed in the United States of America on acid-free paper

CONTENTS OF VOLUME 3

25. Depth from accommodation and vergence 26. Depth from perspective 27. Depth from interposition and shading

1 15 63

34. Reaching and moving in 3-D space 35. Auditory distance perception 36. Electrolocation and thermal senses

260 277 309

28. Depth from motion parallax 29. Constancies in visual depth perception 30. Interactions between visual depth cues

84 122 147

37. Animal navigation 38. Final word

318 334

31. Seeing motion-in-depth 32. Pathology of visual depth perception

179 216

References Index of cited journals

336 382

33. Visual depth perception in the animal kingdom

233

Portrait index Subject index

385 386

v

This page intentionally left blank

25 DEPTH FROM ACCOMMODATION AND VERGENCE

25.1 25.1.1 25.1.2 25.1.3 25.1.4 25.2 25.2.1

Accommodation as a distance cue 1 Accommodation and absolute distance 1 Object blur as a cue to relative depth 2 Defocus blur as a cue to relative depth 3 Adapting to unusual depth blur 4 Vergence as a distance cue 5 Introduction 5

25.2.2 25.2.3 25.2.4 25.2.5 25.2.6 25.2.7 25.2.8

25.1 AC C O M M O DAT I O N A S A D I S TA N C E C U E

Verbal estimation of vergence distance 6 Use of a comparison object 6 Vergence distance indicated by pointing 7 Illusory motion parallax 8 Vergence and apparent size and distance 8 Perceptual effects of maintained vergence 10 Vergence and judgment of relative depth 13

viewed luminous disk at distances of 25, 30, and 40 cm with angular size held constant. Errors were less than 1 cm in the range 25 to 40 cm. When accommodation was optically adjusted to one distance, and vergence to another distance, judgments of distance were a compromise between the two distances but with more weight given to vergence. These results indicate only that accommodation contributes to perceived absolute distance. They do not provide a quantitative measure of that contribution. In the above experiments it was assumed that subjects accommodated on the required target, which most people fail to do accurately. Fisher and Ciuffreda (1988) used both a good accommodative stimulus, consisting of high contrast patterns, and a poor accommodative stimulus in the form of a fuzzy disk. They measured accommodation with an optometer that provided no intruding stimuli. Subjects pointed with a hidden hand to monocular targets. With high-contrast targets, apparent distance decreased linearly with increasing accommodation, but there were large individual differences. Subjects tended to overestimate distances that were less than about 3.2 diopters (31 cm) and underestimate larger distances. Each diopter change in accommodation induced about a 0.25-diopter change in apparent distance. With the poor accommodation stimulus, perceived distance did not vary with accommodation. Mon-Williams and Tresilian (1999a) asked subjects to point with unseen hand to single monocularly viewed targets at distances between 10 and 50 cm. The target was placed along the visual axis of one eye so that vergence cues were eliminated. A vergence motion of the closed eye may have been evoked by changes in perceived distance, but this would not provide independent information

25.1.1 AC C O M M O DAT I O N A N D A B S O LU T E D I S TA N C E

Although Descartes (1664) had no clear idea about the mechanism of accommodation, he proposed that the act of accommodation aids in the perception of depth. Berkeley (1709) made the same suggestion. Between 1858 and 1862 Wundt performed a series of experiments on the role of accommodation in depth judgments. Subjects judged whether a black silk thread seen monocularly through a tube was at the same distance in two successive exposures. Subjects could not judge the absolute distance of the thread but could detect a change in depth of about 8 cm at a distance of 100 cm and of 12 cm at a distance of 250 cm. Hillebrand (1894) used the edge of a black card seen monocularly against an illuminated background so as to remove the depth cue of changes in image size. When the stimulus was moved slowly in depth, subjects were not able to detect the motion. However, when the stimulus moved abruptly, subjects could detect a change of between 1 and 2 diopters. Dixon (1895) and Baird (1903) produced similar results. This evidence suggests that people cannot judge the distance of an object on the basis of the static state of vergence-accommodation but can use changes in accommodation to judge differences in depth. However, more recent experiments have revealed that people have some capacity to judge absolute distance using accommodation. Swenson (1932) asked subjects to move an unseen marker to the perceived distance of a single binocularly

1

about distance. Target size and distance were varied independently to remove the distance cue of size. Four of six subjects showed a correlation between pointing distance and target distance, but responses were very variable. The gradient of optical blur over an inclined or slanted surface increases with decreasing distance of the surface. Vishwanath and Blaser (2010) produced evidence that a frontal surface with a steep gradient of artificial blur appears nearer than a surface with a less steep gradient. 25.1.2 O B J E C T B LU R A S A C U E TO R E L AT I V E D E P T H

A visual object may be physically blurred by a filter that removes high spatial frequencies. Unlike accommodative blur of the retinal image, physical blur is not affected by accommodation—it is open-loop blur. Artists create an impression of depth by simulating the out-of-focus appearance of objects not in a specified plane. Photographers create an impression of depth by using a large aperture to reduce the depth of focus so as to have only the object of interest in focus, leaving objects in other depth planes with various degrees of blur. Pentland (1987) discussed the use of gradients of focus in computer vision systems. Simple physical blur in a photograph provides ambiguous information about depth because the same blur can indicate an object nearer than the plane of focus or one beyond the plane of focus. Furthermore, physical blur can indicate relative depth only if the true sharpness of the edges is known. These sources of ambiguity can be reduced by using a series of static pictures taken with various levels of camera defocus. In this way, it is possible to compute the relative depths of objects by physically scanning each of a set photographs (Rajagopalan et al. 2000). The visual system removes these ambiguities in other ways (see Section 9.6.5). A sharply textured region within a blurred surrounding region can induce impressions of relative depth (Mather 1996). Marshall et al. 1996 used the stimuli shown in Figure 25.1. All subjects reported that the sharp inner square appeared in front of the blurred surround in (A) but beyond the surround in (B). The in-focus edge of the inner square in (A) is seen as belonging to the in-focus texture within the square. The square is therefore seen as occluding the blurred surround region. The out-of-focus edge in (B) is seen as belonging to the surround and the sharp inner square is therefore seen as if through a hole in the surround. In Figures (C) and (D) the effect is ambiguous. The sharp boundary should be seen as belonging to the in-focus surround and therefore as an occluding edge of a nearer surrounding region. But some people see the inner square as nearer. This could be due to a general tendency to see a surrounded region as a foreground figure. The blurred texture of the inner square is then interpreted as intrinsically blurred rather than out of focus. In Figure 25.1D the inner 2



square appears near and out of focus because both its texture and its edges are blurred. In a second experiment, Marshall et al. used a side-by-side bipartite display, which avoided the factor of a figure surrounded by a background. However, the effects were not as clear-cut. Blurring a display reduces its contrast and contrast has its own effect on perceived relative distance. O’Shea et al. (1997) varied relative blur and relative contrast independently in the two halves of textured bipartite displays. A more blurred region appeared more distant than a less blurred region when contrast was the same. A region of higher contrast appeared nearer than a region of lower contrast when blur was the same. The effects of the two cues were additive over a moderate range of contrast. Mather and Smith (2002) used the bipartite display shown in Figure 25.3 (Portrait Figure 25.2). When the boundary was sharp, the blurred region appeared more distant than the nonblurred region. Only when the border was very blurred did the blurred region appear near. They concluded from subsidiary experiments that moderate degrees of border blur are difficult to detect.

25.1.3 D E F O C US B LU R A S A C U E TO R E L AT I V E D E P T H

25.1.3a Static Blur as a Cue to Depth The blur of an in-focus image of an object with a given spatial-frequency content depends on the optics of the eye. The ability of the visual system to detect image blur depends on the sensitivity and spatial sampling of the retina and visual cortex. For given values of these optical and neural factors, the blur of the image of an object at a given distance varies with the state of accommodation of the eye. The question addressed in this section is whether people are able to judge the relative depth of two objects on the basis of the relative blur of their retinal images. Grant (1942) asked subjects to set a luminous disk to the same distance as another disk, when cues to distance other than image blur were removed. The standard error of settings was about 0.94 cm at a distance of 50 cm, and 0.8 cm at a distance of 25 cm. Subjects could distinguish between the image a point of light nearer than the plane of focus and the image of a point beyond the plane of focus (Wilson et al. 2002). The stimulus was presented for 100 ms after a 2-minute training period in which subjects were given knowledge of results. Performance improved with increasing image blur and as pupil diameter was increased from 1 mm to 5 mm. Nguyen et al. (2005) asked subjects to report the relative depth of two monocularly viewed vertical test edges on either side of a gap illuminated by tungsten light, as shown in Figure 25.4. The left edge was fixed at a distance of 37 cm. The right edge was presented at various random distances

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Effects of texture and blur on apparent depth. (A) The inner square appears near because its sharp edge appears to belong to its sharp contents. (B) The inner square appears far because its blurred edge belongs to the surround. (C) The inner square with sharp edges can appear as an out-of-focus surface beyond the surround or as a nearer blurred square. (D) The inner square with blurred edges appears near and out-of-focus because both its contents and edges are blurred. (Redrawn from Marshall et al. 1996) Figure 25.1.

nearer than or beyond the left edge. This edge moved along the eye’s visual axis so as the keep the image of the gap constant in width. In addition, the width of the gap was randomly varied slightly from trial to trial. In one condition, subjects remained fixated and focused on the left edge. In this case, the only information about the relative depth of the two edges was that provided by the relative signed blur of the images. In a second condition, subjects changed accommodation from one edge to the other several times. Figure 25.5A shows that, in both conditions, the relative depths of the two edges could be discriminated 75% of the time when they were about 0.2 D apart in depth. Detection of the depth order of the edges was severely degraded when the stimulus was illuminated by monochromatic sodium light, as shown in Figure B. Monochromatic light does not produce the chromatic aberration that provides a cue to the sign of accommodation (Section 9.8).

25.1.3b Dynamic Accommodation and Relative Depth The act of changing accommodation between two objects at different distances may provide information about their relative depth. Also, the changing blur associated with changing accommodation could be a depth cue. Helmholtz (1909, Vol. 3, p. 294) found that an illuminated slit with a red filter appeared nearer than a slit with a blue filter. He explained the effect in terms of the difference in accommodation required to focus the two slits, arising from chromatic aberration. Although Mon-Williams and Tresilian (2000) found that subjects could not judge the absolute distance of a monocular target, there was some indication that they could judge whether a target was nearer or more distant than a previous one. The above experiments indicate that,

D E P T H F R O M A C C O M M O D AT I O N A N D V E R G E N C E



3

different distances through the mediation of size constancy. This issue is discussed in Section 30.7. 25.1.4 A DA P T I N G TO U NUS UA L D E P T H B LU R

Figure 25.2. George Mather. Born in Liverpool in 1955. He obtained a B.A. in psychology from Sheffield University in 1976 and a Ph.D. with B. Moulden from Reading University in 1979. After postdoctoral work at York University in Toronto, he joined the Department of Experimental Psychology at the University of Sussex, England, where is now a professor.

at near distances, accommodation has an effect on perceived distance. Dynamic accommodation and dynamic image blur may be more effective when many objects in different depth planes are presented at the same time. For this purpose one needs an instrument that presents an array of objects at different accommodative distances but at the same vergence distance. Another approach to the role of accommodation in distance judgments is to test whether accommodation affects the perceived size of an object at

Sharp edges are perceived as sharp in spite of the fact that diffraction and optical aberrations spread the image over several receptors. This question was discussed in Section 9.6.5. When we focus on an object we are not aware of blur in the images of objects in other depth planes. One reason is that we do not normally attend to objects out of the plane of focus. But there is some evidence that we adapt selectively to the normal levels of blur associated with different distances from the plane of focus. It is as if we discount defocus blur so that we can better detect physical blur. Battaglia et al. (2004) asked whether adaptation to image blur is related to perceived depth. Subjects fixated a central target at a distance of 33 cm for 3 minutes, while two flanking textured surfaces moved back and forth between depths of 23 and 32.2 cm at 0.1 Hz. The surfaces became physically more blurred as they approached and less blurred as they receded or vice versa. Blurring was produced by filtering the texture. Before and after adaptation, subjects adjusted the physical blur of a surface at 24.6 cm until it appeared the same as the fixed blur of an adjacent surface at 30.6 cm, or vice versa. After adapting to surfaces that became blurred as they approached, less blur was required in the far surface to make it appear the same as the blur in the near surface. The effect was reversed after adaptation to surfaces that became blurred as they receded. Thus, the unusual blur-depth relationships experienced during adaptation to surfaces moving in depth changed the relative perceived blur of stationary surfaces in a

Effect of blur on perceived depth. The display on the left contains a sharp border between the sharply textured region and the blurred region. Observers reported that the blurred region appeared far. The display on the right contains a blurred border. Observers reported that the blurred region appeared near. (From Mather and Smith 2002, Pion Limited, London)

Figure 25.3.

4



OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Beam splitter Test edges Fixed edge Movable edge Light source

Electronic shutters Prefixation edges

Opal screen

Light source

Figure 25.4.

The apparatus used by Nguyen et al. (2005).

Fraction of correct responses

Tungston light Active-looking

Maintained-fixation

1 .75

just as a blind man might feel out a distance with two staves, one in each hand. The haptic judgment of distance is discussed in Section 34.2. In his Essay Towards a New Theory of Vision (1709), Berkeley argued that the perceived distance of an isolated object from the viewer depends on muscular sensations of convergence and, at near distances, on visual blur and eye strain arising from accommodation (Boring 1942). Brücke (1841) proposed that the three-dimensional structure of a scene is perceived on the basis of vergence eye movements that occur as different parts of the scene are fixated. But Dove (1841) showed that stereopsis can occur with exposures too brief to allow vergence to occur. Thus, vergence movements are not necessary for depth perception. Early experiments on vergence as a cue to distance were conducted by Hillebrand (1894), Bourdon (1902), Baird (1903), and Bappert (1923). The distance, D, of a fixated object in the median and interocular plane as a function of vergence angle, distance, a is given by: D=

.50

a 2 ta q 2

(1)

.25 Far

Far

Near // –1 –0.1 0 0.1 1 –1 –0.1 0 0.1 1 Depth between test edges (diopters)

0

Near

//

//

//

The distance, D´, of a second object in the median plane, which has a disparity d with respect to the fixated object, D is given by:

Fraction of correct responses

A Sodium light Maintained-fixation Active-looking

D′ =

1

a ⎛ q −d ⎞ 2 tan ⎝ 2 ⎠

(2)

.75 .50 .25 0 –1

Far Near // // –0.1 0 0.1 1

Near // // –0.1 0 0.1 1

Far –1

Depth between test edges (diopters)

B Figure 25.5. The detection of relative depth from blur. Subjects judged the relative depth of two edges seen against tungsten light, as in (A) or sodium light, as in (B). They looked from one edge to the other or fixated one edge. Results for three subjects. (Adapted from Nguyen et al. 2005)

distance-specific way. This suggests that the visual system modulates perceived blur by signals related to relative depth. 2 5 . 2 V E R G E N C E A S A D I S TA N C E C U E 25.2.1 I N T RO D U C T I O N

Descartes, in his La dioptrique (1637), described the eyes as “feeling out” a distance by a convergence of the visual axes,

When the effects of linked changes in vergence and accommodation are being investigated, accommodation distance is made equal to vergence distance. We use the term accommodation/vergence distance to refer to the optical distance of the target determined by both accommodation and vergence. On the other hand, accommodation distance may be varied while vergence is held constant, or vergence distance may be varied while accommodation is held constant. There are two ways to vary vergence while holding other cues to distance constant or ineffective. The first is to present targets in a stereoscope with variable offset between the images, and the second is to view the target through base-in or base-out wedge prisms. However, constant accommodation and constant size signify that distance is not changing and may therefore interfere with judgments of distance based on vergence. This problem can be solved by randomly varying accommodation and target size so that they are dissociated from the distance of the target, as specified by convergence. A better solution is to eliminate accommodation as a cue by viewing the stimulus through pinholes, which increase the depth of focus. Size as a cue to distance can be eliminated by

D E P T H F R O M A C C O M M O D AT I O N A N D V E R G E N C E



5

using a point source of light. The luminance of the target should also be kept constant or varied at random. The range of distances over which testing is conducted is a crucial variable because vergence changes very little beyond 2 m. It is unlikely to serve as a cue beyond that distance. Finally, one must select a psychophysical procedure for measuring perceived distance. The following procedures have been used.

Estimated distance (cm)

80

25.2.2 VE R BA L E S T I M AT I O N O F VE RG E N C E D I S TA N C E

Oyama (1974) asked subjects to judge the size and distance of the projected stereoscopic image of a playing card or a blank card. The size of the stimulus was varied. Vergence was changed by changing the lateral separation of the dichoptic images, with accommodation held constant. Estimates of distance decreased linearly with increases in image size or convergence angle. The effect of vergence angle was much less than the effect of image size, both for the familiar and unfamiliar objects. The stimulus was viewed through an aperture, which would have introduced a relative disparity cue. Trained subjects made reasonably accurate verbal estimates of the distance of a point source of light viewed binocularly in dark surroundings at accommodation/ vergence distances of between 0.5 and 9 m (Morrison and Whiteside 1984). There was some overestimation of near distances and overestimation of far distances. Accuracy was still good when the target was exposed for only 200 ms, so that the eyes did not have time to converge on the stimulus. This suggests that the disparity of the flashed target was combined with information about the resting state of vergence. Accuracy was not as high when only accommodation distance was varied as when only vergence distance was varied. Viguier et al. (2001) asked subjects to verbally estimate of the distance of a light seen in dark surroundings. Figure 25.6 shows that distance estimates were good up to a distance of 40 cm, beyond which distance was underestimated. Vergence varies steeply with distance up to 40 cm, which coincides with the limit of reaching distance. Judgments of distance based on vergence were no worse than judgments of direction based on version, when both were expressed in angular terms (Brenner and Smeets 2000). 25.2.3 US E O F A C O M PA R I S O N O B J EC T

The joint effects of vergence and accommodation on perceived distance can be studied by comparing the apparent distance of an object seen through prisms and lenses so that it is at one accommodation/vergence distance, with the apparent distance of a comparison object seen at another 6



n = 12 60

40

20

20

40 60 Stimulus distance (cm)

80

Figure 25.6. Estimated distance as a function of distance. Subjects judged the distance of a light spot seen in dark surroundings. (Redrawn from Viguier et al. 2001)

accommodation/vergence distance. Subjects alternate between viewing the test object and the comparison object presented simultaneously, or the stimuli are presented successively. The images of the two objects are made the same size, to neutralize size cues to distance. In a related procedure, subjects judge the relative sizes of two objects rather than their relative distances. The idea is that perceived relative size is proportional to perceived relative distance, according to the size-distance invariance principle (Section 29.3.2). These two procedures will be referred to as the visual-distance procedure and the visual-size procedure, respectively. The procedures do not indicate the accuracy of judgments of absolute distance but only whether perceived relative distance or relative size is proportional to relative vergence. Alternatively, a test object can be adjusted until it appears the same size or distance as a comparison object. This procedure indicates only the minimum perceived separation in depth between two objects—the JND for relative depth. One object could be presented with full depth cues and the other with only vergence as a cue, with the two objects not visible at the same time. In this case, results indicate the accuracy and precision of depth judgments based on vergence with respect to the accuracy and precision of judgments based on some other depth cue. In a related procedure, subjects match the perceived size of a target presented at various accommodation/vergence distances with the length of a subsequently presented frontal rod seen with full depth cues. Wallach and Floor (1971) used this procedure and found that viewing distance was perceived with 95% accuracy for distances up to 120 cm. Frank (1930) measured the change in apparent size of an object as subjects changed fixation from the object to a mark some distance in front of it. In this procedure, effects of changing convergence are contaminated by blur-induced

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

changes in the size of the image and by changes in pupil diameter. Heinemann et al. (1959) overcame this problem. Subjects compared the size and distance of a luminous disk in dark surroundings at 4 m with the size and distance of a second disk at each of several nearer distances. While apparent size changed with distance, as predicted, judgments of distance were very inaccurate. When the disks were viewed monocularly through an artificial pupil, the relative distances of the disks had no effect on their perceived relative sizes. The nearer disk appeared smaller than the far disk when the disks were viewed with artificial pupils but with a binocular fixation point that provided a stimulus for vergence. Crannell and Peters (1970) conducted a similar experiment using a point of light in dark surroundings at distances between 2 and 50 feet. They used binocular viewing, taking care to eliminate cues of relative size and brightness. Judgments of distance were so variable that it was not possible to discern any significant correlation between actual and judged distances. It is not surprising that judgments were variable beyond 6 feet because vergence changes relatively little beyond this distance. In other studies, prisms and lenses were used to vary vergence and accommodation conjointly so that their normal relationship was maintained. A test object seen through the prisms and lenses was matched in size with a comparison object seen directly at a fixed distance but not at the same time. The apparent size of the test object corresponded to its relative accommodation/vergence distance, but only up to 1 m (Leibowitz and Moore 1966; Leibowitz et al. 1972). Individuals varied widely in their ability to use vergence/accommodation modified by prisms and lenses as a cue to absolute distance (Richards and Miller 1969). Wallach and Floor (1971) used a similar procedure but took extra precautions to ensure that subjects matched the linear rather than the angular sizes of the objects. Accommodation and convergence provided reasonably accurate judgments of relative distance up to 2 m. Thus, the results of experiments involving the visualdistance and visual-size procedures support the idea that accommodation/vergence is used to at least partially scale the apparent size of an object, but it does so only at near distances. Komoda and Ono (1974) presented disks subtending a constant visual angle of 10˚ in an amblyoscope at convergence distances between 20 and 120 cm. Subjects viewed the fused disk through artificial pupils and estimated its angular size, linear size, and distance by nonvisual matching procedures. For example, distance was estimated by marking out a distance on a piece of rope. For fixed values of vergence, apparent angular size, linear size, and distance decreased with increasing convergence. The inverse relationship between convergence and perceived distance

became less evident when subjects viewed the disk while vergence was changed. Evidence reviewed in Section 31.3.2 indicates that the reduced sensation of motion in depth under these circumstances is due to the absence of looming that normally accompanies changing distance. Bourdy et al. (1991) asked subjects to judge when a luminous point was midway in depth between two other luminous points lying in the median plane. Subjects whose dark convergence was near (Section 10.2.1) underestimated the near interval relative to the more distant interval. Subjects, whose dark convergence was far, overestimated the near interval. 25.2.4 V E RG E N C E D I S TA N C E I N D I C AT E D BY P O I N T I N G

Within reaching distance, an unseen marker can be set by hand to the same apparent distance as a visual target. Distance estimates based on this procedure have been found to be half as variable as verbal estimates, and more accurate (Foley 1977; Bingham and Pagano 1998; Pagano and Bingham 1998). Swenson (1932) asked subjects to move an unseen marker to the perceived distance of a single binocularly viewed luminous disk for which the distance cues of relative size and luminance were removed. Errors were less than 1 cm in the range 25 to 40 cm. When accommodation was optically adjusted to one distance by lenses, and vergence to another distance by prisms, judgments of distance were a compromise between the two distances but with more weight given to vergence. Foley and Held (1972) used the same method but eliminated accommodation by using a dichoptic pair of lights with different horizontal disparities. Judged distance increased as the vergence-distance of the target increased from 10 to 40 cm, but subjects consistently overreached, with a median error of 25 cm, which was independent of distance. At a viewing distance of 50 cm, when vergence was the only cue to depth, the standard deviation of pointing in the depth dimension was +25 arcmin of disparity. In the lateral dimension it was +2˚. Standard deviations were approximately half these values when other cues to depth were present. Mon-Williams and Tresilian (1999a) varied the vergence-defined distance of a point of light between 20 and 50 cm, keeping other depth cues constant. Subjects pointed to the light with hidden hand. Mean variable error was less than 2 cm. Shorter distances were overestimated, and longer distances underestimated. Ebenholtz and Ebenholtz (2003) obtained overreaching to an isolated point of light viewed at a distance of 35 cm at headcentric eccentricities of up to 45˚. With binocular viewing, but not monocular viewing, pointing changed to underreaching at distances greater than 45 cm.

D E P T H F R O M A C C O M M O D AT I O N A N D V E R G E N C E



7

Gogel coined the phrase specific distance tendency to describe regression to the mean distance of a set of test distances. This topic is discussed in Section 29.2.2b. Distance errors in pointing are subject to a memory effect. Thus, Magne and Coello (2002) found that subjects underestimated distance when pointing to an isolated target at 27 cm but that pointing immediately became accurate when a textured background was added to the display. After the textured surface was removed, accuracy remained high for several trials. Subjects presumably retained the impression of distance that they had gained when the background was visible. Tresilian et al. (1999) found that errors of pointing with an unseen pointer to a vertical rod increased when prisms introduced a conflict between vergence and the depth cues of perspective and binocular disparity. Errors also increased when the distance of the fixated rod increased from 25 to 105 cm. Thus, more weight was assigned to the vergence cue to distance at nearer distances and when vergence did not conflict with other cues. Mon-Williams and Dijkerman (1999) asked subjects to grasp an object seen in surroundings with many cues to distance while wearing a 9-diopter base-in or base-out prism before one eye. Subjects modified their peak arm velocity and acceleration according to the change in perceived distance induced by the prisms. The visual control of prehension is discussed in Section 34.3. Distance scaling and distance judged by walking are discussed in Section 34.4.2. 25.2.5 I L LUS O RY M OT I O N PA R A L L AX

The motion of a stationary object relative to the moving self should be correctly interpreted as due to motion of the self when the distance of the object, the rotations of the eyes, and the movement of the head are correctly registered. This motion-distance invariance principle is a special case of the size-distance invariance principle (Section 29.3.2). Misregistration of any of these stimulus features should result in illusory motion of the stationary object. However, misregistration of one feature may cancel an opposed misregistration of another feature. Consider the case where the observer’s head moves sideways through a correctly registered distance while the eyes remain fixated on a stationary object. When the distance of the object is underestimated, the eyes will rotate more slowly than they should for that perceived distance. Given that eye rotations are correctly registered, the stationary target would therefore appear to move in the same direction as the head. When the distance of the target is overestimated, the eyes will rotate faster than they should for that perceived distance, and the target would appear to move in the opposite direction to the head (Hay and Sawyer 1969; Wallach et al. 1972b). These misperceptions of motion of an object relative to the head will be referred to as illusory motion parallax. 8



Gogel and Tietz (1973, 1977) instructed subjects to fixate a point of light moving vertically in dark surroundings while they moved the head from side to side. Any apparent sideways motion of the light causes the path of the vertically moving light to appear tilted. This may be called the head-parallax test. There should have been no apparent sideways motion of the light when the distance of the light was correctly perceived, assuming that the movements of the eyes and head were correctly registered. Gogel and Tietz found that an object further away than a specific distance appeared to move in the same direction as the head, indicating that its distance was underestimated. An object nearer than the specific distance appeared to move against the head, indicating that its distance was overestimated. Thus an isolated object appeared to be displaced in the direction of a specific distance of about 2 m. This is Gogel’s specific-distance tendency. In another experiment Gogel and Tietz (1979) used the head-parallax test with sideways motion of the test light relative to the observer (parallax) signifying one distance and vergence modified by prisms signifying another distance. The results indicated that binocular convergence is a more effective cue to absolute distance than is motion parallax. Owens and Leibowitz (1976) found that, for a monocularly viewed point of light, the distance that was correctly perceived, as indicated by the head-parallax test, was related to dark vergence rather than to dark accommodation. Gogel (1982) argued that an explanation in terms of the resting state of vergence does not work when both eyes are fixated on the test object. The whole question of motion parallax as a cue to absolute distance is discussed in more detail in Section 28.2.

25.2.6 VE RG E N C E A N D A P PA R E N T S I Z E A N D D I S TA N C E

25.2.6a Micropsia and Macropsia When the ciliary muscles that control accommodation are paralyzed by atropine, the lens accommodates to a near distance and objects appear unusually small and near. This effect is known as atropine micropsia. It was first described by Aubert (1865). Misaccommodation has negligible effects on the actual size of the image (Smith G et al. 1992). Koster (1896) proposed that micropsia is due to induction of accommodative convergence by the effort to accommodate in spite of atropine paralysis. However, this cannot be the only cause, because von Holst (1957) observed micropsia when vergence was held constant. Hollins (1976) cast doubt on the existence of accommodative micropsia. He used prisms and lenses to dissociate accommodation and vergence and found that only one of three subjects showed any evidence of a decrease

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

in perceived size with increasing accommodation, with convergence held constant. Eserine contracts the ciliary muscles, which causes the lens to accommodate to a far distance. This induces macropsia, in which objects appear unusually large. In this case, there is an unsuccessful attempt to relax accommodation. Micropsia and macropsia may also be induced by a change in convergence. Base-out prisms increase the convergence required to fixate an object and decrease the perceived size of the object—vergence micropsia. This effect was first described by Wheatstone (1852). The same effect is produced when the change in convergence is not accompanied by changes in accommodation (Heinemann et al. 1959; Hollins and Bunn 1977). The telestereoscope, which increases the effective separation of the eyes, has the same effect (see Helmholtz 1909, Vol. 3, p. 352). Base-in prisms have the opposite effect—vergence macropsia. A given degree of convergence produced more micropsia in a small visual target surrounded by a landscape than in an object seen against a blank surround (Enright 1989). One theory of micropsia is that induced accommodation or convergence signal that the visual object appears to be nearer than it really is, which leads to underestimation of its size by the principle of size-distance invariance. However, several investigators have reported that an object viewed by a person experiencing micropsia appears further away than it really is (see McCready 1965). Possibly, the reduced apparent size induced by underestimation of distance induces a conscious judgment that the object is further away. If this is so, the distance underestimation that induces the micropsia is dissociated from the distance estimation that is subsequently based on the apparent size of the object.

25.2.6b The Wallpaper Illusion The wallpaper illusion occurs when a regularly repeating pattern, such as that shown in Figure 14.3, is viewed with the eyes overconverged or underconverged by a multiple of the period of the pattern (see Section 14.2.2). The images of the pattern fuse, but the angle of vergence corresponds to a distance nearer than the true distance. As a result, the pattern appears nearer and therefore smaller than it is. When the eyes are underconverged on the pattern it appears further and larger than it is. When the eyes misconverge on two objects distance x apart the distance of the point of convergence, C, is given by: C=

xA (a x )

(3)

where A is the actual distance of the objects and a is the interocular distance. Several investigators have shown that the wallpaper illusion conforms to this equation

(Lie 1965; Ono et al. 1971; Logvinenko and Belopolskii 1994). If this were the only cause of the illusion, it would be a vivid illustration of the role of vergence in the perception of depth. However, we will now see that this simple explanation of the illusion in terms of vergence is not the whole story. Logvinenko et al. (2001) asked subjects to set a depth probe to the perceived distance of a set of vertical rods while converging the eyes so as to mismatch the rods by a separation of one rod. The rods were about 4 cm apart at a distance of about 30 cm in a well-lit room. Subjects set the probe close to the theoretical illusory distance of the rods and could change vergence by at least 1˚ while maintaining the same illusion and avoiding diplopia. Misconvergence was required initially to mismatch the images of the rods. But the images remained mismatched when the vergence shifted from the initial state. They concluded that the wallpaper illusion is not due to the vergence state of the eyes but to disparity between the mismatched rods and other objects in view. Note that the wallpaper illusion should conform to the above equation whether it is due to misconvergence or to relative disparity with respect to other objects. In an autostereogram, misconvergence on a repetitive pattern produces multiple apparent depth planes. These arise from relative disparities produced by irregularities in the spacing of the repetitive pattern, as explained in Section 24.1.6. Thus relative disparity between the mismatched images of a repetitive pattern and other objects in view contributes to the wallpaper illusion. But vergence could contribute to the illusion when there are no other objects in view. Logvinenko et al. did not measure the illusion with rods in an otherwise empty field and they used only one vergence mismatch between the rods (see Kohly and Ono 2002). Also, it is not clear how the depth probe would be affected by the change in vergence. A nonvisual test probe, such as reaching with unseen hand, would have been better. There is another complicating factor, even with a repetitive pattern viewed in isolation. A pattern in a frontal plane generates gradients of horizontal and vertical disparities. This is because, with increasing eccentricity, the elements of the pattern lie further from the horopter. Furthermore, disparity gradients vary as a function of viewing distance (Section 20.6). Ittelson (1960) believed that vergence micropsia and macropsia result from these changes in disparity and claimed that there were no vergence-induced changes in perceived size when the repeating pattern was confined to the horizontal horopter. Ono et al. (1971) checked Ittelson’s claim by asking subjects to set an unseen pointer to the apparent distance of a wire-mesh surface for different amounts of misconvergence. In one condition, the surface was a vertical cylinder

D E P T H F R O M A C C O M M O D AT I O N A N D V E R G E N C E



9

containing the horizontal horopter defined by the ViethMüller circle. In another condition, the surface was in a frontal plane. Distance estimates in the frontal-plane condition conformed more closely to those predicted from vergence-distance scaling than did estimates in the horopter-plane condition. However, some vergencedistance scaling was evident in the horopter-plane condition, especially when the surface was seen for the first time. The horopter-plane stimulus still contained a gradient of vertical disparity in the quadrants of the display. Vertical disparities could be removed by using a repeating pattern consisting only of vertical lines or of a row of dots confined to the horizontal horopter. Misconvergence on a regularly striped pattern can increase postural sway because a misperception of the distance of the pattern leads to a misperception of the motion created by body sway. This may be one reason why people fall when viewing the regular striped pattern on escalators (Cohn and Lasley 1990; Lasley et al. 1991).

25.2.6c Effect of Vergence on Apparent Size of Afterimages Taylor (1941) observed that an afterimage observed in complete darkness appeared to grow and shrink as subjects moved their heads backward and forward. This effect was confirmed by Gregory et al. (1959). Taylor also observed that the afterimage of a white card held in the hand appeared to grow and shrink when the unseen hand was moved backward and forward in the dark (see also Carey and Allan 1996). The afterimage appeared constant in size when subjects fixated a stationary point of light (see also Mon-Williams et al. 1997). In the dark, the only cue to the “distance” of an afterimage is the vergence-accommodation state of the eyes, so that when these are held constant the afterimage does not appear to change in size. The apparent size of afterimages is discussed in more detail in Section 29.3.4.

Distortions of headcentric space are produced after the eyes have been held in an off-center position for a minute or two. For instance, when an observer attempts to return the gaze to the straight-ahead position the eyes remain displaced in the direction of previous deviation. Furthermore, pointing movements to a visual target with unseen hand miss in the direction of previous eye deviation (MacDougall 1903; Park 1969; Craske et al. 1975). The change in apparent straight-ahead increases with both the eccentricity and duration of the previous eye position. It has a maximum value of about 8˚ (Paap and Ebenholtz 1976). Similar effects occur after externally imposed deviation of a passive eye in the dark (Gauthier et al. 1994). These effects are due to the asymmetrical posture of the eyes rather than to the asymmetrical position of the visual target (Hill 1972; Morgan 1978). An object seen in dark surroundings appears to drift in the opposite direction to that of a previous deviation of the eyes (Gregory and Zangwill 1963; Levy 1973). Holding the head in an asymmetrical posture produces similar effects on the apparent straight-ahead (Howard and Anstis 1974; Ebenholtz 1976) (Portrait Figure 25.7). The aftereffects of asymmetric eye posture are also revealed in physiological studies. The firing rate of single cells in the region of the oculomotor nuclei of the monkey was related to the position of the eyes in the orbits

25.2.7 P E RC E P T UA L E FFEC TS O F M A I N TA I N E D V E RG E N C E

We saw in Section 10.2.5 that viewing the world through base-in or base-out prisms, even for a few minutes, leads to a shift of tonic vergence lasting minutes or hours, as revealed by changes in dark vergence, phoria, or fixation disparity. This section is concerned with the perceptual effects of vergence adaptation. Sheldon Ebenholtz. Born in New York in 1932. He obtained a B.Sc. from City College, New York, in 1958 and a Ph.D. in psychology from the New School for Social Research, New York, in 1961. He joined the faculty of Connecticut College in 1961 and moved to the University of Wiconsin in Madison in 1966. Between 1988 and 1996 he was director of the Schnurmacher Institute for Vision Research in the College of Optometry of the State University of New York.

Figure 25.7.

25.2.7a Postural Aftereffects Tonus aftereffects known as postcontraction, or postural persistence occur in all muscular systems. An illustration of postcontraction is the involuntary elevation of the arm after it has been pressed with some force against a wall. 10



OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

(Eckmiller 1974). However, the firing rate of a given cell was higher after the eyes had approached a target position from a less eccentric angle of gaze than when it approached the target from a more eccentric angle. Thus, the cells manifested directional hysteresis. The difference between ingoing and outgoing impulse rates was between 5 and 22 impulses/s, which corresponds to a difference in eye position of several degrees. These data are compatible with psychophysical findings and could be due to any of the following causes. 1. Posttetanic potentiation Muscles become more responsive to a given level of innervation after a period of active contraction. This is known as posttetanic potentiation (Hughes 1958; Olson and Swett 1971). It is counterbalanced by the fact that prestretching increases the elastic tension in the antagonistic muscle (Bosco and Kom 1979). 2. Adaptation of muscle-spindle receptors After the contraction of a muscle, muscle-spindle receptors show a persistent sensory discharge, which is probably due to an alteration in the contractile state of extrafusal (ordinary) muscle fibers and intrafusal muscle fibers (muscle fibers within sensory muscle spindles) (Morgan et al. 1984; Wilson et al. 1995). 3. Adaptation of tendon organs Golgi tendon organs and ligament receptors adapt when subjected to steady tension (Houck 1967; Ferrell 1980). Human extraocular eye muscles contain muscle spindles, Golgi tendon organs, and palisade endings (Cooper et al. 1955; Richmond et al. 1984). Stimulation of proprioceptors in the extraocular muscles by vibration causes a point of light to appear to move (see Velay et al. 1994).

25.2.7b Aftereffects of Maintained Vergence Viewing the world through base-out prisms or a telestereoscope increases vergence demand. After a few minutes, distances are underestimated. Distances are overestimated for a while after the device has been removed. Base-in prisms, which decrease vergence demand, produce the opposite effects. These effects are what one would expect if vergence serves as a cue to absolute distance. A dispute has arisen about the cause of these aftereffects. Three causes have been proposed. 1. Changes in the tonic state of vergence, 2. Recalibration of the vergence/apparent distance system arising from the disturbed relation between vergence and disparity on the one hand and other cues to

distance, such as familiar size, motion parallax, and perspective on the other hand. 3. Recalibration of the vergence/apparent distance system arising from interaction between the observer and the visual environment. Wallach et al. (1963) conducted the first experiment of this kind. Subjects viewed rotating wire forms for 10 minutes through a telestereoscope that increased the effective interocular distance to 14.1 cm. Subsequently, with normal viewing, the perceived depth between the front and back of a wire form was reduced by 19%, although its perceived size was unchanged. Perceived depth was increased by about 15% when the viewing device reduced interocular distance to 3.8 cm. Wallach et al. interpreted these results in terms of the conflict between the change in binocular disparity produced by the telestereoscope and other cues to depth that were not changed by the telestereoscope, in particular, motion parallax and perspective. Viewing the world through a telestereoscope or prisms may have produced a tonic change in the eye muscles. This may have contributed to the effect reported by Wallach et al. Wallach and Halperin (1977) produced evidence that muscular aftereffects do not account for the whole of the effects of adaptation to prisms (see also von Hofsten 1979). On the other hand, Fisher and Ebenholtz (1986) used a procedure similar to that used by Wallach et al. (1963) and obtained similar aftereffects when there was no conflict between disparity and monocular cues to depth during the induction period. They concluded that aftereffects of viewing through a telestereoscope are due to a change in the tonic state of the extraocular muscles, which causes a change in the apparent distances of objects (Section 10.2.5). Even a small change in apparent absolute distance would have a large effect on the perceived relative depth in an object, because the disparity produced by a given depth interval is inversely proportional to the square of viewing distance. On the other hand, changes in apparent size of objects would be small because the angular size of an object is inversely related to distance, not to the square of distance. Fisher and Ciuffreda (1990) obtained direct measures of changes in tonic vergence, perceived distance, and perceived depth after subjects moved about in a building and performed simple tasks for 30 minutes while wearing a telestereoscopic device. The distance and depth aftereffects were opposite to those predicted from a conflict between disparity cues and monocular cues but were consistent with a change in the tonic state of vergence. Other lines of evidence support the idea that changes in perceived distance arise from changes in the tonic state of vergence. In one experiment, 6 minutes of fixation of a

D E P T H F R O M A C C O M M O D AT I O N A N D V E R G E N C E



11

visual target in a stereoscope at a near distance produced a subsequent overestimation in the perceived distance of a normally viewed test object. Fixation at a far distance produced an underestimation in perceived distance. Maintained fixation at an intermediate distance of about 32 cm produced no aftereffects, presumably because this was the distance corresponding to dark vergence (Ebenholtz and Wolfson 1975). In a related experiment, subjects fixated an isolated visual target at a distance of 41 cm for 6 minutes through prisms ranging from 20 diopter base-out, requiring 32˚ of convergence, to 8 diopter base-in, requiring 0.1˚ of divergence. The size of the aftereffect was approximately proportional to the depth interval between the position of maintained vergence during the induction period and the position of the test object, which happened to be near the position of vergence that the eyes assume in the dark (Paap and Ebenholtz 1977). In both these experiments other depth cues were either held constant or reduced to a minimum and the experimenters concluded that changes in apparent distance resulting from maintained vergence are due simply to changes in muscle tone rather than to conflicting depth cues. Maintained near vergence increases tonus in the medial rectus muscles, so that less innervation is required to hold the gaze on an object. This creates the impression that the object is further away than it normally appears. Maintained far vergence has the opposite effect. Shebilske et al. (1983) found that 10 minutes of fixation on an isolated target 11 cm away induced 4.6 diopter of esophoria and a 6.3 cm overestimation of distance. Other evidence cited at the beginning of this section and in Section 10.2.5 establishes that changes in muscle tone in extraocular muscles do occur. Judge and Bradford (1988) found that subjects closed their hand too soon when trying to catch an approaching ball seen through a telestereoscope, which increased vergence demand. With feedback, subjects soon compensated, but they showed an opposite effect after viewing was returned to normal. Since the room was visible, it is not clear to what extent disparity between the ball and the stationary surroundings was involved rather than altered vergence demand.

25.2.7c Vergence and Other Depth Cues Even though the primary effect of maintained vergence is a change in the tonus of the extraocular muscles, the way this affects judgments of depth depends on the presence of other cues to distance. Wallach et al. (1972a) asked subjects to walk about for 20 minutes wearing 1.5-diopter base-in prisms, which decreased both the convergence and accommodation required to fixate an object. Or they wore base-out prisms, which had the opposite effect. The prisms altered the 12



relationship between convergence/accommodation and other cues to depth such as perspective, disparity, and familiar size. After adaptation, subjects matched the length of a rod they could feel but not see to the depth between the back and front of a wire pyramid. The absolute distance of the pyramid could be detected only on the basis of accommodation and vergence. Estimates of depth within the pyramid changed after subjects had adapted to the prisms, although not by as much as predicted if the visual system had fully adapted to the altered state of vergence. Subjects also estimated the apparent distance of a test object before and after adaptation, by pointing to the position of the object with the unseen hand. The change in apparent distance was the same percentage of full adaptation that was evident in the rod-matching test. Wallach et al. concluded that pairing an unusual state of vergence/accommodation with veridical depth cues leads to a recalibration of vergence/accommodation and to a corresponding change in depth constancy. O’Leary and Wallach (1980) tested whether perceptual scaling of depth can be induced by an apparent change in distance induced by a false familiar-size cue. A normal dollar bill and one 72% of normal size were presented one at a time at the same distance in dark surroundings. A small white disk was suspended 1 cm in front of each bill. Subjects set a pair of calipers by touch to match the depth interval between each disk and its dollar bill. If the perceived distances of the bills had been determined by their angular sizes, the smaller bill should have appeared 1.39 times as far away as the normal bill. Also, the depth between the disk and the smaller bill should have appeared larger than that for the normal bill by a factor of (1.39)2, or 1.93. In fact, it appeared to be larger by a factor of 1.7. This result indicates that an unusual relationship between convergence and familiar size can affect the scaling of perceived depth. However, it is not clear what the scaling factor was since the perceived distances of the test objects were not determined (see also Predebon 1993). A change in the perceived distance of points of light, as revealed by pointing with an unseen hand, was produced in subjects after they had looked at their own feet for 3 minutes through base-out prisms (Craske and Crawshaw 1974). Heuer and Lüschow (1983) exposed subjects to a row of LEDs for 6 minutes at distances of 23, 32, or 50 cm. The aftereffect was indicated by errors of pointing with hidden finger to a test stimulus. When cues to distance other than vergence were absent during the induction period, the aftereffect was reduced by the addition of other depth cues to the test stimulus. This demonstrates that adaptation of the oculomotor system is sufficient to generate the perceptual aftereffect. However, the aftereffect was stronger

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

24 cm

0.6 0.4 0.2 0 Fixating Flash judgements

0.4 0.2 0 0.4

At least part of the effect of wearing prisms may be due to recalibration of central processes arising from active visualmotor experience. For example, the effects of wearing 4-diopter base-out prisms on the perceived distance of a point of light were greater for subjects who moved actively or passively through a building during the 20-minute induction period than for subjects who read a magazine (Owens and Leibowitz 1980). It is not clear whether the smaller effect in the second set of subjects occurred because subjects did not move or because they maintained a fairly constant angle of vergence. Ebenholtz (1981) obtained a greater effect on perceived distance in subjects who wore 5-diopter base-out prisms for 15 minutes while moving about in a normal visual environment than in subjects who maintained an equivalent convergence on an object for the same length of time. But this difference may have been due to the fact that subjects in the second group were exposed to an impoverished visual stimulus with fixed vergence while those in the first group observed a natural scene containing many objects and monocular cues to distance.

0.2

Although eye movements are not required for stereopsis, depth between two laterally separated targets can be detected more easily when the gaze is allowed to move between them (Section 18.10.2). The present section deals with whether depth between two objects in the same visual direction but well separated in depth is perceived more accurately when the gaze moves from one to the other rather than remains on one of them. Foley and Richards (1972) placed a test object at various distances in front of a screen. The screen was at an optical distance of either 250 or 24 cm. In one condition, subjects looked back and forth between object and screen

250 cm Probe judgements

0.8

25.2.7d Effects of Visual-Motor Experience

25.2.8 VE RG E N C E A N D J U D G M E N T O F R E L AT I V E D E P T H

Eye movements

1.0

Perceived distance ratio

when conflicting cues to distance were present during the induction period, demonstrating that cue conflict also contributed to the aftereffect. The direction of the aftereffect depended on the absolute distance of the induction stimulus rather than on the relative distance of induction and test stimuli. This result is what one would expect if the aftereffect arises from tonic changes in the eye muscles. Shebilske et al. (1983), also, found that shifts in phoria and pointing induced by maintained fixation were greater for an isolated test object than for one seen in a structured visual environment. Turvey and Solomon (1984) argued that the aftereffects might have been due to a change in the felt position of the arm and other uncontrolled factors.

Setting depth probe Matches

0

0.2

0.4

0.6 0.8 1.0 0.2 0.4 Physical distance ratio

0 10 20 40 80 100 200 400 Disparity (arcmin)

0.6

0.8 1.0

600

Figure 25.8. Perceived distance and relative distance. Perceived distance of an object from a screen as a proportion of the perceived distance of the screen as a function of the ratio of the two physical distances. The screen was at a fixed distance of 250 or 24 cm. The top curves show results when the subject could look between object and screen. The middle curves are for when the subject fixated a point on the screen and the object was flashed on for 80 ms. The bottom curves are for when the subject set a depth probe to match the perceived depth of the flashed object (N = 1). (Redrawn from Foley and Richards 1972)

and set the object to a specified distance relative to the screen. In another condition, subjects fixated a point on the screen and estimated the relative distance of a disparate test object, flashed on for 80 ms. Incidental differences between conditions, such as the difference in exposure time, were allowed for in control conditions. Results for one subject are shown in Figure 25.8. When vergence movements were allowed, subjects gave reasonably accurate estimates of the distance of the test target from the screen, relative to the distance of the screen to the eye, for all distances of the target. Over the middle range of distances of the test object the perceived relative distance was overestimated by about 10%. When eye movements were not allowed, perceived relative depth was accurate only when the target was near the screen so that the disparity of the images of the test object was small. As disparity increased, the test object appeared much closer to the screen than it actually was. Thus, small disparities were accurately registered without vergence eye movements but large disparities

D E P T H F R O M A C C O M M O D AT I O N A N D V E R G E N C E



13

were not. The large disparities may have been beyond the range of disparity detectors. In any case, the highly disparate images would be diplopic. Diplopic images may simply have defaulted to the same perceived distance as the screen. Otherwise, the improved performance with eye

14



movements may have been due to information provided by vergence in the form of either motor efference or kinesthetic inputs from the extraocular muscles. It could also have been due to changes in disparity produced by eye movements.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

26 DEPTH FROM PER SPECTIVE

Introduction 15 Geometry of perspective 15 Types of perspective 23 Size perspective 24 Experiments on size as a cue to distance 24 Familiar size as a cue to distance 25 Size discrimination and matching 27 Linear perspective 27 Information contained in linear perspective Judging surface inclination from linear perspective 28 26.3.3 Judgments of projective invariants 30 26.3.4 Distortions in viewing pictures 32 26.3.5 Drawing in perspective 36 26.4 Position and perceived distance 39 26.1 26.1.1 26.1.2 26.2 26.2.1 26.2.2 26.2.3 26.3 26.3.1 26.3.2

26.4.1 26.4.2 26.4.3 26.4.4 26.5 26.5.1 26.5.2 26.6 27

26.6.1 26.6.2 26.6.3 26.7 26.7.1 26.7.2

26.1 INTRODUCTION

Effect of height in the field of view 39 Optical adjacency 39 The visual horizon and absolute distance 40 Height in the field and size judgments 43 Texture perspective 44 Types of texture gradient 44 Texture gradients and perceived inclination 46 Texture gradients on curved surfaces of curved surfaces 51 Defining and measuring 3-D shape 51 Texture gradients on cylindrical surfaces 53 Texture gradients on complex surfaces 54 Reversible perspective 58 Reversible perspective in 2-D displays 58 Reversible perspective in 3-D objects 61

central projection line, and its image is the central image point, or centric point. For a spherical image surface centered on P, the centric point can be any specified point. In the eye, the central projection line is the visual axis, and the centric point is on the center of the fovea. The distance between the centric point and the center of projection is the projection distance, d. For a spherical eye with a central nodal point, all image points are distance d from P. For a flat image surface, the distance between P and an image point at eccentricity q is d cosq. A perspectivity is the mapping of a specified set of objects points onto a defined image plane by a specified rule of projection. Consider points in 3-D space projected by polar projection through a nodal point onto a spherical retina. Let the location of each object point be specified in polar coordinates. The direction of an object point is defined by the meridian and eccentricity of the line of sight on which the object lies. The distance of the point is its distance from the nodal point. The directions of image points are defined in the same way, but all retinal points are the same distance from the nodal point. In studying perspective in vision we are interested only in the perspectivities in which the shape of the image is not the same as that of the rigid object that produced it. Consider any shape lying on a spherical surface outside the eye and centered on the nodal point. The image has the same shape as the object no matter where the object lies on

26.1.1 G E O M ET RY O F P E R S P EC T I V E

26.1.1a Basic Geometry The geometry of perspective forms part of projective geometry, which was defined in Section 3.7.2a. In general we start with a set of object points in 3-D space, each of which is projected onto an image surface. A projection line is a line connecting an object point and its image point. In polar projection, all projection lines pass through a common center of projection, P, as shown in Figure 26.1. The image surface may be any shape but it is usually flat, as in photography, or spherical, as in the eye. In the eye, the center of projection is the nodal point, which we will assume is at the center of the eye. In parallel projection, or orthographic projection, the projection lines are parallel. The resulting geometry is known as affine geometry. When P is infinitely distant from the image plane, polar projection lines are effectively parallel. In any system, any object point outside the center of projection, P, lies on only one projection line. All points on the same projection line project to the same image point, which is unique to that projection line. In polar projection, the projection line orthogonal to a flat image surface is the 15

Y axis

Image plane

Center of projection Object point (x, y, z)

–x’ X axis

–y’ d

P

Image point (–x′, –y′, –z′)

y z

x

Projection distance Central projection line From similar triangles y , x , z′ = d y′ = z/d z/d

Z axis

x′ =

Basic projection with rectangular coordinates. A point in position (x, y, z) with respect to rectangular coordinates (X, Y, Z) centered on a center of projection, P, forms an image at point (–x´, –y´, –z´) on a plane distance –d from the center of projection, P.

Figure 26.1.

the surface. If the surface on which the object lies is at the same distance from the nodal point as the retina, the image is the same size as the object. The image of any object moving in any way within a surface concentric with the nodal point of the eye does not undergo a perspective transformation. Any such surface is therefore a locus of zero perspective transformation, or isoperspective locus. In other words, if each point in an object remains at the same distance from the nodal point of the eye, any movement of the object does not produce a change in perspective. Also, moving each of a set of object points along a line of sight does not produce changes in perspective. Changes in perspective arise only from any motion of an object other than a motion in an isoperspective locus or a motion that carries each object point along a projection line (line of sight). A rigid object moving from one isoperspective locus to another produces a simple change in size perspective. Rotation of a rigid surface about an axis tangential to an isoperspective locus produces an image with a first-order gradient of perspective. Deforming an evenly textured planer surface into a surface that is curved in depth, such as a cylinder or sphere, produces an image with a second-order perspective gradient. In each case, the change in perspective is specified with respect to a reference object lying in the zero-perspective locus. In polar projection, a rigid object cannot move purely in distance because only one point of the object can move along a line of sight. All other points in the object change their direction—they move across lines of sight. This produces a perspective change in the image. Perspective consists of those changes in the direction of image points that are associated with changes in depth. A pure change in depth does not produce changes in direction and therefore does not produce changes in perspective. Rather changes in perspective are due only to changes in the directions of points on an object caused by a change in the distance of the whole or part of an object from the eye. 16



For parallel projection, direction is specified in Cartesian coordinates and distance by the distance from the plane of projection. In this system, movement of a rigid object in distance produces no change in perspective, because each object point moves along one of the parallel lines of sight. But rotations in depth of a rigid object (other than a sphere or a cylinder rotating about its main axis) do produce changes in perspective, because they cause object points to move between lines of sight. The Cartesian coordinates of a point in space relative to those of its image in a flat image plane are shown in Figure 26.1. The depth dimension, z, of a point in space is the orthogonal distance between the frontal plane containing the point and the projection point P. For a spherical image surface, the depth dimension of a point is its radial distance from P. The polar coordinates of a point in space relative to those of its image on the retina are shown in Figure 26.2. In both cases, the ratio d/z defines the magnification of the image of any object at distance z. For polar projection onto a flat or spherical surface, the image of any infinitely long line parallel to a projection line ends in a vanishing point. This is the image of the most distant points on all lines parallel to a given projection line. For a flat image surface, parallel straight lines in any plane parallel to the image surface produce parallel images. For the eye, concentric circles on the surface of any sphere centered on the nodal point produce images that fall on concentric lines of latitude. For any image surface, the images of any set of receding parallel lines converge on a common vanishing point. The vanishing point of lines parallel to the principal projection line is the central image point, or principal vanishing point. In the eye, the vanishing point of all lines parallel to the visual axis falls on the fovea. The images of any set of receding parallel flat planes converge on a horizon line. The horizon line for any set

Retina Eccentricity Radial distance

Nodal point

Object point (q, f, z) Meridional angle

d P Image point (q, f)

q

z

f

Projection distance Visual axis

Projection onto the retina in polar coordinates. Projection of point (q ,f , z) through a nodal point P onto a retina at distance d from P. The retinal point is expressed in polar coordinates of meridional angle, q , and eccentricity, f . The size of the retinal image is scaled by d/z with respect to the size of the object.

Figure 26.2.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

of planes parallel to the visual axis is a specific retinal meridian, or great circle through the fovea. The normally horizontal retinal meridian is the visual horizon. The horizon line for any set of parallel planes not parallel to the visual axis is a retinal great circle not through the fovea. Each horizon line can be thought of as the image of one of the set of great circles round the infinite spherical universe. The images of planes parallel to the Earth horizontal converge on the geographical horizon. This horizon falls on whichever retinal great circle happens to be horizontal. This horizon cuts the fovea only when the gaze is horizontal, and is the visual horizon only when the normally horizontal retinal meridian is horizontal. It can be regarded as the image of a horizontal great circle of the universe at eye level. For example, the image of the horizon at sea is everywhere horizontal but encircles the observer. Because vanishing points and horizons are images of objects at infinite distance they maintain fixed relative positions when the observer moves.Linear perspective is traditionally classified into one-point, two-point, and three-point perspective. In computer graphics, object space is represented by X, Y, Z Cartesian axes, and the image plane is parallel to the X, Y plane, as in Figure 26.1. Let the object be a rectangular 3-D object with three axes A, B, C. When each object axis is parallel to a space axis, we have one-point

perspective. When only one object axis is parallel to a space axis we have two-point perspective. When no object axis is parallel to a space axis, we have three-point perspective, as shown in Figure 26.3. A more general definition for multiple objects is as follows. In one-point perspective, all receding lines in the scene have the same vanishing point because they are all parallel to each other. In two-point perspective, all receding lines have vanishing points on the same horizon line because they are all parallel to a common surface, but not all parallel to each other. In three-point perspective, not all vanishing points are on the same horizon line because lines in the scene are parallel to either of two nonparallel surfaces. The vanishing point for any set of parallel lines at 45˚ to the principal projection line lies at a distance d in the picture plane from the central vanishing point, where d is the distance between the projection point and the picture plane. Thus, the locus of all such vanishing points is a circle in the image plane with radius d. The vanishing point for a set of 45˚ parallel lines in any plane parallel to the horizon plane lies on one or the other side of the central point. These two points are known as distance points. The two distance points subtend 90˚ at the center of projection, which is the angle between the 45˚ lines that project to the distance points.

Y A

B

Z Y

C

A

B Z C

X

VP

VP1 Horizon VP3 X

P One-point perspective Y A

Z

B C

P

VP1 X Three-point perspective VP2

P Two-point perspective Figure 26.3.

One-point, two-point, and three-point perspective.

DEPTH FROM PER SPECTIVE



17

VP2

For small central visual displays, we can consider the retina to be a plane surface normal to the visual axis. However, for large displays, we cannot ignore the spherical shape of the retina and the non-Euclidean geometry on its surface. A large image on a spherical retina cannot be portrayed on a flat surface. This has led some theorists to conclude that the geometrical principles of perspective used in painting on a flat surface are either inaccurate or, worse, only an arbitrary convention (Panofsky 1940). But viewing the 2-D projection of a 3-D display creates the same retinal image as the 3-D display if the nodal point of the eye and the projection center of the picture coincide. There will be a difference in image blur due to accommodation, unless the picture is projected onto the flat plane through a lens like that in the eye. The principle of projective equivalence states that any two displays, whether real scenes or pictures, create the same image if all points on any projection line in one display are on a corresponding projection line in the other display. Thus, although a flat photograph or drawing in perspective is not congruent with the image on the curved retina, it nevertheless creates the same image as the original scene when viewed from the correct vantage point. The 3-D display, the drawing, and the retinal image are therefore projectively equivalent (Pirenne 1952). A single flat drawing of a 3-D scene cannot create the correct image in both eyes at the same time, because only one eye can be at the correct vantage point. A stereoscope overcomes this difficulty because it presents to each eye an image appropriate to that eye. But a stereogram lacks the depth cues of differential accommodation and motion parallax. These limitations may be overcome by stereoscopically combining the images of two 3-D objects, as described in Section 24.1.8.

4. Coincident object points produce coincident image points, although coincident image points do not necessarily arise from coincident object points. 5. The order and connectivity of points is preserved except for folded or overlapping objects. 6. Ratios of distances along a line are not preserved, but cross ratios of distances among any four collinear points are preserved, as shown in Figure 26.4. In general, the angle subtended at the projection center by the vanishing points of any two coplanar lines equals the angle in 3-D space between those lines. Think of two lines intersecting at an angle of say 10˚. The plane in which the lines lie, when extended to infinity in all directions, defines a great circle around the universe with the lines intersecting at the center. The 10˚ angle between the extensions of the lines therefore marks out 1/36 of the total circle. Since this interval is at infinity it projects the same angle to any point infinitely distant from it. In particular, it subtends 10˚ to the nodal point of an eye near the intersection of the lines. It is thus possible to derive the angle between any two straight lines in space if the horizon line of the surface within which the lines lie is known. It follows that the angles for any triangle can also be derived, as illustrated in Figure 26.5. Once the angles of a triangle have been determined, the relative lengths of the sides can be determined from the fact that the ratio of any two sides of a triangle equals the ratio of the sines of the opposite angles. The shape of any polygon can also be determined because a polygon can be decomposed into triangles. O

26.1.1b Projective Invariants In polar projection on a flat image plane, all points on a projection line produce the same point image, which is unique to that projection line. The image of any flat surface in a plane containing a projection line is a line. Otherwise, lines or 2-D shape and their images produced by polar projection on a flat image plane possess the following invariant properties:

A B C D

1. The image of a straight line is a straight line, or a point if it lies on a line of sight. However, disconnected lines may produce a continuous line image. Curved lines produce curved images unless they lie in a plane containing lines of sight. 2. All flat objects that lie in a plane parallel to lines of sight produce line images. 3. A polygon remains a polygon with the same number of sides. Conic sections remain conic sections. 18



C′

D′

B′ A′

AB BC

AD A′B′ A′D′ = CD B′C′ C′D′

The cross ratio. Line AD in one plane is projected from point O to form an image A´ D´ in a second plane at an angle to the first plane. The four points in the object and the image are related by the cross ratio.

Figure 26.4.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Vanishing point of plane VPa

VPc

Horizon

VPb

180°–C

B

A B

Inclined plane

a C

c A

b

Figure 26.5. Angles in space and visual subtense of vanishing points. The vanishing points of all sides of a triangle lying on an inclined plane fall on a common horizon line. The distance between the vanishing points of any two sides is determined by the angle of their intersection. The angle subtended at the eye by any distance along the horizon at infinity is the same for all vantage points. Therefore, the angle subtended at an eye by any distance along the horizon equals the angle between the corresponding sides of the triangle. Thus, the angles in any triangle can be determined from the vanishing points of its sides. The relative lengths of any pair of sides of the triangle is given by a/b = sinA/sinB.

C A B Figure 26.6.

D

Projective equivalence of shapes. All shapes that fill out the

A

B

The Ames chair demonstration. (A) A set of luminous rods in a dark box viewed from the top. (B) The same rods viewed through a hole in the end of the box.

same visual pyramid form the same image in the eye. Thus, the folded rectangle at A forms the same image as a slanted trapezoid at B, an inclined trapezoid at C, and a curved shape at D.

Figure 26.7.

A given image can arise from any of an infinite number of objects that fill the cone of light rays that define the image. For example, an image like that in Figure 26.6 could arise from a slanted square or from any number of appropriately tapered trapezoids. These are projectively equivalent objects. For example, the circle and the ellipse are projectively equivalent. These are the conic sections described in Section 3.7.2c. In the Ames chair demonstration a set of luminous rods suspended in a box in an apparently random fashion appears as a chair when viewed through a hole in the end of the box, as illustrated in Figure 26.7. It has been argued that the illusion occurs because the chair is a familiar

object (Gombrich 1961). But the effect would occur with an unfamiliar object. The point is that one can see that the lines are disconnected when they are viewed from the top. When viewed through the hole, the ends of the lines that lie on the same visual line appear connected because their images are connected. The effect works because the visual system adopts the rule that lines that are connected in the image are connected in space. The assumption is equivalent to the generic viewpoint assumption, namely that a perceived object does not change dramatically when the viewpoint shifts slightly. Conversely, a rigid 2-D object can produce an infinite number of images, depending on its distance and

DEPTH FROM PER SPECTIVE



19

orientation to a line of sight. These are projectively equivalent images. Therefore, these rules and invariants do not allow one to determine which stationary 2-D object produces a given image. The rules allow one to decide whether a given object could produce a given image and they specify the set of objects that could produce the image. There are no general geometric invariants for projection from an opaque 3-D object to a 2-D image. From a single view of such an object one cannot determine the set of projectively equivalent objects because many object points are occluded and therefore unspecified in the image. However, given some assumptions about the 3-D object, such invariants can be found (Weiss and Ray 1999). A 3-D object can be more fully specified from two or more views with defined vantage points, as in binocular stereoscopic vision (Lasenby and Bayro-Corrochano 1999).

26.1.1c Projective Transformations Any change in the size or shape of an image produced by a motion of a rigid object is a projective transformation of the image. This definition excludes simple image translations and rotations. A rigid object and all its transformed images are projectively equivalent. Incidence, collinearity, and order of points are preserved, but not parallels, lengths, or angles (Section 3.7.2c). Any moving object can produce an image of constant shape if it undergoes an appropriate plastic transformation. A set of objects that produce the same image defines a projective transformation of the object, since all members of the set are projectively equivalent with respect to a defined image and center of projection. For any image surface there is a surface in object space within which any object and its image have identical size and shape. These are conjugate surfaces. Conjugate surfaces are reflections of each other through the center of projection. The conjugate of a flat image surface is a surface parallel to the image surface at an equal distance on the other side of the projection point. The conjugate surface of a hemispherical retina with central nodal point is a hemisphere of opposite sign. Any line or surface in a conjugate object surface is isometric (same shape and size) with its image. Therefore, for any object on a conjugate surface, magnification is 1. The image of a shape parallel to a conjugate surface is similar to the object—parallels, angles, and shape are preserved, but not size, as shown in Figure 26.8. For any location on such a surface, magnification is constant. Object/image similarity is also preserved for translation of any long line along its own length and for rotation of a sphere. Any object moving along a projection line at a constant orientation to the line projects an image of constant shape, which is not necessarily the shape of the object. It follows from the above principles that the image of a fixed object on a flat image surface changes shape when the 20



For flat image plane Locus of constant image size for flat objects in the plane

For spherical image plane Locus of constant image size and shape for 2-D or 3-D objects tangential to the surface

Centre of projection Image surfaces

Loci of constant image size and shape. For a flat image plane, the image of a flat object remains constant in size and shape for all positions of the object within any one plane parallel to the image plane. Also, object and image have the same shape for all orientations of an object in any such plane. For a spherical image plane, the image of a flat object remains the same size and shape if the object is tangential to any one sphere centered on the center of projection. Image and object are the same shape if the object is tangential to any such sphere. In addition, a 3-D object projects the same image if one of its sides remains tangential to one of these spheres. However, object and image are not the same shape. For both image planes, as a small object moves with constant orientation along any projection line, image shape remains constant, and image size varies inversely with the distance of the object from the projection center.

Figure 26.8.

surface rotates about any axis in the surface but not when the surface moves within its own plane. The image of an object on a spherical projection surface, such as the retina, does not change when the surface rotates about the projection point but does so when the surface translates. Retinal images do change a little when the eye rotates, because the center of rotation and the nodal point are not coincident.

26.1.1d Projective Transformations of a Line Let line AB be placed in a frontal plane at distance D from projection center P, as shown in Figure 26.9. Line AB subtends angle q , and projects image ab. For a flat projection surface, triangle ABP is similar to triangle abP. Therefore, ab is inversely proportional to D. For a spherical projection surface, ab is proportional to . For small q, ab is inversely proportional to D. Thus, the retinal image of a line shrinks approximately in proportion to the distance of the line from the nodal point along a line of sight. However, frontal line AB at distance D projects the same image when it is placed at nearer position A’B’ if the orientation of the line is changed appropriately. Thus, a change in the image of a line with decreasing distance can be compensated for by a change in line orientation. A line of fixed length projects the same image on a spherical retina if it remains tangential to a sphere centered on the nodal point. Any such line subtends a constant angle

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

26.1.1e Projective Transformations of a 2-D Object

A Image planes

A′

B

θ

B′

P

b

D a

a

Figure 26.9. Projection of a line as a function of distance. On a flat surface, the length of the image of AB is inversely proportional to D, since triangle ABP is similar to triangle abP. On a spherical surface, the length of ab is proportional to q , or arctan AB/D. For a short line at a large distance, image size is approximately inversely proportional to D. Line A´B´ is the same length as line AB but projects the same image.

at the nodal point. Therefore constancy of visual subtense is necessary and sufficient for constant image size. On a flat image surface, a line of fixed length projects the same image if it remains within a plane parallel to the image plane, even though its angular subtense changes. Therefore, for a flat image plane, constant visual subtense does not guarantee constant image size. The length of the image of a line depends on two factors. One factor is magnification. Consider object point X on a line, and its image x produced by projection through point P. Magnification is the ratio of distance PX’ to distance PX. The second factor is the orientation of the line to the projection line through its center. Figure 26.10 illustrates how image size for differently oriented line elements varies as a function of location along a frontal plane for flat and spherical image surfaces. For example, consider a short horizontal line AB on a frontal plane, projected through P onto image ab on a spherical image surface, as in Figure 26.10b. The length of image ab decreases as AB moves to eccentricity e . First, there is a change in magnification. The distance of AB from P, and angle of subtense, q , increase in inverse proportion to cos e . Since the distance from P to the retina is constant, the magnification of ab depends only on q , and is therefore proportional to cos e. Second, as AB moves to eccentricity e, the orthogonal to the line of sight intersects AB at angle e. This effect is also proportional to cos e. These two factors cause ab to decrease in proportion to cos2 e. The magnification factor operates alone when AB maintains a constant angle to the line of sight, as in Figure 26.10a. On a flat image surface, magnification is constant, because distance AB to P increases in proportion to the distance from P to ab. The image therefore changes in proportion to cos e . These transformations provide a basis for deriving binocular disparities of lines (Section 19.2.2).

From the projective transformations of a line one can work out the perspective transformations of the image of a 2-D object in different orientations moving in different directions. Consider a square with its edges orthogonal to or parallel with a line of sight. Such squares are in onepoint perspective with a vanishing point on the horizon. Figure 26.11 shows the transformations of the image of a square projected onto a flat plane as the square moves over a frontal plane or over a horizontal plane. The same transformations occur for squares moving over a vertical plane. First, consider a square moving vertically down from the horizon. The image of a square on a frontal plane (Figure 26.11C) does not change on a flat image surface. On a spherical surface, the image becomes tapered, its width shrinks in proportion to cos e , and its height shrinks in proportion to cos 2 e . For both image surfaces, the image of a vertical square parallel to the principal axis shears (A) as it moves away from the horizon. On a spherical surface, the image of a horizontal square (B) decreases in height in proportion to cos e and in width in proportion to sin e cos e . On a flat surface, image height increases in proportion to sin e and image width is constant. Now consider a square orthogonal to a projection line receding in depth along the line (D). On a flat surface, the image shrinks in proportion to distance. On a spherical surface, the image shrinks in proportion to the square’s angle of subtense. When the angle is small, image size is approximately inversely proportional to distance. When a square recedes along a horizontal surface below eye level the horizontal dimension of the image shrinks in proportion to distance. The vertical dimension shrinks in proportion to distance squared because the effect of increasing distance is compounded by the increasing viewing angle (Gillam 1981). Linear perspective is discussed in more detail in Section 26.3.

26.1.1f Projective Transformations of 3-D Objects When an eye translates, the image of a rigid object remains constant if the object is very far away or if it moves with the eye. Thus, when the shape and size of the image of an object remain constant when we translate the eye we can infer that the object is either very far away or that it is moving with us. For example, if the image of a nonspherical object remains the same when we walk round it, we infer that the object is rotating at the same speed. A 2-D or 3-D object moving over a sphere centered on the eye while keeping a constant orientation to the line of sight retains a constant image. The object subtends

DEPTH FROM PER SPECTIVE



21

Figure 26.10.

Perspective transformations of lines moving horizontally along a frontal plane.

the same solid visual angle, as shown in Figure 26.8. The image of an object moving in a flat plane, including a frontal plane, changes in size and shape. However, the change is small for small distances from the fixation point. On a flat image surface, there are no loci in space for which the image of a 3-D object remains constant when more than one side of the object is visible. The image changes in shape and/or size for every change in orientation or position of the object, except for rotation about the 22



principal visual axis. These transformations can be derived by combining the transformations for each of the orthogonal dimensions of the object. The perceptual registration of projective transformations allows us to make inferences about object properties. For example, if the image of an approaching object remains constant, the object appears to shrink as it approaches. At the other extreme, if the image grows in inverse proportion to distance, we perceive an approaching object of constant size. Given that we have a correct appreciation

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Image width Frontal image plane

ta is

Image height

td ec bj

O e nc ) (z

A

B

C

D

E

Increasing eccentricity below eye level

Horizon (eye level)

F

Figure 26.11. Transformations of the image of a square on a frontal image plane. The image of a square moving down in a frontal plane from the horizon to a vertical eccentricity of e: (A) Shears through angle e when the square is in a sagittal plane. (B) Grows in height in proportion to cos e when the square is horizontal. (C) Remains constant when the square is frontal. The image of a square receding in depth along a horizontal plane: (D) Shrinks in proportion to distance when the square is frontal. (E) Shrinks in width in proportion to distance, z, and in height in proportion to z2 when the square is horizontal. This is because of the combined effects of increasing distance and movement toward eye level (or toward the median plane in the case of a vertical plane). (F) Shrinks in width and height in proportion to distance when the square is sagittal.

of the projective transformation of image dilation for an object approaching, we can assign any residual change in image size to a change in object size. Similarly, an appreciation of the shear transformation of a 3-D object as we move the head from side to side allows us to distinguish between motion parallax due to self-motion and plastic deformations of the object. The set of projective transformations produced by all translations and rotations of an object of a given shape is unique to that shape. Perspective in a stationary image is compatible with a large class of object shapes. However, when a rigid object moves or rotates, the projective transformation of the image allows one to narrow the choice of possible objects to those that produce that image transformation for that motion. For example, Mayhew and Longuet-Higgins (1982) showed that just four noncoplanar points seen from two defined vantage points provide sufficient information to recover the complete 3-D structure of the points. The specification of 3-D structure through motion is discussed in more detail in Chapter 28. 26.1.2 T Y P E S O F P E R S P E C T I V E

Perspective refers to the size or shape of an image produced by a specified projection of an object onto a surface. Perspective takes many forms depending on the type of projection, the type of object, and what aspect of the object is being considered. There are five basic types of projective transformation. For the sake of simplicity assume that the

image is formed by polar projection on a flat projection surface. 1. Size perspective As a rigid object recedes into the distance, while maintaining the same orientation to the projection plane, its image decreases in size. 2. Linear perspective Linear perspective arises when a surface is inclined with respect to the projection plane. For example, the images of the sides of an inclined rectangle converge to form a trapezoid. 3. Height in the field As the distance of an object on a horizontal surface increases, its image moves toward the visual horizon. 4. Texture gradient The images of equal-sized texture elements on an inclined surface decrease in size and increase in density with increasing distance along the surface. 5. Aspect-ratio Consider a circle inclined with respect to a line of sight. Its image is compressed in a direction orthogonal to the axis of inclination so as to form an ellipse. This is also called foreshortening. The aspect ratio of the image is the ratio of the minor axis of the ellipse to the major axis. The aspect ratio decreases from 1 as the circle becomes increasingly inclined to the line of sight. The aspect ratio does not vary with increasing distance of the circle along a line of sight. Consider a circle on a horizontal plane below eye level. The aspect ratio of the image decreases as the circle moves away

DEPTH FROM PER SPECTIVE



23

proportional to distance. This is the simple size-distance law discussed in Section 29.3.2. For an image to remain the same size in polar projection, each point on an object must move along its own line of sight. This means that the object must expand as its distance increases. Increasing the distance of a rigid object has no effect on image size with parallel projection because, in parallel projection, each object point moves along a line of sight. To establish that image size can serve as a cue to distance, all other cues to distance must be eliminated. This can be done by presenting the stimuli monocularly in dark surroundings at a fixed accommodation and vergence distance. Under these circumstances, image size can serve as a cue to distance in three ways. 1. An image changing in size can give rise to an impression of an approaching or receding object. This topic is discussed in Section 31.2.2.

General perspective. All types of perspective are represented. These are: simple size perspective, linear perspective, height in the field, texture size gradient, texture density gradient, and aspect-ratio perspective.

Figure 26.12.

along the surface. This effect occurs because, with increasing distance, the circle rises toward the horizon (eye height), which decreases the angle of inclination to the line of sight. Simply lifting the circle in a frontal plane has the same effect. An array of circles on an inclined plane forms an aspect-ratio gradient. We will see later that the geometry of perspective is different for a spherical projection surface such as the retina. All types of perspective are represented in Figure 26.12. They all occur in images produced by polar projection onto flat or spherical image planes. Parallel projection produces only aspect-ratio perspective. In natural scenes, the various types of perspective covary. However, they can, at least to some extent, be independently manipulated in the laboratory. 26.2 S I Z E P E R S P E C T I VE 26.2.1 E X P E R I M E N TS O N S I Z E A S A CU E TO D I S TA N C E

Pure size perspective arises from the motion-in-depth of a rigid object that maintains a constant orientation to a line of sight. In polar projection the size of the image of an object at distance D is specified by the angle, q , subtended by the object, where tanq is proportional to 1/D. For a small object, image size is approximately inversely 24



2. The relative sizes of the images of simultaneously or successively presented objects indicate their relative distances if the observer assumes that the objects are the same size. 3. The size of the image of an isolated familiar object of known size indicates the distance of the object from the viewer. This topic is discussed in the next section. Almost all experiments on size perspective have involved the use of small flat objects in a frontal plane placed at different distances along a line of sight at eye level. Under these conditions, the retinal image obeys the simple size/distance law, namely, both its dimensions shrink approximately in proportion to the distance of the object from the eye. Ames demonstrated that a spherical balloon at a fixed distance in dark surroundings appears to approach while it is being inflated and recede while is being deflated (Ittelson 1952). The same effect is produced by looming textured displays on a computer monitor (see Section 31.3). These effects imply that people interpret looming images, at least in part, as due to motion-in-depth of an object of fixed size rather than as due to an object at a fixed distance changing in size. Several investigators have reported that when two stationary disks are presented at the same distance in dark surroundings, the disk subtending the smaller visual angle appears more distant (Ittelson 1951; Hochberg and McAlister 1955; Over 1960a). It is as if observers assume that the images arise from objects of the same size (Gogel 1969). The targets need not be the same shape (Epstein and Franklin 1971). The perceived distance of isolated objects is discussed in Section 29.2.2. In apparent contradiction to the principle that the larger of two images signifies a nearer object, Brown and Weisstein (1988) reported that a grating with high spatial frequency appeared closer than an abutting low spatial-frequency grating, even when disparity indicated the contrary.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

However, this effect could have been due to the fact that their fine grating appeared to have a horizontal border that occluded the coarser grating. If so, it would invalidate the conclusion that finely textured regions appear closer than coarsely textured regions. Siderov and Harwerth (1993) found no effect of a two-octave difference of spatial frequency of DOG vertical bars on the disparity depthdiscrimination threshold. A display modulated in spatial frequency creates an impression of depth modulation with the apparently near peaks corresponding to the regions of low spatial frequency. In a natural scene, objects on receding surfaces, such as blades of grass on a ground surface, produce a continuous gradient of diminishing image size. People seem to be able to use this scale of distance to judge accurately the distance to an object placed on a ground plane in a natural scene at distances up to about 15 m (Rieser et al. 1990; Loomis et al. 1992; Fukusima et al. 1997) (Portrait Figure 26.13). Also, people are able to walk accurately to an object seen on a grassy field after they have closed their eyes (see Section 34.4.2). 26.2.2 FA M I L I A R S I Z E A S A C U E TO D I S TA N C E

of the object. The usual procedure is to ask subjects to estimate the distance of familiar objects presented one at a time in reduced cue conditions. Ittelson (1951) presented playing cards of normal size, double size, and half size one at a time at the same distance in dark surroundings. Subjects set a comparison stimulus presented separately to the same apparent distance. He concluded from his results that the perceived distance of a familiar object is, “that distance at which an object of physical size equal to the assumed-size would have to be placed in order to produce the given retinal image.” Hochberg and Hochberg (1952) argued that Ittelson had not controlled for the effects of the relative sizes of the three playing cards. Three blank cards may have produced the same result. Readers can judge for themselves by looking at the displays in Figure 26.14 through a hole in a card. Epstein (1961) conducted a similar experiment but failed to find any effect of familiar size on judged distance. They did find an effect of relative size on perceived relative distance. Hochberg and Hochberg separated the effects of familiar size and relative size by presenting drawings of a man and a boy, which subtended the same visual angle. The drawing of the boy appeared no nearer than that of the

Theoretically, the size of the image of a familiar object of fixed size provides information about the absolute distance

Figure 26.13. Jack M. Loomis. Born in Bowling Green, Ohio, in 1945. He obtained a B.A. in psychology from Johns Hopkins University in 1967 and a Ph.D. with D. H. Krantz from the University of Michigan in 1971. After postdoctoral work at the Smith-Kettlewell Institute of Visual Sciences in San Francisco he moved to the University of California at Santa Barbara, where he is now professor of psychology.

Perceived distance and familiar size. By looking through a hole in a card, readers can decide whether the impression of depth is greater for familiar objects of different size than for blank cards of different size.

Figure 26.14.

DEPTH FROM PER SPECTIVE



25

man. However, the Hochbergs used only line drawings of the boy and the man. Ono (1969) used photographs of a real golf ball and a real baseball. When they were presented at the same angular size, the baseball appeared more distant than the golf ball. As with unfamiliar objects, the apparent sizes and distances of familiar objects are influenced by whether subjects are trying to judge linear size or angular size (Predebon 1992). When all cues to distance are present, including familiar objects in the scene, distance judgments seem to be just as accurate for unfamiliar as for familiar test objects (Predebon 1991). Perhaps familiar test objects would be more effective than unfamiliar ones in a scene lacking other familiar objects. Familiar size can influence depth judgments based on disparity (Section 29.3.3). One problem with estimating the distance of single familiar objects is that subjects may make judgments in accordance with cognitive expectations rather than perceptual mechanisms (Higashiyama 1984). Gogel (1976) used a procedure designed to tap a purely perceptual mechanism (Portrait Figure 26.15). The stimuli were monocularly viewed transparencies of a key, a pair of glasses, or a guitar in dark surroundings at a fixed distance of 133 cm. As the subject moved the head from side to side through a specified amplitude, the amplitude and phase of side-to-side motion of the stimulus was adjusted until the subject reported that the stimulus appeared stationary. If perceived distance

Walter C. Gogel. Born in 1918 in Elizabeth, New Jersey. He served as a radar technician in the U.S. Army from 1943 to 1945. He obtained a B.A. from Marietta College in 1948 and a Ph.D. in psychology with Eckhard Hess from the University of Chicago in 1951. From 1951 to 1961 he worked in the Army Medical Research Laboratory in Fort Knox, and from 1961 to 1965 in the Civil Aeromedical Institute in Oklahoma. In 1965 he became a professor of psychology at the University of California, Santa Barbara. He retired in 1989 and died in 2006.

Figure 26.15.

26



equaled actual distance, there would be no perceived absolute motion of the stimulus when the stimulus was stationary. Thus, the magnitude and phase of stimulus motion when the subject reported no stimulus motion indicated the extent to which perceived distance differed from true distance. The results indicated that perceived distance was partially influenced by familiar size. Subjects reported that the objects sometimes appeared smaller or larger than they normally appear. Gogel proposed that the partial effect of familiar size under reduced conditions is due to a tendency to perceive objects at a default distance, a tendency he called the specific-distance tendency. This default distance is affected by eye convergence (Gogel and Tietz 1977) and, under reduced conditions, is related to the resting state of vergence (Section 10.2.1). There have been several claims that merely informing subjects that an unfamiliar object has a certain size or a certain distance can influence subsequent distance estimates under reduced viewing conditions (Coltheart 1970; Park and Michaelson 1974; Tyer et al. 1983). Hastorf (1950) found that distance estimates of a disk were influenced by whether subjects were told that it was a ping-pong ball or a billiard ball. However, Gogel (1981) found no effect of informed size when apparent distance was measured with his motion-parallax procedure. He suggested that effects reported by others were due to cognitive factors. Even if familiar size is not an effective cue for judging absolute distance, it could indicate the relative distance between distinct objects. Consider two familiar objects that differ in size. The relative distance of two objects is given by the ratio of the sizes of their images, divided by the ratio of the linear sizes of the objects. The ratio of linear sizes equals the ratio of image sizes at equidistance. This may be called distance scaling by familiar size. If the sizes of the images are in the same proportion as the sizes of the objects, the objects should appear at the same distance. On the other hand, if the images of the two objects are the same size, the larger object should appear more distant than the smaller object in proportion to the relative linear sizes of the objects (Adelson 1963). Epstein and Baratz (1964) presented subjects with pairs of coins (dime, quarter, and half dollar) under reduced visual conditions. The perceived distances between members of each pair of stimuli was determined by their familiar sizes, rather than by their relative angular sizes. For example, a quarter subtending the same sized image as a dime appeared more distant than the dime. In a second experiment, Gogel and Mertens (1968) used a wide variety of familiar objects and produced further evidence of distance scaling by familiar size. Epstein and Baratz (1964) first exposed subjects to an association between the colors of disks and their physical size. They then presented pairs of equidistant disks in different colors under reduced depth-cue conditions.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

The sizes expected from the colors of the disks did not influence perceived relative distances. Thus, only long-term familiarity is effective as a relative depth cue. The length of a swinging pendulum is proportional to the square of its period. A similar relationship holds between the stride length of an animal and the temporal period of walking. Jokisch and Troje (2003) reported that a simulated animal with long strides appeared larger than one with short strides.

receding lines on a flat horizontal surface produce images that meet in the principal vanishing point on the principal horizon (the horizon at eye level). Assume that the visual axis of an eye intersects the principal vanishing point and that the retina is flat and orthogonal to the visual axis. Let the object be a rectangular surface and let the perspective angle be the angle between the images of a pair of receding opposite sides. The perspective angle indicates the following features of a rectangular surface.

26.2.3 S I Z E D I S C R I M I NAT I O N A N D M ATC H I N G

1. Height below or above the eye level For a horizontal surface, the perspective angle increases as the surface moves vertically up toward eye level, as shown in objects 1 and 2 in Figure 26.16. Thus, perspective provides information about the height of the eye above the surface. For a receding wall surface, the angle of convergence increases as the surface approaches the median plane.

The performance of certain tasks involving the discrimination of size differences is influenced by the relative sizes of the compared stimuli. But does performance depend on relative retinal size or on relative perceived size? We will now see that perceived size has a strong influence, even when compared objects have the same retinal size. Perceived size affects reaction time in discrimination tasks. Uhlarik et al. (1980) found that subjects took longer to make size estimates of objects at various distances along a textured surface when the perceived sizes of the objects differed. Bennett and Warren (2002) asked subjects to make a same-different judgment about two abstract black shapes at different distances along a hallway. Reaction times were influenced both by differences in retinal size and by differences in object size. This probably reflects the fact that perceived size in cue-reduced situations is a compromise between retinal size and object size. As more depth cues were added, reaction time became more influenced by object size. A background texture gradient affects the perceived sizes of objects placed on the gradient. This can affect the speed with which observers pick out one object that differs in size from surrounding objects (Aks and Enns 1996). Finally, perceived size also affects memory of form. Biederman and Cooper (1992) showed that subjects more rapidly detected when a shape was the same as one presented before when the two shapes were perceived to be the same size compared with when the shapes had the same retinal size. All this evidence suggests that the perceived sizes of objects are computed before discriminations are made. This is what one would expect because, in the real world, object size rather than retinal size is important for behavior. 2 6 . 3 L I N E A R P E R S P E C T I VE 26.3.1 I N F O R M AT I O N C O N TA I N E D IN LINEAR PER SPECTIVE

The simplest case of linear perspective is the convergence of the images of receding parallel straight lines. Parallel

2. Magnitude of inclination or slant The perspective angle increases with increasing inclination of the surface about a horizontal axis with respect to the frontal plane, as indicated by object 3 in Figure 26.16. The perspective angle also increases with increasing slant about a vertical axis. 3. Sign of inclination The horizon of an upwardly inclined surface is above the principal horizon (eye level), as shown in object 3 of Figure 26.16. The horizon of a downwardly inclined surface is below the principal horizon. The horizon of a wall surface slanted to the left is to the left of the median plane, and that of a surface slanted to the right is to the right of that plane. All parallel surfaces have the same horizon line. 4. Up-down, left-right location The direction of perspective indicates whether a horizontal surface is below or above eye level. If image lines converge upward, the surface is below eye level, as shown in object 1. If lines converge downward, the surface is above eye level, as shown in object 6. If image lines on a vertical surface parallel to the median plane converge to the left, the surface is to the right of the median plane, and vice versa. In general, the direction of the angle of perspective indicates which direction the surface lies with respect to the plane containing the nodal point of the eye and the horizon line of that surface. 5. Surface tilt The tilt of a surface with respect to the horizontal meridian of the eye is indicated by the location of its horizon relative to the principal horizon, as indicated by object 4 in Figure 26.16. 6. Surface spin The image of a horizontal rectangle that is not parallel to either the frontal plane or median plane has two vanishing points, as shown by object 5 in Figure 26.16. The positions of the vanishing points

DEPTH FROM PER SPECTIVE



27

Elevated horizon for object 3

On ceiling plane

6

Til

ted

ho

riz

on

for

ob

jec

t4

θ

Principal horizon Principal vanishing point

5

φ

2

Rotated in horizontal plane

3 1

Inclined upwards

On lower ground plane 4

On higher ground plane

θ Figure 26.16.

Tilted

Information conveyed by simple linear perspective. For correct perspective, view the figure with one eye from about 3 inches in front of the

principal vanishing point.

on either side of the principal vanishing point indicate the extent to which the surface is rotated within its own plane (spin) relative to the median plane. Also, the images of the horizontal and vertical axes of a centrally viewed inclined rectangle intersect in a right angle when its sides are parallel to the median plane. The image becomes skewed when the rectangle rotates in its own plane, as in Figure 26.23. The amount of skew indicates the orientation of the rectangle when the observer knows that it is a rectangle. 7. Height in the field As an object moves away on a horizontal surface its image approached the visual horizon. If eye height is specified, height in the field indicates distance (Section 26.4).

perspective as a cue to surface inclination depends on the extent to which the texture contains elements aligned with the direction of inclination (Saunders and Backus 2006a). Sedgwick and Levy (1985) presented subjects with projected images of two inclined textured surfaces, with the center of one 22.5˚ above the center of the other. With monocular viewing, subjects set the inclination of the test surfaces to match that of the standard surface, either with respect to the horizontal, as in Figure 26.17A, or with respect to the visual axis, as in Figure 26.17B. They could perform both tasks with reasonable accuracy, but the variability of settings was greater for the visual-axis task than for the visual-horizon task. Sedgwick and Levy concluded that observers have direct access to perspective information that specifies inclination to the visual horizon.

26.3.2 JU D G I N G S U R FAC E I N C L I NAT I O N F RO M L I N E A R P E R S P E C T I V E

Linear perspective has been found to be a more effective cue to inclination than a texture gradient (Clark et al. 1956a ; Smith 1964), especially when supplemented with binocular disparity (Clark et al. 1956b). But the strength of linear 28



26.3.2a Effects of Object Size on Linear Perspective Consider a rectangle inclined about a horizontal axis, as in Figure 26.18A. Its image on a flat projection surface has

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

i q

Comparison

22.5°

w

Observer

1/tani

Horizon

Standard

tan(w /2)

A A

Standard

Comparison

22.5° Observer

Vanishing points

B Figure 26.17. Comparison of surface inclinations. Subjects set the inclination of the projected image of a comparison textured surface to match that of the image of a standard surface, either with respect to the horizontal (A) or with respect to the visual axis (B). (Adapted from

Images

Sedgwick and Levy 1985)

Objects

B

sides that converge to a vanishing point on the horizon. The angle of convergence, q , the angular width of the image, w, and the angle of inclination of the rectangular object, i, are related by: tan q

tan (w 2 ) × tan i

C

In Figure 26.18B it can be seen that the vanishing point of a small tapered image is lower than that of a large image with the same perspective taper. Given that both images are produced by rectangles, the lower the vanishing point the greater the inclination. The rectangle that produces the small image is therefore more inclined than the rectangle that produces the large image. Furthermore, Figure 26.18B shows that the height-to-width ratio of the object that produces the larger image is smaller than that of the object that produces the smaller image. Two rectangles inclined at the same angle have the same vanishing point, as in Figure 26.18C. However, the small rectangle produces an image with a smaller angle of convergence. Therefore, the convergence angle indicates the inclination of a rectangle only if the size of the rectangle is taken into account. Stavrianos (1945) reported that, for a given inclination, a large rectangle appeared more inclined than a small rectangle. Freeman (1966a) obtained functions relating angle of inclination, q , to two measures of linear perspective. Consider any rectangle orthogonal to the visual axis with its base on the visual horizon. Let the rectangle be inclined about the visual horizon through angle q , as in Figure 26.19. The image on a vertical plane of one side of the rectangle forms angle p with respect to the horizon.

Size scaling of linear perspective. (A) The angle of convergence, q , of the image of a rectangle, the angle of inclination, i, and the visual )tan i subtense of the rectangle, w , are related by tanq tan(( (B) For images with the same convergence, that of a small rectangle indicates greater inclination than that of a large rectangle. (C) Small and large rectangles have the same inclination when they have the same vanishing point, but the image of the small rectangle has a smaller convergence angle. (Adapted from Saunders and Backus 2006b) Figure 26.18.

Vertical projection plane Inclined rectangle

θ ϕ

d c

Figure 26.19. Perspective from an inclined rectangle. The image of an inclined rectangle depends on its inclination (q), its angular size ( j), and the viewing distance (c), as explained in the text.

DEPTH FROM PER SPECTIVE



29

Angle j decreases as the distance, d, of the side of the rectangle from the visual axis increases. If the base of the rectangle is distance c from the eye then: tanj

c cot q d

The difference, d , between the angular subtense of the near horizontal edge and that of the far edge of the rectangle increases according to the function: tan d

si q (c 2 + d 2 − a 2 sin sinq sin s 2q )

where a is half the height of the rectangle. Freeman (1966b) produced evidence that the change in d with increasing height of an inclined rectangle is responsible for the effect reported by Stavrianos, namely that perceived inclination increases with increasing size of a rectangle. One could test this conclusion by judging the inclinations of rectangles that differ only in height. Increasing the size of a rectangle also increases the separation of opposite edges and this should make it more difficult to estimate both convergence angle and length difference. Saunders and Backus (2006b) asked whether people allow for size when judging the shape and inclination of rectangular surfaces from monocularly viewed tapered images. The images varied in width and aspect ratio and were either untextured or contained a checkerboard. Subjects reported whether the inclined surface that produced each tapered image was longer or wider than a square. Subjects made only partial allowance for image size. Smith (1967) had obtained similar results. The largest deviations from veridical judgments were for highly inclined small surfaces. Judgments were not much affected by the addition of the checkerboard texture. The biases in judgments of shape were consistent with underestimation of the perceived inclination of the rectangles. For example, if a subject underestimates the inclination of a rectangle, the actual surface would appear compressed in length relative to a square. Other investigators have reported underestimation of inclination. The possible reasons for underestimation are listed in Section 26.5.2a. In a second experiment, Saunders and Backus found that, for images of constant size and constant linear perspective, changes in aspect ratio had little effect on judgments of inclination. Thus, perceived depth was determined principally by perspective convergence. Other evidence on this point is described in Section 26.5.2b.

26.3.2b Effects of Knowledge of Object Shape The impression of inclination produced by a small trapezoidal image depends on the observer detecting or assuming that the taper of the image represents an inclined rectangle rather than a frontal trapezoid. Assumptions about the true 30



shape of an object are not required when the object rotates in depth through 360˚ because the shape of the image indicates the shape of the object when the ratio of orthogonal dimensions is at a maximum. This occurs when the shape is normal to the line of sight. The shape is also normal to the line of sight when, for a constant rate of object rotation, the rate of change of image shape is at a minimum. The object must be correctly perceived as being rigid. Neither shape nor rigidity assumptions are required when the object is viewed with both eyes (Reinhardt-Rutland 1990). Also, the shape and inclination of a flat object are indicated when the observer produces motion parallax by moving the head laterally through a sufficient distance (Reinhardt-Rutland 1993).

26.3.2c Depth from Parallel Perspective Long before the principles of linear perspective were discovered, artists represented depth by setting receding edges of objects parallel but at an angle to the frontal surface, as shown in Figure 2.23. Parallel perspective is correct for a distant rectilinear object since image convergence becomes vanishingly small at a far distance. But parallel perspective is evident only for an object above or below eye level. For an off-center rectilinear object the images of receding edges form an angle, q , with the front edge of the object, as in Figure 26.20A. The angle increases with increasing distance of the object above or below eye level. Given that the object is assumed to be rectilinear, this angle provides information about the depth of the object and its position with respect to eye level. However the sign of depth is ambiguous. An impression of depth is created when two sets of parallel lines abut in an angle, as in Figure 26.20B. But there is little or no impression of depth when there is a vertical or horizontal gap between them (Shapley and Maertens 2008).

26.3.3 J U D G M E N T S O F P RO J E C T I V E I N VA R I A N T S

26.3.3a General Invariants Helmholtz (1909) proposed that the projective transformations that images of solid objects undergo become embedded in the perceptual system. Many other visual scientists, including Rock (1983), have expressed the same view. Perspective transformations are characterized by invariance of collinearity, order of points, continuity, incidence, and cross ratios. These invariants are common to all rigid objects. These invariant properties of objects are preserved in their images, but the same properties in images are not necessarily preserved in the object. For example, two incident points in an object produce incident images, but incident points in an image may arise from two points at

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

θ

θ

A

B Figure 26.21. James J. Gibson. Born in McConnelsville, Ohio, in 1904. He obtained a B.A. in psychology in 1925 and a Ph.D. in 1928 from Princeton University. In 1928 he joined the faculty of Smith College, where he met Kurt Koffka. During World War II he directed a research unit in the Army Air Forces Aviation Psychology Program. After the war he moved to the psychology department of Cornell University, where he remained until his death in 1979.

C Figure 26.20. Depth from parallel perspective. (A) Parallel perspective creates an impression of depth. Angle q increases with object distance from eye level and inclination. (B) Depth is created by sets of parallel lines meeting at an angle. (C) Lines out of phase or that do not meet create little or no depth.

different distances along two lines of sight, which are therefore nonincident points. According to Gibson (1966), perspective invariants convey a great deal of information about the world, although later he emphasized that these invariants are more evident in the image of a moving object than of a stationary object (Gibson 1979) (Portrait Figure 26.21). Perspective invariants are so much taken for granted that many of them have not been investigated. Consider the image of a solid rotating object or of an object viewed from different positions. When collinearity or order of points is not preserved we are informed that the object is not rigid. When continuity is not preserved we are informed that the object has a hole in it or is not a single object. When coincidence of lines or surfaces is not preserved we are informed that one part of an object is sliding over another part. Particular motions of objects preserve particular features of the image. For example, the image of an object receding along a line of sight preserves its shape, and the

image of an object moving along a surface concentric with the nodal point of the eye preserves both its shape and size. There is plenty of evidence that we use these relationships. The image of any 2-D figure rotating in depth at a constant velocity compresses at a rate proportional to the cosine of the angle of slant to a line of sight. The true shape is therefore indicated when the rate of change of the image is at a minimum. The image of any planar figure or that of any plane through a solid object is widest when the plane is orthogonal to a line of sight. Thus, the true shape of any rotating 2-D object is given when the area of its image is at a maximum. The extent to which people use these two relationships does not seem to have been investigated.

26.3.3b Invariance of Cross Ratios Experiments have been conducted to discover whether people use the invariance of the cross ratios of four collinear points, or the related ratios of areas on a surface. However, some of the results are invalid, because other types of invariance, such as those listed above, were present. Any quadrilateral has a projective invariant with respect to a defined point, as shown in Figure 26.22. This is a cross ratio for areas. Niall and Macnamara (1990) displayed six different irregular quadrilaterals in dark surroundings and

DEPTH FROM PER SPECTIVE



31

26.3.3c Skew Symmetry

2

1 A Invariant ratio = 3

area 12A x area 34A area 14A x area 32A

4 A projective invariant ratio for a quadrilateral. For a given position of A, the ratio of areas is constant for all 3-D orientations of the image or of the image plane.

Figure 26.22.

asked subjects to indicate which of these standards was projectively equivalent to each of a series of test quadrilaterals. The test stimuli were the projections of the standard stimuli at various angles of slant about axes at various angles of tilt. A regular pentagonal star with appropriate perspective was placed on each quadrilateral to help subjects detect the angle of slant of each test stimulus with respect to the standards. Subjects could not match the shapes correctly above chance level. They therefore showed no evidence of using the cross ratio. In a second experiment, subjects were shown pictures in perspective representing textured quadrilaterals lying on a textured surface inclined at various angles. They were asked to draw both the image and frontal view of each quadrilateral. The dependent measure was the difference between the cross ratio of areas in their drawing minus the cross ratio of the stimulus figure. Errors for both tasks were severe when the display was inclined more than 15˚. This suggests that people do not use the cross ratio for static irregular quadrilaterals. But failure may have been due to difficulty in allowing for differences in the orientation of the shapes rather than to an inability to use cross ratios. In a subsequent experiment, Niall (1999) asked whether subjects are better able to perceive perspective invariants for shapes rotating in depth. Each of 18 distinct 2-D test displays simulated two intersecting rigidly linked ellipses rotating randomly in different directions through various angles in front of an inclined checkerboard surface. Subjects adjusted the size and ellipticality of two stationary ellipses displayed on a gray background on the other half of the screen until they matched the perceived frontal shapes of each of the rotating ellipses. They could perform this task with great accuracy, and Niall concluded that subjects were detecting the projective invariants inherent in a pair of linked ellipses. But two simpler strategies were available. The frontal shape of each ellipse would be revealed as it rotated through the point where its area was at a maximum. Also, for a given velocity of rotation, the rate of change in shape of any rotating flat object is minimum when its shape is frontal and maximum when it is parallel to a line of sight. 32



Particular symmetrical objects preserve their symmetry in projection. For example, a sphere, whatever its position, produces a circular image on a spherical retina (but not on a plane surface). An inclined ellipse always projects as an ellipse. However, when asked to draw an oblique section of a cone most people draw an egg-shape (see Section 3.7.2c for a proof that an oblique section of a cone is an ellipse). The positions of the images of the axes of an ellipse change with respect to the ellipse when it is inclined about an axis other than one of its principal axes. For any other 2-D or 3-D object, symmetry is preserved only about an axis orthogonal to a slant axis. For any other slant axis, a symmetrical shape acquires skew symmetry. Saunders and Knill (2001) asked whether skew symmetry provides information about surface slant. They defined slant as rotation of a flat shape about any axis, tilt as the orientation of the axis of slant relative to horizontal, and spin as the orientation of the axis of symmetry of the shape in its own plane. They used a shape with two orthogonal axes of symmetry. The relative orientation (skew) of the images of the axes of symmetry varies with the angle of spin, as shown in Figure 26.23. However, for any angle of spin, there is a set of combinations of slant and tilt that could arise from a shape with two axes of symmetry. Therefore, if the object is seen as symmetrical, it should be seen as a shape with one of these combinations of slant and tilt. Within these bounds, skew symmetry provides information about surface slant and tilt. An image with a spin angle of 0˚ could be produced by a symmetrical shape at any slant about an axis orthogonal to an axis of symmetry, as in Figure a. An image with a spin of 45˚ could be produced by a symmetrical shape with certain combinations of slant and tilt, as shown in (b). An image with a spin of 20˚ is compatible with a symmetrical shape at other slants and tilts, as in (c). Saunders and Knill presented stereoscopic shapes with twofold symmetry at various angles of slant, tilt, and spin. For shapes with double symmetry, perspective cues are approximately invariant to spin. Subjects adjusted a stereoscopic line to appear perpendicular to the plane of the shape. Judgments were biased in a manner that depended on the spin angle. Saunders and Knill concluded that skew symmetry contributes to judgments of slant and tilt. However, spin-dependent biases were affected by a conflict between the skew and binocular disparity. They explained their results in terms of a Bayesian model of optimal integration of skew and disparity. Saunders and Backus (2007), also, produced evidence that skew symmetry is used in judging slant. 26.3.4 D I S TO RT I O N S I N V I EWI N G PI C T U R E S

26.3.4a Image Taper for Frontal Objects Parallel lines in a frontal surface produce parallel images on a flat image plane. In a spherical eye, any straight line,

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

75

Slant (deg)

70

0° spin

65 60 55 50 45 15

10

5

0 –5 Tilt (deg)

–10

–15

10

5

–5 0 Tilt (deg)

–10

–15

10

5

0 –5 Tilt (deg)

–10

–15

75

Slant (deg)

70

45° spin

65 60 55 50 45 15 75

Slant (deg)

70 65 60 55

20° spin 50 45 15

both directions onto points on opposite ends of a diameter of the eye. Thus, the retinal image of a very tall building viewed with horizontal gaze tapers in both directions. However, the sides of a tall building remain parallel on a picture or photograph for which the image plane is parallel to the building. Some artists and theorists have mistakenly supposed that the drawing of a tall frontally viewed building should taper because the retinal image is tapered. It seems that Leonardo da Vinci was concerned with this problem (Klein 1961). Panofsky (1940, p. 106) wrote, “Perspective construction as practiced in the Renaissance is, in fact, not “correct” from a purely naturalistic, that is, a physiological or psychological point of view.” Pirenne (1952) and White (1967, p. 125) pointed out the fallacy of Panofsky’s argument. One might as well say that the building itself should taper to appear parallel, since the building is projectively similar to its image on a flat surface. When viewed from the correct vantage point, parallel lines in a picture produce the correct tapered retinal images. The flat image of a tall building converges only if the drawing surface or camera is inclined back with respect to the building. Some artists have been confused about the difference between projection onto flat and spherical surfaces (see Figure 26.24). The artist Klee (1968) argued that the picture of a tall building should taper. In fact, images of frontal parallel lines drawn on a spherical picture concentric with the eye should converge to appear parallel but not those drawn on a flat picture. A line that produces a retinal image that does not lie wholly on a great circle is necessarily curved or bent. But an image that falls on a great circle of the retina is not necessarily straight. For example, the image of a circle lies on a great circle of the retina when the circle lies in a plane containing a line of sight. Curved lines on parallel lines of latitude on a sphere concentric with the eye produce images on parallel lines of latitude on the retina.

Figure 26.23. Skew symmetry. The figures on each graph indicate possible combinations of slant and tilt that are compatible with the shape on the left of the graph being symmetrical about two orthogonal axes. The spin is 0˚ in (A), 45˚ in (B), and 20˚ in (C). For each spin the full set of interpretations is indicated by the line through the figure. (Adapted from Saunders and Knill 2001)

whatever its position in space, forms an image where the plane containing the line and the nodal point intersects the retina. But any plane through a central nodal point intersects the retina in an equator—a great circle, or line of longitude—that bisects the retina. Since no two equators are parallel, no noncollinear straight lines produce parallel images, not even two parallel lines in a frontal plane. Retinal images of parallel frontal lines, if extended, converge in

Figure 26.24. Perspective in a frontal plane. Paul Klee argued that this is a correct drawing of the front of a house because the lower windows are closer to the eye than the upper windows. But this is true only if the canvas is inclined backward. The retinal image is tapered for all angles of gaze. (Redrawn from Pedagogical Sketchbook by Paul Klee 1968)

DEPTH FROM PER SPECTIVE



33

26.3.4b Perspective Illusion in Pictures Although a photograph or picture in perspective produces the same image as the 3-D scene when viewed from the correct point, it may appear distorted because it lacks certain cues to depth present in the 3-D scene. For example, the feet in Figure 26.25 appear too large for the body even when the picture is viewed from the correct point. The illusion is due to the presence of information that indicates that the picture is flat. This includes the texture of the surface of the picture, highlights reflected by the picture surface, the absence of motion parallax due to head movement, truncation of the foreground, and zero binocular disparity (Hagen et al. 1978). These factors produce apparent contraction of distances in the picture relative to the 3-D scene, and corresponding distortions of relative size. The supine body in Figure 26.25 is perceptually contracted so that the feet appear nearer to the head than they would in a real scene. This causes the feet to appear large relative to the head. Artists reduce such distortions by not drawing objects too close to the eye or by modifying the strict rules of perspective. The two tabletops depicted in Figure 26.26A appear different although they are identical in shape and size (Shepard 1990). In 3-D space the tabletops are very different. In 3-D space the tabletops that would produce these two images are very different. The perspective information in the 2-D figures prompts the viewer to perceive the figures in 3-D and thereby perceive the tabletops as elongated in the depth dimension. Figure 26.26B is the same illusion

designed by E. H. Adelson. The four pink areas are all the same shape. For each shape, a given dimension appears longer when it is seen as receding in depth than when it is seen as in a frontal plane. In Figure 26.26C the line AC, which appears to lie in depth, appears longer than line AB, which does not appear to lie wholly in depth, and longer still than line BD, which is in a fontal plane. In the 3-D world depicted in the picture,

A

B

C

Figure 26.25. Distortion of relative sizes in pictures. In correct perspective, the feet appear too large in relation to the rest of the body. The absence of depth cues, such as disparity, causes the length of the body to be underestimated.

34



Figure 26.26. Perspective illusions. A drawing of a 3-D object appears elongated in the depth dimension relative to the frontal dimension. (A) The two tabletops are the same shape and size. (After Shepard 1990) (B) The pink areas are the same shape and size. (Designed by Edward Adelson) (C) Lines AB, AC, and AD are same length in the picture. But lines AC and AB are longer than line AD in the 3-D scene.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

line AC is indeed longer than lines AB and BD. This effect may be a factor in the vertical-horizontal illusion, in which a vertical line appears longer than an equal horizontal line (Schiffman and Thompson 1975). The effects shown in Figure 26.26 can all be regarded as due to what Thouless called “regression to the real object”. This means that when people try to match a frontal shape with the image of an inclined shape they are pulled toward selecting a frontal stimulus that matches the incline shape viewed orthogonal to the line of sight. This issue is discussed in Sections 26.3.5 and 29.4.1. An accurate drawing in perspective can be produced by tracing the image of a scene on the surface of a mirror or window (see Section 2.9.4). However, people typically make large errors when asked to judge the size of an image on the surface of a mirror. Gombrich (1960) observed that people perceive the size of the reflection of their own face on the surface of a mirror as the same size as their face. In fact the image on the surface of the mirror is half the size of the face. This must be so because the mirror is exactly half way between the viewer and the virtual reflected image of the viewer. People typically report that their image on the surface of a mirror becomes smaller as they move away from the mirror. In fact, the size of the image on the mirror surface is independent of the distance of the viewer from the mirror. The mirror is always half way between the viewer and the virtual image. The size of the virtual image decreases as a person moves away from a mirror. People report that the size of the image of a fixed object on the surface of a mirror stays the same when they move away from the mirror. In fact, the projection of the image of the object on the mirror becomes larger, even though the virtual image of the object becomes smaller (Bertamini and Parks 2005). Even when people are instructed to judge the size of the image on the surface of the mirror they base their judgments of size of the virtual image (Bertamini et al. 2008). These effects provide further evidence that people perceive objects in 3-D space and have difficulty perceiving the 2-D projections of objects.

A

B

26.3.4c Images from Obliquely Viewed Objects The image of a flat object moving in a frontal plane remains constant in size and shape on a flat image plane but shrinks and changes shape on a spherical retina (Figure 26.10b and d). For both types of image plane, the image of a flat object containing a line of sight is a line. The retinal image of a sphere in any location is a circle. But, the image of a sphere on a flat surface becomes increasingly elliptical as the sphere moves from the straight ahead. Figure 26.27A shows photographs taken with a pinhole camera of spheres in different locations in a frontal plane. The images become increasingly elliptical as they depart further from straight ahead. The major axis of each elliptical image is aligned with the center of projection. Figure 26.27B illustrates how the

C Figure 26.27. Perspective from spheres. (A) Images formed on a flat surface by identical spheres at different locations in the visual field. (Derived from Pirenne 1970) (B) The image of a sphere on a flat surface increases in width and remains constant in height with increasing eccentricity. It therefore becomes increasingly elliptical. On a spherical surface, the image decreases in both width and height and therefore remains circular. (C) A photograph of a spherical object taken with a pinhole camera at an oblique angle on the roof of the Church of Saint Ignazio in Rome. (From Pirenne (1970).)

DEPTH FROM PER SPECTIVE



35

image of a sphere remains circular on a spherical retina as the sphere moves into an eccentric location. But the image on a flat surface becomes more elliptical as the sphere becomes more elliptical. Figure 26.27C shows a photograph of a spherical ornament on a building taken at an oblique angle with a pinhole camera. The ornament in the photograph produces a circular image on the retina when is held in the upper right quadrant of the visual field. All these figures are derived from (Pirenne 1970). The elliptical photographic image of an off-center sphere appears circular when viewed from where the photograph was taken. When the picture is not viewed from the correct point, or when visual cues indicate that the picture is flat, the elliptical shape of the image becomes apparent. Artists usually draw off-center spheres as circles, probably because they believe that this is the correct way to draw them. But even if they realized that the images should be elliptical they would probably draw them as circles to allow for the fact that the picture might not be viewed from the correct point. The fact that artists break the rules of perspective does not mean that the rules are incorrect. The image of a monocularly viewed object changes in perspective as it moves along a frontal plane. However, the object does not appear to change its orientation with respect to the frontal plane. James et al. (2001) asked subjects to orient the front surface of a luminous cube viewed monocularly in dark surroundings so that it appeared in a frontal plane. Perspective was the major cue to the cube’s orientation. On average, the slant of the cube relative to the frontal plane was underestimated about 3˚ at left or right eccentricities of 30˚. James et al. then modified the sense of eye position by having subjects move blocks while wearing a prism that displaced the image 15˚ to the left. Slant settings were shifted about 5˚, as though the cube had been shifted leftward. These results show that people take eye position into account when using perspective to judge the orientation of eccentric objects. In a control condition, subjects set an eccentric cube parallel to a central cube. This eliminated any contribution of errors in registration of the orientation of the head relative to the body. The pattern of errors showed that this source of errors was not important.

26.3.4d Absence of Parallax in Pictures A picture of a 3-D scene does not produce the expected motion parallax when the vantage point changes. A 3-D object produces no parallax only when it moves with the observer. Thus, when we walk past a portrait the person in the picture seems to follow us. This is the only percept that can account for the fact that we continue to see the frontal view of the face. This phenomenon was described by Ptolemy in the 2nd century (Section 2.1.3d) and by William Wollaston (1824). 36



The concave side of a facemask appears convex (Section 30.9). When we move the head from side-to-side the face appears to follows us at twice the angular velocity of the head motion. The images of pictured objects become distorted when a picture is at an oblique angle. This topic is discussed in Section 29.4.4. 26.3.5 D R AWI N G I N P E R S P E C T I V E

Humans, and presumably other animals, have used perspective for the perception of the 3-D structure of the world for millions of years. But, as far as we know, no human could draw in correct perspective before the 15th century and the full theory of perspective was not developed until the 17th century (Section 2.9.3). Figure 26.28 shows some typical examples of how artists before the 15th century drew rectilinear objects. Typically, the receding edges of the tops of rectilinear objects were drawn parallel or nearly parallel rather than converging to a vanishing point. But the sides of the objects were typically drawn diverging. The divergence angles for the tops and sides of the objects, as indicated by the added lines, are shown in the caption. One can understand why an inclined rectilinear surface is drawn with parallel sides. The sides of the inclined surface are indeed parallel and they are perceived as parallel in 3-D space. Our visual system evolved to allow us to see in 3-D, not to make 2-D drawings or to perceive the 2-D layout of the retinal image. But why did many artists draw with divergent perspective, especially when drawing the vertical sides of an object? One possibility is that they drew in parallel perspective, which produced a picture that appeared to diverge, as in Figure 26.26A. If other artists copied such a drawing they may have exaggerated the effect and drawn with divergent perspective. But this cannot be the main reason why artists drew in divergent perspective because, as we will now see, many people today draw with divergent perspective even when drawing an actual rectilinear object. Howard and Allison (2011) asked 80 university students to draw a cube viewed binocularly from a fixed vantage point with one surface in the frontal plane. The vanishing point was in the midline of the head at eye level. The correct drawing in one-point perspective is shown in Figure 26.29, along with the worst drawing, the best drawing, and the mean of the drawings of the 80 subjects. On average, the top of the cube was drawn with approximately parallel edges while the side was drawn with edges that diverged 18.6˚ from parallel. In a correct drawing of a cube in this position both the top and the side should converge 13˚. None of the students came close to drawing in correct perspective. The mean drawing closely resembles the paintings produced before the 15th century, shown in Figure 26.28. Did early artists realize that their drawings were not correct? In 1420 Brunelleschi drew the baptistery of the

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Figure 26.28. Examples of early divergent perspective. Lines have added to each figure to indicate the convergence or divergence of the edges of the receding surfaces with respect to parallel. (A) A fresco from the grotto of Touen Houang, China, from the Tang dynasty (618–906). The top of the table diverges 6˚, the side diverges 12˚. (From Fourcade 1962) (B) From Life in the Middle Ages by R. Delort. Universe Books, New York. 1972, p. 46. The top of the table converges 1˚, the side diverges 11˚. (C) From the Velislav Bible c. 1340. In Prague University Library, MSXXIIIC. 124 lob.412. The top of the well diverges 15˚, the side diverges 23˚. (D) Miracle of St. Guido by Jacopo da Bologna. In Pomposa Abbey, Ferrara, ca, 1350. Fototeca Berenson, Villa I Tatti - Florence. The added lines indicate that the edges of the top of the table diverge 2˚, while the edges of the side diverge 12˚.

cathedral in Florence. This was the first painting in correct perspective (Section 2.9.3). People who viewed the painting through a hole in front of the baptistry thought they were looking at the actual building. From that time on, paintings in correct perspective were accepted as more realistic than paintings with incorrect perspective. When Howard and Allison asked their subjects whether

the drawings they had made were correct they stated that the drawings were not correct but they did not know how to improve them. Howard and Allison presented subjects with a series of boxes that varied in the convergence or divergence of the receding surfaces and asked students to select the box that most resembled a cube. On average, subjects selected the

DEPTH FROM PER SPECTIVE



37

cubic box and rarely selected a box with more than 0.5˚ of taper. Therefore their inability to draw a cube was not due to an inability to perceive the true shape of a cube. Subjects were also asked to select the best drawing of a cube from a set of drawings that varied in the convergence or divergence of the two outer receding edges. On average, subjects selected a drawing in which the outer edges (edges A and C in Figure 26.29B) diverged 8.5˚ relative to the accurate convergence of 26˚. A similar error, known as “ regression to the real object”, is produced when people select a shape in the frontal plane to match an inclined square (see Section 29.4.1a). However, none of the subjects selected a drawing that diverged relative to parallel. Also, while the mean error in selecting the correct drawing was 8.5˚, the mean error of divergence of the outer edges in the drawings of the cube was 34.9˚. Therefore the error in drawing a cube is far greater than the error in selecting the correct drawing of a cube. In a further experiment Howard and Allison showed subjects only the top surface or only the side surface of the cube. Figure 26.30A shows the mean drawings of the two surfaces. The top surface was drawn as if it had been rotated about a horizontal axis toward the frontal plane. The side surface was drawn is if it had been rotated about a vertical axis toward the frontal plane. Thus isolated receding surfaces tended to be drawn as if rotated towards the frontal plane, with the receding edges more or less parallel.

If the top and side surfaces of a fully visible cube were drawn in this way they would not be connected because edge B would have to be drawn twice. Figure 26.29D shows how children solve this problem when drawing a cube. Some children draw the receding surfaces of a cube in the frontal plane with the edges not connected. Other children draw the surfaces in the frontal plane and connect the edges by extending one surface or by distorting the surfaces. It looks as though adults retain the tendency to rotate receding surfaces into the frontal plane and solve the problem of connecting the edges by distorting the drawing. In the above experiments subjects started by drawing edge A and rotated the top surface towards the frontal plane. They then attempted to draw the side surface rotated about a vertical axis towards the frontal plane. However, edge B could not be drawn parallel to edge C because edge B was already drawn parallel to edge A. This produced divergent perspective into the side surface. From this,result Howard and Allison predicted that when people start by drawing edge C they draw edge B parallel to edge C. Edge A is then drawn rotated about a horizontal axis towards the frontal plane as before but it is not be parallel to edge B because edge B is already drawn parallel to edge C. This results in the top face of the cube being drawn with divergent perspective rather than the side face. To test this prediction, 24 university students drew the cube with the edges drawn in the order A, B, C, D, E (A-first

D

A

+24.2

B

E

C

A

C

B

D Drawings of a cube. (A) The best drawing produced by 80 students. The front surface was pre-drawn. Dashed lines indicate the accurate drawing relative to a vanishing point opposite the eye. (B) The mean of the 80 drawings. The error for each edge is with respect to the accurate drawing. (C) The worst of the 80 drawing. Edges A and C diverge 96˚ relative to the true value. Note how this subject drew edges D and E curved in order to connect them to edge B while approximating a right angle between them. (Derived from Howard and Allison 2011) (D) Examples of drawings of a cube by children aged 5 to 8 years (From Chen 1985).

Figure 26.29.

38



OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

condition), and in the order C, B, A, D, E (C-first condition). Figure 26.30 shows the mean of the drawings of the cubes in the two conditions. As predicted, edges A and B were drawn more parallel in the A-first condition than in the C-first condition and edges C and B were drawn more parallel in the C-first condition than in the A-first condition. Furthermore, edges B and C were drawn divergently in the A-first condition while edges A and B were drawn divergently in the C-first condition, These results support the hypothesis that the way people draw a cube depends on the order in which the surfaces are drawn. When the top surface is drawn first it is rotated towards the frontal plane about a horizontal axis with its receding edges more or less parallel. The side face then tends to be drawn with divergent perspective. When the side surface is drawn first it is rotated about a vertical axis towards the frontal plane with its edges more-or-less parallel. The top surface then tends to be drawn with divergent perspective.

2 6 . 4 P O S I T I O N A N D P E R C E I VE D D I S TA N C E 26.4.1 E FFEC T O F H E I G HT I N T H E FI EL D O F VI EW

A single point of light moving along a monocular line of sight does not appear to move laterally or in depth. In general, the image of a point moving along a line other than a line of sight moves toward the vanishing point of that line and the line of sight parallel to the line of motion. Thus, the image of a point moving along a line parallel to the visual axis moves toward the fovea, as is evident in Figure 26.1. Consider two objects at different distances on a ground surface. The more distant object is nearer to the horizon, and its image is therefore higher in the field of view. On a ceiling, the image of the more distant object is lower in the field of view.

B

A

Side

Top only

Side only C A

13.2 mm

A

A B

B

Side

Side

C

C Start with edge A B

Start with edge C C

Figure 26.30. (A) The mean drawing of each receding face of a cube seen in isolation. (B) The mean drawing of a cube when subjects started with edge A. (C) The mean drawing when subjects started with edge C. The thick lines were pre-drawn. The dashed lines indicate the correct drawings. The angles of convergence and divergence are with respect to parallel. (Derived from Howard and Allison 2011)

DEPTH FROM PER SPECTIVE



39

When two lights are shown in dark surroundings, the one higher in the visual field appears more distant (Kilpatrick 1961). Similarly, the lower of two equal circles on a card appears nearer and smaller than the higher circle (Roelofs and Zeeman 1957). In these cases, observers adopt the default assumption that the stimuli are on a horizontal surface below eye level, rather than on a ceiling surface. These are pure height-in-the-field effects, assuming that no other depth cues were present. A related effect comes into play when objects are seen on textured surfaces. For any eye position, as an object moves in depth on a flat surface, its image moves toward the horizon of that surface. Consequently, an object appears to move in depth when it moves toward the horizon of a slanted or inclined textured surface. For a horizontal surface, an object’s height in the field and its vertical distance to the horizon are essentially the same measure. Gardner et al. (2010) devised pictorial displays in which height in the picture was dissociated from distance to the horizon. Distance from the horizon was the dominant factor governing judgments of relative depth. Height in the picture was a factor only when distance to the horizon was eliminated.

Side view

View seen by subject

A

26.4.2 O P T I C A L A D JAC E N C Y

Proposition 10 in Euclid’s Optics (300 BC) reads, “For a horizontal surface located above eye level, the parts further away appear lower.” We could judge relative distances of objects by the location of their points of contact on a horizontal surface. Alhazen expressed this idea in the 11th century (Section 2.2.4d). James Gibson (1950a) stressed the same idea. There is a strong effect of position on the perceived relative distance of two objects on a slanted or inclined surface. The object nearer to the horizon of the surface appears more distant and larger than the object more distant from the horizon. The effect is also evident for identical objects placed on a frontal surface containing lines converging to a vanishing point (Dunn et al. 1965). A frontal surface with a texture gradient produces a stronger effect on the perceived relative distances of objects than a simple tapered blank frame (Epstein 1966). A monocularly viewed object suspended above a horizontal textured surface appears as if it is sitting on a point on the surface that is optically adjacent to the base of the object. This is the point of optical adjacency. Let two equally distant objects be suspended above a textured ground surface with one object higher than the other. The point of optical adjacency of the higher object is nearer the horizon than that of the lower object. Thus the higher object appears more distant than the lower object, as shown in Figure 26.31 (Rock 1975, p. 47). A rod appears inclined when its points of contact on a floor and ceiling are at different heights, as in Figure 26.31B (Martino and Burigana 2009). 40



B Perceived depth and optical adjacency. (A) The square on the rod is above the triangle. However, it appears on the same level and beyond the triangle when the display is viewed from the front. This is because the square appears to touch the plane where its base is optically adjacent. (Adapted from Rock 1975, p. 147) (B) The two red lines appear inclined in opposite directions because of differences of their points of contact on floor and ceiling. (Adapted from Martino and

Figure 26.31.

Burigana 2009)

The effects of optical adjacency on judgments of distance show on ground, ceiling, and wall surfaces. Since ground surfaces are dominant in natural scenes, one might expect that effects of adjacency would be strongest for ground surfaces. Bian et al. (2005) used displays depicting vertical bars between a ground surface and ceiling surface, and horizontal bars between two wall surfaces, as in Figure 26.32. In each case, the surface contact cues were contradictory. In judging the relative distances of the bars, observers showed a preference for using contact with the ground surface rather than contact with the ceiling. Bian et al. called this the ground dominance effect. The effect decreased as the surfaces were tilted away from the ground or ceiling orientation.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

sight of P and the horizon be q . The distance, d, of the point P from the eye is: H cosq

d=

Its distance from the orthogonal point, D, is: H tanq

d=

A

ce

rfa

u es

f th

no

o riz

To

the

ho

θ φ

d

H P D O

A B The ground dominance effect. Subjects reported which of the two striped bars appeared nearest when they extended between ground and ceiling surfaces, as in (A), or between wall surfaces, as in (B). (From

Figure 26.32.

Bian et al. 2005)

P′

26.4.3 T H E V I S UA L H O R I Z O N A N D A B S O LU T E D I S TA N C E

P

26.4.3a Basic Requirements Judgments of the absolute distance of an object requires more information than judgments of the relative distance of two objects. Consider an eye at an orthogonal distance H from a flat surface, as in Figure 26.33A. The surface can be at any angle relative to gravity or the observer. Let point O be the point on the surface nearest to the eye. The line of sight to this point is orthogonal to the surface. It may be called the orthogonal point. Let P be any other point on the surface. Let the angle between the line of sight of O and the line of sight of P be f , and the angle between line of

O′ O

B Distances of points on a flat surface. (A) The distance, d, of point P on a surface at orthogonal distance H from the eye is given by d H cosf where f is the angle subtended by P and the nearest point the surface. It is also given by H sinq , where q is the angle subtended by P and the horizon. (B) An ideal base-down prism displaces O to O´, and P to P´. This decreases the apparent inclination of the surface and displaces the horizon of the surface. The optical distance of P from the eye is reduced but not its distance from O´. Figure 26.33.

DEPTH FROM PER SPECTIVE



41

Otherwise, the distance of point P is given by H sinq , as shown in Figure 26.33. Judgments based on this angle require the observer to register the location of the surface horizon. This could be derived from perspective. For a horizontal surface, the horizon is indicated by the location that is foveated when the gaze is orthogonal to the frontal plane. Judgments of absolute distance require the following information: 1. Registration of the orthogonal distance of the eye from the surface. 2. Registration of the angle subtended at the eye by the object and the nearest point on the surface, or registration of the angle subtended by the object and the horizon of the surface. 3. Proper registration of the point of contact of the object on the surface. The orientation of the surface to gravity or to the head is not relevant unless one assumes that the horizon of the surface is at eye level. Some animals seem to use this type of information (Section 33.2.2). From the above one can predict that if eye height is underestimated the distance of a point on a surface should be underestimated, as shown in Figure 26.34A. Similarly, overestimation of eye height should produce overestimation of distance. Also, estimates of distance should be a function of errors in registration of the angle of declination of the object below the horizon, as shown in Figure 26.34B. The following evidence supports these predictions.

contained two textured regions (grass and concrete). However, subjects made accurate judgments as long as there was continuity of texture around objects on the surface (He et al. 2004). Does the effect of texture discontinuity on judged distance occur in pictures as well as in real ground surfaces? Feria et al. (2003) used pictures of ground-planes or frontal surfaces with homogeneous texture, or with one texture over the near half and a different texture over the far half. Subjects matched the frontal distance between two rods to the perceived depth between rods on either side of the texture discontinuity. As in previous studies, distance across a texture discontinuity on a ground surface was underestimated. But they found similar underestimations for frontal surfaces. Thus, effects of a texture discontinuity on judged distance is not confined to surfaces lying in depth. In a natural scene, one object may be supported on the surface of another object, which in turn is supported on the ground. For example, a cup may sit on a box placed on a table that stands on the floor. Eye height is different for each surface. Sedgwick refers to these as nested contact relations. Meng and Sedgwick (2001) investigated contact relations using a pictorial representation of a cube sitting on a rectangular slab lying on a textured horizontal plane, as Eye level

Apparent object location with underestimation of eye height Target object

26.4.3b Effects of Surface Discontinuities Sinai et al. (1998) found that subjects were reasonably accurate in setting an object on the ground to a distance from their feet equal to their eye height. However, they greatly overestimated eye height with respect to a horizontal surface at the bottom of a 2-meter drop immediately in front of them, as in Figure 26.35. They also overestimated the distance of an object on the horizontal surface at the bottom of the drop. Thus, overestimation of eye height was associated with an overestimation of distance on the horizontal surface, as predicted (see Figure 26.34A). Standing subjects made accurate judgments of the distance of an object on a horizontal textured surface at a distance of 4 m. However, they overestimated distance by about 25% when the ground surface between them and the object contained a 3.7-meter-wide hole. Thus, ground surface continuity is required for accurate judgments of distance. A related question is whether depth judgments are affected by discontinuities of texture between far and near parts of a ground surface. Sinai et al. (1998) found that subjects underestimated distances when the ground surface 42



Apparent object location with overestimation of eye height

A Eye level Angle of declination below eye level

Apparent location with angle overestimation

Target object

Apparent location with angle underestimation

B Errors in distance judgments. (A) Effects of underestimation and overestimation of eye height on the apparent distance of an object. (B) Effects of errors in judging the angle of declination of an object below eye level on the apparent distance of an object.

Figure 26.34.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

26.4.3c Effects of Surface Extent

Eye level

2m

Target object

Figure 26.35.

Stimulus setup used by Sinai et al. (1998).

shown in Figure 26.36. Viewing monocularly, subjects moved a marker along a track on the simulated ground surface until it appeared at the same distance as the front edge of the cube. The perceived distance of the cube varied predictably when the perceived distance of the slab was varied by changing the point of optical contact between slab and ground. Distance estimates became more variable as the thickness of the slab was increased, thus increasing the distance between the top of the slab and the ground surface. Distance estimates were improved by the addition of features, such as shadows and legs, which increased the perception of the slab as being in contact with the ground rather than floating above it. Meng and Sedgwick (2002) asked subjects to set a marker on one slab to match the perceived distance of a cube sitting on a second slab. When the tops of the slabs were coplanar, the distances of marker and cube were compared directly and accurately, bypassing relations between slabs and ground. Making one slab higher than the other introduced errors, indicating that subjects only partially integrated information across the surfaces.

The judged distance of an object on a textured ground plane should be most accurate when the whole surface from under the feet to the horizon is in view. Wu et al. (2004) had subjects monocularly view an object on a field and then walk blindfolded to it. They were reasonably accurate with a full view of the field and when an aperture restricted the view to about 40˚. The feet would not be visible with a field of view of 40˚; thus, view of the feet is not required for accurate judgments of distance (see also Creem-Regehr et al. 2005). But distances were underestimated when vision was restricted to an area around the object of about 21˚ and more so when vision was restricted to about 14˚. The patch of ground with restricted view appeared displaced toward the frontal plane. This could account for the distance underestimation. Accuracy was restored when subjects with the 14˚ field of view were allowed to scan the ground surface from near to far. Scanning from far to near was not effective. Wu et al. concluded that people can integrate sequential information from a ground surface but that the near part of the surface is most important. Even without restricted vision, distance estimates are affected by the spatial extent of the surface on which targets are placed. Lappin et al. (2006) asked observers to place a person halfway between themselves and another person at a distance of either 15 or 30 m. In a hallway, subjects placed the point about 8% further away than the true midpoint. The error was 13% in a large lobby of a building and only 3.2% in an open field. It is not clear why the nearer distance was underestimated relative to the further distance or why the effect varied between the three visual environments. Witt et al. (2007) asked observers to set an object on the floor of a hallway 11.5 m long to the same distance as an object in a hallway 21.5 m long. The objects appeared at the same distance when the object in the short hallway was placed nearer than that in the long hallway. Similar results were obtained with objects on a short meadow and a long meadow. The results of these two experiments could perhaps be due to an effect of the spatial extent of a ground surface on the registration of the visual horizon.

26.4.3d Effects of Prisms

Figure 26.36.

Display used by Meng and Sedgwick (2001).

Figure 26.33B shows the effect of an ideal prism that displaces the whole visual scene upward by the same angle. Ideally, this should cause an inclined surface to appear less inclined to the vertical. It should also reduce the apparent distance of points on the surface from the eye but not their apparent distance from the orthogonal point. A prism changes the location of the visual horizon but does not affect the horizon based on the felt position of gaze. Thus, when a person views a ground surface through base-down prisms, an object appears at a smaller angle below the DEPTH FROM PER SPECTIVE



43

horizon defined by the angle of gaze than when the object is viewed normally. Therefore, an object on the ground should appear more distant when it is viewed without prisms than when it is viewed with prisms, as shown in Figure 26.33B. Ooi et al. (2001) confirmed this prediction by having subjects view an object on the ground through 10˚ base-up prisms and then walk to the object in the dark. They then asked subjects to walk about while wearing the prisms for 20 minutes. The prisms were then removed and the walking-in-the-dark test was applied. In this case, subjects overestimated distances because their perceived eye level had become adapted in a downward direction during exposure to the prisms. Gardner and Mon-Williams (2001) obtained similar results using a pointing task. Thompson et al. (2007) conducted similar experiments with targets placed at distances between 5 and 20 feet on the 9-foot high ceiling of a large room. The walking-inthe-dark test showed that distances were judged accurately. As predicted, 10˚ base-up prisms caused distances on the ceiling to be overestimated. It is not clear what perspective cues were provided on the ceiling. Also, both ceiling and floor were visible, so that subjects may have used floor distances to calibrate ceiling distances. 26.4.4 H E I G H T I N T H E F I E L D A N D S I Z E J U D G M E N TS

Consider the task of judging the height of an object standing on point P on a horizontal surface below eye level, as shown in Figure 26.37. The vertical distance between the observer’s eye and the surface is the eye height. Let the distance between the image of point P and the horizon be the horizon height at that distance. Let the height of the image of the object be its image height. For any object, sitting at any distance on a horizontal surface, the ratio of object height to eye height (H) equals the ratio of image height to horizon height. Thus, an object that projects an image extending from the ground surface to the visual horizon (image height equals horizon height) has the same height as the eye of the observer. If the top of the object’s image extends halfway to the horizon, the object is half as high as the eye height (Sedgwick 1983) (Portrait Figure 26.38). The linear height of any image is proportional to the tangent of the angle of subtense. If horizon height subtends angle q and object height angle f then H=

Horizon height Horizon plane f

Image height

q−f q

Eye height Object height

Object distance

P

Figure 26.37. Eye height and object size. For any object, sitting at any distance on a horizontal surface, the ratio of object height to eye height equals the ratio of image height to horizon height, as explained in the text.

standing on the floor of a hallway improved as their tops were brought nearer to eye level, whatever the lengths of the poles. This supports the idea that the visual horizon is used for judging the relative heights of objects. The analysis also predicts that the height of an object will be overestimated if eye height is overestimated. Wraga (1999) displayed vertical rectangular cards on a horizontal surface at ground level. Subjects viewed the display monocularly through an aperture. The subject’s view is depicted

tanq − tan (q − f ) tan f

According to this analysis, judgments of height should be most accurate and precise for objects the same height as the observer’s eyes. Bertamini et al. (1998) found that discrimination of differences in the lengths of two poles 44



Figure 26.38. Harold A. Sedgwick. Born in Oakland, California, in 1946. He obtained a B.A. in psychology in 1967 and a Ph.D. with J. J. Gibson in 1973, both from Cornell University. He conducted postdoctoral work with Leon Festinger at the New School for Social Research in New York. Since 1974 he has held an academic appointment in the department of vision sciences at the State University of New York, College of Optometry, in New York City.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

in Figure 26.39. Judgments of the height of the cards were similar for standing and seated subjects. Subjects overestimated the size of cards that were less than 0.5 of eye height when the floor seen through the aperture was raised 17 cm above the actual floor. Subjects saw the false floor as the one beneath their feet. Also, the apparent height of small objects was changed in a predictable way when the perspective gradient of lines on the floor was altered (Wraga and Proffitt 2000). This suggests that, for small objects, perspective is used to judge eye height. However, manipulations of perspective did not affect the apparent sizes of cards that were more than half of eye height. This suggests that, for large objects, subjects derived eye-height information from the perceived direction of gaze, rather than from perspective. People were able to use horizon-ratio relations to judge the relative sizes of objects in a picture as long as image sizes were not too different and the horizon line was not too far from the center of the picture (Rogers 1996). People use eye height in judging whether an aperture is large enough to allow them to walk through it (Warren and Whang 1987). 2 6 . 5 T E X T U R E P E R S P E C T I VE 26.5.1 T Y P E S O F T E X T U R E G R A D I E N T

A surface with homogeneous texture contains elements that do not vary in shape, size, or density or that vary in a statistically uniform manner over the surface. A surface with

isotropic texture contains elements that are similar in shape, size, and density at all orientations within the surface. Consider the image of a surface with homogeneous and isotropic texture produced by polar projection. Any curved surface concentric with the nodal point of a spherical retina produces an image with no perspective. Also, surfaces parallel to a flat projection plane produce no perspective. The slope and direction of the local texture gradient around any point indicates the slope and direction of the surface relative to the visual line through that point. A change in a texture gradient indicates a change in the slope of the surface, as in Figure 26.40A (Gibson 1950a, 1961). A gap in a texture gradient of constant gradient indicates a step, as in Figure 26.40B. Texture gradients are usually defined with respect to images projected on a flat surface rather than on a curved retina. The three types of texture perspective are summarized in Table 26.1. They normally occur together but may be separated experimentally.

26.5.1a Texture Size The image of a small object produced by polar projection decreases in size in proportion to its distance. This is size scaling. It is most reliable as a cue to relative depth when the objects are the same physical size. Vertical objects on an inclined surface, such as blades of grass on a field, produce a gradient of image size. Gradient steepness increases with increasing inclination of the surface with respect to the frontal plane. In parallel projection, image size does not change with distance.

26.5.1b Texture Density Texture elements on a frontal surface produce images that become more densely spaced with increasing distance of the surface. This is so in polar projection but not in parallel projection. An inclined textured surface produces a

A Size judgments and eye height. White cards of various sizes were displayed at a distance of 2.3 m on a horizontal ground surface and viewed through a rectangular aperture. (Adapted from Wraga and Proffitt 2000)

Figure 26.39.

B

Figure 26.40. Texture gradients. (A) A change in texture gradient produces a change in slope. (B) A gap discontinuity in a texture gradient produces a step. (Redrawn from Gibson 1950a)

DEPTH FROM PER SPECTIVE



45

Table 26.1. PERSPECTIVE CHANGES IN A RIGID

TEXTURED SURFACE BELOW EYE LEVEL PROJECTED ONTO A FRONTAL PROJECTION PLANE. POLAR PROJECTION TYPE

INCREASED DISTANCE

INCREASED SLANT

DEPTH CURVATURE

Linear perspective

No change

Increases

Unspecified

Texture size

Decreases

Overall size gradient

Higher-order gradients

Texture density

Increases

Overall density gradient

Higher-order gradients

Aspect ratio

No change

Changes

Changes

PARALLEL PROJECTION TYPE

INCREASED DISTANCE

INCREASED SLANT

DEPTH CURVATURE

Linear perspective

No change

No change

No change

Texture size

No change

No change in image width

No change in image width

Texture density

No change

No change in No change lateral density in lateral density

Aspect ratio

No change

Changes

Changes

gradient of texture density in polar projection. In using a density gradient as a cue to surface inclination the observer must correctly assume that the texture elements are distributed homogeneously on the surface.

26.5.1c Aspect Ratio The image of a vertical line becomes shorter approximately in proportion to the cosine of its angle of inclination to the normal to the line of sight. This is foreshortening. The image of a tilted line becomes more parallel to the horizon as the line is increasingly inclined, as shown in Figure 26.41A. The image of a 2-D shape is compressed in the direction of inclination but not in the orthogonal direction. The aspect ratio of the image is its in-depth dimension ( y) divided by its dimension parallel to the axis of inclination (x). If the angle of inclination to the vertical is i, then cosi = y/x. The aspect ratio of the image of a single shape, such as a circle, provides information about inclination only if the true shape is known. Thus, an ellipse is seen as an inclined circle only if it is assumed that the shape is a circle. Also, the aspect ratio of a single image does not indicate the sign of inclination. Aspect-ratio perspective of a single shape occurs in polar and parallel projection. In polar projection, the aspect ratio of the images of texture elements on an inclined flat surface 46



decreases with increasing distance along the surface. This is an aspect-ratio gradient. It occurs because the texture elements become increasingly inclined to the line of sight. An inclined surface also produces an overall gradient of texture density. These two texture gradients are reliable cues for both the magnitude and sign of inclination when the texture of the surface is homogeneous and isotropic and is correctly assumed by the observer to be so. Aspect-ratio gradients and texture density gradients do not occur with parallel (orthotropic) projection of flat surfaces. The cue of aspect ratio is not available for a sphere because a sphere always projects as a circle on a spherical retina. On a flat projection surface, an off-center sphere projects as an ellipse with its major axis directed toward the principal point (Figure 26.29). For a surface covered with spheres, depth is indicated on the retina only by the gradient of texture size and density (Saunders 2003). An array of vertical rectangles standing on a ground plane, as in Figure 26.41B, produces an aspect-ratio gradient on the retina. This is because, with increasing g become less inclined to the line distance, the rectangles ggl q . Rectangles that remain of sight angle g angle orthogonal to a line of sight, as in Figure 26.41C, do not produce an aspect-ratio gradient. Vertical rectangles do not produce an aspect-ratio gradient on a flat image plane, because the rectangles remain parallel to the image surface, as in Figure 26.41D. The images produced on spherical and flat image planes by vertical rectangles standing on a ground plane are depicted in Figure 26.42. Knill (1998a) derived an ideal observer to analyze the depth information content of each type of texture perspective (Portrait Figure 26.43). He quantified the reliability of each cue and derived a measure of how reliability varies with angle of inclination, field of view, and prior assumptions of texture homogeneity and isotropy.

(

)

26.5.2 T E X T U R E G R A D I E N T S A N D P E RC E I VE D I N C L I NAT I O N

26.5.2a Underestimation of Surface Inclination Gibson (1950b) rear-projected slides of textured surfaces inclined about a horizontal axis onto a circular area in a frontal surface, as shown in Figure 26.44. Subjects set a tactile board to match the perceived inclination of each display. Perceived inclination with respect to the vertical increased with increasing texture gradient, but judgments were more frontal than predicted from the gradient. Perceived inclination was greater for regular textures than for irregular textures and for coarse textures than for fine textures. Regular textures produce linear perspective in the form of convergence of aligned texture elements. Linear perspective makes the texture gradient more evident. Gruber and Clark (1956) obtained similar results with real

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

inclination, which shows that subjects were not merely matching the aspect ratio of the test lines to that of the edges of the test square. But they could have been matching the aspect ratio of the test lines to that of the texture elements. If this is what they were doing, the results tell us nothing about perceived inclination, just as matching the lengths, orientations, or colors of two objects tells us nothing about perceived length, orientation, or color. Zimmerman et al. found no evidence of normalization to the vertical. But the effect of normalization may have been canceled because it affected the probe and test surface equally. Inclination was severely underestimated when the surface was viewed through a frontal rectangular aperture, as in Gibson’s experiment. This could be because subjects perceived the rectangular aperture as the boundary of the textured surface, rather than because an isolated texture gradient is a poor stimulus for perceived inclination. Li and Durgin (2010) depicted a 33˚ diameter textured surface in a helmet-mounted display. A fixation point at eye

A

g

q

B

C

Eye level g q g

q

D Aspect ratios. (A) The image of an inclined object becomes increasingly compressed along an axis orthogonal to the axis of inclination. The image of a tilted line becomes less tilted to the axis of rotation. (B) The image of a vertical object on a ground plane becomes longer relative to its width as the object moves into the distance. The object becomes less inclined to the line of sight. (C) The aspect ratio of the image of an object on a ground plane remains constant when the object is orthogonal to the line of sight. The angle to the line of sight remains constant. (D) The aspect ratio of the image of an object moving over a ground plane remains constant on a flat projection surface. The angle to the line of sight changes, but so too does the angle of the image on the surface.

Figure 26.41.

inclined textured surfaces viewed through a square aperture. They also found that slant was underestimated. Estimates of the inclination of surfaces depicted on a computer screen with random-dot texture were accurate to within 3˚ (Zimmerman et al. 1995). Subjects adjusted the length of a line at the same inclination as the surface until it appeared equal in length to an orthogonal line, as shown in Figure 26.45. The method relies on subjects’ perceiving the lines as lying on the surface and on subjects’ matching the in-depth length of the lines rather than their visual subtenses. Performance was not affected when the square surface was tilted with respect to the horizontal axis of

A Eye level

B Figure 26.42. Images of vertical rectangles on a ground plane. (A) An approximate representation of the retinal images of vertical rectangles standing on a ground surface below eye level. With increasing distance, image width decreases, but the decrease is reduced because of increasing height in the field. Image height decreases with increasing distance, but the decrease is reduced by decreasing inclination to the line of sight and by increasing height in the field. (B) Images on a flat image plane produced by vertical rectangles on a ground surface. With increasing distance, the image shrinks in size with no change in aspect ratio or convergence.

DEPTH FROM PER SPECTIVE



47

Estimating inclination by setting aspect ratio. Two orthogonal test lines are placed on the surface and the subject adjusts the length of one of them until they appear equal in length as seen on the surface. (After

Figure 26.45.

Zimmerman et al. 1995)

Figure 26.43. David C. Knill. Born in New Orleans in 1960. He obtained a B.S. in computer science from the University of Virginia in 1982 and a Ph.D. in experimental psychology from Brown University in 1990 with Daniel Kersten. After postdoctoral work at the University of Minnesota, he joined the Department of Psychology at the University of Pennsylvania in 1994. In 1999, he moved to the University of Rochester, where he is now professor in the Department of Brain and Cognitive Sciences and the Center for Visual Science.

level in the center of the surface was at simulated distances between 1 and 6 m, as specified by binocular disparity with respect to a simulated circular aperture. The surface was inclined between 6 and 36˚ with respect to horizontal. Subjects estimated inclination verbally and by judging when a lateral distance between two points was equal to an in-depth distance between two points on the surface. The two methods gave similar results. For each inclination, verbal estimates of inclination with respect to horizontal increased rapidly as viewing distance increased from 1 to 6 m and less rapidly as distance increased from 6 m to 16 m. For each distance, estimates of inclination increased in an approximately linear fashion as inclination increased. The overestimation of inclination was proportionately greater for smaller inclinations from horizontal. The following reasons have been proposed for underestimation of the inclination of surfaces with respect to the vertical or, equivalently, overestimation with respect to horizontal.

Figure 26.44. Texture gradients and perception of inclination. (From Gibson 1950b. Used with permission of the University of Illinois Press)

48



1. Normalization to the frontal plane Gibson and others suggested that an isolated inclined surface tends to normalize to the frontal plane. Gruber and Clark (1956) noticed that a frontal square aperture appeared to taper and incline in the opposite direction to the texture gradient seen through it. The inclined surface could have normalized to the frontal plane and induced a corresponding change in the perceived inclination of the frame so as to preserve the perceived relative inclination of the two stimuli. This is a type of perspective contrast. 2. Incorrect viewing The nodal point of the eye may not be located at the center of projection that was used to generate the simulated inclination of the textured display. This would not account for underestimation of the inclination of real surfaces. 3. Effect of conflicting depth cues Inclination indicated by a texture gradient in a frontal-plane display may be underestimated because other sources of information, such as accommodation, convergence, or disparity, indicate that the display is frontal. However, real inclined textured surfaces with full depth cues also appear displaced toward the frontal plane (Section 29.4.3). 4. Effect of the viewing aperture An inclined surface may be seen as coplanar with the frontal aperture surrounding it. The absence of perspective taper in the aperture would cause the surface to appear frontal (Eby and Braunstein 1995). 5. Perceived location of the orthogonal point Most experiments on the perception of inclination involved small displays inclined about a horizontal axis at eye level and viewed through a frontal aperture. Under these conditions, the point on the surface at eye level is constant and can be seen clearly. However, the point on the surface where a line of sight is perpendicular to the surface varies and is out of sight. In natural conditions we usually judge the inclination of a surface below eye level and it is often one on which we are standing. In this case, the point where a line of sight is perpendicular

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

to the surface is constant and can be seen. Perrone (1980) argued that the perceived location of this point can affect the perceived inclination of a surface. He suggested that, in viewing an inclined surface through an aperture, people tend to perceive the nearest point on the surface as the point where the line of sight is perpendicular to the surface. This causes inclination to be overestimated. The nearer the surface extends toward the observer the closer the nearest point is to the correct point. Therefore, the perceived inclination of larger displays is less than that of smaller displays. In a later paper, Perrone modified this theory. He suggested that errors in judging the inclination of a surface seen through an aperture arise because people assume that eye level (straight ahead) corresponds to the shortest distance from the eye to the surface (Perrone 1982). 6. Extent of the field of view A texture gradient, and hence the inclination of a surface, is easier to detect when more of the surface is visible in a fixed aperture. For example, inclination is more evident in Figure 26.46A, which was taken with a camera with a 60˚ field of view, than in Figure 26.46B, in which the camera had a 5˚ field of view (Todd et al. 2005). The number of texture elements was made the same in the two views. 7. Effect of texture regularity Turner et al. (1991) concluded from a computer simulation that the inclination of a textured surface can be recovered from the outputs of simple cells in the visual cortex. However, outputs of the modeled filters for an inclined surface with irregular texture resembled those from a surface with regular texture inclined at a smaller angle. This would explain why inclination is underestimated more for irregular than for regular textures.

in a vertical line than when they met in an oblique line, as shown in Figure 26.47. Others investigators have reported the same effect. Cohen and Zaidi found that the magnitude of the 2-D anisotropy of perceived angle size was sufficient to account for the 3-D anisotropy of perceived slant. 9. The number of cues to distance Judgments of inclination are more accurate when more information about depth is added (see Section 29.4.3).

26.5.2b Comparison of Texture Cues The relative effectiveness of different types of texturegradient can be assessed by isolating each type or by pitting one type against another. Phillips (1970) used computer-generated displays of ellipses viewed monocularly through a circular aperture. The displays had a texture gradient defined by (a) element size and aspect ratio alone, (b) texture density alone, (c) the two gradients in agreement, and (d) the two gradients in opposition. Subjects made paired comparisons between the perceived inclinations of different displays. Only the texture gradient defined by element size and aspect ratio contributed significantly to perceived inclination. Knill (1998b, 1998c) generated different types of texture perspective on a frontal screen seen through a rectangular aperture. In a forced-choice procedure, subjects discriminated between a test surface with a simulated inclination of 65˚ to the vertical and a comparison surface with variable inclination. An ideal observer analysis was used to derive the weights of the three perspective cues used by subjects. Subjects relied on foreshortening (aspect ratio) of texture elements, based on a prior assumption of texture

8. Orientation of the axis of slant Cohen and Zaidi (2007) found that two abutting surfaces with opposite texture gradients appeared more steeply slanted when they met

A

B

Figure 26.46. Effect of field of view on apparent inclination. (A) Taken with a camera with a 60˚ field of view. (B) Taken with a camera with a 5˚ field of view. (Reprinted from Todd et al. 2005 with permission from Elsevier)

Figure 26.47.

Stimuli used by Cohen and Zaidi (2007).

DEPTH FROM PER SPECTIVE



49

isotropy, more than they relied on gradients of texture size or texture density. Discrimination of inclination improved with increasing inclination to the vertical, a result confirmed by Rosas et al. (2004). This is what one would expect from the fact that the aspect ratio of texture elements changes as a function of the cosine of the angle of inclination. The cosine function becomes steeper away from the vertical. Discrimination was better when only the upper part of a top-away inclined surface was visible than when only the lower part was visible. This is because inclination to the line of sight is greater in the upper half of a top-away inclined surface. Performance improved when the width of the inclined surface was increased, showing that subjects integrated information over a considerable area. The relative effectiveness of different texture cues varies with the angle of inclination of a surface. For example, in a computer-generated display, texture density of horizontal line elements was a more effective cue to surface inclination than convergence of vertical line elements (Andersen et al. 1998). This superiority was greater for simulated ground and ceiling surfaces than for surfaces near the frontal plane. Inclination discrimination was highest for a texture with a prominent aspect-ratio gradient and least for a surface covered with noisy texture (Rosas et al. 2004). However, discrimination was affected by the type of texture when inclination was small (26˚) but not when it was large (66˚). The importance of the aspect-ratio also depends on the type of display. For example, texture size and density scaling are more evident in regular arrays of rectangles than in random arrays (Frisby et al. 1996). For surfaces covered with regular arrays of small dots, aspect ratio can be defined in terms of intervals between the rows of dots relative to intervals between columns of dots. Braunstein and Payne (1969) presented computer-generated surfaces containing dots in regular columns and rows with aspect ratios defined in this way. The texture gradient was produced by polar projection and corresponded to inclinations from vertical of 25˚, 50˚, and 75˚. For each aspect ratio, the perspective convergence of the columns of dots corresponded to various inclinations between 0 and 65˚. The surfaces were presented in pairs, and subjects indicated which member of each pair appeared more inclined. When the two cues were in conflict, relative depth judgments were based on perspective convergence of the dot columns rather than on aspect ratio. However, when the dots were distributed randomly, perspective convergence became the less effective cue.Perhaps aspect ratios of spaces between texture elements is less effective than aspect ratios of texture elements larger than those used in the above experiment. Since texture density is a more global parameter than the aspect ratio of texture elements, one might expect texture density to be less reliable than aspect ratio with small displays but more reliable with larger displays. However, Buckley et al. (1996) failed to find a 50



significant effect of changing texture density on subjects’ judgments of surface inclination for any size of display. Only changes in aspect ratio had any effect. Livingstone and Hubel (1987) claimed that a texture gradient defined by equiluminant colors does not produce an impression of depth. However, Cavanagh (1987) obtained depth percepts from equiluminant line drawings with perspective (Portrait Figure 26.48). Also, it has been reported that the perception of inclination is retained in an equiluminant display containing a texture gradient (Troscianko et al. 1991). Methods for computing surface slope and shape from texture gradients are discussed in Bajcsy and Lieberman (1976) and Stone (1993).

26.5.2c Effects of Texture Inhomogeneity The aspect ratio of the image of a single flat shape indicates the direction and magnitude of inclination of the shape only if the observer knows the true shape. Knowledge of the actual shapes of objects may not be required when there is more than one object. Thus, a difference of aspect ratio between the images of two identical shapes indicates the relative inclination of the shapes, even when the true shapes are not known. Also, an aspect-ratio gradient indicates the direction and magnitude of inclination of a textured surface if it is correctly assumed that there is no actual gradient of aspect ratio over the surface.

Patrick Cavanagh. Born in Oakville, Ontario, in 1947. He received a degree in electrical engineering from McGill University in 1968 and a Ph.D. in cognitive psychology from Carnegie-Mellon in 1972. He held an academic position in psychology in the Université de Montréal until 1989, when he became professor of psychology at Harvard. He moved to the Sorbonne in Paris in 2007.

Figure 26.48.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

If the texture of the actual surface has a size gradient, density gradient, or aspect-ratio gradient observers will confound these surface gradients with perspective gradients and will be unable to make accurate estimations of surface slant. Thus, in using texture gradients to judge absolute surface slant, an observer must make allowance for surface texture gradients. There are three ways in which this can be done. 1. Assumption of surface homogeneity The best observers can do with a stationary surface lacking other depth cues is to assume that the texture is homogeneous and isotropic. This strategy usually works, because most surfaces do not possess texture gradients. Inclination over a given region of a surface is indicated by the inverse transform required to reduce the mean aspect-ratio gradient in that region to zero. The transformed image will be isotropic with respect to aspect-ratio gradients but the texture elements may be any shape, such as horizontal ellipses or stretched polygons. Anisotropy in the actual texture elements may bias judgments of slant but does not necessarily do so (Rosenholtz and Malik 1997).

parabolic, elliptical, or hyperbolic. A flat patch has zero curvature in all directions. A parabolic patch (cylinder) has zero curvature in one direction, as shown in Figure 20.37. An elliptic patch (cone) lies wholly on one side of any plane tangent to the patch. It can be convex (have positive curvature in all directions) or concave (have negative curvature) and can be closed and enclose a volume. A hyperbolic patch is shaped like a saddle and cuts the tangent plane. It has positive curvature in one direction and negative curvature in another direction, and cannot be closed or enclose a volume (Koenderink and van Doorn 1982) (Portrait Figures 26.49 and 26.50). The following methods have been used to measure perceived 3-D shape: Absolute depth judgments Subjects estimate the depth, orientation, and curvedness of a surface at each of several points. Cross-section reproduction Estimates of the curvature of 3-D objects such as cylinders and ellipsoids are

2. Use of other depth cues Other depth cues, such as binocular disparity, could enable observers to distinguish between surface and perspective texture gradients, and make independent judgments of slant and texture homogeneity (Banks et al. 2005). 3. Rotation of the surface Changing the inclination of a rigid surface changes the perspective texture gradient but not the surface texture gradient. As a surface rotates into the frontal plane, the perspective texture gradient becomes zero, leaving only the surface gradient. By registering this minimum gradient, an observer will be able to allow for the surface texture gradient when judging the inclination of that surface. 26.6 TEXTURE GR ADIENTS ON C U RVE D S U R FAC E S 26.6.1 D E FI N I N G A N D M E A S U R I N G 3-D S H A P E

The orientation of a surface at any point can be specified by the orientation of the tangent to the surface at that point (or a unit normal vector to the tangent) with respect to two orthogonal axes. The magnitude of local curvature of a surface in a given direction is the rate of change of surface orientation in that direction. The 3-D shape at a local patch on a surface can be specified by Koenderink’s shape index, which indicates the directions of the two orthogonal principal curvatures, as described in Section 20.5.1. The shape index provides a scale-independent measure of local 3-D shape. A local patch on any continuous surface is flat,

Jan J. Koenderink. Born in Stramproy, Holland, in 1943. He obtained a B.Sc. in physics, astronomy, and mathematics in 1964, and a Ph.D. in visual psychophysics with M.A. Bouman in 1972, both from Utrecht University. He was a research fellow of the Dutch Foundation for Advancement of Pure Science from 1967 to 1972 and held an academic appointment at Groningen University from 1972 to 1974. In 1974 he moved to the Department of Physics at Utrecht University, where he became professor of The Physics of Man. He founded and was the first director of the Helmholtz Institute at Utrecht University. He was the first Douglas Holder Fellow at University College, Oxford. In 1990 he was appointed fellow of the Royal Netherlands Academy of Arts and Sciences.

Figure 26.49.

DEPTH FROM PER SPECTIVE



51

A

Andrea J. van Doorn. Born in Tiel, Holland, in 1948. She obtained a B.Sc. in physics, chemistry, and mathematics in 1967 and a Ph.D. in visual science in 1984, both from the University of Utrecht. Between 1971 and 1997 she held research fellowships at the Dutch Foundation for the Advancement of Science, Groningen University, and Utrecht University. Since 1997 she has held academic appointments in the Department of Physics at Utrecht University and in the department of industrial design Engineering at Delft University.

Figure 26.50.

derived by asking subjects to select a 2-D curve that matches the curvature of the object along lines on its surface that project as straight lines. Phillips and Todd (1996) found that subjects performed poorly using this index. Ordinal depth differences Subjects judge the depth difference between two dots on the surface in different orientations to each other and at each of several locations. An example is shown in Figure 26.51A. The method is unreliable if subjects do not perceive the dots as lying on the surface. The constant shape of the dots may make them appear off the surface. Reichel et al. (1995) found that Weber fractions for discriminating local curvatures of undulating surfaces were 10 to 100 times larger with this method than for other types of discrimination. They concluded that perception of surface structure from monocular texture gradients is rather poor. However, poor performance may reflect the crudeness of the procedure rather than poor perception of surface structure. Norman et al. (2006) asked subjects to judge the relative slant, orientation, and curvedness of neighboring points on smooth 3-D shapes with shading, texture, and disparity. Thresholds for discriminating differences in slant and orientation were between 4˚ and 10˚. Those for curvedness were much worse and varied with changes in the shape index parameter. 52



B Figure 26.51. Probing the perceived structure of 3-D surfaces. (A) Two small dots are placed on neighboring points on the surface. The subject estimates their depth order. The procedure is repeated for different orientations and positions of the dots. This is one of the surfaces used by Todd and Reichel (1989). (B) Samples of elliptical and cross-shaped depth probes. The probe is placed on a local region of a 3-D surface and its shape is adjusted until it appears tangential to the surface with the line appearing orthogonal to the surface. The procedure is repeated over the surface.

Discrimination thresholds The sensitivity of observers to differences in 3-D surface orientation can be assessed by asking them to discriminate between two simultaneously or successively presented surfaces. In a related procedure, observers are shown one stimulus at a time and asked to decide whether it is more or less inclined or curved in depth than a standard or than the mean of the series of stimuli. Todd and Norman (1995) used this method to determine thresholds for detecting changes in the inclination of surfaces, with all cues to depth present. They obtained a Weber fraction of 8% for discrimination of a change in the relative orientation of two abutting surfaces. The fraction was 11% for the same task when the surfaces were spatially separated, and it was 26% for discrimination of changes in the orientation of a local patch on a smoothly curved surface. Use of depth probes A depth probe is a small object that is applied to each of several points on a test surface. The method depends on how accurately and precisely

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

subjects perceive the depth disposition of the probe. Thus, the probe should be viewed stereoscopically. Gregory (1970) constructed an apparatus, which he called Pandora’s box, in which a stereoscopic probe could be superimposed on a monocularly presented display (Portrait Figure 26.52). Stevens (1983) and Koenderink and van Doorn (1992) developed a gauge figure, consisting of a small cross or ellipse with a short line protruding from its center, as shown in Figure 26.51B. The gauge figure is applied to the surface and the subject adjusts its shape until it appears like an orthogonal cross or circle tangential to the surface, with the line appearing normal to the surface. The gauge figure is then applied to other locations on the surface and a best-fitting smooth surface is derived from the judgments. Koenderink et al. (1996a) obtained satisfactory results with a gauge figure. Gauge-figure measurements have been found to be more precise than depth-difference measurements (Koenderink et al. 1996b). Todd et al. (1996) used a gauge figure to investigate judgments of local curvature on stereoscopic photographs of human torsos viewed monocularly or binocularly with different directions of illumination. Koenderink et al. (2001) applied the ordinal depth difference method, the cross-section reproduction method, and the depth-probe method to photographs of several smoothly curved objects.

The gauge figure was the most reliable method. But, when effects of essential ambiguities in the pictures were factored out, all methods yielded similar coherent perceived 3-D structures. For a homogeneous flat surface, each of the three types of texture gradient unambiguously specifies the direction of inclination. The more distant region of an inclined surface projects smaller, denser, and more compressed images of texture elements. However, for a surface curved in depth, only the size-texture gradient unambiguously specifies direction of curvature. Texture-density and aspect-ratio gradients do not specify the direction of 3-D curvature. For example, the images of a convex cylinder and a concave cylinder contain similar texture-density and aspect-ratio gradients. 26.6.2 T E X T U R E G R A D I E N T S O N C Y L I N D R I C A L S U R FAC E S

Cumming et al. (1993) investigated effects of different types of texture perspective on the perceived depth of a horizontal half-cylinder viewed in a stereoscope. Two of their stereograms are shown in Figure 26.53. In the first experiment, texture elements were circles, and element size, density, and aspect ratio were varied independently. The contribution of each texture component to perceived depth was assessed by the disparity required to make the curve of the cylinder appear circular. Aspect-ratio gradient of the texture elements was the only perspective component that

Cylinder with disparity and a texture gradient

Figure 26.52. Richard L. Gregory. Born in 1923 in London. He graduated from Cambridge University in 1950 and was appointed lecturer in psychology in 1953. In 1967 he founded the Department of Machine Intelligence in Edinburgh with D. Michie and C. Longuet-Higgins. In 1970 he was appointed professor of neuropsychology and director of the Brain and Perception Laboratory at the University of Bristol. He is the founder editor of the journal Perception. He became professor emeritus in 1988. He is a fellow of the Royal Society of London.

Cylinder with disparity but no texture gradient Compatible and incompatible texture gradients. Divergent fusion creates convex cylinders. Crossed fusion creates concave cylinders. The cylinder appears less curved in depth when there is no texture gradient, as in the lower figure. (Reprinted from Cumming et al. 1993,

Figure 26.53.

with permission from Elsevier)

DEPTH FROM PER SPECTIVE



53

contributed significantly to perceived depth, confirming Cutting and Millard’s result (1984). Todd and Akerstrom (1987) found an effect of element size only in combination with changing aspect ratio. In a second experiment, Cumming et al. used randomly mixed horizontal ellipses as texture elements in 2-D displays. Subjects could not estimate local curvature in depth from the aspect ratio of single elements but could derive an estimate of curvature from the spatial gradient of mean aspect ratio. Perceived depth was little affected by a moderate degree of random variation in aspect ratio of the elements, as in Figure 26.54A, but fell off severely when the variation was large, as in Figure 26.54B. This could be because of the large variation in the aspect ratio of the ellipses or the anisotropy introduced by elongation of the elements in only one direction. Use of the stimulus shown in Figure 26.54C revealed that texture anisotropy was the main factor. The authors concluded that people accurately perceive the 3-D shapes of isotropic surfaces on the basis of the overall gradient of aspect ratio. Perhaps they neglect changes in texture density because it is not necessarily constant in a frontal surface. Blake et al. (1993) developed an ideal observer model that predicted the variance of judgments of the depth of textured cylinders based on the density and aspect ratio of texture elements. The model showed that there is more reliable information in the aspect-ratio cue than in the texture-density cue. Judgments by human observers based on aspect ratios alone were often better than predicted by the ideal observer. Also, in a cue-conflict condition, Blake et al. found that the aspect-ratio cue dominated the texturedensity cue, in agreement with other investigators. In the Fourier domain, texture compression corresponds to an expansion of the spatial-frequency spectrum. Sakai and Finkel (1995) suggested that, rather than computing the frequency spectrum, the visual system uses simpler representations such as changes in mean frequency or in peak frequency. To examine this question, they investigated the perceived 3-D structure of a set of textured surfaces with specific frequency spectra. In Figure 26.55, the surface texture of the upper figure contains a single spatialfrequency along each axis while that of the lower figure is constructed from white noise and thus has a wide range of spatial frequencies. Both figures generate an impression of a 3-D cylinder, showing that the shape of the spatialfrequency spectrum is not important. In stimuli with strong spatial-frequency peaks, perceived depth curvature was related to changes in peak spatial frequency, as illustrated in Figure 26.56. In stimuli with weak frequency peaks, perceived depth was related to changes in mean spatial frequency, as illustrated in Figure 26.57. Thus, perceived curvature can depend on changes in peak spatial frequency or in mean frequency according to the type of texture gradient. Sakai and Finkel developed a neural network model of these processes. 54



A

B

C Figure 26.54. Types of texture gradient. (A) Elliptical texture elements. (B) Elongated texture elements. (C) Randomly oriented texture elements. (Reprinted from Cumming et al. 1993, with permission from Elsevier)

26.6.3 T E X T U R E G R A D I E N T S O N C O M P L E X S U R FAC E S

The perception of 3-D surfaces is affected by the nature of the texture elements on the surface (Lappin and Craft 2000). Consider a curved surface formed by bending or curving a flat surface without cutting or stretching it. Such a surface is known as a developable surface. Let the surface have smooth parallel depth undulations. At any place on the surface, the line of maximum curvature is the line with maximum modulation in depth. Any straight line joining two points on a flat surface is the shortest distance between the points. Since a developable surface does not stretch, any

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Intensity

Intensity Figure 26.55.

Gradients with distinct spatial-frequency components. The

texture in the upper figure has the minimum number of spatialfrequencies. Spatial frequency shifts from center to edge of the cylinder. The texture in the lower figure is made from white noise. The spatial-frequency spectrum expands from center to edge of the cylinder. Both cylinders appear in 3-D in spite of their different spatial-frequency components. (Adapted from Sakai and Finkel 1995)

straight line remains the shortest distance between the points over the surface after the surface is curved. Any such line on a curved surface is a geodesic. A geodesic is a line whose curvature is entirely attributable to the curvature of the surface on which it lies. At a ridge, depth has a local minimum. At a trough, depth has a local maximum. Surface curvature changes sign at each ridge or trough. Let a developable surface undulating in depth be covered with equally spaced lines parallel to the undulations, so that each line falls on a contour of equal depth. The lines in the image produced by parallel or polar projection will be modulated in density, as in Figure 26.58A. Li and Zaidi (2000) observed that this type of texture gradient does not produce an impression of depth corrugation. Now let the surface be covered with lines running across ridges and troughs. In polar projection, but not in parallel projection, any line across a ridge or trough projects onto a flat screen as a curved line. The only exceptions are lines that lie in a plane of regard (are coplanar with the point of projection). On the spherical retina, all lines coplanar with the nodal point project onto a retinal meridian.

Frequency

Intensity

Intensity

Frequency

Frequency

Frequency

Gradients with large peak-frequencies. In the upper figure, the peak spatial frequency (bold lines) changes over the horizontal dimension, but the mean frequency remains constant. In the lower figure, the peak frequency remains constant, but the mean frequency changes. The upper figure creates a stronger impression of depth than does the lower figure. (Adapted from Sakai and Finkel 1995) Figure 26.56.

This includes all straight lines and all nonstraight lines coplanar with the nodal point. A straight-line image is one that falls on a retinal meridian, and a curved-line or bent-line image is one that does not lie on a meridian. All smooth lines on an undulating surface, except those coplanar with the nodal point, form curved images. Thus, the curvature of geodesic lines in an image produced by polar projection contains information about the 3-D structure of the surface, as in Figure 26.58B. If the texture on a developable surface is homogeneous, any departure from homogeneity in an image formed by polar projection arises only from depth. A slanted surface produces an image with a first-order texture gradient. Curvature in depth produces an image with second-order modulations of texture. These modulations follow parallel geodesics on the surface (see Knill 2001). Li and Zaidi (2000) obtained impressions of depth corrugations from a 2-D surface only when the surface was viewed in polar projection and contained changes in orientation of line elements cutting across lines of maximum curvature, as in Figure 26.58B. However, the

DEPTH FROM PER SPECTIVE



55

Intensity

Intensity

4. The extent to which the local line of maximum curvature departs from being coplanar with the nodal point. Any curved line lying in a plane passing through the nodal point of the eye produces the same image as a straight line. In other words, the curvature of the image of a line on a curved surface can vary with a change in vantage point.

F r equency

Intensity

Intensity

F r equency

The first three factors define image curvature arising from polar projection. This is perspective curvature. Perspective curvature indicates the 3-D structure of the surface. The third factor defines the inclination of ridges and troughs to a line of sight. This is vantage point curvature. Li and Zaidi concluded that the perception of 3-D shape from texture gradients is possible only when (1) displays are viewed in polar projection and (2) contour lines cross the lines of principal curvature. However, Todd and Oomes (2002) argued that, although these conclusions

F r equency

F r equency

Gradients with weak peak-frequencies. In the upper figure, the peak spatial frequency (bold lines) changes over the horizontal dimension, but the mean frequency (dotted lines) remains constant. In the lower figure, the peak frequency remains constant, but the mean frequency changes. The upper figure creates a weaker impression of depth than does the lower figure. (Adapted from Figure 26.57.

Sakai and Finkel 1995)

A

information is ambiguous. For example, when the figure is inspected for some time, the ridges and troughs suddenly reverse. The ambiguity arises because the curvature of the image of a line on a curved surface depends on the following four independent factors: 1. The local curvature in depth of the line of principal curvature that the line crosses. 2. The angle at which the line intersects the line of maximum curvature. 3. The curvature of the line on the surface. All geodesics have zero curvature on the surface. The curvature of a line on a surface corresponds to the curvature of the image produced by projection normal to the line of principal curvature of the surface in the region where the line lies. Image curvature due to perspective is eliminated in this type of projection. 56



B Figure 26.58. Texture gradients and surface modulations. (A) Modulations of line density produce weak depth. (B) Modulations of line curvature produce strong depth. (Reprinted from Li and Zaidi 2000, with permission from Elsevier)

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

apply to the displays used by Li and Zaidi, they do not apply to 3-D shapes in general. For example, the upper display in Figure 26.59 is the image produced by closely spaced corrugations with sharp peaks and straight sides inclined at almost 90˚. Thus, the corrugations have almost no curvature except at peaks and troughs. A good impression of depth is created by either polar or parallel projection of a cylinder covered with dots that contain no oriented energy across the axis of principal curvature, as shown in Figure 26.60. Todd et al. (2005) showed that the perceived depth formed by two textured surfaces decreases when the field of view is restricted, as shown in Figure 26.59. They replicated Li and Zaidi’s results with displays with a limited field of view but not with displays with a wide field of view. Thus, we must qualify Li and Zaidi’s conclusions that the perception of 3-D depth from texture requires polar projection and contours across lines of curvature. The mere perception of depth does not indicate that the sign of depth has been detected. The perceived depth of images of curved surfaces produced by parallel projection is ambiguous. The surface may be seen as convex or concave. Parallel projection preserves the ambiguous depth cues of aspect-ratio and texture-density but not the unambiguous cue of perspective (gradient of image size and convergence of parallel lines). Only polar projection preserves linear perspective. The sign and degree of convergence of aligned elements indicates the sign and magnitude of depth. A size gradient in randomly placed texture elements did not allow the detection of depth sign (Zaidi and Li 2002).

Figure 26.59. The effect of field of view on perceived depth. Each figure is an image of two surfaces slanted 65˚ to create a ridge or trough. The upper images have a depicted field of view of 60˚. The lower images have a field of view of 10˚. The textures are scaled so that each image contains the same number of texture elements. Monocular viewing enhances impressions of depth. (Reprinted from Todd et al. 2005 with permission from Elsevier)

Li and Zaidi (2004) extended their investigations to nondevelopable surfaces. There are two basic types of nondevelopable curved surface—surfaces formed by carving a 3-D block and those formed by elastic deformation. There are also two basic types of textures—those formed by continuous contours and those formed by discrete elements. Continuous contours that are both lines of maximum curvature and geodesics provide information about surface curvature. Stevens and Brookes (1987) found that subjects could match accurately the orientation of a stereoscopic elliptical depth probe to the orientation of the monocularly viewed surface shown in Figure 26.61. They concluded that observers assume that the texture lines are parallel geodesics that lie on lines of principal curvature. These assumptions remove ambiguities arising from factors 2 and 4. Factor 1 (perspective curvature) then indicates that the surface undulates in depth and factor 3 (vantage point curvature) indicates that the surface is inclined. Surface curvature may also be indicated by lines that are neither lines of maximum curvature nor geodesics. Todd and Reichel (1990) constructed the surfaces shown in Figure 26.62. These surfaces are not developable, and the lines are not geodesics but rather lines formed by planar sections of the surface. However, as with geodesics, the inflection points of the lines occur where surface curvature changes sign. The texture lines are parallel in Figure 26.62(A) and (B), positioned at random over a 45˚ range in (C), and over a range of 180˚ in (D). In (B), (C), and (D) the spacing and density of the lines was adjusted to remove the cue of contour density present in (A). It is evident that the visual system tolerates at least a 45˚ deviation from parallelism of texture lines but not complete orientation randomness. Surface structure is evident when texture lines are shortened, as in (E) but not when they are as short as in (F). Figure 26.63 provides examples of surfaces formed by carving a solid object. Todd et al. (2004) asked whether the perceived depth of such surfaces is affected by anisotropy and inhomogeneity of the texture elements (Portrait Figure 26.64). Figure 26.63A is a 2-D image derived from a solid object with homogeneous and isotropic spherical elements while Figure 26.63B is formed from an object that was stretched horizontally. For both shapes, subjects could indicate the maximum and minimum points of surface curvature very accurately. They also made reasonably accurate estimations of the relative depth of selected points of the surfaces. If subjects had based their estimates on an assumption of texture homogeneity, the points of maximum curvature in (B) would have been perceived where the aspect ratio of texture elements was at a minimum. Todd et al. concluded that people are able to perceive 3-D shape from texture that is both anisotropic and globally inhomogeneous. Zaidi and Li (2006) produced perceptible 3-D surfaces defined by flow of chromatic textures. This implies the

DEPTH FROM PER SPECTIVE



57

Figure 26.61. A sinusoidal textured surface. The surface is defined by parallel contours in perspective projection. The ellipses are gauge figures used for judging surface orientation. (Redrawn from Stevens and Brookes 1987)

A

Geneva, described how a drawing of a rhomboid crystal appeared to periodically reverse in depth. The drawing of a cube that reverses in depth, like that in Figure 26.65, is known as the Necker cube. Figure 26.65 also shows Mach’s

B

A

B

C

D

E

F

Figure 26.60. Texture gradients lacking modulated contours. (A) Parallel projection produces only aspect-ratio gradients (B) Polar projection produces aspect-ratio and size perspective. (Reprinted from Todd and Oomes 2002 with permission from Elsevier)

existence of cells tuned to orientation that respond to chromatic contours. 2 6 . 7 R E V E R S I B L E P E R S P E C T I VE 26.7.1 R EVE R S I B L E P E R S P EC T I V E I N 2-D D I S P L AYS

In reversible perspective, a 3-D object or a drawing spontaneously reverses in apparent depth. It is one example of multistable perception, in which a given stimulus may be interpreted in two or more ways. The general principles of multistable perception were discussed in Section 4.5.9. Many Roman and early Islamic mosaics contain simple angular shapes that spontaneously change from appearing concave to appearing convex (Wade 2004). Scientific investigation of reversible perspective began in 1832 when L. A. Necker, professor of mineralogy in 58



Figure 26.62. Surfaces defined by contours. (A) The lines are projections of parallel surface contours. (B) Same as (A) with lines of varying width to eliminate contour crowding near occluding edges. (C) Same as (B) with contours within a 45˚ range of orientations. (D) Contours within a 180˚ range of orientations. (E) Same as (B) with shorter contour lines. (F) Surface with very short contour lines. (From Todd and Reichel 1990. Copyright by the American Psychological Association).

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

A

James T. Todd. Born in Baltimore in 1949. He obtained a B.A. in 1974 and a Ph.D. in 1977, both from the University of Connecticut. He held an academic appointment in psychology at Brandeis University between 1981 and 1992. In 1992 he moved to Ohio State University, where he is now professor of psychology.

Figure 26.64.

B Figure 26.63.

Depictions of 3-D shapes carved from a solid. (A) A shape

from a solid composed of evenly spaced spheres. (B) A shape from a horizontally stretched solid. (From Todd et al. 2004)

book, the Schröder staircase, and the Thiéry blocks (Schroeder 1858). They reverse in depth because the two interpretations are projectively equivalent. Köhler (1940) had proposed that ambiguous figureground stimuli, such as Rubin’s cross, reverse when the neural processes underlying one interpretation become satiated and the processes underlying the other interpretation recover. For Kohler, the term “satiation” referred to hypothetical electrical events spreading out in the brain. The term “adaptation” is now used to refer to a decline in the activity of specific neural processes when a given visual feature is attended to for some time. Prolonged exposure to one alternative of a reversible figure increases the probability that the other alternative will be seen in an ambiguous display. For example, prior inspection of a stereogram of the Schröder staircase, in which disparity causes it to be seen as if from above, increases the probability that a binocularly viewed zerodisparity staircase, or a monocularly viewed staircase, is seen as if from below (Virsu 1975; Harris 1980). The effect is illustrated in Figure 26.66. The Necker cube typically reverses at intervals of between 3 and 5 s. If reversal of perspective depends on

reciprocal interactions between neural processes responsible for two interpretations, one would expect the rate of reversal to increase with continued exposure to a given ambiguous figure. Several investigators have reported an increase of the rate of depth reversal with continued viewing of an ambiguous display (Adams 1954; Brown 1955; Price 1969). The reversal rate is a negatively accelerating function of time. The increased rate of alternation that occurs when a cube is viewed in one retinal location for some time does not transfer to a cube seen subsequently in a different retinal location (Long and Toppino 1981; von Grünau et al. 1984). However, it does transfer from a cube seen by one eye to a cube seen in the corresponding location in the other eye (Brown 1962; Spitz and Lipman 1962). Thus, local postretinal visual channels serving each interpretation must become simultaneously adapted. The location specificity of the adaptation process is revealed by the fact that the rate of alternation of a cube is higher when the eyes remain fixated on one position than it is when the gaze moves over the cube or when the cube moves or rotates (Orbach and Zucker 1964; Spitz 1970). Location specificity is also revealed by the fact that neighboring Necker cubes or other reversible-perspective figures can reverse independently (Flügel 1930). However, structural relationships between neighboring cubes may increase the likelihood that they reverse together.

DEPTH FROM PER SPECTIVE



59

A

The Necker cube

Mach’s book

B

The Thiéry blocks

The Schröder staircase Fig 26.65.

C

Figures showing reversible perspective.

Figure 26.66. Disparity and reversible perspective. Prior inspection of the upper stereogram (A) or the lower stereogram (C) induces the opposite depth in the zero-disparity stimulus in (B). (Redrawn

For example, two nested cubes with the same orientation tend to reverse together and in the same way, while nested cubes with opposite orientations are more likely to reverse independently. Also, cubes linked along one side reverse together more often than do unlinked cubes (Adams and Haire 1958, 1959). These effects are illustrated in Figure 26.67. Coupling of apparent changes has been reported in several ambiguous stimuli, including ambiguous apparent motion (Ramachandran and Anstis 1983) and the kinetic depth effect (Section 28.5.3c). Large cubes tend to reverse at a slower rate than small cubes (Borsellino et al. 1982). It must take longer to adapt a larger stimulus. The adaptation process is also specific to the particular ambiguous stimulus. Thus the increased rate of alternation with continued viewing does not transfer from a cube to a different ambiguous figure in the same location (Howard 1961). However, it does transfer from a black-on-white cube to a white-on-black cube or vice versa (Cohen 1959; Toppino and Long 1987). All this evidence suggests that local neural processes responsible for each specific interpretation of an ambiguous figure become progressively adapted, or fatigued, during the time in which that interpretation is dominant. At the same time, the processes responsible for the nondominant competing interpretation gain in strength. This reciprocal activity results in a periodic alternation of interpretations. The strength of the neural processes associated with each 60



from Virsu 1975)

interpretation is also subject to random fluctuations arising from eye movements, shifts of attention, and other factors (Taylor and Aldridge 1974). There has been some debate about whether eye movements cause reversals of perspective. Necker believed that reversals are due to involuntary changes in eye position. Pheiffer et al. (1956) reported that characteristic eye movements followed rather than preceded perspective reversals of the Schröder staircase. Einhäuser et al. (2004) found a more complex relationship between eye movements and reversals of the Necker cube. The issue is complicated because saccadic eye movements have two effects. They momentarily interrupt clear vision because of image smear, and they change the location of fixation. To some extent the rate of reversal of the Necker cube is under voluntary control. Voluntary control may be mediated by the direction of gaze. When one of the central corners of a cube is fixated, that corner tends come forward (Flügel 1930; Ellis and Stark 1978; Kawabata et al. 1978). But subjects have some control over the rate of reversal of the Necker cube even when they fixate a central spot (Toppino 2003). Also, the direction of gaze cannot account for the fact that two cubes presented at the same time may reverse independently. Adaptation to a given interpretation occurs only when that interpretation is dominant. But does a dominant

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

object produced by parallel projection. These effects come under the heading of the kinetic depth effect that is discussed in Section 28.5. 26.7.2 R EVE R S I B L E P E R S P EC T I VE O F 3-D O B J E C T S A

B Figure 26.67. Structural factors in reversible perspective. (A) Nested Necker cubes in the same orientations tend to reverse together more frequently than do cubes in different orientations. (Adapted from Adams and Haire 1959) (B) Abutting cubes tend to reverse together more frequently than do well separated cubes.

interpretation adapt when attention is not directed to it? If so, one would expect a reduction in the rate of alternation of an ambiguous figure when the observer is attending to a distracting task. Reisberg and O’Shaughnessy (1984) found that the rate of perspective reversal of several ambiguous figures was slowed when subjects counted backward by 6. Reversible perspective may also occur in the projected images of rotating objects. Thus, the image of an outline or transparent rotating 3-D object projected on a frontal screen by parallel projection appears to reverse in perspective and, when it does so, it appears to change its direction of rotation. The change in direction of rotation follows from the simple geometrical relationship between the apparent relative depths and relative directions of motion of the far and near sides of the rotating object. The image of a rotating 3-D object produced by polar projection contains several sources of information about the direction of rotation (see Section 28.5.3a). For example, the images of the nearer parts of the object move more rapidly than the images of the more distant part (Guastella 1966; Hershberger 1967). Therefore the projected image of a rotating 3-D object produced by polar projection appears to reverse less frequently than does the image of the same

Reports of depth inversion of directly viewed natural objects such as windmills, weather vanes, and flags began to appear after 1750 (see Burmester 1908). For example, Sinsteden (1860) reported that a windmill silhouetted against the evening sky at an angle of about 30˚ from the frontal plane, periodically appeared to change its direction of rotation and, simultaneously, its apparent orientation. Directly viewed 3-D wire frames readily reverse in depth when viewed with one eye. But they may appear to reverse in depth even when viewed with both eyes. Wheatstone (1838) described how a wire cube rotating slowly about a diagonal axis, spontaneously reverses in apparent depth, and that each reversal of depth is accompanied by an apparent reversal in the direction of rotation. Flügel (1930) and Hornbostel (1922) reported the same effect. Rotation direction must reverse when perspective reverses because, if the far side of the cube is moving to the right when it is seen in correct perspective, it will appear as the near side moving to the right in reversed perspective. The apparent reversal in rotation provides a useful indication that a reversal of perspective has occurred. The rotating cube appears tapered when seen in reverse perspective. This apparent change in shape is a simple consequence of size-distance scaling (Power and Day 1973). A rotating 3-D skeletal cube reverses in apparent depth and direction of rotation every few seconds when viewed monocularly. With binocular viewing, the first reversal takes about 2 minutes. After the first reversal, the cube reverses every few seconds and sometimes appears to oscillate rapidly (Howard 1961). As with the 2-D Necker cube, the effects of prolonged viewing do not transfer to a similar rotating cube in a different retinal location (Toppino and Long 1987). There is also a long-term learning effect, because reversal rates increase with repeated viewing over a long period (Long et al. 1983). The first reversal of a binocularly viewed skeletal cube takes about 2 minutes. If, at that instant, the rotation is physically reversed it takes about 4 minutes before it reverses again (Howard 1961). This shows that the adaptation must be specific to the direction of rotation in depth. The adapted state of the detectors for motion-in-depth has to “unwind” before becoming adapted in the other direction. The effect cannot be due to adaptation of simple motion detectors because at all positions there is as much motion in one direction as in the opposite direction. The integral effect over time is therefore zero. Another indicator of adaptation of a specific motionin-depth is that, after a few minutes of binocular inspection

DEPTH FROM PER SPECTIVE



61

of a skeletal cube in a given direction, a monocularly viewed cube appeared to rotate in the opposite direction. The adaptation effect did not occur when induction and test cubes differed in size. When the induction cube was shown for only a few seconds, the test cube appeared to rotate in the same direction. This is a perseveration or priming effect (Long and Moran 2007). See Sections 12.3.5g, 30.2.1, and 31.6.3 for other examples of long-tem adaptation and priming in motion-in-depth. The Ames window is constructed on the principle of projective equivalence (Ames 1951). A trapezoidal window rotating about its midvertical axis projects the same image as a rectangular window when it is at the correct angle to the line of sight. When the short side of the window moves toward the viewer from the frontal plane, the window approaches this angle. Its apparent shape changes to rectangular and its apparent orientation about its vertical axis changes accordingly through the operation of sizedistance invariance. As the short side recedes from the frontal plane, the window reverts to its trapezoidal shape and true orientation. The net result is that the window changes its apparent shape and direction of rotation twice in every complete rotation (Graham and Gillam 1970). A rod placed through the window and rotating with it is not subject to the illusory change in direction of rotation. Consequently, when the direction of rotation of the window appears to reverse, the rod keeps going in the same direction and appears to pass through the window or bend round it. Rotating rectangles, ellipses, or irregular shapes also appear to reverse periodically (Day and Power 1963). However, they do not reverse regularly, twice in each revolution, as does the Ames window. The effects of familiarity on reversible perspective are discussed in Section 30.9 and other aftereffects of observing objects rotating in depth are discussed in Section 31.6.1. Figure 26.68 is a photograph of an apartment building in New York. The balconies are actually level and trapezoidal.

62



A

B

Figure 26.68. Perspective illusion on an apartment building. The balconies on this apartment building at 200 East Street, New York City, are actually horizontal but trapezoidal in shape. From one side, shown on the left of the figure, the balconies appear to tilt up. Viewed from the other side they appear to tilt down. (From Griffiths and Zaidi 2000. Pion Limited, London)

But, when the actual building is viewed from one direction the balconies appear rectangular and inclined upward. Viewed from the opposite direction, the balconies appear inclined downward. In both cases, the perceived shape and orientation of the balconies are projectively equivalent to their actual orientation and shape. Griffiths and Zaidi (2000) replicated this illusion in the laboratory. From their investigations they concluded that the assumption that balconies are rectangular leads observers to relinquish the otherwise plausible assumption that balconies are level.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

27 DEPTH FROM INTERPOSITION AND SHADING

27.1 27.2 27.2.1 27.2.2 27.2.3 27.2.4

Rules of object structure 63 Interposition 65 Occlusion within an object 65 Occlusion of one object by another 66 Modal completion of subjective contours Dynamic accretion and deletion 69

27.3 27.3.1 27.3.2 27.3.3 27.3.4

68

2 7 . 1 RU L E S O F O B J E C T S T RU C T U R E

Shading and shadows 71 Surfaces and light sources 71 Shape from shading 74 Detached shadows 80 Aerial perspective, brightness, and contrast

81

planar polygonal objects. The algorithm first finds lines, edges, and vertices. Local guesses are made about which pairs of regions meet in particular edges and vertices and belong to the same object. A link between regions is called strong or weak according to how much evidence supports it. The links are then integrated by an iterative procedure to produce coherent descriptions of objects. The algorithm relies on heuristic rules of composition of edges, corners, surfaces, and backgrounds rather than on projective geometry or model forms. Another approach is to start with an exhaustive classification of edges and vertices (Clowes 1971; Huffman 1971). Types of edges that occur in simple line drawings of trihedral polyhedra are shown in Figure 27.1. Edges with arrows are bounding contours where two surfaces meet but only one can be seen. An edge with a plus sign joins two visible surfaces into a convex shape and an edge with a minus sign joins two visible edges into a concave shape. There are 18 different corners that can be formed by combining the edges, as shown in Figure 27.1b. When applied to a polyhedron, the labeling must be constant from one corner to the next. A line segment has only one label. The arrows on bounding contours must be in a consistent direction round an object. If two corners are linked by an edge they must have two surfaces in common with either both visible or only one visible. A corner with all concave (negative) edges cannot be joined to a corner with all convex (positive) edges. A corner bounding only one surface cannot be joined to one bounding three surfaces, because there is only one surface that two joined corners may not have in common. A consistent labeling is called “legal.” Figure 27.1c shows an example of a legal labeling. Figure 27.1d cannot be labeled legally, and the object is said to be impossible. For example, line AB is concave at one end and convex

Most natural scenes consist of distinct objects. One of the basic tasks of visual perception is to segregate distinct objects and register the structural relationships between them. The Gestalt psychologists proposed several principles of figural organization, or visual grouping that determine how sets of stimulus elements are perceived as coherent patterns (see Koffka 1935). These principles are contiguity, continuity, similarity, good figure, and common motion, as described in Section 4.5.10a. For any 3-D object, the arrangement of edges, corners, and surfaces obeys certain rules, which can be called structural rules of objects. These rules impose constraints on the possible interpretations of retinal images. When adopted by a computer program or by the visual system they help to segregate distinct objects represented in 2-D images. Roberts (1965) devised an algorithm for interpreting photographs of shapes such as a cube, wedge, and hexagonal prism displayed separately or abutting, overlapping, or in support relationships. The algorithm converts the photograph into a line drawing. Rules of topology are used to find four points in the image that probably correspond to four points in the object. Theoretically, any set of noncollinear image points specifies the projective transformation for all points in the image of a surface. If this transformation fits a part of an object, that part is deleted and the process is repeated for the remaining part. If the transformation does not fit, the process is repeated with a second set of candidate points. This algorithm relies on rules of topology, projective geometry, and model forms. An algorithm called SEE devised by Guzmán (1968) starts with well-formed line drawings of 3-D trihedral 63

Convex edge linking two surfaces with only the upper surface visible Convex edge linking two visible surfaces Concave edge linking two visible surfaces

(a) Types of edge L junctions

Y junctions

T junctions

Arrows

(b) Types of corner

(c) A legally labeled image

(e) The Penrose triangle Figure 27.1.

(d) An image with no legal labeling

(f) Impossible ziggurat (From a figure by Harry Turner in Draper 1978)

Edges and corners of images of plane trihedral objects.

at the other. Also, line CD separates two visible surfaces at C but a visible and an occluded surface at D. Other examples of impossible objects are shown in Figures 27.1e and f (Draper 1978). Our ability to recognize these objects as impossible suggests that our perceptual system embodies structural rules for edges, corners, and surfaces. Mackworth (1973) extended the classification of corners to polyhedra with more than three surfaces meeting at a corner. This increases the number of legal labelings. Thus, a drawing of a tetrahedron allows 33 legal labelings (see Malik 1987). Waltz (1975) extended the analysis to include shadows. 64



Malik (1987) produced a procedure for labeling both straight and curved edges in line drawings of solid objects, and an algorithm for determining all legal labelings of such drawings. A curved surface projects an edge where lines of sight are tangential to the surface. This is not an edge in the object but an occluding edge defined by the boundary between visible and invisible parts of the surface. When the eye translates, the image of the occluding edge moves over the surface and provides information about the 3-D structure of the surface (Cipolla 1998; Giblin 1998). In all the above procedures there are generally several legal ways to label the drawing of a real polyhedron.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Even when a particular labeling has been applied, there is still ambiguity about the 3-D object. These ambiguities reflect the fact that a 2-D shape can be produced by a set of projectively equivalent 3-D objects. Thus, there is no general set of structural rules that allow one to specify particular 3-D shapes from line drawings (Carlsson 1998). Structural rules allow one to decide whether an object depicted in a drawing could exist. However, a legally labeled drawing may not represent a realizable object unless one is allowed to twist surfaces. The rules also allow one to determine the set of projectively equivalent objects that could be represented by the drawing. The number of possible interpretations of the image of an object can be reduced if the perceiver adopts the following general assumptions. 1. Generic viewpoint An accidental viewpoint is one for which a small change in position produces a large change in the image. For example, a thin rod aligned with the visual axis of one eye produces a point image that changes abruptly into a line when the observer moves. In a generic viewpoint, edges and surfaces do not lie along lines of sight. Thus, in a generic view, visible edges are represented by distinct lines, curved edges by curved lines, and visible surfaces by areas. Also, lines that are not connected do not form one line in the image. For any object, there are many more generic viewpoints than accidental viewpoints. Therefore, the most likely object is one that produces a gradual change in the proximal stimulus over a change in viewpoint (Freeman 1994). 2. Structural rules Lines meeting in a corner in the image meet in a corner in the object. Lines forming T-junctions arise from one object occluding another, with the top of the T arising from the nearer object and the stem arising from an occluded edge. Parallel lines in an image can be assumed to arise from parallel lines. If an image contains many lines that converge on a common point, they can be assumed to arise from surfaces receding into the distance (Lowe and Binford 1985). 3. Assumption of symmetry The assumption that the object producing a given image is symmetrical may help, because most natural objects are either radially symmetrical or bilaterally symmetrical. 4. Assumption of convexity The assumption that the ambiguous image of a surface is convex rather than concave may help because there are more convex surfaces than concave surfaces. The literature on algorithms for detecting 3-D objects from projected images has been reviewed by Binford (1981) and Mundy (1998).

27.2 INTERPOSITION Interposition occurs when all or part of one object occludes part of the same object or another object. Interposition provides information about depth order but not about the magnitude of a depth interval. There are three main effects to consider. 1. Amodal completion This occurs when a partially occluded object appears complete beyond the object that occludes it. 2. Modal completion This occurs when an object appears complete when parts of it cannot be seen because they have the identical color and texture as the background. 3. Accretion-deletion This occurs when a moving object progressively occludes parts of a more distant object on its leading edge and progressively reveals parts on its trailing edge. These effects will now be discussed. 27.2.1 O C C LUS I O N WIT H I N A N O B J EC T

Consider a smooth opaque object lacking surface marks or shadows and viewed from a fixed location against a featureless background, such as the object in Figure 27.2. The locus of points on the surface of the object where visual lines are tangential to the surface but do not enter the surface is the bounding rim of the object. Holes in the object also produce bounding rims. A bounding rim forms a closed curve round a smooth convex isolated object. For more complex objects, the bounding rim is a series of disconnected lines, as in Figure 27.2. A convex contour signifies a convex patch in the surface of the object and a concave contour indicates a saddle-shaped patch in the surface (Koenderink 1984). For all objects, the bounding rim projects as a closed bounding contour on the retina, except where one part of the object occludes another part. A bounding contour has polarity, meaning that one side is inside the image of the object and the other side is outside. Bounding contours in the image of a single object with no overlapping parts obey Jordan’s theorem, which states that a simple closed line divides a plane into an inside and an outside. Any two points with the same polarity are connected. They are said to be simply connected if there are no holes in the surface. The transit line in Figure 27.2 alternately enters and leaves the image of the surface as it crosses successive bounding contours. Jordan’s theorem is implicitly embedded in the figureground mechanism of the perceptual system, which allows us to recognize the difference between the inside and outside of a figure. We recognize when Jordan’s theorem has been violated, as in Figure 27.3A, even though we may not

DEPTH FROM INTERPOSITION AND SHADING



65

Concave contour indicates parabolic surface (saddle point) Convex contour Bounding contours indicates elliptical surface

Transit line

27.2.2 O C C LUS I O N O F O N E O B J E C T BY A N OT H E R

Pit

Hole

Inner contours indicate folds, Aligned T junctions indicate occlusions, protrusions, or pits continuous occluded surface

Figure 27.2.

Image contours and occlusions.

be able to say why (Schuster 1964). For multiple overlapping objects, Jordan’s theorem applies only after it has been decided which contours belong to which object. A closed edge of a three-dimensional object, like the edge of the side of a cube, usually defines a surface. Crossing such an edge takes one from one surface to another. But this is not necessarily true, as is illustrated by the Möbius strip shown in Figure 27.3B. The surprise generated when this strip is cut around the dotted line and yet remains connected indicates that our perceptual system assumes, incorrectly, that Jordan’s theorem for closed edges and surfaces holds in three dimensions. Visual lines that are tangential to the surface of an object but which also cut the surface define inner rims. Inner rims indicate parts of the object that occlude other parts. Inner rims project as inner contours that lack polarity and do not obey Jordan’s theorem, because both sides lie within the image of the object’s surface. A terminating inner contour indicates a protuberance or fold viewed from one side. A bounding contour that is continuous with an inner contour and forms a T-junction with the adjacent part of the bounding contour indicates that the object is folded over on itself, as shown in Figure 27.2. A T-junction between two inner contours also indicates occlusion of one part of the object by another. Two aligned T-junctions generally indicate a continuous occluded edge of a partially occluded surface. This assumption is so strong that it can lead us to perceive an object that breaks other rules about depth order, as shown in Figure 27.4. We use inner contours and T-junctions to perceptually segregate different parts of a complex object, such as the limbs of a human body (see Beusmans et al. 1987). Figure 27.5 illustrates the use of interposition in art. 66

Image contours may also arise from surface marks and texture, shadows, or sharp ridges. A discontinuity in a texture gradient indicates a step between surfaces or a steep fold (Section 26.5.1).



27.2.2a Occlusion Defined by T-Junctions The images of overlapping objects usually contain edges meeting in T-junctions, as in Figure 27.6a. The edge that terminates at a T-junction is the stem edge and the edge that continues across the T-junction is the continuous edge. The two stem edges in Figure 27.6a are seen as belonging to the more distant of the two shapes (Helmholtz 1909, Vol. 3, p. 284; Ratoosh 1949). This interpretation allows the far object to be seen as having a continuous edge extending behind the nearer object. This is amodal completion (see Guzman 1968; Huffman 1971; Grossberg 1997). The rule for amodal completion is that it occurs when the object to which the edge belongs is perceived in the background. There is evidence that cells in the primary visual cortex respond to amodally completed contours (Section 5.6.7b). Amodal completion has been demonstrated in pigeons (Nagasaka and Wasserman 2008) and mice (Kanizsa et al. 1993). When an object occludes a straight-edged object, pairs of T-junctions on either side of the occluder have collinear stems that are perceived as amodally connected by a straight edge. When the stems are curved they appear connected by a

A

B Anomalous objects. (A) An impossible object. The bounding contours break the rule that a surface has a connected edge. (After Schuster 1964) (B) A Möbius strip is a real object with only one surface. It contradicts the assumption that a complete edge defines two surfaces. When cut along the dotted line the strip remains one loop. Figure 27.3.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

A

B Figure 27.4. Perception of aligned T-junctions. (A) The tendency to see aligned T-junctions as belonging to an occluded object can create an impression of an anomalous object. (B) The anomaly is resolved when the object is viewed from another vantage point. (Redrawn from Cochran 1966)

monotonically curved edge. When the corner of an object is occluded, amodal completion is achieved by visually extending the occluded edges until they meet to form a corner. Sometimes amodal completion does not follow either of these rules, but serves to complete the symmetry of an occluded object (van Lier et al. 1995). This must involve higher-order detectors sensitive to the global property of symmetry. There are at least three reasons why shape B in Figure 27.6a is seen behind shape A. 1. If shape B were seen nearer than shape A, it would be L-shaped with five convex corners and one concave corner rather than a simple rectangle with four convex corners. 2. If shape B were seen nearer it would be asymmetrical rather than symmetrical. 3. If shape B were nearer than shape A it would fit against shape A only if viewed from that particular vantage point, whereas both shapes would remain overlapping rectangles if shape A were seen in front. The generic interpretation is preferred to one that arises from an accidental alignment of contours. When the edges of two overlapping objects are aligned, as in Figure 27.6b, the stem edges extend into a ground region and therefore are not perceived as occluded and linked up. A stem edge is perceived as extending behind an adjacent surface only if that surface is perceived as part of a

figure. There is still a tendency to see rectangle A as nearer because both objects are then seen as simple rectangles. The T-junctions in Figure 27.6b are also uninformative, but there is a tendency to see the smaller rectangle in front because it is smaller and also because this allows the other shape to be seen as a rectangle. Helmholtz pointed out that a far object can be made to appear in front of a near object by aligning the edges of a cut-out portion of the far object with a corner of the near object. Ames (1955) created the illusion with playing cards. From a given viewpoint, a cut-out corner of a near card just fits the corner of a more distant card. The smaller image of the far but apparently nearer card makes this card appear smaller than the nearer but apparently more distant card. In Figure 27.6c the T-junctions provide contradictory information about depth order. In Figure 27.6d the two T-junctions indicate that the rectangle is in front. However, the shapes tend to appear in the same plane so as to preserve the symmetry of the pyramid. In Figure 27.6e there are no well-defined T-junctions, but the star appears in front because both shapes are then symmetrical (Chapanis and McCleary 1953). Figure 27.6f demonstrates a tendency to see an object with a shorter hidden contour as in front of an object with a longer hidden contour (Shipley and Kellman 1992). Thus, information provided by T-junctions may be supplemented or overridden by the tendency to see shapes that are smaller, simpler, more symmetrical, or more meaningful. The tendency to perceptually segregate figure and ground in terms of these features is the Gestalt principle of good figure, or Pregnanz (Bahnsen 1928; Ehrenstein 1930). See Kellman and Shipley (1991) for a review of factors that determine the linkage between distinct pattern elements. In the absence of T-junctions, the figure-ground organization of a display may be ambiguous. For example, in Rubin’s cross, there is a spontaneous alternation in which of two interlaced crosses is seen as figure and which is seen as ground. At any instant, the cross that is seen as figure appears nearer than the cross that appears as ground. Mather and Smith (2004) presented subjects with a set of textured squares on a computer monitor and asked them to rapidly click on each square in turn from nearest to farthest. Relative depth was indicated by overlap, image blur, image contrast, or combinations of these cues. Subjects made more errors for the overlap cue than for the other cues. They were most accurate and made the initial response most rapidly when all cues were available.

27.2.2b Occlusion from Penetration and Enclosure T-junctions indicate that one object is in front of another object. They are not formed when an object penetrates another object, as in Figure 27.7A. Nevertheless, the rod

DEPTH FROM INTERPOSITION AND SHADING



67

Figure 27.5

The use of overlap in art. A man threshing beside a wagon by Peter Paul Rubens (1577–1640). Chalk and pen on paper, 25 x 41 cm. J. Paul

Getty Museum, Los Angeles.

appears as a complete object with continuous edges (Tse and Albert 1998). In Figure 27.7B the rods appear to be attached to the cube but not to penetrate it. In Figure 27.8A, a black blob appears to be wrapped round a white cylinder. The curvature of the black blob defines the curvature of the cylinder. The impression persists when the ends of the cylinder are omitted, as in Figure 27.8B, although the percept may take some time to develop. Thus, T-junctions are not required for amodal completion.

27.2.2c Occluding Edges and 3-D Shape Occluding edges of smooth textured surfaces provide various types of depth information (Howard 1983; Todd and Reichel 1989). 1. An occluding edge provides a contour line that indicates the convexity or concavity of the surface at each location along the line. 2. Discontinuity in a texture gradient at an occluding edge indicates that the surface is folded and provides information about the depth of the fold. 3. Once an occluding edge has been identified, the orientation of the surface along the edge is determined by the fact that the surface is tangential to lines of sight at the edge. 4. When the object or viewer moves, depth is indicated by the way the occluding edge sweeps over the surface of the object and by motion parallax between regions separated by the occluding edge. 68



5. The image of texture on a surface curved in depth becomes compressed near the edges of curvature.

27.2.3 MO DA L C O M P L ET I O N O F S U B J E C T I VE C O N TO U R S

A stationary object cannot be seen when it has exactly the same texture, luminance, and color as the background in the same depth plane. The object is camouflaged. Sometimes we see an object as complete when only parts of its boundary are camouflaged. For example, in Ehrenstein’s figure, shown in Figure 27.9A, white disks appear at the line intersections, and the lines seem to extend behind the disks (see Ehrenstein and Gillam 1999). Figure 27.9B shows one of the figures devised by Kanizsa (1979). We “see” the camouflaged edges of a white triangle even though there is no physical contrast in those regions. This is modal completion. The apparently completed edges of shapes that appear in the foreground are known as subjective contours, cognitive contours, or illusory contours. Modal completion occurs when the object to which the edge belongs is perceived in the foreground. Amodal completion occurs when the object to which the cognitive contour belongs is seen as occluded by a nearer shape. Schumann (1900) published the first account of subjective contours although A. von Szily had reported them at a scientific conference in Vienna in 1894 (Szily 1921; Ehrenstein and Gillam 1999). The breaks in the vertical lines of Figure 27.10 create an illusory horizontal bar with illusory T-junctions. When this figure is rotated in depth about a vertical axis at a low speed the illusory bar appears to rotate in depth with the

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

B A

(a) A appears nearer than B because of T junctions and because A and B appear as simple rectangles

B

A

(b) The T junctions are not informative. The figure appears in one plane or A appears in front of B because it is smaller

In humans, fMRI revealed that neural activity generated by illusory contours occurs first and most strongly in the lateral occipital area, an area containing V3A, V4, V7, and V8 (Mendola et al. 1999; Murray et al. 2002). This suggests that activity in V1 and V2 is due to feedback from these higher centers. Subjective contours have been used in studies of stereoscopic vision (Sections 12.3.3e and 22.2.4). See Kanizsa (1979), Parks (1984), and Spillmann and Dresp (1995) for reviews of this topic. 27.2.4 DY NA M I C AC C R ET I O N A N D D E L ET I O N

(c) The T junctions produce conflicting information

B A

(d) The T junctions indicate that A appears nearer unless both figures appears symmetrical

A

B

(e) There are no T junctions but A appears nearer because the figures appear symmetrical

When a near textured surface moves over a far textured surface, as in Figure 27.11A, texture elements in the far surface are deleted along the leading edge of the near surface and emerge along the lagging edge. The texture of the near surface remains unchanged. This is the accretion-deletion cue to depth order (Gibson et al. 1969; Kaplan 1969). An impression of depth order is still created when there are no texture elements in the neighborhood of the edge of the near surface, as in Figure 27.11B (Yonas et al. 1987; Craton and Yonas 1990). Under this condition, an impression of depth order is created when the edge of a blank region moves with respect to an adjacent textured region or when the edge and texture of a region move coherently with respect to a blank region with static contours, as illustrated in Figure 27.11B. This is still a form of deletion. Even though texture elements are not deleted, the near surface, defined by its moving edge, increases or decreases the blank region of the far surface defined by its stationary edges. The pattern of accretion-deletion produced by motion of a random-dot display with respect to a nearer blank

(f) A region with short hidden contours appears nearer than a region with long hidden contours Figure 27.6. Perception of depth order from occlusion. (Figures (d) and (e) are adapted from Chapanis and McCleary, 1953, Figure (f ) from Shipley and Kellman 1992)

vertical lines. At a high velocity, the bar remains in a frontal plane unless it has a bar at each end (Kojo and Rovamo 1999). Subjective contours, like real contours, show illusory distortions, masking, perspective reversal, apparent movement, and figural aftereffects (see Bradley 1982). Subjective contours and contours defined by texture boundaries evoke responses in cells in V2 of alert monkeys, similar to those evoked by real contours (von der Heydt et al. 1984; Bakin et al. 2000). Some cells in V1 of the monkey also respond to subjective contours (Grosof et al. 1993).

A

B

Amodal completion through object penetration. The rods on the left appear to penetrate the cube, and their edges appear continuous through amodal completion. There are no T-junctions, only Y and L junctions for the square rod and a change in curvature for the round rod. The edges at the point of penetration are the edges of holes in the cube. The rods on the right appear to protrude from the cube. The edges at the points of contact specify the ends of the rods. (Adapted from Tse and

Figure 27.7.

Albert 1998)

DEPTH FROM INTERPOSITION AND SHADING



69

A

B

Figure 27.8. Contour completion for object wrap around. A black blob appears to wrap round a vertical white cylinder even when the ends of the cylinder are not visible, as on the right. (Redrawn from Tse and Albert 1998)

shape, as shown in Figure 27.12, reveals the shape of the blank region (Andersen and Cortese 1989). As one would expect, shape identification improved with increasing dot density and increasing speed of background motion. Thompson et al. (1985) developed an algorithm for detecting boundaries defined by accretion-deletion. The rate of deletion of evenly spaced texture elements is proportional to the density of texture elements, the relative linear speed of the two surfaces, and the distance between them. Even without motion, the relative distances of surfaces can be derived from perspective if the densities of their surface textures are known. Also, the relative distances in depth of two stationary surfaces with the same unknown texture density can be derived if the viewer perceives the textures as the same. In both cases, accretion and deletion provide quantitative information about relative depth only if the relative speed of motion of the surfaces is known. Motion speed will be known if motion is self-produced, as when the observer holds the moving surface or when parallax is caused by movement of the observer’s head. Even with known speed of motion, motion between two surfaces with dissimilar and unknown surface textures indicates only which surface is in front. Accretion-deletion indicates which surface is in front when there are no other depth cues. Hegdé et al. (2004) found that the depth order of two surfaces is readily perceived when the motion between them is defined by second-order stimuli, rather than by luminance. Their moving displays may be viewed in the Journal of Vision Web site. The cues of accretion-deletion and relative motion can be made to conflict. Ono et al. (1988) used two randomdot displays with a vertical boundary. Relative depth was simulated by relative motion of the two displays and by accretion or deletion of dots along the vertical boundary. Both changes were synchronized with sideways motion of the observer’s head. Depth order from relative motion 70



depends on the direction of relative motion of the two displays in relation to the direction of head motion, while depth order from accretion-deletion depends on which surface loses or gains dots. Accretion-deletion was made to indicate one depth order and relative motion to indicate the opposite depth order. For small simulated depth differences, motion parallax was the dominant cue, but for large simulated depth differences, accretion-deletion was the dominant cue. Dot density was not varied, and this may be an important factor because accretion-deletion would become more apparent with higher dot density. When the same displays were presented without head movements, relative motion was ambiguous, but accretion-deletion remained unambiguous. Under this condition, perceived depth order was significantly related to accretion-deletion. Kitazaki and Shimojo (1998) created a CRT display that created the impression of two 8˚-wide vertical bands of white random dots on a black background moving horizontally over a surface of white dots moving in the opposite direction, as shown in Figure 27.13. The depth of the bands relative to the surface was determined by accretion-deletion of surface dots. The motion of the bands was not yoked to the lateral motion of the subject’s head. However, when the subject swayed in one direction the relative motion of the bands indicated a depth order that agreed with that specified by accretion-deletion of dots. In this case, the relative motion of the bands was seen as arising from self-motion, rather than from real opposite motion of the bands. Sway in the other direction reversed depth order specified by relative motion of the bands but not that specified by accretiondeletion. Most subjects continued to see the same depth order for both directions of sway. The anomalous relative motion was interpreted as a real relative motion of the bands. Depth order was then rendered ambiguous by removing accretion-deletion in superimposed transparent random-dot displays moving in opposite directions. Perceived depth order was still preserved as the subject swayed laterally. As before, the displays appeared stationary in one direction of sway and appeared to move in opposite directions in the other direction of sway. One might have expected that reversing body sway would reverse the depth order and preserve the percept of stationary displays. This is what happened when, instead of simulating depth between two distinct surfaces, subjects saw continuous motion parallax simulating a single surface with horizontal sinusoidal ridges. Now depth order reversed as body sway reversed and the rigidity of the corrugated surface was preserved. In the real world, surfaces in different depth planes are more likely to start moving than suddenly reverse their depth order. On the other hand, a single surface is more likely to change its disposition in depth than suddenly become nonrigid. Accretion-deletion can dominate conflicting information from disparity. Thus, accretion-deletion caused a vertical strip of laterally moving dots to appear beyond a

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Figure 27.10. Illusory T-junctions. The aligned vertical lines create an illusory horizontal bar with illusory T-junctions. When the line display was rotated slowly about a vertical axis, the illusory bar appeared to rotate with the lines. When the display rotated rapidly, the bar remained in a frontal plane. The presence of short vertical lines at the ends of the illusory bar, as in this figure, caused the illusory bar to always appear to rotate with the grating. (After Kojo and Rovamo 1999)

A

B Cognitive contours. (A) Ehrenstein’s figure. The gaps appear as white disks by modal completion. The lines appear continuous by amodal completion. (B) Kanizsa triangle. A triangle is seen even though most of its sides are camouflaged. (After Kanizsa 1979)

Figure 27.9.

surrounding display of stationary dots, even when disparity signified that the strip was nearer than the surround (Royden et al. 1988). Dynamic occlusion can resolve the ambiguity of depth order inherent in the projected images of 3-D objects. For example, the surface of an opaque textured sphere projected by parallel projection onto a flat screen can be seen as convex or concave. If the sphere rotates, it can be seen as a convex object rotating in one direction or a concave object rotating in the opposite direction. The images of large texture elements on a stationary sphere are partially occluded at the edge of the sphere, as illustrated in Figure 27.14A. As the sphere rotates, large texture elements progressively disappear along one edge and reappear along the opposite edge. This is compatible only with a convex object. Thus, as the size of texture elements was increased, subjects made more consistent judgments about the direction of rotation of the sphere. Occlusion of far by near large texture elements, as in Figure 27.14B, also helped to resolve the ambiguity in the image of texture elements on a rotating transparent sphere (Braunstein et al. 1982). The image of a rotating transparent sphere covered with clusters of small dots, as in Figure 27.14C, remained ambiguous, presumably because dot clusters do not produce T-junctions that indicate occlusion (Andersen and Braunstein 1983). Dynamic occlusion can also reduce depth ambiguity in moving patterns of light such as those produced by attaching light sources to a person walking in the dark (Proffitt et al. 1984).

Dynamic occlusion enhances a subjective contour (Bradley and Lee 1982). For example, a subjective triangle appears when a black occluding shape is rotated with respect to four white disks, as indicated in Figure 27.15. A subjective triangle is not evident in any static frame, but it becomes evident in the dynamic sequence (Kellman and Cohen 1984). Two superimposed random-dot displays moving in opposite directions produce an impression of two surfaces separated in depth, one seen through the other (Farber and McConkie 1979). The perceived depth order of the displays alternates spontaneously. The same percept is produced by relative motion of two superimposed regular dot lattices with the dots in alternating rows so that the dots do not intersect (Mace and Shaw 1974). The percept is weakened when the rows of dots are aligned. In this case, all the dots intersect at the same time and the relative motion signal is periodically interrupted. However, the perception of relative depth is improved when oppositely moving distinct shapes, rather than dots, intersect. Wong and Weisstein (1985) reported that a steady random-dot display appears closer than a superimposed random-dot display flickered at about 7 Hz. They suggested that the effect is due to a general tendency to see steady regions as figures and flickering regions as ground.

2 7 . 3 S H A D I N G A N D S H A D OW S 27.3.1 S U R FAC E S A N D L I G HT S O U RC E S

27.3.1a Factors Affecting Shading and Shadows The light falling on each point of an image formed by a specified optical system depends on the nature and disposition of objects in the visual scene, the nature and disposition of sources of light, and the transparency of the medium.

DEPTH FROM INTERPOSITION AND SHADING



71

Reflections of small sources of light from a shiny surface produce highlights. The extreme specular surface is a perfect mirror that reflects each incident beam in only one direction. Phong (1975) expressed the contribution of each of the three components to the intensity of light reflected from a given point on a surface by the following equation. I r sI s a sI i ( N.L ) + gI i (H.N )n A

Ambient Lambertian Specular Where Ia is intensity of ambient illumination, g is the proportion of specular reflected light, and H is the unit vector that bisects the direction of illumination and the line of sight. Illumination conforming to this specification is known as Phong illumination. Becker (2006) has reviewed methods for measuring reflectances in natural scenes. Shading is the variation in irradiance from a surface of uniform reflectance due to one or more of the following.

B Figure 27.11. Stimuli used by Yonas et al. (1987a). (A) The two random-dot arrays moved back and forth in counterphase to create alternating accretion and deletion. (B) The dots were removed from the region of overlap so that there was no accretion-deletion of dots. However, the edge of the nearer surface moved over the blank region of the other surface, exposing more of it for one direction of motion and less of it for the other direction.

The albedo, or whiteness of a surface, is the ratio of reflected light to incident light. The albedo of a perfectly white surface is 1 and that of a black surface is 0. The simplest type of reflectance is Lambertian. A matte, or Lambertian surface reflects light equally in all directions, like a sheet of white paper. A beam of light becomes spread over a larger area of a surface as the surface becomes inclined from being normal to the beam (Figure 27.16A). Thus, for an incident beam of intensity Ii, the light reflected from a small point on a Lambertian surface, Ir, is proportional to the albedo, s, of the surface and to the cosine of the angle, q , between the surface and the incident light. I r sI i (

) or sI i (N.L )

The term N.L is the vector inner product of the unit vector in the direction of the light source and the unit vector normal to the surface at that point. This product is equivalent to the cosine of the angle of incidence. Another term can be added to the equation to represent the intensity of ambient light falling on the surface from all directions, Ia. A third term can represent the contribution of specular reflection. A shiny, or specular surface reflects light more in one direction than in other directions, as illustrated in Figure 27.16B. The preferred direction is that for which the angle of reflection equals the angle of incidence. 72



Surface orientation The angle of incidence of light depends on the orientation of the surface and the location of the light source. A smoothly curved matte surface produces gradual shading. A sharp discontinuity of orientation of part of the surface creates shading with a sharp edge. Thus, variations in shading provide information about the 3-D shape of an object. Attached shadows Attached shadows are formed on part of an object facing away from the light source. Cast shadows A cast shadow is due to obstruction of light by an opaque or semiopaque object. It is an extrinsic shadow when the obstruction is another object and an intrinsic shadow when the obstruction is another part of the same object. Specularity The positions of highlights on a shiny surface vary with the orientation of the surface and the direction of illumination. The shape of a highlight provides information about surface curvature. This is because a highlight extends further along the axis of

Figure 27.12 Detection of shape by accretion/deletion. Subjects could identify the shapes of blank areas (indicated here by dotted lines) by the pattern of accretion and deletion of dots as the dotted background moved laterally with respect to stationary blank areas. (Redrawn from Anderson and Cortese 1989)

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

least curvature than along the axis of maximum curvature, as illustrated in Figure 27.17A. Also, the shapes of specular reflections remain surprisingly stable when the world reflected in the surface changes (Fleming et al. 2004). When a 3-D surface moves, highlights cling to regions of high curvature. Movement displaces highlights most in regions of low curvature, and the direction of displacement is determined by the sign of curvature (Koenderink and van Doorn 1980; Blake and Bülthoff 1990). A given highlight appears in different locations to the two eyes and the resulting binocular disparity creates an impression of depth. The light source Light falling on each point of a surface has a single direction if it comes from a single point, or if the light rays from an extended source are parallel (collimated). For these light sources an opaque object creates a black shadow with sharp edges. When the light source is extended and not collimated, light on each point of a surface arrives from more than one direction. The shadow of a sharp edge produced by a small light source consists of a dark umbra with a sharp edge and a graded penumbra. Light from a large diffuse source, scattered light, or light reflected from other objects casts smoothly graded shadows without sharp borders. Light coming equally from all directions forms a spherical Ganzfeld and produces the most diffuse lighting and the most smoothly graded shadows. Multiple distinct light sources produce multiple shadows of single objects. A contour produced by a sudden change in the lightness of a surface may be indistinguishable from the edge of a shadow. More complications arise when the scene contains semitransparent surfaces. The shape of the shadow caster and surface A cast shadow provides information about the dimensions of the object that casts it if the observer knows the direction of illumination and the nature and orientation of the surface on which the shadow is cast. Conversely, if the shape of the object casting the shadow is known, a shadow can provide information about the 3-D structure of the surface on which it is cast. The distance between a cast shadow and the object that casts it provides information about the distance between object and surface (see Section 27.3.3).

27.3.1b Images from Scenes and Scenes from Images Artists gained a practical and operational understanding of shading and shadows long before scientists conducted experiments or developed theories. They became expert at using shading to construct 2-D displays corresponding to 3-D scenes, taking all the above factors into account.

Figure 27.13. Depth from accretion/deletion. The impression of depth was created by accretion/deletion of dots by the two vertical bands. The arrows indicate the direction of motion of the bands and of the dots in the background. The edges of the bands were not edged in white. A fixation dot was placed at the center of the display. (Redrawn from Kitazaki and Shimojo. 1998)

Designers of computer-generated displays are adept at calculating images produced by specified scenes (Blinn 1977; Moon and Spencer 1981; Foley et al. 1990). An example is shown in Figure 27.18. For any effective algorithm for calculating the image formed by a given scene, there is only one solution. A defined optical system in a fixed position produces only one image from a specified scene. The simpler the light sources and surfaces, the easier it is to calculate the light distribution in an image. For both simple and complex scenes, the image can be calculated point-by-point, or locally. When computing images produced by specified scenes there is no need to consider global factors, defined by relationships within in the image. Any system that works in reverse has to decide which scene produced a given image. There are many possible solutions, because the reverse mapping is one to many. Formal algorithms have been developed to achieve the inverse mapping from point-by-point analysis. These algorithms are based on the idea of inverse optics, in which the structure of the scene is inferred from local features of the image. The mathematics is simplified by considering only simple lighting with simple objects. However, this increases ambiguity. With complex scenes, ambiguity with respect to one feature of the scene can be resolved by referring to some other feature. To take advantage of the increased information in more complex scenes, one must consider relationships within the image—one must operate globally, not locally. One approach is to resolve ambiguities by adopting simplifying assumptions, such as the assumption that all surfaces are Lambertian and homogeneous, and that there are no secondary reflections or multiple light sources.

DEPTH FROM INTERPOSITION AND SHADING



73

In the pioneering algorithm developed by Horn (1975), surfaces were assumed to be smooth and illuminated by one light source, and the reflecting properties of surfaces were specified. Pentland (1982) developed an algorithm for finding the direction of illumination in photographs of natural scenes. This is a first step for any derivation of shape from local shading. The algorithm used a maximum-likelihood procedure based on the mean change in image intensity along each of several directions within each image region. The use of the first spatial derivative of image intensity makes the algorithm compatible with biological vision. Surfaces were assumed to be relatively homogeneous and Lambertian. At any point on the occluding boundary of a solid object the line of sight is tangential to the surface and therefore orthogonal to the surface normal. By integrating the irradiance along an occluding boundary, Pentland obtained a good estimate of the direction of illumination. The performance of the algorithm correlated with that of human observers of the same stimulus. However, the algorithm did not allow for the effects of highlights, indirect illumination, transparencies, or cast shadows. Algorithms for deriving shape from shading that take some account of edge information are described in Ikeuchi and Horn (1981) and Hayakawa et al. (1994). Langer and Zucker (1994) developed a method for computing a depth map from the image of a Lambertian surface illuminated by diffuse light. Lehky and Sejnowski (1990) developed a learning algorithm to construct a neural network that computes the principal curvatures and orientation of paraboloids. The receptive fields of units in the model resemble those of cells in the visual cortex. The role of shading gradients and specularities in binocular stereopsis is discussed in Section 17.1.6.

27.3.2 S H A P E F RO M S H A D I N G

27.3.2a Shading and 3-D Structure The correct identification of a 3-D curved surface from shading requires assumptions about the direction and uniformity of illumination and about the reflectance of the surface. Shading alone is ambiguous as a depth cue. See Belhumeur et al. (1999) for a theoretical discussion of the ambiguity of the shading-defined structure of 3-D objects. Nevertheless, shading can be a rich source of information about the structure of 3-D objects. In the technique of chiaroscuro, artists use continuous gradients of light and shade to convey volume and surface structure, as illustrated in Figure 27.19. Knill et al. (1997) have provided an analysis of the geometry of shading on 3-D surfaces. In natural scenes, shadowed and nonshadowed regions arising from depth modulations of a surface are mainly changes in luminance. Changes in color or in color and luminance usually arise from changes in surface color or reflectance rather than depth. The perceived inclination of a 74



A

C B Figure 27.14.

C

Occlusion and resolution of depth ambiguity. (A) Edge occlusion

of large texture elements resolves depth ambiguity in the parallel projection of a rotating opaque sphere. (B) Dynamic occlusion of far by near texture elements resolves depth ambiguity in the image of a rotating transparent sphere. (C) Occlusion of far by near dot clusters does not resolve depth ambiguity because the dot clusters lose their separate identity. (From Andersen and Braunstein 1983)

surface produced by a texture gradient increased when a light-near to dark-far luminance gradient was added to the texture. Addition of a light-far to dark-near luminance gradient reduced perceived inclination. Also, adding a luminance gradient to a homogeneously textured pattern produced an impression of inclination. Addition of an isoluminant redto-green gradient had no effect on perceived depth (Troscianko et al. 1991). Cavanagh (1987) had also reported that depth is not created by chromatic modulations of isoluminant stimuli. Although pure modulations in color do not produce depth, they can promote depth impressions arising from modulations in luminance. Kingdom (2003) used the stimuli shown in Figure 27.20 to illustrate this point. The simple luminance modulation in (a) creates only a weak impression of depth, and the modulation of color in (b) does not create depth. But when the two orthogonal gratings are superimposed, the left oblique luminance grating creates a strong depth modulation. Superimposed aligned luminance and chromatic gratings, as in (d), do not create depth. Nor do orthogonal luminance gratings or orthogonal chromatic gratings, as in (e) and (f ). Kingdom concluded that color variations orthogonal to luminance modulations promote impressions of depth from luminance modulations. Changes in color help the visual system to differentiate between luminance variations due to shadows and those due to changes in surface reflectance.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

surface. It was concluded that subjects used the contour where a near part of the surface occluded a far part. This result may have arisen because the object was always viewed in the same orientation. Shading produced by collimated light has sharper gradients than shading produced by diffuse light. Langer and Bülthoff (2000) asked subjects to judge the depth of single points and the relative depths of neighboring points on diffusely illuminated surfaces like that shown in Figure 27.21. In general, the darker the region on which the dot was placed, the deeper the region appeared. Subjects seemed to be using a simple darker-deeper strategy. However, they discriminated differences in depth better than they discriminated differences in brightness. The darker-deeper strategy breaks down for points near the floor of a valley because the floor, being normal to the light, reflects more light than the sides of the valley (Figure 27.21). Valley floors reflect less

Figure 27.15. Dynamic Kanizsa triangle. A Kanizsa triangle is not evident in each static frame but becomes visible in the dynamic sequence. (Adapted from

Kellman and Cohen 1984)

There have been several attempts to measure the accuracy and precision of depth judgments based on shading. Todd and Mingolla (1983) created computer-generated displays representing untextured vertical cylinders with various degrees of curvature illuminated by a narrow vertical light source. For each cylinder, subjects judged surface curvature by selecting one of five curved lines of variable curvature. Subjects also judged the direction of illumination. The depth of the cylinders was underestimated when the shading was Lambertian (matte). Perceived curvature was closer to theoretical values when specularities were added to the surfaces. However, specularities did not affect judgments of the direction of illumination. Performance was not improved by the addition of a texture gradient to the cylinders. Bülthoff and Mallot (1990) asked subjects to vary the shape of an ellipsoid, defined by shading and/or texture, to match the shape of an ellipsoid defined by disparity. Curvature defined by shading or texture alone was underestimated relative to that defined by disparity. The underestimation was less for a surface defined by both shading and texture and when the shaded surface was shiny. Mamassian and Kersten (1996) used a computergenerated image representing a croissant-shaped object that contained elliptical, parabolic, and hyperbolic surface components. The surface was Lambertian and illuminated by a single light source. Subjects set a gauge figure on a sphere to match the orientation of points on the test figure. The slant of the test object was underestimated by as much as 30˚. The error was similar when shading was removed from the

Normal surface

θ

Inclined surface

A

Incident beam

Lambertian reflection (red)

Specular reflection blue

Mirror reflection green

B Figure 27.16. Properties of reflecting surfaces. (A) Collimated light falling on unit area of a normal surface spreads over an area of 1 cosq on a surface inclined at an angle of q to the direction of the light. For a Lambertian surface, the irradiance of each point on the surface decreases in proportion to cosq as the surface is inclined. (B) Lambertian, specular, and mirror reflections of an incident beam of light.

DEPTH FROM INTERPOSITION AND SHADING



75

and inclination of a test patch superimposed on a computer-generated display representing a shading-defined sphere of variable curvature. The mean Weber fraction was 0.11, which is comparable to that obtained by Johnston (1991) for cylinders defined by disparity. Discrimination was best when the light source was oblique, although this effect was less evident when the surface was shiny or when texture cues to curvature were present (Curran and Johnston 1996a). The perception of surface curvature from shading can be influenced by global factors. For example, a shadingdefined spherical surface appeared more curved when adjacent to a less curved spherical surface than to a more curved surface, as shown in Figure 21.26 (Curran and Johnston 1996b). The same curvature-contrast effect occurred with real adjacent curved surfaces ( Johnston and Curran 1996). Koenderink and van Doorn (1980) provided another example of a global factor. They discussed how 3-D shape modulations constrain the distribution of light over a surface. Contours of equal luminance (isophotes) obey certain rules of topology and indicate the positions of hills, ridges, and valleys. The effects of perceived depth on the perception of surface whiteness were discussed in Section 22.4.1. Interactions between shading and stereo cues in depth perception are discussed in Section 30.6.

A

B

27.3.2b Detecting Concavity and Convexity: Stimulus Factors On a matte surface, a smooth protuberance illuminated from one side produces the same retinal image as an indentation illuminated from the other side, as in Figure 27.22. In the absence of information about the direction of illumination, shading does not allow one to distinguish between convexities and concavities. The following stimulus features can help to resolve the ambiguity of shading:

C Figure 27.17. 3-D shape from highlights. Subjects were better able to discriminate changes in the 3-D shape of a surface in the presence of highlights, as in (A), than with diffuse lighting, as in (B), or with texture gradients, as in (C). (From Norman et al. 2004)

light than hilltops because fewer light rays reach the floor. This could allow subjects to resolve the inherent ambiguity of depth from shading. The darker-deeper strategy is applicable only to a valley of fixed and finite width. Narrow valleys, like creases in a surface, produce dark shading even though they may not be deep. In the above experiments, curvature discrimination thresholds based on shading were higher than those based on disparity. Johnston and Passmore (1994) suggested that this was because the test procedures were not optimal. In their own experiments, subjects compared the curvature 76



Other cues to depth These are disparity, including disparity in the images of shading and shadows (Puerta et al. 1989), motion parallax, and texture gradients (Georgeson 1979; van den Enden and Spekreijse 1989; Christou and Koenderink 1997). Presence of other objects Ambiguity of shading may be reduced or resolved by adding objects with unambiguous depth (Berbaum et al. 1984). Occluding edges When a hollow in a surface is viewed obliquely, its near edge is an occluding edge with a perspective-gradient discontinuity, and the far edge is the boundary between two perspective gradients with different slopes (see Figure 27.23). When a hill on a surface is viewed obliquely, the far edge is the

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

There is some evidence that people tend to misperceive surface relief when misinformed about the direction of illumination (Oppel 1856; Yonas et al. 1979; Berbaum et al. 1983). Highlights It was explained in Section 27.3.1a that highlights on a shiny surface provide information about the magnitude and sign of surface curvature (Todd and Mingolla 1983). Norman et al. (2004) measured the threshold for detection of differences in the 3-D shapes of objects like those shown in Figure 27.17 Smaller changes were detected when highlights were present as in (A), than when the shapes had diffuse shading or texture gradients, as in (B) and (C). Figure 27.18.

Shading in a computer graphics image.

occluding edge and the near edge is a gradient boundary (Koenderink and van Doorn 1982; Koenderink 1984). Several textbooks used a picture like that in Figure 27.23 to illustrate the effect of assumed direction of illumination on convexity and concavity. The hollow becomes a mound when the picture is inverted. But the direction of light cannot be the main factor since it comes mostly from the side. The main factor in this picture is that one edge of the structure is an occluding edge, which is seen as such because of the sudden change in texture density (Howard 1983). Direction of illumination Erens et al. (1993) inquired whether people can distinguish between computergenerated elliptic, parabolic, and hyperbolic patches of a Lambertian surface in which the 3-D shapes were defined only by shading. Subjects performed poorly and were especially prone to confuse hyperbolic and positive and negative elliptic patches. Addition of a cast shadow that indicated the direction of the light source allowed subjects to distinguish between convex and concave elliptic patches, but they still confused elliptic and hyperbolic patches. Performance improved, but was still poor, after subjects had been shown animated sequences of patches. Christou and Koenderink (1997) asked subjects to set a gauge figure tangential to various points on the surface of an ellipsoid with depth defined by shading. As the direction of illumination was changed, gauge settings revealed a small bias to perceive brighter parts of the shape as nearer. This caused the ellipsoid to appear to bulge toward the more brightly illuminated part of the object.

Stimulus motion Movement of a 3-D surface displaces highlights most in regions of low curvature. Also, the direction of displacement of a highlight indicates the sign of curvature. Norman et al. (2004) found that rotation of a shape with highlights, like that in Figure 27.17A, did not lower shape discrimination thresholds. They suggested that this was due to a ceiling effect. However, motion improved performance for diffusely shaded objects.

27.3.2c Detecting Concavity and Convexity: Subjective Factors The following are subjective biasing factors that can influence the way people interpret convexities and concavities. Preference for ground over ceiling surfaces A surface viewed obliquely is more likely to be seen as a ground surface than as a ceiling surface (Reichel and Todd 1990; Mamassian and Landy 1998; Langer and Bülthoff 2001). Preference for convex over concave objects People tend to see an ambiguous 3-D object as convex rather than concave (Mamassian and Landy 1998; Liu and Todd 2004). Also, a concave patch among convex patches is more easily detected than a convex patch among concave patches (Kleffner and Ramachandran 1992). The greater oddity of the concave patch could make it more detectable. However, differences in apparent contrast between concave and convex patches seem to be involved (Chacon 2004). Familiarity Familiarity with particular objects can bias the way people resolve ambiguity of surface relief. For example, a facemask is seen as convex even when it is concave (Schroeder 1858; Gregory 1970). This reversal occurs in abstract concave objects, but less readily than with a familiar object.

DEPTH FROM INTERPOSITION AND SHADING



77

Figure 27.19.

The use of shading in art.Cattle by Anthony van Dyck (1599–1641). (Devonshire Collection, Chatsworth, UK, Reproduced by permission of

the Chatsworth Settlement Trustees)

Assumed direction of illumination In 1744 Philip Gmelin of Wurttemberg reported to the Royal Society of London that when a seal is viewed through a microscope, concave regions appear convex and vice versa. Brewster (1826) reproduced this report (see

Wade 1983). The American astronomer David Rittenhouse (1786) pointed out that a microscope inverts objects relative to the direction of illumination. When a flat representation of a surface, such as that in Figure 27.22, is viewed in a frontal plane, a region in

Stimuli used by Kingdom (2003). The luminance modulated grating (a) and the chromatic modulated grating (b) create an impression of a depth modulation when they are superimposed in (c). Superimposition of aligned luminance and chromatic gratings (d) does not create depth, nor does superimposition of orthogonal luminance gratings (e) or orthogonal chromatic gratings (f ). (Reprinted by permission of Macmillan Publishers Ltd.)

Figure 27.20.

78



OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Figure 27.21. Diffuse shading and perceived depth. By the left arrow, the point on the left is on a darker and deeper part of the surface than the point on the right. By the right arrow, the point on the left is on a brighter and deeper part of the surface than the point on the right. (From Langer and Bülthoff 2000)

which the light part is above the dark part tends to appear as a convexity, and a region in which the dark part is above the light part tends to appear as a concavity (see Papathomas and Gorea 1990 for an interesting variant). Brewster (1826, 1847) concluded that people resolve shading ambiguity by implicitly assuming that illumination comes from above. This will be referred to as the “lightfrom-above assumption.” This principle is generally accepted because it is seemingly so easy to demonstrate. The lightfrom-above assumption seems to be genetically determined in chickens, and develops very early in human infants (Section 7.4.2d). Most investigators and writers of textbooks have uncritically assumed that the light-from-above-assumption refers to “above” defined in a gravitational frame of reference. But the relevant factor could be “above” with respect to the head or with respect to the retina. Ramachandran (1988) commented that, “it is the orientation on the retina that

A

B

Figure 27.22. Convexity-concavity from shading. Convexities and concavities perceptually reverse when (A) is inverted, in conformity with the assumption that light comes from above. Viewing the display with inverted head reveals that the important factor is the direction of light with respect to the head rather than with respect to gravity. The odd disk pops out, especially when the shading axis is up-down as in (A), rather than from side-to-side as in (B). This shows that shape from shading is a preattentive feature.

matters—not the phenomenal or ‘gravitational’ vertical.” This can be verified by viewing stimuli with the head upsidedown. But this procedure does not allow one to distinguish between a headcentric and retinocentric frame. Also, we may use more than one frame of reference. One cannot conclude that only one frame of reference is used unless frames are tested independently, rather than just being pitted one against the other. Howard et al. (1990) recorded the way observers perceived shaded disks, like those shown in Figure 27.23, under the following three conditions. (1) With head upright to gravity (congruent head and gravity frames). (2) With head inverted (opposed head and gravity frames). (3) With head tilted 90˚ to one side (orthogonal head and gravity frames). In this condition, the gravitational frame of reference was irrelevant when the axis of the shading gradient was aligned with the head, and the headcentric frame of reference was irrelevant when the shading axis was aligned with gravity. (4) With head upright but with the figure at a steep angle beneath the chin. The headcentric frame of reference for direction of illumination was dominant when head and gravity frames were opposed. However, when the head frame was irrelevant, subjects showed a weak bias toward the assumption that illumination was from above in a gravity frame of reference. In condition (4) the figure was viewed with head upright but with the figure at a steep angle beneath the chin so that the part of the picture that was “top” with respect to gravity and to the head, was upside-down on the retina. The results showed the retinal frame of reference, not the head frame of reference, is the dominant factor. Now the convexities and concavities were interpreted in terms of the orientation of the figure with respect to retinal coordinates. So the “assumption” about the direction of illumination is embedded in retinal coordinates. Shaded objects were detected more rapidly when the illumination came from between 30 and 60˚ to the left (Sun and Perona 1998). Most of 225 randomly selected paintings were illuminated by a light source between 30 and 60˚ to the left of vertical. Mamassian and Goutcher (2001) asked subjects to report the relative depths of surface ridges defined by shading, with the ridges placed in various orientations in the frontal plane. Responses indicated that subjects assumed that the light source was above but on average 30˚ to the left. Left- and right-handed subjects showed the same bias. The ecological reason for this bias is not clear. Mamassian and Goutcher suggested that it arises from a tendency of people to attend more to the right side of objects than to the left side. Whatever the bias, the assumed direction of light above is very broadly tuned (Symons et al. 2000). There is some evidence that the light-from-above assumption and the bias in favor of convexity can be modified by allowing subjects to feel the actual relief of shaded patches (Adams et al. 2004; Champion and Adams 2007).

DEPTH FROM INTERPOSITION AND SHADING



79

Figure 27.23. Reversing crater. The crater becomes a hill when turned upside down. It has been claimed that this is due to the up-down reversal of the assumed direction of illumination. But illumination is predominantly from one side. The effect is most likely due to the texture-gradient discontinuity along the near edge of the crater or the top of the hill. (From Levine and Shefner 2001)

27.3.2d Shape-from-Shading as a Preattentive Feature The following evidence suggests that simple depth from shading is processed at a preattentive level, probably in the primary visual cortex. 1. A simple figure with depth defined by the direction of shading pops out from an array of figures with the opposite depth, as in Figure 27.23. The time taken to find an odd element in such a display is independent of the number of elements (Kleffner and Ramachandran 1992; Braun 1993). 2. Convexities and concavities defined by shading are interpreted in terms of the orientation of the shading gradient with respect to retinal coordinates (Howard et al. 1990). This suggests that depth from shading is processed at an early stage before the orientation of the stimulus with respect to the head is registered. 3. Humphrey et al. (1996) described a woman with visual form agnosia, who had difficulty recognizing simple line drawings of shapes. She did better with real objects that contained surface textures and shading. She could readily distinguish between convex and concave patches, like those in Figure 27.23, but was deficient in distinguishing between patches in which shading gradients were replaced by sharp discontinuities of luminance. 80



4. Humphrey et al. (1997) found greater fMRI activation in the human primary visual cortex in response to disks with a vertical shading gradient than to disks with a horizontal shading gradient.

27.3.3 D ETAC H E D S H A D OWS

A shadow cast by an object is known as a detached shadow. A detached shadow typically has an edge with a shape that depends on the height and shape of the shadow caster. If the direction of illumination is known, a cast shadow provides information about the size and shape of the shadow caster. However, the shape of a shadow also depends on the surface on which it is cast. For example, the undulations of a shadow cast by an object such as a pole provide information about the 3-D structure of the surface on which it falls. An object in contact with a surface casts a detached shadow that touches the object. An object that is separated from a surface casts a detached shadow with a complete independent edge. The position of a detached shadow relative to the object provides information about the spatial location of the object relative to the surface (Yonas 1979). For example, the three white squares in Figure 27.24 are similarly placed on the checkerboard, but their heights above the checkerboard appear to increase from left to right because of the shift of the shadow. Figure 27.25 depicts frames of a video display used by Kersten et al. (1997). The black disk moved diagonally across the checkerboard.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

When the shadow moved with the disk, the disk appeared to move in depth in contact with the surface. The disk also appeared to increase in size because of the relationship between perceived size and perceived distance. When the shadow moved horizontally while the disk moved diagonally, the disk appeared to move in the frontal plane and rise up above the surface. Correct perception depends on the ability to distinguish between the edges of shadows and other edges. Shadows have the following features that can serve to distinguish them from changes in surface brightness: 1. Edges of shadows tend to be blurred and involve a single transition from light to dark. Hering (1905) showed that a shadow with a dark line drawn round it appears as a dark painted patch rather than a shadow on a surface. In Figure 27.26, the impression of a face is destroyed when a line is drawn round the shadow. Addition of blur to a dark region can convert that region into an object with attached shadow or into an object that casts a shadow, as in Figure 27.27 (Elder et al. 2004). 2. A shadow is darker than the region around it. A region appears as a shadow only when its boundary is defined by a change in luminance. An equiluminant region that differs only in color is not perceived as a shadow, even if it has all the other attributes of a shadow. Also, regions defined only by motion or binocular disparity are not perceived as shadows (Cavanagh and Leclerc 1989).

texture or contrast of a surface do not usually coincide with the edge of a shadow, the impression of a shadow is not necessarily destroyed when such coincidences occur (Cavanagh and Leclerc 1989). 27.3.4 A E R I A L P E R S P E C T I V E , BRIGHTNESS, AND CONTRAST

The atmosphere has two main effects on the visibility of a distant object. First, rising convection currents of warm air refract light and introduce shimmer. This is optical haze. Second, light is absorbed and scattered by atmospheric dust and mist, as illustrated in Figure 27.28. The effects of both these factors on contrast can be measured with a photoelectric telephotometer (Coleman and Rosenberger 1949, 1950). The atmosphere, especially water vapor, absorbs red light more than blue light. In the 4th century BC, Aristotle wrote in his Meteorologica that things, such as a promontory at sea, seem larger in a mist. It seems that Leonardo da Vinci coined the term aerial perspective to describe the blueness and faintness of distant objects (Ross and Plug 1998). Helmholtz argued that haziness increases the perceived distance of objects and that this causes them to appear enlarged. He wrote: If distant objects such as the moon or a range of mountains, owing to their being seen through haze or for some other reason, are regarded as being farther away, they will invariably appear also to be magnified in size to the same degree. (Helmholtz 1909, Vol. 3, p. 283).

3. A change in luminance along the edge of a shadow is usually in a consistent direction. An edge with inconsistent luminance polarity does not create an impression of a shadow. Although changes in the

This assumes that the primary effect of aerial perspective is to increase perceived distance. Factors other than

Figure 27.24. Relative depth from cast shadows. The light gray squares appear to increase in depth from the background because of the positions of the shadows. (Redrawn from Kersten et al. 1997)

Figure 27.25. Motion-in-depth and cast shadows. When the black disk and its shadow move together diagonally from left to right, the disk appears to move in depth along the surface and grow in size (top diagram). When the disk moves diagonally but the shadow moves horizontally, the disk appears to move in a frontal plane, rise above the surface, and remain the same size (lower diagram). (Adapted from Kersten et al. 1997)

DEPTH FROM INTERPOSITION AND SHADING



81

Figure 27.26. Destruction of shadow by a line contour. The front view of a woman’s face can be seen on the left but not when the shaded areas are surrounded by a line.

a direct increase in perceived distance could be involved in aerial perspective. A mist could obscure the texture gradient arising from the ground surface, and this loss of depth information could cause things to appear nearer and therefore smaller. Also, the image of an object seen through a mist or haze has blurred borders, which could make the image appear larger and the object nearer. Ross (1967) found that the diameters of disks seen in a fog were overestimated about 20%, and their distances by about 80%. Perceived size and perceived distance were only poorly correlated.

Figure 27.27. Blur creates attached and cast shadows. Adding blur to the crescent creates an object with attached shadow or a disk casting a shadow. (Redrawn from Elder et al. 2004)

82



Other factors arising from aerial perspective are loss of brightness and contrast. Ames (1946) reported that a disk of light appeared to approach when its luminance was increased, and to recede when it luminance was decreased (see Ittelson 1952). A luminous test disk in dark surroundings appeared progressively nearer than a less luminous comparison disk of equal angular subtense as the luminance of the test disk was increased (Coules 1955). In a natural setting, stimuli with high contrast relative to the background were judged to be nearer than stimuli with lower contrast (Mount et al. 1956). In the above studies, background luminance was held constant so that the effects of changing luminance were confounded by changes in contrast. Farné (1977a) found that a rectangle with greater contrast appeared nearer than one with lower contrast, whether it was black or white, as illustrated in Figure 27.29A. In Figure 27.29B a high-contrast patch appears nearer than a low-contrast patch, independently of the luminances of the patches (O’Shea et al. 1994) (Section 18.7.3b). In other words, perceived depth is a function of contrast rather than of luminance. A disk of constant luminance and constant size appeared to brighten and approach when the background was darkened and to darken and recede when the background became brighter (Farnè 1977b). O’Shea et al. (1997) varied relative blur and relative contrast independently in the two halves of textured bipartite displays. A more blurred region appeared more distant

Figure 27.28. Aerial perspective. St Martins-in-the-Fields painted by William Logsdail. Exhibited at the Royal Academy in 1888. (Reproduced by courtesy of the Trustees of the Tate Gallery, London)

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

A

B Perceived depth and relative contrast. (A) The rectangles with greater contrast appear nearer than the background. (Adapted from Farnè 1977a) (B) In each display the high-contrast square appears nearer than the low-contrast square, even though the squares on the left have the same luminance as those on the right. (Adapted from O’Shea et al. 1997)

Figure 27.29.

of lower contrast when blur was the same. The effects of the two cues were additive over a moderate range of contrast. Dresp et al. (2002) confirmed the effect of contrast on the perceived depth order of two shapes and investigated how relative contrast interacts with overlap and occlusion cues. The effects of brightness on the perceived depth of a disk could be due to changes in the perceived angular size of the disk. Holway and Boring (1940c) found that a 1 0-fold dimming of a 1˚ disk of light in dark surroundings induced, approximately, a 10% reduction in apparent size relative to a standard disk. A 100-fold reduction in illumination reduced apparent size by 27%. Robinson (1954) found that a 6.25-fold decrease in illumination reduced the apparent size of a disk by only about 3%. There was no mention of apparent distance in either of these studies. It is a well-known feature of figure-ground perception that, other things being equal, a region that is perceived as a figure is perceived to be nearer than a region perceived as ground. Thus, in Rubin’s cross, the cross that appears nearer at any instant is the one that is perceived as the figure. Egusa (1983) found that the magnitude of depth between the halves of a square increased with increasing difference in brightness between the halves. However, the perceived depth order of the halves depended on which half was seen as the figure.

than a less blurred region when contrast was the same. A region of higher contrast appeared nearer than a region

DEPTH FROM INTERPOSITION AND SHADING



83

28 DEPTH FROM MOTION PARALLAX

28.1 28.1.1 28.1.2 28.1.3 28.2 28.2.1 28.2.2 28.2.3 28.3 28.3.1 28.3.2 28.3.3

Properties of motion parallax 84 Historical background 84 Motion parallax and binocular disparity 85 Types of motion parallax 86 Parallax and absolute distance 87 Judging the distance of a single frontal object 87 Parallax and the distance of 2-D displays 90 Parallax and the distance of 3-D displays 91 Parallax and relative depth 91 Relative depth from motion parallax 91 Thresholds for depth from motion parallax 92 Stimulus features in motion parallax 95

28.3.4 28.4 28.4.1 28.4.2 28.4.3 28.4.4 28.5 28.5.1 28.5.2 28.5.3 28.6

28.1 P R O P E RT I E S O F M OT I O N PA R A L L AX

Depth scaling for motion parallax 102 Parallax and 3-D shape 103 Basic studies of 3-D shape perception 103 Shape from parallax and from disparity 104 Contrast in motion parallax 105 Observer differences in depth from parallax 105 The kinetic depth effect 106 Basic features of the kinetic depth effect 106 Stimuli for the kinetic depth effect 109 Ambiguities in the kinetic depth effect 112 The stereokinetic effect 118

lying along the same visual line would project to the same point on the retina whether the distance be longer or shorter.

28.1.1 H I S TO R I C A L BAC KG RO U N D

If we ignore changes in accommodation, this is a correct statement about isolated points observed by a single eye, as depicted in Figure 28.1A. Each stationary point of an image in one eye contains directional information but not depth information. However, globally, an array of points may contain perspective information, and most natural scenes contain several other types of information about the 3-D shapes of objects and the 3-D layout of the scene, even when viewed with one stationary eye. Berkeley’s “lost” depth information can, in theory, be recovered from the images of a point viewed by the two stationary eyes, or successively by moving a single eye, as in Figures 28.1B and C. As long ago as the 3rd century BC, Euclid described how self-motion creates differential motion of objects at different distances (Section 2.1.3b). Wheatstone (1838, p. 380) was the first person to describe the similar appearance of the 3-D world from disparity and motion-parallax. He stated that people blind in one eye derive depth information, equivalent to that provided by disparity, from motion of the head. He wrote:

Motion parallax refers to relative motion of the images of object points at different distances that is caused by motion of the observer or by coherent motion of the object points. Consider an observer viewing a 3-D object with one eye, with the eye first in one position and then displaced to one side through the interocular distance. The two successive images are identical to the two simultaneous images in the stationary eyes. Thus, motion parallax, like binocular disparity, arises from differences in perspective due to viewing stimuli from different vantage points. The information derived from motion parallax can be called sequential perspective, and that derived from binocular perspective can be called binocular perspective. This chapter examines the contribution of motion parallax to depth perception. It also examines the similarities and differences between depth perception based on binocular disparity and that based on motion parallax. Interactions between these two depth cues are reviewed in Section 30.2. On the first page of his book An Essay Toward a New Theory of Vision, the empiricist philosopher George Berkeley (1709) asserted,

When different projections of the same object are successively presented, the form to which they belong is completely characterized. While the object remains

It is, I think, agreed by all that distance, of itself and immediately, cannot be seen because two points 84

fixed, at every movement of the head it is viewed from a different point of sight, and the picture on the retina consequently continually changes. Helmholtz (1909) wrote: Suppose, for instance, that a person is standing still in a thick wood, where it is impossible for him to distinguish, except vaguely and roughly, in the mass of foliage and branches all around him what belongs to one tree and what to another. But the moment he begins to move forward, everything disentangles itself, and immediately he gets an apperception of the material contents of the woods and their relations to each other in space, just as if he were looking at a good stereoscopic view of it. (Vol. 3, pp. 295–296)

28.1.2 M OT I O N PA R A L L AX A N D B I N O CU L A R D I S PA R IT Y

Monocular motion parallax, like binocular disparity can serve as a cue to relative depth (Rogers 1993). For an object at a given distance and a given motion of the observer, the extent of motion parallax between that object and a second object is proportional to the depth between them. The sign of the parallax depends on which object is nearer to the viewer. For two distant objects separated by a small depth, the parallax produced by a given motion of the observer is inversely proportional to the square of the mean distance of the two objects from the observer. These geometrical relationships are analogous to those for binocular disparity (Section 14.2.3). The similarities between depth perception based on disparity and that based on motion parallax suggest that neural mechanisms for the two types of depth perception have much in common. However, in spite of their basic similarity, binocular disparity and motion parallax differ in the following ways. 1. Simultaneous versus sequential processing Disparity-

based stereopsis relies on discrete differences between the optic arrays in the two eyes created by simultaneous viewing from two different vantage points. This is binocular perspective. Stereoscopic vision based on motion parallax relies on continuous changes in a single optic array occurring over time. This is sequential perspective. Consider the simplest case of a single object. Detection of the absolute distance of a small object from the absolute binocular disparity of its images requires information about (1) the distance between the eyes, (2) the angle of convergence, and (3) the retinal disparity in the images of the object, which is zero when the object is binocularly fixated. Detection of the distance of a stationary object from the absolute

P1

P1

P1

P2

P2

A

P2

x

x

B

C

Figure 28.1. Information about the distance of isolated points. (A) All information about the distances of two isolated small points P1 and P2 is lost when seen from a single vantage point. (B) Knowing the visual directions of the points as well as the separation between two simultaneous vantage points (x) provides complete information about their positions in space. (C) The same information is available from one moving vantage point. (Redrawn from Rogers 1993)

motion parallax of its image produced by sideways motion of the head requires information about (1) the translatory motion of the head, (2) the change in the angle of gaze, and (3) the change in the retinal location of the image (Section 28.2.1). Also, any rotation of the head on the body or of the body, as the head translates, must be registered. It is not surprising that we are not very good at judging the absolute distance of a single point from absolute motion parallax (Section 28.2.1). We will see in Sections 28.2.2 and 28.2.3 that judgments of absolute distance from parallax are better when there are several points in view. 2. Image correspondence In stereopsis, the parts of one eye’s

image must be linked with corresponding parts in the other eye’s image. For motion parallax it is only necessary to register the continuity over time of image features in the same eye. 3. Ambiguities Motion parallax is subject to three

ambiguities. First, binocular disparity in stationary binocular images must be due to differences in relative depth when the eyes are properly aligned. Motion parallax may be due to either differences in depth or to relative movement or deformation of objects. We will see in Section 28.3.3d that this ambiguity is removed only if the observer has knowledge about the object’s rigidity. The second ambiguity is that, while disparity provides information about the sign of relative depth, motion parallax may not do so. Both cues indicate the magnitude of relative depth only if information about viewing distance is provided. Ambiguities are largely eliminated when both cues are present (Richards 1985).

D E P T H F R O M M OT I O N PA R A L L AX



85

The third ambiguity is that motion parallax produced by translation of an observer is essentially the same as that produced by equivalent motion of the stimulus with respect to the stationary observer. Even for a rigid object, there is an ambiguity as to whether a pattern of motion is due to motion of the observer or of the object. Several sources of information could resolve this ambiguity, including (a) kinesthetic and vestibular signals and (b) the relative optic flow between the object and other parts of the visible scene. 4. Supplementary information In judging 3-D structure

from motion parallax, the direction and extent of changes in viewing direction must be correctly registered. For example, to a first approximation, the same pattern of relative motion is created by a deeply corrugated surface moving through a small angle as by a surface with shallow corrugations moving through a larger angle (Rogers and Collett 1989; Durgin et al. 1995). There are two sources of parallax information about changes of viewing direction. The first is the absolute motion parallax of a point. To use this information the observer must monitor the motion of the image of the point, the rotation of the eye in the head, the head on the shoulders, and of the body with respect to the ground. These issues are discussed in Section 28.3.1. The second is perspective changes in the image of an object or surface close to the observer. This is analogous to the use of patterns of disparity for judging the direction of a surface. In judging 3-D structure from binocular disparities, the state of vergence and the interocular separation must be correctly registered. Information about vergence could be derived from (a) proprioceptive or efference copy signals indicating the vergence state of the eyes and (b) patterns of vertical disparity (Section 19.6.4). 5. Eye separation versus extent of head/eye motion

Perception of depth from binocular disparity requires registration of the interocular distance. This distance is a constant that is presumably learned. When mirrors are used to effectively reduce or extend the interocular distance, perceived depth intervals are changed accordingly. Perhaps, with long exposure to altered interocular distance, we could recalibrate our registration of interocular distance. But that type of experiment has not been performed. Perception of depth from head parallax requires registration of the magnitude of head motion or of eye movements associated with head motion. Unlike interocular distance, this is not constant. The crucial factor is the ratio of relative image motion to the magnitude of head motion or, equivalently, the ratio of the velocity of image motion to the velocity of head 86



motion. If a ratio is used, a perceived depth interval does not vary with the magnitude of head motion, given that parallax is detected. But we will see that increasing the magnitude of head motion extends the range over which a given depth interval can be detected. 6. Adjustment of depth range For binocular disparity, the

range of absolute distance over which a given depth interval can be detected is determined by stereoacuity and the interocular distance. For a given observer, both these factors are fixed. The only way to extend depth range is to artificially increase the interocular distance, as with a range finder. For head parallax, the distance range over which a depth interval can be detected is determined by motion sensitivity and the magnitude of head motion. But a person can control the magnitude of head motion and thus the depth range. 7. Axis of image displacement The major binocular

disparities lie along the interocular axis, while the major changes due to head motion lie along the direction of head displacement, which can be any direction. For example, many animals gain depth information by bobbing their heads up and down to create vertical motion parallax. This issue is discussed in Section 28.3.3f. 28.1.3 T Y P E S O F MOT I O N PA R A L L AX

A typical visual scene is composed of opaque textured surfaces rather than isolated points. Depth modulations of such surfaces are usually gradual. Therefore, optic-flow generated by movements of the observer is spatially differentiable over local areas. Any differentiable flow field can be decomposed into five local spatial gradients of velocity, as shown in Figure 19.18. These differential velocity components are: 1. Translation within a defined axis system. 2. Expansion, or contraction (div), 3. Rotation (curl), and 4. Shear deformation (def 1) 5. Compression- or expansion-deformation (def 2).

The last four components are known as differential invariants because they do not depend on the choice of axis system. Koenderink (1986) suggested that the visual system contains detectors specialized for coding these differential components of a flow field. Each detector pools information about the direction and magnitude of motion over each local area. Mechanisms of this type have been revealed in cortical areas MT and MST (Section 5.8.4).

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

There are five corresponding types of motion parallax. Unless stated otherwise, the images are produced by polar projection. 1. Linear parallax The image of a single object moves over

the retina if the object or the eye moves in a frontal plane (orthogonal to the visual axis). This simple image motion defines absolute motion parallax. For a given magnitude of motion of object or eye, the change in visual direction of an object is, approximately, inversely proportional to its distance. With monocular fixation on a stationary object, the change in the angle of gaze as the head moves through the interocular distance is the same as the angle of convergence when both eyes of a stationary observer fixate a stationary object. Consider two objects at different distances moving sideways at the same velocity and viewed with a stationary eye. The image of the more distant object moves more slowly than that of the nearer object. This is linear motion parallax. For a given distance of the near object, the linear parallax is proportional to the distance of the far object from the near object. For two objects with a given separation in depth, the motion parallax is approximately inversely proportional to the square of their distance.

axis of slope its image expands or contracts in the direction of motion. In both types of deformation parallax the images of all the texture elements on a surface move with a common motion, which is superimposed on the deformation pattern. To a first approximation, the sum of the translation and deformation components defines a velocity gradient, as shown in Figure 28.8A (Koenderink 1986). Images of moving 3-D objects produced by polar projection on a flat screen also contain common motion and shear motion. Such images create a strong impression of depth. This is a kinetic depth effect, which is discussed in Section 28.5. Images of objects formed by parallel projection contain only the common motion. The two components of deformation parallax provide information about the local slant and inclination of a surface up to a scaling factor of viewing distance (Koenderink and van Doorn 1975, 1976; Longuet-Higgins and Prazdny 1980). They also provide information about the overall shapes of 3-D surfaces (Section 28.4). The corresponding components of spatial gradients of disparity provide information about the same features of stationary surfaces, as explained in Chapter 19.

2. Looming parallax The image of an approaching object

increases in size. The rate of increase is inversely proportional to the distance of the object. This is simple looming. An isolated disk or textured display at a fixed distance but growing in size creates an impression of motion-in-depth, as discussed in Section 31.2. The images of two objects at different distances loom at different rates at they move in depth. This is looming parallax Similarly, the front and rear parts of an approaching 3-D object exhibit looming parallax. Looming parallax does not occur with parallel projection. Looming parallax provides information about the motion of objects in depth (Section 31.2). 3. Rotation parallax A 3-D object rotating eccentrically

about the visual axis of an eye produces an image that undergoes cyclic changes in parallax. The projection of such an object on a screen produces a strong impression of depth. This is the stereokinetic effect, which is described in Section 28.6. Since each point in an object that produces a stereokinetic effect moves mainly in a frontal plane, the effect is essentially the same for polar as for parallel projection. 4. Shear-deformation parallax When a textured surface

sloping in depth moves in a direction parallel to its axis of slope, its image exhibits shear-deformation parallax. 5. Compression- or expansion-deformation parallax When

a sloping surface moves in a direction orthogonal to its

2 8 . 2 PA R A L L AX A N D A B S O LU T E D I S TA N C E 28.2.1 JU D G I N G T H E D I S TA N C E O F A S I N G L E FRO N TA L O B J EC T

When an eye translates through distance d along a path orthogonal to the visual axis, the image of a stationary object at distance D from the eye undergoes an angular displacement, q , and D d tanq . If the angular displacement of the image and the amplitude of eye movement are registered, the distance of the point can be detected. When the head moves through the interocular distance, the change in the visual direction of the object seen by one eye equals the vergence angle between two stationary eyes fixating the object. If a translating eye remains fixated on a stationary point, q is the angular displacement of the eye. If the head turns so as to maintain both fixation and a constant angle of the eye in the head, then q is the angle of rotation of the head. If the whole body turns, q is specified relative to ground. Consequently, detection of the distance of a single object by motion relative to the observer requires information about the direction and magnitude of: 1. Change in position of the retinal image. 2. Eye and/or head translation. 3. Rotation of eye in head.

D E P T H F R O M M OT I O N PA R A L L AX



87

4. Rotation of head on body. 5. Rotation of the whole body.

Consider a person moving the head from side to side along a linear track while fixating a straight-ahead stationary point of light in dark surroundings. For a given velocity of the head, the velocity of eye rotation required to keep fixation on the point is inversely related to the distance of the light, and becomes zero at infinity. Therefore, in theory, the rotation of the eyes, as indicated by efference or by proprioceptive feedback, provides accurate information about the distance of a fixated stationary target. However, to make use of this information the viewer must know that the object is stationary and must correctly register three attributes of head motion: its amplitude, its direction in headcentric coordinates, and that it is a purely translatory motion. Any misregistration of the movements of eyes or head will be reflected in either misperception of the motion of the object relative to the observer or of its distance from the observer. If, during a sideways motion of the head, a person underestimates the distance of an accurately fixated stationary target, the eyes will move more slowly than they should for that perceived distance. The stationary target will therefore appear to move in the same direction as the head. When target distance is overestimated, the eyes will move faster than they should for that perceived distance, and the target will appear to move in the opposite direction as the head (Hay and Sawyer 1969; Wallach et al. 1972b). These misperceptions of motion may be called illusory motion parallax. The motion of a stationary object relative to the moving self is correctly interpreted as due to self-motion only when the distance of the object and the motions of eyes and head are correctly registered. This motion-distance invariance principle is a special case of the size-distance invariance principle described in Section 26.2. The motion-distance invariance principle is illustrated by the common observation that a face on a poster seems to follow one as one walks past it. This effect is particularly evident when a stereogram is viewed with the head moving from side to side (Tyler 1974b ; Shimono et al. 2002). When one walks past a real object the motion parallax between the near and far parts is ascribed to motion of the self rather than of the object. In a picture of a face or in a stereogram the expected motion parallax is not present, even though the object appears to have depth. A real object with depth can be seen without motion parallax only if it moves with the viewer. It is as if the visual system has a natural appreciation of the degree of motion parallax of parts of an object relative to each other or of motion of an object relative to the self generated by motion of the head. As long as this natural scaling of relative motion and distance holds, the object appears stationary, but a departure from this natural scaling is interpreted as a rotation or translation of the object. 88



The motion-distance invariance principle can also be illustrated by placing two identical coins side by side on a uniform frontal surface. If the head moves while the two coins are fused by convergence, they appear to move with the head. If they are fused by divergence, they appear to move against the head (Hay and Sawyer 1969). Scaling between motion and distance need not depend on a high-level inferential process and it is not under conscious control. However, the perceived distance of an isolated object can be modified, at least temporarily, by exposure to an unusual association between distance and vergence such as is produced by wearing base-out or base-in prisms (Section 25.2.6). Gogel and Tietz (1973, 1977) had subjects fixate a point of light moving vertically in dark surroundings while they moved the head from side to side. Any apparent sideways motion of the light appeared as a tilt of the path of motion of the vertically moving light. According to the motiondistance invariance principle there should be no apparent sideways motion of the light when the distance of the light is correctly perceived, assuming that the movements of the eyes and head are correctly registered. Gogel and Tietz found that, with monocular and binocular viewing, the motion of the test light indicated that distance was judged accurately only when the light was at a specific distance. A point of light beyond the specific distance appeared to move with the head, indicating that its distance was underestimated. A point nearer than the specific distance appeared to move against the head, indicating that its distance was overestimated. In other words, in the absence of depth cues other than parallax or with parallax plus convergence, an isolated object appeared to be displaced in depth in the direction of a specific distance of about 2 m. This is Gogel’s specificdistance tendency, which was discussed in Section 26.2. In another experiment, Gogel and Tietz (1979) used the same procedure with motion of the test light relative to the observer (parallax) signifying one distance and prismmodified vergence of the eyes signifying another distance. They concluded from their results that binocular convergence is a more effective cue to absolute distance than is motion parallax. In the above analysis it was assumed that movements of the eyes and head were correctly registered but that the distance of the point of light was not. Suppose it were the other way around. Assume that a person underestimates the movements of the eyes when fixating a distant point while moving the head from side to side. By the motion-distance invariance principle, this should cause the point to appear to move with the observer. Assume also that a person overestimates the movement of the eyes when fixating a near point of light. This should cause the point to appear to move against the motion of the observer. In other words, the same illusory movements of an isolated point can be explained in terms of incorrect judgments of either the movements of the eyes or the distance of the point.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

The illusory parallax effect could be due to a specific eyeamplitude tendency—a tendency to underestimate slow and overestimate fast eye movements. Post and Leibowitz (1982) proposed such a theory of illusory parallax motion. When the head moves from side to side, stimuli arising in the utricles of the vestibular system evoke a compensatory movement of the eyes, known as the linear vestibuloocular reflex (LVOR). For objects at infinity, the required gain of LVOR is zero, and the response is inhibited by visual fixation. They proposed that these inhibitory signals are interpreted as a movement of the eyes in the direction opposite to the LVOR signal. This causes the point of light to appear to move with the head. For objects close to the observer the natural gain of LVOR is supplemented by extra innervation generated by visual fixation of the stationary target, and this is interpreted as extra movement of the eyes in the direction of the LVOR signal. This causes the light to appear to move against the head. At a certain distance the natural gain of the LVOR is appropriate to hold fixation on the target without the help of visually evoked eye movements. This is the distance at which the point of light appears to be stationary as the head moves. There is evidence that the gain of the LVOR in the dark varies with vergence angle (Section 10.9.2). However, there is no direct evidence that eye movements are misregistered in the manner required to explain illusory parallax motion. Evidence cited previously, based on static judgments of distance, independently support the specific distance tendency. Perhaps illusory parallax motion is due to misregistration of both the distance and magnitude of eye movements. The other possibility, which nobody seems to have considered, is that illusory motion parallax is due to misregistration of motion of the head. Little is known about how accurately or precisely we register the amplitude of motion of the head when it is moved actively from side to side in the dark. When the whole body is translated or rotated passively at a constant velocity for more than about 30 s, signals from the vestibular organs cease. There are therefore no nonvisual sensory inputs and the relative motion between the self and the visual surroundings is totally ambiguous, as illustrated by the phenomenon of illusory self-motion, or linear vection (see Howard 1982). Dees (1966) investigated the accuracy and precision of estimates of the absolute distance of an object from binocular disparity and motion parallax. Observers were taught to use a scale of 1 to 20 to report the perceived distance of a ping-pong ball subtending a visual angle of 2˚ against the background of a star field at optical infinity. The 2-D display was viewed with binocular disparities, or with motion parallax that would be created by a 2-foot side-toside head movement, or both cues combined. Note that the head did not actually move. Not surprisingly, the precision of judgments decreased with increasing viewing distance. Motion parallax gave less variable results than disparity

alone, but the two cues together gave the best results. However, superior performance with motion parallax may have arisen because parallax was created by an object displacement of 2 feet, whereas, in the binocular case, the separation of the eyes is only 21/2 inches. Since the head did not move, it is not clear how relative motion between the ping-pong ball and the star background provided depth information. The target was presented at several distances, and subjects initially received error feedback. They could therefore have learned to base distance estimates on relative angular velocity between the ball and the background. Wickelgren et al. (2000) asked subjects to point with unseen hand to a frontal array of disks viewed monocularly as they moved their heads from side to side or backward and forward. Accuracy and precision of reaching were about the same for the two types of head motion. During translatory motion of the head, the images of near objects move more rapidly than those of far objects, relative to the head. Medendorp et al. (2003) asked whether the direction of a flash of light is correctly retained during active sideways motion of the head. Subject fixated an LED at 2 m in otherwise dark surroundings while they actively moved the head from side to side. During the motion, a target light at a distance of 20 to 100 cm was flashed onto a peripheral retinal location. Two seconds later the subject stopped moving, the fixation light was extinguished, and the subject directed the gaze to the remembered location of the flashed target. As predicted, their responses almost perfectly corrected for the headcentric parallax motion of the flashed target during the interval in which it was not visible. Subjects turned their eyes further for a near target than for a far target. Thus, during active self-motion the remembered location of an object is updated by taking its distance into account. The distance of the flashed target must have been registered by its binocular disparity relative to the fixation target. When an object is released, it falls a given distance in a given time. Theoretically, the angular motion of the image of the object could be used for judging the size and distance of the object (Watson et al. 1992). If the distance fallen (l) subtends a small visual angle (q ) , the distance of the object from the eye is approximately: Distance =

l tan q

When q = 1 , the h distance d

(1)

57 cm

For reviews of the empirical literature on this issue, see Tresilian (1994a) and Hecht et al. (1996). People with only one eye could learn to use head parallax for judging distances. However, 12-year-old children with one eye patched and children who had been monocularly enucleated before 2 years of age did not move their heads from side to side more than did binocularly sighted children when making depth judgments with a

D E P T H F R O M M OT I O N PA R A L L AX



89

Howard-Dolman apparatus (Gonzalez et al. 1989). Nevertheless, the depth discrimination of monocular children improved when they were shown how to use head parallax. Monocularly enucleated adults did produce larger and faster lateral and vertical head movements than did adults with normal vision (Marotta et al. 1995). They had presumably learned to use head parallax without explicit instruction. Some animals use head parallax (head bobbing) to judge the distances of objects (Section 33.7).

A

28. 2.2 PA R A L L AX A N D T H E D I S TA N C E O F 2-D D I S P L AYS

The main binocular disparities are parallel to the interocular axis. The direction of parallax due to head motion is parallel to the direction of head displacement, which can be any direction. However, for both disparity and motion parallax, there are secondary effects in the orthogonal directions when the stimulus subtends more than a few degrees. A single monocular view of a stationary unfamiliar scene does not provide information about the absolute distance to any part of the scene, even if it contains textured surfaces. There are always “equivalent configurations” that generate the same optic array (Ames 1955). The same is not true of views from two or more simultaneous vantage points. Mayhew and Longuet-Higgins (1982) showed that just four noncoplanar points seen from two vantage points provide sufficient information to recover the complete 3-D structure of the points. The information about absolute distance provided by two vantage points can be appreciated by considering a rectangular surface lying in a frontal plane straight ahead of the observer. If the surface is close to the observer, the righthand edge subtends a larger angle to the right eye than to the left eye because it is closer to the right eye. The opposite is true of the left-hand edge (Figure 28.2A). However, if the surface is far away but subtends the same visual angle, the angular subtense of the left-hand and right-hand edges approach the same value, since the distances of the edges from the two eyes become similar (Figure 28.2B). Johansson (1973) had observers view a display of four lights at the corners of a square inclined 60˚ to the vertical and subtending a constant visual angle of 10˚. The set of lights was presented in dark surroundings to one eye at distances from 30 to 240 cm. Subjects viewed the test square as they moved the head from side to side. They then switched to seeing a vertical rod binocularly in illuminated surroundings, with head stationary. Subjects set the rod to the perceived distance of the test square. On average, the probe was set close to the actual distance of the test stimulus. Thus distance judgments based on head parallax were the same as those based on all other depth cues combined. Johansson argued that the absolute distances of the test square were judged accurately. But this assumes that the distance of the 90



B Perspective images from two vantage points. Perspective images created by viewing a frontal checkerboard of the same angular extent from either (A) a close distance or (B) far away. When the surface is close, there is an equal and opposite horizontal gradient of vertical size in the two images because each vertical edge of the surface lies at different distances and therefore subtends different angles at each eye. When the surface is far away, the two images are almost identical. (Adapted from Rogers and Bradshaw 1993)

Figure 28.2.

rod was accurately registered. A depth-probe indicates only the relative accuracy of judgments under two cue conditions, not absolute accuracy. In any case, these results do not prove that subjects had correctly registered the head motion. An error in registered head motion may have canceled an opposite error in registered image motion. The angle subtended by a vertical line to one eye divided by its subtense to the other eye is the binocular vertical-size ratio. The binocular image of a frontal textured surface contains a horizontal gradient of vertical-size ratios extending out from the midline. For a given interocular distance, this gradient fully specifies the slant of a surface with respect to the cyclopean line of sight. Also, as a vertical line moves away along any cyclopean line of sight, the vertical-size ratio tends to unity. Thus, if the interocular distance is known, the absolute distance to a frontal surface can be derived from the pattern of vertical disparity. Rogers and Bradshaw (1993) showed that the visual system exploits this type of binocular perspective for judging absolute distance and size, and for scaling horizontal disparities as a function of absolute distance. The situation is similar in the motion parallax domain. Let an observer move from side to side while fixating the center of a frontal textured surface with one eye. The angular subtenses of the right and left edges change in opposite directions, as the head moves closer to one edge or the other. Thus, a spatial gradient of vertical size moves over the retina as the head moves from side to side. As the head

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

moves through the interocular distance the sequential shift of vertical size disparity equals the simultaneous shift of one image relative to the other with binocular viewing. In both cases, the shift in the gradient of vertical size is due to the change in the slant of the surface with respect to the visual axis. When the surface is very far away, sideways motion of head or surface produces no visible change. If the magnitude of head displacement is known, the absolute distance to the surface could be derived from the changing gradient of vertical size. The pattern of shear parallax produced by moving a surface from side to side in a frontal plane contains two components—a translation and a rotation. Rotating a surface about its midvertical axis produces rotation but no translation. Translating a surface from side to side while rotating it so that it remains orthogonal to the visual axis of one eye produces translation but no rotation. When the distance or magnitude of translation of a surface moving in a frontal plane is incorrectly registered, the surface appears to rotate as it translates. In other words, part of the parallax due to translation is interpreted as being due to rotation. Rogers and Collett’s (1989) model of interactions between parallax and disparity (Section 30.2.5) predicts that a binocularly viewed surface moving in a frontal plane will appear to rotate around a vertical axis by different amounts as the apparent distance of the surface is changed by manipulating the spatial gradient of vertical size. In the same way, the perceived rotation of a monocularly viewed textured surface will change when perspective cues to distance are manipulated. Also, manipulating perspective cues to distance affects the perceived depth in a simulated corrugated surface (Rogers and Bradshaw 1991, 1992). 28.2.3 PA R A L L AX A N D T H E D I S TA N C E O F 3-D D I S P L AYS

Forward-backward motion of a 3-D display generates an alternating pattern of looming motion that contains parallax information about the mean distance of the display. Peh et al. (2002) generated a looming 2-D random-dot display, of radius 6˚ or 15˚. Viewed monocularly, it created the impression of an opaque 3-D sphere moving to-and-fro in depth through 18 cm. The sphere was presented at nine simulated distances from the subject. When it was observed with stationary head (object motion), verbal estimates of the distance of the center of oscillation of the sphere were reasonably accurate. However, judgments became less accurate when the amplitude of motion-in-depth was irregular. When looming was coupled to forward-backward movements of the head (self-motion) distance judgments were reasonably accurate even when the amplitude of motion was irregular. This indicates that subjects used perceived head velocity to judge distance when motion amplitude was irregular.

Sideways motion of a 3-D display generates a shearing pattern of parallax that contains information about the distance of the display. Panerai et al. (2002) created a projected image that simulated a sphere of random dots. In one condition, the subject’s head moved from side to side. In a second condition the sphere moved from side to side. In each case, the simulated sphere executed the appropriate pattern of parallax. The angular subtense of the stimulus was held constant so that motion parallax was the only cue to distance. The accuracy and precision of monocular judgments of the absolute distance of the sphere were obtained for simulated distances between 30 and 360 cm. With regular movements of head or stimulus, distance judgments were similar under the two viewing conditions. However, when the amplitude of head or stimulus motion varied from trial to trial, subjects could make accurate distance judgments only when they moved the head. It was concluded that, in this condition, subjects used signals associated with pursuit movements of the eyes. 2 8 . 3 PA R A L L AX A N D R E L AT I V E D E P T H 28.3.1 R E L AT I V E D E P T H FRO M MOT I O N PA R A L L AX

Since Wheatstone’s time, many experimenters have demonstrated the precision and accuracy of judgments of relative depth based on disparity (Chapter 18). However, results from studies of motion parallax as a depth cue to relative depth have been more difficult to interpret. We will see that the precision and accuracy of depth judgments based on motion parallax depend on many factors. The simplest stimulus for linear motion parallax as a cue to relative depth consists of two objects at different distances. When the head moves sideways with the eyes maintaining a constant orientation in the head, the image of the near object moves faster than that of the far object. The same relative motion of the images is produced when the two objects translate in the same direction and at the same velocity with respect to a stationary head and stationary eyes. The relative velocity of the two images is proportional to the depth between the two objects and decreases as both objects are moved further away. Motion parallax is canceled when both the head and the objects translate in the same direction at the same velocity. One might therefore expect that the faster of two objects moving horizontally on a flat screen would appear nearer. Bourdon (1902) exposed two spots of light to one eye in dark surroundings. The spots subtended the same visual angle but were at different distances. They appeared at the same distance when the head was fixed. Their relative depth, but not their absolute distances, became apparent when the head moved. Eriksson (1974) used a three-dimensional array of self-luminous simple

D E P T H F R O M M OT I O N PA R A L L AX



91

shapes in a dark room. When stationary observers viewed the display monocularly, the perceived relative depths of the stimuli were determined by their relative angular sizes, although this was not their true depth order. However, the lights appeared in their true relative depths when the observers moved along an oblique path. Degelman and Rosinski (1979) used two objects at different distances that subtended the same visual angle. Subjects perceived the correct relative depth of the objects when parallax was generated by head motion but performed poorly when it was generated by sideways motion of the objects. Even with head motion, the magnitude of depth was underestimated. Gibson et al. (1959) superimposed two arrays of dots on a flat screen, with the dots in each array moving together but the two arrays moving in the same direction at different velocities. For stationary observers the two arrays of dots appeared to separate in depth. However, the faster display did not necessarily appear nearer and, for many observers, the depth order periodically reversed. This issue is discussed further in Section 28.3.3e. Gibson stressed the importance of patterns of optic flow generated by motion of an inclined textured surface, especially the spatial gradient of velocity (Gibson 1950a ; Gibson et al. 1955). The first attempts to generate an impression of a surface in depth from a spatial gradient of velocity in a 2-D display of dots were not successful (Gibson and Carel 1952). Most subjects saw a collection of moving dots rather than a continuous surface, which suggests that the dots were too far apart. Gibson et al. (1959) produced a spatial gradient of velocity in a 2-D display with a high density of texture elements. The velocity gradient corresponded to an inclination of 45˚. Most subjects saw a rigid surface inclined top away. The mean estimated inclination was 40˚. Rock (1984) reported that observers were not able to make reliable judgments about the relative depth of a small number of disks at different absolute distances while making side-to-side head movements with only one eye open. Like Epstein and Park (1964) and Gogel (1977), Rock concluded that, “motion parallax does not by itself seem to be a cue to distance or depth.” Hagen and Teghtsoonian (1981) found that headmotion parallax did not improve estimates of depth between objects on a textured surface for either monocular or binocular viewing. But even without motion parallax the stimulus contained the strong monocular depth cues of the texture gradient on the surface plus height in the field. In 1979 Rogers and Graham (1979) produced unequivocal evidence that motion parallax alone can produce strong impressions of relative depth. They displayed a random-dot pattern on an oscilloscope at a distance of 57 cm. While viewing the pattern, subjects moved the head from side to side through 13 cm. The motion of the head was coupled to the random-dot display in such a way that the pattern of dots was transformed to mimic the shearing motion 92



parallax produced by a real 3-D surface with horizontal corrugations, as shown in Figure 28.3. The depth of the corrugations was determined by the magnitude of relative motion of the peaks and troughs of the corrugations per unit amplitude of head motion. The spatial frequency of depth corrugations varied between 0.05 and 1.6 cpd. Observers had no problem differentiating between the different depth profiles and could, with reasonable accuracy, match the perceived depth to that produced by an equivalent disparity in a static surface. When the observer’s head was stationary, the random-dot display appeared flat. The results were similar when observers were stationary and parallax motion was coupled to sideways movements of the oscilloscope, thereby simulating parallax created by a translating corrugated surface (Figure 28.4). In contrast to several earlier studies, Rogers and Graham found no ambiguities in the direction of depth effects—rows of dots that moved in the opposite direction to the observer’s movement (or in the same direction as the moving oscilloscope) appeared in front, as the geometry predicts. They concluded that simulated parallax produces a compelling impression of three-dimensionality not unlike that found with binocular stereopsis. 28.3.2 T H R E S H O L D S F O R D E P T H FRO M MOT I O N PA R A L L AX

28.3.2a Sensitivity to Relative Motion Motion parallax can indicate relative depth only if the relative motion is detected. Tschermak-Seysenegg (1939) was the first person to determine the threshold for detection of relative motion. He used an instrument that he called a “parallactoscope.” With one eye closed, the subject

A

B Figure 28.3. Motion parallax transformations. When the square-wave surface depicted in (A) is viewed by a moving observer, relative motion is created between the surface features at different distances. This is motion parallax. (B) shows how the random-dot pattern on the computer monitor was transformed with side-to-side head movements to simulate motion parallax. In the experiment, the transformation was continuous rather than discrete, and the edges of the transforming pattern, which provide information about the surface shape, were not visible. (Redrawn from Rogers and Graham 1979)

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

standing wave of between 0.1 and 0.6 cpd. Monkeys and humans produced similar results.

28.3.2b Sensitivity to Depth Produced by Motion Parallax

Mirrors

Figure 28.4. Simulating object-produced motion parallax. The apparatus used by Rogers and Graham (1979) to simulate object-produced motion parallax. The monitors were translated together on a platform and the random-dot patterns were transformed in step with the platform movement to simulate the motion parallax that would be produced by a 3-D surface translating across the observer’s line of sight. The screens could be independently blanked to provide monocular parallax, or both patterns could be made visible to study interactions between motion parallax and binocular disparity.

moved the head from side to side while setting two vertical rods to equidistance. The mean error was small, but larger than that obtained with binocular viewing. Graham et al. (1948) placed two vertical needles, one above the other. Both needles moved from side to side in the same direction at the same constant speed across an illuminated background. Subjects viewed the needles monocularly and moved one of them forward or backward until they appeared at the same distance. When the needles appeared equidistant, motion parallax was no longer discernable. Graham et al. derived the following equation relating the relative motion threshold, w t , to the depth threshold, d t dt (2) v in radians D where D is the viewing distance of the needles and v is the velocity of their side-to-side motion. The results indicated a mean threshold of relative velocity of 30 arcsec/s. The threshold decreased with decreasing intensity of illumination and increased with increasing velocity of the needles. The threshold was about twice as high for motion of horizontal needles along a vertical axis as for motion of vertical needles along a horizontal axis. At suprathreshold velocities, some subjects saw relative motion and others saw the rods separate in depth (see also Zegers 1948). Golomb et al. (1985) measured the smallest detectable amplitude of shearing motion. They used a standing wave of horizontal shearing motion in an array of random dots. The amplitude threshold was lowest for a temporal frequency of shearing motion of 2 Hz and a spatial frequency of the wt =

Rogers and Graham (1982, 1985) used depth corrugations with a difference-of-Gaussians depth profile. They measured disparity thresholds for detecting depth corrugations in a stationary random-dot stereogram as a function of the spatial frequency of the corrugations. Their results replicated those of Tyler (1975a) discussed in Section 18.6.3. The disparity threshold reached its lowest value of about 20 arcsec at a spatial frequency of depth corrugations of 0.3 cpd (see Figure 28.5). They then measured the threshold amplitude of head-generated motion parallax for detection of depth corrugations in a monocularly viewed random-dot display. The peak sensitivity also occurred at a spatial frequency of about 0.3 cpd. The depth created by motion parallax can be expressed in terms of the disparity required to produce the same depth. In this way the depth threshold for a disparitydefined display can be compared with that for a motionparallax display. The fall-off in the depth threshold for motion parallax at low and high spatial frequencies was similar to that for disparity-defined corrugations (Figure 28.5). Rogers and Graham showed that the fall-off in sensitivity at low spatial frequencies was not an artifact of the smaller number of cycles visible on the screen. However, depth thresholds for disparity-defined corrugations were typically half those for parallax-defined corrugations. Some of this difference might have been due to the fact that parallax thresholds were measured under the more difficult conditions of monocular viewing and while the observer moved the head from side to side. Bradshaw and Rogers (1996) and Ichikawa and Saida (1998) reported a similar advantage for disparitydefined depth. In their original study, Rogers and Graham (1979) used the method of limits to obtain thresholds for detecting 3-D structure. Bradshaw and Rogers (1999) used a forced-choice procedure in which observers judged whether the central corrugation was concave or convex. Thresholds were from 2 to 4 arcsec of peak-to-trough depth for disparity-defined corrugations of 0.3 cpd (Figure 18.31), and around 8 to 10 arcsec of equivalent depth for motion-parallax corrugations. Thresholds were much lower than in the original study and were lower for horizontal than for vertical corrugations. The psychophysical procedure did not affect the overall similarity of the sensitivity functions or the lower thresholds for disparitydefined corrugations. Cornilleau-Pérès and Droulez (1993) compared thresholds for detection of curvature in surfaces defined either by binocular disparities or motion parallax (Portrait Figure 28.6). Parallax simulated a ±12.45˚ rotation of a

D E P T H F R O M M OT I O N PA R A L L AX



93

Equivalent disparity threshold (arcsec)

200

200

200 Motion parallax corrugations

100

100

Motion parallax corrugations

50

50

20

20

100

Motion parallax corrugations

50

Disparity corrugations Disparity corrugations

20

BJR

0.05 0.1 0.2

0.5

1

2

GQ

MEG 10

10

10

Disparity corrugations

0.05 0.1 0.2 0.5 1 2 Corrugation spatial frequency (cpd)

0.05 0.1 0.2

0.5

1

2

Comparison of disparity and parallax threshold functions. Thresholds for detecting corrugations defined by disparity and motion parallax as a function of corrugation frequency using a method of limits. The shapes of the disparity and parallax functions are similar for the three observers, with lowest thresholds at 0.3 cpd. Averaged over the three observers, disparity thresholds were half the parallax thresholds. (Redrawn from Rogers and Graham 1982)

Figure 28.5.

textured sphere of variable radius around either a vertical or horizontal axis. Observers made a forced-choice discrimination between planar and curved surfaces. The threshold for motion around a horizontal axis was similar to that for motion around a vertical axis. For five of six observers, thresholds were lower for motion-defined surfaces than for disparity-defined surfaces. The lowest thresholds occurred when both disparity and motion cues were present. The 75% points on the psychometric functions indicated that observers could discriminate a planar from a curved surface when its radius of curvature was about 1 m. The higher sensitivity for motion parallax than for binocular disparity conflicts with the results reported by Rogers and Graham (1982). Cornilleau-Pérès and Droulez calculated the equivalent disparities of their surfaces at threshold. While the thresholds for parallax were similar to those of Rogers and Graham, thresholds for disparity were at least four times higher. Figure 20.38 shows that the lowest thresholds for discriminating the direction of curvature in parabolic corrugations were equivalent to a radius of curvature of 5 m (Cagenello and Rogers 1989). Cornilleau-Pérès and Droulez suggested that their higher disparity thresholds were due to the lower dot density of their images (400 dots compared with 32,000 dots used by Graham and Rogers over a similar area). The pixel size in their displays was 2 arcmin, but disparities smaller than this were simulated by temporal interpolation. Limitations of the temporal interpolation procedure compared with the (theoretically) infinite resolution of Rogers and Graham’s (1982) analog technique may also have contributed to the higher disparity thresholds obtained by Cornilleau-Pérès and Droulez. The rhesus monkey is sensitive to motion parallax as a cue to depth. Cao and Schiller (2002) trained monkeys to detect the depth of a square region in a random dot display 94



and to discriminate which of four squares was in a different depth plane from the other three. Depth specified by parallax was detected about as well as that specified by disparity alone. Depths produced by 4.0 and 10.1 arcmin of disparity were detected in almost 100% of trials. Depth discriminations based on disparity were more rapid and somewhat more accurate than those based on parallax. However, the depth range was one over which disparity would be expected to be superior to parallax. When the two cues were presented together but in conflict, the monkeys tended to use disparity. De Vries and Werkhoven (1995) compared perceived slant and inclination produced by parallax with slant and inclination produced by binocular disparity. Slant about a vertical axis of a flat surface in a random-dot display induced by two-frame motion parallax appeared larger than slant defined by an equivalent binocular disparity. The same was true for inclination about a horizontal axis. However, for curved surfaces, motion-induced depth was underestimated relative to disparity-defined curvature. Perhaps there is no general answer to the question of whether parallax or disparity is the most effective depth cue, because there are so many variables that can affect the two cues in different ways.

28.3.2c Effects of Head Motion Cornilleau-Pérès and Droulez (1994) asked whether selfmotion increases sensitivity to parallax-induced depth. Thresholds for discriminating between a frontal textured surface and a surface curved in depth were lower when parallax was produced by lateral head motion than when it was produced by object translation. They concluded that this was because the sideways motion of the retinal image is more effectively stabilized when the subject moved than

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

The above experiments were concerned with the effects of head motion on depth-detection thresholds in curved surfaces. Van Boxtel et al. (2003) inquired whether head motion improves the precision and accuracy of judgments of the direction and magnitude of slant produced by optic flow of a random-dot surface. In one condition the optic flow was associated with to-and-fro head movements. In the other condition the same optic flow occurred without head movements. When the head was mobile, there were fewer perceptual reversals of the direction of slant, and perceived slant was more closely related to simulated slant. Since the optic flow was the same in the two conditions, these results demonstrate that head motion contributes to the accuracy of slant produced by optic flow.

Figure 28.6.

Valerie Cornilleau-Pérès. Born in Paris in 1961. She obtained

a degree in engineering in 1984 and a Ph.D. with J. Droulez in visual neuroscience in 1990. Between 1998 and 2001 she worked in the Singapore Eye Research Institute. In 2002 she moved to the Laboratoire de Neurosciences Fonctionelles et Pathologies in Lille.

when the object moved. With self-motion, optokinetic nystagmus is supplemented by vestibular nystagmus and by better prediction of the motion. Parallax shear due to depth can be detected more easily when the overall sideways motion of the image is stabilized. This conclusion is supported by the fact that the depth threshold was lowest when the simulated 3-D surface sheared about a fixed point. In these experiments the displays subtended only 8˚. In a later study from the same laboratory, with displays subtending 90˚, depth sensitivity with self-motion was the same as that with object-motion (Dijkstra et al. 1995). Optokinetic pursuit has a higher gain with a large display, so that it does not need to be supplemented by vestibular nystagmus. Van Damme and van de Grind (1996), also, found that motion parallax coupled to head movements improved the ability of subjects to discriminate the depth curvature of surfaces in displays of random dots with limited dot lifetime. However, the detection threshold for motion, as indicated by the effects of added noise, was lower without head movement than with it. They concluded that the advantage of head movement is not due to better detection of relative motion but rather to the contribution of head motion to the detection of structure from motion. There may have been some confounding factors in these experiments, such as the presence of a fixation point in the motion-detection experiment and the short dot lifetime in the parallax experiment. Jobling et al. (1997) obtained lower depth thresholds when parallax between three rods was induced by head motion rather than by rod motion. This was true for all contrasts both for subjects with normal vision and with reduced vision.

28.3.3 S T I MU LUS FE AT U R E S I N MOT I O N PA R A L L AX

28.3.3a Depth Segregation by Simple Motion Parallax Two random-dot displays superimposed in the same depth plane and translating with different speeds or in different directions are usually seen as two surfaces in different depth planes. However, the perceived depth order of the planes is unstable and periodically reverses (Gibson et al. 1959). For superimposed gratings moving in opposite directions the one with higher spatial frequency tends to appear in front (Moreno-Bote et al. 2008). The magnitude of perceived depth between the two random-dot surfaces moving in the same direction increased almost linearly as the velocity ratio increased from 1 to 3 (Andersen 1989) (Portrait Figure 28.7). However, depth

Figure 28.7. George J. Andersen. Born in Oklahoma City in 1958. He obtained a B.A. in 1980 and a Ph.D. with Myron Braunstein in 1985. Both degrees were in psychology from the University of California, at Irvine. He held appointments in the Department of Psychology and the Institute of Aviation at the University of Illinois, ChampaignUrbana, from 1985 to 1990. In 1990 he moved to the Department of Psychology, University of California at Riverside, where he is now a professor.

D E P T H F R O M M OT I O N PA R A L L AX



95

separations were underestimated relative to theoretical values, probably because subjects underestimated absolute distance. Consider two objects at different distances moving at the same linear speed in frontal planes with respect to a stationary eye. The motion of the images can be decomposed into a common motion (the motion of the slower image) and a relative motion (the difference between them). In the kinetic depth effect and the stereokinetic effect, described in Section 28.5, there is no common motion. The present section is concerned with displays containing both common and relative motion. Now consider two stationary objects at distances a and b from an observer moving sideways through distance d at velocity v. For a small movement, the angular displacement, q, and velocity, w, of the image of one object relative to that of the other is: q d

⎛1 ⎝a

1⎞ b⎠

d w=n

⎛ 1 1⎞ − ⎝ a b⎠

(3)

For two objects at fixed distances, the relative velocity of the images is proportional to head velocity, and their common motion is proportional to head displacement. The common motion is with respect to the retina if the eye does not rotate. It is with respect to the head if the eye fixates one of the objects. The perceived relative depth between two objects (as indicated by their perceived relative size) increased with increasing velocity or increasing amplitude of head translation until the stimulus changes reached 20 times threshold (Hell 1978). Larger increases in head velocity or displacement had no effect on perceived depth. Any factors that affect the detection of relative motion should also affect perceived depth from motion parallax. For example, increasing the lateral separation between two moving objects increases the threshold for detection of relative motion (Harvey and Michon 1974) and also the threshold for detection of parallax-induced relative depth (Hell and Freeman 1977). Section 18.9 was concerned with how many superimposed stationary textured surfaces can be detected when relative depth is indicated by binocular disparity. Andersen (1989) asked how many superimposed random-dot surfaces could be detected when depth is indicated by motion parallax. He used between one and five superimposed random-dot displays with dot density adjusted so that the total number of dots remained constant, but dot density in each plane decreased as the number of planes increased. In one set of trials, the different displays translated horizontally at different velocities in the same direction, simulating displays in different depth planes. In other trials, the dots in each display moved radially at different velocities, simulating motion-in-depth. For both types of motion, subjects 96



could detect up to three surfaces with reasonable accuracy. Most subjects reported seeing the distinct surfaces rather than merely counting the number of distinct velocities in the total display. However, the difference in velocity between the fastest and slowest displays was constant, so that detection of more than three planes may have been difficult because of the decrease in the relative velocity between the neighboring displays. Since the displays were presented for as long as subjects wished and eye movements were not controlled, it is not possible to say whether the different displays were detected at the same time or sequentially as the eyes pursued now one and now another surface. De Bruyn and Orban (1993) superimposed two random-dot patterns with differing patterns of optic flow. The patterns were expansion, contraction, and clockwise and counterclockwise rotation. The displays were presented for only three sequential frames occupying a total duration of 85 ms followed by a display of randomly moving dots. Subjects could identify the patterns of optic flow presented one at a time but not the component patterns in superimposed pairs of patterns. However, when primed to detect one or the other pattern, they could do so. They could identify both superimposed patterns in the same display only with longer viewing time. De Bruyn and Orban concluded that superimposed moving patterns are processed sequentially. This process may be helped by tracking first one pattern and then the other with a pursuit eye movement. Ullman (1979) superimposed the projected images of two concentric vertical random-dot cylinders. When stationary, the display appeared as a random array of dots. Two cylinders were seen when the cylinders rotated about a vertical axis at different velocities. Liter et al. (1994) superimposed two random-dot displays, each simulating a transparent sphere rotating about its center at 1.5˚ per frame. The two spheres differed only in the angle, in 3-D space, separating their axes of rotation. Subjects were more accurate at distinguishing between a single sphere and two spheres as the angle between the axes increased to 15˚, the largest angle tested. However, subjects did not report seeing two spheres but rather detected that not all the dots were moving in the same way. In this display, there was no difference in velocity to help subjects segregate the displays in depth. Perhaps they would have seen two spheres with a larger angle of separation of the axes. We will see in Section 28.4.1 that people use spatial derivatives of velocity in judging the shapes of 3-D objects. Caudek and Rubin (2001) developed an algorithm for the segregation of superimposed moving patterns based on spatial derivatives of velocity. Each local set of three texture elements in the projected image of a rigid moving object undergoes a specific deformation (def), divergence (div), and rotation (curl), which can be computed from two frames. These derivatives of velocity are constant over the image of a rotating plane and vary in a systematic way over the image of a rotating curved surface. The image of

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

two superimposed patterns moving at different velocities contains spurious sets of elements derived from both patterns. The deformation of these sets will generally vary randomly over the whole image. The two moving patterns can therefore be segregated by registering the peaks in the distribution of local deformations in the velocity field. Caudek and Rubin measured the ability of subjects to segregate superimposed surfaces defined by dot patterns moving at different velocities. Performance improved as more dots were added to the surfaces but declined with increasing surface curvature. With corrugated surfaces, performance decreased as corrugation amplitude or spatial frequency was increased. The algorithm also performed well only for plane surfaces or for surfaces that were not highly curved. Since most complex surfaces are opaque, we do not often encounter the conditions in which this algorithm fails.

Subjects were more accurate at discriminating between a motion-generated corrugated surface and a random display when corrugation amplitude was increased (Andersen 1996). An increase in amplitude entailed an increase in the ratio of maximum to minimum velocity from 1.75 (1.75˚/s, 1˚/s) to 3.3 (2˚/s, 0.6˚/s). Thus, for a corrugation with a given spatial period, sensitivity to motion-induced depth increased with an increase in velocity gradient, at least up to a certain value. For a fixed corrugation amplitude, the velocity gradient increases with increasing spatial frequency of the corrugation, and we know that depth sensitivity reaches a peak at about 0.3 cpd. Braunstein and Tittle (1988) produced the opposite result. With a kinetic

Axis of translation

28.3.3b Effects of Velocity of Gradients of Motion Parallax Figure 28.8A shows that an inclined textured surface translating horizontally produces two components of motion parallax—a common horizontal translation and changing shear. The result is a vertical velocity gradient. A slanted surface translating horizontally produces common motion and image compression. Figure 28.8B shows that two surfaces with different inclinations about a horizontal axis can produce the same shear if their relative translational velocities are suitably adjusted. Similarly, two surfaces with different slant about a vertical axis produce the same image compression if translation velocities are suitably adjusted. The orientation of the axis of rotation of a surface is unambiguously defined by motion parallax. The direction of the parallax velocity gradient specifies the sign of inclination (top away or top near). However, the sign of inclination is ambiguous when the relative magnitudes of retinal shear and translation are not known. The relative magnitudes of these components vary with the velocity of pursuit eye or head movements. When the moving surface is accurately pursued, the retinal translation component is zero. The magnitude of the velocity gradient then specifies the magnitude of inclination. A misregistration of either motion component leads to an error in the estimation of inclination (see Domini and Caudek 1999; Freeman and Fowler 2000). This and related ambiguities are discussed in Section 28.5.3b. Braunstein and Andersen (1981) used a gradient of horizontal motion parallax in a random-dot display to simulate a convex or concave V-shaped horizontal ridge. Accuracy in judging the sign of depth increased as peak velocity increased to 10˚/s, which was the maximum value tested. Accuracy also increased as duration increased to 10 s, the maximum tested. The crucial factor was probably the steepness of the velocity gradient rather than peak velocity.

Horizontal component from surface translation

Shear component from surface rotation

Resultant velocity gradient

A Different inclinations

Different translations

Same shear

B Velocity gradients from an inclined surface. (A) A surface inclined top away and translating horizontally along a frontal-plane axis produces an image with a vertical velocity gradient. The gradient is, approximately, the sum of a shear component due to rotation of the surface about the gaze normal and a translatory component. A surface inclined top near produces gradients with signs reversed. (B) Two surfaces with different inclinations moving at different speeds can produce the same shear. Therefore, correct registration of surface inclination requires accurate registration of both shear and translatory image motion.

Figure 28.8.

D E P T H F R O M M OT I O N PA R A L L AX



97

depth display (KDE) (Section 28.5) representing a single V-shaped ridge, accuracy of depth judgments declined as the velocity ratio increased from 1.12 to 3. They argued that the display with the higher velocity ratio appeared to deform rather than to be in depth. Ono et al. (1986) found that, with parallax created by sideways motion of the head, perceived depth gave way to shearing motion as parallax amplitude became large. The simulated depth of a ridge of a given angular size increases as the velocity gradient is increased. But the perceived velocity gradient, and hence the perceived depth, may be influenced by the velocity of a frontal textured surface surrounding the ridge. Zhong and Braunstein (2004) investigated this question. They used a stimulus in which motion parallax simulated the polar projection of a concave V-shaped ridge. As the simulated depth of the ridge increased, the dihedral angle became smaller. An estimate of perceived depth could therefore be obtained by asking subjects to set the angle between two lines to match the dihedral angle of the ridge. Perceived depth of the ridge was underestimated in all conditions. However, perceived depth increased as the speed of the textured background was increased until it was moving at the same speed as the outer edges of the ridge. Perceived depth decreased with further increases in the speed of the background. We will see in Section 28.4 that spatial gradients of velocity and their derivatives are important for the perception of the shapes of 3-D surfaces.

28.3.3c Spatial-Frequency Bandwidths for Motion Parallax The bandwidth of a visual channel tuned to a particular spatial frequency of luminance modulation can be determined by measuring how a stimulus of that spatial frequency is masked by the addition of stimuli with neighboring spatial frequencies. Hogervorst et al. (2000) used this method to determine the spatial-frequency bandwidth of visual channels tuned to parallax-induced modulations in depth. They measured the elevation in threshold for detection of a parallax-induced depth corrugation of one spatial frequency as a function of the spatial frequencies of two superimposed corrugations with flanking spatial frequencies. The results indicated that the visual channels for detection of parallaxinduced corrugations have a spatial frequency bandwidth of about 1.4 octaves. This is similar to the bandwidth of channels sensitive to disparity-induced depth corrugations (Section 18.6.3e).

28.3.3d Surface Rigidity The laws of geometrical optics fully determine the pattern of motion parallax produced by a given visual scene. Thus, a visual scene can produce only one pattern of motion parallax in a given visual system. However, different scenes may 98



produce the same image. In inverse optics, one maps an image onto the visual stimuli that produced it. But many different visual scenes can produce a given pattern of motion parallax. The mapping is one to many. Therefore, parallax cannot be used to fully recover the 3-D structure of a visual scene unless the observer has prior knowledge about certain features of the scene. One important feature is the rigidity of objects and the rigidity of the spatial layout of the scene. Motion parallax between a set of objects may be a consequence of the 3-D structure of the scene, but may also result from actual relative motion between the objects. Any monocular pattern of optic flow can be due to an infinite number of motion configurations, including one in which all object points move at different velocities across a flat surface. Depth specified by binocular disparity is a necessary consequence of the instantaneous 3-D layout of the scene and the viewing geometry. It is not affected by the rigidity of the scene. Theoretically, motion parallax can arise from object deformation or depth. Therefore, there is a trade-off between perceived rigidity and perceived depth. The greater the perceived depth in a parallax display the less it appears to move or shear. Head motion enhances perceived depth because it provides extra, nonvisual, information about the magnitude of relative motion between observer and display. Accordingly, a parallax display simulating a corrugated surface appeared deep and rigid when it was produced by head motion and shallow and nonrigid when it was produced without head motion (Ono and Steinbach 1990). We will see in Section 28.3.3e that head motion can also remove the ambiguity about the sign of depth in a motion parallax display. The number of possible interpretations of a parallax display is reduced substantially, and often to one or two, if it is assumed that objects depicted in the array are rigid ( Johansson 1977; Ullman 1979). It has been proposed that the visual system uses a rigidity constraint in the interpretation of optic flow to limit the number of possible interpretations. Our visual world contains nonrigid objects such as animals or waves. It would be inappropriate, and often impossible, to apply the rigidity constraint to the entire optic-flow field. Todd (1982) and Koenderink (1986) have therefore proposed solutions to the structure-from-motion problem based on the weaker assumption of local rigidity or the assumption of nonrigidity (Koenderink and van Doorn 1986). In an extended textured surface, several additional properties of disparity and motion parallax become evident. For a surface slanted around a vertical axis, the size, spatial frequency, and shape of images differ in the two eyes (Section 20.2). The same features of the monocular image change as the head moves horizontally. For a surface inclined about a horizontal axis, the orientation and shape of binocular images differ, and the same features of the monocular image also change as the head moves horizontally (Section 20.3). For a surface curved around a horizontal axis, there may

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

also be curvature differences between corresponding features in binocular images, and curvature changes of the same features with observer movement. Evidence that the human visual system makes use of these binocular differences was discussed in Chapter 20. For rotating planar surfaces, the deformation component of optic flow is the crucial stimulus factor underlying the ability to discriminate rigid from nonrigid 3-D objects (Domini et al. 1997).

28.3.3e Depth Sign in Motion Parallax As the head moves, the relative motion of the images of stationary objects supplies information about their relative distances. With eyes and head fixed, the image of a nearer object moves faster than that of a far object when the objects move at the same speed. However, Gibson et al. (1959) found that the faster of two superimposed arrays of dots did not always appear to be the nearer of the two (see also Farber and McConkie 1979). People do not make full use of relative velocity to detect depth order for the following reasons: 1. They must correctly assume that the objects are

stationary or, if the objects are moving, they must correctly register that motion. 2. They must allow for the effects of eye movements.

Eye movements alter the motions of the images. For example, if the eyes track one of the objects, its image is stationary. Hayashibe (1991) found that the faster parts of a parallax display reversed from far to near when the eyes pursued the faster parts. Thus, retinal velocities do not necessarily indicate relative distances. Theoretically, effects of eye movements can be allowed for if they are correctly registered or if there is a stationary object in view. 3. They must allow for the effects of head movements.

If one of the objects is tracked with a rotation of the head, the rotation must be correctly registered. Even if the head is stationary, this must be correctly registered. Theoretically, the ambiguity of depth sign is removed if the direction of the common motion is known. The nearer object is the one that moves relative to the other object in the same direction as the common motion. There are three sources of information about common motion.

Signals arising from receptors in muscles, tendons, and the vestibular organs are proprioceptive signals. Efference copy signals associated with head or eye movements are motor signals. All these signals are referred to as extraretinal signals to distinguish them from signals generated by motion of retinal images. Consider two abutting horizontal textured bands in a frontal plane, as in Figure 28.9. Sideways movement of the bands in opposite directions creates the impression of a square-wave depth corrugation. However, depth order is ambiguous and spontaneously reverses. Now let the head move sideways as one of the bands moves with the head while the other band remains stationary in space. The band that moves with the head appears to lie beyond the stationary band. When the other band moves with the head, the perceived depth order of the band reverses (Hayashibe 1993). This follows from the geometry of motion parallax produced by observing two stationary objects at different distances while moving the head. For a given motion of the bands, perceived relative depth decreases as head velocity increases (Ono and Ujike 1994). This is because, as the depth between two bands increases, a smaller head velocity is required to produce a given parallax velocity. Thus, head movements can disambiguate depth order produced by motion parallax. Ono and Ujike obtained the same results when an opposed motion aftereffect was produced in stationary bands after inspection of moving bands. In this case, the visual motion signal was openloop because eye movements did not affect it. Nawrot (2003a), also, obtained these effects with opposite motion aftereffects. He also had a condition in which the head was stationary and subjects followed the fixation cross as it moved to the left or to the right across the stationary display exhibiting the motion aftereffect. The bands that moved in the same direction as the eyes appeared more distant than the bands that

Head moving to the right Fixation on central spot

Perceived depth order Behind

1. Extraretinal signals The direction of common motion

could be derived from the direction of tracking movements of the eyes or of the head. If the head is moved passively, signals from the otolith organs could indicate the direction of common motion. If the motion of the head is self-produced, then efference signals sent to the muscles add extra information.

In front Figure 28.9.

Stimulus motion, head motion, and depth order.

Ujike 1994)

D E P T H F R O M M OT I O N PA R A L L AX



99

(Adapted from Ono and

moved in the opposite direction. He argued that, since the visual motion signal in the aftereffect was openloop, any perceived change in depth order must have been due solely to the pursuit motion of either the head or the eyes. In a final condition, the display containing the motion aftereffect moved sideways with the head. In this situation, the vestibuloocular reflex (VOR) is inhibited by visual fixation, that is, by signals arising in the smooth pursuit system of eye movement control. Nawrot concluded from the perceived depth order of parallax motion that pursuit eye movements evoked by visual signals rather than those evoked by VOR are responsible for disambiguating the depth order of parallax motion. At greater viewing distances, the contribution of VOR to pursuit eye movements increases while the contribution of voluntary pursuit decreases. The decrease in the gain of the voluntary pursuit component with increasing distance was found to be associated with a decrease in perceived depth from motion parallax (Nawrot 2003b). Nawrot and Joyce (2006) produced further evidence in support of the crucial role of pursuit eye movements in the perception of depth order in a motion parallax display. Naji and Freeman (2004) produced further evidence that signals arising from eye movements can disambiguate depth order in motion parallax. Depth order was correctly perceived when motion parallax in the projected image of a sinusoidal corrugated surface was yoked to eye movements. But depth order was ambiguous when a stimulus that contained the image motion resulting from eye movements was viewed with stationary eyes. Nadler et al. (2009) recorded from cells in MT as alert monkeys viewed with one eye the motion-parallax of a random-dot patch relative to a background. They used the following conditions. (1) The stationary monkey fixated a stationary spot. Depth sign was ambiguous in this condition. (2) The monkey was moved passively from side to side while fixating the stationary spot. This generated pursuit eye movements. (3) The whole display moved from side to side with the head so that the monkey did not have to move its eyes. (4) The whole display moved, with the monkey stationary. Cells in MT that responded selectively to depth sign combined visual motion signals and signals from eye movements in conditions (2) and (4) in which eye movements occurred. Cells did not respond selectively to depth sign in condition (3) in which only the head moved. Nadler et al. concluded that “a smooth eye movement signal is both necessary and sufficient to produce selectivity for depth-sign from motion parallax in area MT” (p. 530). But this conclusion is premature. 100



The head movements were passive and therefore generated signals only from the otolith organs. Active head movements also produce motor signals, which may be just as effective as eye-movement signals. 2. Perspective changes The direction of common motion

could also be derived from the direction of changes in perspective, as the display moves away from the midline. For example, the image of a display moving to the right becomes compressed and tapered toward the right edge of the display. These changes are more pronounced with a large display and large amplitude of common motion. 3. Motion relative to the background In illuminated

surroundings, the common motion in a 2-D display is the motion of the display relative to stationary objects. Rogers and Rogers (1992) isolated each of these sources of information using a 2-D display that simulated the parallax produced by a moving 3-D sinusoidal grating. The signals from self-motion were isolated by having the subject view the display on a screen that moved with the head. A viewing tunnel removed sight of surrounding objects. Changes in horizontal and vertical perspective were isolated by rotating the screen about a vertical axis with the subject stationary. Motion between the display and the background was isolated by coupling sideways rotation of a horizontal textured surface to the parallax motion in the display. When all sources of information were removed, subjects reported one depth order just as frequently as the other. Each source of information significantly increased the percentage of time that the parallax display appeared in its theoretically correct depth order.

28.3.3f Anisotropies of Motion Parallax The first question about the isotropy of motion parallax is whether depth is produced by motion along all headcentric meridians. For small objects, an impression of depth is produced by binocular disparity only when there is a component of disparity parallel to the interocular axis. On the other hand, motion parallax creates depth for all directions of head motion. For example, many animals obtain depth information by bobbing their heads up and down to create vertical motion parallax. Although physically isotropic, motion parallax signals obtained by motion in different directions may not be perceived isotropically. Thus, a stereokinetic depth effect produced by rotating two overlapping black disks became larger as the pair of disks became vertically oriented (Masini et al. 1994). Lissajous figures on an oscilloscope (phase modulated sine-waves), designed to be ambiguous with regard to the axis of rotation, are seen as 3-D figures rotating about a

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

vertical axis more often than as figures rotating about a horizontal axis (Mulligan 1992). The second question about isotropy is whether, for a given lateral motion, a surface slanted about a vertical axis is detected as well as a surface inclined about a horizontal axis. A slanted surface produces horizontal-size disparity in stereoscopic displays, or a gradient of compression motion in motion-parallax displays. An inclined surface produces horizontal-shear disparity in stereoscopic displays, or a gradient of shear motion in a motion-parallax display. There are marked asymmetries in perceived depth defined by disparity. For example, we saw in Section 20.4 that thresholds for discriminating the direction of slant of a random-dot surface were ±2.1˚, while thresholds for inclination were only ±1.3˚ (Cagenello and Rogers 1993). Thus, we are more sensitive to shear disparity than to size disparity. A similar anisotropy occurs between shear-motion parallax and compressive-motion parallax. A compressive gradient of motion produced by lateral motion of a vertical depth corrugation produced less depth than a shearing gradient of motion produced by a horizontal grating (Rogers and Graham 1983). However, Harris et al. (1992) found that perceived inclination of a random-dot display produced by a gradient of shear motion was the same as the perceived slant produced by compression motion parallax. Examples of anisotropies of motion parallax are provided in Sections 28.4.3 and 31.5.1.

28.3.3g Motion Parallax at Equiluminance Depth is perceived in equiluminant displays in which disparity is defined by chromatic boundaries (Section 17.1.4). When the lower equivalent contrast of an equiluminant display is allowed for, sensitivity to disparity defined by color is about the same as sensitivity to disparity defined by luminance. Detection of depth created by head parallax is also at least as good for an equiluminant display as for a luminance-defined display when differences in equivalent contrast are taken into account (Cavanagh et al. 1995). The detection of motion and of motion direction are also as good for chromatic stimuli as for luminance-defined stimuli, at least in the foveal region, where color-opponent mechanisms are well represented (Metha et al. 1994).

28.3.3h Motion Parallax with Second-Order Motion The kinetic depth effect is evoked strongly by first-order motion of luminance-defined elements but only weakly by second-order motion created by contrast reversing the texture elements on successive frames (Dosher et al. 1989b ; Landy et al. 1991a). Also, the kinetic depth effect is not created by motion of second-order stimuli defined by contrast modulation (Hess and Ziegler 2000).

Ichikawa et al. (2004) asked whether head-linked parallax between second-order motion signals creates depth. They linked opposite shearing motions of horizontal bars to side-to-side motion of the head. Bars containing vertical stripes defined by luminance segregated into distinct depth planes. As one would expect, the sign of depth depended on the direction of head motion relative to stimulus motion, and the magnitude of depth increased with increasing amplitude of head motion. When the bars contained second-order stripes defined by flicker, size modulations, or contrast modulation, the sign of depth was still detected but not its magnitude. When the second-order stimulus contained no features that could be visually tracked, subjects could not detect the sign of depth. Ishikawa et al. concluded that depth from second-order motion is not supported by a motion energy signal but rather by relative position shifts of salient stimulus features.

28.3.3i Induced Effects in Motion Parallax A horizontal gradient of horizontal disparity creates a surface slanted about a vertical axis while a vertical gradient of vertical disparity creates opposite slant. This is Ogle’s induced effect, discussed in Section 20.2.3. Gradients of disparity in both directions (overall magnification) create little or no slant. Horizontal-shear disparity creates inclination in one direction, while vertical-shear disparity creates inclination in the opposite direction (Section 20.3.2a). Shear in both directions (cyclodisparity) has no effect. Thus, the visual system uses the difference between vertical and horizontal gradients of disparity to code slant or inclination. Rogers and Koenderink (1986) found an analogous induced effect for surfaces defined by motion parallax. Observers viewed a random-dot pattern monocularly while making side-to-side head movements. The vertical size of the random-dot pattern increased as the head moved in one direction and decreased as it moved in the opposite direction. The two momentary images created at the outer limits of the head movement were identical to the stereoscopic images in Ogle’s induced effect. In the induced effect based on disparities, a random-dot pattern vertically magnified to the right eye appears slanted with the right edge appearing closer than the left edge. In the motion-parallax case, the right-hand edge of the surface appeared closer when vertical magnification increased with a rightward head movement and further away when magnification increased with a leftward head movement. Thus, the direction of the effect was the same in both domains. For both the disparityinduced aftereffect and the parallax-induced aftereffect, the perceived slant increased up to a limiting vertical-size difference of between 5 and 10%. One interpretation of these results is that the disparity induced effect uses binocular vertical size differences and the parallax induced effect uses vertical size changes over time, to compute the angle of eccentric gaze (Mayhew and

D E P T H F R O M M OT I O N PA R A L L AX



101

Longuet-Higgins 1982). Rogers and Koenderink pointed out that the two induced effects are also consistent with use of deformation in the disparity and motion flow fields for judging surface slant. If the visual system decomposes the disparity field into differential invariants, as suggested by Koenderink and van Doorn (1976), the vertical magnification of one image in Ogle’s induced effect would yield both a deformation component and an expansion component (Figure 28.10). Perceived slant in the disparity-induced effect is consistent with the use of the deformation component. The expansion component could signal the direction and extent of asymmetric gaze. The motion-parallax induced effect is also consistent with the use of the deformation component, and the expansion component could signal the eccentricity of the surface. Alternatively, and formally equivalent, the expansion component could signal the change in distance of the surface with each side-to-side head movement. Rogers and Koenderink reported that the plane of dots in the induced effect produced by motion parallax appeared both to slant with respect to the frontal plane and to approach and recede with each sideways head movement. Whether these two interpretations are significantly different remains to be determined, but the existence of similar induced effects in the two domains provides further evidence of similarities between disparity and motion parallax as cues for 3-D shape and layout. Meese et al. (1995) produced further evidence that disparity gradients and gradients of optic flow are processed in the same way. They superimposed flow patterns on the horizontal flow of a random-dot display. The flow patterns were horizontal shear, vertical shear, rotation, and deformation. As expected, a vertical gradient of horizontal shear superimposed on a field of horizontally translating random dots created a surface inclined about a horizontal axis. Vertical shear created a surface inclined by the same amount in the

opposite direction. Rotation (horizontal shear plus vertical shear) created no slant, but deformation (both components with opposite sign) created more inclination than either component alone. Thus, the visual system codes slant in terms of the difference between horizontal shearing motion and vertical shearing motion rather than in terms of only horizontal shear. Use of the difference signal ensures that the component of horizontal shear in a rotating flow pattern, such as that produced by tilting the head does not produce an impression of slant. Analogous effects on perceived slant about a vertical axis were obtained by superimposing horizontal gradients of horizontal flow (horizontal-compression gradient) and vertical gradients of vertical flow (vertical compression) on horizontally translating dots. Use of the difference between horizontal and vertical flow ensures that the component of horizontal dilation in an expanding flow pattern produced by an approaching surface does not induce slant. Meese and Harris (1997) found that horizontal and vertical compression gradients of flow contributed to perceived slant but that a vertical-shear gradient was given less weight than a horizontal-shear gradient. These results for optic flow are analogous to those obtained by Howard and Kaneko (1994) in the domain of binocular disparity (Section 20.3.2). Allison et al. (2003), also, produced slant and inclination induced effects from motion parallax. Distinct horizontal disparities can occur in local regions, but vertical-size and vertical-shear disparities do not occur locally. Therefore, induced effects produced by vertical disparity are not local—they require large stimuli (Section 20.3.2). Vertical-size disparities of opposite sign in the two halves of the visual field produce only weak induced effects (Kaneko and Howard 1996). Vertical-shear disparities of opposite sign cancel and leave no induced effect (Kaneko and Howard 1997). Vertical-size parallax can arise from local looming produced by an approaching object. Vertical-shear parallax can arise from a rotating 3-D object. One might therefore expect that induced effects produced by motion parallax could be produced locally. Allison et al. (3003) obtained slant and inclination induced effects produced by motion parallax, even though their stimuli were only 9˚ in diameter. Also, they obtained opposite motion-parallax induced effects in two adjacent displays for both vertical-size parallax and vertical-shear parallax. 28.3.4 D E P T H S C A L I N G F O R MOT I O N PA R A L L AX

In Ogle’s induced effect, one image is magnified vertically, as shown in the upper figure. Vertical magnification can be decomposed into a deformation component and an expansion component. (Adapted from Rogers and Koenderink 1986)

Figure 28.10. Image transformation in the induced effect.

102



Horizontal disparities indicate the relative depths of objects, not their absolute distances from the observer. The horizontal disparity produced by two objects a fixed distance apart in depth is approximately inversely proportional to the square of viewing distance (Section 18.3.1). A disparity

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

gradient, or first spatial derivative of disparity, is inversely proportional to viewing distance. For example, doubling the absolute distance of a slanted surface halves the disparity gradient. Thus, disparities produced by a slanted or inclined surface require scaling by viewing distance. The local second spatial derivative of disparity, or disparity curvature, remains approximately constant with changes in viewing distance (Rogers and Cagenello 1989). In theory, the local 3-D curvature of a surface could be recovered without scaling for distance (Section 20.6.5). In a similar way, motion parallax indicates only the relative depths of objects. Also, the same distance-scaling properties hold for motion parallax. For a given head movement, the motion parallax between the images of two objects a fixed distance apart in depth varies inversely with the square of viewing distance. The spatial gradient of parallax produced by a planar surface varies inversely with viewing distance, while the second spatial derivative of parallax created by a curved surface patch—parallax curvature— remains approximately constant with changing viewing distance. Tittle et al. (1995) created a polar projection of a textured horizontal hemicylinder oscillating about a vertical axis. The perceived width-to-depth ratio remained invariant over changes in viewing distance. However, shape constancy broke down when the cylinder oscillated about a nonfrontal axis. Ono et al. (1986) and Rivest et al. (1989) measured depth constancy in corrugated surfaces at different distances when motion parallax was linked to active movements of the observer’s head. They found poor constancy for judgments of depth in corrugated surfaces more than 1 m away. Ohtsuka et al. (2002) measured the apparent depth of corrugations created by head-linked motion parallax in a random-dot display. Apparent depth was larger when dynamic occlusion caused the corrugations to appear beyond a surrounding random-dot region than when the corrugations were made to appear nearer than the surround. Absolute distance required for depth scaling of disparity could be indicated by perspective, vergence, or vertical disparities, as indicated in Section 29.2.2. The same sources of information could be used to scale motion parallax.

index of +1 when it is convex and -1 when it is concave. A radially symmetrical ellipsoid has a shape index of +1 or -1 because it has the same curvature in all directions. A paraboloid has an index of ±0.5 because it curves in only one direction. A hyperboloid has an index of 0 because it is convex in one direction and concave in the orthogonal direction. The shape index is a scale-invariant measure that can be derived from the spatial pattern of optic flow produced by a rigid curved surface under polar or parallel projection (Dijkstra et al. 1994). Thus, each shape produces an image with a characteristic pattern of optic flow when it translates or rotates about a given axis. Curvedness is the sum of the squared principal curvatures. It depends on scale (size of the surface) and requires information about higher order spatial relations from at least three views. Observers can categorize and label surfaces with different shape indices independently of the curvedness of the surface (De Vries et al. 1993). The first spatial derivative of velocity in a given local area describes changes in length or orientation of image segments. First-order velocity gradients are produced in the image of a flat surface as it translates or rotates about any axis. The second spatial derivative of velocity describes variation in curvature (bending) of line elements. This is known as the spin variation (SV). It is zero when there is no change in curvature and increases as the change of curvature increases. The spin variation (SV) can be calculated in each direction through any point on a surface to yield a polar plot of variation in curvature along each direction, as illustrated in Figure 28.11. For any surface translating or rotating about a fixed axis, there is at least one direction in which SV is zero. Flat surfaces do not produce spin variation in any direction because they produce only first-order velocity gradients. Theoretically, a first-order velocity gradient, and hence the

P

P

2 8 . 4 PA R A L L AX A N D 3 -D S H A P E 28.4.1 BA S I C S T U D I E S O F 3-D S H A P E P E RC E P T I O N

Koenderink (1990) classified curvatures of local smooth surface patches according to a shape index and a curvedness index (Section 20.5.1). The shape index is the ratio of the two principal curvatures of the surface. The basic shapes are the flat plane, the ellipsoid (cone), the paraboloid (cylinder), and the hyperboloid (saddle shape). A shape has an

Vertical translation

Horizontal translation

Figure 28.11. Polar plots of spin variation. The polar plots represent the magnitudes of changing curvature of the images of lines passing in all directions through point P on the surface of a cylinder translating vertically or horizontally. (Adapted from Droulez and Cornilleau-Pérès 1990)

D E P T H F R O M M OT I O N PA R A L L AX



103

28.4.2 S H A P E F RO M PA R A L L AX A N D F RO M D I S PA R I T Y

28.4.2a Comparison of Thresholds Judgments of the shape of both real surfaces and of disparity-defined 3-D surfaces can be close to perfect for distances up to 3 m (Section 20.5.2). De Vries et al. (1994) compared shape discrimination for surfaces defined by disparity with that for surfaces defined by motion parallax. Discrimination of disparity-defined cylindrical surfaces was best with shape indices close to -0.5 or +0.5, and was poorer for saddle shapes and symmetrical ellipsoids (shape index 0). The patterns of discrimination were very similar in the two domains, with overall discrimination abilities slightly better in the disparity domain. For example, the lowest JNDs for motion parallax-defined cylinders were around 0.025, on a shape index scale from +1 to -1, compared with the lowest JNDs for disparity-defined cylinders of about 0.015. These results mirror the differences in thresholds for detecting depth in 104



corrugations defined by disparity compared with those defined by motion parallax. Durgin et al. (1995) compared the contributions of motion parallax and binocular disparity to the perception of the shape of a 3-D cone. The 2-D stimuli simulated protruding cones with a base 6 cm in diameter and base-to-tip depths of between 2.4 and 18 cm. In the monocular condition, parallax simulated a cone moving from side to side at constant speed. In the binocular condition, disparity simulated depth of a stationary cone. The amplitude of parallax (4.92˚) was the same as the amplitude of stimulus disparity. Subjects adjusted the height of a triangle on the computer screen to indicate the depth of the cone relative to the base. Judgments of the disparity-defined depth closely matched the simulated depth. However, depth from motion parallax was overestimated for shallow cones and underestimated for tall cones, as shown in Figure 28.12. Judgments from motion parallax were much less accurate than those reported by Rogers and Graham (1979) using sinusoidal depth corrugations. Durgin et al. obtained different results with real cones viewed monocularly in a brightly lit structured environment with the head moving from side to side through 25 cm. Depth judgments were close to actual depths, at least for cones with depth no greater than 150% of the diameter of the base. This improvement could have been due to the use of real cones, active head movement, larger amplitude of movements (7.2˚ compared with 4.92˚), or to information from surroundings and accommodation. Tittle and Perotti (1997) compared the contributions of motion parallax and binocular disparity to the perception 300 Judged cone depth (% of base diameter)

motion of a flat plane, can be derived from two successive views. Detection of the degree of surface curvature requires at least three views because acceleration is not specified by a single change in position. The pattern of spin variation depends on the type of surface and the way in which it is moving. For example, an ellipsoid or a cylinder translating along its axis produces a flow field in which SV is zero in only one direction. In both cases, the direction of motion can be determined. A saddle shape or a cylinder translating in any direction other than along its axis produces zero SV in two directions. Only a translating hyperboloid produces zero SV in three directions, but the direction of motion is not specified (Droulez and Cornilleau-Pérès 1990). Van Damme et al. (1994) investigated shape identification and discrimination of smooth surfaces specified by motion parallax during active movements of the observer’s head. In their first experiment, observers discriminated between the shape of a test surface and that of a reference surface with one of eleven shape-index values between –1.0 and +1.0. Curvedness was kept constant. Shape discrimination thresholds, expressed as JNDs, were lowest for cylindrical surfaces with shape indices of ±0.5. Thresholds for saddle shapes and elliptical shapes were similar and both were higher than those for cylinders. This is similar to the order of difficulty with stationary shapes (Section 20.5.2). In their second experiment, Van Damme et al. found that curvedness discrimination thresholds increased with increasing curvedness, but, when expressed as Weber fractions, the lowest fractions (∼15%) were obtained for the least curved surfaces. By comparison, Rogers and Cagenello (1989) reported curvature discrimination thresholds for parabolic cylinders as low as 5% for disparitydefined surfaces (Figure 20.38).

250 Disparity in simulated cone

Binocularly viewed real cone

200

Parallax in real cone 150

100 Parallax in simulated cone 50

0 0

50

100

150

200

250

300

Actual cone depth (% of base diameter) Figure 28.12. Perceived shape of disparity and parallax cones. The judged depth of protruding cones of different heights. For simulated depth specified by binocular disparity and for binocularly viewed real cones, judged depth was close to actual. Depth from motion parallax was overestimated for short cones and underestimated for tall cones. The underestimation was very large for parallax in a simulated cone. (Redrawn from Durgin et al. 1995)

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

of the 3-D shape and curvedness of small elliptical, cylindrical, and saddle-shaped patches. Parallax contributed more to the perception of shape index than to the perception of curvedness while disparity-based depth contributed more to curvedness than to shape index. Perotti et al. (1998) created monocularly viewed patterns of optic flow by parallel projection of rotating cones, cylinders, and saddle shapes. Subjects matched the shapes to subsequently binocularly viewed shapes created by disparity within stationary dot patterns. Shape matches were almost perfect, but curvedness created by optic flow was overestimated relative to disparity-defined curvedness. Performance was not much affected when first-order flow patterns of dilatation, rotation, or shear were superimposed on the patterns of optic flow. This suggests that shape judgments were based on second-order spatial derivatives of displacement. Thus, subjects were detecting velocity changes (spatial accelerations) over space. By comparison, several studies have shown that observers are poor at detecting velocity changes over time (temporal accelerations).

28.4.2b Motion Disparity A translating textured surface produces small differences between the patterns of optic flow reaching the two eyes. This may be called motion disparity. Theoretically, motion disparity could be a cue to surface shape (Cornilleau-Pérès and Droulez 1990). Cornilleau-Pérès and Droulez (1993) looked for evidence in support of this idea. Whereas depth created by motion parallax does not depend on the direction of the axis of motion, motion disparities are created only by that component of motion parallel to the interocular axis. Therefore, if the visual system uses motion disparities, curvature-detection thresholds should be lower for surfaces moving horizontally than for surfaces moving vertically. However, no systematic differences were found for the different directions of motion. In an additional experiment, the dichoptic images were decorrelated, but the patterns of optic flow were made slightly different to create motion disparities. Observers were unable to discriminate between planar and curved surfaces, even with the maximal degree of surface curvature. It thus seems that motion disparity, as opposed to monocular motion parallax, does not contribute to the detection of 3-D curvature. 28.4.3 C O N T R A S T I N M OT I O N PA R A L L AX

The similarity between disparity and motion parallax is illustrated by the finding that similar successive and simultaneous contrast effects occur in the two domains. For example, after prolonged viewing of a sinusoidally corrugated surface defined by disparity or parallax, a flat test surface appears corrugated in the opposite phase (Section 21.4.1). When measured with a nulling procedure, depth aftereffects from a surface defined by disparities were the

same as those from surfaces defined by motion parallax (Graham and Rogers 1982a). The communality of the two aftereffects is further illustrated by the fact that a depth aftereffect created by prolonged inspection of a disparity-defined surface can be nulled with monocular motion parallax and vice versa (see Section 30.2). Simultaneous contrast effects occur when a frontal surface is flanked by regions inclined or slanted in depth (Section 21.4.2c). A frontal surface appears inclined or slanted in the opposite direction to the inclination or slant of the flanks. Graham and Rogers measured the strength of these simultaneous contrast effects using a nulling procedure, and again found that the magnitude and other characteristics of the effects were similar whether the surface was defined by disparities or by motion parallax. Anstis et al. (1978) reported a simultaneous contrast effect in the binocular-disparity domain. Two coplanar surfaces were separated by a Craik-O’Brien-Cornsweet scallop-shaped profile in depth defined by disparity. The two surfaces appeared to lie in different depth planes (see Figure 21.23A in Section 21.4.2d). Rogers and Graham (1983) reported an effect of similar magnitude for a similar depth profile specified by motion parallax. Thus, both the motion-parallax system and the disparity system are less sensitive to gradual changes in depth than to a steep discontinuity in depth. Rogers and Graham also found that the illusion occurred when the depth ridge was vertical, with flanking regions to left and right. However, when the ridge was horizontal, with flanking regions above and below, the flanking regions appeared coplanar (Figure 28.13). The same anisotropy was found when the 3-D surface was defined by motion parallax produced by lateral head motion. This anisotropy is consistent with the fact that we are less sensitive to compressive disparity or compressive motion parallax than to shear disparity or shear parallax, as described in Section 28.3.3e. Other examples of anisotropies in motion parallax were described in Sections 28.3.3f and 28.3.3i. Anisotropies of disparity-defined slant and inclination were discussed in Section 20.3. 28.4.4 O B S E RV E R D I FFE R E N C E S I N D E P T H FRO M PA R A L L AX

The evidence presented so far suggests that disparity and motion parallax are processed in a similar way. However, there are differences. According to Richards (1970), 4% of a random sample of MIT undergraduates with apparently normal vision were unable to use disparity to code depth, and 10% had great difficulty in seeing depth in Julesz random-dot stereograms. Julesz (1971) found that 2% of observers were stereoblind and a further 15% had particular difficulties in seeing depth in complex random-dot stereograms. There are no reports of people who cannot see the

D E P T H F R O M M OT I O N PA R A L L AX



105

40%

Motion parallax Disparity

20%

2 8 . 5 T H E K I N ET I C D E P T H E F F E C T

10%

28.5.1 BA S I C FE AT U R E S O F T H E K I N ET I C D E P T H EFFEC T

0% Vertical Horizontal

Vertical Horizontal

Orientation of depth edge Figure 28.13. Craik-O’Brien-Cornsweet illusion in depth. The size of the CraikO’Brien-Cornsweet illusion in depth for disparity-defined and motionparallax-defined surfaces, expressed as a percentage of the 8-arcmin depth discontinuity. In both cases the effect was larger for a vertical than for a horizontal depth discontinuity. (Redrawn from Rogers and Graham 1983)

3-D structure of surfaces in motion-parallax displays. Of the several hundred subjects tested, Rogers and Graham did not find one who did not see depth in a self-produced parallax stimulus. Even strabismics, who lack stereopsis, saw depth from motion parallax. This is not surprising because, unlike disparity-based stereopsis, motion parallax does not depend on the precise alignment of the two eyes or on the development of binocular cells. Although there are no reports of “parallax-blindness,” Richards and Lieberman (1985) reported that observers with normal vision could see the kinetic depth effect when the 2-D display was stereoscopically beyond the fixation point but not when it was stereoscopically in front of the fixation point. However, Bradshaw et al. (1987) failed to replicate this finding, after taking greater care to control for the effects of changes in vergence. Richards and Lieberman also reported that the ability to see the kinetic depth effect was positively correlated with the ability to distinguish between crossed and uncrossed disparities. Correlation would arise if the two depth cues fed into a common depth-registration mechanism, which varied in proficiency from person to person. Or correlation could arise simply because some people are, in general, better observers. On the other hand, one might expect the effectiveness of the two cues to be negatively correlated. A person with weak stereoscopic vision might well compensate by being better in using motion parallax. A stereoblind person would be expected to show the best motion-parallax performance. But this issue has not been investigated. Rogers (1984) used a task in which observers discriminated between the depths of sinusoidally corrugated surfaces 106



The image of a 3-D object, such as a twisted piece of wire, rear projected onto a translucent screen appears flat. However, the object’s 3-D form becomes apparent in the projected image when the object is rotated about an axis parallel to the screen. Miles (1931) had noted the effect in the silhouette of a revolving fan. Metzger (1934) noted it in the silhouette of an array of rotating vertical rods. Gibson and Gibson (1957) noted it in the projected images of surfaces rotating about a vertical axis. Also, Fisichelli (1946) described the three-dimensional appearance of Lissajous figures on an oscilloscope, formed by phase-modulated sine waves. Braunstein (1962) reviewed these earlier studies. The ability to recognize the shape of a 3-D object from its silhouette is also enhanced when the object rotates. Rotation greatly facilitates recognition when the silhouette contains prominent convexities that signify bumps on the object, or concavities that signify saddle-shaped regions, as described in Section 28.4.1 (Norman et al. 2000).

Amplitude-discrimination threshold for disparity-defined depth (% of reference surface)

Size of illusion

30%

of different amplitudes. The surfaces were presented either binocularly with disparity cues or in an observer-produced parallax situation. Results for 13 observers are shown in Figure 28.14. There was only a weak relationship between the ability to use disparity and the ability to use motion parallax.

20

15

10

5

0

0

10

20

30

40

50

Amplitude-discrimination threshold for parallax-defined depth (% of reference surface) Figure 28.14. Disparity and parallax discrimination thresholds. Thresholds for discriminating differences in peak-to-trough amplitude of sinusoidal corrugations for twelve naïve and one practiced subject. There is a weak correlation between disparity and parallax performance—observers who had lower thresholds for discriminating the amplitude of disparity corrugations tended to have lower thresholds for discriminating parallax corrugations. The dashed line shows where parallax thresholds are 2.5 times higher than disparity thresholds.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Wallach and O’Connell (1953) conducted the first systematic investigation of the kinetic depth effect and gave it its name. It is now referred to as the KDE. The perceived 3-D shape produced by a KDE resembles that of the rigid 3-D object producing the image. If, for any reason, the impression of depth weakens, the object appears to deform as it rotates in depth. When there is no perceived depth, the moving image appears as a 2-D deforming object. Wallach and O’Connell reported that the projected image of a single rod rotating in depth about an axis parallel to the screen appeared to lengthen and shorten rather than rotate in depth. Also, the image of a figure made from three pieces of wire forming a 3-D corner did not appear to rotate in depth when an aperture masked the changes in length of the images. They concluded that the image must change in both length and angular position to generate a KDE. According to this conclusion, the simplest stimulus to produce a KDE is a single line changing in both length and orientation. For example, consider a line with one end fixed and the other end moving back-and-forth along a horizontal path above the fixed point, lengthening and shortening as it moves, as shown in Figure 28.15 This is equivalent to the projected image of a line of fixed length moving in a conical orbit, with the top of the line at eye level. That is the impression created (Beghi et al. 1997). Wallach and O’Connell also found that the projected image of an object with only curved contours appeared to deform rather than rotate in depth. They concluded that the KDE requires contours with identifiable points. However, we will see later images with smooth contours may generate a KDE under certain circumstances. In most studies of the KDE, the display consisted of moving dots or lines generated on an oscilloscope or computer screen. This provides greater control over stimulus variables than does the projected image of a real object, and allows one to construct displays that could not be produced by real objects. A KDE depends on the following features of the 3-D object that produces it: Apparent motion in depth Real motion Eye level

1. Distance of the 3-D object and screen from the

observer. 2. Depth scale This is the absolute depth between each

pair of defined points in the object. 3. Magnitude of curvature This is specified at each location

on the object’s surface. 4. Depth sign A 3-D object has a front and a back.

However, the perceived depth order of its projected image may be unstable and reverse spontaneously. Any reversal of perceived front and back of a rotating skeletal or transparent object causes an apparent reversal of its direction of rotation. 5. Shape index The basic types of local shape are parabolic,

hyperbolic, and saddle shape. 6. Smoothness An object can be smooth or rough and

contain ridges, edges, or breaks. 7. Type of lines and texture Textures vary in randomness,

density, gradient, and in the contrast, number, and size of elements. The shape and texture of a KDE display depends on the smoothness, shape, and texture of the 3-D object, on the angles of surfaces of the object to the plane of projection, and on whether projection is parallel or polar. 8. Type of motion Basic types of motion are translation,

rotation, expansion, and two types of shear. Velocity may be constant or there may be acceleration or higher spatiotemporal derivatives. Translation is not involved in the KDE. A transparent 3-D object with a textured surface produces two or more superimposed patterns of optic flow. The pattern of optic flow also depends on the duration of a motion sequence. Theoretically, it is possible to determine the motion and slope of any region on a rigid textured 3-D surface from the pattern of optic flow in the polar projected image (LonguetHiggins and Prazdny 1980). The crucial components of optic flow are the first and second spatial derivatives of motion, that is, the spatial gradients of velocity and acceleration. For parallel projection, the first and second spatial derivatives of motion specify the angular velocity and the normal at each point on the surface, but leave the sign of depth ambiguous (Hoffman 1982). 9. Orientation and phase of motion The orientation of the

Pivot point Figure 28.15. A simple kinetic depth effect. To-and-fro rotation of a line about one end with the other end moving in a frontal plane along a straight path at eye level creates the impression of a line of fixed length rotating round a 3-D conical orbit.

axis of object rotation is specified with respect to a vertical line in the image plane. The phase of rotation is the angle between a defined rotating radius of the object and the image plane. 10.Rigidity In a rigid object, the distance between any pair

of points remains constant. A nonrigid object

D E P T H F R O M M OT I O N PA R A L L AX



107

undergoes plastic deformation. An articulated object has hinged segments. 11.Coherence Motion is coherent when all parts of an

object move or rotate at the same linear or angular velocity about the same axis. 12.Familiarity The 3-D shape of a KDE display is more

evident when it is a familiar rather than an unfamiliar object. 13.The type of projection The basic types of projection are

polar and parallel. Parallel projection is also known as affine or orthographic projection. Polar projection is formed by a point source of light. Parallel projection is formed by a parallel beam of light or by a point source at an infinite distance. In both cases, image points move only when object points traverse lines of projection. With parallel projection, as in Figure 28.16A, the size of the image does not vary with the distance of the object behind the screen. With polar rear-projection, the size of the image depends on the ratio of the distance of the object from the light source, D1, to the distance of the screen from the light source, D2, as shown in Figure 28.16B. As the object approaches the screen, the ratio approaches 1 and the image on the screen approaches the size of the object. For a given image on the screen, the size of the retinal image depends on the distance of the eye from the screen, D3. Also, as D1 increases, the linear perspective in the image of a 3-D object produced by polar rear projection decreases. As D1 becomes infinite, polar projection becomes parallel and linear perspective falls to zero, leaving only aspect-ratio perspective. Accordingly, the accuracy of observers in detecting the true direction of rotation in a KDE decreases as D1 is increased (Braunstein 1972). Consider dots on the surface of a transparent sphere rotating about a vertical axis and rear projected from a point source onto a screen. A person viewing the screen from the other side sees two superimposed opposed patterns of horizontal motion. Each pattern has a horizontal sinusoidal variation in velocity, and a vertical gradient of velocity (Braunstein and Andersen 1984a). The image of a point on the side of the object nearer the viewer (further from the projection point) moves more rapidly than a point on the far side of the object. The image in the eye of the viewer is equivalent to that of a sphere with reversed depth moving in the opposite direction. The above remarks apply to a rear-projected image. If the point of projection is in the same location as the eye of the viewer, the image on the screen creates a retinal image that is identical to that of the actual object, whatever the distance of the object or the screen. This arrangement can be achieved computationally or by the use of a semisilvered mirror, as shown in Figure 28.16C. 108



Subjects may be asked to judge any of the features of the 3-D object that they perceive in a KDE display. For each feature there are six basic tasks that a subject may be asked to perform. 1. Detection Subjects report whether a feature differs from

some internalized standard. For example, they may report whether each of a set of surfaces is curved relative to a plane surface or more or less curved than a spherical surface. This yields a threshold with respect to a canonical standard, or norm.

Eye

Rotating object Screen

Kinetic depth display

A D3

D2 D1

Eye

Rotating object Screen

Light source

Kinetic depth display

B Light source

Rotating object

Screen

Eye Semisilvered mirror Kinetic depth display

C Figure 28.16. Types of projection for kinetic depth displays. (A) With parallel projection, the size of the kinetic depth display does not vary with the distance of the object, and there is no perspective in the image. (B) With point projection, the size of the kinetic depth display depends on D1/D2. The size of the retinal image depends on D3. (C) If the eye is at the same distance from the mirror as the light, the retinal image of the kinetic depth display will be the same as that of the object whatever the distance of the object from the screen or of the screen from the eye.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

2. Discrimination Subjects report whether two stimuli

differ with respect to some feature. This yields a difference threshold with respect to any value of the feature dimension. 3. Magnitude estimation Subjects may estimate the depth

of a display on a rating scale. Dosher et al. (1989a) asked subjects to rate the depth, rigidity, and coherence of the perceived 3-D object and found that these criteria were affected in different ways by different changes in the stimulus. One must therefore be aware that conclusions drawn from experiments may be limited to the feature being judged and the psychophysical task. 4. Categorizing a feature For example, subjects may report

the depth sign or the shape of local surface patches or they may match the perceived object to one of a sample set of real comparison objects. 5. Use of a depth probe Subjects may be presented with two

displays in which depth is produced by two different depth cues and asked to adjust the value of the depth cue of one display (the probe) until the depth in the two displays is judged to be the same. For example, the test stimulus could be a display in which depth is defined by a KDE and the depth probe could be a stationary display in which depth is defined by disparity (Todd and Perotti 1999). A depth probe indicates only the perceived depth produced by one cue relative to that produced by the other cue. It does not indicate the absolute accuracy of either cue.

53 depth-modulated surfaces in rotary oscillation (Portrait Figure 28.17). The cue of changing dot density could be eliminated by reducing dot lifetimes. Subjects immediately recognized the shapes when only the changing-velocity cue was present, but not when only the cue of changing dotdensity was present (see also Petersik 1996). They concluded that the mechanism for detection of 3-D from 2-D motion first finds local minima and maxima of relative motion and then assigns depth in proportion to velocity. Eby (1992) used a computer-generated image of a hollow half-ellipsoid pointing away from the screen and executing curvilinear motion round a circular orbit. The perceived depth of the ellipsoid increased with increasing velocity and extent of motion but was not affected by the number of dots on the ellipsoid. A KDE display generated on a computer screen necessarily consists of a sequence of images that are displaced in discrete steps at defined intervals. In other words, the motion is apparent motion rather than smooth motion. Apparent motion with short interframe steps is known as short-range apparent motion and that with long steps is known as long-range apparent motion. Several investigators have shown that the KDE is specifically generated by short-range apparent motion with short interframe temporal intervals. Todd et al. (1988) used

6. Cue trading Subjects may be presented with a single

object in which depth is produced by two distinct cues and asked to vary the value of one cue until depth is not evident. For example, depth created by motion parallax may be nulled by an opposing depth signal generated by disparity.

28.5.2 S T I MU L I F O R T H E K I N ET I C D E P T H EFFEC T

28.5.2a Effects of Dot Density, Velocity, and Lifetime Consider a set of evenly spaced points in 3-D space rotating coherently about a common axis parallel to a rear-projection screen. The velocity of each point on the screen is inversely proportional to the distance of the screen from the point light source. Object points moving parallel to the screen form a less dense image than do points moving at an oblique angle. For example, the images of dots on a rotating vertical 3-D cylinder become compressed as they move toward the flank of the cylinder. Sperling et al. (1989) generated 2-D random-dot displays that simulated each of

Figure 28.17. George Sperling. He obtained a B.S. in mathematics and biophysics from the University of Michigan in 1955 and a Ph.D. in psychology from Harvard in 1959. He worked in the Bell laboratories in New Jersey from 1959 to 1970. Between 1970 and 1992 he was professor of psychology and neural sciences in New York University. In 1992 he moved to the University of California at Irvine, where he is a professor in the Department of Cognitive Sciences. He won the Howard Cosby Warren Medal of the Society of Experimental Psychologists in 1996 and the Tillyer award of the Optical Society of America in 2002.

D E P T H F R O M M OT I O N PA R A L L AX



109

a computer-generated image of a rotating 3-D array of lines produced by parallel projection. The KDE was most evident with a multiple-frame sequence with small interframe displacements and small temporal intervals. Mather (1989) used a computer-generated image of a rotating 3-D transparent sphere of random dots. Depth was detected only with interframe displacements of less than 15˚ and interframe intervals of shorter than 80 ms. The optimal interval was 40 to 60 ms. Dick et al. (1991) presented a 4-frame sequence of a parallel projection of a 3-D rotating cylinder of dots set in a background of dots moving horizontally in a direction and at a velocity that varied from trial to trial. Subjects’ ability to detect the cylinder declined rapidly when the cylinder rotated more than 6˚ between frames. Thus, the KDE is evoked more effectively by short-range apparent motion than by long-range apparent motion. Another question concerns the effect of the lifetimes of dots used in KDE displays. Dot lifetime is determined by how frequently the dot pattern on the simulated object is changed as the object rotates. Treue et al. (1991) determined the dot lifetime required for the perception of a rotating cylinder of dots in a KDE display produced by parallel projection on a cathode ray tube. The threshold lifetime remained fairly constant at between 50 and 85 ms over a wide range of velocities and numbers of dots. Depth perception did not improve as dot lifetime was lengthened beyond 125 ms. Treue et al. concluded that the dot-lifetime threshold reflected the minimum required dot duration rather than the minimum path length of dot motion. Motion was detected in the stimulus at dot lifetimes shorter than those required for detection of 3-D shape. They concluded that the longer lifetime required for detection of shape is due to the need to integrate global information over time. Monkeys, like humans, could detect a change from a moving 2-D display with no 3-D structure to a display with 3-D structure when dot lifetime was 75 to 100 ms (Siegel and Andersen 1988).

structure of a 3-D object requires three successive views of the orthographic (parallel) projection of four noncoplanar points (Ullman 1979). Three-dimensional structure can also be derived from four views of two points if all the points in the object move at constant angular velocity around a common center and there is no translatory component to the motion (Hoffman and Bennett 1986). If the object is rigid and rotating at constant velocity about an axis parallel to the image plane, three views of two points are sufficient to give a unique interpretation. The 3-D structure of an object can also be derived from four views of three points. The points in 3-D space can move at different angular velocities about a common axis. Thus, the points need not lie on a rigid object (Bennett and Hoffman 1985). Two views of the parallel projection of four or more noncoplanar points can be produced by a large set of 3-D objects. The set forms a group (Section 3.7.1). For example, a slanted flat vertical surface rotating about a vertical axis through a small angle produces the same gradient of image displacements as a less steep and less wide surface rotating through a larger angle, as shown in Figure 28.18. In other words, different combinations of slant and angle of rotation are equivalent in two-view parallel projection (Hay 1966). Surfaces at different angles produce different gradients of acceleration (second-order motion) in both parallel and polar projection, but second-order motion can be detected only when there are more than two views. Bennett et al.

A A’

B

28.5.2b Effects of Number of Motion frames For an image of a 3-D object produced by parallel projection, the perceived spatial disposition of the 3-D shape remains ambiguous with respect to its reflection about the frontal plane. Apart from this residual ambiguity, the spatial information that can be derived from a KDE display depends on the number of points and the number of frames in the motion sequence. Theoretically, the instantaneous velocity of an optic flow pattern can be derived from two successive views. Higher spatiotemporal derivatives, such as gradients of acceleration, require more than two views. Two successive projections of four points specify coplanarity and distance ratios along the same direction (Koenderink and van Doorn 1991). However, specification of the full metric 110



θ B’

φ

Image plane Projective equivalence of rotating planes. In parallel projection, surface A rotating through angle q produces the same pattern of optic flow as surface B rotating through angle f . The patterns of optic flow differ when the surfaces rotate through a frontal plane, because the surfaces project different maximum widths when in a frontal plane. The patterns also differ in polar projection. Figure 28.18.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

(1989) developed an algorithm for generating equivalence groups under parallel projection. See also Huang and Lee (1989). In polar projection there is no ambiguity for any surface translating along the normal to the surface, including the special case of a frontal surface moving toward or away from an observer (Ullman 1986). Theoretically, the 3-D structure of an object can be derived from two views of eight noncoplanar points (Longuet-Higgins 1981) (Portrait Figure 28.19) if the rigidity and smoothness of the 3-D object are correctly assumed (Aloimonos and Brown 1989). Koenderink and van Doorn (1986) showed that two views of seven points specifies a polyhedral vertex undergoing bending deformation, as long as the component surfaces do not stretch. Observers could perceive a transparent sphere of dots from a multidot KDE display presented for two frames produced by the image of a 5.6˚ rotation of a 3-D sphere (Lappin et al. 1980). But the 3-D percept was disrupted when a small percentage of dots moved incoherently. Also, a 3-D object is perceived in a multiframe KDE display in which no dot survives for more than two frames (Dosher et al. 1989b ; Mather 1989; Landy et al. 1991a). For nonsmooth objects, two views in parallel projection are sufficient for the recovery of properties, such as coplanarity and rigidity, that are invariant under affine transformation. But two views do not allow recovery of properties such as relative lengths or angles that are not invariant under affine transformation (Section 3.7.2). Theoretically, these tasks require more than two views.

Figure 28.19. Christopher Longuet-Higgins. Born in Lenham, England, in 1923. In 1941 he went to Oxford University to study chemistry. By the age of 29 he was professor of theoretical physics at King’s College, London, and in 1954 he became professor of chemistry at Cambridge University. In 1967 he moved to Edinburgh University to found the department of machine intelligence and perception. He later moved to the department of psychology at the University of Sussex. He received five honorary degrees. He was a fellow of the Royal Society and a foreign associate of the National Academy of Sciences. He died in 2004.

In a rotating flat surface, all pairs of texture elements symmetrically placed about the rotation axis move at the same rate. Only a flat surface produces this pattern of motion. Todd and Bressan (1990) found that, with two frames, observers could detect when three connected lines in a parallel KDE display deviated from coplanarity by a mean value of 2.6˚. But, with two frames, observers were poor at detecting relative lengths of lines or whether an angle between lines was greater or less than 90˚. Performance should have improved on these latter tasks when the number of frames was increased, because three or more views provide acceleration information. However, no such improvement occurred. Increasing the number of frames had only a small effect on the perceived depth of sinusoidal corrugations in a linear parallax display (Todd and Norman 1991). Also, varying higher order temporal derivatives of optic flow that produced two surfaces meeting in a dihedral angle had little effect on the perceived depth of a display in parallel projection (Todd and Perotti 1999). Todd and Bressan concluded that people are unable to make use of information stemming from visual acceleration in parallel projection. Hildreth et al. (1990), in apparent contradiction to Todd and Bressan, found that increasing the number of motion frames beyond two did improve an observer’s ability to detect 3-D structure. However, they presented each frame sequence only once, so that two-frame displays were necessarily shorter in duration than displays with more frames. Todd and Bressan controlled for the effects of stimulus duration by allowing subjects to view each frame sequence as often as they liked. Pollick (1997), also, found that discrimination of rigid and nonrigid motion improved with the number of views, although performance was high even with two views. Eagle and Blake (1995) found that, with two frames, subjects could detect a departure from coplanarity between two hinged random-dot surfaces of about 10˚. The threshold did not change when the number of frames was increased to 10. This result is consistent with the findings of Todd and Bressan. Thus, two-frame motion provides all the motion information required for detection of coplanarity. However, Eagle and Blake’s subjects could not detect a change in the dihedral angle between two surfaces undergoing two-frame rotation. They could do so with threeframe motion, albeit with a very high threshold. This suggests that, unlike Todd and Bressan’s subjects, they made some use of second-order motion. Eagle and Blake analyzed their data in terms of detection thresholds for frontal-plane motion (Snowden and Braddick 1991; Werkhoven et al. 1992). They concluded that the differential motion signals in Todd and Bressan’s metric-structure stimuli were too small to allow subjects to make use of the additional information in multiframe stimuli. All these experiments involved parallel projection with depth information provided by only motion. In natural

D E P T H F R O M M OT I O N PA R A L L AX



111

viewing, retinal images are formed by polar projection and contain both motion and perspective. Under these conditions, two views are sufficient for the detection of both the relief structure and metric structure of nearby objects. Thus, we do not normally have to rely on higherorder motion.

28.5.2c Effects of number of dots The KDE improved in perceived depth, rigidity, and coherence as the number of texture elements was increased up to about 30 (Green 1961; Dosher et al. 1989a). Turner et al. (1995) used a 2-D display of dots that represented each of several surfaces oscillating in depth about a vertical axis. Subjects had to decide as quickly as possible whether the dots lay on a surface or were randomly distributed in depth. They performed above chance when there were only 5 dots, but accuracy and speed improved as the number of dots was increased to 16, the largest number tested. A saddle-shaped surface required more dots than a cone-shaped or cylindrical surface. Uttal (1988) obtained similar results. Todd (1985) found that a display of 10,000 dots produced a more compelling KDE than one containing only 100 dots. The KDE is stronger for a display of connected lines than for a display of dots (Green 1961; Todd et al. 1988). However, a display of moving dots can still create an impression of a rotating 3-D object, such as a cylinder, when large areas of the display are blank. The visual system interpolates a surface over the blank areas (Saidpour et al. 1992; Treue et al. 1995). Also, a KDE occurred when a 20-cmwide projection of a cube rotating about its vertical axis was viewed through a 2-cm-wide vertical slit (Day 1989). The shape of an oscillating ellipsoid could be detected at above chance when 90% of the display was occluded by a series of bars (Eby and Loomis 1993).

29.5.2d Effects of Motion Noise The KDE was still evident in an array of moving points simulating a rotating sphere when a large proportion of the points moved in an incoherent direction (Petersik 1979). The KDE became more compelling as the number of coherently moving dots was increased. However, the KDE became more resistant to noise when the total number of dots was reduced (Todd 1985). Increasing dot density must make it more difficult to match corresponding dots in successive frames. Noise is also less disruptive when the number of motion frames in a computer-generated display is increased (Doner et al. 1984). Andersen and Wuestefeld (1993) found that subjects could detect a motion-defined corrugated surface when there were twice as many noise elements in a random volume as there were elements on the surface. Noise elements had less effect when their velocity did not overlap that of surface elements. 112



28.5.3 A M B I GU IT I E S I N T H E K I N ET I C D E P T H EFFEC T

28.5.3a Ambiguity of Depth, Rotation Direction, and shape In the image of a rotating object produced by parallel projection, relative motion conveys the 3-D shape. However, images produced by parallel projection lack changes in perspective and, without perspective, there is no information about depth order. It is impossible to tell which is the front and which is the back of the object. The 3-D object is ambiguous with respect to its reflection in the plane of projection (sign of depth). As a result, the object periodically appears to reverse in perspective. For example, the perceived depth of the orthographic image of a transparent textured cylinder rotating about its vertical axis spontaneously reverses. Each reversal of depth is accompanied by an apparent reversal of the direction of rotation. Rather than appearing as a full cylinder, it may sometimes appear as two concave or two convex half cylinders rotating in opposite directions (Hol et al. 2003). The image of a rotating object formed by polar projection, with the eye at the center of projection, creates the same retinal image as that created by the object. For example, the image of a 3-D array of points rotating coherently about a vertical axis contains enough information to specify all spatial attributes of the object, including direction of motion. There are two main components of changing perspective in the polar image of an object rotating in depth about an axis in a frontal plane. 1. A change in texture density in the direction orthogonal

to the axis of rotation. 2. A change in texture density in a direction parallel to the

axis of rotation. Figure 28.20 illustrates these components in the polar image of a rectangle rotating about a vertical axis. Guastella (1966) described the monocular cues to the direction of rotation of an object about an axis in a frontal plane. Image changes orthogonal to the axis of rotation can be isolated by taking a horizontal line of dots at eye level rotating about a vertical axis, as depicted in Figure 28.21. The direction of rotation is specified by each of the following pieces of information. 1. Differential motion parallax For both directions of

rotation, the images of dots moving over the sector nearer the viewer move between positions of maximum displacement more rapidly than do the images of dots moving over the more distant sector. This is the traditional cue of motion parallax. 2. Lateral image displacement The image of a dot at

distance r from the axis of rotation is maximally

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

3. Velocity gradient With clockwise rotation, the images of

outer dots on the left become maximally displaced before dots nearer the axis of rotation. This creates a gradient of image velocity along the line of dots. The left-right order of the gradient reverses with counterclockwise rotation. Frontal-plane rectangle

Horizontal gradient and vertical convergence produced by slant

Horizontal gradient component only Figure 28.20.

Vertical convergence component only

Components of changing perspective.

(Adapted from Braunstein 1972)

displaced from the axis of rotation when its line of sight is tangential to the path of motion. For a center of rotation at viewing distance d, this occurs when the line of dots makes angle q with respect to the frontal plane, where q = sin −1 r d. Figure 28.21 shows that, as an array of dots rotates clockwise from position 1 to position 2, the whole image displaces to the left. With counterclockwise rotation, the image displaces to the right.

4. Occlusion With large opaque texture elements, those

moving in front periodically occlude those moving behind. Hershberger and Urban (1970) found that subjects could use differential lateral displacement (cue 2) to make reliable judgments of the direction of rotation of a horizontal array of dots seen in polar projection, and that performance was improved by the addition of cue 1 or cue 3. Hershberger et al. (1974) found that subjects could use the velocity gradient (cue 3) alone to judge rotation direction. Image changes parallel to the axis of rotation can be isolated by taking a single vertical line rotating around a vertical axis some distance from the line. Its image grows in size on the approach half of the rotation and shrinks on the receding half. The amount of change in image size depends on the length of the line, its offset from the axis of rotation, and the viewing distance of the axis of rotation. These changes indicate the direction of rotation and are preserved in polar projection but not in parallel projection. Both components of changing perspective are contained in the image of a surface rotating about a vertical axis. Changes orthogonal to the direction of rotation increase as the horizontal size of the object increases and changes

Image motion for clockwise rotation between positions 1 and 2

Position 1

Direction of rotation

Position 2

Left

θ Nodal point

r

Axis of rotation Right Frontal plane

d

Three cues to direction of rotation in depth. Images of dots moving over the near sector move between positions of maximum displacement more rapidly than do the images of dots moving over the far sector. As the row of dots rotates clockwise from position 1 to position 2, the whole image displaces to the left. With counterclockwise rotation, the image displaces to the right. With clockwise rotation, the images of dots further from the axis of rotation are maximally displaced before those of dots nearer the axis of rotation, as indicated by the tangent on the smaller circle. The lag effect reverses when rotation is reversed. (Adapted from Hershberger and Urban 1970)

Figure 28.21.

D E P T H F R O M M OT I O N PA R A L L AX



113

parallel to the axis of rotation increase as its vertical size increases. Braunstein (1972) created polar projections of rectangles or trapezoids rotating about a vertical axis. Rotation direction was detected just as well when only vertical changes in perspective were present as when both vertical and horizontal components were present. However, it is not clear whether the overall horizontal dimension changed. Accuracy was low for all projection distances when only the horizontal texture-gradient component was present. Braunstein also found that accuracy was higher for a rotating rectangle than for a rotating trapezoid, presumably because angular changes in perspective are easier to detect relative to a baseline of 90˚ than to a baseline of some other angle. Subjects could detect the direction of rotation of a simulated sphere rotating about a vertical axis when perspective changes were confined to those parallel to the axis of rotation (Braunstein 1977). These changes consist of different curvatures of the flow field in the images of the near and far sides of the sphere. Subjects performed as well as when all perspective changes were present. Performance was at chance when the only perspective changes were orthogonal to the axis of rotation. These changes consist of a difference in velocity between images of the near and far sides of the sphere. Hershberger et al. (1976a) used a computer-generated rectangular array of vertical lines rotating about the central line. Increasing either the horizontal or the vertical size of the display improved the ability of subjects to judge the direction of rotation. They set the horizontal perspective changes to indicate one direction of motion and the vertical changes in perspective to indicate the opposite direction of rotation. When the display appeared to rotate in the direction of the horizontal changes, the lines appeared straight but changed in length. When the display appeared to rotate in the direction of the vertical changes, the lines appeared constant in length but appeared to undulate in shape. They appeared straight when the display was in its sagittal position but S-shaped when the array was frontal. For any 3-D object rotating about an axis in a frontal plane, a perceived change of depth sign is necessarily accompanied by an apparent change in direction of rotation. There is also a necessary linkage between perceived depth sign and perceived object rigidity. In polar projection, the image of the far side of a rectilinear object is smaller than that of the near side. A rotating rectilinear object is seen as rigid only when its small-image side is perceived to be more distant than its large-image side. Only then are perspective changes consistent with a rotating rigid object. When the object appears in reversed depth it appears to rotate in the opposite direction and deform as it rotates. Now perspective changes are inconsistent with a rigid object. For example, a rotating skeletal cube seen in reverse depth appears like a deforming trapezoid. The assumption of object rigidity may delay apparent reversal of rotation of a skeletal 114



cube but does not prevent it (Schwartz and Sperling 1983). Even a binocularly viewed 3-D object can appear to reverse its perspective and its apparent direction of rotation, and lose its perceived rigidity, as we saw in Section 26.7. Schwartz and Sperling (1983) found that the perceived direction of rotation of the projected image of a skeletal cube was strongly influenced by a brightness cue to depth. The cube was seen to rotate in a direction compatible with the perceived near side being brighter than the perceived far side. With parallel projection of an array of points on a rotating cylinder the direction of rotation is unspecified and is therefore perceptually unstable. Fang and He (2004) found that when each end of a rotating cylinder contained disparity that specified rotation direction, as shown in Figure 28.22A, the perceived direction of the whole cylinder became stable. Perceived direction was also stabilized by the addition of a stationary textured patch that occluded part of the far surface of the cylinder, as shown in Figure 28.22B. Inspection of either of the stabilized displays for 1 minute caused a subsequently seen ambiguous test cylinder to appear to rotate in the opposite direction. The aftereffect was specific to the location of the induction stimulus and disappeared when the adapting and test stimuli were presented in different disparity-defined depth planes. See Section 31.6.1 for more discussion of aftereffects of rotation in depth.

Stimulus

Percept

Monocular

or

A Stereoscopic Monocular Stereoscopic

B

With occluder

C Figure 28.22. Disambiguation of rotation direction in a KDE. The direction of rotation in a cylinder in parallel projection is ambiguous, as indicated in (A). Rotation direction is stabilized by the addition of disparity to the ends of the cylinder, as in (B) or by occluding one set of elements, as in (C). (Adapted from Fang and He 2004)

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

In Section 28.3.3e it was shown that proprioceptive and vestibular information produced by self-motion disambiguates depth structure in linear motion-parallax. Van Boxtel et al. (2003) asked subjects to report reversals of depth order and rotation direction in the polar projection of a rectangular surface rotating about a vertical axis. Reversals were less frequent when the image motion was coupled to active side-to-side motion of the head than when the same motion was seen with stationary head. Responses of cells in the medial temporal cortex (MT) of the monkey change when the animal reports a change in direction of motion of the image of a rotating textured cylinder (Section 5.8.4d).

was objectively reversed, the other display tended to reverse. This effect was most evident when the objectively reversed display contained disparity-defined depth and when the two displays rotated about a common axis. Thus an unambiguous display can bias the perceived direction of an ambiguous display as long as the two displays are sufficiently similar. The projected images of two random-dot spheres rotating at the same speed appeared to rotate in opposite directions when they were touching to simulate frictional contact (Gilroy and Blake 2004). The coupling of apparent rotation directions broke down when the spheres were not touching or when they were touching but not rotating at the same velocity.

28.5.3b Independence of Perspective Reversals It was mentioned in Section 26.7.1 that spatially separated ambiguous stimuli tend to reverse together when they appear to be grouped. In general, different kinetic depth displays seen at the same time in different regions of the visual field reverse their apparent direction of motion independently. However, figural grouping between two displays can cause them to reverse in synchrony. Coupling of apparent reversals in two KDE displays can be used to investigate the perception of figural grouping of elements. Gillam (1972, 1976) created a KDE display that simulated two rods attached across a rotating vertical shaft. The rods appeared to reverse their direction of rotation independently when they were parallel and some distance apart. But they tended to reverse at the same time when they were not far apart, especially when they converged on a common point and had collinear end points. Thus, synchrony of reversal occurred when the configuration of the lines prompted the impression of a connected surface. Grossmann and Dobbins (2003) created random-dot KDE displays depicting a 3-D cube, ellipsoid, or hemisphere rotating in depth. One display depicted a transparent 3-D object in which the direction of rotation was ambiguous. A second display placed below the first was either transparent like the first display or opaque so that its direction of rotation was unambiguous. The two displays rotated about a common vertical axis. The similar displays tended to reverse together, but the apparent direction of the ambiguous display was not affected by the direction of rotation of the unambiguous display. Grossmann and Dobbins concluded that coupling of reversals occurs only between two ambiguous KDE displays. But the crucial factor may not have been stimulus ambiguity but the fact that two transparent displays are more similar than a transparent and an opaque display. Freeman and Drever (2006) used two side-by-side KDE displays depicting transparent cylinders rotating about their central axes. They were separated by a 4˚ gap and rotated about a common horizontal axis or about parallel vertical axes. When the direction of rotation of one of the displays

28.5.3c Ambiguity of Slant Angle Pure motion-in-depth of a rigid object is not registered in parallel projection. Any velocity gradient in an image produced by parallel projection is due only to differential motion of object points along planes parallel to the projection plane. An example is motion parallax produced by rotation of a plane surface about an axis in the frontal plane. A slanted surface rotated in depth through a small angle can produce the same gradient of image displacements as a less slanted surface rotated through a larger angle, as shown in Figure 28.18. The ambiguity cannot be resolved with only two views. However, it can be resolved with three or more successive views because the images of surfaces at different angles have different gradients of velocity for rotation at constant velocity. We saw in Section 28.5.2b that there is conflicting evidence about whether people use second-order motion parallax. In polar projection, a rigid object cannot move purely in depth (not all points can move along visual lines). Therefore, the looming image of a receding or approaching rigid object produced by polar projection contains horizontal and vertical gradients of velocity. This removes the ambiguity inherent in two-view parallel projection. Van Veen and Werkhoven (1996) produced orthographic images of two horizontally separated random-dot surfaces viewed through a circular aperture. The standard surface rotated back and forth about a central vertical axis through various amplitudes between 28 and 89˚. The inclination of the standard surface with respect to vertical varied between 15 and 60˚. Subjects adjusted the inclination and rotation amplitude of the other surface until the two surfaces appeared the same. Differences in the matching of inclination between the test and standard surfaces were inversely related to differences in the matching of rotation amplitude, in accordance with the above formulation. Van Veen and Werkhoven referred to this as metamerism. It is not metamerism but projective equivalence (Section 4.2.7). In metamerism, stimulus equivalence arises in the visual system, not in the stimulus. Projective equivalence

D E P T H F R O M M OT I O N PA R A L L AX



115

arises in the optic array. For large amplitudes of rotation and small angles of inclination, the ways in which the two features were matched to the standard were not correlated. This suggests that second-order motion parallax was used to provide separate estimates of inclination and rotation magnitude. Domini and Caudek (1999) found that observers handled the ambiguity of a slant angle in parallel projection by estimating the object most likely to produce the given stimulus. Observers also tended to base estimates of whether a surface was rotating at a constant or variable velocity on whether shear velocity was constant, even when changing velocity was not accompanied by a change in shear velocity (Domini et al. 1998). Eagle and Hogervorst (1999) used a random-dot display that simulated two surfaces forming a vertical convex ridge in the subject’s midline. A standard pair formed a dihedral angle of 60˚ and rocked to-and-fro about the ridge axis through 4˚ at 1.5 Hz. Subjects discriminated between the standard pair and a test pair at another dihedral angle, with mean image speed held constant. For both polar and parallel projection with a small display (4.6˚ wide by 6˚ high), a difference of dihedral angle of about 90˚ was required before the difference could be discriminated. As display size was increased to 32˚ by 41˚, performance fell to chance for parallel projection. For parallel projection, the velocity gradient across the display was constant for the different dihedral angles, but there was a difference in the change of velocity over time. This signal should have become more evident with larger displays. The poorer performance with larger displays produced by parallel projection was presumably due to the fact that the absence of convergent perspective made it increasingly evident that the display was not slanted. Under polar projection, which contains convergent perspective, smaller differences in dihedral angle were detected as stimulus size increased. Eagle and Hogervorst concluded that second-order motion signals contribute to depth perception (see also Hogervorst and Eagle 2000) but that the improvement in discrimination with increasing display size is due to the additional first-order motion-parallax information provided by polar projection. Cortese and Andersen (1991) obtained an impression of depth from the orthographic projection of a single 3-D ellipsoid oscillating in depth about its vertical minor axes and seen against a background of random dots. The effect worked best when the major axis of the ellipsoid was small. At any instant, the same image is produced by a wide ellipsoid at a steep angle to the frontal plane as by a narrow ellipsoid at a less steep angle. Because of this, Cortese and Anderson found that any error in the perceived extent of rocking of the ellipsoid was accompanied by a corresponding error in the perceived shape of the ellipsoid. This ambiguity was removed when an ellipsoid rotated through 360˚ and was perceived to do so. With complete rotation, the 116



maximum image width indicates the length of the major axis, the minimum width indicates the length of one of the minor axes, and the length of the unchanging image dimension indicates the other minor axis.

28.5.3d Rotation Axis Relative to Axis of Curvature Thresholds for detecting depth in the parallel projection of a textured cylinder rocking about an axis parallel to the cylinder axis were about twice those for a cylinder rocking about an orthogonal axis (Cornilleau-Pérès and Droulez 1989). Rotation in depth about an orthogonal axis introduces a variation of curvature (spin variation) into the images of lines running round the cylinder. Rotation about an axis parallel to the cylinder axis does not. Thus, changing image curvature is a cue to depth. Norman and Lappin (1992) obtained similar results using a wider variety of stimuli (Portrait Figure 28.23). Theoretically, first-order spatial derivatives of optic flow produced by parallel projection of a surface rotating in depth determine the tilt of the axis of rotation and the slant of the surface (Hoffmann 1982). However, both angles are ambiguous with regard to reflection in depth. Caudek and Domini (1998) discussed how well observers judge the tilt of the axis of rotation of planar surfaces seen in two-view and multiview parallel projections.

Figure 28.23. J. Farley Norman. Born in Wichita Falls, Texas, in 1961. He obtained a B.A. from the University of Texas in 1983 and a Ph.D. from Vanderbilt University with J. Lappin in 1990. He conducted postdoctoral work at Brandeis University and the Ohio State University. In 1996 he was appointed professor of psychology at Western Kentucky University.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

28.5.3e Discrimination of Rigid from Nonrigid Motion A fully rigid object has the following properties: 1. Unique matching of object and image points Optically,

each object point produces only one image point, but object points lying on the same visual line fall on the same image point. Image-point to object-point correspondence is relatively easy to perceive for most opaque objects. However, the correspondence of image elements to object elements may be lost when object points lie on or near the same visual line. This happens in the image of an extremely slanted textured surface and in transparent textured objects. The image of a densely textured transparent object or of superimposed transparent objects may degenerate into a mishmash of discontinuous elements. 2. Constant distances between object points Strict rigidity

requires that all distances between object points in 3-D space remain constant. However, an object will appear rigid if only the distances between object points that define the 3-D structure of the object remain constant. Texture elements may move or deform over the surface of the object without destroying the percept of a rigid object. For example, a soap bubble remains a rigid sphere even though streaming is visible over its surface. 3. Continuity of object points We can usually see the

continuous motion of each point of a moving object. Object points are accreted or deleted only at the edges of a rotating object. However, an object may appear rigid when all texture elements are frequently changed as long as the edges of the object remain well defined and texture gradients are not changed. A rapid change of texture creates an impression of an object with a scintillating surface. In theory, relative motion produces an impression of a rotating 3-D object only if it could arise from differences in the distances of points from the center of rotation. This is the theoretical rigidity condition for rotation. Any other relative motion in the image of a rotating object is due to deformation. A typical KDE display satisfies the rigidity condition, but that does not guarantee that it will appear rigid. A viewer can adopt any of many interpretations. If a KDE display is seen as flat, relative motions are interpreted as deformations. If all relative motions are seen as arising from motion parallax, the display is seen as a rigid 3-D object. There is only one rigid interpretation of a rigid 3-D object. However, there many interpretations when relative motions are ascribed partly to deformation and partly to motion parallax. A preference for seeing objects as rigid is known as the rigidity assumption. However, we readily see

nonrigid objects when the motion signals do not satisfy the rigidity condition. We see the articulations of people’s limbs, changing facial expressions, and the changing shapes of clouds. Even with ambiguous stimuli, the rigidity assumption may be overruled by other assumptions. For example, the Ames window appears to deform as it rotates because of the influence of perspective. Two successive views of four points distinguish rigid from nonrigid motion. Braunstein et al. (1990) found that observers could also perform this task with two views of a parallel projection of four points simulating 3-D rotation. Addition of points moving incoherently degraded performance. However, a rigid 3-D object could be discriminated from a nonrigid object from two successive views. Although the KDE is seen most readily when generated by a rigid 3-D object, an object that is not completely rigid may generate a KDE that is perceived as a nonrigid object ( Jansson and Johansson 1973). Subjects were able to partition the motion signals into those arising from parallax and those arising from object deformation (Todd 1984). Ullman (1984) developed a model of how an observer can deal with nonrigid objects. The internal representation is progressively modified to account for the observed transformation of the image (see also Grzywacz and Hildreth 1987; Landy 1987). We will now see that, for certain KDE displays, it is difficult to decide whether the object is rigid or nonrigid. A single moving line presents problems because motion along the line is not visible. Also, the direction of motion orthogonal to the line is ambiguous (Section 22.3.1). The same pattern of two-frame motion can be produced by a large range of 3-D rigid or nonrigid objects containing only straight contours with no attached end points (Rubin et al. 1995a). With such stimuli, observers cannot detect the axis or direction of rotation or discriminate between rigid and nonrigid motion (Rubin et al. 1995b). Projected images of rotating smoothly contoured opaque objects also present problems. They lack welldefined points, and one part of the object may occlude another part. Points that project to the rim of the image at one instant may project into the blank center of the image at another instant. Todd (1985) illustrated this problem with computer simulations of a horizontal 3-D opaque ellipsoid rotating about a vertical axis and of a vertical ellipsoid rotating about a horizontal axis. They were presented singly or together in overlapping or distinct orbits. Neither the single ellipsoids nor the pair of spatially separated ellipsoids produced a compelling KDE. They appeared as deforming ellipses in a frontal plane. The pair of intersecting ellipsoids produced a strong KDE with nondeforming ellipsoids. Todd concluded that this strong KDE was due to the yoked motion of the intersecting ellipsoids plus the changing pattern of contour intersections. The single ellipsoids produced a compelling KDE when texture or smoothly

D E P T H F R O M M OT I O N PA R A L L AX



117

graded shading was added. The added contours helped to solve the motion correspondence problem. A KDE in polar projection produced greater depth but appeared less rigid than a KDE in parallel projection. A display in reversed polar projection, in which perspective transformations were distorted, appeared grossly nonrigid (Dosher et al. 1989a). The rigidity assumption manifests itself when stimuli that do not arise from rigid objects are nevertheless perceived as rigid. However, a display may appear as a rigid object when the 2-D image does not conform to the rigidity condition. For example, Ramachandran et al. (1988) superimposed two dot patterns moving in opposite directions at constant velocity with a periodic reversal of direction. Instead of two flat planes, observers saw a rigid cylinder. Observers were unable to detect when a simulated wire frame stretched in its depth dimension as it rotated through 360˚, even though the stretching was indicated by secondorder velocity information (Norman and Todd 1993). The deforming figure appeared rigid. Observers could detect nonrigidity arising from differences in the relative phases of a 2-D display simulating two dots rotating in depth about a vertical axis. Norman and Todd concluded that people perform this latter task by detecting only the sign of acceleration and that people are insensitive to higher-order spatiotemporal relations, such as relative accelerations, that extend over more than two frames. However, Hogervorst et al. (1996) showed that, for this type of stimulus, rigid and nonrigid objects produce almost the same projected image. Even an ideal observer would have difficulty detecting the difference. Subsequently, Hogervorst et al. (1997) showed that there is a large class of nonrigid transformations that cannot be detected by the visual system. Performance should be measured with stronger second-order stimuli before it is claimed that the visual system is insensitive to them.

29.5.3f Object Familiarity and the KDE Experience can influence whether a KDE display is perceived as rigid or nonrigid. Sinha and Poggio (1996) exposed subjects to the training KDE display shown in Figure 28.24A. The display rocked through an angle of 40˚ about a horizontal axis in the frontal plane for 60 s. Subjects were then shown the test KDE display in Figure 28.24B. Although the training and test objects were not the same, they produced the same 2-D shape at the midpoint of their motion. For subjects not exposed to the training display, the test KDE display appeared rigid on almost all trials. For subjects exposed to the training display, the test display appeared nonrigid on 40% of trials. The effect of training was diminished when the test object rocked about a vertical axis orthogonal to that of the training object. Sinha and Poggio concluded that the visual system learns the particular motion sequence generated by rotation of a 118



(a) Training KDE display.

3-D objects rocking through 40’

(b) Test KDE display.

1

4

3

2

5

6

(c) KDE display that includes the image of a cube. Figure 28.24.

KDE displays used by Sinha and Poggio (1996).

(Reprinted by permission

from Macmillan Publishers Ltd)

familiar 2-D display. When a familiar 2-D display generates an unfamiliar motion sequence it is perceived as arising from a nonrigid object. They illustrated this point using the KDE display shown in Figure 28.24C. This is the image sequence produced by a noncubic object, which at one point in its rotation produces an image that would be produced by a cube. This display appears highly nonrigid. A similar KDE display that did not include the cubic image appeared rigid. 2 8 . 6 T H E S T E R E O K I N ET I C E F F E C T Figures rotating in a frontal plane about the visual axis at about 1 rev/s can create the impression of a 3-D object. According to Musatti (1924), these effects were first described by Benussi, who called them stereokinetic effects (SKEs). Readers may observe KDEs by copying the displays in Figures 28.25 to 28.28 and rotating them on a gramophone turntable. The impression of a 3-D object is most evident

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Nested cones KDE display. When rotated, it produces two nested cones that alternate in their depth relationships. (Redrawn from

Figure 28.25.

Wilson et al. 1983)

with monocular viewing. With binocular viewing, the absence of disparity indicates that the display is flat. Stereokinetic effects take time to emerge. With practice, this time can be reduced (Bressan and Vallortigara 1987). Stereokinetic effects, like other forms of motion parallax, produce impressions of depth that are just as convincing as those produced by binocular disparity. However, the depth order of a SKE remains ambiguous in the absence of other depth cues. Nested circles offset in different directions, as in Figure 28.25, create the percept of a double cone. Since the sign of depth of each cone is ambiguous, the display can be seen in four ways. It can be seen as a hollow or protruding small cone set inside a hollow or protruding large cone (Wilson et al. 1983). Musatti described a display consisting of an ellipse, as in Figure 28.26, rotating eccentrically about the center of a disk. This display contains both element rotation and slew rotation. With continued observation, it undergoes three perceptual transformations, because the visual system can parse the display in three ways. At first, it appears as it is—a

Musatti’s rotating ellipse. When the disk is rotated, the ellipse eventually appears as a circular disk rotating on a hemisphere.

Figure 28.26.

rigid ellipse changing its orientation (rotating about its own center) as it rotates eccentrically about the point in the plane. It then appears as a nonrigid ellipse that maintains a constant orientation in space by transforming its shape as it rotates, like an amoeba. This removes element rotation at the expense of loss of rigidity. After some time, the ellipse appears as a rigid circular patch on the side of a rotating hemisphere. This percept removes element rotation while preserving rigidity. After prolonged inspection, the ellipse may appear like a rotating egg elongated in depth (Bressan and Vallortigara 1986) or as a slanted ellipse that varies in aspect ratio as it rotates (Mefferd and Weiland 1967). Todorovic (1993) provided a vector analysis of these transformations. Each interpretation is consistent with the proximal stimulus. Different interpretations come into operation sequentially. Rokers et al. (2006) concluded that the final rigid 3-D interpretation is one that produces the slowest and smoothest motion. Thus the most persistent interpretation is not the initial one but the final one. Wallach et al. (1956) devised the display shown in Figure 28.27. When rotated, it first appears as two linked circles, each rotating about its own center and both rotating eccentrically about the display’s center of rotation. After a while, each circle appears to stop rotating about its own center because there is no physical signal for that motion. When this happens, the observer perceives the two circles sliding over each other as they rotate about the center of the display. Finally, after some time, the display looks like a rigid 3-D dumbbell wobbling about a fixed point. Rotation of the pattern in Figure 28.28 creates a nonrigid 3-D figure. The arcs appear to slide over each other in different relative depth arrangements (Braunstein and Andersen 1984b ; Wilson and Robinson 2000). Like the amoeboid movement of a rotating ellipse, this violates the assumption of rigidity. A white bar lying along a radius of a black disk rotating in a frontal plane eventually appears to rotate about its own center and to be inclined in depth. At the same time, the

Wallach’s KDE display. When rotated, the linked circles appear to slide over each other. Then they appear as a 3-D wobbling dumbbell.

Figure 28.27.

D E P T H F R O M M OT I O N PA R A L L AX



119

Figure 28.28. Rotating arcs. When rotated, the arcs appear to slide over each other in various depth relationships.

Dennis R. Proffitt. Born in Altoona, Pennsylvania, in 1948. He obtained a B.S. in 1970 and a Ph.D. in 1976, both from Pennsylvania State University. He joined the psychology department at Wesleyan University in 1976. In 1979 he moved to the University of Virginia, where he is now professor in the department of psychology.

Figure 28.30.

Eccentric rotation Figure 28.29.

Slew rotation

Element rotation

Components of eccentric rotation of a line element.

apparent length of the line increases, because an inclined line has to be longer to project the same image as a line in the frontal plane (Zanforlin and Vallortigara 1988). This depth percept minimizes relative velocity between all points on the rotating bar (Beghi et al. 1991a, 1991b). Consider a line rotating in a frontal plane about a point outside the line, as in Figure 28.29. Its rotation can be decomposed into two components. The first is element rotation of the line about its own center. Note that element rotation of a circle is not visible. The second component is that about the center of the display, with

the line maintaining a constant orientation. This is slew rotation (Proffitt et al. 1992) (Portrait Figure 28.30). There is no image deformation or translation in this stereokinetic effect. Two or more circles, rotating eccentrically about a common point are known as Benussi circles (Figure 28.31 A. At first, they appear as circles in a plane. They then appear as a cone wobbling on a base fixed in a frontal plane. The same image is produced by wobble motion of a cone that deforms as it wobbles so that its circular contours project as circles. This is nonrigid wobble. This impression

Shear wobble

e1

Shear wobble

Cone tip

D2

e1

θ1 Cone base

θ2 D1

Cone base

A Figure 28.31.

Benussi circles create a cone with shear wobble.

120



OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

B

Rigid wabble

A

φ Figure 28.32.

Rotating ellipses create a rocking rigid cone.

is preserved even when the display has some faint texture that is seen to rotate. The rotating texture perceptually segregates from the cone and appears to float round it (Wilson et al. 1983). The magnitude of wobble is specified by angle q in Figure 28.31A. The depth, D, of the cone is given by etanq , where e is the linear eccentricity of the image of the tip of the cone. This means that a cone with little depth but large wobble can produce the same image as a cone with large depth but small wobble, as shown in Figure 28.31B. The circular contours of a rigid cone wobbling in a circular motion about the visual axis project as ellipses (Figure 28.32). Ellipses contain visible element rotation and create rigid wobble. If wobble is specified by angle f , then the minor axis of the elliptical image of the base of the cone is proportional to cosf . If the ellipses are seen as slanted circles, the foreshortening of the base of the cone specifies the angle of wobble. With angle of wobble determined, the eccentricity of the tip of the cone specifies its depth for a given perceived distance of the display. The display specifies the structure of the cone, except that depth order remains ambiguous. Perceived depth in Benussi circles increases linearly with increasing eccentricity of the tip of the cone (Fischer 1956; Wilson and Robinson 1986). But, because depth and wobble amplitude are reciprocally related, increasing eccentricity could be seen as increasing wobble. So why is larger asymmetry interpreted as greater depth rather than greater wobble? A preference for perceiving rigid motion may provide an explanation. Perhaps we perceive that degree of depth that reduces the perceived nonrigid wobble to a threshold value so that the cone can be interpreted as rigid. A related point is that a rigid cone of small depth and large wobble produces a large foreshortening effect. Benussi circles have no foreshortening and therefore can be interpreted as arising from a rigid cone only if perceived wobble is small enough to allow the lack of foreshortening to be ignored. Indeed, Benussi circles create the impression of a rigid wobbling cone rather than of a shearing cone. The retinal

B Figure 28.33 Subjective contours and stereokinesis. (A) When the display rotates about its center, a black disc appears to lie beyond a circular aperture (After Metelli 1940) (B) When each figure on the left is rotated about its center, the 3-D object on the right emerges. (Redrawn from Sumi 1989)

eccentricity of the whole of a Benussi-ring display does not affect perceived depth (Robinson et al. 1985). A stereokinetic effect can create an impression of figural completion. For example, rotation of Figure 28.33A about the center of the circle creates a circular black disk rotating eccentrically behind a circular aperture (Metelli 1940). The hidden part of the disk is perceptually completed. Rotation of the display about the center of the partly hidden black disk creates a stationary black disk seen through an eccentric oscillating window. Various stereokinetic displays that produce subjective contours, and their effects are shown in Figure 28.33B (Sumi 1989). The artists Marcel Duchamp (see Schwartz 1970) and Frederick S. Duncan (1975) have created several kinetic displays.

D E P T H F R O M M OT I O N PA R A L L AX



121

29 CONSTANCIES IN VISUAL DEPTH PERCEPTION

29.1 29.2 29.2.1 29.2.2 29.3 29.3.1 29.3.2 29.3.3 29.3.4 29.3.5

Introduction 122 Judging absolute depth 122 Visual cues to absolute depth 122 Constancy of depth judgments 123 Size constancy 125 Procedures and basic findings 125 Size-distance invariance 126 Influence of familiar objects 130 Emmert’s law 131 The moon illusion 132

29.3.6 Size-distance scaling in geometrical illusions 29.4 Shape constancy 137 29.4.1 Procedures for measuring simple shape constancy 137 29.4.2 Shape-slant invariance hypothesis 138 29.4.3 Shape constancy of 3-D objects 140 29.4.4 Underestimation of in-depth distance 141 29.4.5 Perceiving shapes in slanted pictures 142 29.4.6 Neurological deficits of shape constancy 143 29.5 Speed constancy 144

29.1 I N T R O D U C T I O N

133

depth or moved to different distances. For example, observers may judge whether a cube remains cubic as it is rotated or translated in depth.

In general, a perceptual constancy is the ability to judge a feature of a distal stimulus as constant in spite of variations in the proximal stimulus. For example, we can recognize the lightness and color of an object in different lighting conditions. The perceived constancy of depth intervals defined by disparity was discussed in Section 20.6.3a. The present chapter is about perceptual constancies associated with the perception of depth defined by monocular cues. These constancies may be classified as follows:

Speed constancy is the ability to judge the linear speed of an object at different viewing distances. Many investigators have commented on the dual nature of perceptual judgments. For example, when observing a receding object in a normal scene we say that it remains the same size. However, at the same time, we appreciate that the angular size of the object is shrinking. Gibson (1950a) used the term “visual world” to refer to the world seen with depth constancies present. He used the term “visual field” to refer to the world seen as a 2-D display. He stressed that, in normal visual scenes, the visual field can be experienced only by deliberately adopting a “pictorial attitude.” Even then, visual-field judgments are usually erroneous. We greatly underestimate the extent to which the image of an object shrinks with increasing distance. Artists must learn to see the world as a 2-D display. They often rely on artificial aids, such as viewing through an aperture, using vanishing points, or a camera obscura (Section 2.9.4).

Size constancy is the ability to judge the linear size of an object at different absolute distances despite changes in its angular subtense. Casual observation reveals that we have at least some size constancy. For instance, a familiar object, such as a person or automobile, does not seem to shrink when it moves into the distance. Constancy of 2-D shape is the ability to judge the shape of a flat object from different viewpoints. One can also inquire whether the function relating perceived shape to object inclination remains the same for different distances of the object.

2 9 . 2 J U D G I N G A B S O LU T E D E P T H

Constancy of relative depth is the ability to judge the depth interval between two objects when the objects are at different absolute distances. In this type of constancy, people must appreciate how cues to relative depth scale with viewing distance.

29.2.1 V I S UA L C U E S TO A B S O LU T E D E P T H

Most constancies associated with visual perception of depth require information about the absolute distance between objects and the observer. Sources of information about distance are listed below.

Constancy of 3-D shape is the ability to judge the perceived 3-D shape of an object as it is rotated in 122

Image size For a small object, image size varies inversely with viewing distance. This is not a cue to absolute distance unless the object has a fixed known size. Relative image size is an effective cue to relative depth over a large range of distances but only for objects that have the same size and are perceived as having the same size. Height in the field In the simplest case, the distance of an object on a horizontal ground surface is indicated by the vertical angle of gaze when the base of the object is fixated or by the position of the object on the texture gradient extending from the feet to the horizon. This issue was discussed in Section 26.4.2a. Gradients of optic flow Sideways motion of an eye with respect to a 3-D array of points produces a pattern of optic flow. The spatial gradient of velocity of the flow field can be detected from three or more successive views. This “acceleration” component provides information about the absolute locations of points in 3-D space if the motion of the eye is correctly registered. Equivalent information can be derived from three stationary cameras. The extra bundle of light rays to the third camera removes the ambiguity about absolute distance in the information provided by two cameras (Ullman 1979). Vergence For an interocular distance, a, vergence angle, q , varies with distance, D, according to: q = arctan

a 2D

The vergence state of the eyes could be registered by motor efference and perhaps also by sensory signals generated in the extraocular muscles. However, vergence is an unreliable cue to distance beyond about 2 m (Section 25.2). Head parallax The change in the angle of gaze of one eye as it fixates an object during a sideways motion of the head through distance a is a function of the distance of the fixated object. In theory, this cue can operate over longer distances than changes in vergence because the head can move more than the interocular distance. Horizontal disparity The horizontal disparity of the images of an object increases in proportion to the distance in depth of the object from the point of binocular fixation. Thus, horizontal disparity indicates only relative distances between objects. Absolute distance would require information about the vergence state of the eyes. This could be provided by motor efference or proprioception. For animals, like frogs, with a fixed angle of vergence, horizontal disparity indicates distance from the unchanging point of

convergence. The distance of convergence could be registered as a constant so neither motor efference nor proprioception is needed (Section 33.4.1c). The binocular disparity in radians, m , produced by an object at distance d from a fixation point at distance D is, approximately: m=

ad D2

where a is the interocular distance. Thus, judgments of relative distance based on disparity must make allowance for this inverse square law. Local disparity gradients over an inclined surface scale linearly with depth. Local disparity curvature is approximately invariant over changes in depth (see Section 20.5). Vertical disparity Vertical disparity produced by a point increases as the point moves away from the median plane and the midhorizontal plane of the head. Vertical disparity decreases as the point moves away along a cyclopean line of sight. Thus, the pattern of vertical disparities produced by an extended surface provides information about viewing distance (Section 20.6.3c). This cue is not effective for small central displays, because vertical disparities are very small with displays less than 10°. The same patterns of vertical differences between images can be detected in the successive views of one eye as the head moves from one location to another (Section 28.1.2). For each of the above sources of information, some feature of the proximal stimulus varies in a systematic way with the distance of the object. However, the function that maps the feature of the stimulus onto distance varies from one source of information to another. The visual system must reconcile these different mappings if it is to produce a consistent estimate of the distance of an object. Literature on the perception of absolute distance from monocular cues was reviewed by Foley (1980) and Sedgwick (1986). Literature on depth constancy was reviewed by Ono and Comerford (1977). 29.2.2 C O N S TA N C Y O F D E P T H J U D G M E N TS

This section is concerned with the effect of distance on the perception of absolute depth and of depth intervals. There is a problem because the information indicating a depth interval between two objects varies with their absolute distance. This means that relative depth judgments must be scaled by absolute distance. There is a further problem. We saw in Section 29.2.1 that cues to depth vary in the way they code relative or absolute distance. Therefore, the constancy of judgments of relative depth should depend on which depth cues are available and how well the observer applies

C O N S TA N C I E S I N V I S UA L D E P T H P E R C E P T I O N



123

the relevant scaling factors. The present section is concerned with constancy based on monocular depth cues, particularly perspective. The constancy of depth judgments based on binocular disparity was discussed in Section 20.6.

29.2.2a Depth Judgments with Impoverished Stimuli Several investigators have reported that shorter distances are overestimated and longer distances underestimated (Owens and Leibowitz 1976; Morrison and Whiteside 1984). Gogel coined the phrase specific distance tendency to describe regression to the mean distance of a set of test distances. The effect occurs when judgments are based on any cue to distance in isolation, and is reduced as more depth information is provided (Gogel and Tietz 1973; Foley 1980; Philbeck and Loomis 1997). Contraction of judgments toward the mean is not unique to distance judgments. It occurs in any task when there is uncertainty (Mon-Williams et al. 2000). One would expect that judgments of distance in the absence of depth information would tend to the most probable distance in natural scenes. Yang and Purves (2003) found that, in natural scenes, the probability distribution of physical distances of objects has a maximum at about 3 m. This is similar to the distance that objects appear in the absence of distance information (Gogel 1965) and to the distance of dark vergence (Owens and Leibowitz 1976).

29.2.2b Judging Depth in Natural Scenes The relation between the judged distance of an object, Dʹ, and its actual distance, D, has been expressed as a power function with exponent n. In its simplest form the function is Dʹ = KDn, where K is a constant. When n = 1, perceived distance is a linear function actual of distance. When n > 1, larger distances are overestimated, and when n < 1, larger distances are underestimated relative to nearer distances. Empirical exponents have varied between about 0.5 and 1.4 (Teghtsoonian and Teghtsoonian 1970; Da Silva 1985). Various scenes have been used, including a tabletop, a large room, and a grassy field. The range of distances has varied between less than one meter to several kilometers. Viewing has been either monocular or binocular. In some experiments the head was fixed, while in others it was allowed to move and produce motion parallax. Results also depend on whether subjects are attempting to judge linear distances or angular extents (Rogers and Gogel 1975). In some experiments, exponents obtained from an enclosed environment were close to 1 (Gogel and Tietz 1973; Cook 1978). Several, but not all, investigators found that the exponent was smaller when the range of distances was increased (see Da Silva 1985). Levin and Haber (1993) had subjects estimate distances between 13 stakes at various distances and directions on a 124



grassy area 20 m in diameter surrounded by buildings and trees. With an eye height of 1.5 m and all depth cues present, estimated distance was a linear function of true distance with an exponent close to 1. Distances in a frontal plane were overestimated with respect to actual frontal distances and with respect to in-depth distances, especially for more widely separated objects. Exponents obtained from an open environment were usually greater than 1 (Galanter and Galanter 1973). Gibson and Bergman (1954) measured constant and variable errors of distance estimates of targets on stakes at distances between 48 and 361 m on a level field. Distances were generally underestimated but far distances were overestimated. Performance improved with error feedback, even though each distance was judged only once (see also Gibson et al. 1955). Recently, Daum and Hecht (2009) found that distances greater than about 100 m on a grassy plane were overestimated. They cited other studies in which far distances were overestimated. Gogel (1972) reported that when two objects at different distances were seen in dark surroundings, the apparent distance of the more distant object remained the same as when it was alone. However, the distance of the nearer object was underestimated compared with when it was presented alone. Mershon et al. (1993) found the same far-anchor effect for motion-in-depth. They presented two small illuminated rectangles stereoscopically separated in depth, with one of them stereoscopically moving to-and-fro in depth. Subjects more often judged correctly which of the two targets was moving in depth when the nearer target moved compared with when the more distant target moved. In other words, subjects were predisposed to see the more distant target as stationary. A similar predisposition to perceive more distant objects as stationary is reported in Section 22.7.3. Judgments of distance can be based on verbal estimates in physical units, or on motor responses such as pointing, or walking. The standard deviation of pointing has been found to be about half that of verbal estimates of distance within the range of reaching (Foley 1977). Also, walking to a previously seen target at distances between 5 and 90 feet was more accurate than verbal estimation of distance (Andre and Rogers 2006). When a person verbally estimates the distance of an object, sensory information is mapped onto a learned scale based on a physical unit. When a distance is verbally underestimated we cannot say that visual information is poorly registered or that the internalized scale is too large. We can only say that the mapping of visual information onto the internalized scale is defective. This mapping involves centers in the brain concerned with verbal learning, most likely in the temporal lobe. A learned scale is not required when visual distance is used to direct movements of body parts. When we reach for an object, we map visual information onto motor commands. This involves the parietal lobe and motor cortex.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Accuracy depends only on the adequacy of the mapping. There is no point in asking about the separate accuracy of the sensory information or the motor output. We are constantly reaching for things but we estimate distances verbally only occasionally. It is therefore not surprising that reaching is more accurate and precise than verbal judgments of distance. This issue is discussed further in Section 34.3.5. Verbal judgments at far distances in natural scenes can be grossly inaccurate (Fine and Kobrick 1983). When we judge the depth of one object relative to that of another object we can use one object as the unit of measurement, rather than using physical units. Relative judgments can be based on comparison between in-depth intervals and frontal-plane intervals, or fractionation of a distance into perceived equal intervals or ratios. There are large individual differences, although individuals tend to be consistent (Cook 1978). Purdy and Gibson (1955) asked subjects to bisect or trisect distances to a stake placed between 23 and 229 m on a grassy quadrangle. Perceived distances corresponded closely to real distances and showed no evidence of being based on relative visual angles. Harway (1963) varied eye height but found that it did not affect the accuracy of marking out 1-foot intervals along the ground. The different methods do not produce the same exponents. For example, Da Silva (1985) found that magnitude estimation produced smaller exponents than ratio estimation or fractionation.

29.2.2c Depth Scaling in Pictorial Displays One must be cautious when generalizing results from pictorial or computer-generated displays to real scenes. In Section 26.1.1e, it was explained that, as a horizontal square recedes along a ground plane below eye level, the image of its lateral dimension shrinks in proportion to distance, but the image of its in-depth dimension shrinks in proportion to distance squared. This is because, for the indepth dimension, the effect of increasing distance is compounded by the decreasing angle to the line of sight, as shown in Figure 26.11. However, in a 2-D representation of a textured plane, with other depth cues eliminated, distance estimates conform more closely to their projections in the picture plane. Thus, in pictorial displays, depth intervals become perceptually compressed relative to horizontal distances. Hagen et al. (1978) found that distances to vertical triangles viewed monocularly on a textured tabletop were underestimated by about 25%. However, distances were underestimated by about 50% when the display was viewed through a small hole or presented as a 2-D picture. Compression of perceived depth is particularly evident in 2-D displays. Andersen et al. (1998) asked subjects to

judge the distance between two poles set at various distances on computer-generated textured ground-planes and ceiling surfaces. The distance between the poles became progressively more underestimated with increasing distance. Depth intervals were judged more accurately when the display was optically at infinity than when it was near. The absence of a gradient of accommodative image blur indicates that the display is two-dimensional when the display is near but not when it is far. Bruno and Cutting (1988) asked subjects to estimate depths between squares on a computer screen, with distance indicated by relative size, height in the field, and occlusion, singly or in various combinations. Subjects based their estimates of relative depth on the sum of the contributions of the cues. Bruno and Cutting proposed that the cues are processed in distinct channels and combine additively. Each cue alone gave only a weak impression of relative depth. One would not expect simple summation of strong cues, such as linear perspective and disparity, because this could produce gross overestimation of depth. The more general rule should be one of weighted averaging (Section 30.1.2). One must be alert to the possibility that performance changes over repeated trials. This could occur because of fatigue, because comparison and test stimuli interact, or because of learning. For example, exposure to lenses that magnify the image in one eye produces aftereffects in judgments of relative depth after the lenses are removed (Section 9.9.3). In view of the many conflicting results and the variety of testing procedures, some of which are suspect, it is difficult to draw firm conclusions about relative depth constancy over a large range of distances. Nevertheless, a severe lack of depth constancy is not evident when one observes a natural scene. The perception of depth intervals defined by binocular disparity was discussed in Section 20.6. 2 9 . 3 S I Z E C O N S TA N C Y 29.3.1 P RO C E D U R E S A N D BA S I C FI N D I N G S

Size constancy refers to the ability to judge the size of a rigid object as remaining constant when it is moved in depth while retaining a constant orientation to the line of sight. Judgments need not be accurate, only constant. This section reviews only the main findings from the vast literature on size constancy. For detailed reviews see Ono (1970), Foley (1980), Sedgwick (1986), and Ross and Plug (1998). There are three procedures for measuring size constancy. The object is typically a vertical rod, disk, or rectangle. 1. Subjects estimate the actual sizes of objects placed at different distances, using a conventional unit of length, an arbitrary scale, or a ratio scale.

C O N S TA N C I E S I N V I S UA L D E P T H P E R C E P T I O N



125

2. Subjects adjust the size of a comparison object at a fixed distance until it appears equal to the size of a test object at each of several distances. 3. In a method devised by Anstis et al. (1961), a small light near the subject’s eye casts the shadow of a small disk on a screen. As the screen is moved toward or away from the subject the ratio of changing image size to changing screen distance depends on the positions of the light and the disk relative to eye and screen. The setting for which the subject perceives no change in the size of the projected shadow indicates the relationship between perceived size and perceived distance required for size constancy. In a variant of the method, the subject moves toward and away from a stationary test display on an oscilloscope (Gregory and Ross 1964a). The rate of change in size of the display is adjusted until the subject reports that its size remains constant. This nulling method works only in a dark room because, if other things are in view, subjects will perceive the change in size of the display relative to the other objects. In the classic study by Holway and Boring (1941) the size of a comparison disk of light was varied until it was judged to be the same linear size as a test disk. The test disk was shown in various sizes at various distances, but it always subtended 1° of visual angle. With binocular viewing, the judged size of the test disk conformed closely to its actual size. However, judged size increased slightly with increasing distance—a tendency known as overconstancy. With monocular viewing, the size of the test stimulus was increasingly underestimated as distance increased. Thus, estimates deviated in the direction of matches based on visual subtense (relative image size). When stimuli were viewed through an artificial pupil that eliminated effects of accommodation, and through a tube that eliminated the surroundings, judgments came close to visual subtense matches. The fact that they did not conform exactly to image matches must have been due to residual depth information. Hastorf and Way (1952) obtained size matches conforming to image size when they took greater care to reduce distance cues. Other investigators have also reported overconstancy at far distances. For example, with good depth cues in an outdoor setting, objects of constant size appeared to increase in size with increasing distance (Gilinsky 1955a). This was especially evident with small objects ( Joynson et al. 1965) and unfamiliar objects (Leibowitz and Harvey 1969). The topic of overconstancy was reviewed by Carlson and Tassone (1967, 1971) and Teghtsoonian (1974). When depth cues are reduced, size constancy falls off as the distance of the stimulus increases beyond 2 m (Harvey and Leibowitz 1967). Leibowitz and Moore (1966) demonstrated that this is because vergence changes little beyond a distance of 2 m. 126



Using the nulling method, Gregory and Ross (1964b) found that size constancy for a circle in dark surroundings improved with binocular viewing and when subjects had tactile information that they were moving toward or away from the stimulus. Under good conditions, size constancy operated with objects that subtend a fraction of one degree (Day et al. 1980; Georgeson and Harris 1981). Also, with all depth cues present, size constancy was not affected when the stimuli were exposed for only a fraction of a second (Harvey and Leibowitz 1967). 29.3.2 S I Z E -D I S TA N C E I N VA R I A N C E

29.3.2a The Size-Distance Invariance Hypothesis Consider a simple spherical object of fixed size. There are three size-distance laws that relate object size, image size (angular size), and distance along a line of sight from the nodal point of the eye. Since these laws refer to objectively measurable quantities, they will be referred to as O1, O2, and O3. O1. For a small object of constant size, image size varies inversely with distance. O2. For an image of constant size, object size is proportional to distance. O3. For objects at a fixed distance, image size is proportional to object size. Euclid described these laws of geometrical optics over 2,000 years ago (Section 2.1.3b). A perceptual system that correctly registers any two of the variables, object size, image size, and distance, should have no difficulty performing according to these laws. It would be an ideal perceiver of size and distance. If errors can occur in the perception of the variables, we can construct three perceptual hypotheses, which will be referred to as P1, P2, and P3. P1. For an object with a given perceived linear size (correct or not), perceived image size varies inversely with perceived distance. Misregistration of image size produces a reciprocal error in perceived distance and vice versa. The product of perceived image size and perceived distance is constant. P2. For an image of a given perceived size (correct or not), judged object size is proportional to judged distance. Misregistration of object size results in a proportional error in judged distance and vice versa. This is the size-distance invariance hypothesis. P3. For an object at a given perceived distance (correct or not), perceived image size is proportional to perceived object size. Misregistration of image size

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

produces a proportional error in judged object size and vice versa. The ratio of perceived object size to perceived image size is constant. These hypotheses are about perception and can be verified only by psychophysical experiments in which apparent object size, apparent image size, and apparent distance are measured. But nobody has succeeded in measuring all three variables in one experiment, and this has produced some confusion.

29.3.2b Testing the Size-Distance Invariance Hypothesis The following is only a selective review of the large, complex, and controversial literature concerned with the extent to which people’s judgments conform quantitatively to the size-distance hypothesis. For reviews of the earlier literature see Epstein et al. (1961), Hochberg (1971), and Sedgwick (1986). The Ames distorted room is generally regarded as providing support for the size-distance invariance hypothesis (Portrait Figure 29.1). The far wall of the room is a trapezoid and slanted by just the amount that makes the wall projectively equivalent to a rectangular wall in a frontal plane of an eye looking through a peephole in the near wall

Figure 29.1. Adelbert Ames. He was born in Lowell, Massachusetts, in 1880. He spent a few years as a lawyer and then as a painter before going to Clark University to study physiological optics in 1914. After serving in World War I, he went to Dartmouth College to work with Charles Proctor. The Dartmouth Eye Institute was founded in 1935 under the directorship of Alfred Bielschowsky. Ames became director of research. In 1955 he won the Tillyer Medal of the Optical Society of America. Ames died later that same year.

(Ittelson 1952; Kilpatrick 1961). The whole room, including its slanted far wall and sloping ceiling and floor is projectively equivalent to a rectangular room, as shown in Figure 29.2. Distorted rooms built in 17th-century Holland are described in Section 2.9.5. When viewed from a specified point, the Ames room creates the same image as a rectangular room and therefore appears rectangular. The impression persists when the eye is not exactly at the correct viewing point or when the room is viewed with both eyes (Gehringer and Engel 1986). Because the room appears rectangular and the far wall frontal, a person standing in the more distant corner of the room appears to be at the same distance as a person standing in the near corner of the room. Accordingly, the person in the far corner appears smaller than the person standing in the near corner (Gregory 1997).

The Ames room. When the room depicted in the upper figure is viewed with one eye from the hollow on the crossbar it appears as shown in the lower figure. The two women are about the same height, but one is standing in the near corner of the room and the other is in the far corner. (Redrawn From Kilpatrick 1961 and from Gregory 1978)

Figure 29.2.

C O N S TA N C I E S I N V I S UA L D E P T H P E R C E P T I O N



127

The effect of the Ames room on perceived size is due to the angular size of the person relative to the angular height of the room. The person in the far corner extends only part of the way to the ceiling while the near person extends as far as the ceiling, as shown in Figure 29.2 The same proximal stimulus would be produced by placing a giant in one corner of a rectangular room and a midget in the other corner. If the Ames room is perceived as rectangular, this tells us nothing about size constancy. The effect of relative size could be investigated by placing two people in a room in which the far wall is frontal but tapered. Another way to control for this factor is to compare the sizes of afterimages projected onto different parts of the far wall, as we will see in Section 29.3.4. The Ames window described in Section 26.7.2 is also constructed on the principle of projective equivalence (Ames 1951). As it rotates about its vertical axis, its apparent shape changes between trapezoidal and rectangular and its apparent orientation about its vertical axis changes accordingly. Attempts to provide quantitative support for the sizedistance invariance hypothesis have produced conflicting results. Some experiments confirmed the expected proportional relationship between perceived object size and perceived distance. In other experiments, perceived size was inversely related to perceived distance (Gruber 1954, but see Gilinsky 1955b). This is called the size-distance paradox. It has been claimed that this effect disproves the sizedistance invariance hypothesis (P2). But it does no such thing. One need only assume that subjects were judging image size rather than object size. Then, by hypothesis P1, the inverse relationship is to be expected. The problem is that experimenters have concentrated on the size-distance hypothesis (P2) and ignored the other two hypotheses. The paradoxical effect is only paradoxical when it is assumed that registered image size is invariant ( Joynson 1949; McCready 1985). Some of the factors that may be account for contradictory and paradoxical results are as follows. 1. Effects of instructions and knowledge Results obtained in any experiment on size constancy depend on whether subjects are asked to judge linear size or angular size, and on how well these concepts are explained, understood, and acted on (Leibowitz and Harvey 1967). With reduced cues to distance, linear size estimates were proportional to distance estimates but angular size estimates varied inversely with perceived distance (Gilinsky 1955a ; Baird 1963). By hypothesis P2, if the linear size of an object is overestimated it should appear more distant than when its size is correctly estimated, because it would have to be more distant to produce the same image. By hypothesis P1, if image size is overestimated, the object should appear closer, because it would have to be closer to produce a 128



larger image. This would be paradoxical from the point of view of judging linear size, but it is not paradoxical when considered in the context of what the subject is trying to judge. Ono (1966) trained subjects to make one or the other type of response by providing error feedback. Subjects in one group were told that their size matches were correct when they conformed to same image size. Subjects in a second group were told their matches were correct when they conformed to equal linear size. Under conditions of reduced depth cues, subjects readily learned to match image size. Under full viewing conditions, subjects readily learned to match linear size. Mon-Williams and Tresilian (1999b) produced evidence that the size-distance paradox occurs when subjects make verbal estimates of size and distance but not when they estimate distance by pointing. Also, size constancy depends on knowledge of the effect of distance on image size. Children aged 5 to 10 years who understood the effect of distance on image size were more accurate in judging the sizes of distant objects than were children with poor understanding (Granrud 2009). Also, only children with good understanding improved their accuracy when instructed to base their judgments on objective size rather than on image size. 2. Size-contrast effects These effects have contaminated the results of several studies of size-distance invariance, as we will see in Section 29.3.6. 3. Number of cues to distance Judgments of the size and distance of an object become indeterminate and subject to moment-to-moment fluctuations when cues to distance are impoverished (Over 1963). In the absence of depth information, subjects tend to perceive a larger object as nearer than a smaller one. When subjects match the sizes of the images of two objects at different distances, performance improves as the number of cues to depth is reduced (Over 1960a, 1960b). The Weber fraction for detection of a difference in image size (about 9%) is similar to that for detection of a difference in linear size (McKee and Welch 1992). Even with all external depth cues removed, matches of object size deviate slightly from visual-angle matches because of the effects of accommodation and convergence (Biersdorf 1966). In a 2-D picture containing many cues to distance, people overestimate the angular subtense of a distant object relative to that of the same object at a nearer distance (Carlson 1962; Epstein 1963; Leibowitz et al. 1969). This effect is illustrated in Figure 29.3 The habit of comparing the linear sizes of objects intrudes into the

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

The difficulty in judging visual subtense. The visual subtense of the far person is the same as that of the small image in the foreground. Judgments of perceived distance are influenced by perceived distance.

Figure 29.3.

task of comparing image (angular) sizes. Thus, people are very poor at judging the relative sizes of images of objects at different distances, in the presence of good depth cues. Some investigators have concluded from this that people cannot be using image size in conjunction with distance to achieve size constancy. But a more reasonable view is that, although people cannot perceptually isolate image size in cue-rich conditions, they effectively use image size in conjunction with distance when estimating the linear sizes of objects. The inaccuracy of judgments of image size is an example of a general difficulty people have in estimating the magnitude of a sensory feature that is normally compounded with a related feature. For example, people make large errors when comparing the weights of two objects that differ in density. The habit of comparing densities (weight per unit volume) intrudes when one attempts to compare weights while ignoring volume differences. Similarly, people make large errors when comparing the brightness of two surfaces when the difference in brightness is caused by a shadow falling on one of the surfaces (Section 22.4). In this case, the habit of judging the ratio of incident light to reflected light (albedo) intrudes when one attempts to perceptually isolate reflected light.

Artists drawing in perspective must represent objects in terms of visual angles and overcome the tendency to represent their linear extents. Even trained artists use aids such as scaling objects relative to marks on a rod held at arm’s length. The difficulty in judging visual angles is one reason why perspective was not understood by anybody until the 15th century (Section 2.9.3) and why most people still do not understand it (Section 26.3.5). 4. Lack of linearity and transitivity Judgments of isolated stimulus attributes, such as image size or distance may not reflect how the visual system uses the same stimulus information when it is combined with other information in a constancy judgment (Gruber 1954; Foley 1972). In other words, perceptual judgments are not necessarily associative, commutative, or transitive. Consider two equal objects at different distances moving at the same linear speed in frontal planes and viewed with a stationary eye. The image of the nearer object moves faster than that of the distant object. Therefore, the nearer object should appear closer. It should also appear smaller because of size-distance invariance. However, Kaneko and Uchikawa (1993) found that the object producing the faster image did not always appear smaller, even when it appeared nearer. They concluded that relative velocity affects perceived size

C O N S TA N C I E S I N V I S UA L D E P T H P E R C E P T I O N



129

directly in addition to any effect on perceived size through the mediation of parallax-induced depth. A change in the apparent value of one stimulus feature may trigger changes in distinct perceptual mechanisms. For example, a stimulus factor that makes an object appear more distant will, by the size-distance mechanism, cause it to increase in apparent size. However, the increase in apparent size may feed into the image-looming mechanism, which makes an object that grows in size appear to be a constantsized object coming closer. The result is that the initial increase in apparent distance causes the object to appear nearer (Rump 1961). This recursive perceptual process has been used to explain the moon illusion (Section 29.3.5). Judgments of size and distance are strongly influenced by linear perspective. Vogel and Teghtsoonian (1972) asked subjects to estimate the size and distance of disks of various sizes presented one at a time at various distances in a 9-ftlong box. When the sides of the box physically converged into the distance, the apparent distance of the disk grew as the 1.4 power of distance, and its apparent size increased with increasing apparent distance. When the walls diverged, apparent distance grew as the 0.95 power of distance, and apparent size decreased with apparent distance. Apparent size was not a simple function of apparent distance. But apparent size would be influenced by at least two factors— the apparent distance of the disk and its size relative to the far end of the box. When the walls diverged, the end of the box was larger in relation to the disks than when the walls converged. In a similar experiment, Blessing et al. (1967) found a linear relationship between apparent size and apparent distance, in conformity with the size-distance invariance hypothesis (Section 29.3.2). The same size-distance relationship is responsible for anamorphic art, as described in Section 2.9.5. Sauer et al. (2002) placed two vertical rods at different heights on each of two pictures of surfaces. One picture consisted of random dots undergoing motion parallax that simulated a ground plane. The other was a photograph of a grassy field. The perceived depth separation of the rods increased when they were surrounded by a quadrilateral with sides converging toward the top. The greater the perspective taper, the greater the perceived depth separation of the rods. The perceived depth of rods placed outside the quadrilateral was influenced in a similar way but to a lesser extent. Thus the effects of linear perspective on perceived depth can propagate to some extent to regions outside an object displaying the perspective. 29.3.3 I N F LU E N C E O F FA M I L I A R O B J EC T S

People can make size judgments of familiar objects without taking account of distance (Bolles and Bailey 1956). Under reduced viewing conditions, an abnormally large familiar object appears in its natural size at a reduced distance 130



(Franklin and Erickson 1969). Ittelson (1951) presented adults with half-size, normal-size, and double-size playing cards one at a time under conditions in which the only cue to distance was the size of the retinal image. The larger card appeared nearer and the smaller card appeared more distant than the normal card. Ittelson concluded that familiar size is a cue to distance. There has been considerable dispute about whether this was a true perceptual effect or simply the result of a cognitive decision. Slack (1956) asked subjects to judge the height of a normal chair, an abnormally small chair, and an abnormally large chair presented at 20, 30, and 40 yards on an open field. In comparison with a nearby stake, subjects overestimated the size of the small chair and underestimated the size of the large chairs. Subjects were not asked to judge distances. Slack concluded that familiar size modifies apparent size. Predebon et al. (1974) obtained similar results in natural viewing conditions at a distance of 28 yards but not at a distance of only 14 yards. At the nearer distance, depth cues would be more effective and allow subjects to make accurate size judgments. Schiffman (1967) obtained similar effects of distance on the effects of familiar size on perceived size. He concluded that the estimation of the distance of a familiar object is influenced by its unusual size only when other cues to distance are not fully available. One would expect familiar size to strongly affect judged size when all cues to distance are eliminated. However, this is not necessarily so. Gogel and Newton (1969) had subjects view transparencies of familiar objects in unusual sizes monocularly through a lens that placed them at optical infinity. The objects were in dark surroundings. Subjects made reasonably accurate estimates of how much larger or smaller the objects were compared with objects of normal size. They were primed to expect objects of unusual sizes. Gogel and Newton explained the results in terms of a tendency to perceive the stimuli as being at the same distance. This is Gogel’s specific distance tendency, described in Section 29.2.2b. When we walk through a normal environment we do not expect objects to change in size. For example, any increase in the size of the image of a wall of a room is seen as due to our movement toward it rather than to expansion of the wall. Glennerster et al. (2006) immersed subjects in a stereoscopic virtual room in a helmet-mounted display. As they walked from one side of the room to the other side the expansion of the room created retinal flow similar to that created by walking through the room. However, the relationship between distance walked and the change in the retinal image was altered. None of the subjects noticed that there had been a change in the size of the room or of objects in the room. They ignored stereoscopic and optic-flow information about changes in size. Some subjects felt that their strides were getting smaller or longer as they walked

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

to-and-fro. Subjects made fewer errors in judging changes in the size of an object in the virtual room when given feedback based on a valid cue of texture-perspective. However, judgments did not improve after feedback based on valid disparity or motion parallax cues (Rauschecker et al. 2006). 29.3.4 E M M E RT ’S L AW

An afterimage appears to change in size according to the distance of the surface on which it is projected. The earliest known account of this effect is by Benedetto Castelli (1557–1644) in his Discorso sopra la vista, which was printed in Bologna in 1669 (see Section 2.5.2). Without reference to Castelli, Robert Darwin (1794), father of Charles Darwin, wrote: Thus when you view a spectrum [afterimage] on a sheet of white paper, if you approach the paper to the eye, you may diminish it to a point; and if the paper is made to recede from the eye, the spectrum will be magnified in proportion to the distance. Several people rediscovered the effect during the 19th century, including Goethe (1810), Fechner (1838), Séguin (1858), Lubinoff (1858), and Aubert (1865). There was some debate about priority (see Campbell and Tauscher 1966 for these references). In spite of these earlier reports, the discovery of the effect is attributed to E. Emmert, lecturer in ophthalmology in Bern. In 1881 Emmert reported that an afterimage projected onto a surface in a normal visual environment covers less of the surface as the surface is brought nearer. This is the geometrical form of what has become known as Emmert’s law. It is a fact of geometrical optics, not a statement about perception. Young (1952a, 1952b) attempted to verify this interpretation of Emmert’s law by marking the boundaries of an afterimage on surfaces at different distances. However, it is difficult to produce precise measurements because afterimages have blurred edges that are on the retinal periphery. They are also subject to fading and move with respect to external markers when the eye moves. There arose a debate about whether Emmert intended his law to refer to the geometry of images or to the relationship between the perceived size and the perceived distance of an afterimage (Boring 1940; Edwards and Boring 1951; Young 1951; Crookes 1959). If Emmert’s law states that the perceived size of an afterimage is proportional to its perceived distance, it refers to a perceptual mechanism and is equivalent to the size-distance invariance hypothesis. It can be verified only by measuring apparent size and apparent distance. It can easily be confirmed that an afterimage appears larger when it is projected on a more distant surface than when it is projected on a near surface. For this to be true

one must perceive the afterimage as lying on the surface one is looking at and perceive the distances of the surfaces. Afterimages appear constant in size in reduced conditions, where size constancy does not hold (Edwards 1953). However, experiments to determine whether Emmert’s law for perceived size and distance holds quantitatively have produced different results depending on instructions, methods of measurement, and the nature of the surface on which the afterimage was projected (Hastorf and Kennedy 1957; Teghtsoonian 1971). For example, Furedy and Stanley (1970) found that the apparent size of the afterimage of a 3° disk increased as the distance of the surface on which it was projected increased from 54 to 126 inches. However, the rate of increase did not keep pace with judgments of the apparent size of a real disk placed on the surface. Unlike the real object, the afterimage moved over the projection surface when the eyes moved. Therefore, the afterimage may not have been perceived as firmly attached to the surface. Taylor (1941) observed that an afterimage in complete darkness appeared to grow and shrink as subjects moved their heads backward and forward. The effect was confirmed by Gregory et al. (1959). Also, the afterimage of a white card held in the hand appeared to grow and shrink when the hand and card were moved backward and forward in the dark (see also Carey and Allan 1996). The afterimage appeared constant in size when observers fixated a stationary point of light (see also Mon-Williams et al. 1997; Bross 2000). In the dark, the only cue to the distance is the vergence-accommodation state of the eyes. When these are constant, the size of the afterimage does not appear to change. In the Ames room the far wall is a slanted trapezoid that is projectively equivalent to a rectangular wall in a frontal plane (Figure 29.2). The apparent size of an afterimage projected onto different locations on the rear wall of the Ames room did not change (Dwyer et al. 1991). In other words, the apparent size of the afterimage depended on the apparent distance rather than the actual distance of the part of the surface on which it was projected. Note that an afterimage covers the same proportion of the height of the image of the room whichever part of the far wall it was projected on. Thus, size-contrast effects are eliminated. To the extent that an afterimage appears like a patch of constant angular subtense painted on a surface, the further away the surface appears the larger the patch will appear. Whatever rule governs the relationship between perceived size and perceived distance of a painted patch will also apply to an afterimage. If an afterimage does not behave like a real patch it can only be because of some incidental difference between them, such as the spatial and temporal instability of the afterimage. This possibility could be tested by comparing the apparent size of an afterimage with that of a projected disk of light that wanders and fades like an afterimage. Otherwise, there is nothing special about afterimages.

C O N S TA N C I E S I N V I S UA L D E P T H P E R C E P T I O N



131

An afterimage is simply a convenient way to keep the angular subtense of a patch constant. 29.3.5 T H E M O O N I L LUS I O N

The horizon moon typically appears about 1.5 times larger than the zenith moon even though the retinal images are the same size. The image of the moon is like an afterimage in that its size is fixed. King and Gruber (1962) asked subjects to project the afterimage of a disk of light onto the night sky at the horizon and at elevations of 45 and 90°. Most subjects reported that the afterimage, like the moon, appeared about 1.6 times larger when on the horizon than when at 90°. Several theories have been proposed to account for the moon illusion. Helmholtz (1909, Vol. 3, p. 361) described early theories. Ross and Plug (2002) thoroughly reviewed the subject. It is best to think of the various theories as factors that may contribute to the illusion rather then being mutually exclusive.

29.3.5a Atmospheric Effects During the 4th century BC, Aristotle proposed that the sun appears larger on the horizon because of the surrounding halo caused by atmospheric vapors. Ptolemy mentioned the moon illusion in the Almagest, written in about AD 142. He explained it in terms of refraction of light through the atmosphere, and likened it to the apparent enlargement of objects in water. However, objective measurements prove that the moon projects an image of the same size whatever its position in the sky. Ptolemy may have realized this because in the Optics he explained the illusion vaguely in terms of a perceived difference in the sizes of the horizon and zenith moons. Bishop Berkeley (1709, sections 67–78) suggested that the moon illusion occurs because the horizon moon is dim and hazy. Other things being equal, a brighter object appears nearer than a dim object. There has been conflicting evidence on the role of this factor. Kaufman and Rock (1962) found that reducing the luminance of a disk projected onto the horizon had no noticeable effect on its apparent size. Hamilton (1966) found that the moon illusion measured with projected disks occurred whatever the relative luminances of the disks, as long as the terrain on the horizon was visible. The moon illusion has been reported to be evident when the moon is equally bright in the two positions (Goldstein 1962).

29.3.5b Effects of Head Posture The horizon moon is seen with the head vertical, but the zenith moon is seen with the head inclined backward. Holway and Boring (1940a) suggested that the illusion is caused by these postural changes. Using a mirror, they 132



reflected the image of the moon to various elevations, with the subject either erect or supine. Subjects adjusted the size of a nearby comparison disk to match the size of the moon. For supine subjects, the zenith moon appeared larger than the horizon moon. In a subsequent paper they concluded that the moon illusion is due mainly to reflex divergence of the eyes induced by upward gaze (Holway and Boring 1940b). One problem here is that the comparison disk should also have been affected by the shift of gaze. Kaufman and Rock (1962) showed that gaze shift is not the main cause of the illusion. The moon still appeared larger when on the horizon when the scene was viewed through 90° prisms that required subjects to look upward. Also, the moon still appeared smaller when seen in an empty sky with eyes and head level.

29.3.5c The Vault of Heaven Alhazen, in 11th-century Cairo, offered a perceptual explanation of the moon illusion (see Ross and Ross 1976; Sabra 1987; Plug and Ross 1994). He stated that the vault of heaven appears flattened, which causes the zenith moon to appear nearer and therefore smaller than the horizon moon. In China, Shu Hsi had proposed a similar perceptual explanation of the moon illusion in the 3rd or 4th century AD (see Needham 1962, Vol. 3, p. 226). Desaguliers (1736) and Smith (1738) adopted this theory. From their experiments conducted in 1962, Rock and Kaufman concluded that the moon illusion arises, at least in part, because the horizon moon appears more distant than the zenith moon on account of the night sky’s being perceived as a flattened dome. They suggested that this percept could arise from experience with cloud formations, which appear essentially flat. In support of this suggestion, Gruber et al. (1963) found that 2 minutes’ prior exposure to a luminous ceiling affected the perceived relative sizes of moons presented at different elevations. Lloyd Kaufman and his son James returned to the moon illusion (Kaufman and Kaufman 2000). They projected stereoscopic images of two adjacent moons of equal size onto the horizon, or onto an empty sky at an elevation of 45°. The comparison moon had zero binocular disparity, and subjects adjusted the disparity of the variable moon until it appeared at half the distance of the comparison moon. On average, disparity settings of the variable horizon moon corresponded to a distance of 36 m, while those of the elevated moon corresponded to a distance of 8.6 m. By this criterion, the horizon moon was perceived to be more distant than the elevated moon. This conformed to the fact that both horizon moons appeared larger than the elevated moons. Kaufman and Kaufman then asked subjects to adjust the disparity of one moon until it appeared half the size of the comparison moon. By Emmert’s law, a moon perceived at half the distance of another moon of equal image

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

size, should appear half the size of the other moon. Although subjects saw the moon getting smaller as it came nearer, it did not appear half as large as the comparison moon until it was at a disparity-defined distance of 5.5 m for the horizon moon and 3.6 m for the elevated moon. Thus, the perceived size of the stereoscopic image of the moon decreased with decreasing distance much less than predicted by Emmert’s law. This result is in agreement with results reported by Enright (1989). When asked to judge the distance of a synthetic moon, most observers reported that, although appearing larger, the horizon moon appeared nearer than the zenith moon (Bilderback et al. 1964). This seems to contradict the theory that the moon illusion is due to the horizon moon’s appearing more distant than the zenith moon. We saw in Section 29.3.2a that this type of paradox arises in other size-distance judgments when there is a lack of good information about distance and uncertainty about which of several poor depth cues to use. The height of the zenith moon is visually unspecified. For a person of average height, objects on the horizon of a horizontal plane are about 3 miles away, but observers may not perceive the horizon moon at the same distance as adjacent objects. The horizon moon is partially occluded by horizon objects and is clearly not attached to them. Rock and Kaufman (1962) suggested that observers use primary distance cues and assumptions about the shape of the dome of the sky to scale the size of the moon but then recursively use the scaled size of the moon when asked to make a conscious decision about its distance.

horizon moon than when viewing the zenith moon. Iavecchia et al. (1983) found that size estimates of an artificial moon viewed in different surroundings correlated highly with differences in accommodation measured with a laser optometer. 29.3.6 S I Z E -D I S TA N C E S C A L I N G I N G EO M ET R I C A L I L LUS I O NS

Illusions such as those shown in Figure 29.4 are used to demonstrate the size-distance hypothesis. The usual explanation of such illusions is that objects depicted in a picture appear more distant and therefore larger as they approach the vanishing point of the converging lines. This type of figure was described and explained in this way by Benedetto Castelli in 1639 (Section 2.5.2). It is a variant of the Ponzo illusion shown in Figure 29.9. In a real-world scene viewed in the normal way, a distant object that subtends the same visual angle as a near object would indeed be much larger than the near object. In those

29.3.5d Effects of the Background According to another theory, the moon illusion occurs because the horizon moon appears large when adjacent to familiar objects, such as buildings and trees. The zenith moon is seen in isolation or among objects of unspecified size, such as clouds and stars. Rock and Kaufman (1962) obtained a larger moon illusion when the terrain effect was present but they obtained some illusion when both horizon and zenith moons were seen against a clear sky. Restle (1970) argued that any object, even an unfamiliar one, seen among other smaller objects appears larger than the same object viewed in relatively empty surroundings. According to these explanations, the moon illusion does not depend on perceived distance but simply on size comparisons between neighboring objects.

A

B

29.3.5e Empty Field Myopia Viewing the zenith moon in an empty sky induces empty field myopia (Section 9.3), which is not induced when the moon is seen against objects on the horizon. Roscoe (1989) suggested that the horizon moon looks larger because the eye is accommodated to a greater distance when viewing the

Perspective-size illusions. (A) The men are the same height but appear to grow with increasing apparent distance. It is a variant of the Ponzo illusion. But the apparent increase in size could also be due to size contrast between the figures and the spacing of the converging lines. (B) The two bold lines are the same length. The illusion embodies the Müller-Lyer illusion, as shown by the red lines, and the Ponzo illusion.

Figure 29.4.

C O N S TA N C I E S I N V I S UA L D E P T H P E R C E P T I O N



133

circumstances, to say that that the far object is larger than the near object would not be an illusion. But in a flat picture, we say there is an illusion because, in 2-D, the objects have both the same angular and linear size. Figure 29.4B combines the Ponzo illusion and the MüllerLyer illusion. The difference in the perceived size of two objects with the same size, like those shown in Figure 29.4A, resulted in a corresponding difference in the area of fMRI activation in V1 of humans (Murray et al. 2006). The effect was reduced when subjects attended closely to a fixation point at the center of each object rather than to the whole object (Fang et al. 2008). This suggests that the effect in V1 depends on feedback from higher centers. It has been proposed that many geometrical illusions are due to interpreting the figures in terms of linear perspective. This is known as the perspective theory of illusions. Consider the Müller-Lyer illusion in Figure 29.5A. Von Holst (1957) argued that the converging lines cause one of the vertical lines to recede in depth and the diverging lines cause the other vertical line to come forward. By size-distance scaling, the apparently more distant line appears larger than the apparently nearer line. However, it has been reported that the Müller-Lyer illusion occurs even though subjects are not aware of any illusory impressions of depth (see Robinson JO 1972). Gregory (1963) distinguished between primary and secondary size scaling. He defined primary scaling as an automatic process in which perspective in a flat drawing directly affects apparent size, even though other information indicates that the drawing is flat. He defined secondary scaling as a higher-level process in which differences in apparent size are due to differences in apparent distance arising from perspective. It is difficult to see how the notion of primary scaling can be verified because, by definition, primary scaling produces no impression of depth. See Hotopf (1966) and Robinson (1972, p. 152) for a discussion of this issue. According to the perspective theory, the Müller-Lyer illusion should be enhanced when disparity reinforces the perspective cue to depth, as in Figure 29.5C, and should be reduced when the two cues are in opposite directions, as in Figure 29.5D. However, Georgeson and Blakemore (1973) failed to find a significant difference in the magnitude of the illusion between these two conditions. Another difficulty with the perspective theory is that the Müller-Lyer illusion is evident when the fins are replaced by circles, as in Figure 29.5B. Circles do not signify perspective (Day 1965) (Portrait Figure 29.6). Thiéry (1896) and Tausch (1954) proposed that the Poggendorff illusion, shown in Figure 29.7, is due to illusory depth created by perspective. In perspective, people tend to perceive A as continuous with C, and B as continuous with D. In conformity with this idea, Green and Hoyle (1964) claimed that the illusion is increased when 134



A

B

C

D Figure 29.5. Müller-Lyer illusion with congruent and noncongruent perspective and disparity. (A) The line with diverging fins appears longer than the line

with converging fins. (B) A version of the illusion in which the fins are replaced by circles that do not convey perspective. (C) With crossed fusion, the disparity of the fins is consistent with their perspective. (D) With crossed fusion, disparity and perspective are opposed. With uncrossed fusion, the effects in (C) and (D) are reversed.

more perspective lines are added, as in lower Figure 29.7. Gillam (1971) presented further evidence in favor of the perspective theory of the Poggendorff illusion. Perspective is not the only cause of geometrical illusions. Other factors include perceptual interactions between lines in different orientations, the effects of global factors on the processing of local features, and size contrast.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Figure 29.6. Ross Day. He graduated in psychology from the University of Western Australia in 1949. He was appointed lecturer in the University of Bristol and then research fellow by the Flying Personnel Research Committee of the Air Ministry. He obtained a Ph.D. from the University of Bristol in 1953. In 1955 he was appointed to a lectureship in psychology in the University of Sydney, where he became reader in psychology. In 1965 he was appointed to the foundation chair of psychology at Monash University in Melbourne. After his retirement in 1993 he was invited to La Trobe University in Melbourne, as adjunct professor of psychology. He was elected to the Australian Academy of Science in 1990 and received the Australian Federation Medal in 2003 and the Australian Psychological Society Research Award in 2005.

In size contrast an object appears larger when adjacent to a small object than when it is adjacent to a large object. Neighboring objects serve as a frame of reference for size (Rock and Ebenholtz 1959). For example, in the Titchener illusion, shown in Figure 29.8A, the circle surrounded by large circles appears smaller than the circle surrounded by small circles. In the normal form of the illusion the circles lie in the same depth plane, as in Figure 29.8A. The illusion is enhanced by size-distance scaling when the apparently smaller circle is made to appear nearer than the apparently larger circle, as in one of the stereograms in Figure 29.8. The illusion is reduced or eliminated when the apparently smaller circle is made to appear more distant, as in the other stereogram. Perceiving the central circle as more distant than the large surrounding circles but nearer than the surrounding small circles produces an illusion in the reverse direction to that usually observed. We saw in Section 29.3.5 that Restle (1970) suggested that the moon illusion is due at least in part, to size contrast. See also Restle and Merryman (1968). Geometrical illusions, like those in Figure 29.4 and the related Ponzo illusion, shown in Figure 29.9A, may be due to size contrast rather than to size-distance scaling arising

A

C

D

B

A

C

D

B

Figure 29.7. Perspective and the Poggendorff illusion. Green and Hoyle (1964) found that the Poggendorff illusion, evident in the upper figure, is increased when additional perspective lines are added, as in the lower figure.

from perspective. In each case, one object may appear smaller because it extends across fewer converging lines than the other objects. Humphrey and Morgan (1965) put this idea to the test by using vertical test lines aligned with the converging lines, as in Figure 29.9B. This removes size contrast, but an illusion due to size-distance scaling should still be evident. They reported that the illusion was not evident. Gillam (1973) superimposed vertical lines on a texture gradient of horizontal lines, as in Figure 29.9C. This produced an illusion. The effect could be due to either size-distance scaling or to size contrast. But there was no measurable illusion when horizontal test lines were superimposed on the gradient of horizontal lines, as in Figure 29.9D. Here again, the size-contrast effect had been removed. Ames presented two luminous disks differing in size at the same distance in dark surroundings. The larger disk appears nearer than the smaller disk. In Figure 29.10B three squares at the same depth appear to increase in distance with decreasing size (Tolin 1969). When a single

C O N S TA N C I E S I N V I S UA L D E P T H P E R C E P T I O N



135

A

B

C

D

A

B

The Ponzo illusion. (A) Standard illusion. (B) Standard illusion with vertical test lines. (C) Illusion with horizontal induction lines and vertical test lines. (D) Illusion with horizontal induction and test lines.

Figure 29.9.

It can be concluded that size illusions may be created by size contrast in the absence of differences in perceived distance, and also by differences in perceived distance in the absence of size differences. C Titchener circles. (A) The circle surrounded by large circles appears smaller than the circle surrounded by small circles. (B) When the images are cross-fused, the illusion is enhanced because it is supplemented by a size-distance effect. (C) In the fused image, the illusion is decreased or eliminated by a counteracting size-distance effect.

Figure 29.8.

A

luminous disk grows in size it appears to approach (see Section 31.3.2). The effect of size contrast can be eliminated by creating depth by disparity (Coren 1971). The stereogram in Figure 29.10A creates three squares at different stereoscopic depths. The squares subtend the same visual angle but appear to increase in size with increasing distance. Thus, perceived linear size is scaled by disparity-induced differences in depth. In this case, size contrast is not a factor. Nor is height in the field because the effect works when this factor is reversed. This can be seen by comparing the two fused images in Figure 29.10A. 136



B Figure 29.10. Apparent size and stereoscopic depth. (A) In the fused images, the equal-size squares appear to increase in size as they increase in apparent distance. The effect is not due to height in the field, since it occurs in both fused images. (B) The squares are at the same depth but appear to increase in distance with decreasing size.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

The voluminous literature on geometrical illusions has been reviewed by Luckiesh (1965), Robinson (1972), Wade (1990), and Ninio (2001). 2 9 . 4 S H A P E C O N S TA N C Y The term “shape constancy” has been used to denote the following. 1. The ability to recognize that the shape of a simple 2-D object, such as a circle or square, remains the same whatever its orientation to the frontal plane. Shape constancy does not require that the shape be perceived correctly. It requires only that perceived shape remains constant with changing orientation. Consider a flat object inclined about a horizontal axis at angle q to the normal to a line of sight. The vertical subtense of the object declines in proportion to cosq , as shown in Figure 29.11A. This is referred to as foreshortening. The horizontal subtense of the object along the axis of inclination remains constant. Also, the vertical sides of the object converge toward the vanishing point, as shown in Figure 29.11B. The literature on simple shape constancy was reviewed by Epstein and Park (1963) and Sedgwick (1986). This topic is discussed in the following section. 2. The ability to recognize a complex 2-D object, such as a polygon, irrespective of its orientation. For example,

d

φ

θ

a

D

A

θ

this topic was investigated Shepard and Cooper (1982). The topic is beyond the scope of this book. 3. The ability to recognize the shape of a simple 3-D object, such as a cube, ridge, or cylindrical surface in different orientations or sizes, or at different distances. This topic is discussed in Sections 20.6 and 29.4.3. 4. The ability to recognize a complex object, such as a polyhedron or face, from different vantage points. There is a vast literature on this topic but it has more to do with shape perception than with depth perception (See for example Bennett and Vuong 2006; Li et al. 2011; and Rentschler et al. 2008).

29.4.1 P RO C E D U R E S F O R M E A S U R I N G S I M P L E S H A P E C O NS TA N C Y

The following three procedures have been used to measure shape constancy.

29.4.1a Shape Matching Thouless (1930) approached the question of shape constancy by asking subjects to select a shape displayed in the frontal plane to match an adjacent inclined shape at the same distance. For example, subjects selected an ellipse in the frontal plane to match the shape of an inclined circle. But the instructions given by Thouless predisposed subjects to compare the shapes as projected into the frontal plane. Let us call this image matching. Image matching is not a measure of shape constancy. Shape constancy is measured by asking subjects to match two shapes inclined at different angles as if each shape were viewed orthogonal to the line of sight. Let us call this orthogonal-view matching. Thouless reported that subjects typically selected an ellipse that was intermediate between a circle and the image of the inclined circle. Thus, in spite of trying to match images they were pulled toward orthogonal-view matching. Thouless referred to this as “regression to the real object.” The regression is expressed as a fraction of the total possible amount of regression, as indicated by the Brunswick ratio, namely: a−s r−s

B Figure 29.11. Foreshortening. (A) Angle f subtended at the eye by a line of length d inclined at angle q at viewing distance D is arctan a/D, where a d cosq . (B) The image of the in-depth dimension of an inclined flat object decreases in proportion to the cosine of the angle of slant, q , with respect to the gaze normal. The image of the dimension along the axis of inclination remains constant.

where a is the height of the selected ellipse, r is the diameter of the circle, and s is the height of the image of the inclined circle projected into the frontal plane through the axis of inclination (Brunswick 1929). The Thouless ratio uses the logarithms of the numbers. These and other indices are discussed in Myers (1980). A Brunswick ratio of 1 indicates perfect shape constancy (orthogonal-view matching) and a ratio of 0 indicates perfect image matching. The procedure of matching a frontal shape with an inclined shape is subject to the following problems.

C O N S TA N C I E S I N V I S UA L D E P T H P E R C E P T I O N



137

1. The procedure is confounded by the ambiguity of instructions. With vague instructions, some subjects achieved high shape constancy, whereas other subjects attempted to match the projected shapes of the stimuli and achieved compromise judgments ( Joynson 1958a, 1958b; Joynson and Newson 1962). Even when subjects are instructed to use a specific criterion they may fail to respond according to the instructions (Carlson 1962; Epstein 1963; Kaess 19978; Kaneko and Uchikawa 1997). 2. People may not be consistent. For example, as the angle of inclination increases, subjects have an increasing tendency to judge shape in terms of the projected image (Eissler 1933; Lichte 1952). Also, shape constancy, as measured by the setting of a frontal shape to match the real shape of an inclined polygon, improved with the number of times the subject had judged the orientation of the polygon in a set of training trials without error feedback (Borresen and Lichte 1962). 3. Investigators using the shape-matching procedure have generally assumed that the shape of the frontal stimulus is accurately perceived. But this may not be true. For example, a frontal ellipse may be perceived as an inclined circle, which produces the impression that the minor axis of the ellipse is elongated with respect to its height in the frontal plane. This effect can occur even though the observer is aware that the ellipse is in a frontal plane. Perspective illusions arising from this effect were discussed in Section 26.3.4b.

29.4.1b Measuring Shape Constancy by Drawing In this procedure subjects view an inclined shape and draw it. The accuracy of the drawing is supposed to provide a measure of shape constancy. Clark et al. 1956a and Nelson and Bartley 1956 used this method. It is unsatisfactory because when people draw something they try to draw in perspective rather than draw the actual shape of the inclined stimulus. Most people cannot draw in accurate perspective and, in any case, their attempts to draw in perspective tell us nothing about shape constancy (see Section 26.3.5). Perhaps drawing would indicate something about shape constancy if subjects were carefully instructed to draw the actual object as if they were looking at it orthogonally. But this experiment has not been done.

29.4.1c Measuring Shape Constancy by Shape Selection This procedure can be used only with stimuli, such as a rectangle, circle, cross, and equilateral triangle, that have a uniquely definable shape (accidental, or nongeneric, shapes as explained in Section 4.5.9d). For example, the subject is 138



shown a series of ellipses lying at a defined angle to the line of sight and is asked to select the one that most resembles a circle. This provides a direct measure of shape constancy and avoids the intrusion of ambiguous instructions. It seems that Stavrianos (1945) is the only person to have used this procedure. But the outline stimuli were viewed in dark surroundings at a distance of 17.6 ft. These conditions are not conducive to shape constancy.

29.4.1d Measuring Shape Constancy by Estimating Angles This procedure can be applied to simple or complex polygons. Subjects estimate specified angles between the sides of a polygon shown at various inclinations and/or distances. For example, Lappin and Prebble (1975) showed subjects a photograph of an irregular gray polygon lying on a table and surrounded by familiar objects. Subjects underestimated specified angles of the polygon by about 8° when instructed to regard the polygon as lying on the table in the photograph. They overestimated the angles by about 16° when they were instructed to regard the polygon as lying in the frontal plane. Better constancy would presumably be obtained if the stimulus were a real 3-D display rather than a photograph. 29.4.2 S H A P E -S L A N T I N VA R I A N C E H Y P OT H E S I S

According to the shape-slant invariance hypothesis, first formulated by Koffka (1935), a given retinal image of a flat rigid object determines a unique inverse relationship between errors of perceived shape and errors of perceived inclination of the object. It is not usually stated that this hypothesis depends on a correct registration of the retinal image and of the rigidity of the object. If either of these features is incorrectly perceived, there is no basis for predicting perceived shape from perceived inclination. A given image can arise from any of an infinite number of objects that fill the cone of light rays that define the image. For example, Figure 29.12 shows that a slanted square or any number of appropriately tapered trapezoids produce the same retinal image. These are projectively equivalent objects. In the absence of depth information, the inclination of an object producing a given image can be known only if the shape of the object is known. But if its shape is known, perceived inclination is irrelevant (Wallach and Moore 1962) (Portrait Figure 29.13). Under these circumstances, the shape-slant invariance hypothesis states that, for a given image, the assumed shape of an object depends on its assumed inclination, and vice versa. In the presence of depth information, the hypothesis states that any error in perceived inclination causes a corresponding error in the perceived shape of the object (assuming that the shape of the image is correctly registered).

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

A B C Figure 29.12. Joint ambiguity of shape and perspective. The image produced by a frontal trapezoid (B) could arise from an inclined square (C) or a more tapered and inclined trapezoid (A). Thus, shape constancy cannot be achieved from estimates of inclination based on linear perspective.

An object can produce an infinite number of images depending on its distance and orientation to a line of sight. These are projectively equivalent images. The transformation that links the set of images is a projective transformation. One can hypothesize that a set of projectively transformed images will be perceived as arising from a rigid object rather than from a nonrigid object to the extent that the perspective transformation is interpreted as a change in inclination rather than a change in the shape of the object.

Hans Wallach. Born in Berlin in 1905. He obtained his Ph.D. from the University of Berlin in 1934. In 1936 he left Nazi Germany for the United States to work with Wolfgang Köhler at Swarthmore College. There he became Centennial Professor of Psychology in 1971. Recipient of the Warren Medal of the Society of Experimental Psychologists and the Distinguished Scientific Contribution Award of the American Psychological Association. Elected to the National Academy of Science in 1986. He died in 1998.

Figure 29.13.

This hypothesis has meaning only if the transformation is a projective transformation. If the transformation is not projective there is no basis for predicting perceived rigidity from perceived inclination. The different depth cues used for explicit judgments of inclination may be used in different ways for scaling the perceived size of inclined objects. Thus, Gillam (1967) proposed that the uses to which particular depth information is put might be affected by contextual variables, especially by conflicts between one cue and another (Morrison and Fox 1981) (Section 30.3.1). Massaro (1973) produced some evidence in support of this idea. He found that the time required for judging the shape of a slanted object increased with increasing slant but that the time required for judging only the slant or only the projected shape of the object was invariant over changes in slant. A related problem is that depth cues that are used to judge shape may be dissociated from other depth cues that are used for judging slant or inclination. Only the former depth cues can be expected to conform to the shape-slant invariance hypothesis. The shape-slant hypothesis is supported by the following findings. Under good viewing conditions and with the correct instructions, subjects make nearly perfect shape-constancy judgments (Lichte and Borresen 1967). Some experimenters found that errors in perceived inclination produced the predicted errors in perceived shape. For example, Beck and Gibson (1955) induced errors into the perceived inclination of a monocularly viewed triangle by inclining it with reference to a binocularly viewed textured background. The perceived shape of the triangle relative to a fixed comparison object corresponded to its perceived inclination. When subjects viewed the triangle binocularly they could see its true inclination and made correct estimates of its actual shape. Winnick and Rogoff (1965) found that the inclination of a rectangle was underestimated for inclinations from the vertical of less than 40° and overestimated at larger angles. The apparent shape of the inclined rectangle was determined by the shape-matching procedure. In conformity with shape-slant invariance, perceived shape conformed to the actual shape at an inclination of 40°. However, the instructions in the shape-matching procedure were not clearly specified. Kaiser (1967) found a close correspondence between errors of perceived inclination and errors in perceived shape. However, the relationship was not exact under some viewing conditions (Epstein et al. 1962). Conflicting results can arise from the use of inadequate psychophysical methods, and ambiguous instructions (Epstein and Park 1963; Kraft and Winnick 1967). The shape-slant hypothesis predicts what happens when cues to inclination are misleading, but not what happens when they are reduced or removed. There are three possibilities. Shape judgments could (1) become more variable,

C O N S TA N C I E S I N V I S UA L D E P T H P E R C E P T I O N



139

(2) depend on the angle of inclination that the subject assumes, or (3) conform to the assumption that the shape is in the frontal plane. Most investigators have found that, whatever the instructions, perceived shape conforms more closely to projected shape (option 3) as depth cues are reduced. Cue reduction can be achieved by closing one eye, by increasing viewing distance, or by viewing the stimuli in dark surroundings or through a hole in a reduction screen (Langdon 1953; Meneghini and Leibowitz 1967). Thus, it seems that people generally assume that an object lies in a frontal plane when there is no information to indicate otherwise. But this is not always the case. For example, a trapezoid seen in dark surroundings appears like a rectangle inclined in the direction of the taper (Clark et al. 1955). A point of light tracing out an ellipse in dark surroundings appears to move in a frontal elliptical path. However, two or more lights tracing out an ellipse appear to move in a circle inclined to the line of sight ( Johansson 1978). In surroundings devoid of static depth cues, rotation of a shape in depth can restore shape constancy (Langdon 1955). The projected image of a 3-D shape, such as a bent piece of wire, appears flat. But when the object is rotating, the image appears three dimensional, and the shape of the object is perceived accurately. This is the kinetic depth effect discussed in Section 28.5. Thus, the pattern of optic flow is an effective depth cue for shape constancy. A symmetrical shape slanted about an axis at an angle to the axis of symmetry produces an asymmetrical image. The image is said to have skew symmetry. Only the images of circles retain true symmetry when slanted. An image with skew symmetry should help people to recognize the true shape of the symmetrical object that produced it. Wagemans (1993) presented an irregular or symmetrical dot pattern or polygon next to a vertically compressed and rotated version, as in Figure 29.14. Subjects recognized the original and transformed versions as being the same shape more rapidly and accurately when the original shape was symmetrical compared with when it was irregular. Only some of the transformed shapes, such as those with connected contours or surrounding frames, were seen as slanted. The role of skew symmetry in detection of slant was discussed in Section 26.3.3c. 29.4.3 S H A P E C O NS TA N C Y O F 3-D O B J EC T S

Shape constancy for a 3-D object may be measured by asking subjects to adjust the lateral size of a test object relative to its depth dimension until the object appears as a defined shape such as a cylinder or two surfaces meeting in a right angle. Subjects have shape constancy when the perceived ratio is constant over changes in the distance, orientation, or size of the object. In a related procedure, an object is moved to-and-fro in depth and subjects judge whether it remains the same shape. In both cases, the procedure is 140



Figure 29.14. Examples of stimuli used by Wagemans (1993). Shapes on the right are images produced by parallel projection of shapes on the left slanted about an oblique axis.

repeated at several distances. These procedures determine whether the proportions of an object remain perceptually constant at different distances. Shape constancy for a simple 3-D object, such as a cube, can be also measured by asking subjects to select a cube from among a set of tapered boxes, as described in Section 26.3.5. The procedure can be repeated with the boxes at each of a set orientations and distances. Hecht et al. (1999) asked subjects to estimate the dihedral angle between surfaces such as the walls at the corner of a building with each wall seen at an angle of 45°. Any underestimation of the depth dimension should cause the angle between the walls to appear more than 90°. By this criterion, estimates of depth were reasonably accurate for viewing distances up to about 15 m but were underestimated at larger distances. The same trend was evident for the same objects viewed in photographs, which indicates that the underestimation was related more to monocular distance cues than to binocular cues. Shape constancy over changes in distance of a 3-D object defined by perspective does not require an estimate of absolute distance when the object is small and moves along a line of sight. In this case, the angular subtenses of all dimensions of the object scale in the same way with distance. However, the angular subtenses of the different dimensions of an object scale with distance in different ways when the object does not move along a line of sight. For example, consider a cube moving away along a horizontal

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

surface below eye level. The image of the in-depth dimension of the cube shrinks in proportion to the distance of the cube from the eye. But this dimension also shrinks because the object gets closer to the visual horizon. Therefore, the depth dimension of the image of a cube shrinks approximately in proportion to the square of distance. Thus, the image of the top of the cube becomes compressed. The size of the image of the front face of the cube is approximately inversely proportional to the distance of the cube from the eye. However, the image of the front face changes from a truncated trapezoid to a square and increases in size as the cube rises in the visual field. Therefore, the aspect ratios of the images of all sides of a cube moving in any direction other than along a line of sight change with viewing distance. In judging the shape of a monocularly viewed object one must use linear or texture-gradient perspective to register its distance, the angular sizes of it sides, and the inclination or curvature of the sides relative to the line of sight. In a picture, cues such as disparity, accommodation, and vergence are absent or misleading. These factors cause the depth dimension of objects in a picture to be underestimated. Objects in a picture appear flattened, which causes nearer parts of an object to appear enlarged relative to more distant parts, as in Figure 26.25. Shape constancy of inclined surfaces and 3-D surfaces defined by binocular disparity was discussed in Section 20.6.

29.4.4 U N D E R E S T I M AT I O N O F I N-D E P T H D I S TA N C E

29.4.4a Underestimation of Slant and Inclination from the Vertical Underestimation of the depth dimension is related to the fact causes people to perceive inclined or slanted surfaces as being more frontal than they are (Gillam and Ryan 1992) (Section 26.5.2a). Thus hills appear steeper than they are as one approaches them. For example, a hill inclined 5° to the horizontal was judged to be inclined about 20°, and a 10° hill was judged to be inclined about 30° (Proffitt et al. 1995). These effects are usually described as overestimation of inclination with respect to the horizontal. But they could equally be described as underestimation of inclination with respect to the vertical. The extent to which the inclination of a surface is underestimated depends on the distance cues present. Feresin and Agostini (2007) found that subjects were able to set a tactile paddle to match the inclination of real roads quite accurately from a distance of 4 m, but they became less accurate at a distance of 6 m. Subjects retained reasonable accuracy when tested with stereoscopic photographs of the roads, but made large underestimations with photographs lacking binocular disparity.

Durgin et al. (2010) found that the inclination of a real textured surface seen within arm’s reach was underestimated with respect to vertical up to 15°.

24.4.4b Underestimation of In-Depth Distances Relative to Frontal Distances It has been reported in several studies that, even under full viewing conditions, the length of an object orthogonal to the frontal plane is underestimated relative to the length of an object in the frontal plane. Baird and Biersdorf (1967) presented 8-inch long white cards on a black featureless tabletop 18 inches below eye level and viewed binocularly. The lengths of vertical cards were slightly overestimated, but the lengths of cards lying flat on the table extending away from the subject were underestimated. Wagner (1985) asked subjects to estimate distances in depth between vertical white cards displayed on stakes 1 m high placed in a field at distances up to 40 m with all cues to distance present. Distances in depth between cards arranged in depth were judged to be half as large as the same distances between cards in the same frontal plane (see also Loomis et al. 1992). Loomis and Philbeck (1999) joined cylindrical rods to form an L shape and placed them on a grassy field at different distances. One rod in each shape specified a frontal distance, and the other specified an in-depth distance. The height of the subject’s eye above the ground was varied in proportion to distance. In this way, the height of the rods in the visual field and the inclination of the rods with respect to the line of sight remained constant. Subjects judged the ratio of the lengths of the two rods. In-depth distances were considerably underestimated relative to frontal distances, especially with monocular viewing. With monocular viewing, the degree of depth underestimation was invariant with distance between 3.9 and 19.5 m and with the size of the stimuli. With binocular viewing, underestimation of the depth dimension was less with smaller stimuli. However, it increased slightly with distance, presumably because binocular disparity becomes less informative at far distances. In all these experiment a frontal stimulus was matched with an in-depth stimulus. This procedure is subject to the effects of ambiguous instructions, as described in Section 29.4.1a. People have a tendency to match image sizes rather than actual sizes even when instructed to match actual sizes. This indicates confusion about what is to be judged rather than a failure of depth constancy. When all cues to distance are eliminated, as when one views an object through a hole in a reduction screen, judgments of size are wholly in terms of visual angles, and the perceived depth dimension is determined by the perceived size of the retinal image (Epstein and Landauer 1969). This is not a failure of depth constancy but merely an expression of the fact that depth constancy requires the presence of depth information.

C O N S TA N C I E S I N V I S UA L D E P T H P E R C E P T I O N



141

The perception of depth in disparity-defined 3-D surfaces was discussed in Section 20.6. 29.4.5 P E RC E I V I N G S H A P E S I N S L A N T E D PICTURES

Perceptual distortions produced by viewing pictures in the frontal plane were discussed in Section 26.3.4. The present section deals with effects produced by viewing a picture at an oblique angle. A photograph creates the same image in an eye as the original scene as long as the photograph is viewed from the correct vantage point (Section 24.1.1). When the photograph is viewed from a greater distance the image is too small, and when the picture is slanted its image is distorted. People are accustomed to viewing pictures from an incorrect vantage point. As long as there is information that the picture plane is slanted in depth, its angle of slant is perceived (Rosinski et al. 1980) and the picture does not appear unduly distorted (Wallach and Marshall 1986). But when a slanted picture is made to appear in a frontal plane by viewing it monocularly through a frame orthogonal to the line of sight, the picture appears grossly distorted. See Halloran (1989) for further discussion of this issue. A slanted picture can be projected onto a flat surface, as in Figure 29.15. This removes accommodation and disparity information about the slant of the picture. Busey et al. (1990) presented computer images of pictures of faces viewed at angles of slant or inclination of 22 and 44°. Subjects rated the face distortions on a 7-point scale. Distortions in faces were more evident in images arising from an inclined picture than from images arising from slanted pictures. However, the distortion produced by a slant of 22° was hardly noticed. Distortions were clearly noticed for a slant or inclination of 44°. But this could be explained by the fact that faces vary in their aspect ratio, some faces are wider than others and some are longer than others. The distortion of a face viewed at an angle of 22° falls within the normal range of variation of face shape. Perhaps the distortion would be evident in a very familiar face. Placing the computer images in a rectangular or tapered frame did not affect estimates of face distortion. But the frame consisted of a thin line. The unchanging frame of the computer monitor would have provided a fixed and more prominent frontal frame. Perhaps, to some extent, people compensate for viewing pictures obliquely through knowledge of the true shapes of the objects in the picture. But this effect would be weak at best, because people compensate for oblique viewing when the objects are unfamiliar. Also, familiar objects appear distorted when the orientation of the picture is not apparent. Vishwanath et al. (2005) distinguished between two methods for compensating for oblique picture viewing. In the picture-compensation method the viewer uses perspective information in the picture to derive the location of 142



Figure 29.15. Perception of oblique pictures. Objects in a picture viewed at an oblique angle appear undistorted, even though the images are compressed. If the frame of the slanted picture is rectangular and in a frontal plane, the contents of the picture appear distorted because the frame produces the impression that the picture is in a frontal plane. (World Wide Photos)

the correct vantage point. In the simplest case, the vantage point is indicated by the locations of the distance points formed by diagonals in the picture. In the surface-compensation method the viewer estimates the slant of the picture surface from the linear perspective of the sides or frame of the picture and from binocular disparity. This information is then used to distinguish between perspective in objects in the picture that this slant produces and the residual perspective that is due to the slant of 3-D objects depicted in the picture. To determine which method people use, Vishwanath et al. showed pictures of a sphere or of a slanted cube with surroundings containing perspective, as shown in Figure 29.16. The image of a sphere is not affected by the 3-D orientation of the sphere. However, the image of the picture of a sphere is an ellipsoid when the picture is slanted. Thus, for a sphere, any distortion is due only to the slant of the picture. The image of a slanted cube is affected both by its 3-D orientation and by the slant of the picture. Any perspective due to the slant of the picture must be allowed for so that residual perspective may be used to detect the shapes of surfaces in the picture. The pictures were viewed at various angles of slant about a vertical axis up to 45°. Viewing was (1) monocular through an aperture that occluded the edges of the picture, (2) monocular with the picture edges

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

slanted pictures monocularly through an aperture they based their shape judgments on the retinal image, which gave the correct answers. With full binocular viewing, the slanted pictures appeared distorted. Thus, with binocular viewing, subjects inappropriately allowed for the obliqueness of the picture. They behaved as if they were viewing the slanted picture frontally. Had they used the perspective information contained in the images, they would not have made this inappropriate allowance. All this evidence indicates that people allow for the slant of a picture by registering the slant by depth cues such as the perspective of the picture edges and binocular disparity rather than by using perspective information within the picture. 29.4.6 N EU RO L O G I C A L D E FI C I T S O F S H A P E C O NS TA N C Y

Figure 29.16.

Stimuli used by Vishwanath et al. (2005).

(Reprinted by permission of

Macmillan Publishers Ltd.)

in view, and (3) binocular with the picture edges in view. The aspect ratio (height-to-width) of the image of the sphere and square was varied from trial to trial. Subjects were asked to select the picture that contained a sphere or a cube. With restricted monocular viewing, subjects based their judgments on the retinal images and made no allowance for the slant of the picture. With full binocular viewing, subjects selected the pictures that represented a sphere or cube. Monocular viewing with the picture edges in view yielded intermediate judgments. Like earlier work, this result shows that people can allow for the slant of a picture. Vishwanath et al. found that subjects allowed for slant of the picture with full binocular viewing when the surroundings of the square or sphere within the picture were removed. This would have reduced the accuracy of judgments if subjects had been using the picture-compensation method. In a second experiment, the scenes were projected onto a slanted picture plane, to produce anamorphic pictures (see Section 2.9.5). When the resulting pictures were placed in a frontal plane, the images appeared distorted, like the images in Figure 29.17. However, when the picture was slanted to the same angle as the surface on which the picture was projected, the image was the same as that produced by the original scene. In the same way, an anamorphic picture appears normal when viewed at an angle. When subjects viewed the

Brain lesions, especially in the parietal lobe, can impair a person’s ability to recognize objects. Lesions in the left hemisphere have been associated with agnosia, or an inability to recognize objects at the semantic level. Lesions in the right hemisphere have been associated with defects in basic perceptual abilities such as figure-ground organization in pictures, recognition of degraded pictures, and recognition of objects seen in different orientations or sizes, especially unfamiliar objects (Warrington and Taylor 1978; Layman and Greene 1988). But the distinction between left and right hemispheres is not clear-cut. For example, Davidoff

Picture plane at an angle to the optic axis of the camera

Image when the picture is viewed frontally

Image when the picture is viewed from the camera vantage point Figure 29.17.

Vantage point and images of an off-axis picture.

C O N S TA N C I E S I N V I S UA L D E P T H P E R C E P T I O N



143

and De Bleser (1994) described a patient with left cerebral damage who could recognize objects seen in their usual orientation but not when seen in an unusual orientation or in a photograph or drawing. A patient with general cerebral atrophy could not identify objects but could match an object in one orientation with the same object in another orientation (Taylor and Warrington 1971). A patient with visual form agnosia could not use image size to judge the distance of an object. For near objects, the patient relied on vergence and the height of the stimulus in the visual field to a greater extent than did normal subjects (Mon-Williams et al. 2001a, 2001b). The literature on neurological disorders and shape constancy has been reviewed by Lawson and Humphreys (1998). 2 9 . 5 S P E E D C O N S TA N C Y For an object moving at constant velocity along a path orthogonal to a line of sight, angular velocity decreases approximately in proportion to distance. An object moving at a given angular velocity appears to have a higher linear velocity when it is seen as being in the distance rather than near. This is speed scaling. In the presence of adequate cues to depth, objects traveling at the same linear speed in frontal planes at different distances appear to have the same speed, although their angular velocities are inversely proportional to distance. This is speed constancy and is analogous to the constancy of perceived size of an object at different distances. Brown (1927, 1931) was investigating speed constancy when he discovered what he called the transposition principle. Subjects adjusted the speed of an array of dots passing behind an aperture to match the speed of the same array at the same distance with all linear dimensions doubled, including the size and spacing of the dots and the diameter of the aperture. The displays were seen successively. When the dots appeared equal in speed, the objective speed of the larger display was almost twice that of the smaller display. For size ratios of 2, 4, and 10, speed ratios were 1.9, 3.7, and 8.2. In a second experiment, Brown changed the size of the image of the comparison display by changing its distance from the subject. For ratios of distances (and image sizes) of the compared displays of 1:3.3, 1:6.6, and 1:10, the ratios of objective speeds were 1.12, 1.15, and 1.21. For perfect speed constancy the speed ratio should be 1. Thus, the perceived speed of a moving display was proportional to the size of the retinal image. In other words, two displays appeared to move at the same speed when size-scaled elements traversed equal proportions of the diameters of the displays in the same time. The effect is speed transposition when image size is controlled by varying the size of the display. It is speed constancy when it is controlled by varying the distance of the display. 144



Irwin Rock. Born in New York in 1922. He received his B.A. in psychology from City College, New York, in 1947 and his Ph.D. with Hans Wallach from the New School for Social Research in 1952. He taught at the New School and Yeshiva University until 1967, when he moved to Rutgers University. After retiring from Rutgers in 1987 he continued his research at Berkeley until he died in 1995.

Figure 29.18.

Brown found that speed transposition is weakened when stationary surroundings of constant size are in view, because they provide a constant frame of reference for the two size-scaled displays. On the other hand, visible surroundings improve speed constancy because their images are also scaled by distance and provide additional information about the relative distances of the displays (Wallach 1939; Epstein 1978). Depth information provided by binocular disparity combined with motion-induced patterns of optic flow may also improve speed constancy (Kellman and Kaiser 1995). Rock et al. (1968) found that, with binocular viewing, subjects could, with reasonable accuracy, adjust the linear speed of a luminous disk at a distance of 18 cm to match that of a second disk at 72 cm. The angular sizes of the disks and the angular distances traveled were the same so that speed judgments could not be based on duration of travel. When all cues to distance were eliminated, speed matches conformed to equal angular speeds. Subjects were also asked to match the linear sizes of the circles (Portrait Figure 18). Speed matches and size matches were similar. Rock et al. concluded that information about distance is sufficient for speed constancy and that perceived linear speed is a function of perceived linear extent traversed in unit time. When image size provides the only cue to the relative distances of two displays, the speed transposition effect

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

explains speed constancy. But size scaling may induce a change in perceived speed directly rather than through the mediation of a change in perceived distance. The next question is whether speed constancy is evident when disparity provides the only information about a change in distance. McKee and Welch (1989) asked subjects to estimate the velocity of a bar moving at 10 or 26 cm/s, presented at various disparities relative to a fixation target for a mean duration of 150 ms. Estimates of velocity conformed to angular velocity, and there was no evidence of speed constancy. They also found that the Weber fraction for discrimination of linear velocity was higher than that for discrimination of angular velocity. The Weber fractions for discrimination of linear size and angular size for a stationary object at different distances were similar. According to this evidence, speed constancy is not evoked by disparity cues to depth and is not controlled by the same process as size constancy. There are several possible reasons for lack of speed constancy in the display used by McKee and Welch. Disparity indicated the distance of the moving display relative to the fixation target but not the absolute distance of the display or of the fixation target. The latter distance is crucial for speed constancy. The bar remained the same angular size, which would detract from speed constancy. The brief exposure time may have interfered with the judgment of depth, since stereoacuity for moving targets is particularly degraded for short exposures (Section 18.12.3). Finally, there were neither objects nor texture in the plane of the moving bar to provide a relative scaling of speed. The speed transposition effect and the following evidence suggest that the perceived speed of a textured display is scaled by texture density. Fernandez and Farell (2006a) presented two abutting random-dots displays, one above and the other below a fixation point. For each of several relative disparities between the displays, they measured the relative angular velocities that produced equal perceived linear speeds. They found that a near surface must have a faster retinal angular velocity than a farther surface for the two surfaces to appear to move at the same speed. This is what one would expect from speed constancy, although Fernandez and Farell concluded that it represented the opposite of speed constancy. In agreement with McKee and Welsh, they found no speed constancy when the two displays were presented successively. Mowafy (1990) moved a monocular target horizontally in a frontal plane over a textured surface that was stereoscopically slanted about a vertical axis. The target appeared to roll along the slanted surface rather than remain in the frontal plane. Its perceived speed increased as its perceived distance along the surface increased, even though the target actually moved at constant velocity in a frontal plane. Wist et al. (1976) used the Pulfrich stereophenomenon (Section 23.1) to vary the apparent distance of a 22°-wide by 40°-high display of vertical stripes. This was done by placing a neutral filter before one eye as the display moved

horizontally at constant velocity. The apparent distance of the display depended on the density of the filter. This is a useful procedure because it changes apparent distance without changing image size or accommodation. For a given spatial frequency of the display, perceived linear velocity was proportional to perceived distance, but the constant of proportionality increased with increasing spatial frequency. Zohary and Sittig (1993) asked subjects to match the velocity of a random-dot kinematogram at a distance of 1 m, with that of a second kinematogram at 2 m. Subjects matched the linear velocities of the stimuli when the sizes of the dots in the two displays were also scaled to distance. Settings of a probe to the distance of the displays were related to distance but not by the same function that related speed estimates to distance. When the dots in the two stimuli had the same angular size, subjects tended to match angular velocities, and speed constancy broke down, as in the experiment by McKee and Welch. Zohary and Sittig concluded that estimates of relative speed are based on relative size judgments, as in the transposition effect, rather than on judgments of distance. Since only one set of distances was used, these results tell us nothing about the relation between relative speed estimates and disparity scaling as a function of distance. Rock et al. (1968) used two single disks and found speed was scaled by distance when size was not scaled for depth (the two disks subtended the same angle). Perhaps, with random-dot displays, the absence of size scaling becomes more evident and outweighs the binocular cues to depth. Ramachandran et al. (1988) used a monocularly viewed random-dot display simulating two superimposed coaxial cylinders with the same diameter rotating about their common longitudinal axis at different speeds (5°/s and 10°/s). Observers usually saw a small cylinder inside a large cylinder, each rotating at the same angular velocity. Thus, differences in velocity in the flow field were perceived as differences in distance and size. A second display simulated two superimposed cylinders of unequal diameter but moving at different angular velocities to produce equal linear velocities in the flow field. Observers saw two cylinders of equal diameter. Thus, elements moving at the same linear velocity were assigned to the same depth plane, even though they arose from distinct simulated depth planes. An object approaching along a line of sight creates an expanding image. In the absence of cues to distance, the image may be interpreted as an expanding object rather than as an approaching object of constant size. Self-produced motion toward a stationary object gives rise to the perception of an object of constant size (Wallach and Flaherty 1975). The image of the near face of an outline rectangular box approaching along the line of sight expands more rapidly than that of the far face, but the box appears rigid. Perrone (1986) produced a 2-D display of a box approaching or

C O N S TA N C I E S I N V I S UA L D E P T H P E R C E P T I O N



145

receding along the line of sight and asked subjects to adjust the relative rates of expansion of the near and far faces until the box appeared rigid. For an approaching box, the rates of expansion were those that would be produced by a rigid 3-D box, but for a receding box, the nearer face appeared to move faster than the far face when they moved at the same velocity in 3-D space. In other words, motion constancy held for an approaching object but not for a receding object. Perrone argued that, because we normally walk forward, we experience approach motion more frequently than receding motion. These experiments were done with only perspective information for depth. They should be repeated with disparity information added. The topic of optic flow and depth perception is discussed further in Sections 28.3 and 31.2.3. Reviews of the topic of optic flow were provided by Lee (1980), Cutting (1986), and Simpson (1993). Epstein (1977) edited a book that reviewed the earlier literature on constancy in general.

146



Summary Under favorable viewing conditions people show reasonable speed constancy. They perceive the linear speed of an object seen at different viewing distances. However, speed constancy is not as complete as size constancy. The perceived speed of a display seen at a fixed distance in blank surroundings is approximately proportional to its linear dimensions. This is the speed transposition effect. Since the size of the image of a display is scaled by viewing distance, the speed transposition effect could account for speed constancy when image size is the only cue to distance. Speed constancy is not evident in successive judgments of displays in which depth relative to a fixation point is defined by disparity. However, speed constancy is evident in simultaneously viewed adjacent textured surfaces at different disparity-defined depths. The speed of looming in the projected image of a 3-D object is perceptually scaled for distance for an approaching object but not for a receding object.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

30 INTERACTIONS BET WEEN VISUAL DEPTH CUES

30.1 30.1.1 30.1.2 30.1.3 30.2 30.2.1 30.2.2 30.2.3 30.2.4 30.2.5 30.2.6 30.2.7 30.2.8 30.3 30.3.1

Types of cue interaction 147 Features of visual cues to depth 147 Ways in which depth cues may interact 149 Models of cue interactions 150 Disparity and motion parallax 152 Cross-cue biasing 152 Between-cue cancellation of aftereffects 153 Subthreshold cue summation 154 Between-cue threshold elevation 154 Resolution of motion-parallax ambiguities 155 Surface shape from disparity and parallax 156 Depth-to-size invariance 158 Temporal factors 160 Disparity and perspective 160 Cue interactions on plane surfaces 160

30.3.2 30.3.3 30.4 30.4.1 30.4.2 30.4.3 30.5 30.5.1 30.5.2 30.6 30.7 30.7.1 30.7.2 30.8 30.9

3 0 . 1 T Y P E S O F C U E I N T E R AC T I O N

Cue interactions on curved surfaces 163 Disparity/perspective interactions: dynamics Disparity and interposition 166 Averaging disparity and interposition 166 Cue dominance and cue dissociation 168 Reinterpretation of interposition 168 Disparity and transparency 169 Disparity-transparency interactions 169 Effects of color, brightness, and texture 171 Disparity and shading 172 Accommodation and other depth cues 173 Accommodation and perspective 173 Accommodation and disparity 174 Motion parallax and perspective 176 Cognition and depth-cue interactions 176

164

disparity information. However, information in the optic-arrays does not distinguish between distal stimuli that produce the same optic array. Drawings with ambiguous perspective and stereograms come under this heading.

30.1.1 F E AT U R E S O F V I S UA L C U E S TO DEPTH

The broad issues of how signals from different sense organs are combined were discussed in Section 4.5.7. The present chapter is concerned with interactions between different sources of visual information used for the perception of depth. Depth information, like other visual information, may be defined at any of the following five levels:

3. The proximal stimulus This is the retinal image. Depth information is extended to include processes occurring in the retinal image that produce true or false impressions of depth. These include effects of changes in accommodation and chromostereopsis in which chromatic aberration of the eye creates an impression of depth (Section 17.8).

1. The distal stimulus At this level depth information consists of the actual depth structure of a visual scene. It includes the 3-D structure of each object visible from a given vantage point, plus the relative distances, orientations, and motions of objects and their spatial relations to the vantage point. It also includes color, shading, and shadows. Measurements of distal stimuli obtained with instruments provide the bedrock against which we assess percepts quantitatively.

4. Retinal neural processing At this level, contrast and color are processed locally in the retina. 5. Central neural processing In the central nervous system depth information is processed first locally and then globally. It is combined with information arising from other sensory modalities and that stored in memory. Visual depth cues are shown in Table 30.1. Depth cues have the following objective attributes:

2. The optic-array This is the cone of light rays defined by the visual field. It contains all the visual depth information available from a given vantage point. The optic arrays of the two eyes contain binocular

1. Value and sign The value of a cue is the objective magnitude of the proximal stimulus that defines the cue.

147

Table 30.1. INFORMATION FOR VISUAL PERCEPTION OF DISTANCE Monocular Static

Dynamic

Perspective

Overlap

Lighting

Aerial effects

Focusing

Image size Linear perspective Texture gradient Height in field

Occlusion Transparency

Shading Shadow

Optical haze Mist

Image blur Accommodation

Motion parallax

Accretion-deletion

Binocular Vergence

Binocular disparity

Static Changing

Position disparity Phase disparity Disparity gradients

For example, the value of disparity is the magnitude of the crossed or uncrossed disparity of the images without regard to sign. The sign of a depth cue is the objective depth order of two stimuli that the cue specifies. Some cues, such as disparity, have both value and sign. Some cues to depth, such as shading, have value but no sign. The depth cue of overlap has sign but no value. 2. Validity A fully valid depth cue is one that varies only as a function of a change in the distance or relative depth of the distal stimuli. For example, binocular disparity is a valid cue to depth in natural scenes. In stereograms, it is a valid cue to simulated depth. Motion parallax is not a very valid cue since it can arise from depth or from relative motion of objects. Chromostereopsis arises from the totally invalid proximal cue of chromatic aberration (Section 17.8). 3. Absolute distance versus relative depth Vertical disparity, image size, height in the field, gradients of optic flow, and vergence vary with absolute distance (see Section 29.2.1). Cues such as horizontal disparity and overlap vary only with relative depth between objects. 4. Continuity versus discreteness A continuous cue is one that increases in value continuously with increasing depth or relative depth. Overlap is discrete because it has sign but no magnitude. 5. Monocular versus binocular Some cues are available to monocular viewing while other require binocular viewing. 6. Optic array versus extraretinal stimuli All visual depth information may be defined in terms of the optic array. Vergence and accommodation provide extraretinal depth information. 148



Monocular occlusion

Depth cues also have the following attributes that determine how they are detected: 1. Detectability This is the threshold value of the cue required to produce an impression of depth. 2. Reliability This refers to the variability of depth impressions arising from a given value of a depth cue. It is determined by three factors: a. Stability This indicates the extent to which a cue is subject to adaptation over a period of constant observation. b. Repeatability A cue has high repeatability when it produces the same depth impression over separate observations, as indicated by the variance of depth estimates. c. Robustness This indicates the immunity of a cue to the presence of distracters or noise. 3. Accuracy Accuracy of a depth cue is the signed difference between the perceived depth created by the cue and the actual depth in the stimulus. 4. Gain The gain of a depth cue is the ratio of perceived depth derived from the cue to the actual depth. It is another measure of accuracy. 5. Range The range of stimulus values over which a given cue produces an impression of depth. 6. Spatial resolution This is indicated by the highest spatial modulation of the value of a cue that generates an impression of depth modulation. For example, a modulation of disparity at a spatial frequency higher than about 4 cpd does not create an impression of modulated depth. 7. Latency and temporal resolution Latency refers to how long the visual system takes to register depth. Latency

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

determines the highest frequency of temporal modulation of the value of a depth cue that generates an impression of changing depth. 30.1.2 WAYS I N WH I C H D E P T H CU E S M AY I N T E R AC T

Theoretically, multiple cues for depth could interact in any of the following ways: 1. Summation of cue detectability Signals in the threshold region from distinct cue systems may sum to provide a stronger signal without affecting the depth that the signals indicate. This will improve detection or discrimination of depth without affecting perceived depth magnitude (Section 13.1). For example, the sensitivity of observers to a difference in depth between two stimuli in the presence of the cues of disparity, image size, accommodation, and motion parallax was the sum of their sensitivities to each cue measured separately ( Jameson and Hurvich 1959). Subjects identified the relative depths of areas in a random-dot stereogram more rapidly and with fewer errors when several depth cues were present rather than only one cue (Mather and Smith 2004). 2. Summation of depth magnitude With full cue summation, the magnitude of perceived depth between two objects would be the sum of the depths indicated by each of the component depth cues. For efficient cues this would lead to overestimation of distance. Cue summation is assumed to have occurred whenever a depth estimate based of multiple cues is larger than the depth estimate based on the cue with the highest gain. 3. Cue averaging and tracking When depth cues differ in gain, the distances indicated by the cues do not agree. The most efficient way to combine such signals is to take the weighted mean of the distances indicated by the different cues. Weights have been determined from discrimination thresholds (Knill and Saunders 2003). But this measure takes no account of the accuracy of judgments for each cue or of variations in accuracy with changing cue magnitude (see Todd et al. 2010). The depth information contained in the weighted signals could be summed and then divided by a normalizing factor. The normalization factor could be the total of the unweighted signals or normalization could be achieved by inhibitory linkages between the coding processes (Groh 2001). Theoretically, the variance of a judgment based on two cues is lower than that based on either cue alone. When different signals from different cues converge on a common neural process, an observer may offset a change in one signal by an opposite change in the other.

For example, the slant of a surface may be indicated by disparity or by perspective. An observer may null slant produced by disparity by adjusting the perspective of the stimulus. The null points for different values of each cue define a cue trading function. Examples of trading functions in stereoscopic vision are provided by the titration of binocular disparity against monocular parallax (Section 30.2), of disparity cues to motion-indepth against monocular cues to motion-in-depth (Section 31.4.1), and of disparity against perspective (Section 30.3). Cue trading allows one to investigate the equivalence and relative gains of cues. Cue averaging and trading occur only for cues that have a continuous range of overlapping values that converge early onto a common neural process and that are not widely discordant. Otherwise cues interact in one or more of the following ways. 4. Enhancement of percept reliability Some cue systems simply confirm one another rather than summate or average. This is most evident with bistable or multistable percepts, such as reversible perspective (Section 26.7) or the stereokinetic effect (Section 28.6). The alternative interpretations are exclusive and discrete. The strength of available cues may determine the reliability of a given percept, rather than its magnitude. A reliable percept is stable over a period of observation, repeatable over different occasions, and robust to the effects of noise. For example, Sperling and Dosher (1995) found that the stability of a particular depth interpretation of a bistable 3-D cube depends on the additive contributions of disparity and the relative contrasts of far and near sides. Perceived depth was not affected, only the stability of its perceived sign. 5. Range extension Two cues working together may allow depth intervals to be detected over a larger range of contrasts, distances, or spatial or temporal frequencies than is possible with either cue alone. For example, perspective is a better cue to depth than disparity at far distances, but disparity is the better cue at near distances. 6. Provision of an error signal Responses evoked by some sensory signals lack an error signal, so that there is no way to determine whether the response is adequate. For example, pointing to a visual target with unseen hand does not generate a position error signal, even when the positions of target and hand are well registered. However, pointing with the seen hand does provide an effective error signal, even when the positions of target and hand are not well registered. 7. Cue specialization Different cues may provide information about different components of a stimulus.

I N T E R A C T I O N S B ET WE E N V I S UA L D E P T H C U E S



149

For example, Tittle et al. (1997a) found that binocular disparity contributes strongly to detection of the degree of curvature of a surface patch while shading and texture contribute more strongly to detection of the shape of the patch (whether it is convex, concave, or saddle shaped). 8. Cue dominance When two cues provide conflicting information, judgments may be based on only one, with the other cue being suppressed. For instance, when the depth order of two objects indicated by interposition contradicts that indicated by binocular disparity, the percept is generally determined by interposition (Section 30.4). 9. Cue dissociation When the conflict between two cues is large, the cues may be interpreted as arising from distinct objects. For instance, when prisms cause the seen position of an object to be separated from its felt position, two objects are perceived. Diplopia produced by strabismus is a form of cue dissociation. Anomalous correspondence may also be regarded in this way (Section 14.4.1). 10. Disambiguation of depth sign One cue may resolve an ambiguity of the sign of depth order inherent in another cue. For example, the impression of a 3-D object produced by motion parallax can be ambiguous with respect to the sign of depth. The ambiguity can be resolved by disparity, interposition, or a cast shadow. 11. Reinterpretation of equivalent stimuli A change in a proximal stimulus can arise from more than one cause. For example, a change in the size of the image of an object can arise from motion of the object in depth but also from a change in the size of the object. The ambiguity is resolved in favor of motion-in depth when information, such as changing disparity, indicate motionin-depth. Conversely, it is resolved in favor of changing object size when information, such as constant disparity, indicates that the object is stationary. In both cases the ambiguous cue of changing image size can be said to be reinterpreted by a change in disparity information. 12. Cue recalibration Exposure to conflicting depth cues for a period of time may lead to an adaptive shift in the calibration of one or the other cue. This adaptive shift shows itself as a change in perceived depth produced by the adapted cue when it alone is present. Experiments of this type are discussed in Section 21.3. Epstein (1975) reviewed the early literature on this topic. Bülthoff and Mallot (1988) suggested a different classification of cue interactions based on five categories: (1) accumulation, (2) veto, (3) cooperation, (4) disambiguation, and (5) hierarchy (Portrait Figure 30.1). Disambiguation is common to both classifications, while veto is an 150



Figure 30.1. Heinrich Bülthoff. Born in Zetel, North Germany, in 1950. He obtained a Ph.D. in the natural sciences from the Eberhard-KarlsUniversität in Tübingen in 1980 and conducted postdoctoral work at the Max-Planck-Institute for Biological Cybernetics in Tübingen. In 1985 he moved to the Massachusetts Institute of Technology and became professor at Brown University, Rhode Island. In 1993 he returned to the Max-Planck-Institute for Biological Cybernetics in Tübingen, where he is now the director of a group of biologists, computer scientists, mathematicians, physicists, and psychologists working on psychophysical and computational aspects of high level visual processes. He is a honorary professor at the Eberhard-KarlsUniversität.

alternative name for cue dominance. In hierarchy, information from one cue is used as raw data for a second. Accumulation and cooperation in Bülthoff and Mallot’s scheme are different forms of cue averaging or summation. The difference between the two is a function of the extent and level at which the interaction is assumed to take place. Accumulation refers to interactions between the outputs of essentially independent mechanisms, whereas in cooperation the interaction is more substantial and occurs at an earlier stage. Bülthoff and Mallot use the example of the cooperative synergistic interaction between modules dealing with detection of poor or noisy cues. Most studies on depth cue integration have focused on cue averaging and summation. Schiller and Carvey (2006) produced an interesting account of interactions between cues to depth, with a set of illustrations. 30.1.3 MO D E L S O F C U E I N T E R AC T I O NS

Dosher et al. (1986) carried out one of the first empirical studies on interactions between disparity, motion, and luminance cues to depth. Their results could be modeled in terms of a weighted linear combination of the individual depth cues. Maloney and Landy (1989) developed this idea and proposed a simple statistical framework for modeling the combination of depth estimates from different cue systems. They assumed that the different sources of information are processed independently (modularity) and that the

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

final representation is in the form of a depth map of points in different visual directions. In the model, the outputs of the different modules “fuse” into a single depth estimate at each point in the scene. Their model also assumes a linear weighted combination of depth estimates and was not designed to account for severe cue conflict. Different cues provide different sorts of information. For example, in principle, motion parallax and binocular disparities provide complete information to create a depth map of points in different visual directions. By comparison, texture gradients and linear perspective provide information only up to a scaling factor. In the third class of cues identified by Maloney and Landy (which includes the kinetic depth effect) there is an additional uncertainty about the sign of depth and the direction of rotation of a moving object. Maloney and Landy suggested that cues that provide the missing parameters “promote” the information signaled by other cues present at the same time. Thus, simultaneous depth cues provide comparable absolute depth estimates. The ideal-observer model then produces a weighted linear combination of the outputs of different modules. Clark and Yuille (1990) distinguished between “weak fusion” and “strong fusion.” Weak fusion occurs between the outputs of independent mechanisms and is similar to Bülthoff and Mallot’s “accumulation.” Strong fusion is related to Bülthoff and Mallot’s “cooperation.” Information from different cues is processed cooperatively to derive a depth estimate. Strong fusion models predict nonlinear interactions between the cues, whereas weak fusion models preclude such interactions (Young et al. 1993). Scaling or “promoting” one cue by a parameter obtained from another cue implies more than accumulation of evidence from independent modules, and therefore implies strong fusion. Landy et al. (1991c, 1995) suggested the idea of “modified weak fusion,” which limits nonlinear cue interactions, characteristic of strong fusion, to cue promotion. In some situations, in which a parameter is missing, the unscaled estimates may be averaged prior to the scaling process (Portrait Figure 30.2). This would be regarded as weak fusion. Landy et al. (1991c) and Young et al. (1993) used perturbation analysis to minimize inconsistencies between cues. This allowed the accuracy of depth judgments based on motion and texture to be modeled by a simple weighted linear combination of estimates based on distinct cues. Linearity of cue combination is more consistent with the weak fusion model than the strong fusion model. The ideal-observer model of Landy et al. also predicts that the weights given to different depth cues change with changes in the reliability of the cues. Experimental results support this prediction. Depth is represented by a depth map in the weak fusion and modified weak fusion models. Cues that do not provide complete information have to be “promoted” by information supplied by richer cues. In contrast, Bülthoff (1991) emphasized the differences between the information provided by different cues and suggested that procedures for

Figure 30.2. Michael Landy. Born in Bronxville, New York, in 1957. He obtained a B.S. in electrical engineering from Columbia University in 1974 and a Ph.D. in computer and communication sciences from the University of Michigan in 1981. He did postdoctoral work with George Sperling at New York University and then, in 1984, he obtained a faculty appointment at New York University, where he is now professor of psychology and neural science. He spent the years 1999 to 2002 in the School of Optometry, at the University of California at Berkeley.

investigating cue interactions depend on the assumed level of representation. He proposed a Bayesian framework to model the integration of different depth cues (see also Bülthoff and Yuille 1991). Cues are weighted according to their robustness, and smoothing is applied to improve the reliability of depth estimates. For example, little smoothing is needed for depth cues such as edge disparities, since they provide reliable depth estimates. On the other hand, binocular differences in the shape of a luminance profile— intensity-based disparities—are less reliable and require more smoothing. Bülthoff argued that the predictions of the Bayesian model are consistent with the results of his own experiments, in which observers adjusted a depth probe to appear on the surface of ellipsoid shapes defined by rotation. In general, models of cue combination based on energy minimization or regularization predict that, in the absence of reliable information, surfaces regress toward the frontal plane. Poggio et al. (1988) proposed an alternative to the Bayesian framework, based on Markov random fields. The differences between the modified weak fusion model of Landy et al. (1991c) and Bülthoff ’s (1991) Bayesian model may arise from the different sorts of interactions that the models were designed to explain. Nonlinear interactions such as dominance or vetoing have been observed experimentally, particularly when the cues differ in the kind of information they provide. On the other hand,

I N T E R A C T I O N S B ET WE E N V I S UA L D E P T H C U E S



151

both models predict the observed linear averaging when the cues provide similar information and the quantitative discrepancies between cues are small. Bülthoff and Yuille (1991) argued that the Bayesian approach that incorporates general assumptions and constraints provides a more general framework for thinking about both low- and highlevel vision (see also Yuille et al. 1991). This strategy may be contrasted with the view that there is no real theory of vision but only a “bag of tricks” that provides particular solutions in particular situations (Ramachandran and Anstis 1986). The problem with the Bayesian approach is that it can predict the patterns of cue interaction in novel situations only in rather general terms. We may also question the wisdom of designing models to account for interactions between discrepant or contradictory sources of depth information that do not normally occur. Fine and Jacobs (1999) have also modeled cue interactions.

3 0 . 2 D I S PA R I T Y A N D M OT I O N PA R A L L AX Strong impressions of depth can be created by differential motion of superimposed or adjacent displays (Chapter 28). All forms of optic flow that produce depth can be regarded as dynamic perspective. Interactions between changing image size (looming) and other cues to depth are discussed in Section 31.4. The present section deals with interactions between motion parallax and other cues to depth. 30.2.1 C RO S S - CU E B I A S I N G

30.2.1a Opposite Biasing One can ask whether exposure to a cue conflict situation leads to a recalibration of cues. Wallach and Karsh (1963) asked whether exposure to conflict between motion parallax and disparity leads to a recalibration of the stereoscopic system. For 10 minutes, subjects looked through a telestereoscope at a wire form rotating in depth. This increased disparity-defined depth relative to motion-defined depth. Subsequently, subjects underestimated the depth dimension of a 3-D stationary wire pyramid. A related question is whether prior inspection of a depth cue of one sign biases the sign of perceived depth of an ambiguous stimulus created by another cue. Graham and Rogers (1982b) created an ambiguous disparity-defined corrugation by repeating patterns of random dots, as in an autostereogram. The dots could be dichoptically matched either in front of or behind the fixation point by an amount that depended on the period of repetition (Rogers and Graham 1984). They also created an ambiguous motionparallax corrugation by a pattern of horizontal shearing motion that mimicked that formed by a real corrugation. When viewed with stationary head, the sign of depth of the 152



corrugation was ambiguous. After 15 s of adaptation to an unambiguous motion-parallax corrugation, the ambiguous stereo corrugations were overwhelmingly biased toward the opposite phase in depth. Similarly, after 15 s of adaptation to an unambiguous disparity-defined corrugation, ambiguous motion-parallax corrugations were strongly biased toward the opposite sign (see also Rogers and Graham 1984). Monocular adaptation to a surface with slant produced by motion parallax caused a binocular stationary test surface to appear slanted in the opposite direction. Similarly, adaptation to slant defined by the unambiguous cue of binocular disparity produced an impression of opposite slant in a surface in which slant was produced only by motion parallax. Also, adaptation to slant specified by perspective, induced aftereffects in surfaces in which slant was defined by disparity or by motion parallax (Poom and Börjesson 1999; Domini et al. 2003). In the kinetic depth effect (KDE), the depth sign of the orthographic projection of a rotating 3-D object is ambiguous (Section 28.5). Each perceived reversal of depth is accompanied by an apparent change in the direction of rotation. For example, the depth sign and perceived direction of rotation of the projected image of a rotating sphere defined by random dots periodically reverse. Following 90 s of binocular inspection of an actual rotating sphere covered with random dots, in which disparity indicated the true direction of rotation, the projected image of the sphere appeared to rotate in the opposite direction. The aftereffect lasted up to 30 s and occurred only when adapting and test stimuli were in the same location (Nawrot and Blake 1989, 1991a). Nawrot and Blake also noted that an ambiguous kinetic depth display could be indistinguishable from a display with unambiguous disparity. All these results suggest that the processing of depth from motion parallax and from disparity engage a common neural mechanism. Nawrot and Blake reported that several nonstereoscopic depth cues, such as perspective (Braunstein 1966), occlusion (Braunstein et al. 1982; Andersen and Braunstein 1983), and luminance proximity (Dosher et al. 1986) disambiguated the direction of rotation in a KDE display when presented at the same time. However, prior adaptation to an unambiguous KDE display lacking disparity was not sufficient to bias an ambiguous KDE display (Portrait Figure 30.3). For example, prolonged inspection of a rotating KDE sphere under polar projection, in which one depth interpretation was dominant for some time, failed to bias the perceived direction of rotation of the image of a rotating sphere seen under parallel projection. Although consistent with the idea of a special link between disparity and structure-from-motion processes, these results may be a consequence of the fact that disparity is a more powerful source of disambiguating information than is motion parallax.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Other aspects of aftereffects of motion-in-depth are discussed in Section 31.6.

30.2.1b Cross-Cue Priming

Figure 30.3. Myron L. (Mike) Braunstein. Born in 1936 in New York City. He obtained a B.Sc. in psychology from Brooklyn College in 1956 and a Ph.D. in psychology from the University of Michigan in 1961. He was then a research psychologist at the Cornell Aeronautical Laboratory and at the Flight Safety Foundation. In 1965 he joined the faculty at the University of California at Irvine, where he is now professor emeritus in the Department of Cognitive Sciences. He was editor of Perception and Psychophysics from 1994 to 1998.

Nawrot and Blake (1991b) proposed a neural-network model of interactions between disparity and structure from motion, which is consistent with many of these results. Nawrot and Blake (1993b) produced further evidence of links between motion parallax and disparity. They used a dynamic visual noise stimulus in which the dots survived for only one frame before being replaced. With a small interocular delay between the presentations of the same dynamic noise pattern, observers perceived the dots swirling around the fixation point, clockwise or counterclockwise depending on which eye received the delayed presentation (Ross 1974; Tyler 1974c). Fifteen seconds of adaptation to stereoscopic superimposed frontal randomdot displays moving in opposite directions at different depths produced an opposite bias in the apparent direction of rotation in depth of dynamic visual noise. Moreover, the biasing effect was still evident when an interocular delay introduced into the dynamic visual noise dots provided the noise with an unambiguous signal about direction of rotation. These results show that depth aftereffects following adaptation to moving disparate surfaces interact quantitatively with depth produced by interocular delay. Cross-cue adaptation effects must arise in a mechanism that codes depth irrespective of the cue that produces it. Physiological evidence for cue-invariant detectors was reviewed in Section 5.8.3b.

We have seen that prolonged exposure to a stimulus of one sign biases the visual system to perceive the opposite sign in an ambiguous stimulus. However, brief exposure to a stimulus with a given sign can bias the system to continue perceiving the same sign. This is known as priming. For example, a 1-s exposure to two superimposed random dot displays, with +1 and −1 arcmin of disparity, moving horizontally in opposite directions biased the impression of rotation in an ambiguous KDE sphere in the same direction (Nawrot and Blake 1993a). Also, viewing a KDE display for 1 s caused a display containing unambiguous disparity cues to appear to rotate in the same direction as the KDE display. In these priming effects, an ambiguous stimulus or weakly unambiguous stimulus takes on the same sign as the priming stimulus, not the opposite sign, as in aftereffects due to adaptation. Thus, over the short term, the visual system exhibits priming, or percept inertia. It is as if the visual system is reluctant to believe that the stimulus has suddenly changed. Adaptation or fatigue takes time to develop. See Sections 30.2.1 and 31.6.3 for other examples of long-tem adaptation and short-term priming in motionin-depth. 30.2.2 B ET WE E N- CU E C A N C E L L AT I O N O F A F T E R E FFE C T S

In a cue-trading procedure, the effect of one cue is canceled by the introduction of another cue of opposite sign. A cue trading function defines the magnitude of one cue required to null another as a function of cue magnitude. Prolonged viewing of a random-dot surface containing sinusoidal corrugations defined by either disparity or motion parallax causes a subsequently seen flat surface to appear corrugated in the opposite direction (Section 21.7.2). The aftereffect has been measured with a nulling procedure in which either disparity or parallax depth was introduced into the test surface until it appeared flat. In some cases, up to 80% of the depth in the inspection surface had to be introduced into the test surface for it to appear flat (Graham and Rogers 1982a). The aftereffect created by adaptation to a disparitydefined corrugation could be nulled with motion-parallax depth in a monocular random-dot pattern viewed with side-to-side head movements. Likewise, the aftereffect created by monocular adaptation to corrugations specified by motion parallax linked to head movements could be nulled by adjusting the disparity in a binocular corrugated surface defined by disparity and viewed with stationary head. The depth needed to null aftereffects in the between-cue situation was always much less than that needed in the

I N T E R A C T I O N S B ET WE E N V I S UA L D E P T H C U E S



153

within-cue situation. Also, more motion parallax was needed to null effects of adaptation to disparity than the other way round (Graham and Rogers 1982b). These results provide evidence of quantitative cue averaging between the depth aftereffects created by one cue and depth created by a different cue. In addition, they show that an aftereffect from a binocular cyclopean stimulus can be seen in a monocular test pattern. 30.2.3 S U BT H R E S H O L D CU E S UM M AT I O N

If neural signals conveying binocular disparity and those conveying motion parallax converge, thresholds for detecting depth specified by both cues should be lower than thresholds based on either cue alone. In other words, there should be subthreshold summation of information provided by the two cues (Graham 1989). The contribution of probability summation must be allowed for (Section 13.1.1). There is also another factor. It was shown in Section 15.3.9 that random-dot binocular images are more easily matched when dots in the different depth planes move in different directions. In other words, common motion aids the detection of corresponding images. An effect due to this factor would be distinct from effects due to cue summation and could account for nonlinear effects in cue combination. Bradshaw and Rogers (1996) looked for subthreshold summation in the detection of the phase (0° or 180°) of sinusoidal depth corrugations with spatial frequencies of 0.1, 0.2, and 0.4 cpd. Thresholds were measured for disparity-defined corrugations, motion parallax-defined corrugations, and for corrugations specified by both cues together. The relative amplitudes of the two cues were scaled (normalized) according to their thresholds. Thresholds for detecting the combined-cue corrugations were about half those for the separate cue surfaces. This summation effect is greater than that predicted by most models of probability summation. Thus, mechanisms that process disparity and motion parallax must interact before threshold judgments are made. Ichikawa et al. (2003) measured thresholds for detecting sinusoidal depth modulations specified by motion parallax and/or disparity. The two cues lowered the depth-detection threshold more than expected by probability summation only when they specified an undulation of the same frequency and phase. The threshold was elevated when disparity-defined corrugations were 180° out of phase with superimposed motion-defined corrugations (Ichikawa and Saida 1998). One can ask whether motion-parallax signals and disparity signals summate in the same way for different tasks. Tittle et al. (1997b) created random-dot triangular and sine-wave depth corrugations defined by disparity alone, by motion parallax alone, or by both disparity and parallax. Each surface was embedded in random noise. They measured 154



the ratio of signal dots to noise dots required for detection of the surface. Subjects were better able to discriminate between sine-wave and triangular corrugations with both cues than with one cue. But the improvement was no more than predicted from the combination of two independent sources of information. However, when subjects discriminated differences in spatial frequency of depth corrugations, the presence of two cues improved performance more than predicted by probability summation. Thus, sources of information may sum in one way for one task but in another way for another task. Cornilleau-Pérès and Droulez (1993) reported that thresholds for discrimination between planar and curved surfaces were generally lower for surfaces specified by motion parallax than for surfaces specified by disparity, and lowest for surfaces specified by both cues. However, cue facilitation was no greater than predicted by probability summation. Possible effects of differential image motion on detection of corresponding images were not considered in any of these studies. 30.2.4 B ET WE E N- C U E T H R E S H O L D E L EVAT I O N

Depth aftereffects are created after inspection of a disparitydefined depth corrugation with steady gaze or with the images stabilized on the retina (Section 21.7.2). A depth aftereffect is also seen after the eyes have scanned along the depth corrugations. But scanning across the corrugations produces no aftereffect because this type of eye movement creates a sequence of depth corrugations of opposite phase. However, across-corrugation scanning produces a shortterm elevation of the threshold for detecting corrugations of the same or similar spatial frequencies (Schumer and Ganz 1979). Bradshaw and Rogers (1996) asked whether the threshold-elevation effect transfers between corrugations defined by disparity to those defined by motion parallax. Before adaptation, the mean depth-detection threshold for two observers was 4.9 arcsec for a 0.2-cpd corrugation defined by disparity and 6.5 arcsec for a one defined by parallax. Subjects then inspected for 3 minutes a depth corrugation defined by 4.3 arcmin of disparity (60 times the threshold). The corrugations reversed in phase every 2 s. Following adaptation, thresholds for detecting a disparity corrugation of the same spatial frequency increased by an average of 112%. Subjects then adapted to phase-reversing corrugations defined by motion parallax. Depth-detection thresholds for motion-parallax corrugations increased by 76%. These two effects were within-cue threshold elevations. Thresholds for discriminating the phase of disparity corrugations after adaptation to motion-parallax corrugations increased by 50%, while thresholds for motion-parallax corrugations after adaptation to disparity corrugations rose

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

by 45%. These between-cue threshold elevations were therefore smaller than the within-cue threshold elevations. These results support the idea that disparity and motion-parallax mechanisms interact quantitatively at an early stage in the processing hierarchy. 30.2.5 R E S O LU T I O N O F M OT I O N-PA R A L L AX A M B I GU IT I E S

Consider a real 3-D corrugated surface translating in front of an observer through a small distance compared with the distance to the surface. A given pattern of relative motion between the peaks and troughs of the corrugated surface may be created by any of the following stimuli: (1) a corrugated surface with a particular depth modulation translating in the frontal plane, (2) a surface with a larger peak-to-trough depth, which rotates toward the observer as it moves toward each end of its translation, and (3) a surface with a smaller peak-to-trough depth, which rotates away from the observer as it moves toward each end of its translation, as shown in Figure 30.4. Rogers and Collett (1989) inquired how the visual system deals with this ambiguity. They used a 2-D randomdot display depicting a 3-D surface with horizontal sinusoidal ridges (0.2 cpd) translating to-and-fro in a frontal plane (object-produced parallax) as depicted in (Figure 30.4). Observers adjusted the depth of corrugations specified by disparity and motion parallax to match the depth of monocularly viewed corrugations specified only by motion parallax. The perceived depth of the monocular surface closely matched that of the two-cue stimulus, as shown by the dotted line in Figure 30.5. The monocular surface did not appear to rotate as it translated to-and-fro. This suggests that the visual system detects the change in slant of a surface relative to the line of sight as it translates along a frontal path. Or perhaps it adopts the default assumption that the surface is moving in a frontal plane rather than rocking. In a second experiment, Rogers and Collett used corrugations that contained different combinations of motion parallax and peak-to-trough disparity. As in the first experiment, subjects adjusted the depth of the two-cue display to match that of each of the monocular displays. Figure 30.5 shows the perceived depth of the monocular corrugations relative to that of the two-cue corrugations. Results for each disparity are plotted as a function of the amplitude of motion parallax. With small disparities, perceived depth increased with increasing parallax amplitude. When disparity was zero (signaling a flat surface), perceived depth with binocular viewing increased linearly with parallax amplitude and was approximately 50% of that found with monocular viewing. Monocular viewing removed the conflicting information provided by zero disparity. As in the first experiment, the perceived depth of the test surface closely matched that of the comparison stimulus.

A

B

C Figure 30.4. The ambiguity of motion parallax. When a corrugated surface translates to-and-fro in a frontal plane, as in (A), there is motion parallax between points on the surface with different depths. Approximately, the same amount of motion parallax is created by a surface with more depth that rotates in a concave direction as it translates, as in (B), or by a surface with less depth that rotates in a convex direction as it translates, as in (C). (Adapted from Rogers and Collett 1989)

Thus, motion parallax and disparity interacted quantitatively to determine the perceived depth of simulated translating surfaces. The two cues engaged in cue averaging, or weak fusion. Ichikawa and Saida (1996) confirmed that perceived depth magnitude is a compromise between that specified by motion parallax and that specified by disparity. However, they found that some people relied more on parallax in judging depth order while others relied more on disparity. Figure 30.5 shows that, for disparities larger than 8 arcmin, varying parallax amplitude had little effect on perceived depth. Instead, the surface appeared to rotate as it translated in the frontal plane, a “concave” rotation when parallax was smaller than disparity, and a “convex” rotation when it was larger than disparity. Thus, for large disparities, parallax did not enhance depth but was taken as a sign that the surface was rotating. One interpretation of these data is that the weight assigned to disparity increases as disparity increases. Rogers and Collett (1989) suggested an alternative interpretation. They proposed that disparity reduces the “ambiguity” of motion parallax because changes in disparity produced by a

I N T E R A C T I O N S B ET WE E N V I S UA L D E P T H C U E S



155

16

Monocular viewing

Disparity (arcmin)

Matched amplitude (arcmin)

12 8 4

8 4 2

0 –4 0 –8 –12 –16 –16

Correct depth defined by comparison display –12

–8

–4

0

4

8

12

16

Parallax amplitude (arcmin) Perceived depth in disparity plus parallax surfaces. Observers matched the depth in corrugations specified by different amounts of motion parallax (abscissa) with disparity set at different values (different curves) to depth in corrugations specified by congruent disparity and parallax. For small disparities, perceived depth increased with increasing parallax. With zero disparity, perceived depth was about 50% of that with monocular viewing. With monocular viewing, perceived depth was close to that specified by the congruent depth cues. The amount of parallax depth had little effect on perceived depth when disparity was large (8 arcmin). (Adapted from Rogers and Collett 1989) Figure 30.5.

translating 3-D surface indicate the amount of surface rotation. However, motion parallax is ambiguous only for small surfaces or surfaces that rotate through a small angle. For large surfaces, changes in perspective produced by translation provide unambiguous information about the orientation of the surface to the line of sight. Also, the acceleration component of the flow field indicates the amount of rotation. Thus, depth is seen veridically in large monocularly viewed surfaces translating in a frontal plane. When both motion parallax and disparities are present, there are two constraints on the interpretation of parallax. Perspective and flow-field acceleration constrain perceived rotation of the surface with respect to the line of sight, and disparities constrain perceived depth. Consider what happens when the cues are in conflict. If the interpretation of motion parallax involved setting the depth of the stimulus to that specified by disparities, the amount of rotation would be incompatible with that specified by perspective changes and flow-field acceleration. If the interpretation of motion parallax were based on the amount of rotation specified by perspective changes and flow-field acceleration, perceived depth would be incompatible with disparity depth. In the surfaces with large disparity-defined depth used by Rogers and Collett, the predicted rotation when motion parallax was inconsistent with disparity was necessarily small and not incompatible with perspective and flow-field information. 156



With small disparity-defined depth, the predicted rotation was large and was therefore incompatible with perspective and flow-field information. Under these circumstances, one would expect perceived depth to be influenced by the amount of parallax as well as by the disparity. This was precisely the pattern of results. Williams (1994) measured perceived rotation and perceived depth in translating surfaces. His results were consistent with the idea that the interpretation of parallax is constrained by both the depth specified by disparities and the rotation specified by either the change of perspective or the acceleration component of the flow field. He also reported that manipulations of the perspective component of the flow field affected both the perceived depth and the perceived rotation of the translating parallax surfaces (see also Rogers and Bradshaw 1991). This shows that the visual system uses the change of perspective provided by large displays as a source of information about change of slant. For a particular amount of rotation, the change in perspective of a surface that subtends only a small visual angle becomes vanishingly small, and therefore the pattern of results observed by Rogers and Collett (1989) should be significantly affected by the size of the display.

30.2.6 S U R FAC E S H A P E FRO M D I S PA R IT Y A N D PA R A L L AX

30.2.6a Disparity and Parallax Interactions The addition of binocular disparities provides unambiguous information about the sign of depth in otherwise ambiguous motion-parallax displays. Thus, disparity disambiguates the sign of depth in the orthographic projection of a rotating textured sphere or wire-frame Necker cube (Braunstein et al. 1986; Dosher et al. 1986). This is an example of a cooperative interaction. The addition of disparity disambiguated the sign of depth in a moving projected image even for observers who could not detect depth in a stationary stereogram (Rouse et al. 1989). Interaction between disparity and motion parallax can also affect the magnitude of perceived depth of a 3-D shape. Tittle and Braunstein (1993) presented stereograms of horizontal transparent cylinders. In one condition they were stationary. In other conditions they rotated about the long axis, or translated along a horizontal path. The judged depth-to-height ratio of the moving cylinders was greater than that of stationary cylinders. Thus, the addition of motion parallax enhanced the perceived depth of the cylinders. However, perceived depth increased as disparity increased, with parallax-defined depth remaining constant. In fact depth-to-height judgments were reasonably consistent with the magnitude of disparity. This suggests that subjects weighted disparity more heavily than motion parallax in this situation. Rogers and Collett (1989) reported similar results. The trading function between disparity and

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

parallax is characteristic of an accumulation or weak fusion interaction. Tittle and Braunstein hypothesized that motion parallax increases the perceived depth in a display with disparity because corresponding points are easier to match when they are moving. They found that transparent cylinders composed of high-density patterns, which are difficult to see in depth when stationary, became easier to see when the dots were moving. This is consistent with their hypothesis that motion facilitates disparity processing by helping to resolve the correspondence problem (see Section 15.3.9). Norman et al. (1995) used a stereoscopic stimulus like that shown in Figure 30.6, in which the surface was textured, smoothly shaded, or with highlights. Subjects set a gauge figure to be tangential to the surface at several points. The mean error of probe settings was 24.5° when neither disparity nor motion was present but fell to 14.5° when either disparity or motion was added. This was the case even for shaded surfaces that lacked sharp surface features. Thus, adding rotation enhanced depth by about the same extent as adding disparity. Norman and Raines (2002) used similar stimuli. Subjects discriminated the depth order of two points on the surface of simulated 3-D objects. When only the outlines of

Figure 30.6.

test objects were visible, performance was much better just after the silhouette shape had rotated 360° than when it remained stationary. Adding disparity to the silhouette images improved performance for the static shapes but not for the rotated shapes. Note that disparity for 3-D objects in silhouette consists largely of differential occlusion of surface regions (Section 17.3) rather than positional disparities. Adding texture density gradients to the surface of the stationary objects improved performance considerably. Lappin and Love (1992) used a display that simulated an ellipse lying on an inclined plane. Subjects selected a frontal shape that matched the actual shape of the ellipse. Accuracy was high when the ellipse rotated in the inclined plane through 80°. Performance declined with decreasing amplitude of rotation and was poor when the ellipse was stationary. Performance was also poor when the simulated motion deviated from that of a planar figure. Accuracy with the rotating shapes was no better when disparity was added. Any misperception of the shape of an ellipse rotating on an inclined surface would result in an impression of a nonrigid shape. Thus, the assumption of rigidity would help viewers to detect the true shape of a moving ellipse but not that of a stationary ellipse.

One of the stimuli used by Norman et al. (1995). The top stereogram is for convergent fusion, and the lower one is for divergent fusion. (Copyright

by Psychonomic Society Inc.)

I N T E R A C T I O N S B ET WE E N V I S UA L D E P T H C U E S



157

Perception of disparity-defined curvature is anisotropic. Surfaces curved vertically are perceived more readily than surfaces curved horizontally (Rogers and Graham 1983). The perception of 3-D curvature defined by motion (the kinetic depth effect) is anisotropic in a different way. Motion around the axis of a cylinder (the axis of curvature) produces less vivid depth than motion around an axis orthogonal to the axis of curvature. This is so whatever the orientation of the cylinder relative to the observer (Cornilleau-Pérès and Droulez 1989; Norman and Lappin 1992). Rotation about an axis parallel to the cylinder axis introduces no image curvature, whereas rotation about an orthogonal axis introduces changing curvature (spin variation) into the images of lines running round the cylinder, as described in Section 28.4. Norman and Todd (1995) created a display with disparity-defined sinusoidal ridges in one direction and motiondefined ridges in the orthogonal direction. Subjects saw the disparity-defined ridges when they were in the most effective orientation for disparity. They saw motion-defined ridges when the axis of rotation was best suited to the perception of depth from motion. Tittle and Braunstein (1993) drew four conclusions about the relationship between disparities and motion parallax, to which a fifth can be added. 1. Interactions between disparity and motion parallax can be cooperative and facilitative, as evidenced by the fact that disparity can disambiguate the sign of depth in a parallax display. Also, depth created by both cues can be more than the sum of the depths created by the two cues in isolation. 2. Depth from displays containing both disparity and motion parallax does not depend solely on the depth specified by parallax, whereas Richards’s (1985) model predicts that it does. 3. Facilitation provided by parallax increases as the linkage of binocular images is made more difficult by increasing the density of texture elements. 4. The facilitation provided by motion parallax is due not only to the presence of opposite directions of motion but also to the presence of structure from motion in the displays. 5. Disparity and parallax are anisotropic in different ways, so that their relative weighting depends on the spatial orientation of the stimulus. Seen together, these results suggest that processing of 3-D structure from motion and binocular disparity is not strictly modular with outputs combined on the basis of a weighted average, but that cooperative and facilitative interactions are also operating. Tassinari et al. (2008) have developed a model for how the visual system combines disparity and motion parallax in the detection of surface curvature. 158



30.2.6b Depth scaling by disparity and parallax Disparity and motion parallax specify the true depth in a 3-D object only when viewing distance is correctly registered. Both disparity and parallax contain information about viewing distance. Brenner and Landy (1999) asked whether the two cues combine in the registration of absolute distance. They presented two textured 3-D ellipsoids side-by side in a stereoscope. Subjects set the width and depth of each shape to create the impression of a spherical tennis ball and then set the distance of the ball on the left to be half of that on the right. When the ball on the left was rotated about its midhorizontal axis it was set much more closely to spherical. However, this did not affect the perceived distance of either ball or the shape setting of the stationary ball. Brenner and Landy concluded that subjects did not combine disparity and motion parallax to obtain a more accurate estimation of viewing distance. In a subsequent study, subjects made similar settings of a rotating 3-D textured ellipsoid to match a tennis ball (Champion et al. 2004). Shape settings of the rotating shape did not vary with viewing distance and did not affect the shape settings of a stationary ellipsoid exposed subsequently either in the same or in a different depth plane. They concluded that this provides further evidence against the idea that a combination of disparity and parallax provides information about viewing distance. Perhaps the results would be different with large displays.

30.2.7 D E P T H-TO -S I Z E I N VA R I A N C E

Consider a flat surface rotating about a vertical axis and viewed with one eye. Two sources of information indicate the change in surface slant. First, for a surface subtending more than about 5°, the image contains detectable perspective changes. These changes are a function only of the change of slant to the line of sight. They are independent of distance, for surfaces of a given angular size. The second source of information to specify the change in slant of a rotating surface is the spatial gradient of velocity of the flow field (spatial acceleration). If the surface is not flat there will be a complex pattern of relative motion between parts of the rotating surface. The magnitude of relative retinal motion at each location depends on the gradient of the depth modulation at that point and on the velocity of rotation. Two similar surfaces of the same angular size, but one twice as far away as the other, produce approximately the same pattern of motion parallax if the depth modulations of the far surface are twice those of the near surface. It follows that all surfaces that have the same ratio of depth modulation to their physical size (threedimensionally similar) create the same motion parallax for a given change of slant. Hence, if change of slant with respect to the line of sight is indicated by either perspective

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

changes or acceleration of the flow field, the ratio of relative motion to angular size specifies the depth-to-size ratio of the surface whatever its distance from the observer (Richards 1985). This invariant property is characteristic of motion parallax produced by either object motion or observer motion. Johnston et al. (1994) proposed that this invariance allows the visual system to determine the 3-D shape of a moving surface irrespective of its distance, where shape is defined in terms of a depth-to-height or depth-to-width ratio. For a given change of slant of the surface with respect to the line of sight, the ratio of parallax motion to angular size is the same for all surfaces that have the same depth-tosize ratio. The same is true of surfaces with slant defined by disparity. For a surface seen from two vantage points, the ratio of disparity to angular size is not affected by the distance to the surface. But this is true only if vergence is held constant. If vergence is changed, the invariant relationship no longer holds, and the shape of surfaces can be correctly determined only if there is information about the vergence angle of the eyes. In situations where vergence is not constant, Johnston et al. (1994) suggested that the invariant property of motion parallax facilitates the interpretation of binocular disparities. To test this idea, they presented observers with horizontal textured cylinders that were either (1) stationary and specified by disparities, or (2) oscillating about a vertical axis and specified by relative motion (viewed monocularly) or, (3) oscillating about a vertical axis and specified by both relative motion and disparities. Half-cylinders defined by disparities alone appeared flattened by a factor of two when displayed at a 200-cm viewing distance. They attributed the flattening to an underestimation of viewing distance from vergence signals (see also Johnston 1991 and Section 20.6.2). In contrast, the same horizontal semicircular cylinders defined by relative motion (rocking to-and-fro about a vertical axis) were perceived veridically. When both cues were available, the shape of the cylinders was also perceived veridically. These results are consistent with a veto model in which perceived shape is determined by the ratio of relative motion to angular size, which unambiguously specifies the shape (depth-to-size ratio) of the surface. Disparity is simply ignored. However, the simple veto model is inconsistent with the results of a second experiment by Johnston et al. In this experiment, motion parallax and disparity specified the depth-to-size ratios of the cylinder. If the shape of the cylinder is determined only by motion parallax, variations in disparity should have no effect. However, when disparity specified a flattened cylinder, the cylinder was judged as curved in depth only when the depth-to-height ratio specified by motion parallax was greater than one. In other words, there was a trading relation between the two cues under the cue-conflict conditions.

Johnston et al. argued that disparity and motion parallax do not combine linearly, as specified in the weak fusion model. Instead, the results are more compatible with modified weak fusion, involving an initial stage of cooperative interaction to determine the missing distance parameter followed by weighted linear summation. The weights applied to different cues vary with viewing distance and the number of frames present in the motion sequence. Johnston et al. used a horizontal cylinder rotating about a vertical axis. This created changing perspective of the sides and ends of the cylinder. The invariance of the ratio of relative motion to angular size over changes in the absolute distance of moving surfaces indicates a high degree of constancy for depth-to-width judgments for such a stimulus. The invariant relation between motion and angular size in motion parallax displays can explain the constancy of depth-to-size judgments at different viewing distances reported by Johnston et al. However this invariant relation does not explain the accuracy of the depth-to-size judgments. Veridical judgments require accurate registration of the change of slant of the surface from either changing perspective or the acceleration components of the flow field. The surface used by Johnston et al. subtended only 1.72°. The perspective changes created by rotation of such a small surface through ±15° are negligible. Johnston et al. concluded that the acceleration component of the flow field was responsible for the veridical shape judgments. The fact that shape judgments were not veridical when only two frames of motion-parallax were presented is consistent with this conclusion. Rogers and Collett (1989) used motionparallax displays that subtended at least 20° so that, for these stimuli, changes in perspective were detectable and probably contributed to the accuracy of depth judgments. Hibbard and Bradshaw (2002) used computer-generated horizontal and vertical textured cylinders, as shown in Figure 30.7. At a distance of 30 cm the cylinder subtended 18.9° long and 11.4° wide. The angular size decreased in proportion to simulated viewing distance. The depth of the cylinder was specified by disparity alone, by motion parallax alone, or by both cues. Motion parallax was created by twoframe motion through 1.5°, which ensured that there was no motion acceleration to provide unambiguous information about the shape of the cylinder. When the cylinder rotated about its own axis there was no changing perspective. When it rotated about an orthogonal axis the image contained a perspective transformation. In general, subjects were more accurate in setting the cylinder to appear circular in cross section when both disparity and motion parallax were present than when only one cue was present. Neither the orientation of the cylinder nor the orientation of the rotation axis had much effect. The depth of the cylinders was generally underestimated, and the underestimation increased as viewing distance increased from 30 to 150 cm. Hibbard and Bradshaw concluded that changing perspective is not a

I N T E R A C T I O N S B ET WE E N V I S UA L D E P T H C U E S



159

3 0 . 3 D I S PA R I T Y A N D P E R S P E C T I VE 30.3.1 C U E I N T E R AC T I O N S O N P L A N E S U R FAC E S Horizontal cylinder horizontal rotation axis

Horizontal cylinder vertical rotation axis

Vertical cylinder horizontal rotation axis Figure 30.7.

Vertical cylinder vertical rotation axis

Stimuli used by Hibbard and Bradshaw (2002).

critical factor in this situation and that vertical motion augments disparity in much the same way as does horizontal motion. However, changing perspective would not be well defined by two-frame motion through 1.5°. 30.2.8 T E M P O R A L FAC TO R S

Uomori and Nishida (1994) reported temporal changes in the perceived depth of surfaces that contained conflicting disparity and motion-parallax information. They used a display in which motion parallax depicted an opaque vertical cylinder rotating continuously around its vertical axis, while binocular disparities depicted a horizontal cylinder. There were large individual differences among the four subjects, but typically the display appeared as a vertical cylinder at the start of the 60 s inspection period and subsequently as a combination of the two cylinders (a barrel shape) or, in some cases, as a horizontal cylinder. On some trials, an adaptation stimulus depicting only the disparity-defined horizontal cylinder or only the vertical cylinder defined by motion parallax was presented for 10 s before the display with conflicting cues was presented. Prior adaptation to the disparity-defined horizontal cylinder caused the display with conflicting cues to be more likely seen as a vertical cylinder. Prior adaptation to the parallax-defined vertical cylinder caused the display with conflicting cues to be more likely seen as a horizontal cylinder. This result is consistent with the results obtained by Rogers and Graham (1984) and Nawrot and Blake (1989) that were described in Section 30.2.1a. Uomori and Nishida attributed the alternation of the percepts in the display with conflicting cues to adaptation of the shape-from-motion mechanism. 160



Interactions between perspective and binocular disparity have been investigated by pitting one cue against the other. Van der Meer (1979) used a rectangle filled with vertical lines and with various degrees of horizontal taper. A horizontal size disparity caused the rectangle to slant about a vertical axis, which was either in the same or opposite direction to the slant specified by perspective. Perceived slant was some weighted combination of the two cues. Some subjects gave more weight to disparity, while others gave more weight to perspective. A simple frame with disparity-defined slant appears less slanted when the upper and lower sides are parallel than when they have appropriate perspective, as shown in Figure 21.27A. The lack of perspective is not as evident when the upper and lower sides of the frame are omitted as in Figure 21.27B (McKee 1983). Also, stereoacuity for two vertical lines was reduced when they were part of a square (Zalevski et al. 2007). Stereoacuity improved when the square had perspective consistent with disparity, although not to the level achieved with two lines. The following factors influence the relative weighting of perspective and disparity. 1. The type of perspective The relative weighting of perspective and binocular disparity depends of the type of perspective. Gillam (1968) found that the accuracy of slant judgments for a monocularly viewed surface slanted 18° about a vertical axis was high when the surface was covered with horizontal lines or a grid. However, slant was severely underestimated when the surface was covered with vertical lines or random dots. In other words, slant judgments based on linear perspective were more accurate than judgments based on foreshortening. Also, adding foreshortening to linear perspective did not improve accuracy. In a second condition, the surfaces were presented in a stereoscope in the frontal plane. The image in one eye was magnified horizontally to simulate slants of up to 30°. The slant of the surface covered with horizontal lines was severely underestimated for stereoscopic slants greater than 15° but the slant of the surface with vertical lines was accurately perceived for all angles of slant up to 30°. Thus the strong monocular cue of linear perspective, when set at zero slant, reduced perceived slant produced by disparity, but the weak monocular cue of foreshortening set at zero had no effect on disparity-induced slant. 2. Figural coherence of the cues Disparity and perspective interact only when the cues are seen as arising from the

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

same surface. Gillam and Cook (2001) constructed a random-dot stereogram depicting a trapezoid standing in front of a background (see Figure 30.8). The perceived slant of the trapezoid was greater when slant indicated by perspective corresponded to disparitydefined slant compared with when the cues were in conflict. However, when the slanted surface was beyond a trapezoidal aperture in the frontal surround, the taper of the aperture did not affect perceived slant. Thus, the perspective cue had to be seen as belonging to the slanted surface for it to have any effect. Muller et al. (2009b) confirmed that averaging of disparity and perspective cues does not occur between surfaces in different depth planes, even when they are superimposed. 3. The degree of slant The weighting of perspective and disparity also depends on the degree of slant. Knill and Saunders (2003) found that both texture perspective and disparity provide more precise slant discriminations as slant is increased. However, disparity was a more reliable cue for slant discrimination at small angles of inclination relative to the frontal plane, while texture perspective was more reliable at inclinations greater than 50°. This difference was reflected in the relative weightings assigned to the two cues as a function of slant when the cues were placed in conflict. For the same stimulus, more weight was given to disparity in the task of placing an object on an inclined surface than in the task of making an explicit judgment of inclination (Knill 2005). It is not known what caused this difference. 4. Viewing distance Theoretically, the reliability of textureperspective increases with increasing slant or inclination but does not change with distance. The reliability of disparity decreases with increasing distance and varies with the angle to the frontal plane in a way that depends on distance. Hillis et al. (2004) measured

JNDs for inclination and slant of a surface at angles up to 70° from the frontal plane and at distances between 19 and 172 cm. Tests were done for each cue and for the cues in various combinations. Overall, reliabilities obtained with the two cues conformed to an optimal weighting of the reliabilities of the separate cues. Sensitivity to a change in inclination of a textured patch based on disparity declined to zero when the patch had a disparity of 1° relative to a fixation point. Sensitivity to inclination based on perspective was not affected by a disparity of 1° (Greenwald and Knill 2009). 5. Degree of cue discordance Cue combination fails when perspective and disparity are too different. Under these circumstances, the visual system behaves as if it assumes that there is some malfunction in the system, since severe cue conflict cannot arise in natural stimuli. In robust statistics, little weight is assigned to severe outliers. Knill (2007) asked subjects to set a stereoscopic line perpendicular to a textured circular disk inclined 35° from the vertical. The aspect ratio of the disk was set in conflict with binocular disparity. The weight assigned to perspective decreased with increasing cue conflict but did not reach zero. A Bayesian analysis based on a prior assumption that elliptical images arise from circular objects fitted their results. One problem here is that the perceived orientation of the line probe used to estimate inclination may not have been a linear function of its disparity-defined depth. Also, a disparity-defined probe may have biased subjects to use the disparity cue rather than the perspective cue in the inclined disk. Van Ee et al. (2002) used a rectangle or trapezoid covered with vertical and horizontal lines with various magnitudes of horizontal-size disparity and linear perspective. When disparity and perspective were

Figure 30.8. Interaction between disparity, perspective, and relative depth. One of the fused images creates a slanted trapezoid in front of a background, and the other creates a slanted surface seen through a trapezoidal aperture. In only the former case was the perceived slant influenced by whether the taper of the trapezoid was or was not in agreement with disparity-defined slant. (From Gillam and Cook 2001)

I N T E R A C T I O N S B ET WE E N V I S UA L D E P T H C U E S



161

similar, subjects saw intermediate slant. However, when the two cues were highly discordant, 30% of subjects alternated between seeing slant appropriate to disparity and slant appropriate to perspective. The perceived shape of the stimulus changed in accordance with its perceived slant. These subjects could alternate their percepts at will. Other subjects initially saw slant appropriate to perspective before seeing slant appropriate to disparity. These effects are illustrated in Figure 30.9. The dynamics of changes in perceived slant were qualitatively similar to the dynamics of perspective reversals of the Necker cube and to the dynamics of binocular rivalry of orthogonal gratings (van Ee 2005). However, the alternation rate for perceived slant was slower, and subjects had greater conscious control over the rate of alternation than with the other bistable percepts (van Ee et al. 2005). Alternation of perceived slant was not related to saccades or vergence eye movements (van Dam and van Ee 2005).

(a) Disparity and perspective for slant away to the right.

(b) Same perspective as in (a) but stronger disparity.

Van Ee et al. (2003) and Girshick and Banks (2009) developed Bayesian models of bistability arising from cue conflict. 6. Context The weight assigned to a cue in a particular stimulus could perhaps be influenced by the context in which the stimulus is presented. Muller et al. (2009a) found that when subjects judged the inclination of an ellipse on a computer monitor, the weight assigned to perspective compared to disparity was not influenced by the presence of a surrounding array of ellipses or circles. In this case subjects had no reason to suppose that the stimuli in the array were connected. One can imagine situations in which context would have an effect. For example, if surrounding information indicated that an array of ellipses was part of a flat surface. 7. Cue variability There is evidence that the relative strength of a cue to depth is influenced by whether that cue is fixed or variable. Harwerth et al. (1998) investigated interactions between disparity, contrast, and spatial frequency in the discrimination of depth between two Gabor patches. In a given set of trials, one cue varied while the others were held constant at some consonant or discordant value. Depth perception was dominated by the cue that was varied. Thus, in trials in which only contrast varied, contrast was the dominant cue, with high-contrast stimuli appearing nearer. However, this dominance decreased with increasing disparity until, with disparities larger than 6 arcmin, depth depended only on disparity. Similarly, when only spatial frequency varied, it was the dominant cue, with high spatial frequencies appearing more distant. At disparities more than about 4 arcmin, depth became dependent on disparity alone. The important point to 162



(c) No perspective. Same disparity as in (b).

(d) No disparity. Perspective as in (a) and (b).

(e) Disparity for slant to left. Perspective for slant to right. Figure 30.9 Interactions between disparity and perspective. The perspective and the disparity for crossed fusion are indicated under each figure. Some viewers may experience an alternating slant in (e) where there is a strong conflict between disparity and perspective.

emerge from this study is that the relative strengths of cues depend on which cue varies from trial to trial. 8. Agreement with motor kinesthetic information One would expect that the weights assigned to any two cues would depend on the reliability of judgments based on each cue. One indicator of the reliability of cues to surface slant is the extent to which perceptual judgments about the slant of a surface conform to

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

feelings generated by moving the hand over the surface. Ernst et al. (2000) investigated this question. Subjects viewed a computer-generated image of a textured surface inclined about a horizontal axis. Inclination to the frontal plane specified by perspective differed from that specified by disparity. Initially, the relative weights assigned to the two cues were measured by having subjects set the surface to the apparent frontal plane. Then subjects looked at the reflected image of the surface inclined at a fixed angle while they moved a block to-and-fro over a real inclined surface that they could feel but not see. In one condition, the inclination of the tactile surface corresponded to that of the visual surface specified by perspective. In a second condition, it corresponded to the inclination specified by disparity. After 30 minutes of training the relative weights of the two cues were assessed again. The weight assigned to perspective increased when perspective had been the matching tactile cue and the weight assigned to disparity increased when disparity had been the matching cue. The contribution of perspective to anisotropy in the perception of slanted and inclined surfaces was discussed in Section 20.3. The effects of perspective on depth contrast were discussed in Section 21.4.2e. 30.3.2 CU E I N T E R AC T I O N S O N C U RV E D S U R FAC E S

Schriever (1924) presented stereoscopic photographs of 3-D objects, such as a cone and tunnel, with the sign of disparity reversed. He found that depth was judged in terms of perspective rather than in terms of the reversed disparity. Johnston et al. (1993) used stereograms of textured half-cylinders, triangular ridges, and spheres. Subjects judged when the shapes appeared as deep as they were wide. At a distance of 200 cm, depth produced by disparity was underestimated 50% when texture indicated a frontal surface. The underestimation was less when the texture gradient was in accord with disparity. At a distance of 50 cm, depth produced by disparity was only slightly underestimated when the texture gradient was frontal, and was slightly overestimated when the texture gradient was correct. Thus, a given disparity is more effective at near distances than at far distances. Adding the correct texture gradient to binocular disparity had a stronger effect for vertical cylinders than for horizontal cylinders. One possible explanation of this effect is that relative compression disparity produced on surfaces slanted about a vertical axis is a weaker cue to depth than shear disparity produced on surfaces inclined about a horizontal axis. Rogers and Graham (1983) and Gillam et al. (1988) reported a similar anisotropy (see Section 20.3). Some of Johnston et al.’s subjects reported that, like the

texture elements in Figure 15.10, texture elements did not always appear to lie in the plane of the stereo surface. This suggests that subjects did not register local disparity in the texture elements. Buckley and Frisby (1993) used stereograms depicting vertical and horizontal cylindrical ridges in which perspective and disparity specified a ridge with a depth of between 3 and 9 cm. Perspective was specified mainly by the gradient of the aspect ratio of elliptical texture elements. Perspective and disparity were consistent in some conditions and inconsistent in other conditions (see Figure 30.10). Subjects matched the perceived depth of the ridge to one of a set of numbered 2-D ridge profiles. The depth of ridges was judged with reasonable accuracy when disparity and perspective were consistent, as shown by the points linked by a dotted line in Figure 30.11. With only the monocular cue present, depth was underestimated by about one-third compared with when both cues were present and consistent. The vertical ridges showed a large interaction between the two cues. As depth specified by perspective was reduced, variations in disparity had a smaller effect, and disparity had no effect when perspective depth was only 3 cm. Thus, disparity in a vertical cylinder did not supplement aspect-ratio perspective unless perspective specified a large depth. For horizontal cylinders, disparity supplemented perspective for all values of both cues. Similar results were obtained with triangular ridges with linear perspective rather than aspect-ratio perspective. The anisotropy between horizontal and vertical ridges was absent when real 3-D cylinders were used. The stereograms and the real cylinders differed in that accommodation cues to distance were present only in the real cylinders. Also, if subjects moved their heads, there would have been motion parallax in the real cylinders but not in the stereograms. It seems that, with the stereogram of the vertical cylinder, disparity was not sufficiently strong to add to the depth created by perspective in the absence of the accommodation cue. This issue was discussed by Frisby et al. (1995). The relative effectiveness of disparity and texture depends on the type of judgment. Tittle et al. (1997a) found that binocular disparity contributed more to judgments about the shape index of a surface region (whether it was parabolic, hyperbolic, or saddle shaped) than to judgments of curvature magnitude. On the other hand, texture and shading contributed more to judgments of curvature magnitude than to judgments of shape index. A related factor is whether a cue that indicates curvature dominates a cue that indicates a smooth surface. Stevens et al. (1991) found that a bump in a surface indicated only by texture perspective was seen when disparity indicated a smooth surface and that a bump indicated by disparity was seen when texture indicated a smooth surface. In parallel projection or at large viewing distances the sign of depth produced by a texture gradient is ambiguous with monocular viewing. Thus, a monocularly viewed

I N T E R A C T I O N S B ET WE E N V I S UA L D E P T H C U E S



163

A

D

B

E

C

F

Figure 30.10. Stereograms used to study cue interactions. Convergent fusion of each stereogram produces a vertical or horizontal parabolic ridge. Disparity is the same in all stereograms. In (B) and (E) disparity and perspective are consistent. In (A) and (D) perspective indicates less depth than disparity. In (C) and (F) perspective indicates more depth than disparity. Perspective had a greater effect on perceived depth for vertical than for horizontal ridges. For some observers the difference wore off with practice. (Adapted from Buckley and Frisby 1993)

vertical textured cylinder can be seen as convex or concave. However, people have a bias to see the convex interpretation. Adams and Mamassian (2004) found that when binocular disparity indicated that a textured cylinder was concave, the cylinder was perceived as concave. Thus, for a cylinder, disparity neutralized the tendency to see the cylinder as convex. However, disparity does not always overcome the tendency to see concave shapes as convex. For example, a concave face mask appears convex even when viewed with both eyes (Section 30.9). 30.3.3 D I S PA R IT Y/P E R S P EC T I V E I N T E R AC T I O N S : DY NA M I C S

The relative effectiveness of disparity and perspective could be affected by dynamic factors. If two cues are processed with different latencies, the visual system must accommodate these differences. Subjects detected a change in the inclination of an annulus more accurately when the change was indicated by both perspective and binocular disparity than when only one cue changed (Van Mierlo et al. 2007). The two-cue advantage was still evident when the changes of the two cues occurred with an asynchrony of up to 100 ms. If temporal changes in one cue are detected at higher temporal frequencies than changes in another cue, the highfrequency cue should become more dominant for surfaces that oscillate in depth. Pong et al. (1990) constructed a square of random dots in which linear motion-in-depth defined by disparity was 164



consistent with or opposed to that defined by changing size. When the two cues were in conflict, perceived motion-indepth was determined by changing disparity at low speeds of stimulus change and by changing size at higher stimulus speeds. In other words, the changing size cue became relatively more effective at higher velocities. We shall now see that other evidence supports this conclusion. Allison and Howard (2000a) investigated the dynamics of interactions between disparity and perspective in the perception of inclination and slant. A circular randomtexture display or a square display with regular horizontal and vertical grid lines was presented in a stereoscope. Both displays subtended 32°. The following cue conditions were used: 1. Both cues indicated 20° of inclination or slant in the same direction. 2. Disparity indicated 20° of inclination or slant, while perspective indicated frontal surfaces. 3. Perspective indicated 20° of inclination or slant, while disparity indicated frontal planes. 4. The two cues indicated 20° of inclination or slant in opposite directions. Each display was presented at a constant slant of 20° for 30 s or was oscillated between +20° and -20° of inclination or of slant at 0.45 Hz for five cycles. After the stimulus was removed, subjects set a real textured surface with all depth

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Depth specified by perspective 9

9 cm

(a) Vertical ridge

8 7 cm 6 5 cm 4

Judged depth amplitude (cm)

Consistent cues 2 3 cm 1

9

3 4 5 6 7 8 Depth specified by disparity

(b) Horizontal ridge

9 Depth specified by perspective 9 cm 7 cm 5 cm

8

3 cm 6

4

2 1

3 4 5 6 7 8 Depth specified by disparity

9

Figure 30.11. Interaction between perspective and disparity. Judged depth of vertical and horizontal ridges as a function of depth specified by disparity for four depths specified by perspective. The dashed line joins points where disparity and perspective were consistent. Error bar is mean standard error (N = 6). (Adapted from Buckley and Frisby 1993)

estimate of subject EK was 11°. This subject was clearly biased to use perspective rather than disparity. Perceived slant varied between zero and 12.8°, with the odd subject giving the largest estimate. Inclination and slant estimates were higher for the grid pattern than for the random-texture pattern. With conflicting cues, all ten subjects reported that the static random-textured display at first appeared to incline or slant in the direction of the perspective cue. But, over the 30-s viewing period, the surface gradually changed to appear inclined or slanted in the direction of the disparity cue. Subject EK continued to be dominated by perspective. All subjects continued to see the grid pattern as inclined or slanted in the direction of the perspective cue. The perceived inclination or slant of each static display was measured for stimulus durations of 0.1, 1, 10, and 30 s. Figure 30.12 shows that, with disparity alone, or with conflicting cues, perceived inclination and slant increased with increasing stimulus duration. With perspective alone, perceived inclination and slant declined over time. When the displays oscillated between −20° and +20° at 0.45 Hz, all subjects saw almost veridical peak inclination and slant with concordant cues or with perspective alone. With conflicting cues, perceived inclination and perceived slant were dominated by the perspective cue. With disparity alone, perceived inclination and slant were considerably less than in the static condition. These results demonstrate that perspective becomes dominant for displays that change continuously in depth. The results indicate that changing perspective is a particularly strong cue to motion-in-depth. All animals with good vision detect changes in perspective while only animals with stereoscopic vision detect changes in disparity. This issue is discussed further in Section 31.3.2. 30 Concordant cues 20 Perceived slant (deg)

cues present to equal the perceived inclination or slant of each static test display or the perceived peak inclination or slant of each dynamic display. With concordant depth cues, the mean perceived inclination or slant of both static displays was about 16°. When disparity indicated 20° of inclination or slant and perspective indicated a frontal surface, the mean perceived inclination or slant of the randomly textured display was about 10° and that of the grid display was only about 3°. The grid pattern contained strong linear perspective, foreshortening, and aspect ratio cues, while the random-texture display contained the weaker cue of texture gradient. This result is in accord with those reported by Gillam (1968). Subject EK saw small slants and inclinations in the opposite direction to that specified by the disparity. When perspective indicated 20° of slant and disparity indicated a frontal surface, the mean perceived inclination of the textured pattern was only about 2°, except that the

10

Perspective only

0

Disparity only

–10 –20 –30

Conflicting cues 0.1

1 Stimulus duration (s)

10

Figure 30.12. Stimulus duration and inclination and slant. Randomly textured test surfaces were inclined or slanted 20°, as specified by disparity alone, by perspective alone, by both cues in agreement, or by the two cues in conflict. After inspecting the test surface for duration, s, subjects set a real surface to the same perceived angle (N = 5). (Adapted from Allison and Howard 2000a)

I N T E R A C T I O N S B ET WE E N V I S UA L D E P T H C U E S



165

There is another reason why changing perspective is dominant over changing disparity. When disparity alone indicates that a stationary textured surface is inclined, the surface acquires an illusory gradient of perspective. The texture elements of the apparently near part of the surface appear smaller than those of the apparently far part. This is an expression of size-distance invariance. A person, such as subject EK, who was strongly influenced by perspective, interpreted the illusory perspective as a real perspective. He therefore perceived the surface to be inclined in a direction opposite to that specified by disparity. Other subjects perceived the illusory texture gradient as an inhomogeneity of surface texture. These individual differences indicate that people vary in the weight they assign to perspective and disparity. Now consider a surface oscillating in slant because of changing disparity when perspective indicates that the surface is not moving. If the surface were to appear to oscillate in depth, the illusory change in texture gradient would be perceived as a change of surface rigidity. Surface rigidity would be maintained only if the changing disparity was ignored and the perspective indicated oscillation in depth. People may interpret a static illusory gradient of perspective as a surface inhomogeneity, but all subjects perceived dynamically changing perspective as an oscillation in depth. If it were not seen as oscillation in depth it would have to be seen as a deformation of shape, and this would violate the rigidity assumption. Allison and Howard (2000b) produced evidence in favor of the rigidity assumption explanation for the dominance of changing perspective. They used computer-generated random-dot surfaces that appeared to change slowly in inclination or slant because of changing disparity. Subjects perceived a larger change of inclination or slant when the dots were renewed on every frame (33.5 Hz) compared with when they persisted over frames. Cumming and Parker (1994) reported a similar effect at threshold levels of inclination. Also, when disparity and perspective indicated opposite signs of changing inclination or slant, subjects gave greater weight to disparity with dynamic random-dot displays than with persistent dots. Allison and Howard explained these results by assuming that the cue of changing perspective is weakened in a dynamic random-dot display because the motion signals that indicate changing perspective are absent in such a display. Van Mierlo et al. (2009) produced other evidence that changing perspective is a particularly strong cue to inclination. Subjects were asked to rapidly place a cylinder on a flat surface. Just after subjects started to move, the inclination of the surface changed. Subjects responded more quickly to the change when it was indicated by a change in perspective than when it was indicated by a change in disparity. When the surface was blanked out while the inclination changed, a change in inclination defined by disparity was detected more rapidly than a change defined by perspective (see also Greenwald et al. 2005). 166



Summary It can be stated that the relative weights assigned to perspective and disparity are affected by the following factors: 1. Linear perspective is given more weight than foreshortening. 2. Perspective is given more weight at far distances and disparity at near distances. 3. Highly discordant cues produce alternating impressions. 4. A cue that is concordant with motor-kinesthetic information is given more weight. 5. Texture perspective is given more weight when added to vertical disparity-modulated cylinders than when added to horizontal cylinders. 6. The cue of accommodation affects the relative weighting of perspective and disparity. 7. Texture and disparity contribute in different ways to judgments of shape index and judgments of curvature magnitude. 8. The relative strengths of cues to depth depend on which cue is varied over a set of trials. 9. Texture indicating curvature dominates disparity indicating a flat surface. Disparity that indicates curvature dominates texture that indicates flatness. 10. Perspective becomes more dominant and disparity less dominant when the slant or inclination of a surface is modulated continuously over time. 11. A change in inclination indicated by a sudden change in disparity is registered more rapidly than a change indicated by a sudden change in perspective. 3 0 . 4 D I S PA R I T Y A N D I N T E R P O S I T I O N 30.4.1 AVE R AG I N G D I S PA R IT Y A N D I N T E R P O S IT I O N

There can be no quantitative averaging between disparity and interposition because interposition is only an ordinal cue. However, conflicting interposition information could affect disparity-defined depth in three ways. It could raise the disparity-detection threshold, reduce the magnitude of disparity-defined depth, or override (veto) disparity. What is the evidence on these points? Hou et al. (2006) used stimuli like those in Figure 30.13. The disparity threshold for discriminating a difference in depth between the two vertical bars was higher when the bars were stereoscopically beyond the occluding horizontal bar, as in Figure 30.13A, than when they were nearer, as in

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

A A

B

B Figure 30.14. Disparity and figure-ground organization. (A) Crossed fusion creates a white face in front of a black surround. Disparity is consistent with figure-ground organization. (B) Crossed fusion creates a nonface white region in front of a black surround. Disparity is not consistent with figure-ground organization. (Adapted from Burge et al. 2005)

C Figure 30.13. Interaction between disparity and overlap. With crossed fusion the vertical bars appear beyond the horizontal bar in (A) and nearer then the horizontal bar in (B). The disparity threshold for discriminating a difference in depth between the upper and lower vertical bars was higher for condition (A) than for condition (B). The bars were separated by gaps in (B) but not in (A). Condition (A) was not run with gaps, as in (C). (Redrawn from Hou et al. 2006).

(B). They concluded that in (A) the two vertical bars appear as a single partially occluded bar, which predisposes subjects to see it lying in one depth plane, even when there is a disparity difference between the two parts of the bar. However, the results may have been affected by the fact that the vertical and horizontal bars were separated in condition (B) but not in condition (A). A control condition like that in (C) is required. A region that defines a familiar figure typically appears nearer than a region that does not define a familiar figure. Thus, with monocular viewing, the white region in Figure 30.14A appears as a face in front of a black background. Burge et al. (2005) introduced a conflict between this figural effect and binocular disparity. With crossed fusion of the images in (A), the two factors are consistent and a white face appears in front of a dark background. In (B) the factors are inconsistent and the fused image appears as a white region lacking a face profile in front of a dark background. Burge et al. presented pairs of consistent and

inconsistent stereograms with variable disparity for 1 s and asked subjects to report which member of the pair had greater depth. On average, subjects saw greater depth with consistent cues than with inconsistent cues. Burge et al. concluded that the ordinal figural cue interacted quantitatively with disparity. The effect must be small because it is not apparent in the simultaneous comparison of the two stereograms in Figure 30.14. Also, measurements of the perceived difference in depth in the two stereograms failed to reveal a difference (Gillam et al. 2009). A face seen in profile is more convex than its inverse and this rather than familiarity may have been responsible for the above effect. Burge et al. (2010) obtained similar results using simple concave and convex half-circle contours. They showed that, in natural scenes, convex contours are more likely to be nearer than concave contours. In these experiments, subjects may have used one cue on some forced-choice trials and the other cues on other trials, especially at the point of maximum uncertainty. The mean of a set of trials would look like a quantitative interaction between cues. Burge et al. (2010) obtained the same result using the method of adjustment, which they claimed is immune to this artifact. Van Ee et al. (2001) asked subjects to detect the depth between two intersecting black rods. The vertical disparity between the occlusion junctions did not improve

I N T E R A C T I O N S B ET WE E N V I S UA L D E P T H C U E S



167

performance, even when it was well above threshold and the horizontal disparity was below threshold. They found no effect even when the rods possessed surface features. The vertical disparity could not override the impression of no relative depth created by the near-zero horizontal disparities. Thus, vertical disparity between occlusion zones did not contribute to stereoacuity. 30.4.2 CU E D O M I NA N C E A N D C U E D I S S O C I AT I O N

An example of a conflict between disparity and monocular interposition is provided by the stereogram in Figure 30.15A. An uncrossed disparity in the vertical stick causes it to appear beyond the horizontal stick. In this case, stereoscopic depth and figural continuity of the vertical stick are in agreement. With a crossed disparity, the interposition cue sometimes dominates and brings the horizontal bar forward. At other times, disparity dominates and causes the vertical rod to break in two and come forward (Schriever 1924). This is an example of cue dominance, or veto. In Figure 30.15B, the overlap cue is removed. Disparity at the ends of the horizontal wings causes them to appear to bend forward with a crossed disparity and to bend backward with an uncrossed disparity (Zanforlin 1982). In Figure 30.15C, one horizontal wing is disconnected from the vertical lines. This removes the restraining influence of figural continuity and allows that wing to appear in depth relative to the vertical wings. The other wing remains attached and is therefore not seen in depth, although it may appear to bend in depth. In Figure 30.15D both wings are disconnected from the verticals and both therefore appear in depth relative to the vertical wings. This is an example of cue dissociation. The general principle is that disparities of separated objects are processed separately, but depth impressions change when the same objects with the same disparities are seen as parts of one object. Dynamic occlusion may override conflicting information provided by binocular disparity. Braunstein et al. (1986) presented the image of a rotating sphere covered with large texture elements. The elements exhibited accretion on one edge of the sphere and deletion on the opposite edge, thus providing information about the sphere’s 3-D structure and direction of rotation. Thus accretion-deletion information was sufficient to veto conflicting disparity information. 30.4.3 R E I N T E R P R ETAT I O N O F I N T E R P O S IT I O N

The presence of an occluding object can affect the way disparities are interpreted (Häkkinen and Nyman 1997). The horizontal-size disparities in the stereograms in Figure 30.16A create an impression of two sets of rectangles slanted in opposite directions. The fused images of all the rectangles are treated as distinct surfaces. With crossed 168



A

B

C

D Figure 30.15. Effects of figural continuity on stereopsis. (A) With crossed fusion, the horizontal stick in the upper stereogram appears nearer than the vertical stick. Disparity and overlap agree. In the lower stereogram, overlap indicates that the horizontal stick is nearer, while disparity indicates that it is more distant. Overlap overrides disparity, or else the vertical stick breaks into two pieces that come forward. (Adapted from Schriever 1924). (B) With crossed fusion, the horizontal wings appear curved toward the viewer in the upper stereogram and away from the viewer in the lower stereogram. (Adapted from Zanforlin 1982) (C) The left wing is disconnected. Its perceived depth is governed by disparity. The right wing remains attached, and disparity is disregarded or interpreted as bending. (D) Both horizontal wings are disconnected from the vertical wings. Depth of both is interpreted according to disparity.

fusion, the stereogram in Figure 30.16B creates the impression of a single rectangle seen beyond a central occluder. Inspection of the images reveals that the size disparities of the separate rectangles in (A) have been converted into a simple horizontal disparity between the images of a long occluded rectangle. Grove et al. (2003) designed Figure 30.16C, which shows that when T-junctions do not create the impression of a central occluder, disparities in the flanking rectangles cause them to appear slanted. See Gillam and Grove (2004) for more evidence of this type of interaction between occlusion and disparity.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

A

A

B

C Figure 30.16. Occlusion and the interpretation of disparity. (A) Horizontal size disparities produce two slanted rectangles. (B) The flanking rectangles appear as a single frontal rectangle beyond an occluder. (C) The flanking rectangles do not form a single frontal rectangle when the T-junctions indicate they are not occluded. The stereograms are designed for crossed fusion. (Adapted from Häkkinen and Nyman 1997; and

B

Grove et al. 2003)

In the stereograms in Figure 30.15B there is disparity at the ends of the horizontal bar but not along its length. Only the ends of the horizontal bar are seen displaced in depth because the interposition cue prevents interpolation of depth into the center. This causes the horizontal bar to appear bent. The situation is different when both bars are set at an oblique angle, as in Figure 30.17 (Howard 1993b). There is now disparity information along the whole length of the bars. In Figure 30.17A, interposition and disparity are compatible when the shaded bar has a crossed disparity (stereoscopically nearer) but are incompatible when the black bar has a crossed disparity. In the incompatible pair the black bar does not appear to bend forward, as it did in Figure 30.17B, but appears as two pieces in front of the shaded bar, with a gap where the two bars intersect. We refer to this as figural segmentation. This is an example of cue reinterpretation because the interposition cue is reinterpreted as a broken bar. This interpretation occurs because it is the only way to resolve the conflict between disparity and interposition. Figural continuity of the black bar is sacrificed so that the disparity depth cue can be preserved without contradicting the interposition cue. In Figure 30.17B, interposition and disparity are compatible when the black bar is stereoscopically nearer, but incompatible when the shaded bar is stereoscopically nearer. In the incompatible pair of images the shaded bar is interpreted as continuing across the black bar as a transparent object with subjective edges. Figural continuity is preserved for both bars, and the visual system resolves the conflict between disparity and interposition by inferring the presence of a continuous transparent object. This is another example of cue reinterpretation.

Figure 30.17. Conflicts between overlap and disparity. (A) When the shaded bar is near, overlap and disparity are compatible. When the dark bar is near they are incompatible and the black bar breaks into two pieces. (B) When the dark bar is near, overlap and disparity are compatible. When the shaded bar is near, the cues are incompatible and the shaded bar appears transparent.

There are thus two ways in which an incompatibility between disparity and interposition can be resolved by reinterpreting the occlusion cue. 1. Figure segmentation, in which occlusion is reinterpreted as figural discontinuity. 2. Transparency, in which occlusion is reinterpreted as one surface seen through another. In these cases, there is no cue averaging or cue trading between disparity and interposition, as there is, for instance, between disparity and monocular parallax (Section 30.2.5). Anderson and Nakayama (1994) reviewed the role of monocular occlusion in biasing the sign of perceived depth in stereograms in which disparity is ambiguous. Malik et al. (1999) developed a model of how subjects use monocular occlusion as a supplementary cue to depth.

3 0 . 5 D I S PA R I T Y A N D T R A N S PA R E N C Y 30.5.1 D I S PA R IT Y-T R A NS PA R E N C Y I N T E R AC T I O NS

There are two basic types of transparency. The first is specular transparency, in which an object is reflected in a

I N T E R A C T I O N S B ET WE E N V I S UA L D E P T H C U E S



169

shiny surface. In this case, the luminance of overlapping regions is the sum of the luminances of the separate images. The second type is film transparency, which occurs when something is seen through a tinted transparent surface. In this case, the luminance of the overlapping region is the product of the luminances of the separate regions (see Langley et al. 1998, 1999). In the absence of depth cues, the perception of transparency between overlapping areas occurs when the luminance relations prompt the perception of a continuous surface across X-junctions, as in Figure 30.18A (Metelli 1974; Beck and Ivry 1988). Occlusion of the contour intersections tends to weaken the percept of transparency, as in Figure 30.18B. However, the percept is restored when disparity signifies that the surfaces are not in the same depth plane, as in the fused images in Figure 30.18D (Kersten 1991). With a display like that in Figure 30.19A the cue of surface transparency is ambiguous, so that, as far as this cue in concerned, one is free to interpret either surface as being in front. In the fused image, the sign of the disparity determines the perceived depth order of the two surfaces. In Figure 30.19B the gray square appears to occlude the black square, and there is thus a conflict of cues when the black square has a crossed disparity relative to the gray square. Trueswell and Hayhoe (1993) claimed that more disparity is required to evoke relative depth when transparency and disparity are in conflict than when they are in agreement. This would amount to a trading relationship between

these two cues. However, these results can be interpreted in another way. Howard (1993b) noticed that after some time the black square in the cue conflict configuration in Figure 30.19B appears as a piece of dark transparent glass through which the gray square can be seen. Before the transparency percept emerges, the black and gray squares appear coplanar but, after it emerges, the depth between the black and gray squares appears the same as that between black and gray squares in the nonconflict configuration. In other words, disparity information is suppressed when the overlap cue is dominant but there is no loss of stereoscopic efficiency once the transparency of the squares has been reinterpreted to be consistent with disparity. There is no evidence of a trading relationship between overlap and transparency. Trueswell and Hayhoe used the probability of seeing depth as their measure of stereoacuity. Reports of depth when the black square appeared transparent and of no depth when it appeared opaque would, when averaged, generate a spurious measure of stereoacuity. Probability of seeing is an invalid measure of acuity under conditions of unstable criteria. One could say that the experimenters did the averaging, not the subjects. Figure 30.19C provides another illustration of how disparity can force one to reinterpret transparency. Depth is seen readily in the fused image in which the black square has a crossed disparity, since this is compatible with the occlusion of the black square by the gray square. At first,

Figure 30.18. Transparency, stereo depth, and T-junctions. The impression of transparency evident in (A) is removed by occlusion of T-junctions, as in (B). Stereo depth induces an impression of transparency with T-junctions, as in (C), or without T-junctions, as in (D). (Adapted from Kersten 1991)

170



OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Figure 30.19. Effects of transparency on stereopsis. (A) Transparency is ambiguous and compatible with both crossed and uncrossed disparity produced by fusing the images. (B) Depth is seen more readily when disparity is consistent with the gray square nearer than the black. As soon as the black square looks like frosted glass, depth in the two cases appears equal. (C) Depth is seen more readily for a gray square in front of a black square. As soon as the black square appears transparent, depth in the two cases appears equal.

depth is not seen on the side in which the black square has an uncrossed disparity. But after a while, the black square appears complete and pops out in front of the gray square. When this happens, the physically missing corner of the black square is subjectively filled in as part of a transparent black square (Howard 1993b). Watanabe and Cavanagh (1993) reported the same effect. The interplay between disparity and interposition is influenced by the degree of temporal synchrony between the conflicting cues (Kojima and Blake 1998). In Figure 30.20 the addition of zero-disparity white connecting lines biases the perceived depth of the flanking brackets in the direction of appearing in front of the zero-disparity vertical bar. However, this biasing effect is reduced when the white lines and flanking brackets are flashed out of phase at 4 Hz. See Section 4.3.4 for a discussion of synchrony.

The effect is not evident when the red square is seen beyond the black bars (see also Liinasuo et al. 2000). Neon spreading was first described in 1971 by Dario Varin of the University of Milan and rediscovered by Tuijl (1975). The region into which color spreads produces an afterimage in the complementary color (Shimojo et al. 2001). Whereas ordinary afterimages are thought to be retinal, this afterimage must be cortical. The literature

30.5.2 E FFEC TS O F C O L O R , B R I G HT N E S S , AND TEXTURE

The stereograms in Figure 30.21 create a red transparent square nearer than the black frame and an opaque red square behind the frame, depending on whether the red square has crossed or uncrossed disparity. Notice how the red color appears to spread over the dark bars when the red square appears near. This effect is known as neon spreading.

Figure 30.20. Temporal synchrony and cue interaction. The upper stereogram creates horizontal rectangles, one in front of and one beyond the vertical bar. Addition of zero-disparity white connecting lines biases the percept in the direction of the horizontal rectangle being in front, but not when the white lines are flashed at 4 Hz out of phase with the flanking brackets. (Redrawn from Kojima and Blake 1998)

I N T E R A C T I O N S B ET WE E N V I S UA L D E P T H C U E S



171

Figure 30.21. Transparency/opacity from perceived depth. One stereogram creates a near transparent red rectangle. The other creates a far opaque red rectangle. When the red rectangle appears nearer, its color appears to spread across the black bars. (Redrawn from Nakayama et al. 1992)

on neon spreading has been reviewed by Bressan et al. (1997). A surface seen in front of or through a larger surface, half black and half white, can appear evenly bright when its two halves differ in luminance. The difference in luminance is ascribed to the difference between the two halves of the larger surface, as in Figure 30.22A. A bold line or an apparent fold in the smaller surface can destroy the impression that the surface is evenly bright and can weaken the impression of transparency, even when disparity cues are present, as in Figures 30.22B and C. Perception of transparency is influenced by interaction between disparity and texture continuity. In the stereogram in Figure 30.23A the gray sectors on the white circles create a transparent square when the sectors have crossed disparity. The square appears in front of the striped background, which can be seen through the square. When disparity places the square beyond the white circles, the subjective contours along the sides of the square disappear, and so does the impression of transparency. Instead, one has the impression of four circular windows, each containing one corner of a more distant gray square. When the gray sectors are replaced by the lines of the background, as in Figure 30.23B, the square with crossed disparity is interpreted as opaque. This forces the interpretation that the lines in the square are on its surface and come forward with it (Nakayama et al. 1990). With uncrossed disparity the corners of a square with lines is seen through four portholes. Even a single dot seen in depth can trigger a sensation of transparency, as in Figure 30.24.

3 0 . 6 D I S PA R I T Y A N D S H A D I N G The perception of shape from shading was discussed in Section 27.3.2. The effects of perceived depth on the perception of surface whiteness was discussed in Section 22.4. The present section is concerned with interactions 172



between shading and stereo in the perception of solid shape. The relative effectiveness of disparity and shading depends on the type of judgment being made. Thus, Tittle et al. (1997a) found that disparity contributed more to perceived shape of a surface region (parabolic, hyperbolic, or saddle shaped) than to perceived magnitude of curvature. On the other hand, shading with highlights contributed more to curvature magnitude than to shape. The addition of shading had only a small effect on the disparity threshold for detection of depth in a random-dot corrugation (Wright and Ledgeway 2004). Doorschot et al. (2001) used photographs of a human torso with several levels of binocular disparity combined with several levels of shading produced by varying the position of a light source. Subjects set a gauge figure to match the orientation in depth of each of 300 locations on the torso. A perceived depth map was derived from the settings. The disparity and shading cues were found to combine in an almost linear fashion. Vuong et al. (2006) asked subjects to set a stereoscopic dot onto a spherical surface with depth specified by only disparity of sparsely distributed dots, by only smooth shading, or by both cues. Even though subjects could not use monocular shading to perform the task, the addition of shading to disparity improved performance above the level when only disparity was present. This result cannot be explained by simple addition of cues. Vuong et al. suggested that shading allowed subjects to interpolate into regions between the disparate dots. When shading over a surface patch indicates that illumination is from above, the patch appears convex, as in Figure 30.25a. With the opposite shading, the patch appears concave, as in Figure 30.25b. Zhang et al. (2007) used a stimulus of this kind with depth indicated (1) by only shading; (2) by only disparity, as in Figures 30.25c and d; (3) with both cues in agreement, as in Figures 30.25e and g; (4) with the two cues in conflict, as in Figures 30.25f and h. Human subjects and monkeys could use either cue to correctly identify the sign of depth of a stimulus in a set of stimuli of opposite sign. They were faster and more accurate when both cues were present and in agreement than when only one cue was present. They concluded that disparity and shading cues are integrated at some point in the visual system. However, this could be a simple matter of probability summation rather than neural integration (Section 13.1). It could also be due to the enhanced attention value of any stimulus when its difference from other objects is increased. Thus, an object that differs in both color and shape from surrounding objects is easier to detect than an object that differs in only color or only shape. When the cues were in conflict, shading was the dominant cue when there was little disparity, but disparity became dominant when it exceeded about 10 arcmin.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Transparency and surface partition. (A) The rectangle appears in front and transparent in one fused image and behind and opaque in the other. Its surface appears evenly bright. (B) The impression of transparency in the near rectangle is weakened or delayed because the vertical line creates areas differing in brightness. (C) The impression of transparency of the folded surface is weakened or delayed because the spine of the folded rectangle creates the impression that the surface is differentially shaded. Figure 30.22.

Inspection of Figure 30.25 should allow readers to confirm these results. 3 0 . 7 AC C O M M O DAT I O N A N D OT H E R DEPTH CUES 30.7.1 AC C O M M O DAT I O N A N D P E R S P EC T I V E

Accommodation as a cue to depth was discussed in Section 25.1. The present section is concerned with interactions between accommodation and other depth cues. Richards and Miller (1969) asked subjects to set a normally viewed point of light to the same distance as a point of light viewed through prisms and lenses that placed it at a different accommodation/vergence distance. The distance cues of image size and brightness were the same for the two lights. Depth judgments were affected by the state of accommodation/vergence for 15 of the 25 subjects. Judgments of the other subjects were dominated by the distance cues of size and brightness. In size constancy, the apparent size of an object is not affected by viewing distance. According to the size-distance invariance hypothesis (Section 29.2.2), a change in the

apparent distance of an object is accompanied by a change in its apparent size. In the absence of cues to distance, size constancy breaks down and isolated objects tend to appear in the same depth plane (Gogel 1965). Von Holst (1955) presented dichoptic images of an object at a given fixation/vergence distance but varied the distance of the images from the eye between 8 and 50 cm. Thus, only the accommodative distance of the stimulus changed. At each distance subjects adjusted the image sizes to keep the perceived size of the fused image constant. The results fell halfway between no size constancy (constant angular size) and full constancy (constant linear size). This suggests that accommodation contributed to the estimation of absolute distance required for size constancy. If subjects had remained accommodated at the fixation/vergence distance, defocus blur would have been the crucial factor. But if they had dissociated vergence and accommodation by focusing the images in spite of converging at a different distance, the accommodative state of the ciliary muscles would have been the crucial factor. Heinemann et al. (1959) asked subjects to compare the size and distance of a luminous disk at a distance of 4 m with the size and distance of a test disk at each of several nearer distances. The disks were viewed monocularly in dark surroundings and always subtended the same visual

I N T E R A C T I O N S B ET WE E N V I S UA L D E P T H C U E S



173

A

Figure 30.24. Opacity or transparency triggered by a dot. A central dot appears coplanar with an opaque surface in the two top stereograms. It appears through a transparent surface in one of the lower stereograms. (Redrawn from Nakayama et al. 1990)

B Figure 30.23. Stereo depth and apparent transparency. (A) With crossed fusion, a transparent square stands out in the upper stereogram. The lines remain with the background. In the lower stereogram, a gray square is seen through four portholes. (B) With crossed fusion, an opaque lined square stands out in the upper stereogram. In the lower stereogram, the square is seen through four portholes. (Redrawn from Nakayama et al. 1990)

angle. The nearer disk appeared slightly smaller than the far disk, as one would expect from size-distance scaling. Changing the distance of the test disk had no effect on its perceived relative size when the disks were viewed through artificial pupils that eliminated accommodation as a depth cue. This suggests that differential accommodative blur provided some information about relative depth required for size constancy. However, changes in accommodation would have been accompanied by changes in vergence. The nearer disk appeared smaller than the far disk when the disks were viewed binocularly with artificial pupils but with a fixation point that provided a stimulus for vergence. Wallach and Norris (1963), also, found that size constancy was more evident when monocular stimuli were viewed normally than when the effects of accommodation were eliminated by an artificial pupil. They obtained a larger 174



effect of accommodation than that reported by Heinemann et al. Heinemann and Nachmias (1965) questioned the procedure used by Wallach and Norris. Gogel and Sturm (1972) found that size-distance scaling was more evident when both vergence and accommodation cues were present than when only accommodation was present. A related question is whether information provided by accommodation enhances the perceived slant of a textured surface viewed monocularly. Watt et al. (2005) asked subjects to set the angle between two lines in a frontal plane to match the slant of a computer-generated textured surface. In one condition, the surface was frontal. Slant was simulated by a texture gradient, but accommodative blur indicated a frontal surface. In a second condition, the monitor was actually slanted so that perspective was consistent with accommodative blur. For all three subjects, slant was underestimated when accommodation was inconsistent with perspective compared with when the two cues were consistent. 30.7.2 AC C O M MO DAT I O N A N D D I S PA R IT Y

Reduction in luminance dilates the pupils and reduces depth of focus. This should improve the precision of relative depth judgments based on accommodation. However, reduction in luminance reduces visual acuity, which degrades the precision of depth judgments based on disparity. Stereoacuity measured with the Randot and Titmus tests was adversely affected only when pupil diameter was reduced below 2.5 mm (Lovasik and Szymkiw 1985).

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Figure 30.25.

Interactions between shading and binocular disparity. Figures (a) and (b) are for monocular viewing. Figures (c) to (h) are stereograms designed

for crossed fusion. (Redrawn from Zhang et al. 2007)

Mather and Smith (2000) constructed one random-dot stereogram in which the texture of the zero-disparity region was sharp and that of the disparate region was blurred by an amount that corresponded to its relative depth. In a second stereogram the zero-disparity region was blurred and the disparate region was sharp. Each stereogram was shown for 250 ms. Observers were only marginally better at detecting depth when the congruent blur cue was present rather than no blur. Performance was not affected by the presence of conflicting blur. This is not a fair comparison of the two cues. The disparity cue was under dynamic control, since the subjects could converge on one or the other depth plane. Image blur was not under dynamic control and may have been perceived as a real difference of contour sharpness between texture elements rather than due to a difference in depth. Image blur under dynamic control can only be due to depth differences and can be an effective cue to relative depth (Section 25.1.3). Also, only real defocus blur contains chromatic aberration and other aberrations that provide information about relative depth. Watt et al. (2005) presented a frontal surface with disparity-defined slant for which disparity and accommodative blur were inconsistent. In a second condition, the surface actually slanted so that disparity and accommodative blur were consistent. Slant judgments were similar for the two surfaces. Thus, inconsistent blur had no effect on slant produced by disparity.

Watt et al. went on to ask whether accommodative blur is used to detect the distance of a disparity-defined stimulus. If so, it could contribute to distance scaling of disparity (Section 29.4). The stimuli were two computer-generated surfaces meeting to form the appearance of a concave “open book.” Texture perspective and disparity were consistent. Subjects fixated on the hinge and reported whether the dihedral angle was greater than or less than 90°. Vergence distance was varied by changing the overall disparity of the display, keeping accommodation constant. Accommodative distance was varied by changing the distance of the monitor, keeping vergence constant. Depth constancy (perceived constancy of the dihedral angle at different distances) was higher when vergence and accommodation were consistent compared with when accommodative distance was held constant. This indicates that even though accommodation did not contribute to perception of disparity-defined slant at a given distance, it contributed to the distance scaling of disparity. Accommodation and vergence are linked, so it is difficult to control one while varying the other. Swenson (1932) asked subjects to move an unseen marker to the perceived distance of a single binocularly viewed luminous disk with distance cues of image size and luminance held constant. Errors were less than 1 cm in the range 25 to 40 cm. When accommodation was optically adjusted to one distance by lenses, and vergence to another distance by prisms, judgments of distance were a compromise between the two distances but with more weight given to vergence.

I N T E R A C T I O N S B ET WE E N V I S UA L D E P T H C U E S



175

3 0 . 8 M OT I O N PA R A L L AX A N D P E R S P E C T I VE This section is concerned with whether perspective or motion parallax is given more weight in the perception of depth when the two cues are in conflict. Braunstein (1968) found that, in judging the inclination of a textured surface, people gave greater weight to a velocity gradient than to a texture gradient when the two cues were in conflict. O’Brien and Johnston (2000) used 2-D displays simulating a flat inclined surface. Each surface was viewed monocularly through a circular widow. Perspective was generated by a horizontal or vertical sinusoidal grating or by the two gratings combined to form a plaid. The grating drifted to simulate lateral motion of the surface. The inclination discrimination threshold, with respect to a standard surface inclined 45°, was lowest for the plaid texture but was not affected by the addition of motion parallax. When texturedefined inclination was made to conflict with parallaxdefined inclination, judgments were based almost entirely on texture perspective. Subjects were more sensitive to changes in texture gradients formed by sinusoidal gratings than to equivalent changes in velocity gradients, but this difference was not large enough to account for the differential weighting of the cues. Cornilleau-Pérès et al. (2002) found a similar dominance of perspective over conflicting motion parallax in the perception of the tilt (direction of maximum slope) of the projected image of a rotating rectangle. Landy et al. (1991b) used 2-D simulations of opaque vertical cylinders rotating about a vertical axis. The cylinders were covered with randomly spaced dots that provided a texture gradient. Subjects compared the depth in a cylinder with consistent depth cues with that in a cylinder with conflicting cues. Interactions between the motion-parallax cue and the texture-gradient cue were adequately described by simple averaging, with the texture-gradient cue weighted more heavily (0.62) than the motion cue. In subsequent experiments from the same laboratory, observers judged the depth-to-width ratio of a vertical cylinder rotating about a horizontal axis (Young et al. 1993). Interactions between the two cues were again well modeled by a simple linear combination. Motion parallax was weighted more heavily than texture for one out of three observers. The other two observers weighted the cues equally. As one would expect, the weighting for texture was reduced when the cue was weakened by the introduction of anisotropic texture. Jacobs (1999) developed a Bayesian model of optimal interactions between motion parallax and texture cues to depth in the projected image of an opaque vertical cylinder. The parameters of the model were derived from judgments of the depth of a cylinder when either only parallax or only texture perspective was available. The model successfully predicted performance when both cues were present. 176



O’Brien and Johnston obtained dominance of perspective cues of linear perspective and/or texture density, while Young et al. and Jacobs obtained cue summation for curved surfaces containing perspective cues of texture density and aspect ratio of texture elements. The relative weighting of motion parallax and texture perspective can be modified by experience. Jacobs and Fine (1999) used a video display that simulated textured transparent vertical cylinders rotating about the central vertical axis. In discriminating differences in depth between two cylinders subjects adjusted their weighting of texture and motion parallax according to which cue was most informative in a given set of displays. The weighting that subjects applied could be cued by a difference in cylinder length. The absolute distance of an object on a textured ground surface is indicated by the height of the point of contact between the object and the surface with respect to the horizon (Section 26.4.1). Motion of the observer or of the object introduces the additional cue of motion parallax. Ni et al. (2005) projected a computer-generated cylinder onto a ground surface in a motion picture of a 3-D scene. The scene was either stationary or moved to generate motion parallax. Judgments of the distance of the cylinder were a compromise between distance specified by surface contact and distance specified by motion parallax. With several objects, motion parallax became the dominant cue to distance. There is no general answer to the question of whether perspective or motion parallax is the stronger cue to depth. There are large individual differences and cue weightings can be modified by experience. 3 0 . 9 C O G N I T I O N A N D D E P T H- C U E I N T E R AC T I O N S Section 26.7 dealt with how 3-D objects spontaneously appear to reverse in depth. The present section deals with whether familiarity can cause an object to be seen in a way that contradicts information from depth cues. In 1838, Wheatstone reversed the drawings in his mirror stereoscope, thus reversing the sign of disparity. Simple line drawings reversed in depth, but more complicated figures did not reverse. He concluded that, “the mind, unaccustomed to perceive its converse, because it never occurs in nature, can find no meaning in it.” (Wheatstone 1838, p. 74). In 1852 Wheatstone viewed a variety of real objects through a pseudoscope, which reversed the sign of disparity. Objects such as the inside of a cup, an embossed medal, a globe, and a skeletal cube appeared in reversed depth, but most familiar objects appeared in their normal relief. He also noted that some 3-D objects, such as an intaglio or salt crystal, might appear in reverse relief even when viewed normally with two eyes. On this occasion Wheatstone wrote,

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

The reason for this is, that the relief and distance of objects is not suggested to the mind solely by the binocular pictures and the convergence of the optic axes, but also by other signs, which are perceived by means of each eye singly; among which the most effective are the distribution of light and shade and the perspective forms which we have been accustomed to see accompany these appearances. (Wheatstone 1852, p. 164) However, he did not believe that there was an inversion of disparity coding in such cases. There is no good evidence that the normal appearance of objects seen in a pseudoscope results from an inversion of the sign of disparity coding. Evidence on this issue was presented in Section 21.6.2g. The Ames rotating trapezoidal window (Ittelson 1952) reverses in apparent depth and direction every half-turn to create the impression of a window with the familiar rectangular shape. The trapezoidal shape of the window biases the interpretation of depth at certain orientations, but the window appears rectangular when it is near that angle of slant that produces a rectangular image. Ames argued that familiarity with rectangular windows biases the interpretation. However, the same effect can be obtained with a trapezoid that does not resemble a window. The effect is most likely due to a tendency to see a frontal tapered shape as a slanted rectangular shape. Familiarity with a rectangular object may increase this tendency, but conclusive experiments have not been performed. In one of the Ames demonstrations (Ittelson 1952) straight pieces of string were suspended in an apparently random three-dimensional disconnected structure. When the structure was viewed from a particular vantage point, the pieces of string appeared connected in the form of a chair. The real three-dimensional arrangement of the pieces was ignored in favor of the chair percept. But this effect may have nothing to do with familiarity. It would occur if the pieces of string formed any connected polyhedron, familiar or not. The effect depends on the tendency to see connected lines in the image as arising from a connected object. This is an example of the generic viewpoint assumption (Sections 4.5.9e and 26.1.1b). Brewster (1826) reported that a hollow mask of the human face appears as a normal convex face. This has been taken as evidence that familiarity overrides conflicting information from binocular disparity (Gregory 1970). Klopfer (1991) asked subjects to view a face mask binocularly as it rotated about its midvertical axis. As the concave side of the mask turned toward the subject it appeared convex and the mask appeared to reverse its direction of motion. A rotating upside-down face mask reversed in apparent depth and direction of rotation less readily than an upright face, presumably because an inverted face is more difficult to recognize than an upright face (Carey 1981). An unfamiliar concave object reversed less frequently still.

Bülthoff et al. (1998) arranged a set of moving lights in 3-D space in a stereoscope so that their 2-D projection created the same moving pattern as that created by lights attached to the limbs of a walking human figure. Stereoscopically, the lights were scrambled in depth. Subjects rated this stimulus to be as good a human figure as a nonscrambled stimulus. The inappropriate disparity was ignored in favor of the familiar percept. When the lights did not project as a familiar pattern their 3-D structure indicated by disparity was more likely to be perceived. Thus, top-down influences arising from familiarity with the human figure can override conflicting information from binocular disparity. Lu et al. (2006) used a stereoscopic image of a stationary human figure defined by light spots. In one condition, the image was inverted so that it was not seen as a human figure. In a second condition the figure was erect, and subjects were shown a movie that revealed that the stimulus was a human figure. Subjects were more sensitive to the disparity-defined relative depths of points on the two arms when the stimulus was not seen as a human figure. Liu et al. concluded that when the stimulus was seen as a human figure the expectation that the two arms are equal in length impaired the discrimination of disparity-defined differences in arm length. These observations suggest that observers ignore disparity information when it conflicts with perception of a familiar figure. However, in any instance, the following factors must be allowed for before one can conclude that familiarity is the crucial factor. 1. The presence of perspective Even with unfamiliar stereograms, texture perspective can be a more powerful cue to depth than disparity (Cumming et al. 1993; Johnston et al. 1993). Georgeson (1979) found that a random-dot stereogram of a concave face containing no texture perspective appeared concave. This suggests that familiarity cannot overcome conflicting disparity in the absence of other depth cues. However, a face created by a random-dot stereogram may be too poorly defined to evoke a strong sense of familiarity. Van den Enden and Spekreijse (1989) replaced the natural texture in a pseudoscopically viewed photograph of a face with uniform texture. After a while, the face appeared concave. However, Deutsch et al. (1990) pointed out that the uniform texture enhanced disparity detection and that this, rather than the removal of texture perspective, may have accounted for the pseudoscopic face appearing concave. This evidence suggests that familiarity is not the only factor causing faces to resist the effects of pseudoscopic viewing but it does not prove that familiarity is not involved. Other evidence suggests that familiarity does affect apparent depth reversal when depth cues are in conflict.

I N T E R A C T I O N S B ET WE E N V I S UA L D E P T H C U E S



177

2. Bias to see convex objects There is a general tendency to see concave objects as convex but no reverse tendency to see convex objects as concave ( Johnston et al. 1992). The rotating unfamiliar objects in the Klopfer experiment described above tended to be seen as convex more frequently than concave. This preference for seeing convex objects may arise because there are more convex than concave objects. This can be regarded as one aspect of familiarity but not familiarity with particular objects. The concave-to-convex tendency can be demonstrated by viewing along the axis of symmetry of a foam plastic cup with the rim of the cup nearer or with the base of the cup nearer. The concave cup more readily appears convex than the convex cup appears concave. Hill and Bruce (1994) found that an upside down concave mask was less readily seen as convex than an upright mask but that a hollow surface with random undulations reversed equally well in both orientations. If reversal were due only to a general tendency to see concave surfaces as convex, inversion should affect a face and an abstract shape in the same way. 3. Assumed direction of illumination The convexity or concavity of surfaces is affected by the assumed direction of illumination. A shaded disk is seen as a hill or a hollow according to which interpretation conforms to the assumption that illumination comes from above in a headcentric frame of reference (Howard et al. 1990). This issue was discussed in Section 27.3.2. Faces are more difficult to recognize when illuminated from below, probably because we must overcome the tendency to see protuberances illuminated from below as hollows illuminated from above ( Johnston et al. 1992). A concave mask illuminated from above appears to be illuminated from below when seen as convex. This factor should therefore prevent one seeing a concave mask as convex. Thus, reversal of a concave face mask overrides conflicting information from shading. Hill and Bruce (1993) found that an opaque concave mask more readily appeared convex when it was illuminated from below; thus appearing to be illuminated from above when seen as concave.

178



A translucent concave mask reverses more readily than an opaque mask when both are illuminated from above. A translucent concave mask transilluminated from above appears to be illuminated from above when seen as convex. In this case, shading does not conflict with depth reversal. 4. Presence of bounding edges There is a tendency to see surface regions bounded by an occluding edge as being in front of a neighboring occluded region (Howard 1982). 5. Bias to see ground planes There is a preference for seeing a surface inclined top away rather than top forward (Reichel and Todd 1990). 6. Incomplete suppression of disparity Hartung et al. (2005) asked subjects to judge the distance between the nose and cheek of a face mask created in a stereoscope. Binocular disparity was consistent with either a convex or a concave mask. They argued that if disparity is completely ignored in a concave mask, the apparent depth of the face should be the same as that in a convex mask. On average, the concave mask that appeared convex was judged to be only half as deep as the convex mask. Hartung et al. concluded that the conflicting disparity information in the concave mask was not totally ignored but was responsible for the apparent flattening of the concave face relative to the convex face. This conclusion is suspect. A control is needed in which the mask has zero disparity, to show that flattening is not simply due to a default depth impression that arises when disparity is absent or totally ignored. In other words, the familiarity cue alone may not be capable of generating the full impression of depth in a face. A monocularly viewed concave mask appears to follow a moving viewer. The motion parallax produced by motion of the viewer is the reverse of that produced by viewing a convex mask. However, it does not provide conflicting information about depth because it is reinterpreted as due to a rotation of the mask. The apparent depth of a concave mask may well be enhanced by head parallax. In any case, motion parallax is too strong a cue to be totally ignored.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

31 SEEING MOTION-IN-DEPTH

31.1 31.1.1 31.1.2 31.1.3 31.1.4 31.1.5 31.2 31.2.1 31.2.2 31.2.3 31.2.4 31.2.5 31.2.6 31.3 31.3.1 31.3.2 31.3.3 31.3.4 31.3.5

Judging time-to-contact 179 Monocular cues to time-to-contact 179 Estimating time-to-contact 181 Relative judgments of time-to-contact 183 Comparison of cues to time-to-contact 184 Detection of time-to-contact by animals 185 Direction of motion-in-depth 185 Motion direction of a point of light 185 Motion direction of a single object 186 Motion direction from optic flow 187 Binocular cues to motion direction 189 Sensitivity to direction of binocular motion 190 Accuracy of binocular judgments of motion direction 191 Detecting motion-in-depth 192 Information for motion-in-depth 192 Motion-in-depth with no stationary reference 193 Detecting relative motion-in-depth 195 Isolating the changing-disparity cue 200 Isolating the difference-of-motion cue 201

31.3.6 31.3.7 31.4 31.4.1 31.4.2 31.5 31.5.1 31.5.2 31.6 31.6.1 31.6.2 31.6.3 31.7 31.7.1 31.7.2 31.7.3 31.8 31.8.1 31.8.2

3 1 . 1 J U D G I N G T I M E -TO - C O N TAC T

Detecting the speed of motion-in-depth 203 The flash-lag effect in depth 203 Cue interactions 204 Interaction of monocular and binocular cues 204 Interaction of object motion and self-motion 205 Spatial aspects of motion-in-depth 205 Directional asymmetry 205 Position in the visual field 206 Aftereffects of motion-in-depth 206 Aftereffects of rotation in depth 206 Aftereffects of monocular looming 207 Aftereffects of disparity-defined motion-in-depth 208 Induced motion-in-depth 210 Monocular induced motion-in-depth 210 Binocular induced motion-in-depth 211 Induced motion and perceived distance 212 Detectors for motion-in-depth 212 Looming detectors 212 Detectors for binocular motion-in-depth 214

2. Judging contact-time Observers respond at the estimated instant of arrival of the approaching object at a defined location. They may press a key or indicate whether a test flash occurs before or after the anticipated contact-time.

The ability to judge when an approaching object will arrive at a particular location is important in a variety of everyday tasks. We avoid obstacles when we run or drive (see Lee 1976; Cavallo and Laurent 1988), we judge when and where we will land when striding or jumping (Lee et al. 1982; Warren et al. 1986), and we dodge potentially dangerous objects moving toward us. We also judge time-to-contact when we catch a ball or hit it with a bat or racket. In all these tasks we make absolute estimates of time-to-contact. Three tasks are used in laboratory studies of the ability of people to estimate time-to-contact. In each task the approaching object or a computer simulation of an approaching is shown for a specified period and then removed from view at a defined distance from the observer.

3. Relative time-to-contact judgments Observers judge whether one object will reach a destination before another object. 31.1.1 MO N O C U L A R C U E S TO T I M E -TO - C O N TAC T

For a spherical object of known size, approaching at a known velocity, time-to-contact could be indicated at any instant by the angular diameter of the image. A more useful indicator of time-to-contact would be the rate of expansion of the image ( dq ). An avoidance response could be initiated when the rate of expansion reaches a given value. The threshold value is reached at a greater distance for a large object than for a small object. The size of the image produced by a small object is approximately inversely proportional to the distance of the

1. Judging time-to-contact intervals Observers estimate the time interval between when the approaching object is removed from view and when it will reach a specified position, usually the body of the observer. 179

object from the eye. For a small object approaching at constant velocity, distance from the eye is proportional to timeto-contact. Let the angular size of the image of a small spherical object at time t1 be q1 and at time t2 be q 2 . Let T be the time interval between time t2 and when the object reaches the eye. For a small object, the proportional change in image size in time interval t2 – t1 is equal to the ratio of this time interval to time interval T, or q2 q1 q1

=

t 2 − t1 T

or T =

(

2



1

)q

q2 q1

1

(1)

Thus, the time-to-contact from the end of a time interval is the time interval multiplied by the original image size, q1 , divided by the change in image size, q 2 q1, during that interval. For unit observation time, time-to-contact is the original image size divided by the change in image size. For example, if the size of the image of an approaching object has doubled in 1 s, the object must have traveled half the distance from its starting point to the observer in 1 s. It will therefore take another second to hit the eye. The rate of change of image size is q 2 q1 t2 − t1. The instantaneous rate of change is d ddt. Substituting in equation 1, we obtain: Time - to - contact =

q dq dt

(2)

Thus, the time-to-contact of an object moving at constant velocity is equal to the angular size of its image, q , at a given time divided by the rate of change of image size at that time, d ddt. Lee (1976) called this ratio tau (t ) (Portrait Figure 31.1). This indicator is independent of the size of the object. This idea was first proposed by Fred Hoyle (1957) in a science fiction novel about an asteroid approaching the earth. Note that information about the linear size of the object or the distance of the object is not required for judgments about time-to-contact. An approaching distant bullet or a nearby insect will hit in 1 s if the image doubles in size in that period. A rapidly looming image of an object at any distance is alarming because it signifies impact is imminent. Equation 2 is satisfied by any stimulus variable, in addition to image size, which approximately varies inversely with distance, such as accommodation and vergence. Time-to-contact of an approaching object may also be estimated by dividing its perceived distance by its approach speed. Whether observers use this distance/speed cue or tau probably depends on the stimulus conditions. Both indicators are invalid when the approaching object is accelerating or decelerating, or is close to the eye. Hatsopoulos et al. (1995) proposed another indicator of time-to-contact, which they called eta (h ). They used it 180



Figure 31.1. David N. Lee. Born in Liverpool, England, in 1939. He obtained a B.A. in mathematics from Cambridge University in 1961 and a Ph.D. in computer methods and psychology from the University of London in 1965. After postdoctoral work at Harvard University, he moved to Edinburgh University, where he is now professor emeritus of perception, action, and development. He is a fellow of the Royal Society of Edinburgh.

to model the responses of looming detectors in locusts. Eta varies as a function of time according to the following equation: h= C

d q − aqq(t ) e dt

(3)

where C is a constant that indicates the overall magnitude of the response of the detector, and a is a parameter that prevents response saturation at near distances. At far distances the rate of increase in image size is the dominant influence. At near distances the exponential function is the dominant influence. The value of h peaks before collision. The signal generated by a large object peaks earlier than that generated by a small object. The peak value provides a strong signal for an avoidance response. However, detection of the peak in any time-varying stimulus requires integration of signals before and after the peak has occurred. Also, a large object at a greater distance produces the same signal as a nearer small object (Gabbiani et al. 1999). Figure 31.2 indicates how q , dq , t , 1 t , and h vary with time for an object approaching at constant velocity. Signals based on t 1 t do not depend on the size or velocity of the approaching object. The disadvantages associated with each of these indicators could be overcome by using all of them. We will see in Section 31.8.2 that there may be distinct detectors for each of these indicators of time-to-contact. Whether a looming object appears to approach depends on whether the observer interprets the changing image size as motion-in-depth or as a change in the size of the object.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Large object

Small object

q

Threshold

Time to contact

0

(a) Image size (q ) Large object

Small object

dq dt

Threshold

Time to contact

0

(b) Rate of image expansion

q dq/dt

Large and small objects t Threshold 1/t Time to contact

0

(c) Image size/rate of expansion (t) Small object

Large object h

Time to contact

(d) h = c Figure 31.2.

0

d q –αq(t) dt e

Time course of four time-to-contact indicators. Time is plotted in

reverse relative to time-to-contact at t = 0.

(Adapted from Sun and Frost 1998)

For discussion of this topic, readers are referred to Johansson (1964), Marmolin (1973), and Swanston and Gogel (1986). For a discussion of other indicators of time-to-contact see Tresilian (1994b) and Bootsma and Craig (2002). 31.1.2 E S T I M AT I N G T I M E -TO - C O N TAC T

31.1.2a Accuracy in Judging Time-to-Contact The most direct way to assess a person’s ability to judge time-to-contact is to show an approaching object for a specified time and obtain an estimate of when it will hit the body. In one study, subjects saw a movie of a black square subtending 0.75 or 3˚ moving toward them at constant velocity on a white background or over a ground surface covered with a regular grid pattern (Schiff and Detwiler 1979). Subjects were also shown a movie of an approaching

automobile. After a variable period of time the screen went blank and subjects pressed a key when they thought the object would have reached them. Times-to-contact of up to 10 s were underestimated by about 35%. The percentage error was much larger for a time-to-contact of 16 s. One problem with using a movie sequence is that the image on the screen reproduces the image created by viewing the original scene only when the eye is placed at the optic distance of the camera. It is not clear that this condition was achieved in these experiments. McLeod and Ross (1983) conducted a similar experiment using film clips of 2 to 6 s duration taken from a car moving at between 40 and 100 km/hr toward a stationary car. The projected film subtended the same visual angle as the original scene. Subjects called “Now” when they thought they would hit the stationary car. They, also, obtained underestimations of over 30%. Accuracy was independent of stimulus duration but improved with increasing velocity (decreasing time-to-contact). McLeod and Ross argued that this supports the role of tau. Bootsma (1989) found that subjects judged the contact-time of a ball more accurately when using a natural action such as striking the ball with a bat, than when they pressed a key. In the same laboratory, Savelsbergh et al. (1991) used a real ball that moved on the end of a pendulum toward the subject’s hand. After 2s, electrically operated shutters occluded the ball and the subject moved the hand as if to catch it. It was found that looming information in the last 200 ms of a ball’s flight tuned the natural catching action of the hand both with regard to time and to the size of the ball. A ball that deflated slightly as it approached had a smaller value of tau, because the rate of expansion of its image was less than that of the image of a ball of fixed size. As predicted from the tau hypothesis, the catching motion occurred later for a deflating ball than for one of fixed size. The results were essentially the same with binocular viewing as with monocular viewing, which suggests that binocular information is not used in this type of task. An object approaching a point close to the nose was judged to pass the frontal plane of the head sooner than an object approaching a point 12˚ below the line of sight (Gray and Regan 2006).

31.1.2b Effects of Object Size and Distance If time to arrival were based solely on tau, the size of the approaching object should have no effect on judgments of time-to-contact. However, several investigators have found an effect of size. Baker and Steedman (1962) asked subjects to judge when an approaching luminous disk, viewed monocularly in dark surroundings, had moved halfway from its starting position. Its speed was 10 or 20 in/s, its starting position varied between 10 and 45 ft, and its initial angular size varied between 4 and 36 arcmin. Subjects responded

S E E I N G M OT I O N -I N -D E P T H



181

too soon for the largest stimulus but too late for the smallest stimulus. This may be related to the finding that the threshold for detecting monocular motion-in-depth was lower for objects subtending larger visual angles (Steedman and Baker 1962). Subjects reduced these systematic errors to about 6% after several hours of training with knowledge of results. DeLucia (1991), also reported that large objects appear to have an earlier time of arrival than small objects. This was also true for judging time to impact for movement of the observer toward a stationary object (DeLucia and Warren 1994). Smith et al. (2001), also, found that responses in a simulated ball-catching task were earlier when the balls were made larger or were moved more slowly. Size would affect performance if image size alone or change of image size alone were used to judge distance. The absolute change in image area of a large approaching object is greater than that of a small object. Perhaps subjects were using tau but with an estimate of change of image size affected by image size. Gray and Regan (1999a) argued that time-to-contact judgments based on the ratio of judged distance-to-speed should improve with the addition of distance information, but that judgments based on tau should be independent of distance information. They presented a looming circular disk on a monocularly viewed computer screen at a distance or 100 or 500 cm with the visual subtense of the sphere held constant. All monocular cues to distance were available. After a variable viewing time the disk was removed from view, and subjects estimated contact time with reference to a sound click. Estimates were the same at the two viewing distances. Gray and Regan concluded that subjects were using tau rather than the ratio of distance-to-speed. This result does not mean that the ratio of distance to speed, or simply speed alone, would not be used under other circumstances (see Kerzel et al. 2000).

31.1.2c Time-to-Contact of Accelerating and Partially Occluded Objects Estimates of time-to-contact based on tau require the object to be moving at constant velocity. When the object is accelerating, accurate estimates of time of arrival would require registration of acceleration. Suppose a person uses tau based on the assumption of constant velocity to estimate the arrival time of an approaching object after it has disappeared from view. An accelerating object would arrive earlier than expected, and a decelerating object would arrive later than expected. Although large differences in acceleration of a visual object can be detected, it has been suggested that detection is based on successive estimates of velocity rather than on detectors specifically tuned to acceleration (Werkhoven et al. 1992). For a review of the literature on the detection of visual acceleration see Brouwer et al. (2002). 182



Kaiser and Hecht (1995) found that, as expected, an approaching object that was decelerating when it disappeared from view arrived later than expected. But an accelerating object did not show the expected error. However, acceleration may have been too low and occlusion duration too short to reveal any effect of acceleration. Benguigui et al. (2003) used a larger range of accelerations and occlusion times. Also, the object moved laterally toward a visual target rather than in depth. The pattern of errors indicated that subjects based their estimates of time of arrival of the object at the target on the velocity of the object at the time it disappeared. Subjects did not use acceleration even when it was high enough to be detected. Detection of acceleration is one thing, but extrapolation of the future location of an object from acceleration is a much more difficult task. In the natural world, approaching objects are often partially occluded by a stationary object or by another moving object. The visible part of an approaching object may shrink or enlarge, depending on its direction of motion relative to a stationary object. DeLucia (2004) found that time-to-contact judgments of a simulated object in a computer-generated display were affected in predictable ways by partial occlusion of the approaching object by another, stationary or moving, object.

31.1.2d Effects of Intermittent Illumination An object moving in a frontal plane appears to move more slowly when viewed intermittently than when viewed continuously. This is presumably because the motion signal generated by a moving object illuminated intermittently is weaker than that generated in continuous illumination. A related effect is that the momentto-moment position of a continuously illuminated object is more accurately registered than that of an intermittently illuminated object. This is the flash-lag effect, which is discussed in Section 31.3.7. Hecht et al. (2002) asked whether intermittently illuminating an approaching object makes it appear to move more slowly and hence causes time-to-contact to be overestimated. They used monocularly viewed targets on a computer screen and real approaching objects. The targets were viewed for 5 s, continuously or intermittently at rates between 1 and 18 Hz. When the target was illuminated intermittently, time-to-contact, as indicated by a button press or a grasping response, was overestimated. The lower the frequency the larger the effect. Presumably, motion signals are attenuated to a greater extent when the object is sampled less frequently. A constantly illuminated object should appear to approach more quickly than an adjacent intermittently illuminated object. A frontal array of objects with a gradient of flash rates along the array should appear to rotate out of the frontal plane as it approaches.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

31.1.3 R E L AT I V E J U D G M E N TS O F T I M E -TO - C O N TAC T

31.1.3a Comparing Time-to-Contact of Two Objects Rather than indicating time-to-contact of a single object, subjects can be asked to discriminate a difference in timeto-contact of two objects. For example, Todd (1981) asked subjects to judge which of two looming squares on an oscilloscope screen would reach them first. They varied relative starting size and distance and provided error feedback. The responses were 90% accurate for a time-to-contact difference of 150 ms and did not reach chance level until the time difference was reduced to 10 ms. Estimates were more accurate for trajectories in the midline than for oblique trajectories (Schiff and Oldak 1990). Suppose that the starting size of the images of two test objects is held constant. An observer could use either tau (angular size over rate of angular expansion) or the relative rates of angular expansion of the two images to discriminate between the times-to-contact of the two objects. Interleaving two or three starting sizes does not demonstrate which cue a subject is using. Regan and Hamstra (1993) used an 8 × 8 matrix of stimuli in which time-to-contact varied along the rows ) varied down the columns and rate of expansion ( (Portrait Figure 31.3). Thresholds for discriminating differences in both time-to-contact and rate of expansion could be obtained from a single data set. When subjects were instructed to indicate whether the time-to-contact of an expanding bright solid square was greater or less than the mean of the stimulus set, the threshold for time-to-contact was between 7 and 13% over a range of times-to-contact of 1 to 4 s. This was about 100 times lower than the discrimination threshold for rate of expansion. Regan and Hamstra concluded that the human visual system contains a mechanism specifically sensitive to time-to-contact that is independent of changes in either the angular size or the rate of change of angular size of the image of the approaching object. Regan and Vincent (1995), also, found that discrimination thresholds for judgments of time-to-contact, rate of expansion, and starting size were independent for a display in which the values of the three features were varied simultaneously. However, this independence of discrimination thresholds decreased as the eccentricity of the stimulus increased from 0˚ to 35˚. For example, in peripheral vision, variations in rate of expansion produced changes in judgments of time-to-contact. Discrimination thresholds for judgments of time-to-contact of a square that loomed in size from 1.2˚ to 4.1˚ were not affected by whether the texture within the square loomed at a lesser rate, the same rate, or a greater rate than the square. However, estimates of time-to-contact became shorter as the rate of looming of the texture relative to the square was increased.

David Regan. Born in Scarborough, England, in 1935. He obtained a B.Sc. in 1957 and a Ph.D. in 1964, both in physics from Imperial College, London. In 1965 he became senior research fellow in communication and neuroscience in the University of Keele. Between 1976 and 1987 he held professorships in psychology, physiology, and medicine at Dalhousie University, Canada. Between 1976 and 1987 he was professor of psychology at York University, Toronto, and professor of ophthalmology at the University of Toronto. From 1987 until he retired in 2008, he was professor of psychology and biology at York University. He received the J. W. Dawson Medal of the Royal Society of Canada in 1997, the C. F. Prentice Medal of the American Academy of Optometry in 1990, the Max Forman Prize for Medical Research, and the Proctor Medal of the Association for Research in Vision and Ophthalmology in 2001. He is a fellow of the Royal Society of Canada, and recipient of the Order of Canada.

Figure 31.3.

Estimates were most accurate when the texture loomed at the same rate as the square (Vincent and Regan 1997). López-Moliner and Bonnet (2002) used a psychophysical procedure like that used by Regan and Hamstra (1993). Subjects indicated which of two monocularly viewed squares seen in succession on a dark computer screen would reach them first. The JND ranged from 6% to 10%, which is similar to the values reported by Regan and Hamstra. From an analysis of reaction times and response accuracy they concluded that subjects initiated a response when the signal h , described in Section 31.1.1, reached a threshold value. The differences in h between the two stimuli predicted the correct responses more accurately than did differences in image size, rates of expansion, or a weighted sum of these variables.

31.1.3b Intercepting a Moving Object The act of moving an effector, such as the hand, the mouth, or an implement to intercept a moving object requires the effector to reach the object at the same place at the same time.

S E E I N G M OT I O N -I N -D E P T H



183

Goal (G) Monitor

Target (T)

Cursor (H)

a fly, are performed ballistically. In the above task, the effector (H) and the target (T) were clearly visible but the goal area was not. Consider the task of moving the unseen hand to intercept a ball just as it reaches a clearly visible goal. Subjects would use strategy 3 under these circumstances. 31.1.4 C O M PA R I S O N O F C U E S TO T I M E -TO - C O N TAC T

31.1.4a Monocular Versus Binocular Cues Figure 31.4. Interception task used by Lee et al. (2001). Subjects moved the cursor so that it reached the goal at the same time as the target.

Lee et al. (2001) set up the task depicted in Figure 31.4. Subjects moved a cursor so as to bring point H on a computer monitor to rest at point G just as the moving target T reached G. The target moved through a constant distance at constant velocity or constant acceleration over a time period of between 0.5 and 2s. The following four strategies that a subject could adopt were examined. 1. The subject predetermines when T will reach the goal and then moves H without feedback. 2. The subject keeps the distance of H to the goal, G, constantly related to the distance of the target, T, to the goal. 3. The subject keeps the time-to-contact (t ) of H with the goal constantly related to the time-to-contact of the target with the goal. That is, t HG is kept equal to kt TG, where k is a constant. 4. The subject keeps the time-to-contact of H with the goal constantly related to the time-to-contact of H with the target. That is, t HG is kept equal to kt HT . The first strategy was discounted, because subjects showed clear evidence of being influenced by the velocity profile of the target. The second strategy was discounted because it would not allow the subject to bring H to rest when it reached the goal. The function relating t HG kt HT was linear for a greater proportion of the movement than was the function relating tHG ktTG . Lee et al. concluded that subjects used strategy 4. This strategy of coupling the times-to-contact (tau gaps) of H to the goal and of H to the target allows the subject to bring H to rest just when T reaches G. Also, the H-to-G and the H-to-T contact times are both specified in the optic array. The above task favored the strategy that subjects used. Other tasks may well favor other strategies. For example, many tasks, such as striking the keys of a piano or swatting 184



Subjects were better at catching a tennis ball with binocular vision than with monocular vision, but only when the surroundings were visible (Von Hofsten et al. 1992). Servos and Goodale (1998) found no difference between monocular and binocular conditions in the accuracy of catching a 2.5 cm approaching ball with surroundings visible. But in this experiment, subjects might have gained a spurious accuracy because of the relatively stereotyped motion of the ball. Judgments of time-to-contact of a small computergenerated circle were more accurate when the circle both loomed and changed disparity compared with when only one cue was available (Heuer 1993a). When the cues were in conflict, the changing-size cue was dominant for objects larger than about 1˚, but changing disparity (and/or vergence) was the dominant cue for objects smaller than 0.5˚. Looming is not an effective cue for small objects. Gray and Regan (1998) asked subjects to estimate timeof-contact of a 0.7˚ or 0.03˚ disk seen against a fixed array of random dots with reference to a click sound. They ensured that subjects used the required cue by varying taskirrelevant cues, such as stimulus duration. Like Heuer, they found greater accuracy when both looming and changing disparity were present than when only one cue was present. With only changing disparity, time-to-contact was overestimated 2.5 to 10% for the large target and 2.6 to 3% for the small target. With only looming, time-to-contact of the large target was underestimated by between 2 and 12%. Subjects could not make estimates with the looming small target, thus confirming that looming is not an effective cue for small objects. With both cues available for the large target, errors ranged between 1.3 and 2.7%. The accuracies reported by Gray and Regan are much higher than those reported by previous experimenters. Gray and Regan’s subjects estimated time-of-contact with reference to a click, while other investigators asked subjects to press a key, a method contaminated with reaction times. Gray and Regan commented that an accuracy of 1.3% could explain the ±2.0- to 2.5-ms accuracy with which experts can estimate the time-to-contact between a ball and a bat.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Gray and Regan (1999b) asked subjects to judge contact-time of a patch of dots. When looming (of dot size, dot spacing, and patch size) and changing disparity were concordant, errors were independent of dot size. When dot size was constant, errors were independent of dot size when the dots were smaller than about 4 arcmin. When dot size was 10 arcmin, time-to-contact was overestimated up to 21%. In another experiment, Gray and Regan (2000a) found that monocular and binocular sources of information about time-to-contact of the simulated approach of a sphere viewed for 1.25 s were weighted about equally. However, when the object was a nonspherical object that rotated 90˚ as it approached, subjects could perform only with both eyes open. Presumably, with only one eye open, rotation of the object interfered with the ability to detect the rate of looming of the image. Monocular inspection of a repeatedly expanding spot for 10 minutes increased the estimated time-to-contact of a subsequently seen test spot in which approach was simulated by only monocular looming or by only changing disparity (Gray and Regan 1999c). This result supports the idea that changing-size and changing-disparity information converge on a common motion-in-depth mechanism. Rushton and Wann (1999) designed a model system sensitive to a weighted combination of looming and changing disparity, with weights set according to which cue specifies the earlier arrival time or is stronger. The model conformed to the way subjects performed in a virtualreality, ball-catching task.

31.1.4b Comparison of Monocular Cues DeLucia et al. (2003) compared the separate and combined effectiveness of four monocular cues to time-to-contact. In a computer-generated monocularly viewed scene drawn in perspective, two objects loomed at the same velocity for 3.5 s. Theoretically, the objects had the same time-tocontact. After the objects had disappeared, subjects pressed a key when they judged when one of the objects had reached the observation plane. The procedure was then repeated for the other object. For the relative size cue, one object was half the area of the other. For the height-in-field cue, one object was nearer the horizon than the other. For the motion-parallax cue, relative horizontal motion was added. For the texture-density cue, one object had random dots of twice the density as those on the other object. Each additional cue affected time-to-contact judgments. When more than one cue was present, performance was determined by a weighted combination rather than by reliance on the most effective cue. The general conclusion from work on time-to-contact is that people may use, or learn to use, any of a number of sources of information, singly or in combination. The choice

will depend on the information available, the reliability of the information, the sensory capacities and experience of the person, and the requirements of the task. 31.1.5 D ET E C T I O N O F T I M E -TO - C O N TAC T BY A N I M A L S

Several animals have been shown to judge time-to-contact. For instance, diving birds, such as gannets, accurately judge the time when they should fold their wings as they dive into water at high speed (Lee and Reddish 1981). Neurons have been found in the nucleus rotundus of the pigeon that respond at a constant time before an approaching object makes contact with the head, even when the size or velocity of the object varies widely. These cells could therefore control avoidance behavior (Wang and Frost 1992). Locusts avoid collisions with other locusts when flying in a swarm (Robertson and Johnson 1993). In this case, all the objects are the same size, and avoidance could be based on the size of the looming image rather than on tau. However, the firing rate of two wide-angle cells, known as lobular giant motion detectors, in the locust brain depends on the product of image size and instantaneous angular velocity of changing size (Hatsopoulos et al. 1995; Rind 1996; Rind and Simmons 1997) (Section 33.1.5). Hatsopoulos et al. modeled the signal by the function h described in Section 31.1.1. The physiological processes underlying motion-indepth are discussed in Section 31.8. The topic of time-to-contact has been reviewed by Hecht and Savelsbergh (2002). Theoretical issues in research on time-to-contact have been critically reviewed by Tresilian (1991, 1993, 1995), and Wann (1996). Both these authors cast doubt on the idea that the tau ratio is used in judging time-to-contact under natural viewing conditions. The more general issue of judging the time-to-contact between two objects rather than between an object and the self has been discussed by Bootsma and Oudejans (1993), Tresilian (1994b), Smeets et al. (1996), and Laurent et al. (1996). Applications of studies of motion-in-depth to sport have been reviewed by Regan (1997) and Land and McLeod (2000). 3 1 .2 D I R E C T I O N O F M OT I O N -I N -D E P T H 31. 2.1 MOT I O N D I R E C T I O N O F A POINT OF LIGHT

At a given instant, the direction of motion of a point-of light approaching an eye along a straight trajectory within a horizontal plane can be decomposed into two components, as shown in Figure 31.5.

S E E I N G M OT I O N -I N -D E P T H



185

A Sagittal plane P Trajectory of P

θ Point of impact on the orthogonal to the visual line

α Impact direction of P Visual line

90°

φ Monocular azimuth of P Frontal plane of head

B O Point of impact on the frontal plane of the head

Figure 31.5. Motion-in-depth of a monocular object. The azimuth of an object P at a given instant, as it approaches along trajectory AB, is angle f between the visual line, PO, on which P lies, and a sagittal plane of the head. The impact direction of P at a given instant is angle q between the trajectory and the visual line that P is traversing. Whatever the value of f , for a specified distance of P, q specifies the point of impact of P on a plane orthogonal to the visual line. For a given azimuth and distance of P, q specifies the point of impact on the frontal plane of the head. Angle a, between P’s trajectory and a sagittal plane, equals q f .

The first component is the headcentric direction of the motion, which is best specified in polar coordinates (meridional angle and eccentricity). For motion in a horizontal plane, direction is specified by the azimuth of the trajectory relative to a sagittal plane of the head. The second component is the impact direction of the trajectory. This is the angle that the trajectory makes with the visual line that the object is traversing at that instant. When the trajectory lies along a visual line, the impact direction is 0˚ and the object will hit the eye—the object is on a direct-impact trajectory. When the trajectory is not along a visual line, both the azimuth and impact direction change from moment to moment. If the trajectory of an object at distance D traverses a visual line at an angle q , it will cut the plane containing the nodal point of the eye and orthogonal to the trajectory at distance Dsinq from the nodal point. Thus, the nearest distance by which an approaching object will miss the eye depends only on its impact direction and distance. An object at azimuth f and with an impact direction q will impact the frontal plane of the head at a distance from the nodal point given by: Distanceof impact fromeye =

sin q cos a

(4)

where a q + f , as shown in Figure 31.5. Thus, for an object approaching in a horizontal plane, the point of impact in the frontal plane of the head is at a distance from 186



the eye that depends on the object’s azimuth, impact direction, and present distance. Consider a monocularly viewed point of light in dark surroundings, the headcentric direction of which is judged correctly. As the point moves, the observer can detect only the component of its motion projected into the frontal plane. For an object moving at constant velocity, a change in its impact direction causes a change in the direction or velocity of the motion of the retinal image. If the eye pursues the object, the direction or velocity of the eye movement changes. In the absence of any other information, these cues to impact direction are ambiguous. The retinal image motion produced by an object moving at a certain velocity within a frontal plane may be the same as that produced by an object traveling at an angle toward or away from the observer at some other velocity. If the point of light moves directly toward or away from an eye, its motion is not visible when viewed monocularly. It is therefore not possible on this basis alone to judge whether a point of light is approaching or receding. If the observer knows that a point source is moving, a judgment can be made about whether it is moving along a line of sight or along some other trajectory. Similarly, an observer moving in a straight line can tell whether or not the heading is in the direction of an isolated point of light viewed monocularly. Suppose, for instance, that a pilot is flying a plane on a dark windless night toward a lighthouse or in pursuit of a light on another plane. Any perceived movement of the light relative to the nose of the plane indicates that the pilot is not flying directly toward the light. Corrections made for a crosswind will cause the plane to fly to the target along a curved path. The pilot will not know whether the plane is getting nearer to a light on another plane because the other plane could be approaching or receding. The task of detecting motion of the light relative to a part of the cockpit fixed with respect to the head reduces to that of vernier acuity and can be performed with great precision, as in aiming a gun with a gun-site. Llewellyn (1971) found that the precision and accuracy of judgments about heading based solely on the apparent sideways drift of a small target were far better than the precision and accuracy of judgments based on the focus of expansion of an array of objects, which is described next. 31.2.2 MOT I O N D I R EC T I O N O F A S I N G L E O B J EC T

The motion of the image of an object moving with respect to an eye can be decomposed into a translation component and a changing-size component. When the object keeps a constant distance from the eye, its image translates but does not change size. When the object approaches the eye along a visual line, its image expands but does not translate. This signifies that the object will ultimately hit the eye. The impact direction may be specified by the number of

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

object radii, n, by which an object will miss the center of the eye within a plane orthogonal to the visual line the object is traversing. Regan and Kaushal (1993) showed that n is approximately equal to the ratio of the translational speed of the object’s retinal image, df ddt, to the rate of image expansion, d ddt

n=

df d dt

(5)

dq dt d

When the center an object approaches along a line of sight, the image of the object expands symmetrically, the translational motion is zero, and n = 0. When one edge of the object moves along a line of sight, the object is destined to graze the nodal point of the eye, and n = 1. Also, the image of the object expands asymmetrically. If the asymmetry of image expansion is large enough, the object is destined to miss the eye. The degree of asymmetry of image expansion can be specified by the ratio of the velocities of its opposite edges, V1 and V2. When an object approaches along an eye’s line of sight, its opposite edges move in opposite directions at the same speed, and the velocity ratio is –1. For n < 1 the two edges move in the same direction and the velocity ratio is positive. Regan and Kaushal derived the following equation relating n to the asymmetry of image expansion. n=

( 1 − (V

1+ V V V

) )

(6)

The information supplied by the ratio of v1 to v2 is mathematically equivalent to that supplied by the ratio of translational speed to looming speed. These cues to impact direction depend only on the direction of the path of the approaching object relative to a visual line—they are not substantially affected by the direction of fixed gaze. These cues do not indicate the azimuth direction or location of the approaching object, only its impact direction. In using these cues to estimate impact direction, a person does not need to know the linear size or distance of the object. If the eye remains fixated on the center of an approaching object, the image expands symmetrically whatever the impact direction. In this case, impact direction is indicated by the ratio of eye velocity to the rate of image expansion. Up to now it has been assumed that the object is spherical. The image of a nonspherical object changes shape when it moves along any path except a line of sight. For example, the image of a frontal square becomes trapezoidal when the square moves in a frontal plane away from the straight-ahead position. The ratio of this change to the rate of image expansion could also specify impact direction.

Avoidance behavior in response to looming of shadows cast on a screen has been demonstrated in a variety of species, including crabs, chicks, monkeys, and humans (Schiff et al. 1962; Schiff 1965). There is some indication that animals can detect impact direction from the degree of asymmetry in the looming shadow pattern. There has been some debate about whether avoidance responses to looming are innate or develop only after the experiential coupling of looming and impact (Section 7.4.1b). Regan and Kaushal (1993) measured the ability of human subjects to discriminate differences in impact direction of a simulated approaching rectangular object viewed monocularly. They randomly varied the duration and speed of the target so that subjects were forced to use one of the ratios defined above. Subjects could discriminate differences in the impact direction to better than 0.1˚. Regan and Kaushal concluded that the visual system contains a monocular mechanism sensitive to one of the two ratios, independent of the location of the object or of the lateral component of its motion. The discrimination threshold for impact direction of an approaching object was similar for trajectories lying in horizontal, vertical, and oblique planes, which project onto horizontal, vertical, and oblique retinal meridians, respectively. The form of the discrimination threshold did not vary significantly as a function of the impact direction within a given plane. Therefore, the function provided no evidence of multiple channels tuned to impact direction. We will see in the following that multiple channels have been revealed in the binocular detection of impact direction. 31.2.3 MOT I O N D I R EC T I O N FRO M O P T I C FL OW

Linear motion of an observer in a 3-D scene creates the pattern of optic flow shown in Figure 31.6 (Gibson 1966). Optic flow is defined with respect to lines of sight at the nodal point of the eye—it is the motion of the optic array, not the motion of the image. For an observer moving at

Linear velocity v Eye Heading

f

Angular velocity w

Figure 31.6.

Optic flow produced by linear motion.

S E E I N G M OT I O N -I N -D E P T H



187

r

p

(Adapted from Gibson 1966)

linear velocity v, the line of sight of an object at distance r rotates round the nodal point with an angular velocity w given by: w=

v sin f

(7)

r

where f is the angle between the radius containing the object and the axis of motion. Consider the special case of an observer moving toward a frontal textured surface with one eye fixated on the central point. The velocity of image points on the retina increases linearly along each radius from zero at the fovea. If the observer knows that the surface is rigid, the rate of image expansion provides unambiguous information about the rate of approach. The center of expansion of the optic flow indicates the location on the surface where impact will occur—the point of impact (Gibson 1950a, 1958). James Gibson stressed the importance of optic flow for a pilot landing an aircraft (Gibson et al. 1955). The point of impact is often referred to as the heading direction. However, if heading direction is defined as the direction along which the observer is moving relative to bodycentric coordinates, it is not fully specified by the point of impact. Heading direction in this sense requires information about the position of the eyes in the head and of the head on the body (see Telford et al. 1995). Image looming has five important features. 1. Rate of looming The rate of image looming is inversely proportional to the distance of the surface that one is approaching. When the rate of looming is greater than that specified by the perceived distance of the surface, the surface is perceived as growing in size. 2. Focus of looming The focus of optic flow produced by forward motion within a display of fixed objects indicates the point of impact. It is independent of the rotation of the eye. If the eye pursues one of the radially moving points in the optic array, the image of that point remains stationary on the retina and therefore forms the focus of the looming image. In this case, the focus of expansion of the optic array moves over the retina. Assume that the head has not turned and that the center of eye rotation and the nodal point coincide. A translatory component due to eye rotation has been added to the looming component due to forward motion. This shifts the focus of image expansion from the focus of optic flow (the point of impact). It may even remove the focus of optic flow from the field of view (Koenderink and van Doorn 1981; Regan and Beverley 1982). Longuet-Higgins and Prazdny (1980) proposed that heading direction could be recovered from the retinal image if the visual system decomposed the translatory and expansion components of the retinal flow pattern. 188



Models of such a mechanism have been developed by Rieger and Lawton (1985), Hildreth (1992) and Perrone (1992). Royden (2002) developed a model that is compatible with the properties of motionsensitive cells in MT. Detection of the direction of an asymmetrical expansion gradient is masked when superimposed orthogonal or opposite translation reaches a certain value (Te Pas et al. 1997). Heading judgments based on the pattern of optic flow in a background display of random dots are biased in the direction of movement of a superimposed surface but only when it crosses the direction of heading (Royden and Hildreth 1996). The decomposition of the looming and translatory components of optic flow could also be achieved if motion of the eyes, or head and eyes, were taken into account. For data on this ability, see Warren et al. (1988) and van den Berg (1992). 3. Pattern of looming This is defined by the gradient of motion velocity along each radius. A departure from the appropriate pattern of looming produced by motion toward a frontal surface indicates that the surface is not rigid. For example, points moving at constant speed from the center of a frontal array at a fixed distance create the impression of an approaching nonrigid surface (De Bruyn and Orban 1990). Asymmetry of looming with a stationary eye indicates that an approached surface is not frontal. Prazdny (1980) analyzed the velocity vector field produced by self-motion in a world of stationary planes. Under constraints of object rigidity, patterns of smooth motion can specify the spatial layout of the surfaces. 4. Flow and spatial scale components Looming has an optic flow component and a changing spatial scale component. The optic flow component may be isolated by moving forward with fixed gaze through an infinite array of points distributed randomly in 3-D space. The images of points at each distance loom at a rate inversely proportional to their distance. However, the overall image remains self-similar—there is optic flow but no change in spatial scale. Motion toward an ideal fractal surface also produces flow with no change in scale. By definition, a fractal surface is one that is self-similar at all spatial scales. The spatial scale component of looming may be isolated by moving toward a computer-generated frontal surface in which the texture elements change from frame to frame. The size of the display increases, but optic flow is mostly eliminated. Schrater et al. (2001) found that this type of display produced a strong percept of expansion that, on average, was similar to that produced by optic flow, but minimal size change. Also, exposure to the

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

changing-size display produced an appropriate motion aftereffect. The authors did not mention whether the changing-size display evoked sensations of self-motionin-depth. 5. Locus of zero parallax When an observer moves through a fixed array of objects at different distances, the path of self-motion is specified by the locus of those objects that do not show parallax. This so-called locus of zero parallax is not affected by eye movements and, under the best conditions, reduces to a vernier acuity task (Regan and Beverley 1984; Cutting 1986). If an observer knows the direction of gaze, the locus of zero parallax indicates headcentric heading. Judgments of heading are much less affected by stimulus noise when stereoscopic depth is added to a display of random dots that simulate the effects of forward self-motion (van den Berg and Brenner 1994). Exposure to optic flow can induce aftereffects. This topic is discussed in Section 31.6. Optic flow can also induce distortions in the perceived motion, size, or distance of a stationary stimulus. This topic is discussed in Section 31.7. The use of optic flow for judging the distance of travel of self-motion is discussed in 34.4.1b. 31.2.4 B I N O CU L A R CU E S TO M OT I O N D I R E C T I O N

The lines and angles that specify the direction of approach of an object P with respect to the two eyes are shown in Figure 31.7. The object lies on a cyclopean line of sight passing through a point midway between the eyes. Its cyclopean azimuth, f , is the angle between the cyclopean line and the median plane of the head. The impact direction, b , of the point is the angle between the cyclopean line and the trajectory of the point (line AB). Consider a pointlike object devoid of monocular cues to its distance, approaching the head at constant velocity within the horizontal plane of regard. When the eyes remain perfectly converged on the object, the angle of vergence indicates its distance, and the cyclopean azimuth of gaze indicates its azimuth. Cyclopean azimuth is the mean of the version angles of the two eyes. The impact direction of the object is indicated by the ratio of the vergence and version components of the tracking eye movement. The movement of the eyes is a pure vergence when the tracked object moves along a hyperbola of Hillebrand (locus of isoversion). Beyond a viewing distance of about 20 cm and for trajectories not too far from the median plane of the head, hyperbolas of Hillebrand may be regarded as straight lines converging on a point midway between the eyes. The movement of the eyes is a pure version when the fixated object moves along an isovergence locus, which is roughly equivalent to an isodisparity locus (see Figure 10.10). The role of tracking eye movements in judging the trajectories

A P Trajectory of P

α Cyclopean impact direction of P Point of impact on orthogonal to the trajectory line

γ β

Binocular subtense of P

Cyclopean line of P

90°

φ

Cyclopean azimuth of P

B Point of impact on frontal plane of the head

Cyclopean point

Figure 31.7. Motion-in-depth of a binocularly viewed object. The trajectory azimuth of point P as it approaches along trajectory AB is the angle f between the cyclopean line on which P lies and the median plane of the head. The cyclopean impact direction of P at a given instant is angle b between the trajectory and the cyclopean line that P is traversing. For any f and distance of P from the cyclopean point, b specifies the point of impact on a plane orthogonal to the cyclopean line and on the frontal plane of the head. The angle a , which P’s trajectory makes with the median plane, equals b f .

of approaching objects does not seem to have been investigated. When the gaze is fixed on a stationary object while a second object moves toward the head, the relative motion of the images of the moving object in the two eyes varies with the impact direction of the object. Consider an object moving within the horizontal plane of regard. When it moves along a hyperbola of Hillebrand, the images in the two eyes move symmetrically outward when the object approaches and symmetrically inward when it recedes. The ratio of image velocities is l:l in antiphase. For an object approaching in the median plane, image velocities are the same and signify that the object will impact a point midway between the eyes. When the object moves along any visual line of one eye the image in that eye does not move and the ratio of velocities is l:0. When the object moves along a path that misses the head, the images move in the same direction, or in phase. The limiting in-phase motion is along the horizontal horopter, when the ratio of movements is l:l in phase. These relationships are depicted in Figure 31.8. These relationships hold wherever the eyes are converged. Thus, at any instant, the relative motion of the images of an approaching object provides unambiguous information about an object’s impact direction relative to both the hyperbola of Hillebrand and the visual lines it is traversing at that instant. The direction, b , of an object’s

S E E I N G M OT I O N -I N -D E P T H



189

it approaches along a hyperbola of Hillebrand there is maximum change of disparity. The angle of approach, b , relative to the cyclopean line of sight of the object is given by: +1:1

0:1

+1:2

–1:1

–2:1

–1:2

1:0

tan b =

+2:1

Figure 31.8. Impact direction and relative image velocity. Speed and direction of image motion in the left eye relative to that in the right eye is related to the impact direction of an object approaching in the plane of regard. The ratio of image velocities in left and right eyes is shown under each figure. Positive signs indicate movements in the same direction in the two eyes. Negative signs indicate movements in opposite directions and that the object will hit between the eyes. (Redrawn from Beverley and Regan 1975)

trajectory relative to the cyclopean line of sight on which the object lies is given by:

tan b =

{(

I ⎡⎣

f

)(

f

)} + 1⎤⎦

2 D ⎡⎣( d df f R dt ) − 1⎤⎦

(8)

where f L dt and f R dt are the angular velocities of the left and right images, D is the distance of the object, and I is the interocular distance (Regan 1993). This formula is accurate only for objects on or near the median plane of the head. Relative image motion does not indicate the headcentric azimuth of an object, since the same relative motion may be produced whichever cyclopean visual line the object traverses. In other words, relative image motion does not indicate the headcentric trajectory of the approaching object; it indicates only the angle of approach relative to a cyclopean line of sight, which is the impact direction. Azimuth is indicated by the mean retinal eccentricity of the images when the eyes are symmetrically converged on a stationary point in the midline or by the mean angle of gaze when the eyes are converged on the approaching object. Over all positions of vergence, headcentric azimuth is the algebraic sum of retinal eccentricity and angle of gaze. The direction of an object’s approach in terms of headcentric elevation is indicated by the vertical eccentricity of the image or of gaze. The impact direction of an approaching object is also indicated by the ratio between the translational image velocity and the rate of change of binocular disparity between the images of the object. When an object moves along the horopter there is no change in disparity, and when 190



I ( df dt )

D ( d g dt )

(9)

where df dt d is the translation speed of the image and dg ddt is the rate of change of disparity (Regan 1993). Equations (8) and (9) are mathematically equivalent. Furthermore, in natural conditions, changing disparity covaries with relative image motion, so it is not possible to say which cue is being used. However, the two cues have distinct physiological implications. Equation (8) implies that impact direction is derived from relative motion, while equation (9) implies that it is derived from changing disparity. The differential motion of binocular images may be called the difference-of-motion signal as opposed to the change-of-disparity signal. It is a question of which attribute is processed first. We will see in what follows that there is evidence that both signals are used. 31.2.5 S E N S I T I V I T Y TO D I R E C T I O N O F B I N O CU L A R MOT I O N

Beverley and Regan (1975) measured subjects’ ability to use changing binocular disparity to discriminate changes in impact direction of a target moving in depth. Subjects converged in the plane of a textured surface containing nonius lines. A pair of dichoptically superimposed vertical bars was placed slightly to one side of the fixation point. Oscillation of the bars in antiphase from side to side created the impression of a single bar oscillating in depth. When the amplitude of oscillation was the same in the two eyes, the bar appeared to move in the median plane of the head. The apparent impact direction could be changed by altering the relative amplitudes of side-to-side motion. A motion-in-depth sensitivity curve was obtained by measuring the threshold amplitude of stimulus oscillation required for the detection of motion-in-depth (Figure 31.9A). Beverley and Regan obtained a direction-discrimination curve by measuring the threshold change in relative amplitude of oscillation required for detection of a change in the direction of motion-in-depth. Figure 31.9B shows three peaks in the direction-discrimination curve—for trajectories directed to the left eye, the right eye, and at a point midway between the eyes. This suggests that tuning functions of distinct motion-in-depth channels overlap at these points. The point of overlap is where the relative activity in neighboring channels changes most rapidly as motion direction changes (Section 4.2.7). Figure 31.9C shows a hypothetical set of tuning functions derived from experiments on adaptation of motion-in-depth.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Sensitivity (arcmin)

5

A 10

Direction discrimination (deg)

0.2

0.4

0.6

B 0.8

Relative sensitivity

1.0

C

0.5

0 3 0:1

90 +1:1

Left eye

0 –1:1

3 1:0 Nose

90 1:1 Right eye

wider of the head than the reference stimulus. The discrimination threshold rose as the reference target moved more obliquely with respect to a point midway between the eyes. Although direction discrimination for motion in DRDS was better than that for the cyclopean motion, the difference was not great. By randomly combining different stimulus parameters Portfors-Yeomans and Regan established that subjects could base their judgments of impact direction on the ratio between the translational image velocity or on the rate of change of disparity between the images of the cyclopean shape, as specified in equation (8) in Section 31.2.4. Portfors-Yeomans and Regan (1997b) approached the same issue in a different way. A small square, visible to each eye was seen against a random-dot background. The disparity of the images of the square was varied to simulate motion-in-depth. When the impact direction of motion changed within the horizontal meridian, there was a change in both the relative motion of binocular images and the ratio of lateral motion to change of disparity. When the impact direction changed within the vertical meridian, the relative motion of binocular images was the same for all impact directions. The only cue to impact direction was the ratio of motion within the frontal plane to change of disparity. The direction-discrimination threshold was the same for changes in the two meridians and it was concluded that the ratio of lateral motion to change of disparity is a good cue when it alone is available.

Direction of motion (deg) and Motion ratio Figure 31.9.

Sensitivity for motion in depth. (A) Sensitivity to changes in

impact direction of a bar, as a function of motion direction relative to the median plane. (B) Smallest detectable amplitude of motion in depth as a function of direction of motion. Results for one subject. (C) Sensitivity functions of four hypothetical mechanisms, each tuned to a different range of impact directions. Bars are SDs. (Redrawn from Beverley and Regan 1975)

Heuer (1993b) created motion-in-depth by changing the disparity of the images of an isolated object. He confirmed that impact directions are discriminated best for motions aimed at a point midway between the eyes but did not find peaks in the discrimination function for trajectories toward each eye. However, the variability of his data was too great to reveal these secondary peaks. Portfors-Yeomans and Regan (1996) compared discrimination thresholds for impact direction of a square in a dynamic random-dot stereogram (DRDS) with those for a noncyclopean square. The noncyclopean square was created by making the surrounding dots static. The squares subtended 0.75˚ at the center of a circular disk of diameter 8.5˚. A reference square started at a point straight ahead with a crossed disparity of 5 arcmin and moved toward the subject at a constant velocity for one second. Subjects indicated whether a subsequently presented test was moving

31.2.6 AC C U R AC Y O F B I N O C U L A R JU D G M E N T S O F MOT I O N D I R E C T I O N

Harris and Dean (2003) asked subjects to estimate the horizontal angle of approach of a small binocularly viewed light with respect to a stationary fixation light in otherwise dark surroundings (Portrait Figure 31.10). The fixation light was straight ahead, and the moving light was just below it. Subjects either drew the plan view of the approach angle on paper or indicated the angle by setting an arrow. The moving target was seen for 1 s, and its angle of approach was varied between 6.9˚ to the left and 6.9˚ to the right by changing the lateral extent of its motion. Motion toward the nose was accurately detected but other angles of approach were overestimated. These errors could be caused by (a) underestimation of the distance of the stimuli, (b) underestimation of the in-depth component of the motion, as indicated by changing disparity, or (c) overestimation of the lateral motion of the target. Constant errors also occurred in the task of estimating where in the plane of the face the target would have impacted after the object had disappeared. Performance of this task does not require information about the distance of the target. The constant error did not vary when the approach angle was varied by changing the depth component (changing disparity) of the motion rather than the lateral component. Harris and Dean

S E E I N G M OT I O N -I N -D E P T H



191

Neppi-Mòdona et al. (2004) moved an LED in otherwise dark surroundings toward the face. The light was switched off halfway, and subjects reported which side of the face the LED would hit. Judgments were most accurate when the light originated from a straight-ahead location. With head and eyes directed straight ahead, lights approaching from the left side produced a leftward bias and those approaching from the right produced a rightward bias. The same biases occurred when the head was straight but the eyes looked in the direction of the approaching light. The above experiments were performed in severely reduced cue conditions, with nothing in view but the moving object and one or two stationary lights. Further experiments are required to determine whether similar constant errors arise in richer, more natural visual surroundings.

3 1 .3 D ET E C T I N G M OT I O N -I N -D E P T H

Figure 31.10.

Julie Harris. Born in Wolverhampton, England, in 1967.

She obtained a B.Sc. in physics from Imperial College, London, in 1988 and a Ph.D. in physiology with Andrew Parker from Oxford University in 1992. Between 1995 and 2004 she held academic appointments in psychology at the Universities of Edinburgh and Newcastle. In 2004 she was appointed professor of psychology at the University of Saint Andrews, Scotland.

concluded that subjects ignored the depth information based on binocular disparity and based their judgments on the lateral component. Gray et al. (2006) used a computer-generated display that simulated a small sphere approaching along each of several straight trajectories and two fixed reference lights above and below the sphere. The approach of the sphere was simulated by increasing its size, by changing its disparity, or by both cues. Subjects indicated the point of impact of the sphere with reference to a set of LEDs just in front of the face. They also moved a hand to catch the approaching sphere. Both tasks revealed that angles of approach were overestimated. However, the mean error was less than that reported by Harris and Dean. In the perceptual task subjects performed equally in all cue conditions, but in the catching task they performed better with both cues present. Subjects varied in their relative sensitivity to the looming cue and the changing-disparity cue. The important point is that these subjects, unlike Harris and Dean’s subjects, used disparity information. Overreaching was reduced to zero after subjects had performed the reaching task with auditory feedback, and to near zero for the perceptual task. This accords with the fact that well-trained players of ball games have no difficulty catching or hitting balls. The accuracy of judgments of impact direction is affected by the azimuth direction of approach of an object. 192



31.3.1 I N F O R M AT I O N F O R MOT I O N-I N-D E P T H

The following sources of information are available with monocular viewing 1. Image looming The angular size, q , of the image of an object of size x at distance D is: q =arctan x D . For a small object, image size is inversely proportional to distance. In looming, distances between the images of objects also change. Looming also brings finer detail into view. Detection of looming is discussed in Section 31.8.1. 2. Looming parallax This is differential looming between near and far parts of an object. It occurs because the rate of looming is greater for near parts of an object than for more distant parts. 3. Changing perspective Perspective over a slanted or inclined surface changes as the surface moves in depth along a path not parallel to a line of sight, as shown in Figure 26.11. 4. Changing occlusion A far object becomes progressively occluded by a nearer approaching object. Continuous addition of new disks on top of a stack of disks, as shown in Figure 31.11, created an impression of forward movement (Engel et al. 2006). As the rate of disk addition increased, the apparent speed of motion increased. Viewing this display for 3 minutes caused a subsequently viewed stimulus with ambiguous direction of motion to appear to move in the opposite direction. 5. Lens accommodation This was discussed in Section 25.1.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

31.3.2 MOT I O N-I N-D E P T H WIT H N O S TAT I O NA RY R E FE R E N C E

Figure 31.11. Motion-in-depth from successive occlusions. Addition of disks to the top of the stack creates an impression of forward motion. (Adapted from Engel et al. 2006)

The following sources of information require binocular viewing 6. Absolute stimulus disparity This partitions into changes in vergence induced by changes in stimulus disparity and residual changes in image disparity due to incomplete vergence. 7. The horizontal gradient of disparity This increases as a surface approaches. 8. Vertical disparities Vertical disparities in the oblique quadrants of the field of view increase as a surface approaches. Vertical disparities provide an effective cue to distance for a surface that subtends more than about 20˚ (Section 20.6.3c). 9. Relative internal disparity This is the disparity between different parts of a 3-D object. Relative disparity from a given depth interval increases inversely with the square of viewing distance. This indicates motion-in-depth if the object is rigid. 10. Relative external disparity This is the relative disparity between the moving object and stationary objects. It is a tangent function of distance. 11. Monocular zones Monocular zones in each eye produced by two objects at different distances change as one object moves or both objects move in depth. This provides unambiguous information about motion-indepth only if the distance between the objects is known.

The motion of an isolated object in a frontal plane is not detected at velocities of less than about 5˚/s. Sensitivity to frontal motion is greatly improved by the addition of a stationary object (Howard and Howard 1994). In a similar way, sensitivity to motion-in-depth of an isolated stimulus is greatly improved by the addition of a stationary stimulus. Thus, Westheimer (1990) found that a single vertical line exposed for 250 ms had to change in disparity at a velocity of between 25 and 45 arcmin/s, producing an excursion of between 6 to 10 arcmin of disparity, before its direction of motion-in-depth could be detected. Sensitivity improved considerably when the line was flanked by two stationary lines. With such a brief exposure, there would be no time for vergence changes. When the eyes remain converged on an approaching object, the disparity of its images remains zero. However, when the eyes track an object moving in depth, the images of stationary objects have a changing disparity, which could be used to detect motion-in-depth and its direction. If the eyes converge in response to an approaching object but at too slow a rate, the changing disparity is partitioned between the moving object and the stationary background according to the gain of convergence. However, the total relative change of disparity between object and background is constant over changes in vergence. It therefore provides accurate information about the impact direction and speed of approach. When the eyes remain converged on a single object and the size of the image is kept constant, convergence movements of the eyes provide the only information about motion-in-depth. Regan et al. (1986a) found that some motion-in-depth was perceived when the disparity between the images of an isolated object was changed. However, thresholds for detection of motion-in-depth were two to seven times higher than those obtained with a stationary comparison object in view. There were no sensations of motion-in-depth when changes in disparity were eliminated by making vergence open-loop. Presumably, therefore, the perceived motion in the closed-loop condition was due to relative motion or changing disparity of the retinal images. Harris (2006), also, found that thresholds for detection of motion-in-depth of an isolated stereoscopic stimulus around a mean viewing distance of 1 m were very high. Any perceived motion-in-depth of an isolated object of constant size in a stereoscope must arise from changing convergence and/or from changing disparity resulting from a failure of vergence to keep the images in binocular register. With a large array of dots in a frontal plane, an overall change in horizontal disparity does not give rise to a sensation of motion-in-depth, whether or not the eyes remain

S E E I N G M OT I O N -I N -D E P T H



193

converged on the stimulus (Gogel 1965; Nelson 1977; Erkelens and Collewijn 1985; Regan et al. 1986a). Erkelens and Collewijn’s subjects observed a 30˚ by 30˚ random-dot stereogram while both images moved as a whole from side to side in antiphase, at frequencies between 0.125 and 1.2 Hz and amplitudes up to 3˚. The rest of the visual field was black. The motion approximately simulated changes in horizontal disparity in a surface moving toward and away from the observer. The fused image did not appear to move in depth, but the display appeared to change slightly in size. The relative depth of a region within the stereogram remained visible. When an in-phase component of motion was added to the antiphase motion, subjects could see the side-to-side motion but not motion-in-depth. The randomdot display appeared to move in depth when a stationary vertical grating was superimposed on it, thus providing a changing relative disparity signal. Vergence movements had a gain of less than 100%. There was therefore some residual change in image disparity in the random-dot display. Even though the display did not appear to move in depth, changes in disparity must have been detected to drive changes in vergence required to keep the images fused. Also, the changing vergence and/or disparity must have been detected to create the apparent change in size of the display. Using the same apparatus, Regan et al. (1986a) saw no movement in depth of a random-dot display but obtained a weak impression of motion-in-depth by changing the disparity of a single point. The conclusion from both these studies was that neither changes in overall disparity nor changes in vergence contribute to sensations of motion-in-depth. We saw in Section 25.2 that the distance of a stationary object can be judged on the basis of vergence alone. So why was motion-in-depth not produced by changing vergence? It will now be shown that this is because other cues indicated that the display was not moving in depth. Motion-in-depth of a large textured surface is accompanied by changes in: 1. Image looming 2. Lens accommodation 3. Absolute stimulus disparity 4. The horizontal gradient of disparity 5. Vertical disparities None of these changes occurs in a disparity-modulated random-dot display. Their absence indicates that the display is not moving in depth. The changing absolute disparity is simply overwhelmed by the conflicting information that these other cues provide. For a small isolated dot, there is only a lack of looming and accommodation. But we have already seen that looming is a weak cue for a small dot. 194



That would explain why a disparity-modulated dot appears to move in depth. It is often falsely assumed that the effects of a cue are eliminated when it has a value of zero. A cue to motion-indepth that does not change has a value of zero and indicates that there is no motion-in-depth. A cue is eliminated not by setting it at zero but by rendering it ineffective. For example, a radial pattern moving in depth beyond in a fixed aperture, like that in Figure 31.12, renders looming ineffective. This is because all the motion is along radial lines so that the image remains the same at all distances. Howard (2008) investigated whether modulations of absolute disparity produce motion-in-depth of a radial pattern that lacks looming and of a single dot that provides only a weak looming signal. Each of the four displays shown in Figure 31.12 was presented in a mirror stereoscope at a distance of 57 cm. Each display subtended 65˚, and no other objects were visible. The horizontal disparity of each display was modulated sinusoidally through ±2.14˚ at 0.2 Hz. Theoretically, this would move the stimulus 37 cm to-and-fro in depth. In one condition a stationary textured display (the reference display) was superimposed on the disparity-modulated display. Subjects tracked the perceived motion-in-depth of each display with unseen hand. The mean results for 12 subjects are shown in Figure 31.13. In agreement with earlier studies, the random-dot display presented alone did not appear to move in depth. Only two of 12 subjects saw a slight movement. The single dot and the radial display appeared to move some distance for all subjects. For all displays, addition of the stationary reference display increased motion-in-depth of all displays, especially the random-dot display. The stationary display introduced changing relative disparity, which is a strong signal for motion-in-depth. Welchman et al. (2009) confirmed that changing the absolute disparity of an isolated spot allows subjects to judge the direction of simulated motion-in-depth. They did not measure the magnitude of perceived motion-in-depth. Subjects correctly judged direction both when they tracked the changing disparity with vergence and when they did not track it. Thus modulations of absolute disparity and/or vergence produce motion-in-depth when conflicting information from image looming is weakened or removed. However, the motion is still much less than the theoretical value, probably because one or more of the other signals listed above still indicate that the stimuli are not moving in depth. The results also confirm that movement in depth is enhanced in the presence of a stationary stimulus, which adds the cue of changing relative disparity. Even with the reference display, the mean perceived movement in depth was only about 10 cm out of a theoretical value of 37 cm. This suggests that strong looming signals are required to produce strong sensations of motion-in-depth. This serves a useful function. Changes in overall disparity are produced by voluntary changes in

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Single spot

Random spots

Sectors Figure 31.12

Reference display

Stimuli used by Howard (2008). The dot subtended 2.1˚. The other stimuli subtended 65˚.

vergence or vergence instability as well as by motionin-depth (Steinman and Collewijn 1980). It would be confusing if changes in absolute disparity generated motion-in-depth. Confusion is avoided if changes in absolute disparity or vergence generate motion-in-depth only when accompanied by other information, such as looming or changing relative disparity. Looming alone is a strong cue to motion-in-depth (Regan and Kaushal 1993; Regan and Beverley 1984). This is because stationary objects rarely change size continuously. All animals with eyes react strongly and rapidly to looming. Also, change of relative disparity is a strong cue to motion-in-depth because it occurs only when there is motion-in-depth.

31.3.3 D ET EC T I N G R E L AT I V E MOT I O N-I N-D E P T H

31.3.3a Detection of Motion-in-Depth as a Function of Frequency of Disparity Modulation Richards (1951) was the first to determine the disparity threshold for detecting relative motion-in-depth as a

function of the temporal frequency of disparity modulation. The threshold for detecting stepwise depth oscillations of a vertical line with respect to two flanking lines rose rapidly as the frequency of disparity oscillation increased above 1 Hz. Depth could not be seen at frequencies of disparity oscillation above about 4 Hz. He did not explore frequencies below 1 Hz. Tyler (1971) measured sensitivity to changes in depth created by temporal modulation of disparity of a single line with respect to a stationary line for frequencies from 0.1 to 5 Hz. Figure 31.14 shows that sensitivity to motionin-depth improved up to a modulation frequency of about 0.5 Hz and declined above about 2 Hz. By comparison, the detection of modulations of luminance contrast extends to 60 Hz. The sensitivity of observers to sinusoidal modulations of disparity in a dynamic random-dot stereogram peaked at about 2 Hz and showed a high-frequency cutoff at 10.5 Hz (Nienborg et al. 2005). The response of binocular cells in V1 of the monkey to the same stimulus showed a similar peak and cutoff. The cells responded to a much higher frequency of luminance modulation than of

S E E I N G M OT I O N -I N -D E P T H



195

12

With no stationary reference With stationary reference

Judged amplitude of motion in depth (cm)

10

8

6

4

2

0 Textured surface

Single dot

Radial pattern

Figure 31.13 Motion-in-depth from changing disparity. Judged amplitude of motion-in-depth of the stimuli shown in Figure, with the stimuli seen in isolation and in the presence of a stationary textured display. The bars are standard errors. (Redrawn from Howard 2008)

Movement sensitivity (arcsec)

5 Monocular oscillation

10 20

Disparity oscillation 50 100 200 0.1

0.2 0.5 1.0 2.0 Frequency of disparity oscillation (Hz)

5.0

Figure 31.14. Sensitivity to depth modulation. Sensitivity to disparity-induced motion-in-depth as a function of temporal frequency of sinusoidal oscillation of disparity (lower curve). The upper curve shows sensitivity to oscillatory motion of the image in one eye (N = 1). (Redrawn from Tyler 1971)

disparity modulation. Nienborg et al. explained the difference in terms of the time taken to code the cross-correlation of the images in the two eyes. Regan and Beverley (1973a) measured sensitivity to motion-in-depth produced by temporal modulation of disparity of a vertical line superimposed on a fixed display of random dots. It can be seen in Figure 31.15 that the attenuation of motion sensitivity at low frequencies of 196



disparity modulation centered on a 5-arcmin depth pedestal was more severe than the attenuation with modulation centered on zero disparity, especially for uncrosseddisparity pedestals. It can also be seen that, for square-wave disparity modulations, low-frequency attenuation was virtually absent for oscillations centered on zero disparity and was less for oscillations about a 5-arcmin pedestal than for sinusoidal modulations of disparity about a pedestal. The loss of stereoacuity for temporal disparity modulations at frequencies below 1 Hz was more severe for oscillations about a crossed disparity pedestal than for oscillations about an uncrossed pedestal. For both types of pedestal, low-frequency attenuation was more severe for stepped movements away from and then back toward the fixation plane than for stepped movements toward and then away from the fixation plane. Sensitivity did not depend on the direction of motion with respect to the eyes (Beverley and Regan 1974). Perception of depth magnitude, also, is affected by temporal modulations of disparity. Richards (1972) measured the magnitude of perceived depth by asking subjects to match the depth of a probe to the maximum apparent depth of a test bar as it moved either stepwise or sinusoidally through the fixation plane. The results for sinusoidal depth modulation are shown in Figure 31.16. For sinusoidal disparity modulations of between 0.5 and 2˚, perceived depth peaked for temporal modulations of between about 0.5 and 1 Hz and decreased monotonically above 1 Hz. The low-frequency attenuation of stereo gain, like the attenuation of stereoacuity, was absent for disparity amplitudes of 0.25 and 0.125 and for square-wave depth modulations. This is consistent with Regan and Beverley’s finding that low-frequency attenuation of stereoacuity is virtually absent for square-wave modulations around zero disparity (Figure 31.15). Richards used only crossed disparities in his display. For all amplitudes of depth oscillation, perception of motion-in-depth failed completely at a frequency of about 6 Hz. Regan and Beverley (1973a) confirmed that suprathreshold stereo gain declines monotonically for depth modulations of between 2.5 and 20 arcmin. They also found that the loss in gain began at lower temporal frequencies for crossed than for uncrossed disparities, providing more evidence that crossed and uncrossed disparities are processed in distinct channels. Evidence reviewed in Section 18.7 suggests that high spatial-frequency channels process small disparities and low spatial-frequency channels process large disparities. On this basis, one would expect the differential effects of temporal modulation, which Richards found for large and small disparities, to show for stereo displays with low and high spatial frequency luminance patterns. In conformity with this expectation, the stereo threshold for spatialfrequencies below about 2.5 cpd was lower for a 1-Hz depth

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

3

3

Depth-detection threshold (arcmin)

Sinusoidal disparity modulation

Squarewave disparity modulation

2

2 5 arcmin uncrossed pedestal disparity

1

0.5

5 arcmin crossed pedestal

5 arcmin crossed pedestal

Zero pedestal disparity

0.1

5 arcmin uncrossed pedestal disparity

1

0.2

0.4

0.5

0.8

1.6

3.2

Zero pedestal disparity

0.1

0.2

0.4

0.8

1.6

3.2

Temporal frequency of disparity modulation (Hz) Figure 31.15. Sensitivity to temporal modulation of disparity. Threshold for detection of motion-in-depth produced by temporal sinusoidal (left panel) and square-wave (right panel) modulations of disparity, as a function of frequency of disparity modulation. Legends indicate the disparity about which disparity was modulated. (Adapted from Regan and Beverley 1973a)

responsive to modulations of 1 Hz and steady-state disparities. Thus, over this range of sinusoidal temporal modulation, the low-frequency system has a transient characteristic and the high-frequency system a sustained characteristic. Perhaps the low spatial-frequency, transient system is the magnocellular system and the high spatialfrequency, sustained system is the parvocellular system (Section 5.8).

0.6 Apparent depth (equivalent disparity)

0.4 0.2

1° 0.25° 0.5°

0.1 0.06 0.04 0.125° 0.02

31.3.3b Impressions Created by Rapid Disparity Alternation

0.01 0.006 0.004 0.125 0.25 0.5 1 2 4 Frequency of depth modulation (Hz)

8

Figure 31.16. Apparent depth and depth modulation. Apparent depth of a test bar, as it moved sinusoidally through the fixation plane, as a function of frequency of disparity modulation. The amplitude of disparity modulation is indicated on each curve. (Adapted from Richards 1972)

modulation than for the same static disparity. However, the stereo threshold for patterns with high spatial frequency was the same for modulated as for static disparities (Schor et al. 1984). It looks as though the response of the low spatialfrequency/large disparity system is boosted by temporal modulations of disparity, at least up to 1 Hz. Also, the high spatial-frequency/small disparity system is equally

In the preceding studies, it was claimed that the sensitivity to motion-in-depth for both the fine and coarse stereo systems declines to zero at a frequency of depth oscillation of between 4 and 6 Hz. However, this conclusion must be qualified. Regan and Beverley (1973a) found that depth impressions ceased for a bar oscillating sinusoidally in depth at about 6 Hz. However, when the bar oscillated discretely in sudden jumps, perceived depth diminished as the temporal frequency of disparity modulation increased up to about 3.2 Hz. At higher frequencies, the whole appearance of the display changed from a bar oscillating in depth to that of two bars at different distances. The apparent depth between the two bars was not affected by further increases in the frequency of disparity oscillation. Similarly, Norcia and Tyler (1984) found that a dynamic random-dot display filling a monitor screen appeared to move back and forth in depth relative to the monitor frame as the disparity of the dots alternated between –1.25 and +1.25 arcmin. Above an oscillation frequency of 6 Hz, the

S E E I N G M OT I O N -I N -D E P T H



197

impression of one plane moving back and forth gave way to an impression of two pulsating depth planes, one seen through the other. Above 14 Hz, the pulsation ceased and two steady depth planes were seen. Thus, 6 Hz is the upper limit for the perception of alternating motion-in-depth but not for the perception of differences in depth. There is no upper limit for temporal modulations of disparity for the perception of distinct depth planes. When there is rapid alternation between two depth planes, information can be integrated over several brief exposures. To see motion-in-depth, the visual system must keep track of the sequence of rapidly alternating disparities. It is this ability that breaks down at 6 Hz.

31.3.3c Motion-in-Depth Compared with Lateral Motion Several investigators have compared the detection of lateral motion with the detection of motion-in-depth as a function of temporal frequency. Tyler (1971) found that the threshold for detection of back-and-forth motion of a vertical line in stereoscopic depth relative to a fixation line was three or more times higher than the threshold for to-and-fro lateral displacement of a monocular line (Figure 31.14). He called this effect stereomotion suppression. The effect was about the same for oscillations between 0.1 and 5 Hz. In a later study, the possible contaminating effects of vergence tracking were controlled by having two lines, 20 arcmin apart, move in depth in antiphase (Tyler 1975). Stereoscopic movement suppression occurred at oscillation rates above 0.3 Hz. Below 0.3 Hz, the threshold for movement in depth was similar to that for lateral movement. A similar suppression of antiphase motion occurred between vertically disparate images, which indicates that inhibition is characteristic of how disparity is used to process binocular fusion rather than depth. The threshold for discriminating between two speeds of motion-in-depth has also been found to be higher than the threshold for discriminating between monocular lateral motions of equivalent stimuli (Brooks and Stone 2006a). Stereomotion suppression was not present for equiluminant red-green gratings (Tyler and Cavanagh 1991). This suggests that stereomotion is processed separately in luminance and chromatic channels. Regan and Beverley (1973b) dissociated the frequency of the movement of the image in each eye from the frequency of the motion-in-depth produced by these movements. They did this by oscillating the monocular images at a slightly different frequency so that the motionin-depth changed as the monocular motions came in and out of phase. Thus, motion-in-depth ran at the difference, or beat frequency of the two monocular oscillations. Although motion-in-depth was not detected at beat 198



frequencies above about 5 Hz, the monocular motions remained visible up to a frequency of about 25 Hz. Tyler et al. (1992) measured the threshold for detection of changing depth of a vertical line with sinusoidal depth modulations along its length. They also measured the threshold for detecting lateral oscillation of a monocular wavy line. The stereo threshold and the lateral-motion threshold were determined as a function of temporal frequency up to 3 Hz, and of spatial-modulation frequencies between 0.05 and 1.5 cpd. There were large individual differences but, in general, the threshold for detection of monocular oscillation was lowest at high spatial and high temporal frequencies (Figure 31.17). The disparity threshold for detection of stereo motion was lowest at low temporal frequencies, as the stereo motion effect had revealed, and at medium spatial frequencies, as revealed in experiments reported in Section 18.6.3 (see also White and Odom 1985). The difference functions for stereo and monocular motion thresholds are shown in Figure 31.17C. Sumnall and Harris (2002) measured displacement thresholds for detection of motion and for detection of the direction of motion of a random-dot display over a wide range of directions in 3-D space. Their results suggest that there are two independent motion-detection mechanisms, one for frontal motion and one for motion-in-depth. Thresholds for motion in the frontal plane were 2 to 4 times lower than those for motion-in-depth within the median plane of the head. In apparent contradiction to these findings, Regan and Beverley (1973a) found that sensitivity to a 0.1-Hz sinusoidal to-and-fro motion-in-depth of a vertical bar relative to a random-dot background was up to twice that for binocular lateral motion, when the disparity modulations were less than ±10 arcmin from the fixation plane. When depth modulations of the bar were centered on disparity pedestals of more than ±10 arcmin, subjects were more sensitive to sideways motion than to motionin-depth (see Figure 31.18). The test and comparison stimuli used by Tyler were separated laterally by 20 arcmin, whereas Regan and Beverley’s test stimulus was superimposed on a field of closely packed dots. Tyler (1971) concluded that stereomotion suppression is due to inhibition between neurons sensitive to increasing disparity and those sensitive to decreasing disparity. But what if motion-in-depth were coded by the interocular difference-of-motion. In that case, stereomotion suppression would be due to mutual inhibition between detectors for opposite directions of motion. Let us suppose that the difference-of-motion signal becomes more dominant when either the lateral separation or the depth separation of neighboring stimuli is increased. As stimuli come closer together, the change-in-disparity signal becomes dominant. If stereomotion suppression were due to inhibition between motion detectors then it should

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

0.1 0.05 atia 0.15 lm odu 0.5 1.5 lati on (cp d)

3 1 (Hz) n 0.3 ulatio d o m oral

Sp 0.01

0.1 p Tem

A

Stereo threshold (arcmin)

1 Oscillation in depth

2

Sideways oscillation binocular

3 4 Monocular oscillation

5

1

–20

0.1

–10 0 10 20 Far Near Pedestal disparity (arcmin) Stereoanamalous subject

1

0.05 Sp atia 0.15 lm odu 0.5 1.5 lati on (cp 0.01 d)

3 1 Hz) 0.3 ion ( dulat o m l ra

0.1 po Tem

B Log stereo/mono threshold ratio

Motion-detection threshold (arcmin)

1

10

Motion-detection threshold (arcmin)

Monocular motion threshold (arcmin)

Subject with normal stereo vision

Oscillation in depth

Sideways oscillation binocular

2

3 4

Monocular oscillation

5

1 –20

3

0.05 Sp atia 0.15 0.5 lm odu 1.5 lati 0.1 on (cp d

1 ) 0.3 n (Hz ulatio d o ral m

0.1 po Tem

)

C

Spatiotemporal features of stereoacuity. (A) Threshold for detection of sinusoidal oscillation of a monocularly viewed wavy line as a function of the spatial and temporal frequency of oscillation. (B) Disparity threshold for detection of peak-to-peak depth modulation of an oscillating wavy line as a function of the spatial and temporal frequency of oscillation. (C) Ratio of stereo threshold to the monocular motion threshold. Points below the base indicate stereo thresholds lower than monocular thresholds, those above indicate higher stereo thresholds. Data (dots) were fitted by cubic spline interpolation. Results for one subject. (Redrawn from Tyler et al. 1992)

Figure 31.17.

be more evident with stimuli well spaced laterally or in depth. This would explain why Tyler obtained suppression with well-spaced targets and why Regan found no suppression with closely spaced stimuli unless they were separated in depth. Experiments are needed to test the hypothesis that the relative strengths of the two cues to motion-in-depth depend on the lateral and depth separation of stimuli. Harris and Rushton (2003) produced evidence for inhibition between motion detectors by showing that

20 –10 0 10 Near Far Pedestal disparity (arcmin)

Figure 31.18. Stereoacuity and depth modulation. The red lines indicate sensitivity (reciprocal of threshold) to oscillatory motion-in-depth of a vertical bar at 0.1 Hz, as a function of the pedestal disparity. The blue lines indicate sensitivity to sideways oscillation of the bar as a function of the same variable (N = 1). (Redrawn from Regan and Beverley 1973a)

opposed dichoptic motion was more difficult to detect than same-direction motion in random-dot displays that were uncorrelated between the two eyes. These displays lacked the changing-disparity signal. The lateral spacing of their stimuli was about the same as the spacing of Tyler’s stimuli. Evidence already cited suggests that attenuation of stereoacuity at certain temporal frequencies of depth modulation is greater for movements about a crosseddisparity pedestal than for movements about an uncrossed pedestal. Also, stereoacuity attenuation is greater for movements stepped away from and back to the fixation plane than for movements toward and then away from the fixation plane. In all these cases, a motion-in-depth one way is followed by a motion-in-depth the other way. There are more metameric interactions between disparity detectors for sequential motions in depth in the same direction, since

S E E I N G M OT I O N -I N -D E P T H



199

the spatiotemporal tuning functions of the detectors of such motions are more likely to overlap. Low temporalfrequency attenuation is more severe for sinusoidal than for square-wave depth modulations (see Figure 31.15). In sinusoidal modulations, inhibitory interactions occur between sequentially stimulated disparity detectors of the same sign, whereas in square-wave-modulated depth about zero disparity, all the contiguous interactions are between detectors of opposite sign. The characteristics of the system can be derived from sinusoidal inputs only if the system is linear. This question needs further exploration. An experiment is needed in which the threshold and efficiency of motion detection are measured for different temporal sequences of disparity change down or up linear or curved disparity ramps and compared with those for interlaced disparity changes of opposite sign. 31.3.4 I S O L AT I N G T H E C H A N G E - O F-D I S PA R IT Y CU E

The change-of-disparity signal and the difference-of-motion signal to motion-in-depth were described in Section 31.2.4. The change-of-disparity signal can be isolated from the difference-of-motion signal by using spatially correlated but temporally uncorrelated dichoptic displays. This is achieved by a dynamic random-dot stereogram (DRDS) in which the dot patterns in both eyes are changed about 60 times per second. This removes coherent image motion. However, at any instant, the random-dot patterns in the two eyes are the same (apart from disparity) so that observers can detect the changing disparity. Julesz (1971) and Pong et al. (1990) observed movement in depth of a central square in a DRDS. Regan (1993) created a DRDS in which the disparity of a central square was alternated continuously through 8 arcmin at 0.5 Hz. The square appeared to oscillate in depth. The square was also oscillated at 0.5 Hz from side to side, in the same direction in both eyes. When the ratio of disparity oscillation to lateral oscillation was varied, the perceived direction of the square’s motion-in-depth varied accordingly. Since the random-dot pattern was uncorrelated over time, subjects must have been using the ratio of the square’s lateral motion to its changing disparity. Gray and Regan (1996) measured sensitivity to motion of a central square in a DRDS. The square subtended 2˚ and was presented in the following conditions: (a) the horizontal disparity of the square was modulated sinusoidally at a mean velocity of 0.5˚/s, (b) the square was oscillated from side to side, and (c) the size of the square was modulated sinusoidally about its mean value of 2˚. In another condition, the square was made visible to each eye by keeping the dots in the surround constant. The threshold amplitude of disparity modulation for detection of motion-in-depth of the DRDS square was similar to that for the square seen by both eyes over the 0.25 to 8 Hz range of modulation 200



frequency. In both cases, the threshold was lowest at about 2 Hz. Gray and Regan concluded that changing disparity in the absence of relative motion of monocular images is sufficient to generate motion-in-depth. Gray and Regan found that oscillating the size of a square in a DRDS produced a weak impression of motionin-depth. The amplitude threshold was considerably higher than that for motion-in-depth produced by oscillating the size of a monocularly visible square. They suggested that changing the size of a square in a DRDS produces a weaker effect because its boundaries are not as well defined as those of a monocularly visible square. Gray and Regan also found that, for the DRDS square, a large size oscillation of both images was required to cancel the impression of motion-in-depth created by a small oscillation of disparity. For a square that was visible to both eyes, a smaller size oscillation of the square was required to cancel a given disparity oscillation. The fact that the monocular cue of changing size traded against the cue of changing disparity suggests that they converge on a common mechanism at a fairly early level of processing. For a square in a DRDS, the amplitude threshold for detection of motion-in-depth, produced by antiphase motion of the square areas in the two eyes, was lower than that for detection of side-to-side motion, produced by inphase motion. The relative insensitivity to frontal-plane motion of an area in a DRDS was also noted by Julesz and Payne (1968) and by Patterson et al. (1992). Frontalplane motion of an area in a DRDS is signaled only by motion of the boundary, whereas motion-in-depth is signaled by the changing disparity of all the elements in the shape. This issue was discussed in Section 16.5. Cumming and Parker (1994) devised a stimulus with visible monocular motion opposite in each eye, but in which disparity changes were beyond the 8 Hz temporal resolution of stereopsis. This stimulus did not produce motion-in-depth. In a second random-dot display, the spatial frequency of depth modulation was beyond the spatial resolution of the stereoscopic system but not of the motion system. Motion-in-depth was not seen in this display either. Cumming (1995) found that stereomotion thresholds correlated well with stereoacuity but poorly with detection of motion in a frontal plane. Cumming and Parker concluded that the only effective binocular cue to motion-in-depth is that of changing disparity based on differentiation of the same disparity signal used for static disparity. However, impressions of motion-in-depth based on opposed motion may have been suppressed when the disparity changes were beyond the temporal resolution threshold. Furthermore, although monocular motion was visible in both displays, the motion thresholds were considerably elevated, perhaps to above the threshold for detection of a difference of motion in the dichoptic images. Weak monocular motions may not produce a detectable interocular difference signal.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

image. In other words, the signal generating motionin-depth could be the changing instantaneous disparity between these motion boundaries rather than a pure difference-of-motion signal. Motion-in-depth is also produced by spatially uncorrelated but temporally correlated displays in which there are no motion-defined boundaries (Howard et al. 1998; Shioiri et al. 2000; Rokers et al. 2008; Allison and Howard 2011). Howard et al. used a 17˚-wide by 3.5˚-high array of white dots randomly distributed at a density of 1.5% on a black background. Distinct displays in the two eyes moved to-and-fro horizontally in opposite directions at various velocities and frequencies. This left the difference-of-motion signal intact but removed any coherent disparity. The boundaries of each image were stationary, so there were no motion-defined boundaries. Chance matches between dots produced an impression of lacy depth, but the mean disparity of randomly matched dots remained centered on zero when the displays moved. However, as the images moved, all randomly paired images underwent a change in disparity of the same sign, even though the mean disparity remained constant. In a comparison display, the dot patterns in the two eyes were identical (spatially correlated). At any instant, disparity between all pairs of correlated images equaled the relative image displacement, while the mean disparity between the uncorrelated images was always zero. Motion-in-depth was not seen in either display presented alone when the rest of the visual field was blank (see Section 31.3.2). The two displays were presented one above the other with a fixation point between them (Figure 31.19). The spatially uncorrelated images moved from side to side in counterphase with all dots moving at the same velocity of

17°

Spatially uncorrelated

1.8°

The same change in disparity may be produced by a distant object moving in depth through a large distance as by a near object moving through a smaller distance. Therefore, detection of the magnitude of motion-in-depth produced by changing disparity requires knowledge of the distance of the object. However, if motion-in-depth is signaled by the rate of change in disparity, the threshold change in disparity for detection of motion-in-depth should be independent of viewing distance. Harris and Sumnall (2000) found that, indeed, the threshold for a single point moving in depth in a 3-D display of stationary dots was independent of viewing distance, with image size held constant across changes in viewing distance. However, this does not prove that motion-in-depth is signaled by changing disparity, because the threshold for detection of motion-in-depth signaled by relative motion should also be independent of viewing distance. Lages et al. (2003) used dichoptic vertical sine-wave gratings of various spatial frequencies moving sinusoidally from side to side at temporal frequencies between 0.1 and 6 Hz. An interocular phase difference caused the fused grating to move sinusoidally in depth within a fixed Gaussian window. The threshold phase difference required for detection of the sign of motion-in-depth was measured as a function of spatial and temporal frequency. They argued that, for a system based on interocular phase differences, the threshold should not vary with spatial or temporal frequency. For a system based on interocular delay, the threshold should vary with temporal but not spatial frequency. For a system based on horizontal disparity, the threshold should vary with spatial but not temporal frequency. Finally, for a system based on velocity differences, the threshold should vary with both spatial and temporal frequency. They found that the phase-difference threshold was a U-shaped function of temporal frequency, with a minimum near 2 Hz. In some, but not all, subjects the threshold increased linearly with spatial frequency. None of the simple predictions accounted for these results. Lages et al. concluded that the system monitors both changing disparity and relative motion of binocular images with limited temporal resolution.

Fixation point

31.3.5 I S O L AT I N G T H E D I FFER EN C E -O F-MOT I O N CU E

In all the above experiments, the same random-dot display was presented to the two eyes—the displays were spatially correlated. Motion-in-depth can be created by changing the disparity between motion-defined shapes in a random-dot stereogram in which different random dot patterns are presented to the two eyes. In this case, the dichoptic images are spatially uncorrelated. This effect demonstrates that spatial correlation of fine texture is not required for disparity-defined motion-in-depth. However, the motiondefined forms in these displays are visible in each eye’s

Spatially correlated Opposite to-and-fro motion of left-eye and right-eye images

Left-eye images Right-eye images

Figure 31.19. Motion-in-depth from uncorrelated images. White dots were randomly distributed on a black background (density 1.5%). In the figure, the dots in the two eyes are made distinct. In the display, the dots in the two eyes were the same. The patterns in the two eyes were spatially uncorrelated in the test display and correlated in the comparison display. The subject fixated on the central dot. (From Howard et al. 1998)

S E E I N G M OT I O N -I N -D E P T H



201

between 0.75 and 120 arcmin/s, at a frequency of 0.2, 0.5, or 1 Hz, and a relative displacement amplitude of between 3.75 arcmin and 1˚. The two spatially correlated images moved at the same frequency. Subjects adjusted the velocity of the correlated images until the fused display appeared to move in depth at the same velocity as the uncorrelated display. The motion-in-depth of the two displays was either in phase or in counterphase. This ensured that motion-indepth of the test display was not due to motion contrast or vergence tracking. In a second condition, the comparison stimulus was a dynamic random-dot stereogram in which the dots were spatially correlated but changed on every frame. In the first condition, subjects could conceivably have matched the velocity of monocular motion of the images rather than the velocity of perceived motion-indepth. They could not adopt this strategy in the second condition because the comparison stimulus did not contain any coherent monocular motion. Figure 31.20 shows that the apparent velocity of motionin-depth of the spatially uncorrelated display with respect to the correlated display was a linear function of the velocity of relative motion of the images. This was true for each of the three frequencies of motion. The velocity of the uncorrelated images was about 10% higher than that of the correlated images when the two displays appeared to move in depth at the same velocity. In other words, the absence of the change-of-disparity signal from the spatially uncorrelated display did not have much effect on the efficiency of the motion-in-depth signal. Changing the dot density of the spatially uncorrelated display from 0.1% to 50% did not change the impression of motion-in-depth significantly. A dot lifetime of about 60 ms was sufficient to create motionin-depth with spatially uncorrelated images. These results support the idea that motion-in-depth produced by differential motion of spatially uncorrelated

Differential velocity of correlated images (arcmin/s)

120 1.0 Hz 100 80 60 40 20

0.5 Hz 0.2 Hz

0 0 20 40 60 80 100 120 Differential velocity of uncorrelated images (arcmin/s) Figure 31.20. Motion-in-depth from uncorrelated images. Each graph shows the opposed horizontal motion of a superimposed pair of uncorrelated random-dot images required to produce the same apparent velocity of motion-in-depth as that produced by motion of correlated images. Each superimposed pair of images moved sinusoidally in counterphase at the frequency indicated on each graph.

202



images is due to registration of the relative motion rather than to the changing disparity of randomly paired dots. Discrimination of a difference in speed of lateral motion deteriorates with increasing retinal eccentricity. If motionin-depth is coded by the difference-of-motion signal, speed discrimination of motion-in-depth should also deteriorate with increasing eccentricity. Brooks and Mather (2000) found that speed discrimination for lateral motion and that for motion-in-depth were affected in the same way when the stimulus was moved 4˚ from the fovea. Discrimination of stationary depth intervals was not affected. Brooks and Mather concluded that differenceof-motion signals are involved in the detection of motionin-depth. However, they measured the effect of eccentricity on static disparity but not on changing disparity. Perhaps the perception of changing disparity is degraded in the periphery. A display moving at constant velocity appears to move more slowly when contrast is reduced (Blakemore and Snowden 1999). Brooks (2001) found the same for motion-in-depth created by opposed motion of stereoscopic random-dot displays. This suggests that motion-indepth in such a display is at least partially dependent on relative motion of monocular images. Continued exposure to lateral motion leads to an apparent reduction in velocity. Brooks (2002a, 2002b) showed that simultaneous or sequential adaptation of each eye to a moving random-dot display reduced the perceived velocity of motion-in-depth produced by spatially uncorrelated images. The ability to discriminate differences in speed and direction-of-motion of two moving displays improved after adaptation to an intermediate velocity (Clifford and Wenderoth 1999). In a similar way, adaptation to frontal motion improved subjects’ ability to discriminate differences in speed and direction-of-motion of random-dot displays moving in depth (Fernandez and Farell 2005). No improvement would be expected if motion-in-depth were indicated only by changing disparity. Motion-in-depth was induced by pure interocular differences in velocity produced by motion aftereffects (Fernandez and Farell 2006b). The speed-discrimination threshold for motion-indepth produced in a DRDS, which contained only the changing disparity signal, was 1.7 higher than that produced in a regular random-dot stereogram, which contained both changing-disparity and difference-of-motion signals (Brooks and Stone 2004). Thus, the difference-ofmotion signal supplemented the changing-disparity signal. Rokers et al. (2011) produced an impression of motionin-depth from stimuli moving horizontally in opposite directions. They used a novel dichoptic pseudoplaid pattern that precluded any contribution from changing binocular disparity. All this evidence supports the idea that motion-in-depth may be produced by the relative motion of the images in

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

the two eyes as well as by changing disparity. For this to be possible, the motion signal from each eye must carry an eye-of-origin signature. 31.3.6 D ET EC T I N G S P E E D O F S T E R EO MOT I O N-I N-D E P T H

Harris and Watamaniuk (1995) reported that speed discrimination for motion-in-depth of a central area in a random-dot stereogram was as good as speed discrimination for sideways motion. However, speed discrimination for motion-in-depth in a dynamic random-dot stereogram (DRDS) was very poor. A DRDS contains only changing disparity relative to the stationary surround. The changing relative motion signal is absent. They concluded that there is no special mechanism for the detection of motionin-depth based only on changing disparity. Portfors-Yeomans and Regan (1997a) pointed out that the central area of the DRDS used by Harris and Watamaniuk would be invisible as it passed through the zero-disparity plane. The relative motion signal in the random-dot stimulus would render the square visible through its whole trajectory. They obtained the same results as Harris and Watamaniuk when they repeated the experiment this way. But when the moving area did not pass through zero disparity, speed discrimination in the DRDS was not significantly different from that for motionin-depth in a regular random-dot stereogram (PortforsYeomans and Regan 1996). In all cases, Weber fractions were under 0.2. Furthermore, their subjects could base speed discriminations on variations in the rate of change of disparity while ignoring changes in displacement amplitude, or they could judge displacement amplitude while ignoring changes in speed. They concluded that there is a special mechanism for the detection of motion-in-depth based only on changing disparity. Evidence already cited indicates that there are two mechanisms for detecting motion-in-depth of cyclopean forms. One relies on the rate of change of disparity, and the other relies on differential image motion. Brooks and Stone (2006b) asked whether the threshold for discriminating differences in the speed of motion-in-depth are similar for the two mechanisms. They found that, as the height of the stimulus was reduced, the threshold rose much more rapidly for changing-disparity stimuli than for differentialvelocity stimuli. They argued that the poor performance of the changing-disparity mechanism arises because it receives its inputs from disparity detectors, which have large receptive fields and therefore coarse spatial resolution. The differential-velocity mechanism receives its inputs from motion-detectors, which have higher spatial resolution. The visual fields of some subjects with otherwise normal vision contained areas specifically blind to motion-indepth, defined by changing disparity (Richards and Regan 1973). These have been called stereomotion scotomata.

Sensitivity to static disparity and to sideways motion is normal in such areas. This topic is reviewed in Section 32.2.3. An object moving in a frontal plane appears to move more slowly when it is visually pursued than when the eyes remain stationary. This is known as the Aubert-Fleischl illusion. Nefs and Harris (2007) reported that an object moving in depth appears to move about 4% more slowly when it is tracked by vergence eye movements compared with when the eyes remain fixated on a stationary object. This could be because the velocity of vergence eye movements is underestimated or because the images of the tracked object lose their retinal motion. Wright and Gurney (1992) found that opposed motion of dichoptic gratings in a fixed circular aperture produced an apparently faster motion-in-depth when the gratings were oblique rather than vertical. They explained this in terms of the way in which the stereoscopic system disambiguates the directions of motion of the monocular images. In the experiments discussed so far, subjects discriminated differences in speed of motion-in-depth of objects at the same absolute distance. In this case, subjects do not have to register the absolute distance of the stimuli. To compare the speed of motion-in-depth of an object at one distance with that of an object at a different distance allowance must be made for the difference in distance. The ability to take distance into account in judging relative speed is known as speed constancy (Section 29.5). Rushton and Duke (2009) investigated this issue. Subjects binocularly viewed a laser spot approaching in the midline along an unseen surface just below eye level. A fixed LED provided information about changing relative disparity. Subjects reported which of two sequentially presented laser spots was moving faster. One spot moved from a distance of 1.6 m, the other from a distance of 2.4 m. Subjects were very poor at judging relative speed. The nearer spot had to move more slowly than the far spot to be perceived as moving at the same speed. Judgments of relative speed were reasonably accurate when the surface over which the spot moved and the room were visible. Thus, when there was adequate information about absolute depth, subjects had reasonable speed constancy. 31.3.7 T H E FL A S H-L AG E FFEC T I N D E P T H

It was mentioned in Section 31.1.2d that an object moving in a frontal plane appears to move more slowly when viewed intermittently than when viewed continuously. This is presumably because the motion signal generated by intermittent illumination is weaker than that generated by continuous illumination. In the flash-lag effect an object flashed on next to an object moving laterally appears displaced in the direction opposite to the motion. The effect may be observed by rotating a line about its center

S E E I N G M OT I O N -I N -D E P T H



203

with the center of the line illuminated continuously and the ends illuminated intermittently. The ends appear to lag behind the center (MacKay 1958; Nijhawan 1994). There are two basic theories of this effect—the prediction theory and the differential latency theory. According to the prediction theory it takes time to process a visual stimulus. Therefore, by the time a moving object is registered it will have moved some distance. For example, with a processing time of 100 ms, an object moving at 30˚/s will have moved 3˚ by the time it is registered. For an object moving at constant velocity, the visual system could compensate for processing delay by extrapolating the future position of the object from its velocity. This process would not apply to a flashed object or to a stroboscopic moving object that lacks velocity signals (Nijhawan 1994). According to the differential latency theory, it takes longer to register a flashed stimulus than a moving stimulus. See Ogmen et al. (2004) for a review of theories of the flash-lag effect. The flash-lag effect also occurs for objects moving in depth. It causes an approaching object to appear to move more slowly when it is illuminated intermittently rather then continuously (see Section 31.1.2d). Ishii et al. (2004) caused two vertical rods seen in a stereoscope to move back and forth in depth by modulating disparity. A flashed vertical rod placed between the moving rods appeared to lag behind the instantaneous location of the moving rods by 70 to 150 ms. Ishii et al. concluded that the effect occurs after binocular fusion because its magnitude did not vary with viewing distance. Harris et al. (2006) extended the study of the 3-D lag effect. They produced motion-in-depth in three stereoscopic stimuli: (1) a disk changing in disparity and size, (2) a random-dot display changing in disparity but not size, and (3) a dynamic random-dot display changing in disparity but with no motion signals and no changing size. Motion-in-depth was at an angular speed of 0.47 or 0.93˚/s with respect to a stationary display. Subjects judged the relative depth of the three displays at the instant when the luminance of the stationary display was briefly increased. The lag effect was up to 400 ms for the random-dot display in which disparity conflicted with the absence of changing size. It was about 200 ms for the dynamic random-dot display in which the absence of changing size would be masked by the lack of motion signals. It was less than 200 ms for the disk in which disparity and changing size agreed. Harris et al. concluded that the lag effect is reduced as information about motion-in-depth is increased. This is the opposite of what one would expect from the theory that the lag effect is due to extrapolation of the position of the moving stimulus. A flash-lag effect in depth should cause a flashed object to lag behind a steadily illuminated object that appears to move in depth. Through the operation of size-distance invariance one might expect the flashed object to appear 204



larger when seen against an approaching background than when seen against a receding background. However, Lee et al. (2008) found the opposite. A flashed object was seen against a random-dot background that was modulated in disparity so that it appeared to move back and forth in depth. The flashed object appeared smaller when the background appeared to approach and larger when it appeared to recede. This occurred even though the flashed object appeared to lag behind the background. Lee et al. concluded that the flashed object appears small against the approaching background because it behaves like an afterimage that appears smaller when seen against a nearer surface than when seen against a far surface. 3 1 .4 C U E I N T E R AC T I O N S 31.4.1 I N T E R AC T I O N O F MO N O C U L A R AND BINOCULAR CUES

Since monocular and binocular cues evoke the same sensation of an approaching object, they presumably converge onto the same neural mechanism. Regan and Beverley (1979b) proposed that the two cues combine according to a simple weighted-sum model. They reported that a motion-in-depth sensation produced by changing size could be canceled by an opposed change in relative disparity. Furthermore, a motion-in-depth aftereffect induced by prolonged inspection of a stimulus changing in size according to a ramp function could be nulled by a change in disparity. Changing disparity became more effective relative to changing size as the velocity of motion-in-depth increased or as exposure time increased. The relative effectiveness of the two cues varied widely between subjects. Heuer (1987) confirmed that when the two cues have the same sign they combine by simple summation. However, when one cue signaled an approaching object and the other a receding object, they rivaled rather than combined, with now one and then the other cue dominating. Brenner et al. (1996) presented for 1 s a computersimulated polyhedral target moving at 21.6 cm/s toward the subject from 60 cm to 38 cm with reference to a stationary textured background. Subjects adjusted the velocity of a laterally moving comparison object to match the approach velocity of the test target. The change in vergence required to maintain fixation on the target, image size, and disparity relative to the stationary background were all changed appropriately, or only one or two cues were changed. The results are shown in Figure 31.21. Keeping any one cue constant, especially image size, reduced perceived velocity of approach below the value when all cues changed. In the single cue conditions, changing size produced the highest perceived velocity, changing relative disparity produced less perceived velocity, and changing vergence produced no

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Matched velocity/stimulus velocity

1.0 0.8 0.6 0.4 0.2 0

All Disparity Vergence Vergence Size Disparity Vergence cues & size & size & disparity Changing cues

Figure 31.21. Perceived velocity of motion-in-depth. Medians and interquartile ranges of the velocity of lateral motion of a comparison object as a ratio of the velocity of simulated motion-in-depth of a test object. The changing motion-in-depth cues are indicated on the abscissa. A nonchanging cue indicates that the stimulus is not moving (N = 7). (Adapted from Brenner et al. 1996)

relative to the subject was decreased 5–12% relative to that when the surroundings were stationary. For high speeds of simulated self-motion relative to object motion, the perceived speed of the object increased for both directions of self-motion. The perceived direction of the object’s approach shifted in the direction of the focus of expansion of the optic flow. These effects were reduced when binocular information about motion-in-depth was added. It is not clear to what extent these effects were due to perceived self-motion (linear vection) because this was not measured. A similar experiment could be done with subjects walking forward or backward in the dark as they judge the time-tocontact or speed of approach of an object relative to the head. Or subjects could be passively accelerated in the dark. The results would indicate how well people combine motion signals generated by proprioception and/or the otoliths with those generated visually.

motion-in-depth. Subjects differed in the weights they assigned to the different cues.

3 1 . 5 S PAT I A L A S P E C T S O F M OT I O N -I N -D E P T H

31.4.2 I N T E R AC T I O N O F O B J E C T MOT I O N A N D S E L F-M OT I O N

31.5.1 D I R EC T I O NA L A SY M M ET RY

In natural conditions a person who is moving forward might judge the motion of an approaching object. In this case, the motion-in-depth signal produced by motion of the object relative to the observer is seen in the context of looming optic flow produced by the stationary visual surroundings. Ideally, the time-to-contact and the speed of approach of an object is fully specified by its angular size divided by its rate of looming, whatever the pattern of optic flow. However, judgments may be disturbed by motion contrast between the object and the surroundings. Gray and Regan (2000b) superimposed a simulated approaching object on a 39˚-wide by 27˚-high display of looming texture that simulated the effects of forward or backward self-motion. After the display had been viewed monocularly for 0.7 s subjects indicated whether a click sound occurred before or after the object would have contacted the face. During simulated forward self-motion, time-to-contact was underestimated by about 11% relative to when the background was stationary. During simulated backward self-motion time-to-contact was overestimated by about the same amount. Gray et al. (2004) used a similar display, 56˚ wide by 88˚ high, to simulate forward or backward self-motion. With monocular viewing, a superimposed object with a given speed of approach relative to the head was judged to have a higher closing speed when the object was seen against a contracting optic flow (simulating backward self-motion) than when it was seen against an expanding optic flow. For example, when the speed of simulated forward self-motion equaled object speed, the perceived speed of the object

There is abundant evidence that the visual system is more sensitive to centrifugal motion than to centripetal motion. This could be due to the fact that animals move forward rather then backward. Response latency for detection of horizontal motion of an 8˚ disk of random dots was shorter for motion away from the fovea (centrifugal motion) than that for centripetal motion, and the difference increased with increasing eccentricity of the stimulus (Ball K and Sekuler 1980). The motion aftereffect induced by centrifugal motion lasted longer than that produced by centripetal motion (Wohlgemuth 1911; Scott et al. 1966). Also, there are more cells tuned to centrifugal than to centripetal motion in cortical area MT (Albright 1989) and in the superior colliculus (Sterling and Wickelgren 1969). On the other hand, sensitivity to motion of a single dot, as opposed to a display of dots, is higher for centripetal than for centrifugal motion (Mateeff et al. 1991). Furthermore, the gain of optokinetic nystagmus is higher for stimuli moving centripetally in each hemifield (Ohmi et al. 1986). Of the neurons in the parietal lobe of primates that are tuned to radial motion, about 75% are tuned to centripetal motion (Motter et al. 1987). It looks as though there are at least two motion-detection channels, with different patterns of directional preference. What about asymmetries in the perception of motionin-depth? Edwards and Badcock (1993) found greater sensitivity to a centripetally looming array of dots that appeared to move away from the observer than for centrifugally moving dots that appeared to move toward the observer. This asymmetry weakened as the stimuli were

S E E I N G M OT I O N -I N -D E P T H



205

moved into the visual periphery. Portfors-Yeomans and Regan (1997a) found no difference in sensitivity between forward and away motion of a form in a random-dot stereogram in which motion-in-depth was indicated only by changing disparity. Subjects made fewer errors in detecting a looming disk that simulated approach set among contracting disks than in detecting a contracting disk set among looming disks (Takeuchi 1997; Shirai and Yamaguchi 2004). However, an approaching disk with shading that indicated that it was concave was harder to detect than a convex approaching disk. The former type of disk appeared as an expanding hole rather than an approaching object.

31.5.2 P O S IT I O N I N T H E V I S UA L F I E L D

31.5.2a Central and Peripheral Visual Fields A stimulus moving at constant speed appears to move more slowly as it moves into the peripheral visual field (Tynan and Sekuler 1982). Brooks and Mather (2000) created two stereoscopic test patches that appeared to move in depth relative to a random-dot background. The only cues to motion-in-depth were changing absolute and relative disparity and the interocular velocity difference. Subjects could accurately match the velocities of the two patches when they were presented side-by-side just above a central fixation point. When one patch was placed at an eccentricity of 4˚ the velocity of its motion-in-depth was underestimated by 24%. When not moving, the central and peripheral test patches appeared to lie at the same depth. It seems that the reduction in apparent speed of motion-indepth of the peripheral test patch was due to the effect of peripheral viewing on the perceived speed of each image rather than to any effect of peripheral viewing on disparity. Brooks and Mather concluded from these results that an interocular velocity difference makes a significant contribution to the perception of motion-in-depth.

31.5.2b Upper and Lower Visual Fields In humans, the density of ganglion cells is greater in the upper than in the lower half of the retina (van Buren 1963). Reaction times are shorter for stimuli in the lower half of the visual field (upper retina) (Payne 1967). Also, the upper retina has better spatial and temporal resolution (Millodot and Lamont 1974; Tyler 1987) and higher stereoacuity (Manning et al. 1992). The upper retina is more heavily represented in the middle temporal cortex (MT). This area contains many direction-selective cells that respond to the motion of large and small visual stimuli (Van Essen et al. 1981). Also, the gain of horizontal optokinetic nystagmus is higher in response to stimuli in the lower visual field (upper retina) than for stimuli in the upper visual field (Murasugi and Howard 1989). Edwards and Badcock 206



(1993) found that sensitivity to motion-in-depth produced by a semi-annular looming random-dot display was higher in the lower than in the upper visual field. 31.6 AFTEREFFECTS OF M OT I O N -I N -D E P T H 31.6.1 A F T E R E FFEC TS O F ROTAT I O N IN DEPTH

It was explained in Section 26.7.2 that a 3-D skeletal cube reverses in apparent depth every few seconds when viewed monocularly. A 3-D cube also reverses in depth when viewed binocularly for several minutes. When the cube rotates slowly around a grand diagonal, an apparent reversal in the direction of motion accompanies each reversal of perspective. For example, if the far side of the cube is moving to the right when it is seen in correct perspective, it will appear as the near side moving to the right in reversed perspective. With monocular viewing, it takes a few seconds before a rotating cube reverses. When viewed with both eyes it takes about 2 minutes until it reverses. After that, it reverses every few seconds and sometimes appears to oscillate rapidly. If, after the first reversal of a binocularly viewed cube, the cube is rotated objectively in the other direction, it takes about 4 minutes before it reverses again (Howard 1961). In the induction period, the detectors sensitive to the specific direction of rotation in depth must become adapted to the point that the alternative interpretation of the stimulus takes precedence. The adapted state of the detectors for motion-in-depth has to “unwind” before becoming adapted in the other direction. The effect cannot be due to adaptation of simple detectors of 2-D motion, because at all positions of the cube there is as much motion in one direction as in the opposite direction. The integral effect over time is therefore zero. The adaptation process must be specific to the direction of rotation in depth. The rate of reversal of stationary 2-D figures, like the Necker cube, increases as inspection is continued for some time (Section 26.7.1). This effect is specific to the location and shape of the inspection figure. The same specificity applies to aftereffects of rotation in depth. Erke and Gräser (1972) presented subjects with the projected image of a cylinder of vertical rods rotating about a central vertical axis. The rate of change in the apparent direction of rotation of the display increased over a period of 3 minutes. The increase did not transfer to a display rotating at a different velocity or to a display that was tilted from the vertical by 45˚. The adaptation to motion-in-depth is therefore s pecific to both the speed of motion and the orientation of the axis of rotation. Any aftereffect from a real object rotating in depth could be due to adaptation of monocular or binocular cues

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

to the direction of rotation. Monocular cues to changing depth are sufficient to induce aftereffects. Regan and Beverley (1978a) obtained an aftereffect from inspection of a looming square (see Section 31.6.2). A square in polar projection contains perspective information about the direction of rotation, whereas a square in parallel projection contains no such information. Several minutes of inspection of a 2-D polar projection of a square rotating in depth in a given direction caused a square in parallel projection to appear to be rotating in the opposite direction (Petersik et al. 1984). The aftereffect developed with a time constant of about 26 s and decayed with a time constant of about 9 s (Petersik 2002). The 2-D motion aftereffect has similar time constants. Aftereffects of rotation in depth are subject to the effects of attention. When subjects inspected two figures moving in opposite directions simultaneously, the direction of the aftereffect depended on which figure subjects had attended to (Shulman 1991).

31.6.2 A F T E R E FFEC TS O F MO N O CU L A R LO O M I N G

31.6.2a Adaptation of Looming Detectors Looming detectors might be expected to manifest aftereffects due to adaptation, like those shown by detectors for other visual attributes, such as lateral motion and orientation. Regan and Beverley (1978a) reported evidence of this kind. Subjects fixated a point between two flanking 0.5˚ black squares on a 15 by 10˚ white background. In one condition, the small squares pulsated sinusoidally in overall size at 2 Hz, which created an impression of to-and-fro motion-in-depth. In a second condition, the vertical edges oscillated in antiphase, which created an impression of changing width. Inspection of a square pulsating in overall size for 25 minutes specifically elevated the threshold for detection of pulsation of overall size in a test square but had very little effect on the threshold for detection of changing width. This suggests that looming is processed by specialized motion detectors that are distinct from those processing local unidirectional motion. When the pulsations and oscillations were ramps interspersed with quick returns, threshold elevation was evident only when the ramp motions of the inspection and test squares were in the same direction. In another experiment, subjects fixated a point at the center of a 1˚-wide bright square that loomed symmetrically at 24 arcmin/s for repeated periods of 0.1 s over a total period of 20 min. The square appeared to move continuously in depth and produced a strong aftereffect of receding motion in a subsequently viewed stationary test stimulus. The aftereffect could be nulled by a real movement of the test in the opposite direction. This aftereffect was restricted to the region of the retina stimulated by the

induction square, but showed at about 40% of its normal strength when the induction square was presented to one eye and the test square to the other (Regan and Beverley 1978b). Regan and Beverley concluded that there are visual channels specifically tuned to looming images, which are built up from detectors for opposite motion along a particular motion axis. Each opposed-motion detector linearly and accurately subtracts the outputs of local detectors that encode motion at two retinal locations within a given meridian (Regan and Beverley 1980; Regan 1986). The two locations must be no more than 1.5˚ of visual angle apart, since adaptation to oscillating squares larger than this did not produce the aftereffect just described (Beverley and Regan 1979a). They further proposed that a looming detector combines signals from orthogonal motion detectors in a nonlinear fashion. The nonlinearity is such that looming signals are accepted only if the image is characteristic of a rigid nonrotating object (Beverley and Regan 1979b ; 1980a). For example, an image whose rate of expansion was not isotropic did not effectively drive the motion-in-depth system. Regan and Beverley (1979a) measured the reduction in sensitivity to size oscillations of a test square after adaptation to a radially expanding and contracting pattern of short lines. Threshold elevation was large only when the test square was exactly where the center of the flow pattern had been. This was so even when the focus of the flow pattern was not on the fixation point (Regan et al. 1979). It was concluded that local looming detectors sample the retinal flow pattern caused by self-motion through a 3-D world, and “light up” in the immediate vicinity of the focus of the retinal flow pattern. For the flow pattern used, the divergence of velocity (div V) was large only at the focus of expansion, so that the looming detectors were acting as div V detectors (Regan and Beverley 1979a ; Regan 1986). The significance of this point is that, by definition, eye movements do not affect div V, since div V is independent of the translatory motion due to eye movements. The aftereffect of adapting to a square pulsating in size was also specific to a range of frequencies of pulsation of the test square centered on the frequency of the adapting stimulus. Within a frequency band of 0.25 to 16 Hz there was evidence of three broadly tuned channels, each with a different preferred frequency of size modulation (Beverley and Regan 1980b). In other words, in addition to being tuned to impact direction, looming detectors are also tuned to the temporal characteristics of an approaching object.

31.6.2b Aftereffects from Relative Motion-in-Depth When a person moves toward a stationary textured surface, the looming of the image of the surface is not interpreted as a change in size of the surface. Wallach and Flaherty (1975)

S E E I N G M OT I O N -I N -D E P T H



207

found that the motion aftereffect produced by an expanding pattern was less when the expansion was coupled to forward motion of the subject than when the subject was stationary. They argued that the strength of the aftereffect is influenced by the perceived change in size of the induction stimulus rather than by the actual change in size of the image. For a single object moving in depth, the images of both the texture elements and the overall size of the object change in size. Also, there is no accretion or deletion of texture elements. Things are more complicated when an object is seen through an aperture. When a textured surface moving in depth is seen through a fixed aperture, only the image of the texture changes in size. When a stationary textured surface is seen through an aperture that moves in depth, only the image of the aperture changes in size. In both cases there is progressive accretion or deletion of texture elements by the aperture. Ten minutes of exposure to a contracting blank square or a contracting textured square produced a motion-indepth aftereffect in which a test square appeared to move forward (Beverley and Regan 1983). The aftereffect was weakened when the texture elements changed in size while the size of the square was constant or when the square changed in size while the texture remained constant. The aftereffect was canceled in a test square containing a particular combination of opposed changes in texture size and square size. Beverley and Regan argued that the motion-in-depth aftereffect is a measure of the strength of the stimulus for motion-in-depth. This is a good measure for motion of an isolated single object. But it needs qualifying for relative motion-in-depth. The balance point between two opposing motion signals where no aftereffect is produced is probably a good measure of the relative strengths of the two signals. But it is not a good measure of the strength of signals

Image position

Phase 0°

90°

for relative motion. For example a retreating textured surface seen through an approaching aperture would be a strong stimulus for relative motion-in-depth even though it may not produce a motion-in-depth aftereffect. In a similar way, two opposed lateral motions do not produce a motion aftereffect even though they produce a strong sensation of relative motion. This whole issue requires further investigation. 31.6.3 A F T E R E FFEC TS O F D I S PA R IT Y-D E FI N E D MOT I O N-I N-D E P T H

Regan and Beverley (1973c) reported an aftereffect of rotation in depth defined only by disparity. Each eye saw a spot moving sinusoidally from side to side at 0.8 Hz. When the images of the spots in the two eyes moved in phase, the fused image appeared to move in a frontal plane. When they moved 180˚ out of phase, the image appeared to move along a straight path in and out of the plane of the screen. A 90˚ phase difference created an impression of rotation around a circular orbit in depth in one direction, and a 270˚ phase shift created an impression of circular rotation in depth in the opposite direction. Other phase angles produced apparent motion around elliptical orbits in one direction or the other, as shown in Figure 31.22. The depth threshold for each of these stimuli was first established by adjusting the amplitude of target oscillation until depth was detected. When subjects had viewed a stimulus that appeared to rotate in depth in one direction for 10 minutes, the display no longer appeared to rotate in depth, and the depth threshold for other stimuli rotating in the same direction was elevated. The depth threshold for stimuli rotating in the nonadapted direction was either not affected or reduced. Beverley and Regan (1973a) provided further evidence for aftereffects produced specifically by changing disparity. 180°

270°

360°

Time 45°

90°

135°

180°

225°

270°

415°

360°

Disparity

Phase 0°

Image position Left eye

Right eye

Figure 31.22. Motion-in-depth and phases of binocular image motion. Phase relationships of the motion of dichoptic images of a spot (top row), and the orbits of apparent motion-in-depth that each creates (bottom row). When images move in phase, the spot appears to move in a frontal plane. When they move 180˚ out of phase, the spot moves to-and-fro in depth along a straight path. Intermediate phases create circular or elliptical orbits of motion within the plane of regard. (Adapted from Regan and Beverley 1973c)

208



OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

The central 2˚ area of a random-dot stereogram was sinusoidally modulated in disparity so that it appeared to oscillate back and forth in depth along a linear path. They varied the direction of the path by changing the phase and relative amplitude (symmetry) of disparity modulation. After subjects viewed motion along a particular trajectory for 10 minutes, the amplitude of relative image motions required to evoke movement in depth in a test stimulus increased. Figure 31.23 shows that the change in sensitivity was greatest when the test movement was along a trajectory similar to that of the inspection movement, and fell to zero when the directions of the two movements differed by more than a certain amount (Beverley and Regan 1973b). Beverley and Regan concluded that there are four visual channels for detecting impact direction. One is tuned to directions aimed between the nose and the left eye, one to those aimed between the nose and the right eye, and one each to those aimed on either side of the head. Another four channels were postulated for motion away from the head. Psychophysical and physiological evidence for changing-disparity detectors is reviewed in Section 31.8.2b. Webster et al. (1998) constructed random-dot stereograms depicting either a cylinder rotating about a vertical axis or two frontal surfaces moving laterally in opposite directions with 8 arcmin of disparity between the near and 12 After adaptation between nose & left ear

Motion-in-depth threshold (arcmin)

10

After adaptation between nose & right ear

8

6

Preadaptation 4 Adapt B

Adapt A

2 1:1

–(0:1)

–(1:1)

–(1:0)

1:1

Left/right ratio Adaptation to motion-in-depth. The red line shows the preadaptation threshold for motion-in-depth within the plane of regard as a function of impact direction relative to the nose. The blue line shows the threshold after the subject adapted to a stimulus moving toward a point between nose and left eye. The green line shows the threshold after adaptation to a stimulus moving toward a point between nose and right eye. (Adapted from Beverley and Regan 1973b)

Figure 31.23.

far surfaces. Subjects fixated a cross midway between the front and back surfaces. One minute of adaptation to the moving stimuli caused the same stationary display to appear to move in the opposite direction. Adaptation to single-plane motion produced a large aftereffect in the stationary stereoscopic cylinder, and adaptation to stereoscopic motion showed some transfer to a single-plane test stimulus. A dynamic Lissajous figure produced on an oscilloscope appears as a sine-wave pattern rotating in depth. Because its direction of rotation is ambiguous it appears to change periodically. When the Lissajous figure is viewed with a dark filter in front of one eye, a pattern of binocular disparities is produced and direction of rotation is no longer ambiguous. The dark filter produces the Pulfrich effect described in Chapter 23. Inspecting the stereo image for some minutes caused a subsequently seen ambiguous figure to appear to rotate in the opposite direction (Smith 1976). The aftereffect required the induction and test stimuli to have the same spatial frequency, suggesting the presence of visual channels tuned jointly to direction of motion, disparity, and spatial frequency (Chase and Smith 1981). Similarly, inspection of the stereoscopic image of a rotating sphere caused the image of a rotating sphere, lacking disparity, to appear to rotate in the opposite direction (Nawrot and Blake 1989, 1991a). Prior inspection of an array of dots moving in one direction with crossed disparity superimposed on uncrosseddisparity dots moving in the opposite direction caused an incoherently moving array of dots to appear as a structure rotating in depth in the opposite direction (Nawrot and Blake 1993b). There are two ways to account for aftereffects of rotation in depth specified only by disparity. The first possibility, proposed by Regan and Beverley (1973c), is that there are distinct sets of detectors, each jointly tuned to a specific direction of sideways motion and a specific sign of disparity. These disparity-specific motion detectors can be understood by referring to Figure 31.24. There must be at least four members to the set: left-motion crossed-disparity and left-motion uncrossed-disparity detectors, and right-motion crossed-disparity and right-motion uncrossed-disparity detectors. Physiological evidence for the existence of these detectors is reviewed in Section 31.8.2b. The second possibility is that distinct sets of detectors are each tuned to a specific direction of changing disparity. These are the changing-disparity detectors discussed in the previous sections. There must be at least two sets of detectors to account for motion-in-depth aftereffects. One set would be sensitive to the changing disparity produced by an approaching object, and the other would be sensitive to the changing disparity produced by a receding object. The experiments described in the following sections were designed to investigate aftereffects produced by each of these types of detector.

S E E I N G M OT I O N -I N -D E P T H



209

Far

Movement left Movement right Horopter Movement left Movement right

Near

2. The oculomotor component is due to misregistration of movement of the eyes. 3. The vection component is due to misregistration of the motion of the self. For example, rotation of the whole visual field produces a sensation of self-rotation in a stationary observer—an effect known as vection. A stationary object superimposed on the flow field also appears to rotate with the observer, as do the visible parts of the observer’s body. Procedures for measuring the separate contributions of these components were described in Section 22.7. Induced visual motion may be measured by: 1. Verbal estimates of velocity or distance moved.

Figure 31.24. Coupling motion direction and disparity. It is assumed that some cells sensitive to leftward motion are also tuned to crossed disparity while others are tuned to uncrossed disparity. Cells sensitive to rightward motion are also assumed to be selectively tuned to disparity. (Adapted from Regan 1986)

Psychophysical evidence reviewed in Section 22.5.4 shows that distinct motion aftereffects can be induced in distinct disparity-defined depth planes. In all the experiments mentioned so far, subjects inspected the unambiguous induction stimulus for one or more minutes. The ambiguous test stimulus appeared to rotate in the opposite direction to that of the induction stimulus. It is as if motion-in-depth in one direction becomes fatigued. When a sphere of dots rotating clearly in one direction was inspected for a second or two, a subsequently seen ambiguous sphere of dots appeared to rotate in the same direction ( Jiang et al. 1978). In other words, before fatigue has set in, a perceived direction of motion primes the direction of motion in an ambiguous display. The visual system exhibits percept inertia over the short term and adaptation, or fatigue, over the long term. Nawrot and Blake (1993a) reported a similar priming effect, as described in Section 30.2.1b. 31.7 I N D U C E D M OT I O N -I N -D E P T H A stationary object seen against a moving background appears to move in the opposite direction, an effect known as induced visual motion. There are three components of induced motion. 1. The retinotopic component is due to contrast between motion detectors, in a manner analogous to color contrast between color detectors and tilt contrast within the orientation-detection system. 210



2. The tilt test. The subject estimates the tilt of the path of motion of a spot moving at right angles to the direction of the inducer. This test was described in Section 22.7. 3. Tracking the perceived motion of the target stimulus with unseen hand. 4. Nulling the apparent motion of the target. This is an unsatisfactory procedure because it alters the relative motion between inducer and target that is responsible for the effect. Nulling is valid for measuring the motion aftereffect because the inducer is no longer present when the nulling is performed. We will now see that illusory visual motion-in-depth may be induced in a stationary stimulus seen against a background moving in depth. 31.7.1 MO N O C U L A R I N D U C E D MOT I O N-I N-D E P T H

Induced motion-in-depth is produced by changing perspective in a monocularly viewed display. Farnè (1972) presented to one eye a textured surface rotating about a midline vertical axis back and forth in depth through +75˚ with respect to the frontal plane. Two stationary vertical lines were displayed in front of the oscillating surface, one on the left and one on the right of center. Initially, in each half cycle of rotation, the line seen against the receding half of the surface appeared to lengthen, and that seen against the approaching half appeared to shorten. Then the impression changed to one in which the lines remained the same length but appeared to oscillate in depth in counterphase to the motion of the surface. Hershberger et al. (1976b) showed a film in parallel projection of five side-by-side vertical rods rotating about a midvertical axis. The perceived direction of rotation was ambiguous. The rotating display was superimposed on stationary lines that diverged from left-to-right or from

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

A dynamic Ponzo illusion. The five vertical lines rotated as a set around the midvertical axis from the frontal plane to the sagittal plane or in the reverse direction. Since the lines were produced by parallel projection, they did not vary in length. The radiating lines remained in the frontal plane. Subjects reported the direction of rotation in depth of the lines. (Redrawn from Hershberger et al. 1976)

Figure 31.25.

right-to-left, as shown in Figure 31.25. The side of the rotating display moving toward the apex of the radiating lines appeared to rotate away from the observer. The other side appeared to rotate toward the observer. This whole display was a dynamic Ponzo illusion in which linear perspective in a stationary background biased the perceived direction of rotation in an otherwise ambiguous display. The above effects were most likely due to contrast between motion detectors in a retinotopic frame of reference, because rotation in depth does not evoke pursuit eye movements. Also, rotation in depth does not evoke vection (see Section 22.7.3). The oculomotor system could be involved in induced motion-in-depth produced by a looming display, because a looming display produces vergence eye movements, as explained in Section 10.3.2c. Changing perspective in a large monocularly viewed scene can induce strong vection. For example, looming motion in a flight simulator induces compelling sensations of forward self-motion. Any stationary objects, such as the contents of the cockpit, also seem to move forward with the person’s body. This is vectioninduced motion-in-depth (see Section 22.7.3). 31.7.2 B I N O CU L A R I N D U C E D MOT I O N-I N-D E P T H

Gogel and Griffin (1982) reported that a test dot undergoing vertical motion appeared to move along a path slanted in depth when seen in the context of binocularly viewed dots moving back and forth in stereoscopic depth. The effect was the same when subjects fixated the vertically

moving test spot as when they fixated the inducing dots. This suggests that the effect was not due to vergence eye movements. This effect could have been a purely stereoscopic effect depending on adaptation of disparity detectors or it could have been due to opposite induced motions in each of the monocular images. Likova and Tyler (2003) presented a long row of test dots flanked by two similar rows above and two rows below the test dots. The test dots had a fixed disparity that varied from trial to trial. The disparity of the flanking rows of dots alternated by 13 arcmin every 600 ms. This was equivalent to to-and-fro motion-in-depth of 7 cm. On average, the stationary row of dots was estimated to move about 3 cm in depth in the opposite direction to that of the flanking dots. The induced motion was the same whether subjects fixated the test dots or the flanking dots. This seems to rule out any contribution of vergence to the induced motion. However, induced motion occurred only when the test dots were either shifted slightly from side-to-side, or periodically interrupted. Presumably, these changes weakened the signal that the test dots were not moving. Induced motion of the test dots was accompanied by a corresponding reduction in the perceived amplitude of motion of the flanking dots. In other words, the perceived relative motion of the two sets of dots was preserved. Consider a large random-dot frontal surface viewed in a stereoscope. When the two images are moved from side to side in antiphase through about 1˚ they produce the same changing disparity signals as those produced by to-and-fro motion-in-depth. However, because of the absence of looming, the surface does not appear to move in depth (see Section 31.3.4). But when a stationary stimulus is superimposed on the disparity-modulated surface one or the other of the stimuli appears to move in depth. Tanahashi et al. (2008) investigated which of the following factors influence the perceived motion-in-depth of two superimposed stimuli. 1. Relative size In 2-D induced motion, a small stimulus appears to move in a direction opposite to that of the motion of a larger superimposed stimulus. Similarly, Tanahashi et al. found that a small stimulus superimposed on a large stimulus appeared to move in depth, whichever stimulus was modulated in disparity. Clearly, a larger surface is perceived as a stationary frame of reference for both lateral motion and motion-in-depth. 2. Which display is disparity modulated There was no significant effect of this factor. 3. Relative luminance In 2-D induced motion, the dimmer of two otherwise equal stimuli appears to move, whichever stimulus is actually moving (Oppenheimer 1935). Similarly, Tanahashi et al. found that the dimmer of two stimuli appeared to

S E E I N G M OT I O N -I N -D E P T H



211

move in depth, whichever stimulus was disparity modulated. 4. Relative depth There was no significant effect of the relative depth of the two stimuli. 5. Which display is fixated Nefs and Harris (2008) found that induced motion-in-depth of a stationary stimulus was greater when observers fixated the disparity-modulated stimulus than when they fixated the stationary stimulus. In vection, a stationary person experiences illusory motion-in-depth of the self and of all things attached to the self when the whole of the visual surroundings move in depth. For example, a pilot in a flight simulator feels that the plane is moving with respect to stationary surroundings. In fact, the plane is not moving and all the motion signals arise in the surroundings. In a stationary flight simulator there are no signals from the vestibular system to indicate self-motion, but these signals are not present after a brief period of real self-motion. The same effect is produced by 3-D cinema displays. There have been no quantitative studies of the vection component of induced motionin-depth. 31.7.3 I N D U C E D M OT I O N A N D P E RC E I VE D D I S TA N C E

A stationary 2-D object containing moving texture elements appears to shift as a whole in the direction of the motion of the elements. For example, the stationary envelope of a Gabor patch appears to shift in the direction of lateral motion of the enclosed carrier sine-wave (DeValois and DeValois 1991). The same effect is produced by the motion aftereffect (Snowden 1998). A similar effect is produced by looming motion. The stationary envelope of a circular Gabor patch appeared to increase in size when the concentric sine wave within the envelope expanded. The Gabor patch appeared to contract when the sine wave contracted (Whitaker et al. 1999). Edwards and Badcock (2003) displayed an expanding array of dots in a fixed circular aperture and a contracting array of dots in a second aperture. To make the two displays appear at the same distance the expanding display had to be stereoscopically displaced away from the observer relative to the contracting display. This means that a stimulus of fixed size containing outward optic flow appeared nearer than a similar stimulus containing inward flow. Assume that subjects saw the fixed boundary of the flow pattern as a window beyond which there was a surface generating the optic flow. Subjects may have judged the size and distance of the surface rather than the size and distance of the whole display, including the window. Perhaps the results would be different if the stationary boundary were 212



stereoscopically beyond rather than nearer than the pattern of optic flow. Baumberger and Flückiger (2004) asked subjects to estimate the distance of a fixed light placed on a large approaching or receding display of dots projected onto a ground surface. The flow patterns induced corresponding sensations of self-motion (vection). The distance of the light was underestimated on an approaching flow pattern relative to its distance on a stationary or receding pattern. This effect must have arisen from a complex combination of effects of changes in vergence and of vection. Tsui et al. (2007) produced a 3-D version of this motion-induction effect. Subjects looked along the axis of a stereoscopic 3-D cylinder containing random dots. Depth was defined only by disparity. All the dots moved either toward or away from the subject within a lifetime of 10 frames. Dots that left one end of the fixed boundary of the cylinder wrapped around and reappeared at the other end. Thus the dots moved in depth within the stationary boundary of the cylinder. The ends of the cylinder appeared closer than a fixed reference frame when the dots moved toward the subject and more distant than the frame when the dots moved away from the subject. The effect increased as the speed of the dots increased.

3 1 .8 D ET E C TO R S F O R M OT I O N -I N -D E P T H 31.8.1 L O O M I N G D ET E C TO R S

31.8.1a Looming as a Preattentive Feature A stimulus moving in the frontal plane in one direction immediately pops out when set among stimuli moving in the opposite direction, and the time taken to detect the odd stimulus is independent of the number of stimuli. This suggests that there are cortical units at an early stage of processing tuned to the direction of lateral motion (Section 5.6.4). If cortical units dedicated to looming exist, one would expect an expanding stimulus to pop out when set among contracting stimuli. However, Braddick and Holliday (1991) found that an expanding stimulus patch that periodically returned rapidly to its initial size, when set among stimuli that changed size in the same way but in counterphase, took longer to detect when the number of distracting stimuli was increased. The starting times of the distracting stimuli were staggered so that the cue of relative size was not available. This suggests that detectors tuned to direction of motion-in-depth do not occur at the preattentive level. However, the sudden return motion of each patch may have stimulated looming detectors as well as the slower expansion. Perhaps a spot undergoing sawtooth motion in a frontal plane would not pop out when set among an array of spots undergoing counterphase sawtooth motion.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

The abrupt appearance of a looming stimulus attracts our attention whatever we are doing. Image contraction that indicates receding motion is a less powerful attention stimulus (Franconeri and Simons 2003). Also, the reaction time to a looming stimulus is shorter than that to contracting stimulus (Skarratt et al. 2009). Sekuler (1992) argued against specific detectors for motion-in-depth produced by looming on the grounds that speed discrimination for looming, lateral motion, and rotation were similar, and that performance with a variety of looming stimuli could be accounted for by linear summation of lateral motion signals. Sekuler’s argument conflicts with psychophysical evidence reviewed in the next section. Although relative motion detectors are constructed by linear summation of detectors for oppositely directed retinal motion (Regan 1986), 2-D changing-size detectors are based on strongly nonlinear summation of orthogonal relative motion detectors (Beverley and Regan 1980a). Beverley and Regan suggested that nonlinear summation blocks access to the motion-in-depth system when the rate of image magnification is not characteristic of a rigid object that is not rotating (Beverley and Regan 1979b). In animals lacking stereoscopic vision, looming is the principal cue for motion-in-depth. Many animals, including humans, react rapidly when they see something loom. Human neonates responded to looming optic flow with a backward motion of the head. The magnitude of head motion increased with increasing velocity of optic flow ( Jouen et al. 2000). These responses are also probably under subcortical control (Banton and Bertenthal 1996).

31.8.1b Physiological Evidence for Looming Detectors Many animals possess a visual mechanism specifically tuned to the direction of approaching objects, as defined by the symmetry of looming of monocular images. Cells tuned to the impact direction and velocity of an approaching object have been found in the visual system of the locust (Section 33.1.5). These cells responded to a monocular object, showing that their effective stimulus is looming rather than changing binocular disparity (Rind 1996). Three types of neurons sensitive to looming have been found in the nucleus rotundus of the pigeon (Sun and Frost 1998). Type-1 cells were tuned to the relative rate of image expansion (tau). Type-2 cells were tuned to the absolute rate of image expansion. Type-3 cells responded in accordance with function eta (h ) described in Section 31.1.1. The value of h peaks before collision. The response of a cell tuned to this function peaks earlier for a large object than for a small object. The peak value therefore provides a strong signal for an avoidance response to a potentially dangerous object. The animals apparently use whichever cue or cue combination is best suited to a particular task or situation.

In mammals, responses to looming occur in the superior colliculus, pulvinar, and other midbrain areas that bypass the visual cortex. Cells responding to looming have been found in the superior colliculus of rats (Dean et al. 1989). A few cells in the deeper layers of the cat’s superior colliculus responded to objects that approached the head (Gordon 1973). Similar cells were found in the superior colliculus of the Cebus monkey (Updyke 1974). Their response selectivity did not depend on binocular vision, which suggests that looming was the relevant stimulus. The cells were indifferent to the physical characteristics of the objects and could be part of a mechanism for controlling vergence eye movements. Positron emission tomography (PET) has revealed responses to looming in the pulvinar and midbrain tegmentum of humans (Beer et al. 2002). Destriate monkeys can learn to avoid colliding with objects (Humphrey 1974). Humans are able to detect looming motion within cortically blind regions (Mestre et al. 1992). Destriate humans can be startled by an approaching object (Weiskrantz 1986). Human adults with normal vision show a defensive reaction to a looming stimulus when their attention is distracted by a tracking game (King et al. 1992). Looming detectors have also been found in several cortical areas. Regan and Cynader (1979) found 56 cells that responded to changing size in cortical area 18 of the cat. Out of these cells, 19 responded to changing size independently of changes in light level. However, the response of almost all these 19 cells to changing size varied as a function of the location of the stimulus in the cell’s receptive field. Some cells preferred an expanding stimulus in one part of the receptive field and a contracting stimulus in another part. It was pointed out that any cell that preferred the same direction of motion for a leading as for a trailing edge would respond more strongly to expansion in one part of its receptive field and to contraction in another part. Only one cell qualified as a detector of expansion. However, the population response of the other cells could code expansion or contraction independently of position. Zeki (1974) found cells in the superior temporal sulcus of the monkey, an area including the middle temporal area (MT), that responded specifically to two parallel dark-light edges moving in a frontal plane toward or away from each other within the cell’s receptive field in one eye. Some cells responded only to motion of edges toward each other, and other cells responded only to motion of edges away from each other. One problem is that these cells may respond to increasing or decreasing amounts of light rather than specifically to motion. Bruce et al. (1981) controlled for changing luminance and found cells in this region that responded specifically to changing size. Area MT of the superior temporal sulcus projects to the ventral intraparietal sulcus (VIP). Like cells in MT, cells in

S E E I N G M OT I O N -I N -D E P T H



213

VIP of the monkey are highly selective for direction and speed of retinal motion, and some respond best to a stimulus moving toward a particular point on the face of the animal. In other words, they are selective to impact direction (Colby et al. 1993). De Jong et al. (1994) used PET and MRI techniques to chart areas of the human cortex responsive to a display of random dots simulating the visual effects of forward motion. Three areas were identified—V3, the superior parietal lobe in the right hemisphere, and the occipitotemporal ventral surface in both hemispheres. Cells in the related center, MST, responded selectively to the center of expansion of a looming pattern through an amplitude that changed when pursuit eye movements changed the retinal location of the center of expansion (Page and Duff y 1999). Beer et al. (2002) used PET to record responses of the human brain to wide-field roll, looming, and yaw motion of a textured surface. Areas V1, V2, and V3 responded to both coherent and incoherent motion. Several other areas showed selective responses to only coherent motion. These included the posterior-inferior temporal cortex and the left inferior occipital gyrus. Several subcortical areas also responded, including the midbrain tegmentum and pulvinar, and the cingulate gyrus and amygdala in the limbic system. Beer et al. concluded that the areas specifically sensitive to coherent motion are involved in the detection of self-motion about each of the principal body axes. These areas are presumably involved in visually induced selfmotion (vection). Responses in the cingulate gyrus are probably responsible for avoidance behavior. 31.8.2 D ET EC TO R S F O R B I N O CU L A R MOT I O N-I N-D E P T H

31.8.2a Stereomotion as a Preattentive Feature Harris et al. (1998) found that subjects could immediately detect a dot moving either sideways or in depth when exposed for 500 ms among a random 2-D array of stationary dots. However, subjects may have detected only the end-point disparities of the dot moving in depth rather than its actual motion. To eliminate this possibility they asked subjects to detect a dot moving within a 3-D cylindrical array of stationary dots. A dot moving sideways was readily detected even with a large number of stationary dots, but a dot moving in depth became progressively harder to detect as the number of stationary dots increased. Harris et al. concluded that the visual system averages opposite left and right velocity signals in the two eyes and discards changing disparity as a signal for motion-in-depth. However, the dot moved in depth at only 4 cm/s (lateral motion 4.1 arcmin/s) and passed through the fixation plane (zero disparity). In any case, the fact that a feature does not pop out from distractors does not prove that the visual system lacks specific detectors for that feature. 214



31.8.2b Evidence for Binocular Motion-in-Depth Detectors There is physiological evidence for disparity-specific motion detectors in several cortical areas, although there is no direct evidence of adaptation effects in these detectors. For instance, some cells in V1 and V2 of the monkey respond selectively to stimuli moving in a given direction in a frontal plane. Some respond only to crossed-disparity stimuli and others only to uncrossed-disparity stimuli (Poggio and Fischer 1977; Poggio and Talbot 1981). Similar cells have been found in the medial temporal visual area (MT) (Maunsell and Van Essen 1983) and in the medial superior temporal area (MST) of the monkey (Komatsu et al. 1988). Maunsell and Van Essen injected a note of caution. They pointed out that to qualify as a detector jointly tuned to motion and disparity, the tuning of the cell to motion should be related to its tuning to disparity. The fact that a cell responds to both motion and disparity does not prove that it is tuned to a particular combination of motion and disparity. A cell may appear to be selectively tuned to motion-in-depth when tested with stimuli moving along various trajectories for which the mean disparity is not the cell’s preferred disparity. There is physiological evidence for binocular mechanisms specifically tuned to impact direction, as defined by the opposite movement of images in the two eyes. Zeki (1974) found cells in the superior temporal sulcus of the monkey that responded to motion in one direction in one eye and to motion in the opposite direction in the other eye. However, Zeki did not stimulate both eyes simultaneously. Pettigrew (1973) did stimulate both eyes simultaneously and found a few cells that responded best to oppositely directed movements in the two eyes. Regan and Spekreijse (1970) recorded evoked potentials from the scalp over the visual cortex in human subjects as they watched a stereoscopic display of simulated motionin-depth. They found responses specifically related to the appearance of motion-in-depth. Responses for approaching motion differed from those for receding motion (Regan and Beverley 1973d). Cynader and Regan (1978) recorded from single cells in area 18 of the cat’s visual cortex as the animal viewed a stereoscopically simulated display of a bar oscillating in depth along various impact directions within a horizontal plane. They found the following three types of cell. 1. Cells tuned to impact direction, with a directional range of 2 to 3˚, which included trajectories that would hit or narrowly miss the cat’s head. These “hitting-the-head” cells responded to opposite motion in the two eyes and were inhibited when both images moved in the same direction. Some of them did not respond to monocular motion, while others responded

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

10 spikes/sweep

Stimulus velocity 10°/s 20°/s 40°/s

Cortical cell tuned to motion-in-depth. Polar plot of response frequency of a cell in the cat’s visual cortex as a function of impact direction of a stimulus approaching in the plane of regard. Response frequencies are indicated as radial distances. Impact direction is with respect to the two eyes. The angle between the eyes is enlarged. The curves are for three velocities of stimulus motion, as indicated in the insert. (Adapted from Cynader and Regan 1982)

Figure 31.26.

to motion in the same direction in the two eyes when tested monocularly (Regan and Cynader 1982). 2. Cells tuned to trajectories that missed the head. These “missing-the-head” cells responded to stimuli moving in the same direction in the two eyes, but at different speeds. Most hitting-the-head cells showed the same directional tuning and response amplitude when stimulus velocity was increased from 10 to 40˚/s, but some cells responded less vigorously to higher velocities, as shown in Figure 31.26. 3. Cells tuned to movements within a frontal plane, that is, to movements of the images in the same direction and at the same speed. The tuning functions of cells responding to an approaching object were largely not affected by changes in the horizontal disparity of the moving images (Cynader and Regan 1982). Spileers et al. (1990) found cells in area 18 of the cat’s visual cortex that were selectively responsive to motionin-depth independently of their response to either disparity or velocity. But they found a greater number of cells that

were also selectively responsive to stimulus speed. They suggested that, as a set, these cells produce a three-dimensional map of the optic-flow field. Cells responsive to opposed motion in the two eyes have also been reported in the lateral suprasylvian cortex (Claire-Bishop area) of the cat (Toyama et al. 1985; Toyama et al. 1986b). Other cells in this area responded to motion in a frontal plane or to changing image size associated with motion-in-depth (Toyama et al. 1986a). Cells selectively responsive to opposed motion in the two eyes have also been reported in areas V1 and V2 of the monkey (Poggio and Talbot 1981). Unlike those described by Cynader and Regan in the visual cortex of the cat, the cells in the monkey received excitatory and inhibitory inputs from both eyes and had opposite directional sensitivity to the inputs from the two eyes. However, the stimuli traversed the horopter in the study with monkeys, but not in the study with the cat. Cells responsive to motion-in-depth defined by looming and/or by changing disparity have been found in MST of the monkey (see Sakata et al. 1997). Cells in the posterior parietal cortex of the monkey responded specifically to stimuli rotating in depth—some to rotation about a horizontal axis and others to rotation about a vertical axis (Sakata et al. 1986). Rokers et al. (2009) obtained distinct fMRI responses from human MT (V5) to motion-in-depth produced by modulations of disparity or of interocular motion. Responses from these two forms of motion-in-depth were distinct from those produced by stationary disparity or by monocular motion. Summary It may be concluded that the mammalian visual system contains cells specifically tuned to the symmetry of looming. Other cells are specifically tuned to binocular cues to motion-in-depth, but it is not known definitely whether the crucial cue is changing disparity or interocular differences in image motion. The binocular motion-indepth mechanism is distinct from mechanisms coding static disparity or frontal plane motion, but it is not known whether these three mechanisms are organized in parallel or in series. Monocular and binocular mechanisms for detecting motion-in-depth presumably pool their outputs into a perceptual system that directs the animal to make appropriate responses. The topic of motion-in-depth was reviewed by Cumming (1994).

S E E I N G M OT I O N -I N -D E P T H



215

32 PATHOLOGY OF VISUAL DEPTH PERCEPTION

32.1 32.1.1 32.1.2 32.2 32.2.1 32.2.2 32.2.3 32.3 32.3.1 32.3.2

Disorders of spatial behavior 216 Visual neglect 216 Extinction and gaze apraxia 220 Stereoanomalies 220 Stereoanomalies with brief stimuli 220 Stereoanomalies with long exposures 221 Stimulus-specific stereoanomalies 221 Brain damage and stereopsis 222 General effects of brain damage 222 Asymmetrical effects of brain damage 223

32.4 32.4.1 32.4.2 32.4.3 32.4.4 32.5 32.6 32.6.1 32.6.2 32.6.3

32.1 DISORDER S OF S PAT I A L B E H AVI O R

Abnormal interocular transfer 223 Binocularity and binocular summation 224 Binocularity and dichoptic masking 224 Binocularity and the tilt aftereffect 226 Binocularity and the motion aftereffect 226 Binocularity and proprioception 227 Albinism 229 Basic characteristics of albinism 229 Abnormal routing of visual pathways 229 Congenital nystagmus 232

Visual spatial agnosia is sometimes associated with disorientation and an inability to recognize familiar places. These symptoms more commonly involve the dominant hemisphere, while neglect usually involves the nondominant hemisphere (Critchley 1955). Neglect can also affect awareness of one side of the body, usually the left side. The patient is unwilling to perform tasks with the left hand. In extreme cases patients complain that they cannot find their left arm or they may disown the left side of the body and declare that it is not theirs. The neglect may affect one half of space defined with respect to the line of sight (oculocentric space), with respect to the midline of the head (headcentric space), or with respect to the midline of the torso (torsocentric space) (Colby 1998). When a person with neglect faces forward and looks forward the retinocentric, headcentric, and torsocentric frames of reference coincide. It was initially assumed that visual neglect operates in a retinocentric frame. At the level of the visual cortex, space is partitioned between the hemispheres into left and right retinotopic hemifields. But in the parietal cortex and other cortical areas the retinocentric locations of stimuli are transformed into headcentric coordinates for purposes such as directing eye movements, and into torsocentric coordinates for control of reaching. So one might expect neglect to occur in different coordinate systems according to the type of task and the location of the lesion. Neglect in retinocentric space can be dissociated from neglect in headcentric space by turning the head while keeping the retinal location of stimuli constant or by turning the eyes while keeping the

This review of spatial disorders is confined to disorders that affect depth perception or are related to the distance of the stimuli.

32.1.1 VI S UA L N EG L EC T

32.1.1a General Features of Neglect Lesions in the inferior parietal lobe render a person unable to fixate, attend to, manipulate, or recall objects in the contralateral half of space (Bisiach and Vallar 1988). The patient may eat from only one half of the plate, shave only half the face, or dress only half the body. Some patients are not aware of their defect—a condition known as anosognosia. This collection of symptoms is known as visual spatial agnosia, or visual neglect (Paterson and Zangwill 1944; McFie et al. 1950; Robertson and Marshall 1993). Visual neglect is not normally associated with loss of visual sensitivity or acuity or with paralysis. It is distinct from hemianopia in which the patient is blind in one half of space. A hemianopic patient directs the gaze so as to bring objects in the blind region into view. A patient with neglect makes no attempt to search for things in the neglected half of space. It is as if, for them, the neglected half of space does not exist. Neglect is best described as a localized defect in the spatial attention mechanism that selects stimuli for further processing.

216

headcentric location of stimuli constant. Such manipulations have revealed that symptoms of neglect are basically in a retinocentric frame of reference but are modulated by the locations of objects with respect to the head or with respect to the torso. Visual neglect, as reflected in the way patients make exploratory eye movements, occurs in a headcentric frame of reference. It remained in a headcentric frame of reference when patients were tilted 30˚ with respect to gravity (Karnath et al. 1998). Thus, neglect is not modulated by changes in the orientation of the body to gravity. Even in the nonneglected half of space, patients tend to center their gaze on the side of an object away from the neglected side. For example, a patient with neglect of the right half of space neglected the right end of words even when the whole word was in the intact half of space (Caramazza and Hillis 1990). This suggests that neglect can operate in an object-centered frame of reference as well as in egocentric frames (Tipper and Behrmann 1996). However, what looks like object-centered neglect could arise from an egocentric gradient of neglect spanning the whole visual field (Driver and Pouget 2000). The effects of neglect become more evident as a stimulus object is taken further into the affected half of space. Thus, for patients with lesions in the right parietal cortex, the accuracy of detecting the number of flashed lights decreased as the stimuli were moved further into the contralateral hemifield. Even in the ipsilateral hemifield, accuracy decreased as the stimuli moved to the left (Smania et al. 1998). Neglect may be confined to a vertical half of space or to one quadrant. For example, Shelton et al. (1990) described a patient with bilateral temporal-lobe lesions who neglected visual stimuli in extrapersonal space, but only when the stimuli were in the upper visual field. Pouget and Sejnowski (1997) developed a neural network model of the transformations of retinocentric stimuli into headcentric and torsocentric coordinates in the parietal lobes. The model performed like neglect patients in a line-bisection task and showed effects in several frames of reference. See Karnath (1997), Landis (2000), Umiltà (2001), McCarthy (2002), and Mozer (2002) for reviews of the topic of spatial frames of reference and neglect.

half of space (Posner et al. 1984). They may also take longer to name objects or properties of objects presented in the neglected half of space. In the target-cancellation test, the patient uses a pen to strike out each of an array of targets, such as lines or letters. Patients with unilateral neglect ignore the targets on one side of the body midline, especially when they require a serial search (Aglioti et al. 1997). In the line-bisection test, patients use a pointer to bisect a visual line presented in various orientations and locations. Patients with neglect of the left half of space set the center of a horizontal line toward the right. However, with a line shorter than about 5 cm, the center is set in the other direction, probably because patients shift their gaze to one end of the line (Marshall and Halligan 1989). In the tactile bisection test, patients manually locate the center of an unseen rod. One would expect the displacement of the center of a line to be correlated with a displacement of the apparent straight ahead when patients are asked to simply point to the straight ahead. There has been conflicting evidence on this question, but the expected correlation was found when care was taken to make the two tasks as similar as possible (Richard et al. 2004). In drawing tests, patients draw a familiar object, such as a human figure. Patients with neglect tend to omit the half on the side of the neglect. In the Wundt-Jastrow illusion, shown in Figure 32.1, the lower figure appears longer than the upper figure. But the upper figure appears longer when the left half of the two figures is covered or to a person with left side neglect viewing the whole figure.

32.1.1c Neurology of Neglect Visual neglect is more likely to arise from lesions in the nondominant hemisphere. For right-handed people, this is the right hemisphere. There is no agreed explanation for this asymmetry. Heilman et al. (1993) produced evidence that the right hemisphere directs attention to both sides

32.1.1b Diagnosis of Visual Neglect Several tests have been used to assess visual neglect. Patients may be asked to detect, identify, touch, or look at visual stimuli presented briefly in different locations. Patients with neglect tend to be reluctant to direct their gaze or move either arm toward an object in the neglected region. This has been called hypokinesia. Patients have difficulty in disengaging their attention from an object in the normal half of space so as to attend to an object in the neglected

Figure 32.1.

The Wundt-Jastrow illusion. The two shapes are identical.

PAT H O L O G Y O F V I S UA L D E P T H P E R C E P T I O N



217

of space while the left hemisphere directs attention to only the right hemifield. Thus, after left-hemisphere damage attention can still be directed to both halves of space. In normal subjects, attention to stimuli in either hemifield produced PET activation in the right parietal cortex. The left parietal cortex was activated only when subjects attended to stimuli in the right hemifield (Corbetta et al. 1993). Traditionally, neglect is associated with lesions in the inferior parietal cortex. The functions of the parietal lobe were discussed in Section 5.8.4e. Neglect has also been associated with lesions in the frontal lobes, cingulate gyrus, basal ganglia, or thalamus. These regions are said to constitute a cortico-subcortical network that underlies spatial awareness (Mesulam 1999) (see Section 5.9). But Karnath et al. (2001) found that neglect in patients with no other visual-field defects is mainly associated with lesions in the rostral superior temporal cortex rather than the parietal lobe. The rostral superior temporal cortex receives inputs from both the ventral and dorsal streams (Section 5.8.3). Karnath et al. suggested that this area, the basal ganglia, and the pulvinar form a network that underlies spatial awareness and the ability to direct attention to specific locations. Unilateral lesions of the parietal lobe may also cause defects in reaching movements of one or both arms to visual targets in the contralateral visual field (Holmes 1918; Holmes and Horrax 1919). This symptom is known as optic ataxia. Patients may be able to reach for points on their own body but not for external visual objects (Ratcliff and Davies-Jones 1972). Optic ataxia arising from parietal lesions can occur in the absence of neglect. For example, Perenin and Vighetto (1988) studied 10 cases of “pure” ataxia arising from unilateral lesions centered in the intraparietal sulcus of the posterior parietal lobe. The patients were slow to use the contralateral arm (hypokinesia). Also, reaching movements were inaccurate, and the hand did not form to the shape of the object. The frontal lobe has been implicated in motor defects. However, Mattingley et al. (1998) compared patients with lesions in the right inferior parietal lobe with patients with lesions in the right inferior frontal lobe. Only the parietal-lobe patients showed short reaction times in reaching to objects on the left. Parietal lesions may also produce astereognosia, or inability to recognize objects by touch, and asomatognosia, or denial that a part of the body is one’s own. See Critchley (1955) and Newcombe and Ratcliff (1989) for reviews of these symptoms. Patients with bilateral parietal lesions may have difficulty in directing their gaze to a peripheral target, even though they can fixate and follow their own finger. This symptom is known as gaze apraxia (Godwin-Austen 1965; Girotti et al. 1982). This defect is associated with a difficulty in shifting attention from one visual object to another rather than with a basic inability to move the eyes. 218



Patients with bilateral parietal lesions may also have difficulty in judging the relative positions or distances of objects. This combination of symptoms is known as Bálint’s syndrome (Bálint 1909), but the terminology and definition of symptoms are confused. It is not clear whether the difficulty in judging distances arises from an inability to attend to more than one object (simultanagnosia) or whether the inability to attend to more than one object arises from a difficulty in appreciating the spatial layout of the scene. Stimulation of the monkey premotor cortex evokes coordinated postures of hand and head. Postures involving the hand are systematically mapped with respect to positions round the body (Graziano et al. 2002). Unilateral ablation of premotor area 6 in the macaque produced a failure to bite food presented contralaterally to the lesion, and inattention to objects in the contralateral visual field. The frontal lobes, rather than the parietal lobes, are probably involved when patients cannot voluntarily initiate eye movements (Monaco et al. 1980). The frontal eye fields (area 8) of the frontal lobes are concerned with controlling the direction of gaze. Ablation of area 8 in the macaque produced a decrease of eye movements into the contralateral hemifield and neglect of that hemifield.

32.1.1d Neglect in Near and Far Space The region within arm’s reach is near visual space or peripersonal space. The region beyond arm’s reach is far visual space or extrapersonal space. Brain (1941) suggested that near space and far space are processed in distinct regions of the brain. He based this conclusion on the fact that spatial defects can vary between near and far visual space. For example, one of his patients with damage to the parietal lobe misjudged the distances of far objects but not of objects within arm’s reach. A second patient, with damage to the temporal lobe, mislocalized objects only when they were within arm’s reach. When visual neglect shows only in near space (peripersonal space) or only in far space (extrapersonal space) it is said to be distance dissociable. For example, a patient with a unilateral right hemisphere stroke arising from infarct of the right lateral posterior parietal cortex, showed severe left visual spatial neglect. In the line-bisection task, she misplaced the center of a horizontal line to the left when the line was within reaching distance. However, the effect was absent or attenuated when the line was in extrapersonal space and bisected with a light-pointer (Halligan and Marshall 1991). Cowey et al. (1994) pointed out that the bisection task used by Halligan and Marshall differed between far and near. In their own study, they asked five patients with left visual neglect to use a light-pointer to bisect lines of equal angular subtense in near and far space. All the patients misplaced the center to the right, but the displacement was

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

greater for lines beyond reach than for lines within reach. This is the reverse of the effect shown by Halligan and Marshall’s patient. Three of the patients studied by Cowey et al. had damage to the frontal eye fields, while they all had damage to the contralateral superior right parietal lobe. Halligan and Marshall’s patient had damage to the right lateral posterior parietal cortex. Heilman et al. (1990) found that a patient with bilateral infarction of the inferior temporal lobe showed smaller errors in bisecting a line extending out from the body when the line was close to the body than when it was at a distance of 60 cm. A woman with a right temporo-occipital hematoma showed left homonymous hemianopia. She neglected visual objects in far space but not in near space. For example, she had problems finding a door to her left but could read a book and ate from both sides of her plate (Vuilleumier et al. 1998). Cowey et al. (1999) reported a similar case arising from damage to the frontal lobe and superior parietal lobe. Performance on the line-bisection task deteriorated in going from near to far, but not abruptly. In using the bisection test at different distances Wilkinson and Halligan (2003) recommend using several lines of different lengths at each distance rather than only one line at each distance with linear length or visual angle held constant. Other evidence suggests that the dissociation of unilateral neglect between near and far space depends on the way the line-bisection task is performed. Berti and Frassinetti (2000) described a patient with right-hemisphere stroke who showed unilateral neglect on the line-bisection task that was limited to near space. However, the difference between near and far space showed only when the task was performed with a light-pointer. When the task was performed with a stick held in the hand, neglect was evident in both near and far space. They concluded that peripersonal space is extended out from the body when a tool artificially extends the arm. Pegna et al. (2001) found a somewhat different pattern of results in a patient with right parietal infarct. There was little evidence of unilateral neglect at any distance when the task was performed with a light-pen. Use of a stick produced unilateral neglect for both near and far distances. However, for both tasks, signs of unilateral neglect were slightly larger in near than in far space. They concluded the unilateral neglect is dissociable partly in terms of distance and partly in terms of type of response. Ackroyd et al. (2002) reported the case of a man with neglect of visual objects in far regions on the left. Detection of objects in this region improved when he reached to them with a tool. This supports the idea that near space can be extended by use of a tool. The tests used in all the above studies involved motor responses. The crucial difference between neglect in near and far space may be whether motor responses are directed to near or to far space. Differential processing of visual

stimuli in near and far space may not be involved. In support of this idea Pizzamiglio et al. (1989) found that 28 neglect patients showed no near-far dissociation when tested with the Wündt-Jastrow illusion (Figure 32.1). The test did not involve motor responses. An experiment is needed in which the same patients are tested on a purely visual test and on tests involving arm movements.

32.1.1e Physiological Evidence for Near-Far Dissociation In primates, the dorsal stream involves MT, MST, the ventral intraparietal area (VIP), and the lateral intraparietal area (LIP). These areas project to area 7b in the posterior parietal lobe, which projects to the premotor cortex, especially to area 6 in the ventral region. Tactile, visual, and proprioceptive sources of information converge in the premotor cortex, which is concerned with the control of movements of the mouth, head, and arms. The ventral stream involves V4, the inferior temporal cortex, the superior temporal polysensory area, and frontal cortex (see Section 5.8.3). There is evidence that spatial defects in near space arise from damage to the part of the dorsal cortical stream involved in controlling movements of the arms (area 7b and the premotor area). On the other hand, defects in far space arise from damage to areas controlling movements of the head and eyes (area 7a and the frontal eye fields) and perhaps also the ventral processing stream. Rizzolatti et al. (1983) reported that ablation of the postarcuate cortex (area 6) of the macaque produced contralateral neglect limited to near space. The animals failed to bite food presented contralaterally to the lesion and were reluctant to use the contralateral hand. Ablation of the frontal eye fields (area 8) reduced contralateral eye movements and produced neglect of contralateral visual objects, especially in extrapersonal space. However, there is conflicting evidence on this point (Latto 1982). Weiss et al. (2000) used positron emission tomography (PET) to determine the regions of the human brain activated when subjects used a light-pointer to bisect a horizontal line. For a line within arm’s reach (0.7 m), activity occurred in the left dorsal occipital cortex, left intraparietal cortex, and left ventral premotor cortex. For a line at 1.7 m (far space), activity occurred in the bilateral ventral occipital cortex and right medial temporal cortex. Thus, the regions activated differed even though the motor components of the tasks were similar. Cells in several regions of the parietal cortex are sensitive to both tactile and visual stimuli in extrapersonal space (see Section 5.8.4e). There is evidence that tool use extends the visual receptive fields of cells in these areas. Iriki et al. (1996) trained macaque monkeys to use a rake to retrieve objects beyond arm’s reach. After training,

PAT H O L O G Y O F V I S UA L D E P T H P E R C E P T I O N



219

the visual receptive fields of cells in the postcentral gyrus, near the intraparietal sulcus, had become extended to include the space occupied by the rake. 32.1.2 E X T I N C T I O N A N D G A Z E A P R AX I A

The symptom of extinction is closely related to that of neglect. The patient has no difficulty detecting a stimulus presented alone in the affected region of space but fails to detect it when a second stimulus is presented to some other location. Typically, the patient is not aware of a stimulus presented to a side of the body affected by damage to the contralateral cortex when a stimulus is applied to the same area on the other side of the body. The two stimuli must be presented at the same time, or with an interval of no more than about half a second (di Pellegrino et al. 1997). It is generally believed that extinction is due to the competitive advantage of stimuli presented to the intact cerebral hemisphere in gaining the patient’s attention (see Marzi et al. 2001). Stimuli on the normal side do not completely suppress stimuli on the affected side. For example, patients responded more rapidly to flashes presented to both sides of the visual field than to a flash presented only to the normal side (Marzi et al. 1996). The fMRI showed visually evoked activity in the right visual cortex and early extrastriate areas of a patient with a lesion in the right inferior parietal area even when the patient was unaware of the stimuli (Rees et al. 2000). It seems that simple attributes of extinguished stimuli are processed even though the patient is unaware of the stimuli. Extinction may be more evident with some stimuli than with others. For example, Ward and Goodrich (1994) reported that two patients with right cerebral infarct failed to detect a stimulus on the left when presented at the same time as a stimulus on the right. However, left extinction was less likely to occur when the two stimuli formed a coherent group because they were similar, symmetrical, or formed a familiar configuration. Competition for attention is weaker for similar stimuli than for dissimilar stimuli (see Driver 1995). In unimodal extinction the two stimuli are presented to the same sensory modality, which may be vision, audition, or touch. In cross-modal extinction the stimuli are presented to distinct sensory modalities (Mattingley et al. 1997). For example, a man with damage to the right frontotemporal cortex was aware of a tactile stimulus applied to the hidden left hand but not when an object was simultaneously seen near the right hand (di Pellegrino et al. 1997). Visual objects in other locations had no effect. In patients with right hemisphere lesions, visual stimuli presented to the ipsilateral side of the face extinguished tactile stimuli on the contralateral side of the face. However, visual stimuli presented to the contralateral side of the face improved detection of tactile stimuli on that side. Visual stimuli 220



presented far from the face had little or no effect on detection of tactile stimuli (Làdavas et al. 1998). The interaction between stimuli in different modalities probably occurs in cells that code locations in near space bimodally. For example, cells in the temporal, parietal, and premotor cortices have tactile receptive fields on the hand and visual receptive fields for objects seen near the hand (Section 5.8.6). We saw in the last section that the extent of peripersonal space is extended out from the body when a task is performed with a stick held in the hand. Maravita et al. (2001) described a patient with a lesion in the right hemisphere. A visual stimulus close to the right hand extinguished awareness of a touch on the left hand. The extinction was reduced when the visual stimulus was more distant. But extinction returned when the more distant visual stimulus was placed at the end of a stick held in the patient’s right hand. Maravita et al. (2003) reviewed the question of the effects of tool use on multisensory integration. 3 2 . 2 S T E R E OA N O M A L I E S 32.2.1 S T E R EOA N O M A L I E S WIT H B R I E F S T I MU L I

Richards (1971) asked observers to report whether lines on each side of a fixation cross were in front of, behind, or coplanar with the cross (Portrait Figure 32.2). The lines were presented for 80 ms with the same disparity of between zero and 4˚. Subjects also matched the perceived depth of

Whitman Richards. He joined the faculty of MIT in 1965 after receiving his PhD in Experimental Psychology from the same institution. Currently he holds appointments at MIT in the department of brain and cognitive sciences, the media arts and sciences, and the computer science and artificial intelligence lab.

Figure 32.2.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

a flashed target with a continuously visible stereoscopic probe seen subsequently with free eye movements. For subjects with normal stereoscopic vision, matched depth at first increased with increasing crossed or uncrossed disparity and then decreased as disparity increased toward 4˚ (Figure 32.3). Stereoanomalous subjects perceived either all crossed or all uncrossed disparities at the same depth as the zero-disparity stimulus. Richards concluded that perceived depth depends on the pooling of inputs from three classes of broadly tuned disparity detectors, one class tuned to crossed disparities, one to uncrossed disparities, and one to zero disparities. Physiological evidence supports this idea (Section 11.4.1). He also concluded that some people lack either crossed or uncrossed disparity detectors and consequently fail to detect depth defined by either crossed or uncrossed disparities with respect to a fixated stimulus. Richards estimated that up to about 30% of the population have a stereoanomaly of this type (Richards 1970). We will see in what follows that the idea that stereoanomalous subjects lack one or other class of disparity detector must be modified, since it has been found that anomalies depend on the test used to diagnose them. 32.2.2 S T E R EOA N O M A L I E S WIT H L O N G E X P O S U R E S

% correct depth detection

Effects of stimulus duration on the stereo threshold were discussed in Section 18.12.1a. People classified as stereoanomalous when tested with briefly exposed stimuli may perform normally when tested with longer exposure. For instance, about 30% of subjects could not detect depth created by 1˚ of crossed or uncrossed disparity in a dynamic random-dot stereogram exposed for 167 ms. However, all but one of the subjects performed perfectly when allowed to look at the stereograms for as long as they wished (Patterson and Fox 1984). Similar results were obtained by Newhouse and Uttal (1982) and by Tam and Stelmach (1998). The stereoanomalous observers may have succeeded with long exposure by simply converging or diverging the

Subject failed with crossed disparities

90

Normal observer

70

Subject failed with uncrossed disparities

50 30

4

3

2 1 0 1 2 3 Crossed Uncrossed Stimulus disparity (deg)

Depth judgments and disparity magnitude. Percent correct identification of the depth of two lines relative to a fixation cross as a function of disparity in the lines. The lines were flashed on for 80 ms. (Redrawn from Richards 1971)

Figure 32.3.

4

eyes and thus converting a disparity that they could not detect into one that they could detect. However, they also performed perfectly when stereo images were impressed on the eyes as afterimages. Foley and Richards (1974) trained a stereoanomalous person to discriminate between zero disparity and either crossed disparities or uncrossed disparities. Stimuli were presented for as long as the subject wished. After this training, the stereoanomaly that was revealed with a flashed target was considerably reduced. This evidence suggests that stereoanomalies revealed with flashed stereograms arise because subjects require time to process disparity information, and not because they lack a basic stereo mechanism. The role of learning in perception of depth in random-dot stereograms was discussed in Section 18.14. 32.2.3 S T I MU LUS -S P E C I FI C S T E R E OA N O M A L I E S

Stereoanomalies can be specific to the sign of luminance contrast in the visual target. Thus, observers who confuse either crossed or uncrossed disparity stimuli with zerodisparity stimuli, reversed the sign of their confusion when the stimulus was changed from dark bars on a light background to light bars on a dark background (Richards 1973). This does not fit with the idea of a simple loss of either crossed or uncrossed disparity detectors. Stereoanomalies can also be specific to particular stimuli and to particular locations in the visual field. Richards and Regan (1973) developed stereo perimeter tests to investigate this question. In one test, a luminous vertical bar oscillating in depth at 2 Hz between 0 and 0.4˚ of crossed or uncrossed disparity was placed in different positions in the visual field while the subject fixated a stationary point. The subject was deemed to have stereoscopic vision in a given region if the target was seen to move in depth rather than from side to side. In a second test, a target with ±0.4˚ of disparity was flashed on for 100 ms at different positions. An observer with apparently normal vision and normal stereopsis with stationary stimuli had large areas in the visual field within which motion in depth created with uncrossed disparities could not be detected and other regions in which motion in depth with crossed disparities could not be detected. Other subjects were found to behave in a similar fashion. Thus the visual fields of some subjects with otherwise normal vision contain local areas blind to motion in depth, defined by changing disparity (Richards and Regan 1973). These have been called stereomotion scotomata. Sensitivity to static disparity and to sideways motion is normal in such areas. The opposite may also be true. About half of a group of people classified as stereoanomalous with static displays could judge depth in moving displays, such as a rotating

PAT H O L O G Y O F V I S UA L D E P T H P E R C E P T I O N



221

cylinder, in which depth was defined by disparity (Rouse et al. 1989). Stereomotion scotomata may be a few degrees in diameter or extend over a quadrant or most of the visual field. The defect in a given area may be specific for either crossed or uncrossed disparities, that is, for an object moving along a trajectory nearer than or beyond the plane of convergence. Furthermore, for some subjects the defect was specific to the direction of motion toward or away from the person (Hong and Regan 1989). The defect took several forms. In one form, motion in depth was seen initially but rapidly faded. In a second form, the moving object appeared diplopic. In a third form, the object appeared stationary or appeared to move from side to side rather than in depth. Stimuli within a stereomotion scotoma evoked only weak vergence eye movements when they moved in depth but normal conjugate eye movements when they moved from side to side (Regan et al. 1986b). The occurrence of stereomotion scotomata in areas of the visual field with normal static stereopsis does not prove that disparity detectors serving stereomotion are distinct from those serving static stereopsis. Cumming (1995) pointed out that the two systems could depend on the same detectors but that an additional mechanism that differentiates the disparity signal could feed into the stereomotion system. Thus, any general loss of disparity detectors would disrupt both static stereopsis and stereomotion but a defect in the differentiator would affect only stereomotion. Patients lacking static stereopsis due to esotropia may perceive motion in depth created by opposite motion of dichoptic stimuli (Maeda et al. 1999). They presumably use difference-of-motion signals rather than change-of-disparity signals (see Section 31.3).

3 2 . 3 B R A I N D A M AG E A N D STEREOPSIS 32.3.1 G E N E R A L E FFEC TS O F B R A I N DA M AG E

Brain damage, especially in the parietal lobes, produces a wide variety of disorders of spatial vision that go under the heading of metamorphopsia. Objects may appear unusually large (macropsia) or small (micropsia) and far away (teleopsia). Or they may appear compressed in one dimension. Objects may appear tilted, inverted, fragmented, or displaced (Critchley 1955). Head trauma commonly produces a temporary loss of stereoscopic vision or reduced stereoacuity. In a group of 93 head trauma cases, 41% scored more than two standard deviations below the mean of a control group on a stereoacuity test, and 24% totally failed the test (Miller et al. 1999). 222



Many patients with focal cerebral lesions have some impairment of stereopsis although they do not notice the defect in their ordinary lives. Impairment of stereopsis due to brain damage is often accompanied by other visual defects (Danta et al. 1978). Complete loss of depth perception is not a common symptom. However, there have been several reports of soldiers suffering from head injuries for whom the world appeared to lie in a single frontal plane, like a picture. However, there was probably some visual agnosia in these cases, because the patients also had difficulty recognizing things. Lesions of the right parietal lobe were implicated in this disorder (Riddoch 1917; Holmes and Horrax 1919; Critchley 1955). Turnbull et al. (2004) described a patient with general head injury who could copy and recognize drawings that represented 2-D objects but failed to recognize the depth structure of drawings of 3-D objects. The patient could not distinguish between drawings that represented real 3-D objects from drawings that represented impossible objects. Cerebral anoxia or coma can induce a period of cerebral blindness from which the patient gradually recovers. Bodis-Wollner and Mylin (1987) monitored this recovery in two patients with cerebral blindness by recording the brain potentials evoked by monocular stimuli and by binocular presentation of a dynamic random-dot stereogram. Binocular responses recovered more slowly than monocular responses. One of the patients recovered stereopsis at about the same time that she showed evidence of binocular function in the evoked potentials. After removal of areas 17 and 18, cats lost their ability to discriminate depth based on disparity, even though other abilities such as offset acuity and brightness discrimination survived (Ptito et al. 1992). Cowey (1985) trained monkeys to detect a depth interval between two black rods. A 10-fold increase in the stereoscopic threshold occurred after the part of V1 corresponding to the central 5˚ of the visual field was removed. A variable but smaller increase occurred after removal of a similar region in V2. This effect is to be expected since the test was one of fine stereopsis. A 50% increase in the stereo threshold after removal of the inferotemporal cortex was unexpected. Detection of a depth region with a large disparity in a random-dot stereogram was unaffected by removal of the central 5˚ of V1. Foveal vision was clearly not essential for this coarse task. Removal of the central area of V2 or of most of the inferotemporal cortex slightly impaired performance on this test. Monkeys were unable to perform the task at all after extensive damage to the rostral superior colliculus and pretectum, which are subcortical regions associated with the control of eye movements. Subsequent tests revealed that these monkeys suffered from diplopia, which suggests that they were unable to control vergence.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

32.3.2 A SY M M ET R I C A L E FFEC TS O F B R A I N DA M AG E

It is generally believed that higher centers in the right cerebral hemisphere are specialized for visuospatial tasks, such as visual localization, judgments of orientation, and depth perception, whereas higher centers in the left hemisphere are specialized for language (Kimura and Durnford 1974; B. Milner 1974; Gazzaniga and LeDoux 1978). However, Birkmayer (1951) reported that of 76% of 70 brain-injured patients with impaired depth perception had left-sided damage. Furthermore, Rothstein and Sacks (1972) reported that patients with left parietal lobe lesions showed a greater stereoscopic deficit on a standard Titmus test than those with lesions in the right parietal lobe. However, only two of their ten patients had left-side damage. Lehmann and Wälchli (1975) also used the Titmus test but failed to find differential effects of left and right hemisphere damage in neurological patients. The Titmus test does not test for disparities of less than about 40 arcsec. Danta et al. (1978) found that stereoscopic defects in a sample of 54 patients were more likely to be associated with damage to the temporal, parietal, and occipital lobes in the right hemisphere than in the left hemisphere (see also Ross 1983). Lesions in the left hemisphere associated with stereoscopic deficits were found to lie preferentially in the frontal and temporal lobes. The question of the lateralization of stereoscopic defects as assessed by standard tests of stereopsis is far from settled. Now consider the lateralization of defects assessed by random-dot stereograms. Several investigators have reported that patients with disease of the right cerebral hemisphere (serving the left visual field) more often fail to perceive depth in randomdot stereograms than do normal subjects or patients with disease of the left hemisphere (Carmon and Bechtoldt 1969; Benton and Hécaen 1970; Hamsher 1978; Ptito et al. 1991). In the last three of these studies, patients with right or left hemisphere disease performed at the same level on standard stereoscopic tests in which the forms were visible monocularly. However, Ross (1983) found that performance with both types of stereogram was equally affected by right hemisphere damage. Rizzo and Damasio (1985) found no laterality effects for either type of stereotest but found impairment to be greater with damage to the parietal lobe than to the temporal lobe. Lehmann and Julesz (1978) found no difference between the visual evoked potentials recorded from the left and right hemispheres when subjects were presented with random-dot stereograms (Section 11.7). It has been reported that people with normal vision are better able to identify a cyclopean form in a random-dot stereogram when it is presented in the left visual field (right hemisphere) rather than the right visual field (Durnford and Kimura 1971). However, since several cyclopean shapes

were presented, shape identification rather than stereoscopic vision may have been the factor responsible for the field asymmetry. Julesz et al. (1976a) used only one cyclopean shape in a dynamic random-dot stereogram and found no hemifield differences in the stimulus-duration threshold for detection of depth or in the maximum eccentricity at which a stereo target could be detected. Pitblado (1979) obtained a left-field (right hemisphere) superiority in the recognition of cyclopean shapes when the dots comprising the stereogram were small but, with large dots, performance was better in the right visual field projecting to the left hemisphere. It has been proposed that stereopsis based on cyclopean forms in random-dot stereograms is more localized in the right hemisphere, and stereopsis based on regular stereograms, in which the forms are visible to each eye, are more localized in the left hemisphere. Even if this is true, the crucial factor may be the relative spatial frequencies of the stimuli rather than whether or not they are cyclopean. Another possibility is that hemisphere-specific deficits reported with random-dot stereograms are due to aspects of the task other than stereopsis, such as form perception, reaction time, or control of convergence. One way to test this would be to see whether such patients can see cyclopean shapes that are not defined by horizontal disparities (they could be defined by texture rivalry or vertical disparities) and whether they can see depth in random-dot stereograms in which the outlines of the forms are provided in the monocular images. The clinical category of hemisphere damage does not allow one to draw conclusions about the specific site of a deficit, and there is the problem of being sure that clinical samples are matched for factors such as age, intelligence, and motivation. 32.4 ABNORM AL INTEROCUL AR TR ANSFER An induction stimulus may affect a test stimulus in three ways. The test stimulus can have its threshold reduced (threshold summation), or increased (threshold elevation, or masking), or some feature of the test stimulus, such as its orientation, motion, or spatial frequency, may appear to change. Each effect can be induced in a test stimulus presented at the same time as or just after the induction stimulus. Most of these effects occur, although to a lesser degree, when the induction stimulus is presented to one eye and the test stimulus to the other. In other words, they show interocular transfer. Binocular summation and masking were discussed in Chapter 13 and interocular transfer of figural effects in Section 13.3. The present section is concerned with the extent to which people with defective binocular vision show binocular summation, binocular masking, and interocular

PAT H O L O G Y O F V I S UA L D E P T H P E R C E P T I O N



223

transfer of aftereffects. The degree of interocular transfer has been taken as a measure of binocular interactions in the visual cortex. Given that an induction effect is cortical, any interocular transfer is assumed to reflect the extent to which the induction and test stimuli excite the same binocular cells in the visual cortex. In people with normal binocular vision, the reduced size of a transferred effect relative to the same-eye effect is assumed to be due to dilution of the effect by unadapted monocular cells fed from the unadapted eye or by binocular AND cells that require simultaneous inputs from both eyes. If this logic is correct, a person lacking binocular cells should show no interocular transfer or binocular recruitment of cortically mediated induction effects. In practice, there are many pitfalls in applying this logic, and the literature has become complex and rather contentious. 32.4.1 B I N O CU L A R IT Y A N D B I N O CU L A R S U M M AT I O N

A near-threshold stimulus is more likely to be detected when it is presented to two eyes rather than one. Binocular summation could be due to neural summation, which is the process whereby subthreshold excitatory signals from the two eyes are summed when they impinge on cortical binocular cells. But the contribution of neural summation can be determined only after allowance has been made for the fact that detection based on information from two independent detectors shows a √2 (1.4:1) advantage over that based on the output of a single detector. This is probability summation, which is best determined by measuring interocular effects under conditions in which neural summation is unlikely to occur, for instance, when the stimuli in the two eyes are separated spatially or presented at slightly different times (Section 13.1.1). In people with normal binocular vision, luminanceincrement and contrast-detection thresholds for superimposed identical dichoptic stimuli are lower than monocular thresholds to a greater extent than predicted from probability summation (Section 13.1.2). Therefore, true neural summation occurs. Neural summation, like the response of cortical cells, is greatest for stimuli with the same orientation and spatial frequency (Blake and Levinson 1977). Cats reared with daily alternating monocular occlusion did not show behavioral evidence of binocular summation (von Grünau 1979). In monocularly deprived cats the VEP in response to a temporally modulated grating was smaller with both eyes than with only the normal eye open (Sclar et al. 1986). In people with severe loss of binocularity, binocular thresholds are what one would predict from probability summation, even with well-matched stimuli (Lema and Blake 1977; Westendorf et al. 1978; Blake et al. 1980; Levi et al. 1980). 224



The temporal contrast sensitivity function is the luminance modulation of a light required for detection of flicker as a function of temporal frequency. For subjects with normal binocular vision, sensitivity for in-phase binocular flicker in the region of 10 Hz is about 40% higher than that for antiphase flicker (see Section 13.1.5). At a flicker rate of 0.1 Hz, sensitivity for in-phase flicker is up to four times higher than for antiphase flicker (van der Tweel and Estévez 1974; Cavonius 1979). Stereoblind subjects showed no difference between in-phase and antiphase flicker sensitivity (Levi et al. 1982). Uniform patches flickering at slightly different frequencies in the two eyes create the appearance of a rhythmic modulation of luminance at a frequency equal to the difference in frequency between the two patches. This dichoptic beat phenomenon is a simple consequence of nonlinear binocular luminance summation as the two flickering patches come into and out of phase. Three subjects with alternating strabismus and normal visual acuity and three stereoblind strabismic amblyopes failed to see dichoptic visual beats, thus providing more evidence that binocular neural summation is reduced or absent in stereoblind people (Baitch and Levi 1989). Visually evoked potentials, also, reveal that binocular summation is reduced in stereoanomalous observers (Section 8.5.2). 32.4.2 B I N O C U L A R I T Y A N D DICHOPTIC MASKING

A briefly exposed suprathreshold test stimulus, such as a black and white grating, becomes difficult to detect when it is superimposed in the same eye on a similar suprathreshold stimulus. This is simultaneous masking. A grating also becomes difficult to detect when it is presented just before or just after a similar grating. This is successive masking, or the threshold-elevation effect (Campbell and Kulikowski 1966). In both cases, the test stimulus is said to be masked by the adapting stimulus, or mask. In dichoptic masking, the mask is presented to one eye and the test stimulus to the other. This is called dichoptic masking rather than rivalry because, unlike rivalry, it occurs optimally between dichoptic stimuli of similar shape. In normal subjects, and at high contrasts, simultaneous dichoptic masking has been reported to be stronger than monocular masking (Legge 1979). Successive dichoptic masking was about 65% as strong as monocular masking (Blakemore and Campbell 1969; Hess 1978). Dichoptic masking increases as the gratings in the two eyes are made more dissimilar in contrast (Legge 1979). Since contrast in an amblyopic eye is attenuated relative to that in the good eye, one could argue that suppression of an amblyopic eye is an expression of the same mechanism that causes interocular masking in normal eyes.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Normal right eye

Contrast sensitivity of right eye

400

Homogeneous surface in left eye

250

100 2-cpd grating in left eye 40 25

10

0.5

1 2 4 8 Spatial frequency (cpd)

16

Amblyopic right eye

100 Contrast sensitivity of right eye

Harrad and Hess (1992) tested this idea by measuring dichoptic masking with sinusoidal gratings in subjects with various kinds of amblyopia. Only anisometropic amblyopes showed normal masking functions after allowance was made for interocular differences in contrast sensitivity. They concluded that other types of amblyopia cannot be understood in terms of the normal mechanism of dichoptic masking, and must involve more than a simple loss of contrast sensitivity in the affected eye. See Baker et al. (2008) for more information on this topic. There has been some dispute about whether dichoptic masking is present in stereoblind people with an early history of strabismus. Ware and Mitchell (1974) found no dichoptic masking in two stereoblind subjects, whereas Lema and Blake (1977) found some in three of their four stereoblind subjects. Anderson et al. (1980), also, found interocular transfer of the threshold-elevation effect in seven stereoblind subjects, especially from a nonamblyopic eye to a normal eye, although to a lesser extent than in normal subjects. Hess (1978) tested one strabismic amblyope, with some residual stereopsis, who showed no interocular transfer of the threshold-elevation effect, and another, with no stereopsis, who showed full transfer. In a later study he found that amblyopes show no threshold-elevation in the amblyopic eye after binocular adaptation, and concluded that threshold-elevation and amblyopic suppression occur at the same cortical level (Hess 1991). Subjects with abnormal binocular vision and amblyopia showed a normal level of simultaneous interocular masking at suprathreshold levels of contrast, as shown in Figure 32.4 (Levi et al. 1979). The same subjects failed to show interocular subthreshold summation. This suggests that, in people with defective binocular vision, inhibitory interactions responsible for masking still occur between the left- and right-eye inputs to binocular cells, but that excitatory interactions responsible for subthreshold summation are absent. This agrees with the evidence reviewed in Section 8.5.2. Binocular summation for low-contrast stimuli was still present in subjects with normal vision who had worn an occluder over one eye for 8 days (Smith and Harwerth 1979). The spatial frequency of the stimulus is an important factor that may help resolve some of the conflicting evidence about interocular transfer of the threshold-elevation effect, and of other aftereffects that are mentioned later. Amblyopes tend to show a selective loss of contrast sensitivity for high spatial frequencies. Selby and Woodhouse (1981) found that amblyopes showed almost normal interocular transfer of the threshold-elevation effect with low spatial-frequency stimuli to which the normal and amblyopic eyes were equally sensitive. However, they showed little or no transfer with high spatial-frequency stimuli for which there was a difference in sensitivity in the two eyes. Some of these amblyopes had stereovision, as tested on the Titmus test, and some did not, but their

Homogeneous surface in left eye 25 2-cpd grating in left eye

10

25.

0.5

1 2 4 8 Spatial frequency (cpd)

16

Figure 32.4. Dichoptic masking in an amblyopic eye. Contrast sensitivity of a normal right eye (upper figure) and an amblyopic right eye (lower figure). The left eye viewed a homogeneous surface or a 2-cpd grating. (Adapted from Levi et al. 1979b)

stereoscopic performance was not related to their degree of interocular transfer. It was concluded that stereopsis and interocular transfer of the threshold-elevation effect are not mediated by the same mechanism. However, the Titmus test involves high spatial-frequency stimuli, and it is not high spatial-frequency stereoacuity that one would expect to be related to interocular transfer of an aftereffect tested at a low spatial frequency. Perhaps a relation between the two functions would be found if subjects were tested for stereoscopic vision with low spatial-frequency stimuli. A second but related factor in interocular transfer is the position of the stimulus. Even though stereoscopic vision may be lost in the central retina, where both fine and coarse disparities are processed, it may be retained in the peripheral retina, where only coarse disparities are processed (Section 8.5.1). Binocular subthreshold summation and interocular transfer of the threshold-elevation effect were reduced or absent in strabismic and anisometropic amblyopes for stimuli confined to the central visual field.

PAT H O L O G Y O F V I S UA L D E P T H P E R C E P T I O N



225

These effects were also absent in the visual periphery for anisometropes, but strabismics showed considerable interocular summation and transfer of the thresholdelevation effect in the peripheral field (Sireteanu et al. 1981). One would expect binocular vision in anisometropes to be disturbed more in the periphery than in the fovea because a differential magnification of the two images produces disparities that increase with eccentricity. In strabismics, image displacement is the same over the whole visual field but affects peripheral vision less than foveal vision because the periphery has larger receptive fields. 32.4.3 B I N O CU L A R IT Y A N D T H E T I LT A F T E R E FFEC T

Inspection of an off-vertical line or grating induces an apparent tilt of a vertical line in the opposite direction. When the induction and test lines are presented at the same time, the effect is known as tilt contrast. When the test line is presented after the induction stimulus, it is known as the tilt aftereffect. In people with normal binocular vision the tilt aftereffect shows interocular transfer when the induction line is presented to one eye and the test line to the other (Section 13.3). Estimates of the extent of interocular transfer have varied between 40 and 100% (Gibson 1937; Campbell and Maffei 1971). Since, in primates, orientation is first coded in the visual cortex, the interocular transfer of the tilt aftereffect has been used as a test of normal binocular functioning. Some investigators found the extent of interocular transfer to be positively correlated with stereoacuity (Mitchell and Ware 1974), while others found no such correlation (Mohn and Van Hof-van Duin 1983). However, subjects with strabismus acquired before the age of 3 years or with loss of stereopsis for other reasons showed little or no interocular transfer of the tilt aftereffect (Movshon et al. 1972; Ware and Mitchell 1974; Banks et al. 1975; Hohmann and Creutzfeldt 1975; Mann 1978). In these experiments the induction stimuli were gratings with a spatial frequency of about 7 cpd, which were tilted about 10˚ to the vertical and subtended 3˚ or less. Maraini and Porta (1978) used a 20˚-wide grating with a spatial frequency of only 0.5 cpd, and obtained a high level of interocular transfer of the tilt aftereffect in alternating strabismics. Moreover, although consistent strabismics (esotropes) showed no transfer from the dominant to the nondominant eye, they showed good transfer in the opposite direction, albeit less than that in subjects with normal vision. Buzzelli (1981) obtained a normal level of interocular transfer of the tilt aftereffect in both directions in a mixed group of 23 strabismics. He used a grating 14˚ wide with a spatial frequency of 2 cpd. This evidence suggests that the tilt aftereffect shows interocular transfer when the spatial frequency of the stimulus is low enough to be detected by the amblyopic eye. 226



Inspection of a textured surface rotating in a frontal plane about the visual axis causes a superimposed vertical line to appear tilted in a direction opposite to that of the surface. When the rotating surface was presented to one eye and the vertical line to the other, this effect remained at full strength in subjects with normal binocular vision but was reduced in stereoblind subjects (Marzi et al. 1986). 32.4.4 B I N O C U L A R I T Y A N D T H E MOT I O N A F T E R E FFEC T

In the motion aftereffect a stationary display appears to move in the opposite direction to a previously inspected moving display. In people with normal binocular vision the motion aftereffect transfers at least 50% to an eye that was not exposed to the induction stimulus (Section 13.3.3). Wade (1976) found no interocular transfer of the motion aftereffect produced by a rotating sectored disk in six stereoblind adults who had strabismus from early childhood. There was some transfer in 11 stereoblind subjects whose strabismus had been surgically corrected (see also Hess et al. 1997). Also, six subjects with mild strabismus and some stereoscopic vision showed some interocular transfer of the motion aftereffect. Using a 10˚-diameter rotating sectored disk, Mitchell et al. (1975), also, found no interocular transfer of the motion aftereffect for subjects who were stereoblind because of childhood strabismus or anisometropic amblyopia. For subjects with some stereoscopic vision there was a positive correlation of 0.75 between the amount of transfer and stereoacuity. Mohn and Van Hof-van Duin (1983) used a similar 10˚-diameter sectored disk and found no interocular transfer of the aftereffect in stereoblind subjects. In subjects with some stereovision, they found no correlation between the amount of transfer and stereoacuity. Keck and Price (1982) used an 8˚-wide moving grating to test three groups of subjects: (1) those who had central scotomata but some peripheral stereoscopic vision, (2) those with alternating strabismus but no stereoscopic vision, and (3) those with a consistent strabismus, anomalous correspondence, and no stereoscopic vision. Subjects in the first two groups showed less transfer of the motion aftereffect than subjects with normal vision, while transfer was absent in the third group. Interocular transfer of the motion aftereffect was significantly reduced in 10 early-onset strabismics with a stimulus confined to the central 2.8˚ of the visual field but not with a stimulus confined to an annular region between 20 and 40˚ of eccentricity (O’Shea et al. 1994). The motion-coherence threshold is the least proportion of coherently moving dots in an array of randomly moving dots that can be seen as moving in a given direction. After inspection of random dots moving in one direction, the motion-coherence threshold is elevated for dots moving in the same direction as the induction stimulus.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Raymond (1993) obtained 96% transfer of the motioncoherence threshold. She favored the view that cortical area MT is the site of this interocular effect, which would account for its high level of interocular transfer. McColl and Mitchell (1998) confirmed the high interocular transfer of the motion-coherence threshold and found that while stereodeficient subjects showed very little transfer of the conventional motion aftereffect they showed about 90% transfer of the motion-coherence threshold aftereffect. This supports the idea that the elevation of the motioncoherence threshold is processed at a higher level in the visual system than the simple motion aftereffect. Summary It is generally agreed that little or no interocular transfer occurs in stereoblind subjects. However, some transfer has been found in certain types of stereoblind subjects, and there is some controversy about the correlation between the degree of transfer and stereoacuity. The conflicting findings from different laboratories could be due to different clinical samples, different diagnostic tests, or different stimuli used to measure interocular transfer. One potentially important factor is the position of the stimuli. There is more interocular transfer of the threshold-elevation effect for peripheral than for central stimuli, and some stereovision is retained in the peripheral retina when it is lost in the central retina. A related factor is spatial frequency. Amblyopes show more interocular transfer of the threshold-elevation effect for low spatial-frequency stimuli to which both eyes are equally sensitive than they show for high spatial-frequency stimuli to which the amblyopic eye is relatively insensitive. In future studies of interocular transfer of motion and tilt aftereffects in people with visual defects, special attention should be paid to the spatial frequency and size of stimuli used to test stereoscopic vision and interocular transfer.

The proprioceptive-visual gating hypothesis is supported by the fact that prolonged monocular paralysis in adult cats causes a bilateral reduction of responses of X-cells in the LGN (Garraghty et al. 1982). This effect is immediately reversed by removal of proprioceptive inputs from the nonparalyzed eye (Guido et al. 1988). This suggests that the effects of monocular paralysis are due to inhibition of X-cells by asymmetrical proprioceptive inputs from the two eyes. Kittens with unilateral or bilateral section of proprioceptive afferents suffer permanent deficits in visual-motor coordination (Hein and Diamond 1983). For example, depth discrimination in a jumping-stand test was affected by unilateral section of the ophthalmic nerve in adult cats (Fiorentini et al. 1985) (Portrait Figure 32.5). However, Graves et al. (1987) found this to be true in only some cats. Significant changes in depth discrimination in cats occurred only when a unilateral section of proprioceptive afferents was performed between the ages of 3 and 13 weeks, or bilateral section between the ages of 3 and 10 weeks (Trotter et al. 1993). Inputs from proprioceptors in the extraocular muscles have also been implicated in the long-term maintenance of binocular alignment of the eyes (Lewis et al. 1994). Trotter et al. (1987) severed the proprioceptive afferents of kittens either unilaterally or bilaterally at various times in the first few months after birth. This did not cause strabismus or interfere with the movements of the eyes. After the operation some of the kittens were reared with normal binocular experience and some in darkness. One month after unilateral section of proprioceptive afferents

32.5 BINOCUL ARIT Y AND PROPRIOCEPTION In many animals, including humans, sensory receptors exist in the extraocular muscles and in the tendons of these muscles (Bach-y-Rita 1975; Richmond et al. 1984). Signals from these receptors enter the brain along the ophthalmic branch of the trigeminal nerve (fifth cranial nerve). Responses to stretching of extraocular muscles have been recorded in cells of the superior colliculus (Donaldson and Long 1980), the vermis of the cerebellum (Tomlinson et al. 1978), the visual cortex (Buisseret and Maffei 1977), and the frontal eye fields (Dubrovsky and Barbas 1977). The response of about 40% of relay cells in the LGN of the cat to drifting gratings was modified when the eye was passively rotated to different positions (Lal and Friedlander 1990). This suggests that afferent eye-position signals gate the transmission of visual signals in the LGN.

Figure 32.5. Adriana Fiorentini. Born in Milan in 1926. She received a B.Sc. in physics from the University of Florence in 1948, and the libera docenza in physiological optics in 1956. She worked in various capacities at the Istituto Nazionale di Ottica in Florence until 1968, when she became a professor at the Istituto di Neuroscienze CNR in Pisa, to commence her long collaboration with professor Maffei. She retired in 1992 but continues her research.

PAT H O L O G Y O F V I S UA L D E P T H P E R C E P T I O N



227

during the critical period, both the seeing cats and the dark-reared cats showed a severe reduction in the number of binocular cortical cells, which was still present 21/2 years later. Unilateral section of the nerve had no effect when performed during the first month after birth or in the adult cat. Bilateral section of the nerve had no effect on binocular cells no matter when it was performed. More recently, Trotter et al. (1993) recorded from cells in the visual cortex of adult cats in which proprioceptive afferents had been severed unilaterally when the animals were between 5 and 12 weeks of age. The stimuli were moving sine-wave gratings with dichoptic phase (horizontal disparity) set at various values. In the cells of operated cats, the range and stability of disparity tuning and the degree of binocular suppression were reduced below the level of cells in normal cats. Maffei and Bisti (1976) surgically deviated one eye of kittens soon after birth and occluded both eyes at the same time (Portrait Figure 32.6). The reduction in the number of cortical binocular cells was about the same as that produced by induced strabismus when both eyes were allowed to see. They concluded that asymmetrical movement of the two eyes, even in the absence of vision, is sufficient to disrupt binocularity in cortical cells. Others failed to replicate this effect (Van Sluyters and Levitt 1980). Maffei and Bisti’s conclusion was also challenged on the ground that monocular paralysis in kittens leads to a reduction in X cells in the LGN and that the apparent loss of

Figure 32.6. Lamberto Maffei. Born in Grosseto, Italy, in 1936. He obtained a degree in medicine in 1961 and studied under G. Morruzzi. He has conducted research in Tubingen, Cambridge, MIT, and Oxford. He is chairman of the Institute of Neurophysiology of the Italian National Research Council and professor of neurobiology at the Scuola Normale Superiore in Pisa, Italy.

228



binocular cells was secondary to this (Berman et al. 1979). Monocular paralysis produces loss of X cells in the LGN even in the adult cat (Brown and Salinger 1975). But this claim has also been challenged by those who found no loss of X cells in the LGN after monocular paralysis (Winterkorn et al. 1981). This issue remains unresolved. A report that surgical deviation of one eye in adult cats leads to a loss of binocular cells (Fiorentini and Maffei 1974; Maffei and Fiorentini 1976) was not replicated by Yinon (1978). However, it was supported by further evidence (Fiorentini et al. 1979). Others have concluded that eye motility plays a crucial role in cortical plasticity but only in combination with abnormal visual inputs. Thus, many cortical cells recovered their response to stimulation of a deprived eye only after the normal eye had been pressure blinded and its extraocular muscles had been paralyzed (Crewther et al. 1978). When anesthetized kittens with their eye muscles paralyzed were exposed to a patterned display for 12 hours there was no reduction in the number of binocularly activated cortical cells (Freeman and Bonds 1979). However, monocular exposure did reduce the number of responding binocular cells when the eyes were not paralyzed or were moved mechanically by the experimenter while the eyes were paralyzed. Moving the eyes mechanically in darkness for 12 hours had no effect. Freeman and Bonds concluded that cortical plasticity depends on a combination of nonmatching visual inputs and eye-movement information, presumably arising in proprioceptors in extraocular muscles. Buisseret and Singer (1983) came to the same conclusion after finding that neither monocular occlusion nor induced strabismus led to a change in the binocularity of cortical cells in the kitten when proprioceptive afferents were abolished by bilateral section of the ophthalmic nerve. Cortical cells that had lost their capacity to respond to an eye that had been occluded responded to stimuli from that eye within minutes after the good eye was blinded by application of pressure and an anesthetic block was applied to the extraocular muscle afferents of the good eye. Neither procedure was effective alone (Crewther et al. 1978). Thus, both proprioceptive and visual afference seem to play a role in maintaining a normal eye’s suppression of a deprived eye. Recovery of visual functions after a period of binocular deprivation is aided by ocular motility. Thus, 6-week-old dark-reared kittens showed some recovery of orientation selectivity of cortical cells when allowed to see, but not when allowed to see with their eye muscles paralyzed (Buisseret et al. 1978; Gary-Bobo et al. 1986). Buisseret et al. (1988) selectively severed the extraocular muscles of 6-week-old dark-reared kittens so that the eyes could move only horizontally or only vertically. After the kittens were given a period of visual experience, orientation-selectivity of cortical cells became predominantly tuned to the direction orthogonal to that of the allowed eye movements.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Eye muscle paralysis could have prevented the animals from bringing the images from the two eyes onto corresponding location. This, rather than loss of proprioception may have prevented visual recovery. The influence of eye-muscle proprioception on visual functions was reviewed by Buisseret (1995). On balance, this evidence suggests that eye proprioception plays a key role in cortical plasticity and development of depth perception. However, how this is accomplished remains a mystery. 32.6 A L B I N I S M 32.6.1 BA S I C C H A R AC T E R I S T I C S OF ALBINISM

Albinism is a group of genetic disorders affecting the synthesis of melanin in the retinal pigment epithelium. It occurs in all mammalian species. There are two main types of the disorder: oculocutaneous albinism, characterized by absence of pigment throughout the body, and ocular albinism, in which hypopigmentation is restricted to the eye. There are many subtypes, and the severity of the deficit depends on at least eight genes (Kinnear et al. 1985; Abadi and Pascal 1989). About 1 in 17,000 people has oculocutaneous albinism and about 1 in 50,000 has ocular albinism. One form of ocular albinism is linked to the X chromosome and occurs only in males ( Jay et al. 1982). For one type of albinism the affected gene encodes tyrosinase, the enzyme involved in synthesis of melanin. Introduction of this gene into albino mice allowed the visual pathways to develop normally ( Jeffery et al. 1994). However, similar symptoms arise in human oculocutaneous albinos, who do not lack the gene for tyrosinase. In these tyrosinase-positive cases, lack of melanin must be due to other factors— probably lack of a substrate (Witkop et al. 1970). In all forms of albinism there is an absence of ocular pigmentation and a reduction of the number of rods, especially in the central retina. The absence of melanin in the pigment epithelium behind the retina causes the ocular fundus, or concave interior of the eye, to appear orange-red and renders the choroidal blood vessels visible through the ophthalmoscope. Many albinos also have loss of pigmentation in the iris, giving the eyes a characteristic pink appearance. Lack of pigment allows light to enter the eye through the sclera and iris and to reflect from the eye’s internal surfaces, causing excessive illumination and glare. Albinos typically avoid the resulting visual discomfort and exposure to excessive doses of ultraviolet light by keeping away from bright lights, a response known as photophobia. The yellow macular pigment, which reduces the effects of chromatic aberration, is also absent in albinos. Albinos also tend to have astigmatism and high refractive errors, especially myopia. They show malformation of

the fovea. The ganglion cell layer is present over the fovea, and central cones resemble those found normally in the parafoveal region (Fulton et al. 1978). The visual pathways of albinos are also deficient. These defects are accompanied by strabismus, congenital nystagmus, and a variety of visual defects including reduced acuity and impaired or absent binocular fusion. Stereoscopic vision is either absent or deficient in albinos (Guo et al. 1989; Apkarian 1996). In a mixed group of 18 human albinos, 9 showed some evidence of stereopsis when tested with a variety of stereo tests, including a random-dot stereogram, although only a simple pass-fail criterion was used (Apkarian and Reits 1989).

32.6.2 A B N O R M A L RO U T I N G O F V I S UA L PAT H WAY S

32.6.2a Abnormal Routing in Albinos In the present context, the most significant visual defect in albinism is the unusual structure of the visual pathways. In the optic tract of the albino rat, there is an unusually small number of uncrossed axons from the ipsilateral temporal retina and an abnormally large number of crossed axons from the contralateral temporal retina (Lund 1965). From the earliest developmental stage in albino rats, cats, and ferrets, uncrossed axons at the chiasm are reduced in number compared with those in normal animals, and tend to originate in the peripheral retina (Guillery 1986). Many axons from about the first 20˚ of the temporal hemiretinas erroneously decussate. The part of the nasal retina closest to the midline also gives rise to some misrouted ganglion cells, but axons from the most nasal parts of the retina have been found to be routed normally in several species of albino animals (Sanderson et al. 1974). Also, cortical cells may receive inputs from the ipsilateral eye through the corpus callosum (Diao et al. 1983). Evoked potentials in human albinos revealed that they, also, have a reduced number of uncrossed inputs to the visual cortex (Creel et al. 1974). In albino mammals of a variety of species, including monkeys, cells in the LGN receiving crossed inputs form into enlarged layers according to the eye of origin but with abnormal fusions between the layers. Cells receiving uncrossed inputs tend to form into segregated islands rather than distinct layers (Sanderson et al. 1974; Gross and Hickey 1980). A postmortem study of the LGN of a human albino revealed abnormal fusions of the four parvocellular layers and of the two magnocellular layers in the region of the LGN that is normally six-layered (Section 5.2.1). In the normal LGN there is a small two-layered region devoted to crossed inputs from the monocular crescent in the far periphery of the visual field. In the albinotic LGN this two-layered region is greatly extended because of the unusual number of crossed inputs (Guillery et al. 1975; Guillery 1986) (Portrait Figure 32.7).

PAT H O L O G Y O F V I S UA L D E P T H P E R C E P T I O N



229

Figure 32.7. Ray W. Guillery. Born in Greifswald, Germany, in 1929. He obtained a B.Sc. 1951 and a Ph.D. in 1954, both in anatomy from University College, London. He held academic appointments in the Department of Anatomy at University College from 1953 to 1964, in the Department of Anatomy at the University of Wisconsin from 1964 to 1977, and in the Department of Physiological Sciences at the University of Chicago from 1977 to 1984. He then moved to Oxford University, where he was Dr. Lee’s professor and head of the Department of Human Anatomy until he retired in 1996. He is now emeritus professor in the Department of Anatomy at the University of Wisconsin. He is a fellow of the Royal Society of London.

The terminals of the abnormally routed visual inputs are arranged in a normal retinotopic order in the visual cortex but on the wrong side of the brain. The topological representation of the visual field normally shows a sudden reversal at the midline, which is represented at the border between areas 17 and 18 (or areas V1 and V2). In the albino ferret this reversal occurs in area 17 up to 30˚ away from the visual midline on the side of the abnormally routed inputs (Thompson and Graham 1995; Ackerman et al. 2003). In human albinos, fMRI has revealed that the point of reversal occurs between 6˚ and 14˚ away from the midline (Hoffmann et al. 2003). The misrouted inputs thus map a part of the ipsilateral visual field rather than the contralateral visual field and in mirror-reversed order (Kaas and Guillery 1973). This produces an unusual location of visually evoked potentials recorded from the scalp (Creel et al. 1978, 1981; Boylan and Harding 1983; Apkarian et al. 1984). The VEPs show delayed ipsilateral latency, reduced ipsilateral amplitude, or both (Guo et al. 1989). Retinal projections to the superior colliculus also appear to be reversed (Collewijn et al. 1978). 230



There have been reports that, in albino cats, the response of cortical cells receiving misrouted inputs is suppressed (Kaas and Guillery 1973). However, in some albino cats and in albino monkeys and humans, cortical cells receiving misrouted inputs were found to respond vigorously (Guillery et al. 1984; Hoffmann et al. 2003). One basic cause of the abnormal routing of axons is a lack of tyrosinase, the enzyme involved in melanin synthesis. It has been suggested that abnormal routing is due to the absence of melanin in the developing eyestalk. Normal visual pathways developed in animals in which melanin was confined to the retina (Colello and Jeffery 1991). The lack of melanin reduces the number of retinal cells specified to remain uncrossed (Marcus et al. 1996). Chemical markers determine whether or not an axon crosses the chiasm (Section 6.3.4). Also, cells that cross tend to be produced later than those that remain uncrossed. Lack of melanin seems to interfere with the spatiotemporal pattern of cell development in the retina, leading to a differential delay in the time of arrival of different types of axon at the chiasm ( Jeffery 1997). The severity of visual defects in albino rats and mink was correlated with the degree of pigment deficit (Sanderson et al. 1974; Balkema and Dräger 1990). Functional MRI revealed that the extent of misrouting of visual inputs in human albinos was also correlated with the degree of pigment deficit (Von dem Hagen et al. 2007). Cell proliferation in the retina is regulated by dopa, which is derived from tyrosine and is a precursor of melanin. Levels of dopa, and of tyrosine, are abnormally low in albino retinas. At the peak of retinal neurogenesis, the retinas of albinos show an abnormally high level of cell proliferation. This produces abnormally thick retinas. At a later stage, an abnormal level of cell death restores retinal thickness to normal. Addition of dopa to albino eyes in vitro prevents the abnormal cell proliferation (Ilia and Jeffery 1999). The spatiotemporal pattern of cell production in the retina is probably disturbed during the period of excessive cell proliferation, and this could affect the routing of axons at the chiasm. Basic retinal defects in albinos disrupt the segregation of crossed and uncrossed axons in the chiasm, leading to an abnormally low number of uncrossed axons. Monocular enucleation performed on prenatal rats, mice, and ferrets before axons have reached the chiasm reduces the number of uncrossed and increases the number of crossed axons from the surviving eye, as in albinos (Guillery 1989). Enucleation performed postnatally, after ganglion-cell axons have reached the LGN, increases the number of uncrossed axons from the surviving eye (Chan and Guillery 1993). This suggests that axons crossing the chiasm trigger the production of a substance that prevents other axons from crossing (Godement et al. 1990). Perhaps the disturbance in cell production in the retina disrupts this chiasmforming mechanism in albinos.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

32.6.2b Abnormal Routing in Siamese Cats The visual pathways of Siamese cats are misrouted in a similar way to those of albino cats (Guillery 1969; Guillery and Kaas 1971; Kalil et al. 1971; Shatz and Kliot 1982). Thus, in the area centralis of Siamese cats only a few ganglion cells project ipsilaterally, compared with nearly 50% in normal cats (Stone et al. 1978). There are two types of visual projection in Siamese cats—the Boston pattern and the Wisconsin pattern. Evidence of a similar distinction has been produced in the visually evoked responses of human albinos (Carroll et al. 1980). In the Boston pattern, the mirror-reversed projection corrects itself to produce an essentially continuous representation of the visual field in each cerebral hemisphere (Hubel and Wiesel 1971; Shatz and LeVay 1979). Optic nerve fibers from up to 20˚ into the temporal retina of each eye cross aberrantly in the chiasm and terminate in the wrong LGN. In the visual cortex, the aberrant representation of the ipsilateral visual field is inserted between the normal representations of the contralateral visual fields in areas 17 and 18. This causes the cortical representation of the vertical meridian to be displaced up to 20˚ away from its usual location along the border between areas 17 and 18. This is the same pattern found in albino animals. The transition from the cortical region containing only contralateral projections to that containing only ipsilateral projects is diffuse in Siamese cats rather than fairly sharp as in normal animals (Cooper and Pettigrew 1979). Also, transcallosal fibers, which usually originate and terminate along the border between areas 17 and 18, acquire a wider distribution with a peak occurring in the region representing the vertical meridian rather than along the border between areas 17 and 18 (Shatz 1977a). This suggests that the transcallosal fibers grow to connect corresponding regions of the visual field rather than similar architectonic regions of the cortex. In the Wisconsin pattern of visual projection the reversed projection is not corrected, but there is intracortical suppression of the anomalous inputs along with all other inputs from the same LGN lamina (Kaas and Guillery 1973). The vertical retinal meridian projects to its usual location along the border between areas 17 and 18 (Kaas and Guillery 1973), and transcallosal fibers originate and terminate in substantially the same way as in normal cats (Shatz 1977b). The superior colliculus of Siamese cats receives an abnormally large representation of the ipsilateral visual field and the part devoted to the fovea is shifted about 7˚ contralateral to its normal location (Berman and Cynader 1972). The superior colliculus of Boston-type Siamese cats has many binocular cells that obtain most of their ipsilateral input through the corpus callosum (Antonini et al. 1981). Some of these binocular cells have disparity tuning functions resembling those of cortical cells tuned to

coarse disparity in normal cats (Bacon et al. 1999). Strabismic cats, also, retain binocular cells in the superior colliculus (Section 8.2.2f ). Siamese cats have few binocular cells in cortical areas 17, 18, and 19 (Guillery et al. 1974; Di Stefano et al. 1984). Nevertheless, they show good interocular transfer of learning to discriminate visual forms (Marzi et al. 1976). This could be because they retain a considerable number of binocular cells in the lateral suprasylvian area (Clare-Bishop area), an area involved in interhemispheric transfer of form discrimination (Berlucchi et al. 1979). Siamese cats of the Boston type have numerous binocular cells in the ClareBishop area, which show selectivity for motion in depth similar to that of cells in the normal animal (Toyama et al. 1991). The suprasylvian area is rich in transcallosal connections. Callosectomy abolishes the binocularity of cells in this area in Siamese cats but not in normal cats (Marzi et al. 1980). This evidence indicates that the ipsilateral inputs to the suprasylvian area of Siamese cats are routed through the corpus callosum. The abnormally routed visual projections in albinos and Siamese cats are associated with a disturbance of inputs to the oculomotor system controlling vergence. This may explain why albinos and Siamese cats usually have strabismus and spontaneous nystagmus. There is also a complete disruption of mechanisms for detecting disparity, so that albinos and Siamese cats have little or no stereoscopic vision (Packwood and Gordon 1975). Although the Siamese condition is genetic in origin, there has been some dispute about which defect is primary; the misrouting of visual inputs, the strabismus, or the lack of binocular cells in the visual cortex. Cool and Crawford (1972) argued that strabismus is the primary cause although they reported that while all Siamese cats lack binocular cells, some do not have strabismus. Misrouting of visual inputs is the most probable cause of the Siamese condition. In Siamese cats, lack of pigment in the embryonic eye stalk causes abnormal positioning of axons in the developing optic nerve, which in turn disrupts routing at the chiasm (Webster et al. 1988).

32.6.2c Achiasmatic Animals In albinism, temporal retinal axons decussate instead of projecting to the ipsilateral cortex. Williams et al. (1991, 1994) identified an autosomal recessive mutation in some sheep dogs that involves the opposite defect. All retinal axons project to the ipsilateral LGN. Thus the optic chiasm is eliminated—the animals are achiasmatic. Axons from each nasal hemiretina terminate in ipsilateral layer A of the LGN with the same topographic arrangement as that in the contralateral LGN in the normal dog. The temporal fibers project normally to the superimposed ipsilateral layer A1. Since the nasal axons have not crossed the midline, the nasal projection is mirror-image reversed with respect to

PAT H O L O G Y O F V I S UA L D E P T H P E R C E P T I O N



231

the temporal projection. The two maps are congruent only along the vertical midline. Williams et al. argued that this reversed mapping could be explained if it is assumed that there is a fixed positiondependent chemoaffinity between retinal axons and LGN cells and layers. Thus, the selection of target cells in the LGN would be controlled by the retinal position from which the axons originate rather than by their eye of origin. These dogs manifest spontaneous nystagmus associated with head oscillations. They exhibit the rare condition of seesaw nystagmus. This is a disjunctive vertical nystagmus in which each eye intorts as it looks up and extorts as is looks down (Dell’Osso and Williams 1995). Apkarian et al. (1994) described two achiasmatic children. The absence of visual evoked potentials from the contralateral cortex indicated that each optic nerve projected fully to the ipsilateral visual cortex. The VEP results were confirmed by magnetic resonance imaging (MRI). The two children, like the achiasmatic dogs, exhibited congenital nystagmus with components of seesaw nystagmus (Dell’Osso 1996). The children lacked stereoscopic vision (Apkarian 1996). McCarty et al. (1992) reported a similar case in a 35-year-old man. 32.6.3 C O N G E N ITA L N YS TAG MUS

All animals with mobile eyes show involuntary pursuit movements in the same direction as a large moving visual display, interspersed with saccadic return movements. This reflex response is optokinetic nystagmus or OKN (see Howard 1993a for a review). In mammals, the subcortical nuclei controlling OKN receive direct inputs from only the nasal hemiretina of the contralateral eye, conveyed by axons that decussate in the optic chiasma. In higher mammals, outputs from the visual cortex descend to the subcortical nuclei and counterbalance the inherent directional asymmetry of the subcortical mechanism, as explained in Section 22.6.1c. These cortical outputs derive from binocular cortical cells that normally receive both decussated and undecussated axons. The OKN system is held in symmetrical balance by the interplay of these two systems. In the albino, the excessive number of decussating axons upsets both the subcortical and cortical components of OKN and the balance between them, and this results in spontaneous nystagmus. All types of spontaneous nystagmus of genetic origin are known as congenital nystagmus. Congenital nystagmus consisting of a conjugate, involuntary oscillation of the eyes, usually in the horizontal direction, is a universal feature of albinism. Nystagmus associated with albinism is due to a misrouting of the visual pathways, which reveals itself in an asymmetry in the visual evoked potentials from the two sides of the brain. Visual evoked potentials do not show this asymmetry with congenital nystagmus not due to albinism (Apkarian and Shallo-Hoffmann 1991). 232



An interesting feature of all forms of congenital nystagmus is that, unlike acquired nystagmus, it is not accompanied by oscillopsia, or perceived oscillation of the visual world. People with congenital nystagmus learn to suppress or ignore the retinal motion signals that arise during nystagmus. They assess the stability of the visual world on the basis of information taken in when the eyes momentarily come to rest between nystagmic sweeps. They experience oscillopsia if the retinal image is artificially stabilized (Leigh et al. 1988). In congenital nystagmus there is usually a position of gaze for which nystagmus is minimal or absent. This so-called null position typically shifts in a direction opposite to the motion of a moving display. Also, congenital nystagmus is usually reduced when the patient converges on a near object (Dickinson 1986). Thus, patients may reduce nystagmus either by looking at an object with the head to one side, so as to bring the gaze into the null position, or by voluntary convergence. Prisms or surgical rotation of the eyes may help by bringing the null angle of gaze into the primary, or straight-ahead position. Patients with congenital nystagmus may show no OKN in response to visual motion. Or they may show a response with unusually low gain, or so-called reversed OKN. In reversed OKN the slow phases, which normally compensate for the motion of the stimulus, occur in the opposite direction to that of the stimulus. Also, the slow phases often have an accelerating velocity profile instead of the constant velocity profile typical of normal slow phases (Dichgans and Jung 1975; Halmagyi et al. 1980; Yee et al. 1980). Reversed OKN occurs only in response to stimuli moving along the meridian in which the congenital nystagmus occurs (Abadi and Dickinson 1985). The reversal of OKN is presumably caused by the mirror-reversed projection of the abnormally routed cortical and pretectal inputs. The vestibuloocular response (VOR) in albinos with congenital nystagmus has an unusually low gain when the head is oscillated at a low frequency, although gain may be normal at high frequencies. Albinos show weak or no VOR in response to caloric stimulation of the vestibular organs since this is equivalent to a low-frequency rotation of the head. Furthermore, optokinetic afternystagmus (OKAN) is absent and postrotary nystagmus is unusually brief in people with congenital nystagmus (Demer and Zee 1984). This suggests that congenital nystagmus involves a defect in the velocity-storage mechanism common to OKN and VOR. In summary, albinos suffer from optic glare, poorly developed retinas, and poorly developed visual pathways. They also have instability of gaze, refractive error, and astigmatism. These factors contribute to loss of visual acuity and stereopsis, which is so severe in some albinos that they are classified as partially or completely blind. Siamese cats and achiasmatic animals also have misrouted visual pathways and have many of the symptoms shown by albinos.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

33 VISUAL DEPTH PERCEPTION IN THE ANIMAL KINGDOM

33.1 33.1.1 33.1.2 33.1.3 33.1.4 33.1.5 33.1.6 33.1.7 33.2 33.2.1 33.2.2 33.3 33.3.1 33.3.2 33.3.3 33.4

Insects and spiders 233 The praying mantis 233 Ants 235 Bees 235 Flies, moths, and dragonflies 236 Grasshoppers and locusts 237 Jumping spiders 237 Scanning eyes and motion parallax 239 Crustaceans 239 Trilobites, shrimps, and crayfish 239 Crabs 241 Fish 243 The visual system of fish 243 Adaptations in deep-sea fish 243 Archer fish 244 Amphibians 244

33.4.1 33.4.2 33.5 33.6 33.6.1 33.6.2 33.6.3 33.7 33.7.1 33.7.2 33.8 33.8.1

Frogs and toads 244 Salamanders 247 Reptiles 248 Birds 249 Pigeons 249 Hawks, falcons, and eagles 251 Owls 251 Mammals 254 Meerkats and rodents 254 Ungulates 255 Evolution of visual depth perception 256 Basic conditions for evolution of sensory mechanisms 256 33.8.2 Evolution of monocular mechanisms 256 33.8.3 Evolution of frontal vision 257 33.8.4 Evolution of stereoscopic vision 258

This chapter is concerned with visual depth-detection through the animal kingdom, starting with insects. The eyes of protozoa, worms, jellyfish, and molluscs were described in Section 6.1. Most animals respond to image looming, and many use perspective and motion parallax to detect depth. Stereoscopic vision based on binocular disparity has evolved in some insects, frogs, and mammals. It seems that mantis shrimps have stereoscopic vision based on vertical disparities detected in the same eye. Visual detection of depth in animals has been reviewed by Walls (1963), Hughes (1977), Collett and Harkness (1982), and Pettigrew (1991). Land and Nilsson (2002) reviewed animal eyes. Nocturnal animals and animals that live underground or in opaque water detect objects by sound, heat, smell, or electrical fields. These nonvisual sensory systems for detecting depth are described in later chapters.

(Horridge 1978). It has been reported that predatory insects, such as dragonfly larvae, tiger beetles, praying mantis, and water scorpions, rarely catch prey when one eye has been removed (Maldonado and Rodriguez 1972; Cloarec 1978). This suggests that they use binocular cues to relate their prey-catching activity to the distance of the target. This question has been investigated in the praying mantis (genera Tenodera and Sphodromantis). The praying mantis has compound eyes between 4 mm and 8 mm apart with a central forward-looking region of high-density ommatidia and a 70°-wide binocular field (Figure 33.1). When a praying mantis with one eye occluded was shown a fly, it centered it in its visual field by a saccadic head movement. When both eyes were open, the head moved to a compromise position so that the images were symmetrical about the center of each eye (Rossel 1986). The mantis cannot center a fly in both eyes at the same time because the eyes do not move. When prisms introduced a vertical disparity between the images of an object, the head took up an intermediate vertical position (Rossel et al. 1992). The mantis is also able to pursue moving prey with visually guided smooth movements of the head (Rossel 1980) (Portrait Figure 33.2). When prey is within a critical range, the mantis strikes it with its forelegs, with an accompanying lunging motion of

33.1 INSECTS AND SPIDER S 33.1.1 T H E P R AY I N G M A N T I S

In many insects the visual fields of the two compound eyes overlap, providing the possibility of binocular stereopsis 233

Figure 33.1.

The praying mantis (Mantis religiosa).

the middle and hind legs. The large African mantis (Sphodromantis viridis) has an interocular separation of 8 mm, and its optimal striking range extends 2 to 6 cm from the head. Movements of the forelegs adjust to the distance of the prey, and the attack succeeds when the prey is within 15° of the head midline (Corrette 1989). When base-out prisms were placed before the eyes the distance of strike initiation changed accordingly. A strike occurred at a distance where the binocular disparity of the target was the same as without the prisms (Rossel 1983). This suggests that the

Figure 33.2. Samuel Rossel. Born in Schaffhausen, Switzerland, in 1948. He obtained a diploma in biology from the University of Zürich in 1976 and a Ph.D. with G. A. Horridge from the School of Biological Sciences at the Australian National University at Canberra. His postdoctoral work was with R. Wehner at the University of Zürich. In 1987 he joined the Department of Neurobiology and Animal Physiology at the University of Freiburg, where he is now professor of behavioral physiology.

234



mantis uses binocular disparity rather than monocular cues, since prisms do not affect monocular cues. Other evidence suggests that the mantis does not use image size to judge distance (Rossel 1991). Accommodation is not used, since compound eyes do not accommodate. The immobility of the eyes means that the binocular disparity produced by a given object indicates the distance of the object from the point where the eyes converge. Thus, if the distance of the convergence point is registered, a given disparity indicates the absolute distance of a given object. A typical prey object in the middle of the striking range subtends about 20°. If only one prey object is within range, the mantis need only register its mean direction from the two eyes, and its distance by extracting the difference between the horizontal positions of the two images. Rossel (1996) presented African mantids with two 4°-wide by 20°high bar targets separated by 30°. The animals moved the head to center one of the targets and tracked it as it moved vertically. They were thus able to link the images in the two eyes from the same object and avoid linking noncorresponding “ghost” images from different objects. They linked the appropriate images using a simple nearest neighbor rule. When one eye could see only one bar and the other eye could see only the other bar, the animals centered the head between those images. Thus, they linked noncorresponding images when there were no other images available. When the two bars were separated by less than 9° the animals failed to respond selectively to one bar. Thus, mantids possess a very coarse mechanism for selectively matching binocular images. A coarse system is all that is required, because prey objects usually subtend 20°. The visual acuity of the eyes is much finer than this limit. Mantids still responded accurately to the distance of a target when the images were prismatically separated vertically by up to 15° (Rossel et al. 1992). This suggests that horizontal disparity is registered more or less independently of vertical separation. When one eye was occluded during development, animals performed accurately when the occluder was removed (Mathis et al. 1992). Thus, growth of the mantid stereoscopic system does not require binocular experience. Little is known about the anatomy or physiology of the mantis brain. Mantids use motion parallax to maneuver between stationary objects, such as the stalks of plants. Before moving they create a motion signal related to the distance of the object by executing side-to-side head movements (peering). If the target is moved during the head movement the animal is deceived about the distance of the target (Poteser and Kral 1995). Motion parallax seems also to help the animals to distinguish between an object and its background. Motion parallax cannot be used for coding the distance of a moving prey object. Mantids use binocular disparity for this purpose.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

The praying mantis strikes at a small approaching object and shows an avoidance response to an approaching large object, such as a bird. These responses are evoked by image looming. Cells in the ventral nerve cord responded specifically to a looming object (Yamawaki and Toh (2009). 33.1.2 A N TS

Desert ants of the genus Cataglyphis forage for food in shrubs scattered among sand dunes. On outward and homeward journeys they run between neighboring shrubs or other landmarks, presumably to avoid lurking predators. Heusser and Wehner (2002) constructed a walled channel between the nest and a feeding area. When the ants entered the channel their motion paths became straighter than when they moved in the open. They ran down the middle of the channel when the walls were black and also when they were marked with a vertical grating, even when the grating on one side was coarser than that on the other side or was moved one way or the other. They were clearly not maintaining a distance from the two walls by balancing the patterns of optic flow. When one wall was made higher than the other, they moved along a path nearer to the lower wall so as to balance the angles subtended by the two walls. This method is well suited to an animal that walks with a fixed height above the ground. The isolated head of a bulldog ant (Myrmecia gulosa) snaps its mandibles to an approaching object (Via 1977). It seems that the response is triggered when an image of appropriate size falls on a small number of ommatidia. Bullfrogs do not distinguish between a large far object and a near small object. Only a few animals responded when one eye was occluded, and responses were more variable. The visual fields of the bulldog ant overlap by about 60°. 33.1.3 B E E S

Bees use visual motion generated by flight to judge the distances of objects they are flying past (Lehrer et al. 1988; Egelhaaf and Borst 1992). They can maintain a flight path down the central axis of a narrow tunnel. They do this by balancing the speeds of the motion signals arising from textures on the two sides of the tunnel, independently of the contrasts or relative spatial periodicities of the patterns (Srinivasan et al. 1991; Srinivasan and Zhang 1993). Bees regulate their speed of flight to keep the average angular velocity of the images constant. Thus, they slow down when flying in a narrow tunnel and when approaching a surface on which to land (Srinivasan et al. 1996) (Portrait Figure 33.3). Bees flew faster when a texture pattern was moved in the direction of flight and flew more slowly when the pattern was moved against the flight direction (Barron and Srinivasan 2006). Bees flying over a textured ground surface decreased flight height when the surface was moved in the direction of flight (Portelli and Ruffier (2010).

Figure 33.3. Mandyam V. Srinivasan. He obtained a degree in electrical engineering from Bangalore University in 1968, a Ph.D. in engineering from Yale University in 1976, and a D.Sc. in neuroethology from the Australian National University. He is professor of visual sciences at the Australian National University’s School of Biological Sciences in Canberra and director of the University’s Centre for Visual Science. He is a fellow of the Australian Academy of Science and a fellow of the Royal Society of London.

In this case, bees kept the ventral angular speed constant by changing flight height rather than flight speed. We saw in the previous section that ants use a similar mechanism. The motion detectors of bees have small receptive fields and are insensitive to the direction of image motion. As in the parallax-detecting system of the locust, detection of motion direction is unnecessary because the images of objects always move counter to the body during flight. Bees flying in a textured tunnel maintained a constant rate of optic flow when facing a headwind of up to 50% of their forward speed. They compensated for the wind even when optic-flow was sparse. Bees have other motion-detectors with large receptive fields that control optokinetic responses. These responses stabilize yaw, pitch, and roll of the body during flight. The detectors are designed only to bring the rotary visual motion signal to zero, which occurs when the body is stable in flight. The detectors are therefore not required to produce a wellcalibrated speed signal (Srinivasan et al. 1999). Thus bees have two distinct motion-detecting systems. One is designed to control linear motions of the body. It produces a directionally unsigned speed signal unconfounded by spatiotemporal frequency. The other is designed to null rotations of the body. It produces a good directional signal but confounds speed and spatiotemporal frequency.

V I S UA L D E P T H P E R C E P T I O N I N T H E A N I M A L K I N G D O M



235

Models of these motion detectors have been proposed by Srinivasan et al. (1999). It would be worthwhile to investigate the generality of this distinction throughout the animal kingdom. Bees approach a step in depth by flying at right angles to the edge of the step. They landed on the edge of a raised surface more frequently when they approached over the lower surface than when they approached over the raised surface (Lehrer and Srinivasan 1993). When bees are given the task of discriminating between patterns on the same flat surface they tend to fly along contours rather than across them. This reduces image blur. However, when bees were trained to land on a raised surface containing stripes orthogonal to the edge, they suppressed their tendency to fly along contours. Instead, they flew in an oblique direction with respect to the stripes so as to generate motion parallax for estimating the height of the surface (Lehrer 1996). Bees thus optimize whichever source of visual information is most appropriate for a given task. A bee could derive the distance of an object from the change in image size as the bee moves a given distance toward the object. It could also use the motion parallax generated by flying past the object at constant speed (Cartwright and Collett 1979). Lehrer and Collett (1994) placed a black cylinder near a feeding site and a second cylinder of different size near an empty dish. Bees showed evidence of learning the distances of the cylinders from the feeding site. They evidently used motion parallax generated by their to-and-fro flight since their choice of cylinders was not affected by changes in the relative sizes of the cylinders. However, after more visits to the site, a change in the size of the cylinder began to have an effect. Thus, the bees initially used parallax to gain knowledge of the 3-D structure of landmarks near a feeding site. They then switched to the simpler process of matching the sizes of the visual images to those produced by the landmarks. Bees can use motion signals to discriminate between the heights of objects over which they are flying. Srinivasan et al. (1989) trained bees to collect sugar water at one of several artificial flowers that differed in size, position, and height above the ground. They could discriminate a difference in height of 2 cm, whatever the size or position of the flower. With a featureless ground, the bees must have used the motion of the image of the flower. Their improved accuracy when the ground was textured suggests that they also used motion parallax between flower and ground. Horridge et al. (1992) rewarded bees with sugar water when they reached a black disk of a given size in one arm of a Y-shaped choice box rather than a disk of a different size in the other arm. The bees learned to select a disk of a given absolute size whatever its distance (angular size of the image). However, they reached a success rate of only about 70% on this task. In their natural environment, bees probably do not need to recognize the absolute sizes of objects. 236



33.1.4 FL I E S , MOT H S , A N D D R AG O N FL I E S

Compound eyes of the housefly (Musca domestica) contain approximately 3,000 ommatidia arranged on a spherical surface. Each ommatidium contains eight photoreceptors (rhabdomeres) and a lens. Two central rhabdomeres are surrounded by six peripheral rhabdomeres. The central rhabdomeres are chromatic with high sensitivity, while the peripheral receptors are achromatic with lower sensitivity. In the fruit fly, Drosophila, receptors project to the medulla and then by three parallel pathways to the surface and deep layers of the lobula and to the lobular plate. Cells in the deep layers of the lobula have large interconnected receptive fields, which suggests that they are specialized for detection of form. Cells in the lobular plate have small receptive fields (Bausenwein et al. 1992). Recordings in the housefly indicate that these cells are specialized for detection of relative motion, which could provide a basis for figure-ground discrimination based on motion parallax (Egelhaaf 1985). Ordinary stereoscopic vision is not possible in flies because there is little overlap between the visual fields of the compound eyes on the two sides of the head. However, Pick (1977) reported that the visual axes of rhabdomeres in one ommatidium converge to a point between 3 and 6 mm in front of the surface of the lens. He suggested that this is a mechanism for increasing light sensitivity and for detection of near distances. Bitsakos and Fernmüller (2006) developed a theoretical model of depth detection by such a system. The mechanism involves comparing the image intensity obtained by rhabdomeres in neighboring ommatidia with parallel optic axes with image intensity obtained by rhabdomeres with converging axes. Bitsakos and Fernmüller showed that the distance of a nearby surface is inversely proportional to the ratio of these two intensity derivatives. When flying in a textured cylindrical tunnel, fruit flies (Drosophila), like bees, regulate their speed of flight by holding the image angular velocity constant, independently of the temporal frequency at which texture elements pass. When the diameter of the tunnel was changed suddenly, the flies adjusted their airspeed to keep the angular velocity of the image constant (David 1982). They did not detect gradual changes in the distance of the visual display, which suggests that they use motion parallax to detect step changes in distance. Optic flow due to body rotation is in the same direction in the two eyes, while that due to forward body motion is in opposite directions in the two eyes. Krapp et al. (2001) found that the response properties of movement-sensitive cells in the optic lobe of the fly are well suited to distinguish between relative binocular motions produced by different body motions. Flies decelerate as they approach a landing site. They must have information about the distance of the landing

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

site in order to initiate deceleration. Wagner (1982) conducted a frame-by-frame analysis of flight velocity, target distance, image size, and image rate of expansion as flies landed on a small sphere. He concluded that deceleration is initiated when the ratio of image expansion velocity to image size reaches a critical value. This ratio is similar to the tau ratio discussed in Section 31.1. European hawk moths (Macroglossum stellatarum) collect nectar while in hovering flight. They maintain a constant distance from a flower swaying in the wind. Farina and Zhou (1994) concluded that the moths were not using stereopsis, because the flower was too near. However, we have just seen that short-range stereoscopic vision may be possible with compound eyes. When presented with a simulated patch of flowers in front of a structured background, hawk moths preferred flowers that were closest to them. This suggests that they use motion parallax between the flower and the background. However, they were able to maintain a constant distance from a “flower” formed of concentric rings moving back and forth without a background. Under this condition they were probably using speed of motion of the image of the rings. The gain of the response increased with increasing density of the rings up to a density beyond which there was a loss of resolution. Single cells in the optic lobes of the hawk moth (Manduca sexta) responded selectively to an approaching or receding blank disk but not to outward or inward motion of a revolving spiral (Wicklein and Strausfeld 2000). The cells were identified as being sensitive to changing size. Other cells responded to both stimuli and were sensitive to optic flow. Similar types of cell have been found in the pigeon (Section 33.6.1). Predatory dragonflies (Libellulidae) sit and wait until an insect flies by. They then take off and pursue the prey. Frame-by-frame analysis of video film revealed that dragonflies fly directly to the prey by steering so as to reduce the movement of the prey’s image on the eye (Olberg et al. 2000). The analysis suggested that distance to the prey is coded by the angular velocity of the prey shortly before launch. Success did not require parallax induced by head movements. Olberg et al. (2005) videotaped perching dragonflies as they flew from a perch to artificial prey (glass beads) suspended from wire. The insects did not respond to beads larger than their heads, even when the beads subtended the same visual angle as smaller more distant beads, which they did respond to. These results indicated that the animals judged the distance of prey items within a distance of about 1 m. 33.1.5 G R A S S H O P P E R S A N D L O CUS TS

Grasshoppers (Phaulacridium vittatum), when jumping horizontally from a platform to a surface, adjust the jump to the distance of the surface. They performed almost as well

when one eye was occluded, which suggests that they use motion parallax for judging distance (Eriksson 1980). Before jumping across a gap, locusts move the anterior part of the body from side to side over several cycles—an activity known as peering. Given that the animal registers either the speed or the amplitude of head motion, the motion of the image of an object indicates the distance of the object (Wallace 1959; Collett 1978). When the target was oscillated in phase with the body, locusts overestimated the distance of the target. When the target was oscillated out of phase with the body, they underestimated the distance (Sobel 1990). Thus, they adjusted the length of the jump as a function of the magnitude of image motion resulting from body motion, suggesting the use of absolute parallax (Section 28.2). However, the adjustment was the same for image motion against the head as for image motion of the same relative speed in the same direction as the head, showing that the sign of the motion is ignored. The sign can be ignored because, for an animal that does not move its eyes, images move only against the head under natural circumstances. Taking the sign of the motion into account would complicate the control process unnecessarily. Even without registering the speed or amplitude of head motion, relative image motion produced by objects at different distances provides information about their relative distance. Relative motion could also help an animal to define the boundaries of objects. Locusts are drawn to edges defined by relative motion within a textured display and tend to jump to the side of a motion-defined edge where image velocity is higher, which normally indicates that it is the nearer side (Collett and Paterson 1991). Locusts possess neurons known as lobular giant motion detectors (LGMD), one on the left and one on the right of the head. Both neurons feed into a single descending contralateral motion detector (DCMD). One type of DCMD has a fast-conducting axon, while a second type has a slowconducting axon (Gray et al. 2010). The axons project to thoracic centers controlling jumping and flight. These neurons respond to looming produced by an object on a collision course, and encode the velocity of an object’s approach (Rind 1996; Gabbiani et al. 1999; Gray et al. 2001; Guest and Gray 2006) (see Section 31.1.1). 33.1.6 J U M P I N G S P I D E R S

Most spiders have four pairs of eyes: one principal anteriormedian pair, a small median pair, an anterior-lateral pair, and a posterior-lateral pair, as shown in Figure 33.5. The principal eyes are mobile, have rhabdomeres directed toward the light, and good resolution. The secondary eyes are stationary, have rhabdomeres directed toward a reflecting tapetum, and poor resolution. The principal eyes move in pursuit of a moving object. However, they do not move when the secondary, stationary, eyes are blocked

V I S UA L D E P T H P E R C E P T I O N I N T H E A N I M A L K I N G D O M



237

(Neuhofer et al. 2009). This shows that only the secondary eyes are used to detect motion. Jumping spiders (Salticidae) are a large family of spiders. They are usually about 2 cm long and brightly colored. A jumping spider is shown in Figure 33.4. The fields of view of the eyes of a jumping spider are depicted in Figure 33.5. The principal eyes are tubular with a lens and a retina consisting of a narrow strip of receptors arranged in four layers. Land (1969a) suggested that objects at different distances are brought into focus on different layers. He also suggested that, because of the chromatic aberration of the lens, light of different wavelengths would be brought into focus on different layers. Blest et al. (1981) showed that the peak wavelength sensitivity of the receptors in each layer is matched to the wavelength of light that is brought to focus in that layer by chromatic aberration. The receptors of the principal eyes lie in a conical pit at the base of the eye. Williams and McIntyre (1980) showed that the pit has a refracting interface which magnifies the image. The pit and the lens form a telephoto system that increases the focal length of the eye and improves its resolving power. Jumping spiders are able to distinguish between flies and other spiders at a distance of about 20 cm ( Jackson and Blest 1982). The retinas of the principal median pair of eyes move sideways behind the stationary lenses. This enables the animals to move the small nonoverlapping visual fields of their median eyes across most of the larger visual fields of the stationary anterior-lateral pair of eyes, which partially overlap in the midline. These movements are of three types. (1) Saccadic movements toward an object in the larger visual fields of the lateral eyes. (2) Slow eye movements in pursuit of a moving object. (3) Side-to-side scanning movements at a temporal frequency of between 0.5 and 1 Hz across an object,

Figure 33.4.

The jumping spider Aelurillus v-insignitus.

Naturephoto-CZ.com)

238



(Photograph by Pavel Krasensky,

Anterior-median eye

Anterior-lateral eye

1 mm

Posterior-lateral eye Retinas Visual fields of a jumping spider (Salticidae). The jumping spider has four pairs of simple eyes; one anterior median pair (AM), a small median pair of unknown function (not shown), an anterior-lateral pair (AL), and a posterior-lateral pair (PL). The retinas of the AM eyes move conjugately, sweeping their visual fields from side to side within the fields of the AL eyes. Fields of the AL eyes overlap in front. (Redrawn from

Figure 33.5.

Forster 1979)

combined with a slower 50° rotation of the retinas about the visual axes (Land 1969b) (Portrait Figure 33.6). All these movements are conjugate, so the visual axes remain approximately parallel. This system could provide a mobile region of high-resolution, which helps the animals to distinguish between prey, other jumping spiders, and inanimate objects. When a prey object enters the visual field, the spider first orients its body to center the prey. Then the spider chases the prey and catches it on the run. Spiders of some species stalk the prey and jump on it from a distance of 3 cm or more. Just before jumping, it attaches a web filament to the substrate. During stalking, the spider reduces its speed of approach as it gets near the prey. In this way it does not exceed the prey’s threshold for detection of an approaching object (Dill 1975). Removal of the median eyes of the jumping spider does not affect orienting behavior. This demonstrates that orientation is controlled by the lateral eyes. However, a spider with only one median eye ignores stationary prey and jumps short of the target. Animals with both anterior-lateral eyes removed still orient the body but no longer chase moving prey. It seems that the principal median eyes with scanning retinas are mainly responsible for judging the distance of the prey, especially short distances. The eyes of these animals do not accommodate. Also, the median eyes do not use disparity, since their visual fields do not overlap.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Figure 33.6. Michael Francis Land FRS. Born in Dartmouth, England, in 1942. He obtained a science tripos at Cambridge in 1963 and a Ph.D. in zoology and neurophysiology at University College, London, in 1967. He held academic appointments at University College, London, Berkeley, the University of Sussex, and the Australian National University between 1966 and 1984. He has been professor of neurobiology at the University of Sussex since 1984. He was elected fellow of the Royal Society of London in 1982 and is a recipient of the Frink Medal of the Zoological Society of London.

The median eyes could signal distance by the temporal delay between detection of the prey by one median eye and its detection by the other eye as the eyes scan from side to side. For a fixed velocity of scanning with parallel visual axes, this delay is approximately inversely proportional to the distance of the scanned object. It seems that the fixed anterior-lateral eyes of the jumping spider (Trite planiceps) contribute to judgments of distances greater than about 3 cm. Their removal degrades jumping accuracy at these distances (Forster 1979). Since the visual fields of these eyes overlap, it is possible that binocular disparity is used. The eyes do not move so that disparities could code absolute distance. 33.1.7 S C A N N I N G EY E S A N D M OT I O N PA R A L L AX

The only known animals with scanning retinas, other than some spiders, are certain small crustaceans (copepods) such as Copilia and Labidocera (Land 1988). The female Copilia has lateral eyes, each containing only three or four photoreceptors and two lenses. A long focal length lens forms an image near the second short focal length lens, which in turn forms an image on the receptors. The second lens and the receptors of the two eyes oscillate laterally in counterphase, each within the focal plane of the first lens (Exner 1891;

Gregory et al. 1964). Copilia lives at a depth of about 200 m, although it may move to the surface at night. It is therefore not normally exposed to stationary objects, and its eyes are presumably designed be sensitive to moving objects. Scanning eye movements occur when the animal is swimming but can be elicited in a tethered animal by moving visual stimuli (Downing 1972). Certain insects, such as the mantis shrimp, and some molluscs, such as the carnivorous sea snail (Oxygyrus), execute continuous scanning movements of the whole eye. Scanning movements enable the animal to economize on the number of receptors, and it is characteristic of animals with scanning retinas or eyes that they have very few light detectors. Scanning eye movements must be slow enough to allow an image to dwell on each receptor long enough to be detected. The larger the angle of acceptance of the receptors (the coarser the resolution), the faster the eye can scan over a scene before the dwell time for each detector falls below an acceptable limit of between 15 and 25 ms (Land et al. 1990). When the center of rotation of an eye is some distance from the optical nodal point, scanning movements entail a translation of the vantage point and differential motion parallax, which could code absolute and relative distances. Flies of the family Diopsidae have long eyestalks. The increased interocular separation could improve binocular stereopsis. The eyestalks also increase scanning amplitude, which could improve detection of motion parallax. Long eyestalks may also serve as a sexual attractant in males (Collett 1987). Wehner (1981) reviewed arthropod spatial vision. 3 3 . 2 C RU S TAC E A N S 33.2.1 T R I L O B IT E S , S H R I M P S , A N D C R AY FI S H

Trilobites are extinct arthropods that appeared in the Cambrian period and disappeared in the mass extinction at the end of the Permian, 250 million years ago. There were at least nine orders and over 15,000 species of trilobites. Most were small marine creatures that walked along the seafloor. Trilobites had antennae and compound eyes. Most of them had holochroal eyes with about 15,000 ommatidia packed into a hexagonal lattice covered by a single corneal membrane. The eyes were similar to the compound eyes of most living arthropods. Trilobites of the suborder Phacopina had schizochroal eyes that had about 700 larger lenses separated by cuticle. Each lens had a cornea and sclera. Each calcite lens had two components (Clarkson 1966). Ray tracing through a model of the lens showed that such lenses corrected for spherical aberration (Clarkson and Levi-Setti 1975). Chromatic aberration is not a problem for animals living more than a few meters below the water surface.

V I S UA L D E P T H P E R C E P T I O N I N T H E A N I M A L K I N G D O M



239

It seems that each lens had a retina rather than the rhabdomeres of a typical ommatidium. The lenses were distributed over a surface with greater horizontal curvature than vertical curvature and were arranged in vertical bands, as shown in Figure 33.7. The total visual field would have extended 360° horizontally and about 40° vertically. Cowen and Kelley (1976) suggested that trilobites with schizochroal eyes may have had stereoscopic vision based on the overlap of the visual fields of neighboring lenses in each vertical band of each eye. Stockton and Cowen (1976) estimated that each lens had a visual field of 15–20° so that the fields of neighboring lenses in each band would have overlapped and hence provided a basis for stereoscopic vision. If so, these animals would have had stereopsis based on vertical disparities detected in each eye rather than on horizontal binocular disparities. We will now see that mantis shrimps may have a similar mechanism. Mantis shrimps are predatory marine crustaceans (order Stomatopoda) inhabiting shallow tropical waters (Figure 33.8). They live in burrows and emerge to stalk small animals, which they strike or spear with two large limbs. The large mobile compound eyes of mantis shrimps possess many remarkable features. Each eye has a central horizontal band of six rows of ommatidia flanked by dorsal and ventral hemispherical regions containing several thousand ommatidia (Cronin and Marshall 2004). The visual field of the central band of ommatidia is a few degrees high and about 180° wide. The band serves color vision and detection of light polarization. It contains at least 10 types of ommatidia. Each type possesses either a distinct filter pigment or a photosensitive pigment with distinct spectral absorption. The rest of the eye is monochromatic (Marshall et al. 1991; Goldsmith and Cronin 1993). A color system with this many channels could resolve wavelength components in a given location—an ability denied animals with trichromatic vision, in which each receptor serves both color vision and spatial resolution (Section 4.2.7). The visual pigments in different species of

Figure 33.7. The fossil eye of the trilobite Acaste. The lenses are arranged in vertical columns about 2 mm high. (From Clarkson 1966 with kind permission from Springer Science+Business Media)

240



Figure 33.8.

The mantis shrimp (Squilla mantis). The shrimp can be 18 cm

long.

mantis shrimp do not vary with the depth at which the shrimps live. However, the density and spectral absorption of the filter pigments vary with depth in a way that compensates for depth-related changes in the spectral composition of light (Cronin and Caldwell 2002). The eyes of the mantis shrimp execute slow vertical and torsional movements through at least 60°. These movements are slow enough to allow the image to dwell on each ommatidium for at least 25 ms—long enough for detection (Land et al. 1990). The eyes also execute tracking movements in response to movements of a stimulus. The eyes move independently (Figure 33.9). Conjugate movements are not required if stereopsis is achieved in each eye. Within each of the three horizontal bands of ommatidia, there is a group of ommatidia directed to the same location in space. These “pseudopupils” show as three dark areas in Figure 33.9. Exner (1891, p. 89) suggested that this arrangement provides range-finding stereoscopic vision in each eye. The upper and lower converging regions of each eye are separated vertically, so that stereopsis would be based on vertical disparities between images in the two regions. If so, these animals would be unique among living animals in having monocular disparity-based stereopsis based on vertical rather than horizontal disparities. The latency of the defense reflex of the crayfish (Procambarus clarki) decreases with increasing velocity of

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Compound eyes of the mantis shrimp. Bands of ommatidia are flanked by upper and lower hemispheres. Six frames of a videotape show how the eyes move independently through large angles, horizontally, vertically, and torsionally. Dark regions in each eye contain ommatidia directed to the same point in space. They could therefore form a basis for monocular stereopsis. (From Land et al. 1990, with kind permission from Springer Science+Business Media)

Figure 33.9.

an approaching object. The critical variable determining latency seems to be the time required for the image of the object to expand a specific amount. The mean discharge rate of optic nerve interneurons was a linear function of the velocity of an approaching object (Glantz 1974). 33.2.2 C R A B S

The visual system of crabs, like that of other crustaceans, consists of receptors, optic lobes within each eyestalk, a lateral protocerebrum, and a supraesophageal ganglion. Each optic lobe consists of three neuropils, a lamina, a medulla, and a lobula. The lobula contains movement-detector neurons (MDNs), which also respond to mechanical stimuli applied to the body. The visual system of semiterrestrial crabs (Brachyura) has two basic designs. Most families, including the Grapsidae and Xanthidae, have widely separated short eyestalks. These crabs live among vegetation and rocks, high in the intertidal zone.

Three families of mostly nonpredatory Brachyura (Ocypodidae, Goneplacidae, and Mictyridae) have closely spaced long eyestalks, as in Figure 33.10. These crabs live in burrows on mud flats in the lower intertidal zone or below the water line. Their long eyestalks act as mobile periscopes. As they rotate from side to side they allow the animals to see long distances in all directions. The crabs rotate their eyes when they are in their burrows, feeding on mudflats, or partially submerged. Their eyes are vertically elongated and have greater vertical than horizontal acuity. They also have a zone of high acuity round the eye’s horizontal meridian. Crabs with widely spaced eyes lack these zones of high acuity. Zeil et al. (1986) suggested that eyes with long eyestalks and high vertical resolution evolved in crabs to facilitate detection of distance. If the eyestalks maintain a constant angle of elevation with respect to the horizontal, the distance of an object on flat terrain is indicated by the height of the image in the compound eye. The visual axis of a given ommatidium intersects a flat terrain at distance D = H/ tanϑ, where H is eye height and ϑ is the angle between the

V I S UA L D E P T H P E R C E P T I O N I N T H E A N I M A L K I N G D O M



241

Figure 33.10.

Eye stalks of a fiddler crab. Three families of the genus Brachyura have narrowly spaced, long eye stalks. This specimen is a female fiddler crab

(Uca vomeris).

(From Zeil and Hemmi 2006, with kind permission fro, Springer Science+Business Media)

visual axis of the ommatidium and the horizon of the flat plane. The horizontal resolution is nearly constant over the eye but vertical resolution peaks along the central horizontal meridian where ommatidia are most closely packed (Land and Layne 1995). Resolution of distances should therefore be best for objects on the horizon. Any object, at any distance on flat terrain, which projects an image above the horizon plane of the crab’s eye is larger than a crab. Such an object could be a predator. Fiddler crabs (Uca) have closely spaced long eyestalks. They live in burrows in intertidal sand (Zeil and Hemmi 2006). They emerge from their burrows at low tide to feed within a radius of about 1 m. They move sideways from the burrow along radial paths so as to keep their lateral body axis aligned with the burrow. When they move along a circumferential path, they reorient the body to keep it aligned with the burrow. They also compensate for passively imposed rotations of a disk on which they are standing (Layne et al. 2003). If they detect a predator, they are able to return rapidly to the burrow by simply reversing their sideways direction of motion. Also, by keeping a constant orientation to the burrow, they are able to detect the distance of other male crabs that may be trying to occupy their burrow. When this distance reaches a critical value, the crab rushes to secure the burrow (Hemmi and Zeil 2003). When sandpaper on which crabs were moving was displaced, the crabs returned to a site that was also displaced in the same way (Zeil 1998). Since fiddler crabs live on flat terrain, the distance of the burrow and of an intruder is indicated by the height of the image relative to the horizon. They also use path integration (Section 37.2) to record the distance to the burrow when the burrow is out of sight (Layne et al. 2003). When using path integration, they are able to make allowance for a hill introduced into the return path (Walls and Layne 2009). When a male fiddler crab detects a female crab it faces in that direction and waves its large claw in a circular motion. 242



As the female approaches from about 70 to 20 cm, the male crab modulates the temporal and structural components of the waving motion and finally coaxes the female into the burrow, where mating takes place (How et al. 2008). Crabs with short eyestalks have widely separated eyes with visual fields that overlap almost completely. This would provide a good basis for stereoscopic vision. Disparity detection would require convergence of inputs from the two eyes onto binocular cells, somewhere in the visual system. Sztarker and Tomsic (2004) recorded from single neurons in the optic lobes of the crab Chasmagnathus in response to pulses of light and a moving black screen. The eyestalks were clamped. Neurons in the medulla of the optic lobe responded only to stimuli applied to the ipsilateral eye. However, movement-detector neurons in the lobula responded in a similar way to both ipsilateral and contralateral moving stimuli. Sztarker and Tomsic concluded that contralateral inputs reach each the lobula via the supraesophageal ganglion. Crabs retreat when presented with a looming stimulus. Oliva et al. (2007) recorded from movement-detectors in the lobula of the crab Chasmagnathus. Their rate of firing closely matched the dynamics of an expanding image. Also, responses to approaching, receding, and laterally moving stimuli reflected the different ways in which the crab reacted to these stimuli. Optokinetic eye rotations (OKN) in crabs are evoked strongly by body rotation but only weakly by body translation (Blanke et al. 1997). During body rotation, images of stimuli at all distances move at a similar angular velocity. Therefore, OKN stabilizes the whole retinal image. During body translation, images of distant objects move more slowly than those of near objects. Therefore, OKN does not stabilize the whole image. Primates solve this problem by coupling the optokinetic responses to the stereoscopic system (Section 22.6.1). However there is no behavioral evidence of stereoscopic vision in crabs.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

33. 3 F I S H 33.3.1 T H E V I S UA L S YS T E M O F F I S H

In water, the cornea has no refractive power. Therefore, the lens of bony fish (teleosts) is the main refractive element. For this reason fish have spherical lenses with a fixed short focal length. The spherical lenses form an equally good image in all directions and have little spherical aberration. Accommodation on distant objects is achieved by contraction of the retractor lentis muscle, which moves the lens toward the retina (see Locket 1977). In teleosts, the majority of optic nerve fibers project to the optic tectum—the homologue of the superior colliculus of mammals (Guthrie 1990). From the tectum, pathways project to the tegmentum, to three centers in the diencephalon (thalamus, preoptic area, and pretectum), and to visual centers in the telencephalon. There are also direct visual inputs to mesencephalic nuclei, the pretectum, and the thalamus (Schellart 1990). Bony fish have a small binocular field. Most, if not all, visual inputs project contralaterally to each tectum. A few uncrossed inputs occur in the tecti of goldfish (Springer and Gaffney 1981). Ipsilateral inputs may also reach the tectum by way of the tectal commissure. Recordings from neurons in the tectal commissure revealed only large poorly defined monocular receptive fields (Mark and Davidson 1966). Some investigators found no binocular cells in the tecti of fish (Schwassmann 1968; Sutterlin and Prosser 1970; Fernald 1985). However, a few binocular cells with large receptive fields and low sensitivity have been found in the tecti of goldfish (see Guthrie 1990) and trout (Galand and Liege 1974). The grating resolution of fish varies between 4 and 20 arcmin, according to species, compared with about 1 arcmin for humans (Nicol 1989; Douglas and Hawryshyn 1990). Fish have optokinetic eye movements (OKN) found in all animals with mobile eyes. Some teleost fish have foveas (Walls 1963), and some, such as the African cichlid fish (Haplochromis burtoni), move their eyes independently to foveate particular objects. Some voluntary saccades involve a coordinated change in vergence when the object is in the binocular field (Schwassmann 1968; Fernald 1985). Insofar as vergence movements are evoked by image disparity, binocular cells must be involved. It is not clear what function these movements serve. Many visual cues to distance, such as perspective, the ground plane, and overlap, are weak or absent in the oceans. There is no evidence of binocular stereopsis in fish. However, fish are able to discriminate between objects at different distances. Douglas et al. (1988) trained goldfish (Carassius auratus) to distinguish a 5-cm disk from a 10-cm disk at a fixed viewing distance. The animals could still distinguish disks that were separated in depth but subtended the same visual angle. The fish thus exhibited size constancy, which implies

that they take distance into account when judging the size of an object. They still showed size constancy when one eye was occluded, so that monocular cues to depth are sufficient. Hammerhead sharks (Sphyrnidae) have a very wide flattened head, known as a cephalofoil. The eyes are therefore very far apart. McComb et al. (2009) showed that the width of the cephalofoil in different species of shark is correlated with the degree of binocular overlap of the visual fields. But there is no evidence that sharks have stereoscopic vision. There is evidence that the cephalofoil of the hammerhead shark (Sphyrna lewini) provides a large surface over which electrosensory and olfactory receptors are distributed. This increases the area over which the fish can search for prey buried in the sea floor (Kajiura and Holland 2002; Kajiura et al. 2005). 33.3.2 A DA P TAT I O NS I N D E E P -S E A FI S H

Water rapidly attenuates sunlight so that there is very little sunlight below a depth of 1000 m. However, there is some light from bioluminescence. The larvae of some deep-sea fish, such as ribbon sawtail fish (Idiacanthus fasciola), have eyes on the ends of long cartilaginous rods. The rods hinge about their base and are probably used for scanning rather than for binocular vision. As the fish mature, the eyestalks shorten and the rods coil up behind the eyes (Walls 1963). Many deep-sea fish have large tubular eyes with largeaperture lenses and retinas that contain only rods. The visual axes are generally parallel. This gives the fish fontal vision but a reduced field of view. Overlapping fields of view of eyes with a fixed angle of convergence would supply depth information. The frontal vision does not seem to have evolved to serve predation, since many of these fish are not predators. Weale (1955) suggested that the visual fields of the eyes overlap to improve visual sensitivity in the dark deep-sea environment. Binocular vision improves visual sensitivity only marginally at high luminance levels but significantly at low levels (Section 13.1.2). Many deep-sea fish direct their tubular eyes upward so that they can see prey silhouetted against the sunlight above them. But this presents a problem because the space in front of the mouth is outside the field of view of these eyes. Also, upwardly directed eyes do not allow the fish to see sources of bioluminescence around them and beneath them. These problems have been solved in several ways. 1. Stylephorus chordatus has forward directed tubular eyes but hangs its body vertically (Lockett 1977). It aligns its head with the prey and sucks it into its large buccal cavity. 2. Some fish with upward directed tubular eyes have a pad of light guides that direct light from several directions onto a specialized region of the retina (Locket 1977).

V I S UA L D E P T H P E R C E P T I O N I N T H E A N I M A L K I N G D O M



243

3. The spookfish (Dolichopteryx longipes) has evolved a unique mechanism for detecting objects above them and below them (Wagner et al. 2008). Each eye has two parts separated by a septum. A tubular eye is directed upward and an ovoid outgrowth (the diverticulum) is directed downward. However, light cannot reach the diverticulum through the ventral cornea. Instead, light is reflected onto the diverticulum by a mirror on its dorsomedial wall. The mirror consists of reflective plates derived from the tapetum. Each plate is set at an angle that changes progressively around the mirror to form a concave mirror that focuses the light. This is the only known case of an eye in which the image is formed by a mirror. 4. The fish Micropinna microstoma has another mechanism (Robison and Reisenbichler (2008). Each of its tubular eyes is in a transparent domelike canopy. The insertions of extraocular muscles have evolved so that the muscles can rotate the eyes through 90° in the sagittal plane. The optic nerve enters each eye at the axis of rotation so that it does not twist when the eye rotates. When the fish has detected a prey object above, it pivots its body upward while its eyes remain locked on the prey so that, when the body is vertical, the eyes are aligned with the mouth. The transparent dome over the eyes probably protects them against the stinging tentacles of large siphonophores, such as the Portuguese Man o’War, from which the fish steal prey with their small mouths. 33.3.3 A RC H E R F I S H

Archer fish (Toxotes jaculatrix) expel a precisely aimed jet of water from the mouth at insects or other small animals, such as lizards, sitting on plants overhanging the water. The fish shoot the jet at an angle of between 60 and 88° (Schuster et al. 2004). A jet of water rising obliquely follows a curved trajectory because of the pull of gravity. The fish must allow for this curvature when aiming at prey. Archer fish prefer prey of a given size. Four fish were rewarded over a period of several weeks for shooting at a 6-mm black disk at various heights and horizontal distances (Schuster et al. 2004). After training they could selectively respond to a disk of correct size but variable height and distance. The trained fish therefore had size constancy. The adhesive force with which an animal clings to a branch is proportional to its weight. Therefore, the jet must be more powerful to dislodge a large animal than a small animal. Schlegel et al. (2006) found that the volume of the jet rather than its velocity is proportional to the size of the prey. This is a rational strategy because the kinetic energy of a jet increases with the square of speed but only linearly with volume. But the strategy works only if distance is taken into account when the fish detect the size of a prey object. Schlegel et al. concluded that the relation between prey size and jet volume is innate, because fish reared with only small 244



prey increased the jet volume when first presented with larger prey. They also concluded from this that the relation is not subject to modification through experience. But their evidence on this point is not conclusive. A stronger test would be to rear fish with flies attached by larger than normal adhesive force. The fish would have to learn to adjust jet volume, or starve. To hit a moving insect, an archer fish would have to predict where the insect will be when the water jet reaches it. Schuster et al. (2006) first tested the ability of 10 fish to aim at stationary prey at heights between 15 and 60 cm. Their accuracy was severely degraded when they were tested with a moving insect. The fish were then trained over a period of several weeks with targets moving at various horizontal speeds and at various heights. Their hit rate improved considerably. There was no change in the speed of the water jet or its size during training. Schuster et al. concluded that the fish adjusted the timing of the shot to take account of the predicted location of the moving prey. A dislodged insect falls onto the water where it is picked up either by the fish that shot it down or by a nearby fish. Within 100 ms after an insect had been dislodged, archer fish turned to where the insect was going to hit the water (Rossel et al. 2002). The fish then adjusted their swimming speed according to the distance of the impact point (Wöhl and Schuster 2006). The fish slightly underestimated the distance because an undershoot is easier to correct than an overshoot. When the insect was attached to a thread so that its fall was deflected before it hit the water, the fish still went to where the insect would have landed if not deflected. The swimming motion was therefore based on the initial 100 ms of the fall path of the insect rather than on continuous visual feedback. Rossel et al. produced evidence that archer fish determine the impact point of an insect on the water surface from its height and velocity during the initial 100 ms. It seems that the rapid motion of fish to falling prey is an adaptation of the rapid escape reaction shown by all teleost fish when they are threatened (Wöhl and Schuster 2007). Some cobras spit venom into the eyes of predators (Section 33.5). Also, some spiders spit a gummy secretion onto their prey (Nentwig 1985). 3 3 .4 A M P H I B I A N S The class amphibia consists of three orders, the anurans (frogs and toads), the urodeles (newts and salamanders), and the Gymnophiona (caecilians, or legless, sightless, burrowing amphibians). 33.4.1 FRO G S A N D TOA D S

33.4.1a Visual System of Frogs and Toads Frogs of the family Ranidae, such as Rana pipiens, live both in water and on land and have laterally placed eyes with a

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

visual field of almost 360° and a frontal binocular field up to 100° wide. Fite (1973) claimed that the binocular field widens to about 160° above the head but other investigators have revised this value down to 60° (Grobstein et al. 1980). Frogs of the family Pipidae, such as Xenopus laevis, are adapted to a completely aquatic life and feed on prey swimming above them. Because of this, their eyes migrate to the top of the head during metamorphosis. Each eye migrates 55° nasally and 50° dorsally with respect to the major body axes. As a result, the lateral extent of the binocular field increases from 30° to 162°. Toads (Bufo bufo) have panoramic vision, with a large binocular field extending above the head. Most of the optic fibers from each eye of frogs project retinotopically to the contralateral optic tectum (mesencephalon) through the lateral and medial divisions of the optic tract. In Ranid frogs, only about 2.3% of visual inputs project directly to the ipsilateral tectum (Singman and Scalia 1990). It seems that Pipid frogs, such as Xenopus, have a greater proportion of direct ipsilateral projections. In Xenopus, contralateral projections arise from the whole retina and are well formed in the tadpole. Ipsilateral projections mainly from the temporoventral retina are not fully formed until some months after metamorphosis (Hoskins and Grobstein 1985a, 1985b). In other vertebrates, including primates, ipsilateral projections to the LGN develop after contralateral projections (Section 6.3.5). Metamorphosis in Xenopus is initiated by thyroxin secreted by the thyroid gland. When thyroxin production is blocked, ipsilateral projections do not develop (Hoskins and Grobstein 1985c). The deeper layers of the rostromedial region of each tectum of the frog contain binocularly driven cells with direct inputs from the contralateral eye. Inputs to each tectum from the ipsilateral eye are routed through the contralateral tectum by way of the nucleus isthmi and postoptic commissures, as shown in Figure 33.11 (Fite 1969; Keating and Gaze 1970a ; Raybourn 1975; Gruberg and Lettvin 1980). It had been claimed that inputs from only the central region of the binocular field of frogs project to binocular cells in both optic tecti (Gaze and Jacobson 1962). However, subsequent investigations revealed that the whole binocular field of frogs projects to binocular cells in each tectum (Keating and Gaze 1970a). Thus, each tectum contains a complete representation of the binocular field. In each tectum, ipsilateral inputs from the whole of one eye are in spatial correspondence with contralateral inputs from the whole of the other eye. During development, the contralateral map develops first and does not require visual experience. A coarse ipsilateral map develops at the end of metamorphosis but its refinement requires visual experience (Brickley et al. 1998). In contrast, in mammals, the halves of the binocular field project to different hemispheres.

Right eye

Left eye

Tectum

Nucleus isthmi The visual pathways of a frog. Inputs from each eye (thick lines) innervate binocular cells of the contralateral tectum directly. Binocular cells in the tectum receive their inputs from the ipsilateral eye from the contralateral tectum via the nucleus isthmi (thin lines).

Figure 33.11.

Inputs from both eyes of the frog can be forced to grow directly into the same tectum (see Section 6.7.3a). The caudolateral region of each tectum receives direct inputs from the peripheral monocular visual field of the contralateral eye. It also receives inputs from the monocular field of the ipsilateral eye by way of the nucleus isthmi (Glasser and Ingle 1978). In each caudolateral region the two monocular visual fields are mapped in mirror symmetric fashion (Winkowski and Gruberg 2005). Motion signals conveyed to these superimposed monocular maps could allow frogs to detect different pattern of optic flow. For example, forward motion of the frog produces peripheral optic flow in the two eyes in opposite directions, while head rotation produces same-direction optic flow. Inputs from each eye also project to three areas of the thalamus (diencephalon), which also receive inputs from the optic tectum. The dorsal anterior thalamus contains binocular cells that receive direct inputs from both ipsilateral and contralateral corresponding regions of the binocular field (Székely 1971; Keating and Kennard 1976). Visual inputs also go from the dorsal thalamus to the forebrain (telencephalon), an area that also receives somatosensory inputs (Kicliter and Northcutt 1975). Some cells in the telencephalon are binocular, but little is known about their function (Liege and Galand 1972).

V I S UA L D E P T H P E R C E P T I O N I N T H E A N I M A L K I N G D O M



245

Some binocular cells in the optic tectum of Rana pipiens have small receptive fields and respond best to spots that are dark below and light above (Finch and Collett 1983). Other cells have very large receptive fields and respond to horizontal boundaries, dark below and light above. The cells with small receptive fields are tuned to zero vertical disparity and, on average, to 1.7° of horizontal disparity. The eyes of the frog do not change their vergence and are about 1.5 cm apart. Cells registering 1.7° of disparity were stimulated maximally by an object at a distance of 50 cm. The largest disparity tuning was 3.4°, which corresponds to a distance of 25 cm. This suggests that the binocular cells of the tectum are not responsible for the estimation of distances up to 25 cm over which the animals snap at prey (the snapping zone). Gaillard (1985) conducted a similar study in Rana esculenta and agreed that tectal cells with small receptive fields are not tuned to disparities in the snapping zone. However, they found a group of cells with receptive fields about 5° in diameter tuned to disparities between 12° and 0.25°, corresponding to distances between 5 and 300 cm. These neurons would be able to detect distances in the snapping zone. While the eyes of Xenopus migrate to the top of the head, the binocular field increases from 30° to 162°. At the same time, the area of each tectum devoted to the binocular visual field increases from 11% to 77%. There is a corresponding increase in the intertectal commissures in the nucleus isthmi responsible for the ipsilateral retinotectal projection (Beazley et al. 1972; Grant and Keating 1989a). Also, there is an increase in the dendritic branching of axon arbors in the tectal laminae receiving ipsilateral projections. These branches are at first widely distributed but are gradually transformed into compact arbors typical of the adult tectum (Udin 1989). Although visual inputs are not required for the development of isthmotectal projections and their increased branching in Xenopus, they are required for the development of compact arbors, which correspond topographically to the contralateral inputs. Thus, in Pipid frogs reared in the dark, the contralateral retinotectal projections develop normally, but the mapping of the intertectal system relative to the contralateral system shows signs of disorder (Grant and Keating 1989b). However, a normal intertectal mapping is restored when sight is restored, even in frogs deprived of vision for 2 years after metamorphosis (Keating et al. 1992). Thus, in the development of normal binocularity Xenopus has no critical period for visual experience like that found in mammals (Section 8.3). Furthermore, a severed optic nerve of the adult frog regenerates and the axons reestablish their proper topographical distribution on the tectum and their proper distribution of distinct cell types in different layers within the tectum (Keating and Gaze 1970b). The visual system of anurans was reviewed by Grüsser and Grüsser-Cornehls (1976). 246



33.4.1b Adaptations to Eye Rotation When one eye of the Pipid frog Xenopus at the tadpole stage was rotated through 90° or 180°, a new set of intertectal connections developed that brought the retinas back into spatial correspondence, but only when the animal was allowed to see (Keating and Feldman 1975; Grant and Keating 1992). Intertectal development was disrupted and adaptation to eye rotation did not occur in Xenopus larvae continuously exposed to 1 Hz stroboscopic illumination (Brickley et al. 1998). Thus, in Xenopus, interactions between signals from the two eyes guide the development of intertectal connections so that tectal neurons receive inputs from corresponding visual directions. After eye rotation in the tadpole of Xenopus, a normal coarse mapping of the ipsilateral inputs develops initially under the control of genetic factors. At 2 weeks after metamorphosis, in a normal environment, axons from the rotated eye retain their old connections but begin to sprout new connections according to activity-dependent cues. The old connections gradually withdraw (Guo and Udin 2000). When the Ranid frog Rana pipiens was reared from the tadpole stage with one eye rotated 180°, there was no evidence of remapping of central connections, even after both eyes were opened ( Jacobson and Hirsch 1973; Skarf 1973). Thus, it seems that Pipid frogs compensate for eye rotation while Ranids do not. Presumably, visually guided plasticity occurs in Pipids because they must adapt visual functions to the migration of their eyes to the top of the head. When one eye of a Xenopus tadpole was replaced by an opposite eye from another tadpole, about half the animals developed intertectal connections with proper spatial correspondence, but the other animals developed abnormal intertectal connections (Beazley 1975). Visual experience is required for the development of normal binocular vision in mammals (Section 8.2). However, like Ranid frogs, mammals do not compensate for rotation of one eye through a large angle (Yinon 1975).

33.4.1c Depth Discrimination in Frogs and Toads When prey is seen within 1 or 2 meters a frog moves forward in a series of jumps, making detours around objects if necessary. When the prey is within about 5 cm it is captured by a flick of the tongue. The distance at which jumping gives way to snapping depends on the size of the frog (Grobstein et al. 1985). Frogs usually orient themselves to bring the prey close to the body midline but can catch prey without first orienting the eyes or head, even when the prey is 45° or more from the midline. Within a distance of about 15 cm, frogs selected from several prey items on the basis of their linear sizes rather than their angular sizes. This suggests that they have size constancy within this range (Ingle and Cook 1977).

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Toads orient themselves to center prey in the midline and then capture it with a flick of the tongue (Fite 1973). Like frogs, toads do not converge their eyes, so snapping is not guided by vergence movements (Grüsser and GrüsserCornehls 1976; Grobstein et al. 1980). As in frogs, inputs from corresponding areas of the two retinas converge in the thalamic and midbrain visual areas. Frogs and toads have good depth discrimination, as revealed by their ability to avoid objects and jump through apertures, over gaps, down steps, and onto objects (Ingle 1976). Toads make a detour around a fence when prey cannot be caught through the fence. They make for gaps formed by overlapping barriers at different distances only if the gap is wide enough—a task that requires detection of depth between the barriers (Lock and Collett 1979). A frog’s ability to catch prey was not much affected by severing the optic tract of one eye; so monocular cues to distance are sufficient for this purpose (Ingle 1972). Like toads, frogs probably use accommodation (Collett TS 1977). Anurans, like fish, accommodate by moving the lens forward or backward rather than by changing its shape. Two protractor lentis muscles attached between the lens and cornea move the lens about 150 μm to achieve a 10-diopter change in refraction. Section of the nerves serving these muscles did not affect accuracy of prey-catching when both eyes were open. But, with one eye occluded, frogs and toads consistently underestimated the distance of prey (Douglas et al. 1986). Stereoscopic vision based on disparity has not been found in frogs, but there is evidence of disparity-based stereopsis in toads, as we will now see. When a negative lens was placed in front of one eye, with the other eye closed, toads undershot the prey by an amount predictable from the assumption that it was using accommodation for judging distance. When both eyes were open, negative lenses had no effect. This suggests that toads use binocular disparity when both eyes are open. This conclusion is strengthened by the fact that base-out prisms in front of the eyes caused toads to undershoot the target (Collett 1977; Collett et al. 1987) (Portrait Figure 33.12). Thus, toads use accommodation when one eye is closed and binocular disparity when both eyes are open. This conclusion was confirmed by Jordan et al. (1980). They showed that drug-induced contraction or relaxation of the accommodation muscles had little effect on distance judgments of toads with both eyes open but produced severe undershooting in monocular toads. Since the vergence position of the eyes of toads is fixed, the size and sign of binocular disparity between the images of an object signifies the absolute distance of an object from a fixed point. The distance of the fixation point would be easy to register. Collett and Udin (1988) revealed another monocular cue used by the frog Rana pipiens. They placed frogs on a transparent sheet of plastic and a prey object on a second

Figure 33.12. Thomas Stephen Collett. Born in London, England, in 1939. He obtained a B.A. in psychology in 1960 and a Ph.D. in zoology in 1964, both from University College, London. In 1965 he joined the faculty of the University of Sussex, England, where he is now professor in the School of Life Sciences.

surface at various distances below the plastic sheet so that its image was unusually elevated on the frogs’ retinas. The animals snapped increasingly short of the prey as the distance of the prey below the plastic sheet was increased. Since the eyes of a frog are normally a fixed distance above the ground, the angle of elevation of the image of a prey object is related to its distance. This is the same mechanism as that used by some crabs, as described in Section 33.2.2. House (1982) developed a model of depth perception in frogs and toads. 33.4.2 S A L A M A N D E R S

About two-thirds of living Salamander species belong to the family Plethodontidae (Figure 33.13). Most Plethodontids have projectile tongues, which constitute the fastest feeding mechanism among vertebrates. The animals wait in ambush until a prey object appears and then project the tongue with great accuracy both in direction and distance. The tongue extends up to two-thirds of their body length within 10 ms. They can catch flies on the wing, implying that they register the temporal as well as the

Figure 33.13.

European salamander (Salamandra salamandra).

V I S UA L D E P T H P E R C E P T I O N I N T H E A N I M A L K I N G D O M



247

spatial characteristics of the stimulus. Some Plethodontids have frontal eyes with a 90° binocular field. Salamanders have three optic tracts that emerge from the chiasm. The basal tract extends to the peduncle, the marginal tract to the optic tectum, and the medial tract to the anterior thalamus. In most amphibians, each optic tectum receives a direct input from the contralateral eye and an indirect input from the ipsilateral eye through the nucleus isthmi. This relayed ipsilateral input involves a delay of at least 30 ms. Tongue-projecting salamanders are unique among amphibians in having a large direct ipsilateral projection to the tectum in addition to the indirect input relayed through the nucleus isthmi (Roth 1987). Since a direct tectal pathway is more rapid than an indirect pathway, tongue-projecting salamanders are able to process binocular inputs more rapidly than other amphibians. In 14 species of Plethodontids, the degree of ipsilateral innervation of the tectum was found to be related to the degree of frontal vision (Rettig and Roth 1986). Members of the Bolitoglossini subgroup of Plethodontids have the most frontally placed eyes. The ipsilateral projection along the marginal optic tract innervates the entire tectum, and is equal to the contralateral projection. Thus, each tectal hemisphere receives a complete projection from both eyes within the binocular visual field. The ipsilateral and contralateral projections are in topological correspondence and project onto binocular cells. The eyes have a fixed convergence, and the monocular receptive fields are in closest correspondence for stimuli at about the maximum distance over which the tongue projects (Wiggers et al. 1995). All this evidence suggests that tongue-projecting salamanders judge absolute distance by a range-finding disparity mechanism, but the details of disparity coding in the tectum remain to be discovered. Salamanders change their accommodation as they turn to fixate a prey object (Werner and Himstedt 1984). Salamanders (Salamandra salamandra) with one eye removed caught prey as well as did binocular animals, but those with one eye occluded often missed the prey (Luthardt-Laimer 1983). It seems that salamanders with only one eye rely on accommodation but that, for some unknown reason, this option is not available to monocularly occluded animals. Aquatic salamanders possess electric sense organs, which they probably use to detect electric discharges emitted by prey animals, especially at night (Fritzsch and Münz 1986).

from ganglion cells have revealed an opponent blue-green color mechanism (Solessio and Engbretson 1993). Each receptor contains a green-sensitive opsin that causes the receptor to depolarize in green light and a blue-sensitive opsin that causes the cell to hyperpolarize in blue light. This bipolar response system is unique to these receptors (Su et al. 2006). It seems that the parietal eye serves to detect changes in light at dawn and dusk. Chameleons are arboreal lizards that live in shrubs (Figure 33.14). A chameleon spends most of its time sitting on a branch waiting for a small insect to come within range of its sticky tongue. Chameleons have an all-cone retina and a well-developed fovea. The laterally placed eyes are mounted in turret-like enclosures and can move 180° horizontally and 80° vertically (Walls 1963; Frens et al. 1998). Like the eyes of primates, the lateral eyes of the chameleon obey Listings law (Section 10.1.2e). Optokinetic nystagmus and the vestibuloocular response together achieve image stability with a gain of about 0.8 as the body moves (Gioanni et al. 1993). When a chameleon is waiting for prey, large spontaneous saccadic movements occur at random intervals independently in the two eyes (Mates 1978). Accommodation is also independently controlled in the two eyes. When a prey object has been detected, both eyes fixate it and accommodate to the same distance. However, the vergence angle is not clearly related to the distance of the object (Ott et al. 1998). A moving prey object is pursued partly by movements of the head and partly by synchronous saccadic eye movements (Ott 2001). When the prey is in the midline of the head the chameleon shoots its tongue out at great speed up to 1.5 times the length of its body. Harkness (1977) filmed the motion of the tongue of a chameleon and found that the distance moved was related to the distance of the prey. Convergent prisms placed before the eyes did not cause the animals to undershoot the target. Also, the precision of convergence is very poor. It seems that chameleons do not use the vergence of the eyes as a cue to distance. However, when a negative lens was placed in front of one eye with the other eye closed the animals undershot the prey. When a positive lens was used they overshot the prey. It therefore seems that chameleons use

33.5 R E P T I L E S Lizards have a third eye known as the parietal eye, located in the midline on top of the head. The third eye has a simple lens and receptors that feed directly into ganglion cells, which project to an area in the midbrain. The eye has no bipolar cells or other interneurons. Nevertheless, recordings 248



Figure 33.14.

The African chameleon (Chamaeleo chamaeleon)

Encyclopedia of Animal Life, Hamlyn 1967)

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

(From the Larousse

accommodation for judging distance. The accommodation response of chameleons is unusually rapid and precise. The following features of the visual system improve the precision of accommodation of chameleons: 1. The unaccommodated lens has negative power, a feature not known in any other animal. A negative lens combined with a positively powered cornea forms a telephoto lens that creates an unusually large image for the size of eye. This improves spatial resolution and the precision of accommodation (Ott and Schaeffel 1995). A telephoto lens also increases the range of accommodation, which is unusually large (45 diopters) in chameleons. 2. Chameleons dilate the pupils when aiming at prey. Pupil dilation reduces the depth of field and therefore makes it easier to detect when an object is not in the focal plane. 3. During saccadic eye movements, only one eye is accommodated appropriately, but attention switches from one eye to the other at intervals of about 1 s. Just before tongue release, both eyes converge and focus on the prey, which improves the precision of accommodation (Ott et al. 1998). 4. Like other lizards and many birds and fish, chameleons have convexiclivate foveas, in which the sides of the foveal pit are convex. Animals with well-developed binocular vision, such as primates, have concaviclivate foveas. Harkness and Bennet-Clark (1978) showed that the pincushion distortion produced by refraction at the surface of a convexiclivate fovea is a sensitive indicator of direction and magnitude of misaccommodation. A telephoto lens brings the nodal point of the chameleon eye well forward of the eye’s center of rotation. This creates motion parallax when the eye is rotated, which could provide a monocular depth cue in addition to accommodation (Land 1995). Kirmse et al. (1994) proposed a novel mechanism for locating prey. The chameleon fixates a prey object with one or other eye so that its image falls on the fovea. It then turns to bring the object into the median plane of the head and places the eyes in a fixed position of divergence of between 17 and 19°. Thus, just before the tongue strikes, the distance of each image from the fovea is a function of the distance of the prey from the chameleon. Because the eyes are diverged, the image of a distant target falls nearer the fovea than does the image of a nearer target. This is not a disparity mechanism because it does not involve comparison of the images in the two retinas. This explains why chameleons can aim the tongue accurately with one eye occluded. Perhaps chameleons base distance estimates on accommodation as was proposed by Harkness, as well as on the lateral

position of images. There is no evidence that chameleons use binocular disparity. Texas horned lizards (Phrynosoma cornutum) capture prey with their prehensile tongues. With one eye covered, a negative lens applied to the open eye induced underestimation of the strike distance. Thus, like chameleons, these lizards use accommodation to judge distance. However, performance was not affected when lenses of equal power were applied to both open eyes (Ott et al. 2004). It therefore seems that these lizards also use information derived from binocular vision. The turtle Chinemys reevesi has extensive ipsilateral projections to the thalamus, and its visual fields overlap about 48°. On the other hand, the predatory turtle Trionyx cartilagineus has no ipsilateral visual projections, even though its visual fields overlap about 67° (Hergueta et al. 1992). Stereoscopic vision has not been demonstrated in these species. Several species of cobras, known as “spitting cobras” spit venom from their fangs into the eyes of predators, even when the target is 1 m or more away and moving. They can spit several times in quick succession so that they can use visual error feedback to improve accuracy (Westhoff et al. 2010). 3 3 .6 B I R D S 33.6.1 P I G EO NS

The domestic pigeon (Columba livia) has laterally placed eyes with an angle between their optic axes of about 120°. The total visual field is about 340°. A frontal binocular region has a maximum width of about 27°, which extends from 90° above the beak to about 40° below it (Martin and Young 1983). In the sample of birds tested by Martin and Young the visual field was widest 20° above the plane containing the eyes and the tip of the beak. In the sample tested by Martinoya et al. (1981) it was widest 45° below this plane (see also McFadden and Reymond 1985; Holden and Low 1989). These differences may reflect differences in eye convergence. In any case the binocular field is well placed to help the bird peck seeds on the ground and locate a landing site. The binocular region of the pigeon’s visual field is served by a specialized area in the temporal retina that contains a high density of receptors. This is known as the area dorsalis. The area projects 10 to 15° below the eye-beak plane and is distinct from the more laterally placed fovea used for lateral monocular viewing, as shown in Figure 33.15. Cones in the region of the area dorsalis contain red oil droplets, whereas those in the region of the lateral fovea contain yellow pigment. Resolution of a high-contrast grating presented to the lateral fovea is between 1.16 and 4.0 arcmin, compared with a resolution of about 0.8 arcmin for humans

V I S UA L D E P T H P E R C E P T I O N I N T H E A N I M A L K I N G D O M



249

(Blough 1971). Peak contrast sensitivity of the lateral fovea occurs at a spatial frequency of about 4 cpd (Nye 1968). The acuity of the frontal fovea is slightly higher (Hahmann and Güntürkün 1993). When the head of a pigeon is restrained, each eye is able to execute saccades of up to 17° amplitude to stimuli presented in the lateral field, with each eye moving independently of the other. The motion of one eye brings the image of a laterally placed object onto the lateral fovea of that eye. Both eyes execute coordinated vergence movements to stimuli presented at near distances in the frontal visual field. These movements bring the images onto the area dorsalis in each eye. Vergence is most effectively evoked by stimuli 25° below the beak, the position of objects on the ground at which the pigeon pecks. Before the head makes a pecking motion it moves forward in two successive movements, with the eyes converging on the ground (Goodale 1983; Bloch et al. 1984). The final ballistic movement occurs from a distance of 55 mm with the target centered on the area dorsalis about 10° below the eye-beak axis. The eyes close during the final phase of the peck. Like all birds, the pigeon has two main visual pathways. One pathway leads to the contralateral pretectum and optic tectum (the analogue of the superior colliculus in mammals), then to the ipsilateral nucleus rotundus and nucleus triangularis in the thalamus, and on to the ectostriatal complex and surrounding extrinsic recipient areas in the telencephalon. This is probably the phylogenetically older pathway, since it is fully decussated and is fully developed by the time of hatching. The pigeon’s optic tectum contains motion-sensitive cells (Frost et al. 1981). The nucleus rotundus is specialized for different functions. Cells in one region respond to changes in illumination over the whole visual field. Those in a second region are color coded, and those in a third region

respond to differential motion. Cells in a fourth region respond to motion-in-depth signaled by image looming. These cells are particularly sensitive to objects moving toward the head (Wang et al. 1993). The second avian visual pathway starts with fully decussated inputs to the optic thalamus (homologue of the mammalian LGN). From there, a hemidecussated pathway passes to a neural center on the dorsum of the diencephalon known as the wulst (German for “bulge”). This system is not fully developed in the pigeon until about 17 days after hatching. This suggests that visual inputs are required for its maturation (Karten et al. 1973). This system contains cells with small receptive fields that could code the shapes and positions of stimuli (Emmerton 1983). Frost et al. (1983) found very few binocular cells in either the pigeon’s optic tectum or the wulst (Portrait Figure 33.16). However, binocular cells in the wulst may have been missed because of misalignment of the eyes. There is conflicting evidence about stereopsis in the pigeon. Martinoya et al. (1988) found that pigeons could discriminate between points of light presented one at a time at different distances. However, they failed to discriminate between a pair of stimuli in distinct depth planes from a subsequently presented pair in one depth plane. These findings suggest that pigeons use a range-finding mechanism

Barry J. Frost. Born in Nelson, New Zealand. He obtained a B.A. at the University of Canterbury, New Zealand, and a Ph.D. from Dalhousie University, Canada, in 1967. He conducted postdoctoral work at Berkeley. Since 1969 he has been at Queens University, Canada, where he is now professor of psychology and biology. He is a fellow of the Royal Society of Canada and of the Canadian Institute for Advanced Research. He received the James McKeen Cattel award and the Alexander von Humboldt Research Prize.

Figure 33.16.

Diagram of the head of a bifoveate bird. In each eye the lateral visual axis projects to the central fovea, and the frontal visual axis projects to a fovea in the temporal retina. (From Pettigrew 1986. Reprinted with

Figure 33.15.

permission of Cambridge University Press)

250



OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

based on vergence rather than binocular disparity. Pigeons can be trained to peck at one of two simultaneously presented panels at different depths. This procedure has revealed that pigeons have depth acuity of between 0.8 and 1.8 arcmin (McFadden 1987). The animals could not perform the task when one eye was occluded, which suggests that they use binocular cues, such as disparity or vergence. Martin (2009) reviewed the binocular fields of a large number of bird species. The binocular field of most birds extends laterally between 15° and 30°. But two facts argue against the use of the binocular field for stereoscopic vision. First, most birds can move their eyes independently, so that images from the same object would not necessarily fall on corresponding cells in the retinas or in the wulst. Second, visual acuity is low in the binocular field compared with that in the lateral fields. Martin concluded that the binocular field is used by many birds for the accurate placement of the feet in landing and for the accurate placement of the bill in pecking, nest building, and feeding the young. Also, some diving birds, such as cormorants, use their binocular visual field to inspect an object that they are about to capture after diving in the water or that they hold in the bill (Martin et al. 2008). The large receptive fields of cells serving the binocular field are well suited to detect optic flow produced as a bird comes in to land. The binocular field will facilitate the detection of the symmetry of the optic flow across the midline. The symmetry of flow indicates the direction of approach. Also, flow magnitude indicates time to contact (Section 31.1). In several species of duck and shore birds the binocular visual field extends laterally only about 5°. Such birds use touch receptors round the bill rather than their eyes to detect food in mud. Also, they do not feed their young or build nests. We will now see that binocular cells have been found in the wulst of some birds. 33.6.2 H AWK S , FA L C O N S , A N D E AG L E S

Birds of prey, such as hawks, falcons, and eagles, like most birds, have two foveas in each eye. The central foveas are used for sideways looking at distant objects, and the temporal foveas serve for binocular viewing of nearer objects. The American kestrel, or sparrow hawk, is actually a small falcon (Falco sparverius). It has two foveas in each eye that are 36° apart and its visual fields overlap about 58° when the eyes are converged (Frost et al. 1990). The eyes move independently when the central foveas are used and conjugately when the temporal foveas are used. Pettigrew (1978) suggested that each tectum controls the contralateral eye and that conjugate movements are controlled by projections from binocular cells in the wulst back to the tectum. The wulst of the kestrel has a much greater representation of the temporal foveas than of the central foveas.

Birds of prey have an impressively large number of receptors per degree of visual angle. For instance, the central fovea of the American kestrel has about 8,000 receptors per degree and the red-tailed hawk (Buteo jamaicensis) has about 15,000 per degree, compared with about 7,000 in the rhesus monkey (Fite and Rosenfield-Wessels 1975). Shlaer (1972) found that image quality in the African eagle (Dryotriorchis spectabilis) is up to 2.4 times greater than in the human eye. The American kestrel can detect a grating with a spatial frequency of 160 cpd. The upper limit for humans is about 60 cpd (Fox et al. 1976). However, the cone density and focal length of the kestrel’s eye are not sufficiently greater than those of the human eye to account for its greater acuity. Snyder and Miller (1978) suggested falconiform birds achieve high acuity because the highly concave foveal pit acts as a negative lens, which, together with the positive lens of the eye, constitutes a telephoto lens system. This effectively increases the focal length of the eye. Another possibility is that the effective density of receptors is increased when they lie on the oblique walls of a concave fovea as opposed to a surface normal to the incident light. Kestrels have a well-developed wulst containing binocular cells (Pettigrew 1978). It is larger than the wulst in pigeons but not as large as that in owls. The fovea used for sideways looking at distant objects has a higher acuity than that used for frontal looking. Since birds of prey have large eyes relative to the size of the head, they have only limited capacity to rotate the eyes. Therefore, to see an object straight ahead they must turn the head to bring the image onto the area of highest acuity. This presents a problem for a diving bird of prey because a sideways posture of the head would lower the speed of diving. Tucker (2000) showed that falcons solve this problem by diving along a logarithmic spiral with the head straight on the body while keeping one eye looking sideways at the prey below them. Fox et al. (1977) trained an American kestrel to fly from a perch to a panel containing a dynamic random-dot stereogram in preference to one containing a similar 2-D randomdot pattern. The stereogram was created by the color-anaglyph method, with the falcon wearing red and green filters. The percentage of correct responses of the trained animal was recorded as the disparity in the stereogram was varied. It can be seen from Figure 33.17 that performance peaked for a disparity of about 12 arcmin. This suggests that kestrels have full-fledged stereoscopic vision enabling them to detect the 3-D shapes of objects, and not merely a vergence or disparity range-finding mechanism serving the detection of absolute distance. 33.6.3 OWL S

Owls are nocturnal predators with acute auditory localization. They are remarkably silent in flight. The eyes of barn owls (Tyto alba) have high optical quality (Schaeffel and

V I S UA L D E P T H P E R C E P T I O N I N T H E A N I M A L K I N G D O M



251

80

Wulst

% correct

70

60

50

2

Figure 33.17.

4

6 8 10 Disparity (arcmin)

12

14

Thalamic relay nucleus

16

Fovea for frontal field

Stereoacuity in a sparrow hawk. A hawk’s percentage correct

detection of depth in a stereogram, as a function of disparity. Each point is based on between 50 and 350 trials. Bars are standard errors.

Limit of binocular field

(Redrawn from Fox et al. 1977)

Wagner 1996). However, the density of ganglion cells predicts that their grating acuity is only about 8 cpd, which is much lower than that of humans (Wathey and Pettigrew 1989). Psychophysical procedures revealed that the contrast sensitivity function of barn owls peaked at a spatial frequency of 1 cpd and fell to zero at between 3 and 4 cpd (Harmening et al. 2009). Unlike other birds, owls have frontal vision, which provides them with a large binocular field. For example, the tawny owl (Strix aluco) has a total visual field of 201° and a binocular visual field with a maximum width of 48° (Martin 1984). The owl’s retina has a single, temporally placed fovea. The fovea of the great horned owl (Bubo Virginianus) has about 13,000 receptors per degree of visual angle (Fite and Rosenfield-Wessels 1975). The visual pathways of owls decussate totally at the chiasm and project retinotopically on the optic tectum. Thus, the tectum does not code binocular disparity. However, the projection from the optic nucleus in the thalamus to the wulst is hemidecussated, as shown in Figure 33.18. In owls, the wulst is particularly well developed. Most cells (86%) were found to be binocular with tuning functions similar to those of binocular cells in the mammalian visual cortex (Pettigrew 1986; Bagnoli et al. 1990; Porciatti et al. 1990). Many cells in the wulst have been found to be tuned to binocular disparity, and their receptive fields in the two eyes have similar tuning for orientation and motion. The cells are also arranged in ocular dominance columns (Pettigrew and Konishi 1976a ; Pettigrew 1979). Efferents from the wulst project to the optic tectum in a manner analogous to the projection of cortical efferents to the superior colliculus in mammals (Casini et al. 1992). Like cats and monkeys, owls reared with one eyelid sutured failed to develop binocular cells and their eyes became misaligned (Pettigrew and Konishi 1976b). 252



Figure 33.18. The visual pathways of the owl. The optic nerves decussate fully as they project to the optic nucleus in the thalamus and to the optic tectum. The pathways from the thalamus to the wulst hemidecussate so that corresponding inputs from the binocular field are brought to the same destination in the wulst. (Redrawn from Pettigrew 1986)

Wagner and Frost (1993, 1994) measured the disparity tuning functions of cells in the wulst of the anesthetized barn owl to sine-wave gratings and one-dimensional visual noise. Although lesser peaks in the disparity tuning function varied as a function of the spatial frequency of the gratings, the main peak occurred at a characteristic disparity for a given cell, which was independent of spatial frequency. The characteristic disparity sensitivity of different cells ranged between 0 and 2.5°. One-dimensional visual noise also produced a peak response at a disparity of 2°. Wagner and Frost argued that a disparity-detection mechanism based on spatial-phase differences (Section 11.4.3) produces tuning functions that vary with the spatial frequency of the stimulus, whereas tuning functions of a mechanism based on position offset are independent of spatial frequency. In the owl, the peak response of most cells tuned to disparities of less than about 3 arcmin did vary with spatial frequency. Wagner and Frost concluded that an offset mechanism codes large disparities and a phasedifference mechanism codes small disparities. Zhu and Qian (1996) questioned these conclusions. They pointed out that the disparity tuning of simple binocular cells is not frequency independent for either type of disparity coding—only complex cells are distinguishable in this way.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Nieder and Wagner (2000) recorded from binocular cells in the wulst of the alert barn owl as it observed a random-dot stereogram with variable disparity. About 75% of cells were tuned to disparity. The tuning functions did not form discrete categories, and cells with similar tuning functions did not form columns. Nieder and Wagner (2001b) found that most cells in the wulst of the barn owl responded to vertical disparity as well as to horizontal disparity in a random-dot stereogram. Disparity tuning functions of two neurons are shown in Figure 33.19. The tuning function of the first neuron is modulated by changing vertical disparity according to a Gabor profile but shows a Gaussian profile for horizontal disparities. This pattern is consistent with a binocular cell having a horizontal preferred orientation. The second neuron has a modulated disparity profile for both types of disparity. This is consistent with a cell having an oblique preferred orientation. The close similarity between the disparity-detecting mechanism of the wulst and that of the mammalian visual cortex seems to be the product of a remarkable process of parallel evolution. Further work is needed to reveal how general this mechanism is among birds. The eyes of an owl are large in proportion to its head and they move, at most, about 1° (Steinbach and Money 1973). It is not known whether these eye movements include vergence. On the other hand, the owl can rotate its head at high velocity through at least 180°. Since the eyes are locked in a more or less fixed position of vergence, binocular disparity could code absolute depth as well as relative depth. In the young bird, the eyes are at first diverged. They take up their

20

15

Neurone 1

Neurone 2

15 10 10 5

Response (spikes/s)

5 0 –3 –2 –1 0 1 2 3 Vertical disparity (deg) 20

0 –3 –2 –1 0 1 2 3 Vertical disparity (deg) 15

15 10 10 5

5 0

0 –3 –2 –1 0 1 2 3 Horizontal disparity (deg)

Figure 33.19. Wagner 2001b)

–3 –2 –1 0 1 2 3 Horizontal disparity (deg)

Disparity tuning functions of owl wulst cells.

(Adapted from Nieder and

adult positions in the second month after hatching. When binocular vision is disrupted during the first 2 months the eyes fail to become properly aligned. Microelectrode recordings from binocular cells in the optic tectum of strabismic owls revealed that the monocular receptive fields were also misaligned (Knudsen 1989). Thus, the neural pattern of binocular correspondence is innate and presumably guides the alignment of the eyes in young birds. Behavioral tests revealed that barn owls can discriminate depth in random-dot stereograms. Owls could discriminate between a homogeneous random-dot stereogram and one containing an inner region in depth. They could also discriminate between two stereograms containing different relative depths. They responded selectively to relative depth when absolute depth was varied. The upper disparity limit for depth discrimination was 41 arcmin (Van der Willigen et al. 1998). Van der Willigen et al. (2010) compared the disparity sensitivity functions of owls and humans. The stimulus was a disparity-modulated horizontal or vertical random-dot grating. Sensitivity to disparity magnitude was plotted as a function of the spatial frequency of disparity modulation. Results for one human and one owl are shown in Figure 33.20. The functions for the human were similar to those reported previously, including the higher sensitivity to the horizontal grating (see Section 18.6.3b). The owl had higher overall sensitivity than the human and showed higher sensitivity for the vertical grating over the horizontal grating. Owls trained to select a given sign of relative depth in a random-dot stereogram transferred the learning to relative depth based solely on motion parallax (Van der Willigen et al. 2002). Thus, owls process the two depth cues in an equivalent manner. Owls did not respond to depth in a random-dot stereogram when contrast polarity was reversed between the two eyes (Nieder and Wagner 2001a). Short-latency binocular cells in the wulst showed side peaks in their response to a particular disparity and responded to disparity in contrastreversed stereograms. Long-latency binocular cells showed a single peak in their response and responded only to stimuli with matching contrast. Nieder and Wagner concluded that local disparities are processed by an initial stage and global disparities are processed by a second stage. Lippert and Wagner (2001) produced a neural network model of this process. These issues were discussed in more detail in Section 11.5. The barn owl apparently uses accommodative effort for judging near distances. It has an accommodative range of over 10 diopters, which is larger than that of any other species of owl (Murphy and Howland 1983). Accommodation, but not pupil constriction, is always identical in the two eyes (Schaeffel and Wagner 1992). With an occluder over one eye and a plus or minus lens in front of the other eye, barn owls made corresponding errors in pecking at a nearby object (Wagner and Schaeffel 1991).

V I S UA L D E P T H P E R C E P T I O N I N T H E A N I M A L K I N G D O M



253

10 Horizontal grating

Stereo sensitivity (1/arcmin)

Human

1

Vertical grating

Vertical grating

0.1

Owl

Horizontal grating

Figure 33.21.

The meerkat (Suricata suricatta).

(From the Larousse Encyclopedia of Animal

Life, Hamlyn 1967)

0.01 0.1 1 Spatial frequency of disparity modulation (cpd) Figure 33.20. Disparity sensitivity functions of the owl. Stereo sensitivity for detection of depth in a disparity-modulated grating, plotted as a function of the frequency of disparity modulation. Results for horizontal and vertical gratings for a human subject and for an owl. (Redrawn from Van der Willigen et al. 2010)

Birds of the order Camprimulgiformes are closely related to owls. Some of these birds, such as the owletnightjars and frogmouths, have a brain morphology almost identical to that of owls (Iwaniuk and Wylie 2006). It is possible that they too have stereoscopic vision. 33.7 M A M M A L S Volume 2 was devoted to stereoscopic vision in cats, and primates. Procedures for assessing stereoscopic vision in cats were described by Fox and Blake (1971), and Mitchell et al. (1979). Procedures used with monkeys were described by Sarmiento (1975), Julesz et al. (1976), Harwerth and Boltz (1979), and Crawford et al. (1996). This section is devoted to visual depth-detection in mammals other than cats and primates. 33.7.1 M E E R K ATS A N D RO D E N TS

The slender tailed meerkat (Suricata suricatta) is a frontal-eyed social carnivore related to the mongoose 254



(Figure 33.21). It lives in burrows in the dry South African veldt. Nothing is known about its visual pathways or visual cortex. However, it seems that meerkats possess stereoscopic vision. An animal trained to jump to the closer of two visual displays with both eyes open could discriminate relative depths corresponding to disparities of about 10 arcmin (Moran et al. 1983). Thus, among frontal-eyed mammals, binocular stereoscopic vision has been demonstrated in the primates, cats, and the meerkat. We will now see, there is evidence of binocular stereopsis in some lateral-eyed mammals. Rats avoid the deep side of a visual cliff as soon as the visual system becomes functional. Depth avoidance in monocular infant rats was similar to that in binocular infant rats; showing that they do not, or need not, use binocular disparity (Lore and Sawatski 1969). After preliminary training, adult rats can jump from one platform to another across a gap and adjust the force of their jump to the width of the gap (Russell 1932). Rats discriminated a difference of 2 cm over a gap between 20 and 40 cm wide. They tended to swing the head before jumping, which suggests that they used motion parallax. Removal of one eye reduced accuracy. Rats raised in darkness for the first 100 days and then trained to jump one distance were able to adjust to another distance on the first jump (Lashley and Russell 1934). On the other hand, Greenhut and Young (1953) found that normal rats required practice to achieve reasonable accuracy in jumping across a gap.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Mongolian gerbils (Meriones unguiculatus) adjust their jump to the width of a gap. Before jumping they bob the head up and down. These movements generate motion parallax between objects at different distances and also motion of the image of the target platform or motion of the eyes as they fixate the platform. They are more likely to use head bobbing when other depth cues are not available. The frequency of head bobs increased when the width of the gap was increased, and jump accuracy increased when animals used more head bobs (Ellard et al. 1984). Jump accuracy was greater when gerbils were able to move forward before jumping compared with when they were placed on a narrow platform that did not allow forward motion. This suggests that gerbils use looming cues generated by forward motion. Gerbils jumped too far when the target platform was smaller than the one they were used to and jumped too near when the platform was larger. This suggests that they use image size as a cue to distance. However, they could compensate for changes in the size of the jumping platform by increasing the frequency of head bobbing. Varying the distance between the target platform and the background behind the platform did not affect jumping accuracy (Goodale et al. 1990). This suggests that relative motion produced by head bobbing is not used, but rather the ratio of head motion to the motion of the image of the platform. It is not clear whether gerbils use binocular disparity. The rabbit has laterally placed eyes. When crouched in the “freeze” position, a rabbit has 360° panoramic vision with a frontal binocular area of about 24°. At least 90% of optic-nerve fibers decussate in the chiasm (Giolli and Guthrie 1969). The retina does not have a fovea, but there is a relatively high concentration of receptors along an extended horizontal region known as the visual streak, which receives the image of the visual horizon. There is also a high concentration of receptors in the posterior part of the retina, corresponding to the binocular region of the visual field. This region is about 30° wide and extends from about 35° below eye level to about 160° above eye level when the eyes are in their most converged position. The rabbit uses this part of the retina when performing visual discrimination tasks (Hughes 1971). When the animal is in the freeze position, the eyes diverge and corresponding parts of the two retinas have a horizontal disparity of about 18°. Rabbits can match corresponding images in the two eyes by converging the eyes, but, when they do so, the visual fields no longer meet behind the head (Hughes and Vaney 1982). Convergence of up to about 18° occurs when a rabbit approaches a visual display (Collewijn 1970; Zuidam and Collewijn 1979). Thus, rabbits have two modes of viewing—one that preserves panoramic vision but loses binocular correspondence, and one that achieves binocular correspondence within the binocular field at the expense of some loss in panoramic coverage.

The region of the rabbit’s visual cortex corresponding to the binocular visual field contains binocular neurons tuned to binocular disparity. At first it was reported that the receptive fields in the two eyes were not similarly tuned for stimulus size and orientation (Van Sluyters and Stewart 1974a ; Choudhury 1980). However, when allowance was made for the fact that the eyes must converge 18° to bring corresponding images into register, binocular cells were found to have matching tuning functions in the two eyes (Hughes and Vaney 1982). Binocular cells of the rabbit show little evidence of the developmental plasticity found in the binocular cells of cats and primates (Van Sluyters and Stewart 1974b). Rabbits can perform discrimination tasks involving the integration binocular information (Van Hof and Steele Russell 1977). They can also execute voluntary eye movements (Hughes 1971). There do not seem to have been any direct tests of stereopsis in rabbits. 33.7.2 U N GU L AT E S

Ungulates are the hoofed mammals. The even-hoofed order, Artiodactyla, includes bovines, goats, and sheep. The oddhoofed order, Perissodactyla, includes horses, tapirs, and rhinoceroses. Ungulates possess cells in the visual cortex that are sensitive to binocular disparity. Cells in V1 of sheep are tuned to small or zero disparities, and those in V2 respond to disparities up to 6° (Clarke et al. 1976). Binocular cells extend up to 20° on both sides of the midline (Clarke and Whitteridge 1976). Disparity-selective cells in the neonate lamb have broader tuning than those in the adult (Clarke et al. 1979). An autoradiographic study of the visual cortex failed to reveal ocular dominance columns in sheep and goats (Pettigrew et al. 1984). Sheep have widely spaced eyes, which produce four times the disparity at a given distance than that produced by cats’ eyes. Sheep have good depth discrimination, but the extent to which this depends on binocular stereopsis is not known. Neonatal lambs and goats are more advanced than neonatal cats and monkeys. A few hours after birth they follow the mother and avoid obstacles and cliffs. However, brief periods of monocular deprivation in lambs cause marked shifts in ocular dominance of cortical cells (Martin et al. 1979). There is no behavioral evidence of stereopsis in sheep or goats, but there is for the horse. Timney and Keil (1994) trained two horses to select a display containing a square in real depth. After learning to perform that task the horses were shown random-dot stereograms viewed with redgreen anaglyphs. On most trials they selected the stereogram that depicted a square in depth. The mean disparity threshold was 14 arcmin. Timney and Keil also showed that horses are sensitive to depth specified by perspective (Timney and Keil 1996).

V I S UA L D E P T H P E R C E P T I O N I N T H E A N I M A L K I N G D O M



255

3 3 . 8 E VO LU T I O N O F V I S UA L DE P T H PERCEPTION 33.8.1 BA S I C C O N D I T I O N S F O R EVO LU T I O N O F S E N S O RY M E C H A N I S M S

The basic requirements for the evolution of any new sensory mechanism are as follows: 1. A new sensory mechanism must arise from a mutation in the existing sensory system that provides the animal with novel sensory signals. 2. The new signals must correlate with distal stimuli or with events in the environment that have significance for the animal. 3. The animal must respond to the new signals or use them to improve the discriminability of signals that it is already using. Responses to the new signals must be reinforcing and provide the animal with some advantage over animals lacking the new signals. 4. The animal must possess a Hebbian mechanism of neural plasticity that allows it to learn to relate the new signals to other sensory signals and motor responses with which they are correlated. 5. The animal must possess epigenetic mechanisms that link neural responses to sensory signals to the local expression of genes. This will consolidate learning based on the new signals. It will also ensure that the mutation is manifested only in sensory cells that respond to the new signals. and that all cells in a particular cell line are the same. 6. The animal must have a genetic mechanism that is subject to variation so that the new sensory system will improve over generations. These principles are nicely illustrated by the evolution of color vision. Most nonprimate mammals are dichromatic. Short wavelengths (blue) are detected by a photopigment produced by an autosomal gene (one not on the X chromosome). Long and middle wavelengths are detected by a photopigment produced by a gene on the X chromosome. In Old World primates, trichromatic color vision arose from a simple replication of the gene on the X chromosome. Two initially identical genes eventually changed slightly so that they produced photopigments with different spectral absorption functions. An already existing epigenetic mechanism randomly assigned one gene to be expressed in one set of cones (L cones) and the other gene to be expressed in another set of cones (M cones). In New World primates, the L pigment gene occurs on one X chromosome and the M pigment gene occurs on the other X chromosome (Mollon et al. 1984). Since males 256



have only one X chromosome, only female New World monkeys have trichromatic color vision. Mancuso et al. (2009) injected a virus containing the human L-opsin gene into the retinas of male New World squirrel monkeys. About 20 weeks later the monkeys could discriminate between red and green regions in a random-dot display— they had acquired trichromatic color vision! Mice do not have color vision. Jacobs et al. (2007) replaced the photopigment gene in one X chromosome of female mice with a human L pigment gene. Some of the resulting female mice, like female New World monkeys, had distinct photopigment genes on their X chromosomes. Now came the interesting part of the experiment. After extensive training, the heterozygote female mice developed the capacity to discriminate between 500 nm and 600 nm lights. Now consider how these principles apply to the evolution of visual depth-detection mechanisms. 33.8.2 EVO LU T I O N O F MO N O C U L A R MECHANISMS

A single detector can register only modulations of light intensity that could be used, for example, to detect the daynight cycle. A detector in a cavity lined with opaque pigment can detect changes in light intensity due to movement of the detector. This could form the basis of phototaxis. An array of detectors with a lens can detect spatial-temporal distributions of light intensity. A simple asymmetrical coupling between neighboring pairs of detectors, as in the Reichardt detector, would allow animals to detect motion and the direction of motion. But an image-forming eye would not be much use if only the 2-D distribution of images were registered. Registration of the 3-D structure of a scene is required to resolve the following ambiguities. 1. The luminance of a given surface is affected by its whiteness and by shadows and shading. The two effects cannot be distinguished unless the 3-D structure of an object is registered (Section 22.4). 2. The size of the image of an object varies with the distance of the object (Section 29.3). 3. The shape of the image of an object varies with the spatial disposition of the object (Section 29.4). 4. The motion of the image of a moving object varies with distance (Section 29.5). Fortunately, the proximal stimulus of an image-forming eye contains adequate information for the detection of depth. But animals must learn to use that information. The first task it to externalize visual stimuli. Two pieces of evidence suggest how this is done. It is explained in Section 35.1.1 that a sound heard through headphones seems to

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

be inside the head because it does not change when the head moves. A sound is externalized as soon as it changes in accordance with movements of the head. In a similar way, an afterimage viewed with closed eyes is not externalized, because it moves with the eyes. Blind people externalize signals delivered from a video camera to the skin only when they control movements of the camera (Section 34.5). Once visual stimuli are externalized their relative disposition in space is indicated by motion parallax and stimulus accretion-deletion. Because these motion signals are closely related with an animal’s motion, animals would soon learn to use the visual stimuli to guide their movements. This learning would be rewarded because it would allow animals to avoid obstacles and find food or catch prey. Animals would also learn to use static stimulus overlap and perspective to detect 3-D structure without having to move the head. Special genetic mechanisms would not be required for the initial evolution of monocular depth perception. All that is required is the Hebbian mechanism for learning to associate related sensory signals and related sensory and motor signals. Already existing epigenetic mechanisms that link neural activity to the expression of genes would ensure that proteins required for consolidation of learned relationships are produced (see Section 6.6.1). The evolution of stereoscopic vision depends on relating stimuli in the two eyes. This cannot be done unless inputs from the two eyes converge in the central nervous system, and this can happen only after frontal vision has evolved. 33.8.3 EVO LU T I O N O F FRO N TA L VI S I O N

Fully frontal eyes have one pair of visual axes that intersect. In lateral eyes, the principal visual axes are directed in opposite directions and never intersect. Some animals, such as the chameleon, are able to direct their visual axes either laterally where they do not intersect, or frontally where they do intersect. They thus have some frontal vision even though they do not have true frontal eyes. Some animals, such as the pigeon, are bifoveate and have two visual axes in each eye—a principal axis that lies close to the optic axis and projects to a central fovea, and a second visual axis that projects to a fovea lying in the temporal retina (see Figure 33.15). The angle between the two visual axes can be as large as 50°. Such animals are classified as lateral eyed, because their principal visual axes point laterally and never intersect. However, they have frontal vision with respect to their secondary visual axes, because these axes do intersect. Many mammals do not have foveas. For instance, the rabbit has a horizontal region of high acuity in each retina, known as the visual streak. In such animals, frontality is indicated by the degree of overlap of the two visual fields.

When threatened, many lateral-eyed animals, such as the pigeon, rabbit, and horse, turn their eyes outward to gain panoramic vision at the expense of losing binocular overlap. When they wish to get a better view of something ahead of them they converge their eyes to increase binocular overlap at the expense of some loss of panoramic vision. Hemidecussation of the optic nerves occurs in mammals, but not in all mammals. When it does occur, the ratio of uncrossed fibers to crossed fibers is proportional to the size of the binocular visual field, which in turn depends on the extent to which the eyes are frontal. This relationship is known as the Newton-Müller-Gudden law. The proportion of uncrossed fibers is almost zero in the rabbit, about one-eighth in the horse, one-quarter in the dog, one-third in the cat, and one-half in primates, including humans (Walls 1963, p. 321). The ferret is unusual. It has large irregular regions of monocular projection from the center of the ipsilateral eye in V1 and from the contralateral eye in V2 with no matching inputs from the opposite eye (White et al. 1999). The visual fields of all mammals, except the Cetaceae (whales and dolphins) overlap to some degree. Over a representative sample of mammalian species, the degree to which the eyes face in the same direction (orbit convergence) is correlated with the degree of overlap of the visual fields (Heesy 2004). Also, across 76 primate species, orbital convergence is correlated with the size of the visual brain, and especially with the size of the part of the brain associated with binocular vision (Barton 2004). In most submammalian vertebrates, the visual pathways fully decussate. However, some ipsilateral projections have been found in the visual pathways of all nonmammalian vertebrates that have been studied, except two minor groups of reptiles including the crocodiles (Ward et al. 1995). The ipsilateral projections innervate the hypothalamus, tectum, pretectum, and thalamus. They are far fewer and show more interspecific variation than do contralateral projections. In general, the number of ipsilateral projections in nonmammalian species is not related to the degree of overlap of the visual fields. For example, the visual fields of the lizard Podacaris overlap by about 18°, and those of the snapping turtle Chelydra overlap about 38°, even though ipsilateral visual projections are more extensive in Podacaris than in Chelydra (Hergueta et al. 1992). Nor is the extent of ipsilateral projections in nonmammalian vertebrates obviously related to whether the animal is a predator or nocturnal, and it seems to be unrelated to taxonomic order (see Ward et al. 1995). In many submammalian species, such as the salamander, ipsilateral axons develop early in life and remain fairly stable. In frogs and toads, contralateral axons develop first and ipsilateral fibers emerge at metamorphosis. In birds, ipsilateral fibers exhibit an initial exuberant growth followed by pruning, as in some mammals. We have seen that many submammalian species with sparse ipsilateral projections

V I S UA L D E P T H P E R C E P T I O N I N T H E A N I M A L K I N G D O M



257

to the thalamus have binocular cells fed by intertectal connections or by hemidecussation at a level beyond the thalamus. 33.8.4 EVO LU T I O N O F S T E R EO S C O P I C VISION

33.8.4a Advantages of Stereoscopic Vision Stereopsis provides the following advantages. 1. Collins (1922) and Le Gros Clark (1970) suggested that stereopsis was an adaptation to arboreal life in which accurate judgments of distance are required to enable animals to leap from branch to branch. However, many arboreal mammals, such as opossums, tree shrews, and squirrels, do not have frontal eyes. Squirrels for instance, have only about 20% binocular overlap. In small mammals, binocular overlap correlates with length of skull rather than with arboreality (Cartmill 1974). 2. Stereopsis enables predators to judge the distance and motion of prey ( Julesz 1971; Cartmill 1974). Visually controlled predation is present in many living prosimians and small marsupials that show primate-like specialization of the visual system. 3. Stereopsis is very effective at revealing camouflaged objects. Camouflage is also broken by motion parallax but it requires the predator to move. Also, detection of depth by motion parallax involves integration of signals over time, so it takes longer than detection of depth by stereopsis. 4. Hughes (1977) proposed that the degree of binocular overlap is correlated with the evolution of visual control of fine movements of the forelimbs. Also, convergence of the visual axes on an object helps an animal to attend to that object and disregard objects at other distances (Section 22.8). 5. Frontal vision in primates may have evolved in prosimians as an adaptation to nocturnal vision. In dim light, sensitivity is improved by binocular vision (Section 13.1.2). Also, reactions are quicker with binocular viewing than with monocular viewing (Section 13.1.7). Other advantages of stereoscopic vision were described in Section 20.1.1. All these advantages provide selective pressure for the improvement of stereopsis. But for natural selection to operate, the animal must have a mechanism, however crude, for detecting disparity. How can we explain the emergence of the initial disparity-detection mechanism? Some binocular overlap of the visual fields is the first prerequisite. This requires the development of frontal vision. 258



33.8.4b Evolution of Disparity Detection Frontal vision may have evolved to improve the visibility of objects ahead of the animal. The prosimian ancestors of primates were nocturnal, and frontal binocular vision would have improved visual sensitivity. But frontal vision would serve this purpose only if visual inputs from the overlap region were to hemidecussate so that inputs from corresponding regions converged in the visual cortex. The molecular mechanisms that control the routing of axons through the chiasm were described in Section 6.3.4. Since most animals have some ipsilaterally projecting axons, it would have taken only small changes in the mechanisms in the chiasm to increase the number of ipsilaterally projecting axons. Binocular inputs from corresponding regions in the area of binocular overlap would be synchronous, and this would have provided a signal for axons from corresponding regions to follow the same route through the chiasm. This would explain why, as the area of binocular overlap increased, the proportion of ipsilateral axons increased. The contralateral and ipsilateral inputs to the LGN would first intermingle and then segregate into distinct layers, just as they do in the embryos of cats and monkeys. Mechanisms for segregation of inputs from parvo- and magnocellular ganglion cells and from ON-cells and OFF-cells were already present in the LGN of lateral-eyed mammals. One mechanism consists of complementary gradients of EphA and ephrin-A molecules. The other involves waves of synchronous activity arising from the retina (see Section 6.3.5a). A simple extension of the same mechanisms would have produced segregation of contralateral and ipsilateral inputs to the LGN. The synchrony between inputs from the same eye plus the lack of synchrony between the eyes would keep the inputs from the two eyes separate in the LGN. It was explained in Section 6.7.1 that, in the monkey, projections from the two eyes overlap extensively in layer 4C during the first 3 postnatal weeks. The transition to a state of monocular dominance involves a phase of exuberant proliferation of synaptic terminals followed by selective withdrawal of inappropriately connected dendrites, accompanied by maturation of appropriately connected dendrites. The process is partly controlled by synchrony of inputs from the two eyes and partly by eye-specific chemical markers. By the 6th postnatal day, the monkey visual cortex contains binocular cells, which are organized into ocular dominance columns. We saw in Section 6.7.3b that when one tectum in the goldfish is removed, the axons that normally innervate it grow into the remaining tectum in spatial register with the contralateral inputs from the other eye. Initially, the invading ipsilateral projection spread homogeneously over the tectal surface but became segregated into eye specific columns. The ocular dominance columns did not form when neural activity was blocked in both eyes. Thus, ocular dominance columns consisting of binocular cells receiving

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

inputs from corresponding regions in the two eyes developed in an animal that had not evolved binocular vision. Furthermore, we saw in Section 6.7.3b that the monocular receptive fields of these cells matched in direction sensitivity, as do the monocular receptive fields of binocular cells in animals with stereoscopic vision. Similarly, we saw in Section 6.7.3a that when one tectum of a frog embryo is ablated, the severed optic nerve grows into the ipsilateral tectum along with inputs from the intact contralateral eye. In each location in the tectum, the ipsilateral axons displace established synapses from the contralateral eye to form ocular dominance columns. A similar rerouting of visual inputs occurred in a human with congenital absence of one cerebral hemisphere. The same processes may have occurred when frontal vision first evolved. By the time in postnatal development when the new ipsilateral inputs reached the visual cortex, the eyes would be open and transmitting visual signals. Such signals from each pair of corresponding points would be highly correlated, while inputs from different regions in the same eye would not be correlated. The Hebbian mechanism would ensure that axons from one eye would link up with corresponding axons from the other eye. The inputs from the two eyes would then modify their orientation tuning to produce matching receptive fields. The resulting binocular cells would then automatically segregate to form ocular dominance columns by the process of competitive innervation described in Section 6.7.2. Now suppose that there was some random scatter in the positions of the two monocular receptive fields feeding into the binocular cells. Different binocular cells would therefore respond selectively to different binocular disparities. A binocular-disparity mechanism would have developed through the mediation of processes that were already present in the visual system. Those early primates would be receiving novel visual signals. Since these signals were correlated with relative

depth, it would not take long for the animals to learn to use them through the mediation of a Hebbian mechanism of neural plasticity. This would set the stage for selective pressure to favor animals with a more refined stereoscopic system. Any pattern of binocular disparity is the difference in the perspective produced by a scene in one eye and that produced in the other eye. Binocular disparity is binocular perspective. Before stereopsis evolved, mechanisms already existed for detection of linear perspective, texture gradients, and higher derivatives of perspective. These mechanisms could be easily adapted to detect simple disparity, disparity gradients, and higher order derivatives of disparity. Mechanisms also existed for detection of gradients of optic flow. These same mechanisms could be adapted to detect spatiotemporal patterns of disparity. The evidence reviewed in Section 6.6 shows that the development of feature-detection mechanisms in the visual system is at least partly controlled by visual experience. The epigenetic mechanisms described in Section 6.6.1 link local gene expression to neural activity. The mechanisms of neural plasticity described in Section 6.5.1 guide the development of detectors for complex stimulus features. Once simple disparity detectors had emerged, these epigenetic and learning mechanisms would lead to the development of detectors for disparity gradients in higher levels of the visual system. These same mechanisms would also lead to the development of neurons that detect correlations between disparity and other stimulus features such as motion, and other depth cues. We have seen that stereoscopic vision evolved independently in several animals including insects, frogs, birds, and mammals. In each case, the animals must have been presented with a novel stimulus that they soon learned to use. The new mechanism would build on epigenetic and neural processes that were already there and then become refined through natural selection.

V I S UA L D E P T H P E R C E P T I O N I N T H E A N I M A L K I N G D O M



259

34 REACHING AND MOVING IN 3-D SPACE

34.1 34.1.1 34.1.2 34.2 34.3 34.3.1 34.3.2 34.3.3

Kinesthesia 260 Kinesthetic receptors 260 Kinesthetic sensitivity 260 Haptic judgments of distance 261 Reaching for and grasping objects 261 The nature of the task 261 Coding direction and distance of reaching Reaching and visual cues to distance 263

34.3.4 34.3.5 34.4 34.4.1 34.4.2 34.4.3 34.5

262

34.1 K INESTHESIA

Reaching with seen or unseen hand 265 Visual processing for perception and action Judgment of self-motion 270 Path integration in humans 270 Integration of motor activity 272 Visually induced motion-in-depth 274 Visuo-tactile substitution 275

267

Cells tuned to the direction of gaze relative to the head have been found in the monkey somatosensory cortex. The signals arise in proprioceptors in the extraocular muscles (Wang et al. 2007).

34.1.1 K I N E S T H ET I C R EC E P TO R S

The position of a limb is indicated by inputs from kinesthetic receptors in muscle spindles, tendons, and ligaments. Some information may also be gained from stretching of the skin over a joint. Movements of a limb are indicated by kinesthetic receptors and by the motor commands (efference) responsible for the movements. Muscle spindles are interspersed among muscle fibers in voluntary muscles. They contain sensory elements that register either the tension in the muscle or the extent of contraction. Sherrington (1894) showed that muscle spindles contain sensory end organs, and Matthews (1931) was the first to record discharges from single muscle spindles. Muscle spindles also contain contractile elements that help to maintain the sensory elements in their operational range of tension. The structure of muscle spindles was reviewed by Barker (1962). Inputs from muscle spindles feed into the spinal cord and cerebellum but do not reach the cerebral cortex by a direct route (Mountcastle 1957). Skeletal members at a joint are linked by ligaments, which contain Ruffini and Golgi sense organs. Discharges from these sense organs are related to the movement and position of the joint and are not affected by muscle tension (Andrews 1954). Inputs ascend to the thalamus and up to the somatosensory cortex. The sensory nerves show a high initial discharge followed by a slowly adapting discharge when the limb is held in a fixed position. Some neurons in the somatosensory cortex are maximally active at a particular joint angle, and some are tuned to the direction of movement at the joint (Mountcastle et al. 1950).

34.1.2 K I N E S T H ET I C S E N S I T I V I T Y

The threshold of kinesthetic receptors may be assessed by the smallest amplitude of passive motion of a limb that can be detected. In one study, subjects reported the direction of a passive movement at the elbow on 80% of trials when the joint was flexed 1.8° at a speed of 0.2°/s (Cleghorn and Darcus 1952). Two procedures have been used to determine the active position sensitivity of a limb. In the first procedure, human subjects place an unseen finger on a reference point, lower the arm, and then regain the previous position. Cohen (1958a) obtained a mean average error of 3.3 cm (about 2°) at the fingertip of a fully extended arm of a standing person. Merton (1961) used a similar procedure but allowed subjects to leave the arm near the target. Subjects replaced a pin in a board with a mean signed accuracy of about 0.1° after intervals of 1 s or 5 s. The variable error doubled after 5 s. Performance with this procedure reflects the accuracy and precision of the registration of both the initial and regained positions of the limb. It also requires the subject to memorize a limb position. A second procedure avoids the problem of memory. The subject, with eyes closed, points a finger of one hand to a finger of the other hand, with the fingers separated by a plate of glass. Errors have been found to increase with increasing distance from the body and with increasing distance from the midline (Slinger and Horsley 1906).

260

just as a blind man might feel out a distance with two staves, one in each hand. But what evidence do we have that people can judge distances using two staves? Cabe et al. (2003) produced the only systematic study of this question. Each of two 48-inch-long rods was pivoted about a vertical axis close to one end. Subjects grasped the pivot points, one in each hand, and extended the index finger of each hand along the rod. The hands and rods were not visible. The distance between the pivot points was varied. The rods intersected at various distances up to 1 meter, either in the median plane or at horizontal eccentricities of ±6 or ±12 inches. Twenty-two sighted subjects estimated the distance of the point of intersection of the rods with respect to marks on a frontal surface. Accuracy was not significantly affected by changes in the distance between the pivot points. The graphs in Figure 34.1 show that estimated distance varied systematically with actual distance but that distances were overestimated when the rods intersected in the median plane. In the above experiment, the only information on which distance judgments could be based was the angle between the rods as detected by the angles of the wrists. The lengths of the rods were not relevant, and wrist movements were not involved because the rods did not move while subjects held them. We need to know whether blind people perform this task more accurately and precisely than sighted people. 3 4 .3 R E AC H I N G F O R A N D G R A S P I N G OBJECTS 34.3.1 T H E NAT U R E O F T H E TA S K

The development of reaching in infants was discussed in Section 7.4.1a. The present section is concerned with reaching in adults. 39 36 Mean judged distance (in.)

The perceived position of a limb is influenced by a previous movement or posture. For example, with eyes closed, hold one extended arm well above the horizontal and the other arm well below the horizontal for 30 s. Then place the two arms at the same level. Upon opening the eyes, you will see that the previously elevated arm is well above the other arm (Nachmias 1953). Exposures of only 5 s to a given limb posture can influence the perception of subsequent limb postures ( Jackson 1954). These effects could be due to persistence of muscle tension, alterations in muscle tonus, or sensory adaptation. If joint receptors are mainly responsible for the sense of joint position then anesthetizing a joint should render a person insensitive to position at that joint. Anesthetizing the joints of the index finger severely reduced sensitivity to passive motion (Goldscheider 1889). Tensing the muscles controlling an anesthetized index finger did not restore the position sense of the finger (Provins 1958). Receptors in muscle spindles and tendon organs are sensitive to muscle tension. Insofar as position sense depends on these receptors or on motor efference it should be affected by the load carried by the limb. Cohen (1958b) asked subjects to restore their extended arm to a previous vertical position with no load and with a 1 kg weight added to the hand. The weight increased the mean error only slightly. Cohen concluded that joint sensitivity depends mainly on receptors in ligaments, which are immune to the effects of load. Motor efference provides the only information about the position of an unseen limb when sensory pathways from the limb are anesthetized or severed. Lashley (1917) investigated a patient with anesthesia of most of the leg afferents due to a bullet wound in the spinal cord. The patient could move his leg but could not detect passively imposed movements at the knee. Nor could he return his leg to a position into which it had been passively placed. He could not maintain his unseen leg in one position for long, but was not aware that he had not done so. He could actively move his leg in specified directions and through specified amplitudes and he accurately repeated a given active motion. However, he could not compensate for different loads. It seems that motor efference provides information about active limb movements made under normal load conditions. Efference is of prime importance in well-practiced ballistic movements that are involved in skills such as playing the piano. Such movements are so rapid that kinesthetic feedback has no time to act during the execution of the movement. Kinesthesia may be involved in preparatory adjustments or in long-term parametric adjustments of skilled movements.

Intersection distance

33

36 in. 30 in.

30 27

24 in. 24 21 18 in. 18 15 –15 –12 –9 –6 –3

34.2 HAPTIC JUDGMENTS OF D I S TA N C E Descartes, in his La dioptrique (1637), described the eyes as “feeling out” a distance by a convergence of the visual axes,

0

3

6

9

12 15

Eccentricity (in.) Figure 34.1. Haptically judged distance. Mean judged distance of the intersection point of two rods held in the two hands as a function of the separation of the hands and the eccentricity of the point of intersection, N = 22. (Adapted from Cabe et al. 2003)

R E A C H I N G A N D M O V I N G I N 3 -D S PA C E



261

The task of reaching for an object has two distinct phases. In the planning phase the person registers the location of the target and decides on a plan of action. In the execution phase the hand moves to the perceived location of the target. The motion in 3-D space of an LED attached to the wrist or finger may be recorded by video cameras. The movement is specified by its direction, amplitude, duration, and dynamics (velocity profile). Planning a movement of the arm to a visual object requires the registration of the location of the object in 3-D space relative to the torso. The direction of the object is first registered with respect to the fovea. This is a simple task when the object is fixated. The direction of the eyes in the head and of the head on the body must be taken into account to derive the direction of the object relative to the torso. The distance of the object is derived from whatever distance cues are available. Planning an arm movement also requires the registration of the initial posture of each arm segment and of the orientation of the shoulder joint relative to the torso. This depends on kinesthesia when the arm is hidden and fully relaxed. When the arm is held in a given posture without support, motor efference and feedback from sensory muscle spindles could be involved. Sight of the hand provides extra information about the initial location of the hand. When the arm is not seen, visual information about the target and kinesthetic/motor information about the arm’s initial position must be in a common egocentric frame of reference centered on the torso. But the motion of the hand must be programmed in terms of motor commands to the limb segments. The execution phase of reaching for and grasping an object consists of two overlapping components. The movement component is the movement of the hand toward the target. The movement is specified by its direction, amplitude, duration, and dynamics. The grasp component consists of orienting the hand with respect to the target and forming the thumb and forefinger into a grasp with a specified aperture ( Jeannerod 1988). These movements can be recorded by attaching LEDs to the fingers. The grasp movements of the hand tend to occur during the deceleration phase of the movement ( Jeannerod 1984). Planning the orientation of a grasping hand requires the visual registration of the orientation of the object relative to the torso or relative to the seen hand. The amplitude of the grasp component is determined by the perceived size of the dimension of the object that is to be grasped. For a circular object lying in the frontal plane, perceived size depends on the size of the image and the perceived distance of the object from the subject. When an object is grasped along its indepth dimension, grasp amplitude depends on the depth dimension of the object. Estimates of the depth dimension could be based on perspective, motion parallax, or binocular disparity. However, motion parallax and disparity must be scaled by viewing distance. 262



The region within arm’s reach is known as near visual space or peripersonal space. The region beyond arm’s reach is far visual space or extrapersonal space. Evidence reviewed in Section 32.1.1d suggests that near and far space are processed in distinct parts of the brain. For example, neglect of one half of space due to unilateral brain damage may be confined to near space or to far space. The boundary of near space may be extended when a person reaches for an object with a stick. Witt et al. (2005) asked subjects to reach to where they had just seen an object on a tabletop. In one condition, they reached with the finger or pointed when the location was beyond reach. In another condition, they reached with a stick. Object locations were verbally judged to be closer when a stick was used than when the finger was used. In particular, objects beyond the reach of the finger were judged to be more distant when subjects pointed at them than when they had reached with the stick. In all the other experiments reviewed in this section the objects were within reaching distance. 34.3.2 C O D I N G D I R E C T I O N A N D D I S TA N C E O F R E AC H I N G

Arm movements are programmed in the parietal cortex and in the premotor and motor areas, as described in Sections 5.8.4f and 5.8.4g. The question addressed in this section is whether the direction and distance components of reaching movements are processed in distinct channels. Clearly, the direction and distance of a visual object are processed by distinct visual channels. Additional visual channels are required for object recognition. It has been argued that the parameters of the motor commands for moving a multijoint limb can be identified by an analysis of the errors of reaching to a visual target. One problem with this approach is that it ignores the contribution of errors in the registration of the location of the visual target. An example of this approach is provided by Gordon et al. (1994). Subjects moved a cursor with unseen hand from a starting position to a target on a horizontal computer monitor. The spatial distributions of end points, plotted over many trials, were ellipses with the major axes lying along the distance axes of movement. Variability of distance errors increased nonlinearly with distance, whereas variability of direction errors were proportional to distance. Gordon et al. concluded that hand movements are planned in hand-centered coordinates with extent and direction being processed by distinct neural systems. However, they did not assess the contribution of errors in target registration. Messier and Kalaska (1997) obtained similar results. McIntyre et al. (1997, 1998) argued we can gain knowledge of how arm movements are coded by measuring the

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

effects of varying the delay between presentation of the target and the start of the reaching movement. Subjects pointed to a briefly exposed LED shown in various locations in 3-D space for 0.5, 5.0, or 8.0 s before the movement started. The room and hand were dimly illuminated or the target was seen in dark surroundings. As delay increased, distance errors increased more rapidly than direction errors. The authors concluded from this and other features of the pattern of errors that the distance and direction of a visual target are stored separately in a reference frame tied to the eyes and the arm. Suppose that target direction and distance are registered initially in bodycentric coordinates. Direction and distance components of an arm movement could then be coded in parallel channels before signals are sent to the muscles. This would be revealed by the presence of neurons that respond only in relation to movement direction and other neurons that respond only in relation to movement distance. Alternatively, direction and distance could be coded by the same neurons. In the simplest scheme, direction would be coded in terms of the set of muscles to which commands are sent, and distance would be coded in terms of signals controlling the velocity, acceleration, and duration of the movement. This account presupposes that a change in the amplitude of an arm movement does not involve a change in the set of active muscles. In both schemes, one would expect direction to be programmed before distance. The physiological evidence is somewhat ambiguous. Fu et al. (1993) recorded from cells in the premotor and motor areas of monkeys before and during reaching movements to visual targets in different directions and at different distances. Responses of most cells were modulated by movement direction both just before and during the movement. Also, responses of most cells were modulated by distance along at least one direction. During the premovement period, responses of cells were mostly related to direction. Their relation to distance was greatest during the movement. Messier and Kalaska (2000), also, found that responses of most cells in the premotor area of monkeys were modulated by both the direction and the distance of the visual target just before the movement started. Only a few cells were tuned to direction only or distance only. After the movement started, distance became a more dominant factor governing a cell’s response. Another approach to this issue is to provide only direction or only distance information before an arm movement starts. Kurata (1993) had monkeys make large or small hand movements in either of two directions. Responses of cells in the premotor area were recorded during the premovement period when the animals had been cued only to direction of movement or only to distance. At this stage, 32% of cells responded to both types of cue. After both direction and distance information were provided, responses of most cells

were modulated by both types of information. After the movement started, responses of 51% of cells were modulated by either direction or distance. Responses of 42% of cells were modulated by direction only or by magnitude only. On the whole, these results suggest that information about direction and distance is carried by the same neurons in the premotor and motor areas. Responses governing the direction of a movement are dominant at first and those governing distance become dominant as the movement progresses. This evidence does not preclude direction and distance of an arm movement being processed in distinct channels at an earlier stage in the parietal cortex.

34.3.3 R E AC H I N G A N D V I S UA L C U E S TO D I S TA N C E

34.3.3a Visual Cues to Pointing to a Visual object The first basic question is how accurately an unseen arm can reach to a visual target, as a function of the available cues to distance. The main points of the evidence reviewed in Section 25.2.4 are as follows. Fisher and Ciuffreda (1988) asked subjects to point with hidden hand to high-contrast monocular targets. There were large individual differences in accuracy, but the finger pointed too far to targets nearer than 31 cm and too near to targets at larger distances. Each diopter change in accommodation induced about a 0.25-diopter change in reaching distance. Mon-Williams and Tresilian (1999a) used a similar task and found a correlation between pointing errors and target distance, but responses were very variable. With binocular viewing, convergence can serve as a distance cue. Swenson (1932) asked subjects to move an unseen marker to binocularly viewed luminous disks at different distances but with constant visual subtense. Errors were less than 1 cm in the range 25 to 40 cm. When accommodation was optically adjusted to one distance, and vergence to another distance by prisms, distance settings were a compromise between the two but with more weight given to vergence. Foley and Held (1972) held accommodation constant by using a dichoptic pair of lights with various horizontal disparities. Judged distance increased as the vergence-distance of the target increased from 10 to 40 cm, but subjects consistently overreached, with a median error of 25 cm. Mon-Williams and Tresilian (1999a) varied the vergence distance of a point of light between 20 and 50 cm, keeping other depth cues constant. The mean variable error of pointing was less than 2 cm. Shorter distances were overestimated and longer distances underestimated. Note that keeping other depth cues constant does not guarantee they have no effect.

R E A C H I N G A N D M O V I N G I N 3 -D S PA C E



263

Mon-Williams and Dijkerman (1999) asked subjects to grasp an object seen in surroundings with many cues to distance while wearing a 9-diopter base-in or base-out prism before one eye. Subjects modified their peak wrist velocity and acceleration according to the change in perceived distance induced by the prism. Subjects underestimated distance when pointing with unseen hand to an isolated visual target at a distance of 27 cm, especially with monocular viewing (Magne and Coello 2002). However, pointing immediately became accurate when a textured background was added to the display with either monocular or binocular viewing. Accuracy remained high for several trials after the textured surface was removed. Consider a person pointing with unseen hand to an object when not correctly converged on it. This situation arises in people with fixation disparity (Section 10.2.4). In the absence of monocular cues to distance, target distance can be detected only by registering both convergence and the disparity of the images of the target. Henriques et al. (2003) found that errors in binocular fixation (left, right, near, or far) did not correlate with errors of pointing to an isolated target with unseen hand. But the variability of pointing may have been too large to reveal any effect of misconvergence. A target at a great distance from the point of convergence becomes diplopic, and a person would have to decide which image to point to. This is one of the criteria used to determine eye dominance. Now consider a person pointing to an isolated test target while converging on a second object at another distance. The distance of the test target could be derived from vergence and disparity, but only if the disparity were not beyond the range of disparity detectors. Henriques et al. (2003) asked subjects to point to an LED at a distance of 30, 38, or 50 cm while converging on an LED at a distance of 2 m. In each case, the near and far LEDs were aligned with the nondominant eye. The surroundings were dark, and the hand could not be seen. The disparity of the target at 50 cm was about 6.5° and that of the 30-cm target was 11°, which is beyond the range of disparity detection (Section 18.4.1). The images of the target were therefore strongly diplopic. Subjects pointed in the direction of the image aligned with the fixation target rather than in the direction of the image of the target in the dominant eye. Pointing distance was only weakly related to target distance. Subjects were clearly not using these large disparities effectively. When subjects converged on the target LEDs, reaching distance was closely related to target distance. This experiment should be repeated with the disparity of the target within the range of disparity detectors. In summary, it can be stated that the vergence angle of the eyes and, to a lesser extent, accommodation of the lens supply some information for controlling the distance of the movement of the hand toward a visual target. 264



34.3.3b Grasping with Monocular and Binocular Viewing Theoretically, there are two ways in which binocular viewing could improve the accuracy with which the hand grasps a 3-D object. First, binocular viewing improves the precision with which the eyes converge on an object, especially in the presence of phoria. Within arm’s reach, this could improve the accuracy of judgments of absolute distance based on convergence (Section 25.2). Secondly, binocular vision provides binocular disparity. Disparity does not indicate the absolute distance of small objects but it does indicate the relative distances of objects or of parts of an object. However, the disparity produced by a given depth interval decreases by the square of distance, so that disparities must be scaled by absolute distance if they are to provide precise information about the 3-D structure of an object. Servos et al. (1992) asked subjects to rapidly grasp the in-depth dimension of a rectangular block placed on a table at distances of between 20 and 40 cm. The room and the hand were in view so that arm movement and grasp were under feedback control. Movement time was longer and the peak velocity of the wrist was less with monocular viewing than with binocular viewing. Also, the final grasp aperture was too small with monocular viewing. For both modes of viewing, peak velocity increased with object distance, but the increase was steeper for binocular than for monocular viewing. These results suggest that the distance and indepth dimension of the object were underestimated in monocular viewing. But it is not clear whether binocular vision improved planning the movement or its execution. Earlier studies used a small number of objects and distances that were presented repeatedly with visual feedback. Perhaps subjects learned to grasp the objects over a set of trials. Watt and Bradshaw (2000) varied object thickness, height, distance, and angular size so that subjects could not learn the identity of particular objects. Subjects reached for and grasped the objects viewed monocularly or binocularly. In a well-illuminated room, closing one eye did not affect the dynamics of the arm movement but did produce an overestimation of grasp aperture. When the surroundings were dark except for lights on the objects and on the hand, closing one eye both slowed the arm movements and produced an overestimation of grasp aperture. However, grasp aperture was scaled by the distance of the object with both monocular and binocular viewing. These results suggest that binocular viewing is particularly advantageous for specifying grasp aperture. Presumably, the reduction of arm velocity in the dimly illuminated room was due to the absence of cues to distance that the surroundings provided. In a further study, reaching for real objects on a tabletop with all depth cues present was compared with reaching for virtual objects that provided only binocular cues of convergence and binocular disparity (Hibbard and Bradshaw 2003).

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

In both cases, peak wrist velocity and grasp aperture were scaled by object width and depth. However, for the virtual objects, grasp amplitude was too large for near objects. Also, subjects made more corrections to their prehensile movements for virtual objects than for real objects. Hibbard and Bradshaw concluded that binocular information alone is not as effective as binocular information supplemented by monocular information. In this case, the monocular perspective information provided by the tabletop provided information about the absolute distances of the objects. For the virtual objects, distance information was provided only by vergence. Servos and Goodale (1994) found that the hand moved more slowly when binocular vision was allowed only during the execution phase, compared with when it was allowed during the whole trial. But, at the end of the movement, the time taken to grasp the object was the same in the two conditions. Binocular vision allowed only during the execution phase increased speed of movement and reduced the time taken to complete the final grasp compared with when the object was seen with one eye throughout the trial. Thus, binocular vision during the execution phase improved the speed of movement. Binocular vision during either the planning phase or the execution phase improved the speed of the final grasp. Jackson et al. (1997) obtained similar results except that they found that movements were slowed when viewing became monocular after initially being binocular. Binocular vision during both phases did not improve reaching accuracy but reduced response variability and decreased response duration (Loftus et al. 2004). Binocular vision during the movement phase could provide error feedback for the control of arm velocity and grasp aperture. Bradshaw and Elliott (2003) asked subjects to reach for luminous rectangular objects at distances of 25.5 and 42 cm, with luminous patches attached to the wrist and hand. Viewing was initially monocular, but both eyes were open when movement started or after 25%, 50%, or 75% of the movement had elapsed. The addition of binocular vision did not affect peak velocity or time to peak velocity. However, grasp aperture was more accurate when binocular vision was added at the 50% point, and the final phase of the movement was faster when binocular vision was added at the 25% point. Binocular vision added at the 75% point had no effect. On average, this final period occupied only 100 ms—too brief a period for visual feedback to have an effect. Melmoth and Grant (2006) confirmed that binocular vision increases the speed of the terminal phase of reaching and enhances grasp accuracy with respect to the dimension of the object. If arm movements are preprogrammed more effectively with binocular viewing than with monocular viewing of the target, one might expect more corrective adjustments with monocular viewing. Marotta et al. (1998) found this to be true when subjects reached for a luminous sphere in dark

surroundings. It is not clear what error signal subjects used for the corrections, since the hand was not in view. Marotta et al. also found that monocular parallax produced by sideways motion of the head was as effective as binocular viewing in reducing the need for corrective hand movements. Thus, motion parallax can substitute for binocular viewing. With object distance indicated only by disparity, peak velocity and grasp width were both scaled for the distance of the object to a greater degree than with static monocular viewing. However, with distance indicated by motion parallax, only hand velocity showed greater distance scaling (Watt and Bradshaw 2003). Binocular vision improved prehension performance but only when binocular vision provided binocular disparity. Otherwise, performance with binocular vision was no better than with monocular vision (Bradshaw et al. 2004). Binocular vision conferred some advantage during the planning stage of stepping over an obstacle (Mamassian 1997; Patla et al. 2002). Practice in catching a ball while wearing a telestereoscopic device that reduced apparent distances, produced a change in response timing. Controls revealed that the change in timing was due to a recalibration of multiple sources of monocular and binocular information with respect to hand movements (Bennett et al. 1999; van der Kamp et al. 1999). Binocular vision does not add to the quality of performance of all depth-related tasks. For instance, both experienced and newly trained pilots, with one eye covered, land an aircraft just as accurately as when they use both eyes (Lewis et al. 1973; Grosslight et al. 1978). But motion parallax provides ample depth information for these types of task.

34.3.4 R E AC H I N G WI T H S E E N O R U NS E E N H A N D

34.3.4a Reaching with the Seen Hand When both the target and the hand are visible while the arm is moving, the movement is controlled by continuous visual feedback. Theoretically, the task could be accomplished by simply using error feedback to reduce the distance between the seen hand and the object. This could be done in an exocentric frame of reference. For example, people smoothly adjust their pointing when the visual target is suddenly moved even though they are not aware of the target motion (Prablanc and Martin 1992). Saunders and Knill (2005) had subjects point with a virtual finger to targets in a 3-D virtual display system. When the virtual finger was suddenly displaced laterally, subjects smoothly adjusted the direction of pointing within 117 ms. Adjustments to sudden displacements of the virtual finger in the distance dimension were less complete and occurred with a latency of up to 200 ms.

R E A C H I N G A N D M O V I N G I N 3 -D S PA C E



265

Sight of the hand just before presentation of a visual target reduced the variability of pointing compared with reaching with the hand totally hidden (Rossetti et al. 1994). This improvement was probably due to vision supplementing kinesthetic information about the initial position of the hand. Sight of both hand and target before the start of a reaching movement further reduced pointing variability. This added improvement was probably due to the provision of information about the relative positions of hand and target. Prior sight of one hand improved the precision and accuracy of pointing with that hand to the index finger of the other hand, when both hands were unseen (Desmurget et al. 1997). In this case, improvement could have been due only to an improved sense of the initial position of the pointing hand.

34.3.4b Reaching with Unseen Hand One way to prevent sight of the hand is to turn the lights off as soon as the arm starts to move. But, in this case, the subject must remember the position of the target. Pointing to a previously seen visual target with unseen hand requires (1) registration of a visual distance, (2) transformation of the visual distance into motor commands and (3) registration of the location of the unseen arm as it approaches the remembered location of the target. Soechting and Flanders (1989) investigated each of these potential sources of pointing error. When subjects pointed in the dark to a previously seen visual target within arm’s reach, the arm undershot the target by an amount that increased with the distance of the target. A target at 60 cm was undershot by about 15 cm. The errors were very small when subjects pointed with a 1 m long pointer. This suggests that the original errors were not due to underestimation of the distance of the visual target. Errors were also small when subjects, in the dark, returned an arm to a position into which it had been passively placed. This suggests that the original errors were not due to underestimation of the position of the arm. Soechting and Flanders concluded that errors in pointing to a previously seen visual target arise in the transformation of visual information into a motorkinesthetic representation of arm position. Turning the lights off as soon as the arm started to move resulted in increased grasp aperture and a shortening of the time taken to produce the grasp, although grasp aperture continued to be scaled by object size ( Jakobson and Goodale 1991). A better procedure for investigating visual feedback is to leave the target in view but have the hand visible in one condition but not in another condition. One way to prevent sight of the hand is to have subjects move to a virtual target seen in a mirror at 45° to the line of sight. Eliminating sight of the hand in this way increased movement time and grasp aperture (Gentilucci et al. 1994). Another way to prevent sight of hand is to use an LED presented in dark 266



surroundings as the target. Lack of visual feedback produced in this way resulted in slower hand speeds and larger grasp apertures compared with when the hand was visible (Berthier et al. 1996). However, the two conditions were not comparable. When the hand could be seen, the target was an object viewed in its surroundings rather than an isolated luminous object in dark surroundings. Connolly and Goodale (1999) avoided this problem by using an opaque barrier to hide the hand, leaving the target unchanged. Compared with when the hand could be seen, movement time was longer but grasp aperture remained the same. They concluded that the grasp posture was programmed in the planning phase and relied on proprioception for its final execution. When the arm is hidden, an arm movement is executed in open-loop mode with respect to visual error feedback. However, there could be error feedback from kinesthetic receptors in joints, tendons, and sensory muscle spindles. Inputs from these receptors could indicate whether the moment-to-moment position of the arm corresponds to the intended position indicated by motor commands (efference copy). This is known as kinesthetic reafference (von Holst and Mittelstaedt 1950). Its use requires a learned association between motor commands and kinesthetic inputs. It has been suggested that the cerebellum is the repository of this learning (see Howard 1982). The importance of kinesthetic reafference can be assessed by measuring how well a person lacking kinesthetic inputs points to visual targets with unseen hand. Lashley (1917) reported that a patient with loss of afference in his leg could move the leg in specified directions and through specified amplitudes but could not compensate for different loads (Section 34.1.2). Gentilucci et al. (1994) investigated reaching behavior in a patient with complete loss of proprioceptive sensations in the fingers and wrists of both arms, and impaired sensitivity in the elbows and shoulders. The patient and five normal subjects reached for and grasped spheres of different sizes at different distances. For the patient, the initial phase of reaching was slower, but movement speed and hand opening were scaled for object size and distance. The patient’s final phase of reaching was unusually long and jerky. Normal subjects adjusted their grasp to compensate for a spring attached between finger and thumb. The patient could not make this adjustment. Bard et al. (1999) tested a woman with loss of the large myelinated sensory nerves in all four limbs. She performed as well as normal subjects in adjusting a pointing movement of the unseen hand to a suddenly displaced visual target. Her direction errors and amplitude errors were within the normal range. She must have used motor efference and its associated sensory feedback from muscle spindles to calibrate her arm movements. The unseen hand may be allowed to make contact with visible target objects at the end of the reaching movement.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

Haptic feedback provides information about the accuracy of reaching. When the tactile object was progressively displaced in depth from the seen target in 1-cm steps, subjects recalibrated their reaching movements accordingly (MonWilliams and Bingham 2007). Subjects were not aware of the displacement of the tactile target even when it was moved 8 cm from the visible target. The recalibration of reaching to a target at one distance transferred to some extent to other distances.

34.3.4c Moving the Unseen Hand Between Visual Targets A second basic question is how accurately a person can move an unseen arm to match the distance between two seen objects. In an early study, Brown et al. (1948) asked subjects to move an unseen slider through the same distance as that between two previously seen visual markers. The distances ranged from 0.6 to 40 cm and were in various directions relative to the body. Shorter distances were overestimated (the finger moved too far) and longer distances were underestimated, for movements both away and toward the body. One can compare how well a person matches a visual distance by pointing with how well the person matches the distance with a visual stimulus. Gentilucci and Negrotti (1996) displayed two LEDs in the dark between 10 and 35 cm apart at a viewing distance of 44 cm. After the LEDs had been extinguished, subjects reproduced the distance by pointing in the dark or by moving a laser pointer from an LED through the same distance. The pointing finger was not moved far enough, and the error increased to about 4 cm with increasing separation of the LEDs. The laser beam was moved about 2 cm too far for all separations. They concluded that visual distances are registered differently for the two types of task. The difference may have arisen not in the registration of the visual distance but in the execution of the tasks. For example, pointing errors may have reflected errors in judging the hand movement. 34.3.5 VI S UA L P RO C E S S I N G F O R P E RC E P T I O N A N D AC T I O N

34.3.5a Clinical Evidence This section is concerned with whether visual information is processed in different ways for perception and motor control. Goodale and Milner championed the idea that perceptual judgments rely mainly on the ventral processing stream, involving the inferotemporal cortex, while the visual control of motion relies mainly on the dorsal stream, involving the parietal lobes and motor cortex (Goodale and Milner 1992; Milner and Goodale 1995) (Portrait Figure 34.2). They based their conclusion on the behavior of two patients. A patient with damage to the ventral stream had deficient shape perception but was able to control grasp

Mel A. Goodale. Born in 1943, in Leigh-on-Sea near London, England. He obtained a B.A. in psychology from the University of Alberta in 1963 and a Ph.D. in psychology at the University of Western Ontario in 1975. After postdoctoral work with L. Weiskrantz at Oxford University he held an academic appointment in the Department of Psychology at the University of St Andrews, Scotland. In 1977 he moved to the University of Western Ontario, where he now holds a Canadian Research Professorship in visual neuroscience. In 1999 he was given the D.O. Hebb Award by the Canadian Society for Brain, Behaviour, and Cognitive Science. He was elected a fellow of the Royal Society of Canada in 2001.

Figure 34.2.

movements of the hand in relation to the shape and size of an object. Another patient with parietal damage showed the reverse pattern of deficiencies. In another study, a patient with visual form agnosia was unable to match the orientations of two objects but could orient the hand correctly when reaching for an object and grasping its in-depth dimension (Dijkerman et al. 1996). This evidence does not indicate where shape is processed. It could be processed initially in V1 and the information then sent to the two processing streams. Or shape could be processed independently in the two streams. Either way, a disorder confined to one stream would disrupt shape processing in that stream but not in the other stream. But this type of argument is based on an overly simple view of visual processing. The processing of shape is not a unitary process. Shape recognition and motor control depend on different attributes of a stimulus. Shape recognition depends on the registration of the internal relationships between parts of an object in relation those of other objects stored in memory. The control of grasping could depend on detection of the locations of distinct parts of the object without regard to internal relationships and without reference to memory. It is not that shape is independently processed in the two streams but that different tasks require processing of

R E A C H I N G A N D M O V I N G I N 3 -D S PA C E



267

different stimulus attributes. For perception, visual information is processed so that it may be compared, recognized, or described. Perceptual tasks use the same basic information in different ways and rely to different extents on memory. Neural disorders have different effects according to whether the patient is required to match two shapes, name them, or describe them. For action, visual information is transformed into a frame of reference appropriate for motor responses. The type of transformation depends on whether the task is directing the gaze, pointing, grasping, or manipulating. Again, different disorders have differential effects.

34.3.5b Comparing Perception and Action Another approach to this issue is to compare how normal subjects perform perceptual judgments and motor responses to the same stimuli. For example, one can ask whether an illusion that causes one stimulus to appear larger than another also affects the way the two stimuli are grasped by the hand. In the Titchener/Ebbinghaus circles illusion, shown in Figure 34.3, the two identical inner circles appear to differ in size. Aglioti, DeSouza, and Goodale (1995) used disks instead of circles. They reported that, in spite of the perceived difference in size of the inner disks, the grasp amplitude of the reaching hand was not much affected by the perceived difference in size. In a subsequent experiment from the same laboratory it was found that decreasing the distance between the inner and outer circles increased the illusion but had no effect on grasp amplitude (Haffenden and Goodale 2000). They concluded from control experiments that, as one would expect, grasp amplitude is reduced

A

by the presence of neighboring objects because the objects are seen as obstacles. This last piece of evidence suggests that it is not that perception and action process the same information independently and in basically different ways but that the two tasks emphasize different attributes of the stimulus (see Smeets et al. 2002). A related point is that the Titchener illusion involves the simultaneous comparison of the two inner disks, while grasping involves attending to only one disk. Franz et al. (2000) made the perceptual and motor tasks more comparable by converting the visual task into successive comparisons. Subjects compared the size of the inner disk of one display with that of an isolated disk and then repeated the comparison with the other display, as illustrated in Figure 34.3. The standard illusion in (A) was greater than the sum of the illusions derived from the separate successive comparisons in (B). Also, with successive comparisons, grasping and perceptual judgments showed similar errors. Pavani et al. (1999) conducted a similar experiment and obtained similar results. In the Muller-Lyer illusion, the task is to compare the lengths of two lines. The illusion did not show when subjects were asked to point to the ends of the lines (Post and Welch 1996). For pointing, perception of the lengths of the lines is not required, only the registration of the location of each end of the line. In the Judd illusion shown in Figure 34.4 the vertical line appears displaced toward the inward pointing arrow even though it is midway between the ends of the horizontal line. Post and Welsh found that subjects pointed accurately to the vertical line. This does not show that pointing is immune to the illusion but merely that the visual illusion affects the perceived location of the vertical line relative to line ends but not its absolute location. When the vertical line was removed and subjects pointed to the imagined location of the midpoint of the horizontal line, the mean pointing position was shifted according to the visual illusion. This task forces subjects to rely on the location of the imaginary point relative to the whole figure. Thus, when both tasks required subjects to take the whole display into account, perception and motor control were influenced by the illusion. Franz et al. (2001) asked subjects to extend the thumb and forefinger to grasp each shaft of the Müller-Lyer illusion. The stimulus and hand were seen only before the hand

B Figure 34.3. The Tichener illusion. (A) The inner equal-size disks appear to differ in size. (B) With successive comparisons, each inner disk is compared in size with an isolated disk.

268



Figure 34.4. The Judd illusion. The vertical line bisects the horizontal line, but appears shifted toward the left end.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

started to move. The differences in grasp amplitude resembled the perceptual differences between the shafts. Here too, the motor task, like the perceptual task, required subjects to attend to the whole figure and not just to the ends of the shafts. Another approach to this issue is to compare the way the hand tracks an object moving round a path with the perceived shape of the path of motion. López-Moliner et al. (2003a) projected the image of a point moving around an elliptical path onto a horizontal surface seen through a mirror. Subjects made perceptual judgments of the path of motion and tracked the moving point with a pen held in the unseen hand. The perceived depth and width of the path of motion varied with the perspective information provided in the background. The path of motion of the pen was affected by the background in a similar way. Thus, perceptual judgments and motor responses were influenced in the same way by changes in the apparent shape of the path of motion. In a subsequent experiment the moving stimulus alternately loomed and shrank in size as it moved round the same elliptical path (López-Moliner et al. 2003b). This, also, affected perceptual judgments and motor responses in the same way. Some illusions depend on processes at an early in the visual system and therefore operate in a retinal frame of reference. For example, the apparent tilt induced in a vertical grating by a surrounding tilted grating probably arises from contrast between orientation detectors early in the visual system. The effect is only about 2°, and two opposite effects may be induced in different locations at the same time. One would expect this illusion to be reflected in both perception and action, because the mechanism serving the two tasks are fed by the same early processes. Indeed, it has been reported that tilt contrast as measured by visual comparison is the same as that measured by setting an unseen hand-held card to the orientation of the grating (Dyde and Milner 2002; Hibbard and Bradshaw 2006). Other illusions occur at a later stage where visual stimuli are assessed with respect to an extrinsic frame of reference. For example, a vertical line in a tilted room appears tilted up to about 20° in the opposite direction. The tilted room appears vertical because it is accepted as the frame of reference for vertical. This effect is therefore accompanied by illusory tilt of the observer’s body. There is little or no change in the perceived relation between the test line and the body axis. As one would expect, this illusion produced a perceptual effect but had little effect on the orientation of the unseen hand reaching for the rod (Dyde and Milner 2002). In 3-D tilt contrast, a frontal surface appears slanted when surrounded by a slanting surface. This illusion was evident in the task of nulling the apparent slant of the test surface but not in a motor task of setting an unseen card to match the slant of the test surface (Hibbard and

Bradshaw 2006). Perhaps the induction surface induced an apparent slant of the observer, but this was not measured. People tend to perceive inclined or slanted surfaces as more frontal than they are (Gillam and Ryan 1992). This causes hills to appear steeper than they are. For example, a hill inclined 5° to the horizontal was judged to be inclined about 20°, and a 10° hill was judged to be inclined about 30° (Proffitt et al. 1995). The illusion showed when subjects set a line on a frontal disk to match a hill’s inclination but not when they set an unseen board to match the hill’s inclination. Witt and Proffitt (2007) concluded that the two tasks involved distinct perceptual processes. However, the visual test line was in a frontal plane and would therefore be immune to the 3-D illusion. If the inclination of the board were overestimated like visual inclination, tactile settings would not measure the illusion. It is a general principle that an illusion cannot be revealed by a measuring process that involves the same illusion. A concave facemask appears convex. Hartung et al. (2005) found that subjects perceived the nose of a computer-generated concave face to be nearer than the cheek. Subjects also pointed further when reaching for the nose than when reaching for the cheek just after the face was removed. Króliczak et al. (2006), also, found that slow reaching movements of a finger to marks on a real concave facemask were directed in the direction of the illusory distances of points on the face relative to a surrounding reference surface. However rapid movements of the finger were directed to the actual locations of the marks. One problem here is that apparent inversion of a facemask involves only relative distances but a rapid hand movement requires detection of the absolute distance of the target mark. Absolute distance may be indicated by the vergence angle of the eyes. There is evidence that vergence eye movements are directed to the actual distances of marks on a concave facemask (Hoffmann and Sebald 2007). Another factor is the extent to which visually guided actions have been practiced. Goodale et al. (2008) reported that only well-practiced grasping and reaching movements to visual targets are not affected by visual illusions. There is a large literature on this general issue but no general consensus about the answers. Part of the problem is that the questions are not clearly formulated. A review was provided by Bruno (2001) and Carey (2001).

34.3.5c Observing and Executing Actions: Mirror Neurons Neurons have been found in the premotor cortex of the monkey that respond in a similar way when the monkey performs a goal-directed act and when it observes the same act performed by another monkey (Rizzolatti et al. 1996). These neurons are called mirror neurons. Mirror neurons

R E A C H I N G A N D M O V I N G I N 3 -D S PA C E



269

have also been found in various areas of the parietal lobe, which receive visual information from the superior temporal sulcus and project to the premotor cortex. Evidence from fMRI recordings indicate that mirror neurons also exist in these regions of the human brain. In their recent review Rizzolatti and Sinigaglia (2010) conclude that the mirror-neuron mechanism allows an observer to appreciate the actions and intensions of other individuals. Mirror neurons in the prefrontal cortex responded in the same way when the monkey grasped an object with normal pliers or with pliers that required a reverse movement of the fingers (Umilta et al. 2008). This indicates that the neurons coded the goal of the action rather than the particular movements used to achieve the goal. The response of most mirror neurons in the premotor cortex to a movie of a grasping action varied according to the direction from which the action was photographed. But the response of some neurons was independent of the viewing direction (Caggiano et al. 2011). Caggiano et al. (2009) identified neurons in the monkey premotor cortex that responded when the monkey grasped an object in the midline of the body. About a quarter of the neurons responded only when the experimenter grasped the same object within the monkey’s reaching distance (peripersonal space). About a quarter of the neurons responded only when the experimenter reached for the object that was outside the monkey’s reaching distance (extrapersonal space). The response of the remaining neurons was independent of the distance from the monkey of the object that the experimenter grasped.

3 4 . 4 J U D G M E N T O F S E L F - M OT I O N 34.4.1 PAT H I N T EG R AT I O N I N H U M A NS

Insects, such as ants and bees, register the distance they have moved by integrating information provided by walking movements or optic flow. This is known as path integration. This topic is discussed in Section 37.2. The present section deals how people judge the distance they have moved or the angle through which they have been rotated. The direction and distance of self-motion may be derived from any of the following sources of information: 1. Inertial signals These signals are provided by the otolith organs or semicircular canals of the vestibular system or by proprioceptors. They may be removed by having subjects move or rotate for some time at constant velocity or by having them walk or rotate on a treadmill. 2. Optic flow A pure optic flow signal is produced when the visual scene has texture but no identifiable objects. These signals are absent in the dark. 270



3. Motion relative to a landmark This is provided by the change in visual direction of an object that is visible during the whole of the motion. 4. Motor activity This could involve counting steps, integrating step lengths, or integrating signals that indicate velocity of self-motion. These signals are absent when subjects are moved passively. See Mittelstaedt and Mittelstaedt (2001) for a discussion of these sources of information.

34.4.1a Integration of Inertial Signals With passive linear movement in the dark, distance traveled might be derived by integrating signals from the otolith organs of the vestibular system (Berthoz et al. 1995). However, these signals are not generated by motion at constant velocity. The acceleration threshold for detection of linear acceleration is about 0.1 m/s2. A single integration of the acceleration signal codes head velocity, which evokes compensatory eye movements and postural responses (Howard 1986). A double integration codes distance traveled. Accelerated linear motion over a short distance may be produced by placing subjects on a parallel swing oscillating along a linear path. Subjects can estimate distance traveled (Parker et al. 1997), or the amplitude of compensatory eye movements may be recorded (Israël and Berthoz 1989). During active motion, signals from the kinesthetic sense organs and motor efference are also available. This section is concerned with the integration of otolith signals over relatively long distances. Female desert mice (Meriones) retrieve offspring that have wandered from the nest. After finding an offspring in the dark they carry it back to the nest on a straight course from a distance of over one meter. Mice away from the nest compensate for imposed rotations as long as these were above the detection threshold (Mittelstaedt and Mittelstaedt 1980). This suggests that mice maintain a sense of the direction of the nest by integrating inputs from the semicircular canals. However, the mice did not compensate for imposed linear displacements. This suggests that they do not integrate information from the otolith organs. However, in a later study, it was found that gerbils do compensate for imposed linear displacements (Mittelstaedt and Glasauer 1991). Perhaps they maintain a sense of the distance of the nest by integrating inputs from walking movements. Israël et al. (1997) studied the ability of human subjects to reproduce a distance traveled when the only information was that provided by the otolith organs. Subjects were passively moved on a wheelchair in the dark through distances of between 2 and 10 m. The chair accelerated at a constant rate and then decelerated at the same rate. Thus, the otolith organs were constantly stimulated. In one condition,

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

the velocity profiles were adjusted to keep duration constant. This prevented subjects from using time duration. Subjects then used a joystick to move themselves in the chair through what they perceived to be the same distance. They were reasonably accurate. On average, the 2-m distance was overshot by 0.31 m and the 10-m distance was undershot by 0.79 m. It was concluded that subjects relied on a double integration of the otolith signal. Subjects also undershot a remembered distance when reproducing a passive motion of the body in a vertical direction. The error was larger for downward motion than for upward motion (Young and Markmiller 1996). Vertical motion of an erect subject involves the saccules rather than the otolith organs. Even if an observer accurately reproduces a distance moved, this does not indicate anything about the observer’s ability to judge the distance. High accuracy and precision of distance reproductions indicate only that a distance moved on one occasion closely matches that moved on a subsequent occasion. Distance reproduction indicates the stability of distance coding when other features of the motion, such as duration and velocity, have changed. It also indicates the extent of memory loss if the reproduction is performed after various intervals. To get at judged distance one must either ask subjects to move a specified distance or to estimate a distance that they have moved. Israël et al. (1997) found that, on average, subjects undershot by 22.5% when asked to move 2 m on a motor-driven chair. Loomis et al. (1993) asked blind subjects to estimate how far they had walked when guided over distances ranging from 2 to 10 m. Constant errors were about 10% of the distance. All this assumes that the internal scale of distance is accurate and stable. Glasauer et al. (1994) asked normal and labyrinthinedefective subjects to walk blindfolded to a previously seen target at a distance of 4 m. There was no difference in distance errors, but the labyrinthine-defective subjects walked more slowly and showed larger lateral errors. Both normal and defective subjects must have estimated distance in terms of walking movements. Integration of otolith signals does not provide a good distance estimate when subjects walk at a more-or-less constant velocity. Humans and monkeys can direct their gaze to a previously flashed target after the eyes or the head have rotated from their position when the target was exposed (see Blohm et al. 2005; Klier et al. 2005). Li and Angelaki (2005) asked whether monkeys can direct their gaze to a previously flashed target after they have been passively moved toward or away from the target. Monkeys fixated a midline LED in dark surroundings at a distance of 17 or 27 cm. A second LED was flashed in a nearer eccentric position. The monkeys were then moved passively 5 cm toward or away from the fixation LED while maintaining fixation on it. The fixation LED was then turned off, and the monkeys moved their eyes to the remembered location

of the flashed target. The results showed that the monkeys took the motion of the body into account in directing their gaze to the remembered target. They could have derived an estimate of their movement from the change in vergence that occurred while fixating the fixation LED. Otherwise, they could have used signals from the otolith organs. Two monkeys with both vestibular organs removed failed to take the motion of their body into account, which suggests that distance moved was derived from otolith signals.

34.4.1b Optic Flow and Distance Traveled Humans can use the pattern of optic flow generated by self-motion to judge heading direction or control the speed of walking (Prokop et al. 1997). In theory, integration of optic flow could indicate absolute distance traveled. This would require integration of velocity over the duration of travel. But the velocity of headcentric optic flow must be scaled by the distance of objects that generate it. Distance scaling is required because, for a given body velocity, a near object creates a higher headcentric velocity of flow than does a far object. When moving over a horizontal ground plane, the height of the eye above the ground must be taken into account. Redlick et al. (2001) presented a virtual corridor in a helmet-mounted display system. The corridor did not contain stereoscopic cues but resembled a real 2-meter wide corridor familiar to the subjects. The display was slaved to head rotation so that head-parallax cues were available. Subjects were shown a target object at a defined distance. The target was then removed, and looming of the corridor simulated forward motion at one of various constant velocities or constant accelerations. Since subjects did not move, there were no vestibular cues. However, subjects experienced illusory forward motion (linear vection). They pressed a button when they estimated that they had reached the position of the previously seen target. The colors of vertical stripes on the walls were changed periodically to prevent subjects’ tracking particular stripes. Performance was reasonably accurate for optic flow accelerations above 0.1 m/s2. Distances traveled were overestimated with lower accelerations, markedly so for constant velocity. That is, subjects indicated too soon that they had reached the target. Redlick et al. suggested that distance overestimation in impoverished cue conditions protects one from bumping into objects. However, it is not clear why subjects were unable to estimate distance from constant-velocity optic flow. Bees fly at constant velocity and estimate distance traveled from optic flow (Section 37.2.2). Frenz and Lappe (2005) used 90° by 90° rear-projected displays that simulated textured or random-dot ground planes. The optic flow simulated constant-velocity forward

R E A C H I N G A N D M O V I N G I N 3 -D S PA C E



271

motion. Subjects judged the distance they had moved in terms of distances marked out on a stationary ground plane. Although estimates increased linearly with the simulated distance traveled, distances were underestimated by at least 20%. This seems to conflict with the overestimation reported by Redlick et al. However, Redlick et al. used distances of between 4 and 32 m whereas Frenz and Lappe used distances of between 1 and 7 m. Also, Redlick et al. asked subjects to reproduce a distance that was indicated before they moved whereas Frenz and Lappe asked subjects to indicate distance moved after they had moved. When Frenz and Lappe used Redlick’s procedure they obtained overestimation of distance moved for short distances and underestimation for long distances. Later, the two groups of investigators conducted a joint experiment (Lappe et al. 2007). Computer-generated stimuli projected onto the six walls of a 2.4-m cubic room simulated a corridor with randomly colored walls. The display was stereoscopic and head-slaved. Optic flow simulated forward movement through distances of between 2 and 64 m at 0.5 m/s for short distances and at 4 m/s for longer distances. In one condition, subjects indicated distance moved by setting a line on the floor of a stationary display (adjusttarget condition). In the second condition, they pressed a button when they felt they had reached the target shown before movement started (move-to-target condition). In the adjust-target condition, distances moved were slightly overestimated for short distances but became progressively more underestimated as the distance increased. In the moveto-target condition, distances became progressively overestimated. It is known that distances of stationary targets are underestimated and that underestimation increases with increasing distance (Section 29.2.2). But this does not explain the opposite effects of the two conditions. Lappe et al. explained their results in terms of a leaky integration of visual-motion signals. In the adjust-target condition, subjects judged the distance moved from the start position. In the move-to-target condition, they judged the residual distance to the target. Even without distance scaling, observers should be able to distinguish between two travel distances (Loomis et al. 1993). With constant velocity optic flow, distances could be discriminated on the basis of flow duration. Bremmer and Lappe (1999) reported that observers could integrate optic flow velocity to discriminate between travel distances in two sequential displays depicting forward motion over a textured ground plane. One display had fixed duration, speed, and distance. In the other display, speed and duration were varied. The average error was under 3%. Subjects must have assumed that the two scenes had the same depth structure. Frenz et al. (2003) had stationary subjects view a computer-generated image that simulated the motion of a ground plane produced by forward motion of the viewer. Subjects reported which of two successively presented 272



stimuli produced the largest impression of self-motion. They could detect differences in distance moved when the two displays differed in velocity, eye height, or in viewing angle. They could compensate for differences in eye height when they could see the change in eye height but not when it occurred in a blank interval. Subjects could also detect differences in travel distance when a “fog” reduced the visibility of more distant (more slowly moving) elements in one of the displays. Harris et al. (2000) asked whether distance moved is more reliably coded by optic flow or by the otolith organs. Subjects were exposed to a forward motion produced by optic flow in the simulated corridor described above. They accurately reproduced the distance when tested with a similar display. However, when constantly accelerated in a wheelchair in the dark, they stopped the chair at about half the distance. They accurately reproduced a real passive motion in the dark by a real motion in the dark, but visually reproduced distances were about double the passively reproduced distances. After being shown a visual target, subjects reproduced the distance in the presence of both optic flow and real body motion. When the two cues were in conflict, perceived distance was closer to the physical distance than to the visually simulated distance. Thus, information from the otolith organs was more highly weighted than that from optic flow. 34.4.2 I N T E G R AT I O N O F MOTO R AC T I V I T Y

The present section asks how accurately humans can walk in the dark to a previously seen object. Only idiothetic information produced by active movement is available for this task. A distance error is the sum of the error in the initial estimate of object distance and the error in walking the distance.

34.4.2a Effects of Delays and Resistance To walk to a previously seen target a person must remember the distance. Over time, the memory could become less precise or show a systematic distortion. Thomson (1980, 1983) was the first person to study this question. He showed human subjects a small object at distances up to 21 m and then asked them to walk to it with eyes closed. They were given practice trials with feedback. In test trials, for distances up to 9 m, subjects were as accurate and precise as with eyes open. For longer distances, the standard deviation of signed errors increased greatly. When subjects waited between seeing the target and walking they preserved high accuracy to distances beyond 9 m. When they ran, they preserved accuracy over longer distances. Thomson concluded that the critical factor is the duration that the distance is held in memory rather than merely distance. Subjects became inaccurate when the interval between seeing the

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

target and reaching it exceeded 8 s. This suggests that memory of distance begins to fade after 8 s. Other investigators failed to replicate these findings. Corlett et al. (1985) found that only for children was the accuracy of walking in the dark to a previously seen target degraded by a 5-s delay between seeing the target and the start of walking. Elliot (1986) found that subjects could walk blindfolded to a previously seen target at a distances from 3 to 15 m with a variable error of about 5% and an undershoot of about 5%. Precision and accuracy were much higher with eyes open, but it is not clear what visual information subjects received. Neither practice with feedback nor initial delay affected variable errors (Elliot 1987). Other investigators agreed that walking to a target in the dark can be accurate but they did not replicate the effects of time delay. Steenhuis and Goodale (1988) found that at least 30 s had to elapse before performance began to deteriorate. Rieser et al. (1990) asked subjects to walk in the dark to previously seen targets at distances between 4 and 24 m in an open field. Subjects walking at a normal pace or a brisk pace showed no consistent constant errors. Variable errors increased with distance and averaged about 8% of the distance walked. There was no significant effect of an 8-s time delay. Eby and Loomis (1987) found no effects of delays of up to 45 s on accuracy of throwing a ball with eyes closed to a previously seen location. Undershooting increased as distance increased from 5 m to 25 m. The effort required to walk to a distant object depends on the slope of the ground. Corlett and Patla (1987) showed subjects targets on a downhill, an uphill, or a level surface and asked them to walk blindfolded to each target along a downhill, uphill, or level surface. The results showed that subjects used an estimate of the effort required to walk to the target when translating visual information into a plan for walking in the dark. For example, subjects who viewed a target on a level surface walked too far when walking downhill but not far enough when walking uphill. Corlett et al. (1990) found that subjects did not walk far enough when walking blindfolded to a previously seen target while a variable elastic force strongly added to the effort of walking.

34.4.2b Walking Distance and Verbal Estimates Another question is whether distances reproduced by walking in the dark are as accurate and precise as verbal estimates of distance. Philbeck and Loomis (1997) showed subjects a visual target at distances between 0.79 and 5 m in an illuminated corridor and then asked them to close their eyes and verbally estimate its distance or walk to where it had been. For both tasks, distances under about 2.5 m were overestimated and larger distances were underestimated. Walked distances were uniformly larger than verbally estimated distances.

34.4.2c Reproduction of a Walked Distance Loomis et al. (1993) guided subjects as they walked over a path in the dark and then asked them to verbally estimate the distance moved or reproduce the path by active walking in the dark. For both types of judgment, the 2-m distance was overestimated and longer distances were underestimated. Subjects were slightly less precise in walking in the dark to a location to which they had been guided in the dark than they were in walking to a target that they had seen (Bigel and Ellard 2000). Subjects walked blindfolded to a previously seen target and to a target to which they had been guided while blindfolded with similar high accuracy (Ellard and Shaughnessy 2003). Subjects also showed similar accuracy when walking blindfolded to a target to which they had been guided while blindfolded or to which they had been guided with eyes open (Sun et al. 2004). However, the walked distance was slightly underestimated in the eyesopen condition relative to the blindfold condition. Theoretically, subjects could use several sources of idiothetic information to reproduce a distance walked in the dark. If they walk some distance at a constant speed they cannot use vestibular cues. If they are prevented from counting steps by having to perform a mental task, they must use some form of path integration derived from their walking movements. Durgin et al. (2007) found that most people increase step length and decrease step time by the same proportion when they increase walking speed. This means that stride frequency is correlated with walking speed. Integration of this speed signal with respect to time could be used to judge distance walked. However, Durgin et al. (2009) found that judgments of time intervals were much more variable than reproductions of distance walked in the dark. They used distances between 4.5 and 18 m. Also, the variability of reproduced distances was proportional to distance walked (it obeyed Weber’s law). Durgin et al. concluded that subjects integrated step distances rather than walking velocity. Subjects were just as accurate in walking to targets that had been viewed monocularly as to targets that had been viewed binocularly. The targets were at distances of between 2 and 12 m in a hallway, which provided adequate perspective information (Creem-Regehr et al. 2005). Vergence does not change much over this range of distances.

34.4.2d Recalibration of Walking Distance Continued activity of a group of muscles can produce aftereffects. For example, when the arm is pushed against a wall for a minute or so, the arm spontaneously rises when it is relaxed. Also, after holding the closed eyes in eccentric gaze for about a minute in one direction, they feel to be looking straight ahead when they are displaced several degrees in the opposite direction (see Section 25.2.7). It is therefore

R E A C H I N G A N D M O V I N G I N 3 -D S PA C E



273

possible that walking for some time would affect estimations of walked distances. Anstis (1995) had blindfolded subjects run on a treadmill for 60 s. Then, still blindfolded, they were asked to run on fixed ground while attempting to remain in one place. In fact they drifted forward through a mean distance of 152 cm in 15 s without being aware of the movement. The aftereffect lasted for about 1 minute. A period of running over a fixed ground with eyes open did not produce the aftereffect. Also, the aftereffect of hopping on the treadmill on one leg produced an aftereffect only with hopping on that leg. Anstis concluded that there was something special about running on a treadmill. However, running on a fixed ground was performed with eyes open and this may have prevented the buildup of an aftereffect. Durgin and Pelah (1999) found that aftereffects produced by running blindfolded on a treadmill were similar to those produced by running while holding a handle on a cart moving over a fixed ground. Thus, running on a treadmill or on the ground while blindfolded recalibrated the relation between motor output and perceived distance. One can ask whether perceived walking distance can be recalibrated by altering the relation between walking movements and visual information. Ellard and Shaughnessy (2003) exposed subjects to conflicting distance information. In one condition, subjects were shown the target on a grassy meadow. In a second condition they were guided blindfolded to what seemed to them to be the same target but which was in fact at another distance. When they subsequently walked to the target blindfolded, distance estimates were a compromise between the two distances (see also Ellard and Wager 2008). Rieser et al. (1995) asked subjects to walk for 8 minutes at 8 km/h on a treadmill that was towed through the visual environment at 5 km/h. In a second condition, they walked at 7 km/h but were moved through the environment at 17 km/h. Before and after this conditioning, subjects walked blindfolded to a previously seen target. After conditioning, subjects who had walked more slowly than indicated by optic flow undershot the target, while subjects who had walked faster than the optic flow overshot the target. Control conditions revealed that the effects of conditioning were not due to simple motor or visual aftereffects but represented a recalibration of walking distance relative to visual distance. Thus walking distance in the dark to a previously seen target is modulated by prior exposure to a given ratio of walking speed to optic flow speed (see Massing 1997). Proffitt et al. (2003) performed a similar experiment. Subjects walked for 3 minutes on a treadmill while viewing a scene in a helmet-mounted display system. The scene remained stationary or moved according to the rate at which the subjects were walking. Subjects were then taken off the treadmill, blindfolded, and asked to make stepping movements but remain in the same place. After exposure to 274



a stationary scene, subjects drifted forward while feeling that they had remained in the same place. The aftereffect was much smaller after exposure to a moving scene. In other words, the absence of the expected optic flow recalibrated the effort associated with walking. The smaller aftereffect produced by the appropriate optic flow was presumably due to motor-proprioceptive aftereffects of walking derived from optic flow. In an experiment from the same laboratory it was found that treadmill walking with zero optic flow increased the apparent distance of visual targets toward which the subjects were about to walk while blindfolded (Witt et al. 2004). However, the apparent distance of the targets was not increased after treadmill walking when subjects were about to throw balls at the targets. Thus, apparent distance was affected by exposure to an unusual combination of walking and optic flow but only if subjects were about to perform a walking task. This suggests that judgments of visual distance are related to specific acts that are about to be performed. 34.4.3 V I S UA L LY I N D U C E D MOT I O N-I N-D E P T H

A stationary person observing a large moving display has a compelling sensation of self-motion in the opposite direction. This is known as vection. Rotation of a visual scene about the yaw, pitch, or roll axis of the body induces illusory self-rotation, or circularvection. Translation of a visual scene in a frontal plane induces illusory self-translation. A radially expanding scene induces illusory forward motion. Any illusory self-translation is known as linear vection. Vection is experienced in the widescreen cinema or when one looks at an adjacent moving train while sitting in a stationary train. The literature on vection was reviewed by Dichgans and Brandt (1978) and Howard (1982). Natural visual scenes rarely move as a whole, especially more distant parts. Therefore, motion of the whole scene indicates self-motion. Furthermore, the semicircular canals and utricles respond only to acceleration of the head. They produce no signals after the head has been rotating or translating at constant velocity for about 30 s. In the initial 30 s or so after the onset of visual motion, the absence of the normally occurring signals from the vestibular system indicates that the body is not moving. After this initial period, sensations of self-motion are controlled by visual motion because visual motion provides the only information that the body is moving. This explains why vection has a latency of up to 30 s. The following evidence supports the idea that vection is weakened when vestibular signals indicate that the body is not moving. 1. The latency of circularvection was reduced when, initially, the subject’s body was given a brief rotary

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

acceleration in the direction of vection (Melcher and Henn 1981; Wong and Frost 1981). Physical acceleration provides the expected vestibular signals. On the other hand, circular vection was inhibited by a sudden physical acceleration of the subject in the direction of the moving display (Teixera and Lackner 1979). 2. The latency of roll vection was shorter in subjects with bilateral loss of labyrinthine function than in normal subjects (Cheung et al. 1989). Also, the latency of roll vection was unusually short in normal subjects in the weightless conditions produced by parabolic flight (Cheung et al. 1990). 3. Monkeys and humans are more sensitive to self-motion that stimulates the utricles (sideways, backward, and forward motion) than to motion that stimulates the saccules (up-down motion) (Malcolm and Melvill Jones 1974; Fernandez and Goldberg 1976). Accordingly, latency for linear vection produced by a visual display moving in depth was shorter than latency for vection in an up-down direction (Giannopulu and Lepecq 1998). According to this analysis, a constantly accelerating scene should produce little or no vection, because visual acceleration arising from actual body motion is accompanied by vestibular signals. The absence of the expected vestibular signals should inhibit vection. This type of experiment has not been done. One might expect that vection latency would be prolonged if some unsmooth motion were added to the visual motion. Adding incoherent motion to a smooth radially expanding display increased the latency and duration of forward vection (Palmisano et al. 2003). The incoherent motion presumably made it more difficult to detect the radial motion. Palmisano et al. (2008) induced forward linear vection with a 56° circular pattern expanding radially. It was viewed monocularly through a cylindrical tube. The radial motion was either constant in velocity or contained a component of acceleration along the x-, y-, or z-axis of the body. The acceleration component was either a coherent random jittery motion or a motion that reversed in direction every 0.9 or 1.75 s. Both types of acceleration in the x-axis or y-axis reduced the latency of forward vection and increased its apparent speed. Acceleration in the z-axis had little or no effect. Thus, forward vection was increased by a component of visual acceleration orthogonal to the direction of vection. There is no good reason to expect that adding lateral or vertical visual acceleration would reduce forward vection. Addition of acceleration would make the visual stimulus more impressive and generate a more convincing sensation of self-motion, since self-motion is usually accompanied by motion perturbations.

The addition of stereoscopic depth to a looming display that simulated motion through a 3-D cloud of randomly positioned objects reduced the latency of forward vection and increased its duration (Palmisano 1996). Vection was improved only by the addition of changing disparity that was consistent with the motion in depth produced by the optic flow (Palmisano 2002). Fixed disparities that increased the perceived 3-D appearance of the display had no effect. The improvement was therefore due to the extra information that changing disparity provided about self-motion in depth. 3 4 . 5 V I S U O -TAC T I L E S U B S T I T U T I O N In sensory substitution, stimuli that normally occur in one sensory modality are converted into stimuli in another modality. In visuo-auditory substitution, visual signals are converted into auditory signals (see Auvray et al. 2007). In visuo-tactile substitution, visual signals are converted into tactile stimuli. Both systems were developed as aids for the blind. This section deals only with visuo-tactile substitution. In the 1960s Bach-y-Rita and colleagues transformed signals from a video camera into signals in an array of 400 vibrators on the back (Bach-y-Rita et al. 1969). Subjects controlled the motion of a bulky camera mounted on a tripod. With practice, blind subjects could recognize simple objects on a tabletop and describe their relative positions. Subjects spontaneously reported that the stimuli seemed to come from in front of the camera rather than from the vibrators on the back. This is known as externalization, or distal attribution. Miniature video cameras can now be held in the hand and new versions of the tactile matrix have higher resolution and can be mounted on any part of the body, including the tongue. Visuo-tactile substitution provides a method for investigating novel forms of distal attribution. White (1970) showed that stimuli from a video camera conveyed to a tactile matrix on the back were recognized and externalized only when the blindfolded subjects controlled the camera. Also, blind people externalize objects that they feel with a cane (Bach-y-Rita and Kercel 2003). In a recent study, blindfolded subjects used joysticks to control the motion of a robot through a maze (Segond et al. 2005). A video camera on the robot detected oriented triangles and horizontal lines placed on the walls of the maze. The information was conveyed to a tactile matrix on the subject’s abdomen. Subjects learned to navigate the maze. As learning progressed, they reported that they felt as if they were driving the robot with respect to signals that seemed to be in front of them. The crucial factor in externalization in any sensory modality is the interplay between motion of stimuli and activity of the perceiver. For example, an important factor

R E A C H I N G A N D M O V I N G I N 3 -D S PA C E



275

in externalization of sound is interaction between auditory stimuli and motion of the head. Auditory stimuli seem to be inside the head when they are delivered through headphones so that they move with the head (Section 35.1.1). Visual stimuli are normally externalized. Presumably, infants soon learn to externalize the visual world as they

276



move their eyes, head, and body. However, there are reports of patients with head injuries for whom the world appeared two-dimensional, like a picture (Section 32.3.1). An afterimage is not clearly externalized when viewed with moving eyes. Also, the pressure phosphene created by pressing on the side of the eye appears inside the head.

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

35 AUDITORY DISTANCE PERCEPTION

35.1 35.1.1 35.1.2 35.2 35.2.1 35.2.2 35.2.3 35.2.4 35.2.5 35.3 35.3.1 35.3.2 35.4 35.4.1

The basis of sound localization 277 Externalization of sound 277 The audio egocenter 278 Monaural cues to distance 278 Distance and loudness 278 Changes in the frequency spectrum 281 Sound reverberation 282 The role of the pinna 283 Long-distance communication by sound 284 Binaural cues to distance 285 The nature of binaural differences 285 Binaural differences and perceived distance 286 Dynamic distance cues 287 Acoustic tau 288

35.4.2 35.4.3 35.5 35.6 35.6.1 35.6.2 35.7 35.8 35.8.1 35.8.2 35.8.3 35.8.4 35.9

35.1 THE BASIS OF SOUND L O C A L I Z AT I O N

Doppler shift 289 Acoustic motion parallax 289 Auditory aftereffects 290 Interactions between vision and audition 290 Ventriloquism 290 Synchrony of visual and auditory signals 291 Auditory localization in owls 292 Echolocation 293 Echolocation in subterranean mammals 293 Echolocation in bats 293 Echolocation in marine mammals 300 Echolocation in humans 305 The lateral-line system 305

than sound produced in a nonreflecting anechoic chamber. 2. Natural sounds reflect from the complex surfaces of each pinna in a manner that varies with the location of the sound source. Subjects localize sounds delivered to tubes that bypass the pinnae as being in or near the head (Young 1931). Also, synthetic sounds delivered to headphones are localized in the head. However, sounds recorded by microphones in the ear canals of one person and heard through headphones by another person are judged to be outside the head (Levy and Butler 1978). Also, sounds recorded by microphones in the ear canals of a dummy head with pinnae are localized externally when heard through headphones (see Plenge 1974).

35.1.1 E X T E R NA L I Z AT I O N O F S O U N D

A click or pure tone generated by an oscillator and delivered to headphones seems to originate from a phantom source in the center of the head. As the sound is made louder in one ear than in the other the phantom moves along the interaural axis toward the ear with the louder sound. As the time interval between the arrival times of a click at the two ears is increased, the phantom moves toward the ear receiving the earlier sound. The phantom also moves toward the ear receiving a phase-advanced pure tone. A judgment of the location of a phantom sound in the head is known as lateralization, as opposed to localization of sound sources outside the head. It is as if, in lateralization, azimuth information is present but distance information is lacking. Three reasons for this have been suggested.

3. The acoustic attributes of natural sounds vary with movements of the head. People naturally move their heads when attempting to localize sound sources in an anechoic chamber (Thurlow et al. 1967). The only sounds that are not affected by movements of the head are those that originate on the axis of head rotation or in the head. Sounds detected by microphones in the ear canals of a dummy head may be delivered to headphones. The sounds become externalized when the dummy head moves in sympathy with movements of the head of the person wearing the headphones (see Toole 1970).

1. Pure sounds delivered by headphones lack interaural differences in frequency composition that natural sounds acquire when they pass through the air or reflect from external surfaces, including the pinnae. Sakamoto et al. (1976) showed that the addition of sound reflections (reverberation) helps to externalize a sound heard through headphones. We will see in Section 35.2.3 that reverberating sound seems more distant 277

Under certain circumstances, sounds produced by external loudspeakers appear in the head or close to the head. Hanson and Kock (1957) placed subjects in front of two loudspeakers at the same distance in a frontal plane. The speakers emitted pure tones in counterphase. At certain lateral positions between the speakers the sounds to one ear canceled while those in the other ear did not. The listener then had the impression that the sound source was close to one ear. A small sideways movement of the head created the impression that the sound source had moved from near one ear to near the other ear. Toole (1970) found that subjects often experienced sounds in the head when listening to identical sounds from four or two external loudspeakers arranged in a frontal plane at equal distances from the head in an anechoic room. In a normal acoustic environment, identical sounds delivered by two loudspeakers, one on the right and one on the left of the listener, produce a single external sound source. This is the basis of stereophonic sound systems. The source seems to be localized somewhere between the speakers in a direction that varies with interaural intensity and timing. Pure tones delivered in this way are typically externalized but otherwise lack distance information. 35.1.2 T H E AU D I O E G O C E N T E R

The visual egocenter is that position in the head from which the azimuth directions of objects are judged. It has usually been found to lie midway between the eyes (see Section 16.7.2). Neelon et al. (2004) measured the audio egocenter. Blindfolded subjects moved a hand-held source of sound until it was judged to be in the same direction as a fixed sound source at eye level at a distance of 1.5 m. Measurements were taken with the hand-held sound a few inches from the head, at arm’s length, and at an intermediate position. The whole procedure was repeated for each of several fixed sound sources placed in an arc extending from 30° right to 45° left of the midline. For each fixed sound source a bestfitting line was drawn through the settings of the three hand-held sources. The location where the lines for the different locations of the fixed source intersected was defined as the audio egocenter. It was found to lie close to the visual egocenter at the midpoint of the interocular axis. However, when the azimuth range of the fixed sound source was extended beyond the binocular visual field, the audio egocenter became more variable and tended to move closer to the midpoint of the interaural axis. 3 5 . 2 M O N AU R A L C U E S TO D I S TA N C E Most work on auditory localization has been concerned with the perception of the direction of sounds, particularly azimuth. The detection of the distance of a sound source is known as ranging. The earlier literature on auditory 278



distance perception in humans was reviewed by Coleman (1963) and Blauert (1997). Several procedures have been used to measure the apparent distance of a sound source. The subject may be asked to verbally report the distance in feet or meters. Another procedure is to display an array of numbered loudspeakers and ask the subject to report from which loudspeaker the sound is coming. This procedure scales auditory distance in terms of visual distance. Finally, subjects may make a motor response such as walking to the perceived location of a previously heard sound.

35.2.1 D I S TA N C E A N D L O U D N E S S

35.2.1a Loudness and Absolute Distance The intensity of a sound is a measure of energy per second per unit area, or joules/s/cm2. Since power in watts is energy per second, sound intensity is given by watts/cm2. However, the sensation level of a sound depends on sound pressure, or amplitude, rather than intensity. Sound pressure, p, is proportional to the square root of power, P, or P = kp2. The term “sound intensity” refers to power per unit area but it is often used to denote sound pressure. Sound pressure is force per unit area, or (dynes/cm2). The sensation level of a sound, P, can be measured in decibels relative to a threshold value of sound power, PT Decibels (dB) = 10 log 10

P PT

(1)

It may also be expressed in terms of the threshold value of sound pressure, which is proportional to the square root of power: Decibels (dB) = 10 log 10

P pT

(2)

The threshold value, pT, most often used is based on the mean threshold of young adults, namely 0.0002 dynes/cm2. Sound intensity relative to this standard is known as the sound pressure level, or SPL. For example, a sound of 60 dBSPL is 60 dB above 0.0002 dynes/cm2. The intensity of sound radiating on a spherical wavefront decreases in inverse proportion to the square of distance, r, from the source. Sound pressure, being proportional to the square root of intensity, is inversely proportional to the distance of the source. The reduction in sound pressure at an ear with increasing distance of the source is known as the (1/r) loss. The loss in decibels (dB) at distance r, relative to a sound at distance r0, is given by: LossindB = 20 log 10 (r r0 )

(3)

In air, in a free space situation (one with no absorbing or reflecting surfaces) there is a 6-dB loss for each doubling of distance. This relationship holds only for a point source of

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

sound that propagates by a spherical wave. Such a source is known as an acoustic monopole. A small sphere pulsating in volume is an acoustic monopole. A more common type of sound is an acoustic dipole, such as a small sphere of constant volume vibrating to-and-fro. It can be thought of as two adjacent monopoles 180° out of phase. For a dipole, the intensity of sound varies with both direction and distance. For example, at 90° to the dipole axis the sounds produced by the two poles cancel. The area of the spherical wavefront produced by a sound source is 4pr 2 . Therefore, the intensity (I) of sound at distance r from an acoustic monopole of power P is I (r ) =

P 4p r 2

(4)

But sound is absorbed by the acoustic medium according to an exponential function. If the coefficient of absorption is a , the final intensity of the sound is I (r ) =

P −a e 4p r 2

(5)

At short distances the loss of intensity is dominated by the spherical spreading of the wavefront. At longer distances the effect of absorption becomes dominant. We shall see in the next section that high-frequency sounds are absorbed more than low-frequency sounds. For all sound sources, obstacles and reflecting surfaces may distort the wavefront. Sound intensity at the ear does not uniquely indicate the absolute distance of a source unless source intensity is constant. Even for a source of fixed intensity, the sound level at the ear indicates absolute distance only if the observer knows something about the intensity of the source. The loudness of a familiar sound of reasonably fixed intensity, such as a speaking voice or a car, gives some idea of its distance. The speaking voice provides a useful stimulus for investigating the effect of sound level on perceived distance because it is a very familiar sound. However, speech can vary in voice level, or vocal effort, between whispering and shouting. Whispering and shouting differ in loudness and also in their frequency composition. For example, a whisper lacks low frequencies because there is no vocal cord voicing. Normally, an increase in voice level increases the volume of the fundamental frequency and that of the first harmonic. It also increases high frequencies relative to low frequencies (Lienard and Benedetto 1999; Traunmüller and Eriksson 2000). However, singers and actors can increase the voice level with little change in the fundamental frequency. Changes in sound intensity due to changes in voice level may be indicated by the power of the sound signal just in front of the speaker’s mouth. Brungart and Scott (2001) called the voice level the production level of speech to distinguish it from the presentation level, which is the sound level at which a recorded voice of fixed voice level is

presented at the ear of a listener. Note that a change in voice level changes spectral content as well as loudness. A whispered voice contains more high-frequency sounds and fewer low-frequency sounds than a loud voice. This is why a listener can distinguish between the voice level and the production level of recorded of speech. Normally, a whisper (low production level) comes from a nearby speaker, while a shout (high production level) comes from a distant speaker. On the other hand, any type of speech seems to come from a more distant speaker as the presentation level is decreased, either because the loudspeaker volume is reduced or because the source is taken further away. Thus, a decrease in voice level should make a source seem nearer while a decrease in presentation level should make it seem more distant. Gardner (1969) presented evidence for the opposed effects of voice level and presentation level of speech on judgments of distance. He presented recorded speech for several seconds at various intensity levels from loudspeakers at a distance of 10 ft or 30 ft in an anechoic room. Subjects indicated which of five visible speakers at distances between 10 and 30 ft was the source. Whatever the actual distance of the source, subjects reported that the lowest level sound came from near the speaker 30 ft away and that the loudest sound came from the speaker 10 ft away. Thus, subjects scaled relative distance by loudness and absolute distance by the visual location of the seen speakers. This is an example of visual capture of sound (Section 35.6.1). Blindfolded subjects could accurately judge the distance, on a four-point scale, of a real person speaking at a normal level. However, subjects underestimated the distance of a person whispering and overestimated the distance of a person shouting. Presumably, this is because a person whispering is usually near and a person shouting is usually at some distance. In an anechoic room, Brungart and Scott (2001) recorded speech samples spoken at various voice levels from a whisper to a loud shout. The samples were presented on headphones at various intensities to subjects sitting outside in a field. Subjects indicated which of nine numbered visual markers was the location of the sound source. For loud voice levels, distance judgments doubled for each 8-dB increase in voice level. For soft voice levels, judgments doubled for each 15-dB increase. For high voice levels, distance judgments doubled for each 12-dB decrease in the intensity of the sound emitted by the headphones. Emitted intensity had little effect on the judged distance of whispered speech. Thus, shouted speech is judged to come from a greater distance than whispered speech but is judged to be more distant when its intensity at the ear is reduced. In the above experiments, sounds were initially recorded in an anechoic room in which the only information for the distance of a sound source is loudness and voice quality. Philbeck and Mershon (2002) recorded a man and a woman whispering, speaking at a normal level, and shouting.

AU D I TO RY D I S TA N C E P E R C E P T I O N



279

Blindfolded subjects judged the distances of the recordings played from a loudspeaker in a carpeted room. When heard for the first time, the whispered voice was judged to be at 0.76 m, the midlevel voice at 1.52 m, and the shouted voice at 3.0 m. This shows that long-term familiarity with the three levels of the human voice is used for judging absolute distance. On subsequent judgments the reported distance of the shouted voice, but not of the whispered or midlevel voices, increased. This shows that short-term experience also influences how people base distance judgments on voice level. For mid- to high-voiced speech the distance judgments doubled with each 8-dB increase in voice level or with each 12-dB decrease in presentation intensity.

35.2.1b Loudness and Relative Distance Warren et al. (1958) proposed that changes in the intensity of a sound associated with changes in its distance provide a scale for judging relative sound intensities. Subjects judged when a reduction in the intensity of a test sound emitted by a loudspeaker made it half as loud as a standard sound. They then judged when a reduction in the intensity of the test sound made it twice as far away as the standard. For both pure tones and a speaking voice, the required intensity difference was the same for the two types of judgment. Warren et al. found that the change in intensity required to make the voice half as loud or twice as far was greater for the voice than for the tone. A complex sound like a voice contains cues to distance not present in a pure tone. These unchanging cues would indicate that the voice had not changed in distance when intensity was changed. This evidence suggests that people scale the loudness of equidistant sounds in the same way that they scale distance in terms of loudness. The following evidence supports this conclusion. Stevens and Guirao (1962) asked subjects to estimate the loudness of each of a series of 1000 Hz tones presented through headphones. Subjects were then asked to estimate the distance of each tone using the same 100-point scale. The two sets of judgments were highly correlated, such that louder tones were judged nearer than softer tones. Petersen (1990) found a similar relationship between intensity and distance judgments for tones presented from a loudspeaker in an anechoic chamber. When intensity is the only cue to the distance of a sound source one would expect that the ability to discriminate a difference in distance between two sounds would be limited by the threshold for discriminating a difference in loudness. This is the pressure-discrimination hypothesis. Miller (1947) reported that subjects could detect, 50% of the time, a change in sound pressure of 0.41 dB at a level of 30 dBSPL. This corresponds to a Weber fraction of 0.048. At levels below 20 dB, the threshold rose steeply, presumably because of the masking effects of noise in the auditory system. The threshold for pure tones was similar to that for 280



white noise (Scharf and Buus 1986). A 0.4-dB change in loudness at the ear is produced by a 5% change in the distance of the source. An approaching constant sound source increases in loudness at the ear. This signifies that the source is approaching if the loudness of the source is perceived as constant. Békésy (1938, 1949) reported that changing the sound pressure from a source at a fixed distance changed the apparent distance of the sound. Simpson and Stanton (1973) moved a sound source in a room toward or away from the subject at an irregular rate of between 1 and 3 cm/s until the direction of motion was detected. A source at 61 cm had to be moved 20 cm before the motion was detected—a difference of 33%. In the above procedure, the stimulus moved gradually and there was no feedback. In measuring discrimination thresholds for loudness, stimuli are usually presented in quick succession with error feedback. Strybel and Perrott (1984) measured depth-discrimination thresholds by the method of limits. A 0.5-s noise was presented at a fixed distance followed by a second 0.5-s noise at stepped distances nearer than or beyond the first noise. Subjects indicated when they detected a change in the relative distances of the noises. For distances between 0.61 and 4.9 m, changes of between 3% and 7% were detected. But, at distances below 0.3 m, the discrimination threshold increased steeply. The method of limits used by Strybel and Perrott may have elevated the threshold at short distances. This is because any reluctance to respond to the changes in relative distance would produce a greater proportional error at near distances than at far distances. Ashmead et al. (1990) avoided this problem by presenting the different distances in random order and using a forced-choice procedure in which subjects reported which of two sequentially presented sounds was nearer. Broadband sounds were presented for 0.5 s in an anechoic chamber. Subjects made depth judgments with respect to each of two reference sounds at 1 and 2 m, with feedback after each trial. On average, the subjects detected a 5.8% difference in depth. The threshold rose to 16% when the intensity of the sound sources was adjusted to keep sound pressure at the ears constant. Some unspecified distance information other than intensity differences must have been present. Ashmead et al. concluded that, even at short distances, depth-discrimination was as good as the threshold for intensity-discrimination. However, with a forced-choice procedure with error feedback, subjects rely on whatever information is available, even if they do not perceive the stimulus feature that they have been asked to detect. They may discriminate intensity differences rather than depth differences. If so, a 5.9% threshold is not surprising. It is still possible that, at near distances, the threshold for detection of a difference in distance exceeds the threshold for detection of a difference in loudness. Perhaps, at near distances,

OT H E R M E C H A N I S M S O F D E P T H P E R C E P T I O N

people are reluctant to interpret small differences in loudness as differences in distance.

35.2.1c Loudness Constancy In a perceptual constancy, an attribute of a distal stimulus is judged as constant when the corresponding proximal stimulus changes because of a change in some other attribute of the stimulus. In loudness constancy, the loudness of a sound source is judged as constant when loudness at the ear changes because of a change in the distance of the source. One approach to loudness constancy is to ask whether a source of fixed intensity is judged as constant when placed at different distances. As with visual size constancy (Section 29.2), one would expect an improvement in loudness constancy as more information about the distance of the source is provided. Mohrmann (1939) had subjects equate the loudness of sounds from two loudspeakers placed at different distances. The sources were a pure tone, a metronome, noise, music, and speech. The loudspeakers were either in view or in darkness. Nearly full loudness constancy was obtained for speech with or without sight of the loudspeakers. Constancy was less for the other sources, especially when they could not be seen. Constancy was higher for louder sounds. When subjects judged the loudness of the sounds at the ear, their estimates were pulled in the direction of constancy when they could see the sources. For judgments of the loudness at the ear of the pure tone in the dark, the degree of constancy was close to zero. A second approach to sound loudness constancy is to ask whether a change in the perceived distance of a fixed source affects its perceived loudness. The phenomenon of visual capture of sound described in Section 35.6.1 provides a way to vary the perceived distance of a sound source without changing the sound in any way. Mershon et al. (1981) placed a hidden loudspeaker at a distance of 420 cm in either an anechoic room or a semireverberant room. A fully visible dummy loudspeaker was placed at a distance of 75, 225, or 375 cm in the subject’s midline. Subjects used a rating scale to judge the loudness of 5-second samples of broadband noise emitted by the real fixed loudspeaker at 40, 55, and 70 dB. Most subjects reported that the sound seemed to originate in the dummy speaker. In the semireverberant room, the mean judged loudness of the sounds on a 1,000-point scale increased from about 250 to 350 as the distance of the dummy speaker increased from 75 to 375 cm. Thus, an increase in the perceived distance of the fixed sound source increased its judged loudness, even though the actual loudness of the source did not change. Zahorik and Wightman (2001) recorded 50-ms bursts of white noise by microphones in the ear canals of humans facing a loudspeaker at distances between 0.3 and 13.8 m. These recordings were used to synthesize sounds that were presented to the same subjects through headphones. Subjects judged the loudness of the sound sources as

constant in spite of the variation in intensity level at the ears. In other words, good loudness constancy was achieved for an unfamiliar sound containing the qualities arising from a reverberant room and reflections from the pinnae. Subjects were also asked to judge the distances of the sounds. Although distance judgments increased with real distance, the mean exponent was only 0.45. In other words, subjects underestimated the distances of the sounds in spite of achieving good loudness constancy. Zahorik and Wightman found that, although the overall energy of sound at the ears decreased with increasing distance, the energy of the reverberant component remained approximately constant. They suggested that subjects based their judgments of constant loudness on the approximately constant reverberant energy (see Section 35.2.3). 35.2.2 C H A N G E S I N T H E FR E Q U E N C Y S P E C T RU M

As a sound is transmitted through air, high frequencies are attenuated more than low frequencies. This provides information about absolute distance only for a sound with a known spectral composition. Approximately, the absorption in decibels per 100 ft at 50% humidity and 20°C is 0.2 dB for 1,000 Hz, −1 dB for 4,000 Hz, and −4.7 dB for 10,000 Hz (Coleman 1968). The effects of humidity, rain, and fog on sound propagation are so small that they can be ignored. During the day, the temperature of the air decreases with height. The temperature gradient refracts sound waves upward so that the sound may not be heard some distance from the source. For example, a vertical temperature gradient of 1°C per 100 m creates a quiet region at a horizontal distance of 1200 m. Sounds become more audible at night because the temperature gradient is usually inverted. Wind gradients have similar effects, which can add to or subtract from the effects of temperature gradients. Wind turbulence due to gustiness can cause sound intensity to fluctuate by up 20 dB. The fluctuations increase with the frequency of the sound. Absorption of sound by the ground can cause considerable loss of intensity for a listener near the ground. The attenuation is greatest for a certain frequency, which varies with the height of the source above the ground (Ingard 1953). The loudness of a sound increases as the source comes nearer. But, as loudness increases, the auditory system gives increasingly more weight to low-frequency components relative to high-frequency components. Thus, as a lowfrequency tone approaches it increases in absolute strength and in strength relative to high-frequency tones. Coleman (1962) presented bursts of white noise from loudspeakers placed in the midline in an open space at 2-ft intervals of between 5 ft to 33 ft. Initially, subjects reported that all sounds came from the near loudspeakers. However, as they became familiar with the task, they began to make reasonably accurate judgments, although with a persistent

AU D I TO RY D I S TA N C E P E R C E P T I O N



281

tendency toward underestimation of distance. Even for an unfamiliar sound at a fixed distance, a reduction of the high-frequency components caused an increase in the apparent distance of the sound. Coleman (1968) placed numbered loudspeakers at 2-ft intervals between distances of 8 and 28 ft in a large room lined with acoustic tiles. Subjects reported from which speaker a 0.1-ms pulse of sound came. A sound with a high-frequency cutoff at 7.68 kHz was judged to come from a more distant speaker than a sound with a cut-off at 10.56 kHz. With this procedure, sound intensity would also serve as a cue to depth. Butler et al. (1980) recorded bursts of sound through microphones inserted into the ear canals of a model head. The sound source was off to one side at a distance of 5 feet from the model head in either an anechoic chamber or in a chamber with reflecting walls. The sound source was either broadband, high-pass, or low-pass. Subjects heard the recorded sounds through one or two headphones and judged their distances in feet. The results are shown in Figure 35.1. It can be seen that high-pass sounds were judged to be nearer than low-pass sounds, with both monaural and binaural listening. Little et al. (1992) presented a wide-band noise for 2 s from a loudspeaker at a distance of 3 m in a room with sound absorbing panels. The sound had low-pass cutoff frequencies of 5.0, 6.0, or 6.7 kHz. Subjects could not judge the distance of a sound on its initial presentation. On average, after several presentations, subjects judged the low cutoff sound to be at 2.5 m and the high cutoff sound to be

18

Judged distance (ft)

15 Echoic Monaural

12

Echoic binaural

9 Anechoic binaural

6

Anechoic monaural

3

>60

>40 >20 Broad