APA Handbook of Behavior Analysis [1] 9781433811111, 7507307228, 0902895291, 0882797431

Behavior analysis emerged from the nonhuman laboratories of B. F. Skinner, Fred Keller, Nate Schoenfeld, Murray Sidman,

142 25 18MB

English Pages 557 Year 2016

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

APA Handbook of Behavior Analysis [1]
 9781433811111, 7507307228, 0902895291, 0882797431

Table of contents :
I. Overview

Single-Case Research Methods: An Overview
Iver H. Iversen
The Five Pillars of the Experimental Analysis of Behavior
Kennon A. Lattal
Translational Research in Behavior Analysis
William V. Dube
Applied Behavior Analysis
Dorothea C. Lerman, Brian A. Iwata, and Gregory P. Hanley


II. Single-Case Research Designs

Single-Case Experimental Designs
Michael Perone and Daniel E. Hursh
Observation and Measurement in Behavior Analysis
Raymond G. Miltenberger and Timothy M. Weil
Generality and Generalization of Research Findings
Marc N. Branch and Henry S. Pennypacker
Single-Case Research Designs and the Scientist-Practitioner Ideal in Applied Psychology
Neville M. Blampied
Visual Analysis in Single-Case Research
Jason C. Bourret and Cynthia J. Pietras
Quantitative Description of Environment–Behavior Relations
Jesse Dallery and Paul L. Soto
Time-Series Statistical Analysis of Single-Case Data
Jeffrey J. Borckardt, Michael R. Nash, Wendy Balliet, Sarah Galloway, and Alok Madan
New Methods for Sequential Behavior Analysis
Peter C. M. Molenaar and Tamara Goode


III. The Experimental Analysis of Behavior

Pavlovian Conditioning
K. Matthew Lattal
The Allocation of Operant Behavior
Randolph C. Grace and Andrew D. Hucks
Behavioral Neuroscience
David W. Schaal
Stimulus Control and Stimulus Class Formation
Peter J. Urcuioli
Attention and Conditioned Reinforcement
Timothy A. Shahan
Remembering and Forgetting
K. Geoffrey White
The Logic and Illogic of Human Reasoning
Edmund Fantino and Stephanie Stolarz-Fantino
Self-Control and Altruism
Matthew L. Locey, Bryan A. Jones, and Howard Rachlin
Behavior in Relation to Aversive Events: Punishment and Negative Reinforcement
Philip N. Hineline and Jesús Rosales-Ruiz
Operant Variability
Allen Neuringer and Greg Jensen
Behavioral Pharmacology
Gail Winger and James H. Woods

Citation preview

Contents VOLUME 1. METHODS AND PRINCIPLES Part I. Overview Chapter 1. Single-Case Research Methods: An Overview ……………………………………… 3 Iver H. Iversen Chapter 2. The Five Pillars of the Experimental Analysis of Behavior ………………... 33 Kennon A. Lattal Chapter 3. Translational Research in Behavior Analysis …………………………………….. 65 William V. Dube Chapter 4. Applied Behavior Analysis ………………………………………………………………… 81 Dorothea C. Lerman, Brian A. Iwata, and Gregory P. Hanley

Part II. Single-Case Research Designs Chapter 5. Single-Case Experimental Designs …………………………………………………… 107 Michael Perone and Daniel E. Hursh Chapter 6. Observation and Measurement in Behavior Analysis ………………………. 127 Raymond G. Miltenberger and Timothy M. Weil Chapter 7. Generality and Generalization of Research Findings ……………………….. 151 Marc N. Branch and Henry S. Pennypacker Chapter 8. Single-Case Research Designs and the Scientist-Practitioner Ideal in Applied Psychology …………………………………………………………... 177 Neville M. Blampied Chapter 9. Visual Analysis in Single-Case Research ………………………………………….. 199 Jason C. Bourret and Cynthia J. Pietras

Chapter 10. Quantitative Description of Environment–Behavior Relations ……………………………………………………………………………. 219 Jesse Dallery and Paul L. Soto Chapter 11. Time-Series Statistical Analysis of Single-Case Data …………… 251 Jeffrey J. Borckardt, Michael R. Nash, Wendy Balliet, Sarah Galloway, and Alok Madan Chapter 12. New Methods for Sequential Behavior Analysis ………………….. 267 Peter C. M. Molenaar and Tamara Goode

Part III. The Experimental Analysis of Behavior Chapter 13. Pavlovian Conditioning ………………………………………………………. 283 K. Matthew Lattal Chapter 14. The Allocation of Operant Behavior ……………………………………. 307 Randolph C. Grace and Andrew D. Hucks Chapter 15. Behavioral Neuroscience ……………………………………………………. 339 David W. Schaal Chapter 16. Stimulus Control and Stimulus Class Formation …………………. 361 Peter J. Urcuioli Chapter 17. Attention and Conditioned Reinforcement …………………………. 387 Timothy A. Shahan Chapter 18. Remembering and Forgetting …………………………………………….. 411 K. Geoffrey White Chapter 19. The Logic and Illogic of Human Reasoning ……………………….… 439 Edmund Fantino and Stephanie Stolarz-Fantino

Chapter 20. Self-Control and Altruism …………………………………………………… 463 Matthew L. Locey, Bryan A. Jones, and Howard Rachlin Chapter 21. Behavior in Relation to Aversive Events: Punishment and Negative Reinforcement ………………………….. 483 Philip N. Hineline and Jesús Rosales-Ruiz Chapter 22. Operant Variability …………………………………………………………….. 513 Allen Neuringer and Greg Jensen Chapter 23. Behavioral Pharmacology …………………………………………………… 547 Gail Winger and James H. Woods

Chapter 1

Single-Case Research Methods: An Overview Iver H. Iversen

My experiments had indeed gone well. I was getting data from a single rat that were more orderly and reproducible than the averages of large groups in mazes and discrimination boxes, and a few principles seemed to be covering a lot of ground. (Skinner, 1979, p. 114) Replication is the essence of believability. (Baer, Wolf, & Risley, 1968, p. 95) Single-case research methods refer to a vast collection of procedures for conducting behavioral research with individual subjects. Such methods are used in basic research and for improving behavioral problems with educational and therapeutic interventions. Analyses and interpretations of data collected with research methods for individual subjects have developed into procedures that are considerably different from those used in research with groups of subjects. In this chapter, I provide an overview of designs, analyses, and interpretations of research results and treatment outcomes using single-case research methods. Background and History The case in single-case research methods essentially refers to a unit of analysis for an individual, a few people in a group, or a large group with varying membership. Single-case research methods should be contrasted with single-case studies that ordinarily consist of anecdotal narrations of what happened to a given person. No treatments or manipulations of experimental variables take place in case studies.

Research Methods Using Single Subjects The single-case research method involves repeated measures of one individual’s behavior before, during, and often after an experimental, educational, or therapeutic intervention. Data are collected repeatedly over several observation periods or sessions in what is customarily called a time series. The objective is to change a behavior to determine the variables that control that behavior. When environmental variables have been found that reliably change behavior, the method can be used to control behavior. The investigator, educator, or therapist can make the behavior start and stop and change frequency or duration at a specific time and place. Therefore, single-case methods provide a tool for a science of behavior at the level of the individual subject. When an educator or therapist can effectively control the client’s behavior, then methods exist for helping that client acquire new behavior or overcome a problem with existing behavior. This ability to help a client by using methods to control the client’s behavior is exactly what often creates controversy around the use of such methods. The methods for controlling behavior and for helping the client are the same, but the verbs control and help are not synonymous. To control behavior means to be able to change behavior reliably, and in this context control has a technical meaning originating from the laboratory. The word control, however, also has political and societal meanings related to authoritative restrictions of behavior for the individual. The intended helping function of an applied behavior

I thank Dominik Guess and Wendon Henton for valuable comments on earlier versions of this chapter. DOI: 10.1037/13937-001 APA Handbook of Behavior Analysis: Vol. 1. Methods and Principles, G. J. Madden (Editor-in-Chief) Copyright © 2013 by the American Psychological Association. All rights reserved.

3

Iver H. Iversen

science is the opposite—to establish enrichment and expansion of the individual’s behavioral repertoire, not to restrict it. This complex issue has lead to ­several considerations regarding the ethics involved in helping a client. For example, Skinner (1978) argued that a client who needs help to obtain essential goods for survival should have a right to acquire a behavior that can provide the goods rather than merely be provided the goods regardless of any behavior. Similarly, Van Houten et al. (1988) stated that individuals who are recipients . . . of treatment designed to change their behavior have the right to a therapeutic environment, services whose overriding goal is personal welfare, treatment by a competent behavior analyst, programs that teach functional skills, behavioral assessment and ongoing evaluation, and the most effective treatment procedures available. (p. 381) Because of the overall success of single-case research methods, a plethora of articles, book chapters, and entire books devoted to the topic have appeared over the past 40 years. Recent publications in this vast literature illustrate the wide use of single-case research methods: Specific topics are basic methodology (Barlow, Nock, & Hersen, 2009; J. M. Johnston & Pennypacker, 2009), educational research (Kennedy, 2005; Sulzer-Azaroff & Mayer, 1991), health sciences (Morgan & Morgan, 2009), clinical and applied settings (Kazdin, 2011), community settings (O’Neill, McDonnell, Billingsly, & Jenson, 2010), and medicine (Janosky, Leininger, Hoerger, & Libkuman, 2009). The term single-case research methods is synonymous with a variety of related terms, the most common of which are single-subject designs and N = 1 designs. The last term should be avoided because it is misleading. N = 1 obviously means the use of only one subject. However, N has an entirely different meaning in statistics, where it stands for the number of data points collected, not for the number of subjects. In customary group research, each ­subject contributes exactly one data point. In contrast, single-case research methods generate a high 4

number of data points for each individual because of the repeated observations of that individual’s behavior. Therefore, it is incorrect and misleading to refer to the single-case research method as an N = 1 design, falsely implying that only one data point has been collected and that the researcher is trying to promote a finding based on a single data point.

Glimpses Into the History of Single-Case Research Methods I. P. Pavlov’s (1927) Conditioned Reflexes had a major influence on B. F. Skinner’s decision to study psychology and on his choice of research methodology (Catania & Laties, 1999; Iversen, 1992; Skinner, 1966). Skinner was impressed with Pavlov’s precise, quantitative measures of behavior in one organism at a time. Thus, Skinner (1956) once wrote, “I had the clue from Pavlov: Control your conditions and you will see order” (p. 223). Pavlov described in great detail control conditions and recording of individual drops of saliva at a given time of day for a ­single animal. Figure 1.1 shows an early version of a single-case research method used by Pavlov (1928). The dog, Krasavets, had previously been conditioned to a positive tone (positive stimulus, or S+) with food powder. Over nine trials, Pavlov alternated the S+ with a negative tone (negative stimulus, or S−) that had not been conditioned to food. The S+ elicited several drops of saliva from two glands, and the S− elicited no saliva. The

Figure 1.1.  Illustration of an early single-case research method. Detailed data for a single dog from Pavlov’s experiments on classical conditioning. Positive tone is followed by food; negative tone is not. From Lectures on Conditioned Reflexes: Twenty-Five Years of Objective Study of the Higher Nervous Activity (Behaviour) of Animals (Vol. 1, p. 173), by I. P. Pavlov, 1928, New York, NY: International Publishers. In the public domain.

Single-Case Research Methods

a­ lternation was irregular by design so that Pavlov could examine what happened to the conditioned reflex to S+ after several presentations of S−; indeed, the elicitation was reduced, as can be seen for S+ at 2:32. Thus, at the level of the individual dog, Pavlov first demonstrated a baseline of reliable elicitation of saliva to S+ and then demonstrated that repeated, successive presentations of S− inhibited the flow of saliva on the next presentation of S+. Pavlov’s work in Russia and contemporary work in Europe by Wundt on perception and by Weber and Fechner on psychophysics grew from physiology, in which the customary method was to investigate the effects of independent variables on individual organisms (Boring, 1929). In Europe, Ebbinghaus (1885/1913) became famous for his memory studies using a single subject (himself). In the United States, Thorndike’s (1911) research on the law of effect with individual cats quickly became well-known (Boakes, 1984). Apparently, Thorndike was ahead of Pavlov by a few years. Thus, Pavlov (1928) wrote, Some years after the beginning of the work with our new method I learned that somewhat similar experiments on animals had been performed in America, and indeed not by physiologists but by psychologists. Thereupon I studied in more detail the American publications, and now I must acknowledge that the honour of having made the first steps along this path belongs to E. L. Thorndike. By two or three years his experiments preceded ours. (pp. 39–40) Pavlov’s work apparently did not appear in English until a paper in Science (Pavlov, 1906) and the translation of Conditioned Reflexes (Pavlov, 1927), long after Thorndike had finished his early animal research. In addition, Jacque Loeb, educated in the German tradition of physiology of individual animals (e.g., Loeb, 1900), had a major influence on Pavlov, Thorndike, Watson, and Skinner (Greenspan & Baars, 2005). At the risk of digressing too far into the history of psychology, I cannot help mentioning the work of the French physiologist Claude Bernard

(1865/1957). Nearly 150 years ago, and long before Loeb, Pavlov, Thorndike, Watson, and Skinner, he articulated in clear terms the need for what he called comparative experiments with individual, intact animals. When using several animals for comparison, he realized that the errors in his data stemmed from variations from animal to animal, and he wrote that “to remove this source of error, I was forced to make the whole experiment on the same animal . . . because in this respect two frogs are not always comparable” (p. 183). Bernard’s work appeared in Russian, and his work had a major influence on ­Russian physiologists while Pavlov was a young researcher. Of particular historical significance, Pavlov’s main professor, Botkin, had been a student of Bernard’s (see Paré, 1990), and Pavlov expressed the greatest admiration for Bernard’s experimental approaches (Todes, 2002). Indeed, according to Wood (2004), Pavlov was an apostle of Bernard’s. As for single-case methods with humans, Watson and Rayner (1920) and Jones (1924) demonstrated conditioning and extinction of fear reactions in infants (i.e., “little Albert” and “little Peter”). The methods were crude, and the data were qualitative descriptions with scant operational definitions. However, multiple training conditions, repetition of test conditions, and tests for transfer made these studies influential in psychology (Harris, 1979). Thorndike (1927) reported an impressive laboratory study with adults that serves as a very early model of the A-B-A design (see below). Subjects received instructions such as “Draw an x-inch line,” where x was 3, 4, 5, or 6. Such instructions were presented in random order, and, being blindfolded, the subjects never saw what they drew. The consequence, or effect, of drawing was the experimenter saying either “right” or “wrong.” Each subject had one early session with no effect delivered, then seven training sessions with an effect, and then one more late session without an effect. Figure 1.2 shows average data for all subjects and individual data for two subjects. All subjects improved accuracy of line drawing during training but dropped in accuracy when the effect was removed in the late test; 16 of 24 subjects had a gain compared with their scores in the early test (e. g., Subject 29), whereas eight subjects had no gain and instead a drop (e.g., 5

Iver H. Iversen

S­ ubject 42) compared with the early test. Thorndike concluded that the effect was responsible for the gain in line-drawing accuracy. In the area of motor learning and control, this experiment is considered a classic method for demonstration of the effects of feedback on performance and learning (e.g., Schmidt & Lee, 2005). With the advent of Skinner’s (1938) rigorous experimental methods featuring operationally defined measures of behavior and highly controlled conditions for individual animal subjects, singlecase research methods developed rapidly and laid the foundation for successful application to humans, which began around the 1950s. For examples of early behavior research and therapy using humans (i.e., 1950–1965), see the collections of articles in Eysenck (1960) and Ullman and Krasner (1965). In addition, Wolpe (1958) developed single-case methods for treatment of phobias in humans. The methods of behavior analysis for individual subjects were laid out clearly in influential texts by Keller and Schoenfeld (1950), Skinner (1953), and Bijou and Baer (1961). The Journal of the Experimental Analysis of Behavior was founded in 1958 and published experiments based on single-case methodology using both human and nonhuman subjects. When the Journal of Applied Behavior Analysis appeared in 1968, single-case research methods were further established as important research tools that were not for animals only. For a more thorough history of behavior analysis and single-case research methods, see Barlow et al. (2009), Kazdin (2011), and Blampied (1999).

Scientific Method Figure 1.2.  Example of a historically early A-B-A design for all subjects (top) and for two subjects (middle and bottom). Twenty-four blindfolded human subjects drew lines. Percentage correct expresses how many lines were drawn within criterion length (3, 4, 5 or 6 inches). In early and late tests (Thorndike’s terms), no consequence was presented to the subjects, whereas during training subjects were presented with the experimenter’s saying “right” or “wrong” depending on their performance. Data from Thorndike (1927).

6

One important method, namely that of comparing collected data from at least two different conditions, is common across different research disciplines. Reaching this point has not come easy. The history of science is full of anecdotes that illustrate the struggle a particular scientist had in convincing contemporary scholars of new findings. Boorstin (1983) related how Galileo had to go through painstaking steps of comparing different conditions to demonstrate that through his telescope one could in fact see how things at a great distance looked. In about 1650, Galileo first aimed the telescope at buildings

Single-Case Research Methods

and other earthly objects and then made a drawing of what he saw. Then he walked to the location seen in the telescope and made a drawing of what he saw there. He compared such drawings over and over to demonstrate that the telescope worked. Using this method, Galileo could prepare his audience for an important inductive step. When the telescope was aimed at objects in the sky, the drawing made of what one saw through the telescope would reflect the structure of what existed at the faraway location. As is well-known from the history of science, many of Galileo’s contemporaries said that what they saw in the telescope was inside the telescope and not far away. Yet, his method of repeated comparisons of different viewing conditions was an important step in convincing at least some scientists of his time that a telescope was an instrument that could be used to view distant, unreachable objects. In a review of the history of experimental control, Boring (1954) gave a related example of how Pascal in 1648 tested whether a new instrument could measure barometric pressure. When a glass tube closed at one end was filled with mercury and then inverted with the open end immersed in a cup of mercury, a vacuum would form at the closed end of the tube. Pascal had the idea that the weight of air pressing on the mercury in the open cup influenced the height of the column of mercury. Hence, the length of the vacuum at the top of the tube should change at a higher altitude, where the weight of air was supposedly less than on the ground. Pascal sent family members with two such instruments made of glass tubes, cups, and mercury to take the measurements. Readings were first made for both instruments at the foot of a mountain. Then one person stayed with one instrument and took readings throughout the day. Another person carried the second instrument up the mountain and took measurements at different locations; when back at the foot of the mountain, measurements were taken again with both instruments. Clearly, the essence of the method is the comparison between one condition at the foot of the mountain, the control condition, and the elevated conditions. Because the control readings stayed the same throughout the day, whereas the elevated condition readings changed, proof existed that the instrument worked as intended. This experiment was a real-life

demonstration by Pascal of a scientific principle as well as of a method of testing. Although these anecdotes are amusing several hundred years later for their extreme and cumbersome methods of testing, the practicing scientist of today still on occasion faces opposition to conclusions from experiments. Thus, behavior analysts sometimes find themselves in a position in which it is difficult to convince psychologists with different training that changes in contingencies of reinforcement can bring about large and robust behavior changes. For example, I once presented to colleagues in different areas of psychology some data that demonstrated very reliable and precise stimulus control of behavior in rats. A colleague objected that such mechanistic, on–off control of behavior is not psychology. Others in the audience were suspicious of the extremely low variability in the data, which was believed to have come from an equipment error and not from the method of controlling behavior. The method of comparing two or more series of readings with the same measurement instrument under two or more different conditions is the ­hallmark of the scientific method (Boring, 1954). Whether the research design is a between-groups or a within-subject comparison of the effects of a manipulation of a variable on behavior, the shared element is always the comparisons of different conditions using the same measurement (see also Baer, 1993). Designs for Single-Case Research Methods A variety of designs have been developed for research with animals and humans. Because this chapter is an overview, I can only describe the most common designs. Chapter 5 of this volume covers designs in considerably more detail. Experimental designs are specific to the problem the researcher seeks to investigate. However, an experimental design is also like an interface between the researcher and the subject because information passes in both directions. An essential aspect of single-case research methods is the interaction between subject and researcher, educator, or therapist. Pavlov, Thorndike, and Skinner, as originators 7

Iver H. Iversen

of the single-case method, modified their apparatus and experimental procedures for individual animals depending on how the animal reacted to the experimental procedures (Skinner, 1956). The scientist modifies the subject’s behavior by experimental manipulations that, when reliable and replicable, constitute the foodstuff of science; equally as important for the process of discovery is the subject’s influence on the researcher’s behavior. Sidman (1960) described how the researcher needs to “listen” to how the data change in accordance with experimental manipulations. To discover laws of behavior, the experimenter needs to adjust methods depending on how they affect the subject’s behavior. In basic research, the experimental situation is often a dynamic exchange between researcher and subject that may lead to novel discoveries and designs. In applied behavior analysis, as successful experiments and treatments are replicated over time, designs tend to become relatively fixed and standardized. The strength of standardization is a high degree of internal validity and ease in communicating the design. However, the weakness of standardization is that researchers or therapists may primarily investigate problems that fit the standard designs (Sidman, 1981). Nonetheless, the standard designs covered here serve as basic building blocks in both basic research and application. My main focus is the underlying logic of each of the covered designs.

A-B Design The essence of single-case research methods is the comparison of repeated measures of the same behavior for the same individual for at least two different experimental, educational, or treatment conditions. Figure 1.3 shows a schematic of an A-B design, using hypothetical data. Phase A is repeated baseline recording of behavior across several observation periods (sessions) before any intervention takes place. The intervention is a change in the individual’s environment, also across several sessions, usually in the form of a change in how reinforcement is provided contingent on behavior. The data in Phase A serve as a comparison with behavior changes in Phase B. However, the comparison is logically from an extension of data from Phase A to Phase B. Data in Phase A are used to forecast or predict what would have happened during the next sessions had the intervention in Phase B not been introduced. If the data in Phase A are without trends and of low variability and are taken across several sessions, then the experienced researcher or therapist predicts and assumes that this behavior will continue at that level if no interruptions in the environment occur. Thus, the dotted line in Phase B in Figure 1.3 indicates the projected trend of the behavior from Phase A to Phase B. The effect on behavior of introducing the change in Phase B is thus evaluated against the backdrop of the projected data from Phase A. Because forecast behavior cannot be measured, the

Figure 1.3.  A-B design using hypothetical data. 8

Single-Case Research Methods

difference between Phase A and Phase B data constitutes the experimental effect, as indicated by the bracket. The validity of the statement that the intervention in Phase B caused the change in behavior from A to B thus depends not only on the change of behavior in Phase B but also on how good the prediction is that the behavior would have remained the same had the intervention not been introduced in Phase B. Ideally, the behavior should not change in Phase B unless the intervention took place. The change should not occur by chance. Because other changes in the environment might take place at the same time as the intervention is introduced, the researcher or therapist can never be sure whether the intervention or some other factor produces the behavior change. Such unintended factors are customarily called confounding variables. For example, a child may have an undesirable low rate of smiling in the baseline phase (A). Then when social reinforcement is explicitly made contingent on smiling in the intervention phase (B), the rate of smiling may increase above that in the baseline (A). If the sessions stretch out over several weeks, with one session each day, then a host of other changes could happen at the same time as the intervention. For example, a parent could return from a trip away from home, the child could recover from the flu, a bully in class may be away for a few weeks, and so on. Such other factors could have made the rate of smiling increase by themselves or in addition to the effect of providing social reinforcement for smiling. For additional considerations regarding the role of confounding variables, see Kazdin (1973). Because of the potential difficulty in controlling such confounding variables,

the A-B design by itself is uncommon in clinical and educational situations. However, the A-B design is very common in laboratory demonstrations, especially with animal subjects, where confounding variables can be controlled experimentally.

A-B-A Design To demonstrate clear experimental control while reducing the potential influence of confounding variables, the experimenter can supplement the A-B design with a withdrawal of the change in Phase B and a return to the conditions of Phase A, as a backto-baseline control. This design is therefore often called a withdrawal design. When the baseline measured in the first Phase A can be recovered in the second Phase A, after the intervention has changed the behavior in Phase B, then the researcher has shown control of behavior and demonstrated high internal validity. The behavior is changed by the intervention and changed back when the intervention is removed. The researcher therefore has full control over the behavior of the individual subject or client, which means that the behavior can be changed at the researcher’s or therapist’s discretion. Figure 1.4 shows a cumulative record of lever pressing for a single rat during baseline, acquisition, and extinction of lever pressing within one session. The rapid increase in lever pressing when it produces food contrasts with the very low rate of lever pressing during the baseline phase when lever pressing did not produce food. The rapid decline in lever pressing when reinforcement is withdrawn demonstrates the control reinforcement had over the behavior. The experienced investigator comes to learn that when such clear control is obtained with

Figure 1.4.  A-B-A design: a cumulative record of lever pressing by a single rat. A = baseline with response-independent reinforcement. B = continuous reinforcement; each lever press produced reinforcement. Second A = extinction. Data from Iversen (2010). 9

Iver H. Iversen

one individual, then it will also be found with other, similar individuals. In clinical and educational applications, the A-B-A design is useful to show that an intervention is effective and that a given behavior change is under the therapist’s control and not caused by other factors. However, a critical problem arises with this design regarding clinical or educational significance for the participant and for the caregivers. If the therapist follows an A-B-A design and changes behavior from a low level in Phase A to a high level in Phase B and then back again to a low level in the second Phase A to demonstrate control, in the end there is no gain for the client or for the caregivers. The educator or therapist can hardly convince the caregivers that it is a great step forward to know that the behavior can now be brought back to the same level as when the participant came in for treatment. The A-B-A design has shown gain in control of the behavior but has produced no clinical or educational gain. Therefore, the A-B-A design is not useful as a stand-alone treatment in clinical or educational situations. A concern with the A-B-A design is that behavior that changed in Phase B may not always show a reversal back to the level seen in the first Phase A when the intervention is withdrawn. For example, for motor skill acquisition such as cycling or walking, the acquired behavior may not drop back to the baseline level when reinforcement is removed in the second Phase A. The individual has acquired a new skill in Phase B that may lead to new contingent reinforcers that were not within reach before the skill was acquired. When behavior does not return to baseline level after the intervention is removed, the possibility also exists that the behavior change was produced by or aided by a confounding variable and not by the intervention. The educator and therapist therefore face a dilemma. They want to help the client acquire a new behavior that is lacking or to suppress an existing unwanted behavior. They also want, however, to be able to communicate to other therapists and educators that they have developed a method to control the behavior of interest. To help the client, they desire the behavior change in Phase B to remain intact when the intervention is removed. To show 10

control over the behavior, they desire the behavior change in Phase B to not remain intact when the intervention is removed.

A-B-A-B Design To resolve some of these difficulties, an additional phase can be added to the A-B-A design in which the intervention (B) is repeated, thereby forming an A-BA-B design, often called a reversal-replication design or a double-replication design. Figure 1.5 (top) shows an example (R. G. Fox, Copeland, Harris, Rieth, & Hall, 1975) in which the number of math problems completed by one eighth-grade underachieving student changed when the teacher paid attention to her when she worked on the assignments. The number of problems completed shows a gradual and very large change in the treatment condition. Then, when the baseline was reinstated, the number dropped to baseline levels only to increase again when the treatment was reintroduced. Because the number of problems completed is high in both treatment (B) phases and low in both baseline (A) phases, the data demonstrate control by the intervention of reinforcing completion of math problems in a single individual child. This design satisfies both the need to demonstrate control of behavior and the need to help the client because the behavior is at the improved level when the treatment ends after the last Phase B. This off–on–off–on logic of the A-B-A-B design mimics daily situations in which people determine whether something works or not by turning it on and off a few times. The A-B-A-B design shows that control of behavior can be repeated for the same individual— that is, intrasubject replication. This ability of the researcher to replicate the behavioral change in a single individual provides a tremendous source of knowledge about the conditions that control behavior. Such results are of practical significance for teachers and family or caregivers because as Baer et al. (1968) succinctly stated, “Replication is the essence of believability” (p. 95). The A-B-A-B design is customarily shown as a time series with several sessions in each phase. The underlying off–on–off–on logic also exists in other procedures in which two conditions alternate several times within a session, as in multiple schedules (e.g., two or more schedules of reinforcement

Single-Case Research Methods

Figure 1.5.  Top: A-B-A-B design with follow-up (post checks). From “A Computerized System for Selecting Responsive Teaching Studies, Catalogued Along Twenty-Eight Important Dimensions, by R. G. Fox, R. E. Copeland, J. W. Harris, H. J. Rieth, and R. V. Hall, in E. Ramp and G. Semb (Eds.), Behavior Analysis: Areas of Research and Application (p. 131), 1975, Englewood Cliffs, NJ: Prentice-Hall, Inc. Copyright 1975 by Prentice-Hall, Inc. Reprinted with permission. Bottom: Illustration of a within-session repeated AB * N design. Event record showing onset of discriminative stimulus, first pen; response, second pen; reinforcement, third pen. Data are for one rat after discrimination training showing perfect discrimination performance. Data from Iversen (2010).

alternate after a few minutes, each under a separate stimulus). Such A-B-A-B-A-B-A-B . . . or A-B * N designs demonstrate powerful, repeated behavior control. The A-B * N design can also be implemented at the moment-to-moment level. Figure 1.5, bottom, shows a sample of an event record in which a rat promptly presses a lever each time a light turns on and almost never presses it when the light is off. The A-B * N design can itself serve as a baseline for other designs. For example, this discrimination procedure can serve as a baseline for assessment of the effects of other factors such as food deprivation, stimulus factors, drugs, and so forth. The method of repeated A-B * N changes within a session can also serve as a baseline for comparison with the outcome

on occasional test trials in which a stimulus is altered, such as, for example, in the determination of stimulus generalization. With various modifications, the repeated A-B changes within a session also form the basis of methods used in research and education, such as the matching-to-sample procedure (see Discrete-Trial Designs section, below).

Multiple-Baseline Designs A popular expansion of the A-B design is the multiple-baseline design. This design is used when reversing the behavior back to the baseline level is not desirable from an educational or therapeutic perspective. After an undesirable behavior has been changed to a desirable level, educators, therapists, 11

Iver H. Iversen

and caregivers are reluctant to bring the behavior back to its former undesirable level to prove that it was the intervention that changed the behavior to the acceptable level. Indeed, it may be considered unethical to force the removal of a desirable behavior acquired by a person with disability. With the multiple-baseline design, the possible influence of confounding variables is not assessed by withdrawing the intervention condition as with the A-B-A and A-B-A-B designs. Instead, data are compared with simultaneously running baselines for other behaviors, situations, or individuals. The multiplebaseline design probably had its formal origin in Baer et al. (1968). Multiple-baseline designs are not ordinarily used with animal subjects because reversals to baseline and replications are not undesirable or unethical. Figure 1.6 illustrates with hypothetical data the underlying logic of multiple-baseline designs. Two children in the same environment have the same behavioral deficit. In the top display, the target behavior is recorded concurrently for both children for several sessions as a baseline before intervention. For Peter, the intervention begins at Session 16 and continues for 10 sessions, and the target behavior shows a clear increase compared with baseline. For Allen, the behavior is still recorded throughout the intervention phase for Peter, but no intervention is scheduled for Allen. However, Allen also shows a similar large increase in the target behavior. Faced with such data, one would be forced to conclude either that the behavior change for Peter may not have been the result of the intervention but of other factors in the environment or that Allen imitated Peter’s new behavior change. Such data never find their way to publication because they do not demonstrate a clear effect of the intervention for him. The bottom display shows, also with hypothetical data, the more ­customary published data pattern from such a multiple-­baseline design across subjects. The ­baseline for Allen continues undisturbed, whereas Peter’s behavior changes during the intervention. When the same intervention is then also introduced for Allen, his behavior shows an increase similar to that of Peter. The data show that the intended behavior change occurs only when the intervention is introduced. The data also show successful 12

Figure 1.6.  Confounding variable in a multiple-­ baseline design. Top two graphs: Allen’s baseline behavior changes during the intervention for Peter, suggesting the influence of a confounding variable during intervention. Bottom two graphs: Allen’s behavior does not change during the intervention for Peter but changes when his intervention starts, suggesting control by the intervention. Data are hypothetical.

r­ eplication of the treatment effect. The benefit of this design is that when faced with such data, the therapist or educator can fairly safely conclude that the intervention caused the behavior change. An additional benefit for the client is that no forced withdrawal of treatment (return to baseline) occurs, which would have ruined any educational or therapeutic gain for that client. Figure 1.7 shows an empirical example of teaching words to an 11-year-old boy with autism using a

Single-Case Research Methods

Figure 1.7.  Multiple-baseline design across behaviors for one child. From “Increasing Spontaneous Language in Three Autistic Children,” by J. L. Matson, J. A. Sevin, D. Fridley, and S. R. Love, 1990, Journal of Applied Behavior Analysis, 23, p. 231. Copyright 1990 by the Society for the Experimental Analysis of Behavior, Inc., Lawrence, KS. Reprinted with permission.

multiple-baseline design across behaviors (Matson, Sevin, Fridley, & Love, 1990). Baselines of three words were recorded concurrently. Saying “please” was taught first while the baseline was continued for the other two target words. Saying “thank you” was then taught while the baseline was continued for the last target word, which was the last to be taught. The data show that teaching one word made saying that word increase in frequency, whereas saying the other words did not increase in frequency. In addition, the data show that saying a given word did not increase in frequency until teaching for that particular word started. Thus, the data generated by this design demonstrate clear control over the target behaviors. In addition, the data show that the control was established by the teaching methods and not by other confounding variables. The logic of the multiple-baseline designs dictates that the baselines run in parallel (concurrently). For example, with a multiple-baseline design across ­subjects, one would not run the first subject in January, the next subject in February, and the third ­subject in March. The point of the design is that the baseline for the second subject guards

against ­possible confounding variables that might occur simultaneously with the introduction of the intervention for the first subject; similarly, the baseline for the third subject guards against possible confounding variables associated with the introduction of the intervention for the first and second subjects. The multiple-baseline design offers several levels of comparison of effects and no effects of the intervention (Iversen, 2012). Figure 1.8 illustrates, with hypothetical data, the many comparisons that can be made in multiple-baseline designs, in this case a multiple-baseline design across subjects. For example, with four clients, the baselines of the same behavior are recorded concurrently for all clients, and the intervention is introduced successively across clients. Each client’s baseline serves as a comparison for the effect of the intervention on that client (as indicated with dotted arrows at A, B, C, and D). The intervention phase for one client is compared with the baseline for the next client (­reading from the top of the chart) to determine whether possible confounding variables might produce concomitant changes in the behavior of other 13

Iver H. Iversen

Figure 1.8.  Logic of the multiple-baseline across-subjects design. A, B, C, and D arrows refer to change in behavior during the intervention compared with the baseline for each individual; a, b, c, d, e, and f arrows refer to the absence of a change for one individual when the intervention has an effect at the same time for another individual. Data are hypothetical.

clients when the behavior is made to change for the previous client. Thus, the dotted lines (marked a, b, c, d, e, and f ) indicate the possible assessments of effects of confounding variables. In this hypothetical example, there are four demonstrations of an intervention effect because the behavior score increases each time the intervention is introduced. In addi14

tion, there are six demonstrations of the absence of the possible effect of confounding variables because the behavior score does not change for one client when the intervention takes effect for another client. Ideally, to provide maximal internal validity, the intervention should produce a change in behavior for the client for whom the intervention takes place

Single-Case Research Methods

and should have no effect on the behavior of the other clients. Notice that the data for the client in the bottom display provides the most information, with a change when the intervention takes place for that client and three comparisons during the baseline to introduction of interventions for the other clients but without any change on the baseline for the last client. Baer (1975) noted that multiple-baseline designs offer an opportunity for systematic replication (across subjects, responses, or situations) but does not offer an opportunity for direct replication for the same subject, response, or situation (i.e., there is no return to baseline or subsequent return to intervention). In essence, the technique is repeated, but ­control over behavior is not. Thus, an essential component of functional behavior analysis is lost with the multiple-baseline design, yet the technique is well suited for applied behavior analysis. Successful, systematic replications of procedures across behaviors, subjects, and situations and across laboratories, classrooms, and clinics over time offer important evidence that the designs are indeed responsible for the behavior changes. In general, multiple-baseline designs are conducted across behaviors or situations for the same individual or across individuals (with the same behavior measured for all individuals). The designs have also been used for groups of individuals. Multiplebaseline designs are appealing to educators and therapists and have become so popular that they are often presented in textbooks as the golden example of modern behavior analysis tools. However, these designs do not quite live up to the original formulation of single-case research methods’ being an interaction between the investigator and the subject. The investigator should be able to change the procedure as needed depending on how the subject’s behavior changes as a function of the investigator’s experimental manipulations. To be successful, applied behavior analysis should guard against becoming a discipline in which participants are pushed through a rigid protocol with a predetermined number of sessions and fixed set of conditions regardless of how their behavior changes. In a recent interview, Sidman (as cited in Holth, 2010) pointed out that to be effective, both basic and applied research requires

a two-way interaction between experimenter and subject and therapist and client, respectively.

Gradual or Sudden Changes in A-B, A-B-A, A-B-A-B, and Multiple-Baseline Designs The literature review for this chapter revealed that two distinctly different patterns of behavior change appear to be associated with the A-B, A-B-A, A-B-A-B, and multiple-baseline designs. Figure 1.9 exemplifies this issue for the A-B-A-B design using hypothetical data so as not to highlight or critique specific studies. In the top graph, the behavior change is a gradual increase in the B phase and a gradual decrease in the second A phase with a return to baseline conditions, and the last B phase also shows a gradual increase in the behavior, as in the first B phase. In the bottom graph, the behavior in the first B phase shifts up abruptly as soon as the intervention takes

Figure 1.9.  Two different patterns of data in A-B-A-B designs. The top display shows an example of gradual acquisition in both B phases and gradual extinction in the second A phase. The bottom display shows abrupt changes in the B phases and in return to the A phase. Data are hypothetical. 15

Iver H. Iversen

place, stays at the same level, and then just as abruptly shifts down to the same level as in the first A phase with a similar abrupt shift up in the last B phase. With animals, data customarily look as illustrated in the top graph, because phase B is usually contingent reinforcement and the change in Phase B can be considered behavior acquisition. Similarly, when reinforcement is removed in the second A phase, behavior will ordinarily extinguish gradually across sessions. In studies with humans, however, both of these data patterns appear in the literature, often without comment or clarification. When data show a large, abrupt change in the very first session of intervention and an equally large and abrupt change back to baseline in the first session of withdrawal, the participant either made contact with the changed reinforcement contingency immediately or responded immediately to an instruction regarding the change, such as “From now on you have to do x to earn the reinforcer” or “From now on you will no longer earn the reinforcer when you do x.” Thus, the two different behavior patterns observed in the A-BA-B design with human clients, gradual versus abrupt changes, could conceivably reflect whether the behavior was under control by contingencies of reinforcement or by discriminative stimuli (instruction). Skinner (1969) drew a distinction between contingency-shaped behavior and rule-governed behavior. Behavior that is rule governed already exists in the individual’s repertoire and is therefore switched on and off by the instructions, and the experiment or treatment becomes an exercise in stimulus control, or rule following. However, ­contingency-shaped behavior may not exist before an intervention intended to control the behavior, and the experiment or treatment becomes a demonstration or study of acquisition. Thus, with human participants, two fundamentally different behavioral processes may underlie the different patterns of behavior seen with the use of the A-B-A-B design (and also with the A-B, A-B-A, and multiple-baseline designs). Unfortunately, authors do not always explain the procedure carefully enough that one can determine whether the participant was instructed about changes in procedure when a new phase was initiated (see also Kazdin, 1973). Perhaps future ­systematic examinations of existing literature can 16

evaluate the frequency and root of the different behavior patterns in these designs.

Alternating-Treatments or Multielement Designs A variant of the A-B-A-B design is a more random alternation of two or more components of intervention or conditions for research. For example, the effects of three doses of a drug on a behavioral ­baseline of operant conditioning for an individual may be compared across sessions with the dose selected randomly for each session. Thus, a possible sequence of conditions might be B-A-A-C-B-C-C-AB-A-C-B, and so on. This design allows for random presentation of each condition and can assess sequential effects in addition to determining the effect of several levels of an independent variable. The basic design has its origin in Sidman (1960) under the label multielement manipulation and has since been labeled multielement design or alternating treatments. This design is somewhat similar to the design used in functional assessment (see the section Functional Assessment later in this chapter). However, an important difference is that in functional assessment, each condition involves assessment of existing behavior under familiar conditions, whereas with the alternating-treatments design, each condition is an intervention seeking to change behavior. Alternating-treatments designs are considerably more complex than what can be covered here (see, e.g., Barlow et al., 2009).

Changing-Criterion Designs When a behavior needs to be changed drastically in topography, duration, or frequency, an effective approach is to change the criterion for reinforcement in small steps. This approach is essentially the method of shaping by successive approximation (Cooper, Heron, & Heward, 2007). Concrete, measurable criteria are applied in a stepwise fashion in accordance with behavioral changes. Each step serves as a baseline for comparison to the next step. Withdrawals or reversals are rare because the goal is to establish a drastic behavior change. The method can be characterized as an A-B-C-D-E-F-G . . . design, although it is rarely written that way. For

Single-Case Research Methods

example, as a laboratory demonstration, the ­duration of lever holding by a rat may be increased in small steps of first 200 milliseconds across sessions, then 500 milliseconds, then 1 second, and so forth. Eventually, the rat may steadily hold the lever down for an extended period of time, say up to 10 seconds or longer if the final criterion is set at 10 seconds (e.g., Brenagan & Iversen, 2012). The changingcriterion design may also be used in the form of stimulus-shaping methods in educational research in which stimuli are faded in or faded out or modified in topography. For example, the spoken word house is taught when a stimulus, say a schematic of a house, is presented to a child. Over time the schematic is modified in many small steps into the printed word HOUSE; at each step, the correct response to the stimulus remains the spoken word house. A variety of other schematics of objects, each with its own separate spoken word, are similarly modified into printed words over time. Eventually, a situation is created in which the child may produce spoken words that correspond to the printed words (e.g., Cooper et al., 2007). McDougall, Hawkins, Brady, and Jenkins (2006) described various implementations of the changing-criterion design in education and suggested that the changing-criterion design can profitably be combined with A-B-A-B or multiple-baseline designs to establish optimal designs for individuals in need of large-scale behavior changes. Actual laboratory experiments and educational or clinical interventions using variations of changingcriterion designs are customarily highly complex mixtures of different methods. For example, to generate visual guidance of motor behavior in chimpanzees, Iversen and Matsuzawa (1996) introduced an automated training protocol to bring line drawing with a finger on a touch screen under stimulus control. Finger movement left “electronic ink” on the screen surface. The top-left diagram in Figure 1.10 shows a sketch of the 10-step training procedure. Each session presented in mixed order four trial types as four orientations of an array of circles (as indicated). The chimpanzees had to touch the circles to produce reinforcement. Stimuli changed across and within sessions. Thus, within Steps 2 and 3, the circles were moved closer across trials, and

the number of circles increased as well. The objective was to enable a topography change from touchlift to a continuous finger motion across circles, as in touch-drag. The lower diagram in Figure 1.10 shows the development of drawing for one subject for one of the four trial types on the monitor. Touch-lift is indicated by a dot and touch-drag (touch and then move the finger while it still has contact with the monitor) is indicated by a thick line that connects successive circles. The figure shows the interplay between procedure and results. The circles come closer together, the number of circles increases, and the chimpanzee’s behavior changes accordingly from touch-lift to touch-drag. Small specks of not lifting the finger between consecutive circles initially appear in Session 13, and the first full sweep across all circles without lifting appears already in Session 14 and is fully developed by Session 16. The time to complete each trial (vertical lines) shortens as the topography changed from touch-lift to touch-drag. Eventually, the chimpanzee swept over the circles in one movement for all four trial types. In the remaining steps, the stimuli were changed by fading techniques across and within sessions from an array of circles to just two dots, one where drawing should start and one where it should end. An additional aspect of the method was that the chimpanzees were also taught to end the trials themselves by pressing a trial termination key, which was introduced in Step 6. Thereby, the end of the drawn trace (lifting of the finger) came under control by the stimuli on the monitor and not by delivery of reinforcement. The final performance was a highly precise drawing behavior under visual guidance by the stimuli on the screen. Such stimulus control of complex motor performance was acquired with a completely automated method over a span of about 100 sessions (3–4 weeks) entirely by continuously rearranging reinforcement contingencies and stimulus-fading procedures in small steps without the use of verbal instruction. In the final performance, the chimpanzees would look at the dots, aim one finger at the start dot, rapidly move the finger across the monitor, lift the finger at the second dot, and then press the trial termination key to produce reinforcement, all in less than 1 second for each trial. The top right graph in Figure 1.10 shows the 17

Iver H. Iversen

Figure 1.10.  Top left: Schematic of the experimental procedure. Top right: Frequency plot of angles of drawing, as illustrated in the top images. Bottom: Data are from one trial type, and all trials of that type are shown in six successive sessions. Number of circles and distance between circles changed within and across sessions. A dot indicates touch-lift, and a black line indicates touch-drag. From “Visually Guided Drawing in the Chimpanzee (Pan Troglodytes),” by I. H. Iversen and T. Matsuzawa, 1996, Japanese Psychological Research, 38, pp. 128, 131, 133. Copyright 1996 by John Wiley & Sons, Inc. Reprinted with permission.

resulting control of the angle of the drawn trace for each trial type. When behavior control techniques are successful and produce very reliable and smooth performance, spectators of the final performance may not quite believe that the subjects could at some point in time not do this. In fact, a renowned developmental psychologist happened to visit the laboratory while the drawing experiment was ongoing. On seeing one of the chimpanzees draw line after line smoothly and without hesitation (in Step 10), he exclaimed that the investigators had wasted their 18

time training the chimpanzees and added that they could “obviously” already draw because such smooth and precise motor performance could not at all have been acquired by simple shaping. Such comments may be somewhat amusing to behavior analysts. Yet, they are made by professionals in other areas of psychology and reveal a disturbing lack of understanding of and respect for effective behavior control techniques developed with the use of single-case research methods. Unfortunately, the commentaries also reveal a failure by behavior analysts to promote understanding about behavior

Single-Case Research Methods

analysis, even to professionals in other areas of psychology.

Discrete-Trial Designs A discrete trial is the presentation of some stimulus material and implementation of a response–­ reinforcer contingency that applies only in the presence of that stimulus. The discrete-trial design has long been a standard procedure for use with animals in all sorts of experiments within and outside of behavior analysis. Complex experiments and educational projects, for example, involving conditional discriminations (e.g., matching to sample), are often based on discrete-trial designs (see Volume 2, Chapter 6, this handbook). Historically, the discrete-trial method has become almost the hallmark of applied behavior analysis, especially for its use in education and treatment of children with intellectual disabilities (Baer, 2005). For example, the method may be as simple as presenting a picture of an animal, and the response that produces the reinforcer in this trial is the spoken name of the animal on the picture; in another trial, the picture may be of another animal, and the reinforced response is the name of that animal. Thus, the method is useful for teaching which ­stimuli (verbal or pictorial) should control which responses and when. Loosely speaking, the discretetrial method teaches when a response is permitted and when it is not. The method may be used in a very informal way, as in training when normal activities should and should not occur. The method may also be presented very formally in an automated arrangement, as in the above example with chimpanzees. In applied behavior analysis, the discretetrial method is useful for teaching single units of behavior such as acquisition of nouns but is less useful for teaching sequential behaviors, such as brushing teeth (Steege & Mace, 2007). Apparently, the very term discrete-trial teaching has of late, outside the field of applied behavior analysis, come to be considered as a simple procedure for simple behaviors. Attempts have been made, therefore, to present applied behavior analysis to audiences outside of behavior analysis as a considerably richer set of methods than just discrete-trial methods (Ghezzi, 2007; Steege & Mace, 2007).

Time Scales of Single-Case Research Methods The time scale can vary considerably for single-case research methods. Laboratory demonstrations with animals using the A-B-A design to show acquisition and extinction of a simple operant can usually be accomplished in a matter of 20 to 30 minutes (e.g., Figure 1.4). Educational interventions using A-B-AB designs may last for weeks. Therapeutic interventions for children with autism spectrum disorder may last a few years, with multiple changes in designs within this time period (e.g., Lovaas, 1987). Green, Brennan, and Fein (2002), for example, described a behavior analysis treatment project for one toddler with autism, Catherine. Treatment continued for 3 years with gradually increasing complexity, beginning with instruction in home settings, through other settings, and eventually to regular preschool settings with minimal instruction. Figure 1.11 shows the chronological order of skill introduction for the 1st year. The design is a gradual introduction of skill complexity in which previously acquired skills serve as prerequisites for new skills; the logic of the design is similar to that of the ­changing-criterion design, mentioned earlier, except that the criterion change is across topographically different behaviors and situations. There is no baseline for each skill other than the therapist’s knowledge that a given skill was absent or not sufficiently developed before it was explicitly targeted for acquisition treatment. There are no withdrawals because, for such real-life therapeutic interventions, they would force removal of an acquired, desirable skill. Effective withdrawals may even be impossible for many of the acquired skills such as eye contact, imitation, and speech. Instead, continued progress revealed that the overall program of intervention was successful. Besides, the study replicated previous similar studies. Green et al. concluded that “over the course of 3 years of intense, comprehensive treatment, Catherine progressed from exhibiting substantial delays in multiple skill domains to functioning at or above her age level in all domains” (p. 97). Multiple-baseline designs can on occasion last for years. D. K. Fox, Hopkins, and Anger (1987), for example, introduced a token reinforcement 19

Iver H. Iversen

Figure 1.11.  Timeline for teaching various skills to a single child in an intense behavior analysis program. Arrows indicate that instruction continued beyond the time period indicated here. From “Intensive Behavioral Treatment for a Toddler at High Risk for Autism,” by G. Green, L. C. Brennan, and D. Fein, 2002, Behavior Modification, 26, p. 82. Copyright 2002 by Sage Publications. Reprinted with permission.

­ rogram for safety behaviors in two open-pit mines. p Concurrent baseline recordings of safety were recorded for both mines. After 2 years in one mine, the contingencies were changed for that mine, and the baseline was continued for the other mine for another 3 years before the contingencies were changed for that mine, too. For this 15-year project, at each mine the contingencies were applied at the level of both the individual worker and teams of workers. This example serves as a reminder that for 20

single-case research methods, the case may not necessarily be one individual but a group of individuals— and in this case, the group may even change members over time. The length of time a project lasts is not an ­essential aspect of single-case research designs. The essential aspects are that data consist of repeated observations or recordings and that such data are compared across different experimental or treatment conditions.

Single-Case Research Methods

Using Single-Case Designs for Assessment A multitude of methods have been developed to assess behavior. The general methodology is similar to single-case designs, which can be used in testing which specific stimuli control a given client’s behavior.

Functional Assessment Functional assessment seeks to ascertain the immediate causes of problem behavior by identifying antecedent events that initiate the behavior and consequent events that reinforce and maintain the behavior. The goal of assessment is to determine the most appropriate intervention for the given individual. For example, a child exhibiting self-injurious behavior may be placed in several different situations to determine which situations and possible consequences of the behavior in those situations affect the behavior frequency. The situations may be (a) alone, (b) alone with toys, (c) with one parent, (d) with the other parent, (e) with a sibling, and (f) with a teacher. The child is exposed to one situation each session with situations alternating across sessions. For determination of reliability, each situation is usually presented more than once. If the problem behavior in this example occurs most frequently when either parent is present, and it is observed that the parent interacts with the child when the problem behavior occurs, then the inference is drawn that the problem behavior may be maintained by parental interaction (i.e., positive reinforcement from attention) and that the parent’s entry is an antecedent for the behavior (i.e., parental entrance is an immediate cause of initiation of the behavior). The therapist will then design an intervention based on the information collected during assessment. Causes of the problem behavior are inferred from functional assessment methods, because no experimental analysis is performed in which the causes are manipulated systematically to determine how they influence behavior. Functional assessment of problem behaviors has become very prevalent; special issues and books have been published on how to conduct the assessments (e.g., Dunlap & Kincaid, 2001; Neef & Iwata, 1994). Recently, functional assessment has appeared to have taken on a life of its own, separate

from the goal of providing impetus for intervention, with publications presenting results from assessment alone without actual follow-up intervention. Thus, the reader is left wondering whether the causes of the problem behavior revealed by the assessment package were also causes that could be manipulated and whether such manipulations would in fact improve the problem behavior. Functional assessment serves as a useful clinical and educational tool to determine possible immediate causes of problem behavior. However, reliability of assessment outcome cannot be fully evaluated without a direct link between the inferred causes of problem behavior in assessment and the outcome of subsequent intervention using manipulations of these inferred causes for the same client.

Assessment Using Discrete-Trial Procedures Many children and adults with intellectual disorders have not acquired expressive language, and communication with them can be difficult or impossible. A method called facilitated communication claims to enable communication with such clients by having a specially trained person, the facilitator, hold the client’s hand or arm while the client types with one finger on a keyboard. Because some typed messages have expressed advanced language use without the clients ever having shown other evidence of such language, facilitated communication has been questioned as a means of authentic communication. The question is whether the facilitator rather than the client could be the author of the typed messages. To test for such a possibility, several investigators have used single-case research methods to examine stimulus control of typing (e.g., Montee, Miltenberger, & Wittrock, 1995; Wheeler, Jacobson, Paglieri, & Schwartz, 1993). The most common test is to first present the client and the facilitator with a series of pictures in discrete trials and ask the client to type the name of the object shown on the picture. If the client types the correct object names, then the method is changed by adding test probes on some trials. On those test trials, the pictures are arranged such that the facilitator sees one picture, and the client sees another picture; an important added 21

Iver H. Iversen

methodological feature is that the client cannot see the picture the facilitator sees, and vice versa. These studies have shown with overwhelming reliability that the clients type the name of the picture that the facilitators see; if the clients see a different picture, then the clients do not type the name of the picture they see but instead type the name of the picture the facilitator sees. Thus, these studies have demonstrated, client by client, that it is the facilitator who types the messages and that the pictures do not control correct typing by the clients—the clients cannot communicate the names of the pictures they see. For commentaries and reviews of this literature, see Green (2005); Green and Shane (1994); and Jacobson, Mulick, and Schwartz (1995). As Green (2005) pointed out, the continued false beliefs by facilitators, family, caregivers, and news media that clients actually communicate with facilitated communication may in fact deprive them of an opportunity to receive effective, scientifically validated treatment.

Real-Life Assessment Using Multistage Reversal-Replication Designs Applications of single-case research methods with clients in real-life situations away from clinic or school may not always follow a pure, prearranged formula. Contextual factors, such as varying client health and family situations, are additional considerations. In such cases, the consistent replication of behavior patterns in similar conditions across multiple sessions becomes the indicator of whether a given treatment or assessment has the intended outcome. For example, completely paralyzed patients with amyotrophic lateral sclerosis were trained to communicate using only their brainwaves (via electroencephalogram) to control the movement of a cursor on a computer screen (Birbaumer et al., 1999). Letters were placed on the screen, and the patient could move the cursor toward them and thereby spell words to communicate. In additional experiments, abilities to distinguish verbs from nouns, odd from even numbers, and consonants from vowels and to perform simple computations were assessed in a matching-to-sample–type task (Iversen et al., 2008). The top part of Figure 1.12 shows a schematic of the events during one 5-second 22

Figure 1.12.  Top: Schematic of the events in a single trial. The first 1.5 seconds is an observation period (in this case, the presession instruction was to always select the noun of a noun–verb choice; new words appeared on each trial, and the correct position varied randomly from trial to trial). A 0.5-second baseline of electroencephalogram (EEG) is then recorded, followed by an active phase in which the patient can control the cursor (ball) on the screen for 3 seconds. Bottom: Data from one patient with amyotrophic lateral sclerosis from one training day with several successive tasks. T = task; T1 = simple target; T5 = noun–verbs; T7 = color matching; T8 = addition or subtraction matching. From “A Brain– Computer Interface Tool to Assess Cognitive Functions in Completely Paralyzed Patients With Amyotrophic Lateral Sclerosis,” by I. H. Iversen, N. Ghanayim, A. Kübler, N. Neumann, N. Birbaumer, and J. Kaiser, 2008, Clinical Neurophysiology, 119, pp. 2217, 2220. Copyright 2008 by Elsevier. Reprinted with permission.

trial. The patient gets online visual feedback from the electroencephalogram in the form of cursor movement. If the cursor reaches the correct stimulus, then a smiley face appears on the screen. The

Single-Case Research Methods

patients lived in their private homes with constant care and were assessed intermittently over several sessions spanning a few hours once or twice each week. The ideal scenario was always to test patients several times on a very simple task, such as steering the cursor to a filled box versus an open box, to make sure that the electrodes were attached correctly, equipment worked, and the patient could still use the electroencephalogram feedback to control the cursor. Once the patient reached at least 85% correct on a simple task, then tasks with assessment stimuli (e.g., nouns and verbs) were presented for one or several sessions. If the neurodegenerative amyotrophic lateral sclerosis destroys the patient’s ability to discriminate words or numbers, then the patient should show a deficit in such tasks compared with when the patient can solve a simple task. To determine whether a patient has a potential deficit in a given skill, such as odd–even discrimination, then it is necessary to know that the patient can still do the simple task of moving the cursor to the correct target when the target is just a filled box. Thus, to interpret the data, it was necessary to present the simple task several times at the beginning of, during, and end of a given day the patient was visited by the testing team. It proved challenging at times to convince family members that it was necessary to repeat the simple tasks several times because, as the family members said, “You already know that he can do that, so why do you keep wasting time repeating it?” The bottom part of Figure 1.12 shows the results for each of 16 consecutive sessions for a single test day for one patient. The task numbers on the x-axis refer to the type of training task, with Task 1 (T1) being the simplest, which was repeated several times. The overall design forms a phase sequence of A-B-C-AC-D-A, in which the A phases present the simple task and serve as a baseline control, and the B, C, and D phases present more complex test material. The A phases thus serve as a control for equipment integrity and for the patient’s basic ability to move the cursor to a correct target. For example, had the training day ended after the session in Phase D, it would have been difficult to conclude that the patient had a deficit in this task because the deteriorating percentage correct could reflect a loose

electrode, equipment failure, or the lack of patient cooperation. That the patient scored high again on the simple task in the last A phase, immediately after the session in the D phase, demonstrates that the patient had some difficulties discriminating the stimuli presented in Task 8 in the D phase (i.e., addition and subtraction, such as 3 + 5 or 7 − 2). Among the many findings is that several warm-up sessions were necessary at the beginning of the day before the patient reached the usual 85% correct on the simple task. Several such training days with this very ill, speechless and motionless patient demonstrated that the patient had some deficits in basic skills. Thus, the data showed that the patient, a former banker, now had problems with very simple addition and subtraction. This example illustrates the use of single-case research methods in complex living situations with patients with extreme disability. To extract meaningful data from research with such patients in a varying environment, it is necessary to know repeatedly that the complex recording and control equipment works as intended and that patients’ basic skills are continuously intact because no verbal communication from the patient is possible (i.e., the patient cannot tell the trainer that he is tired or that something else may be the matter). Indeed, both trainers and family members had to be instructed in some detail, and often, as to why it was necessary to repeat the simple tasks several times on each visit to the patient’s home. The multiphase replication design with repeated presentation of phases of simple tasks in alternation with more complex tasks is a necessary component of single-case research or testing methods applied to complex living situations in which communication may be compromised or impossible. Data Analysis Methods A fundamental aspect of single-case designs is that behavior is recorded repeatedly and under the same set of methods and definitions across different experimental, educational, or treatment conditions. Chapter 6 of this volume covers these issues; see also J. M. Johnston and Pennypacker (2009) and Cooper et al. (2007). 23

Iver H. Iversen

Data Recording Methods Automated recording techniques (customarily used with animal subjects) require frequent verification that monitoring devices (switches, photocells, touch screens, etc.) record behavior as intended and that these monitoring devices are calibrated correctly. With observational methods, calibration of criteria for response occurrence used by different observers is a crucial issue, and intraobserver as well as interobserver agreement calculations are essential for objective recording of behavior (see Chapter 6, this volume). Equally essential is intraand interpersonnel consistency in the methods of delivering consequences to clients in educational and clinical settings. However, reports of agreement scores do not ordinarily include information about how consistently personnel follow described methods. Common measures of behavior are frequency of occurrence (often converted to response rate), response duration, and response topography (e.g., Barlow et al., 2009). A given behavior can also be analyzed in terms of its placement in time as a behavior pattern or its placement among other behaviors, as in analyses of the sequential properties of behavior (e.g., Henton & Iversen, 1978; Iversen, 1991; see also Chapter 12, this volume).

Visual Data Analysis Methods Behavior analysis has a long tradition (beginning with Pavlov, Thorndike, and Skinner) of focusing on nonstatistical, visual analyses of data from single subjects (Iversen, 1991). In Chapter 9 of this volume, Bourret and Pietras describe a variety of methods of visual data analysis. Such analyses are now fairly standard and are covered in most textbooks on single-case research designs (e.g., Kennedy, 2005; Morgan & Morgan, 2009) and in textbooks on behavior analysis in general (e.g., Cooper et al., 2007). Visual analyses of data are not particular to experimental or applied behavior analysis and permeate all sciences and other forms of communication, as exemplified in the books by Tufte (e.g., 1983, 1990; see also Iversen, 1988) on analyzing and presenting visual information. Within behavior analysis, a classic text on visual analysis of behav24

ioral data is Parsonson and Baer (1978), which ­covers analysis of data from basic single-case research designs. Fundamental issues in visual analysis are evaluations of baseline stability and baseline trends. Baselines should ideally vary little, and experimenters should analyze any conditions responsible for unexplained variation (Sidman, 1960). Data from an intervention cannot always be interpreted if variability from the baseline carries over into the intervention phase. Trends in baselines can be problematic if they are in the direction of the expected experimental or therapeutic effect. For example, if a baseline rate of behavior gradually increases over several observation periods, and behavior increases further during the intervention, then it can be difficult or impossible to determine whether the intervention was responsible for the increase (e.g., Cooper et al., 2007). The expressions appropriate baseline and inappropriate baseline have appeared in the literature to emphasize this issue. An appropriate baseline is either a flat baseline without a trend or a baseline with a trend in the direction opposite to the expected effect. For example, if the rate of self-injury in a child is steadily increasing in baseline, and the intervention is expected to decrease the behavior, then an increasing baseline is appropriate for intervention because the behavior is expected to decrease in the intervention even against an increasing baseline. There is certainly no rationale in waiting for self-injury to stabilize before intervention starts. However, an inappropriate baseline is one that has a trend in the same direction as the expected outcome of the intervention (see also ­Cooper et al., 2007). Patterns of data are important in interpretations of behavior changes, in addition to descriptive statistical evaluations. For example, the two trends previously noted here in data for A-B, A-B-A, A-B-AB, and multiple-baseline designs (i.e., Figure 1.9) would not be apparent had data been presented only as averages for each phase of recorded behavior. Plotting data for each session captures trends, variability, and intervention outcomes. Without such data, the investigator easily misses important information that might lead to subsequent changes in

Single-Case Research Methods

procedure. In fact, successful interaction between researcher and subject depends on visual data ­analysis concomitant with project progression (e.g., Sidman, 1960; Skinner, 1938)

Statistical Data Analysis Methods Statistical analyses of behavioral data in both basic research and application have been controversial since Skinner’s (1938) The Behavior of Organisms, which was devoid of traditional statistical analyses. Behavior analysis gathers information about the behavior of individual subjects, whereas the traditional statistical approach gathers information about groups of subjects and offers no information about data from individual subjects. However, visual analyses of behavioral data are not always sufficient, and statistical methods can be used to supplement the analysis. Several authors have analyzed the ongoing controversy regarding use of statistics in behavior analysis and psychology (e.g., Barlow et al., 2009; Kratochwill & Brody, 1978; see Chapter 7, this volume). Chapters 11 and 12 of this volume provide information on new statistical techniques, which may prove useful for behavior analysts conducting time-series single-case research designs. A major issue with the use of statistics in behavior analysis is the treatment of variability in data. Visual analyses take variability as informative data that can prompt an experimental analysis of the sources of this variability (e.g., Sidman, 1960). With statistical analysis, however, variability is artificially compressed to a more manageable single number (variance or standard deviation) without analysis of the source of the variability. Thus, visual analysis and statistical analysis can be seen as antithetical tools to uncover causes of variability in the behavior of the individual subject. Consider, for example, the use of a simple t test for evaluation of data from two successive phases of an A-B design (e.g., Figure 1.3) as an illustration of the problems one may face with the use of a statistical test created for entirely different experimental designs. If the B phase shows acquisition of behavior at a gradually increasing rate compared with the baseline in the A phase, the behavior analyst has no problems visually identifying a large effect of the manipulation (given that confounding variables can

be ruled out). If a t test is applied, however, comparing the baseline data with the treatment data, the standard deviation for the B phase may be very large because data range from the baseline level to the highest rate when treatment is most effective. The result may be a nonsignificant effect. Besides, the data in the B phase showing a gradual acquisition may not be normally distributed and may not be independent measures (i.e., as a series of increasing values, data on session n influence data on session n + 1); the result is that the t test is not valid. However, the failure of a common statistical test to show an effect certainly does not mean that such data are unimportant or that there is no effect of the experimental manipulation. Single-case research methods are not designed for hypothesis testing and inferential statistics but for analysis of behavior of the individual subject and for development of methods that can serve to help individuals acquire appropriate behavior. The assumptions of inferential statistics, meant for between-group comparisons, with independent observations, random selection of subjects, random allocation to treatment, and random treatment onset and offset are obviously not fulfilled in single-case research methods. Eventually, however, statistical tests appropriate for single-case methods may evolve from further developments in analyses of interrupted time-series data (e.g., Barlow et al., 2009; Crosbie, 1993; see Chapters 11 and 12, this volume). Aggregation from individual data points through averages and standard deviations to statistical tests to p values and to the final binary statement “yes or no” is, of course, common in all sciences, including psychology. Quantitative reduction of complex behavior patterns to yes or no eases communication of theories and ideas through publications and presentations, in which actual data from individual subjects or individual trials may be omitted. In contrast, a focus on data linked more directly to experimental manipulations can lead to demonstrations of stunning control and prediction of behavior at the moment-to-moment level for the individual subject (e.g., Henton & Iversen, 1978; Sidman, 1960), which is much closer to everyday interactions between behavior and environment. In daily life, people respond promptly to single instances of 25

Iver H. Iversen

interpersonal and environmental cues. Such stimulus control of behavior is the essence of interhuman communication and conduct. By demonstrating control of behavior at this level for the individual subject, behavior analysis can be both a science of behavior and a tool for educational and therapeutic interventions.

Quantitative Descriptions of Behavior– Environment Relations Behavioral data from single-case designs have invited quantitative descriptions of the relationships between behavior and environmental variables. Such descriptions vary considerably from issue to issue and attract general interest in both basic research (Shull, 1991) and application (see Chapter 10, this volume). Single-Case Designs and Group Studies Compared “Operant methods make their own use of Grand Numbers; instead of studying a thousand rats for one hour each, or a hundred rats for ten hours each, the investigator is likely to study one rat for a thousand hours” (Skinner, 1969, p. 112). Research involving comparisons of groups of subjects, in which each group is exposed once to one level of a manipulation, is rare in behavior analysis, in particular with animal subjects. Yet, group research or studies with a large number of participants sometimes have relevance for behavior analysis. For example, in a recent interview (Holth, 2010), Murray Sidman remarked that largescale implementation of behavior analysis techniques may require prior demonstration of the effectiveness of those techniques in large populations. Thus, studies using randomization and control groups may be necessary for promulgation of effective behavior control techniques. A positive outcome of a group study may not make behavior analysts know more about control of an individual client’s behavior, yet such an outcome may nonetheless make more people know about behavior analysis. For example, the widely cited group study by Lovaas (1987) generated broad interest in behavior analysis methods for treatment of children with 26

autism. One group of children (n = 19) with the diagnosis of autism received intensive behavior analysis procedures (40 hours/week) for 2 years, and another group (n = 19) with the same diagnosis did not receive the same treatment. The differences between the groups were vast; for example, 49% of the children in the treatment group showed significant gains in IQ and other measures of behavioral functioning compared with the other group. This study helped generate respect among parents and professionals for behavioral methods in treatment of children with autism. Since the Lovaas (1987) study, several other studies have similarly compared two groups of children (sometimes randomly assigned) with similar levels of autism spectrum disorder, in which one group received intensive behavior analysis treatment for 1 or several years, and the other group received less treatment or no treatment other than what is ordinarily provided by the child’s community (so-called “eclectic” treatment). For example, Dawson et al. (2009) recently reported one such study with 48 children (18–30 months old) diagnosed with autism spectrum disorder in a randomized controlled trial. Statistical procedures demonstrated significant gains in a variety of behavioral measures for the group that received the treatment. For this study and the Lovaas study, each child went through complex and intense procedures based on single-case research methods for about 2 years. The group comparisons mainly used standard measures of the children’s performances. Enough group ­studies have, in fact, been conducted that several meta-analyses of the efficacy of behavior analysis treatment of children with autism spectrum disorder have now been performed on the basis of such studies. Group comparison methods have also increased the visibility of behavior analysis techniques in areas of application. For example, Taub et al. (2006) used behavioral techniques in treatment of patients with paralysis of arms or legs resulting from stroke or brain injury. With the less affected or “normal” limb restrained to prevent its use, 21 patients who received massed practice of the affected limb (6 hours/ day for 10 consecutive weekdays with shaping of movement and establishment of stimulus control of movement using social reinforcement) were

Single-Case Research Methods

c­ ompared with a matched control group of 20 patients who received customary, standard physical therapy and general support. Treatment patients showed huge and clinically significant gains in motor control of the affected arm, whereas patients in the control group showed no such gains. A similar example is provided in research by Horne et al. (2004). These investigators had first obtained reliable single-case data that reinforcement techniques could be used effectively to increase the consumption of fruits and vegetables among schoolchildren. To promote the findings, Horne et al. conducted a large group study with 749 schoolchildren. The children were split into an experimental group and a control group. A baseline of fruit and vegetable consumption was taken first. Then at lunchtime, fruit and vegetable consumption was encouraged by having the children in the experimental group watch video adventures featuring heroic peers (the Food Dudes) who enjoy eating fruits and vegetables, and the children received reinforcers for eating fruit and vegetables. Children in the control group had free access to fruit and vegetables. Compared with the children in the control group, fruit and vegetable consumption was significantly higher among the children in the experimental group. On the basis of such data, this program has now been implemented on a large scale in all schools in Ireland and in other places in England (Lowe, 2010). Such group-comparison studies, published in journals that do not ordinarily publish studies using single-case research methods, may be helpful in promoting general knowledge about behavior analysis to a much wider audience. In addition, when effective, evidence-based behavior-analytic treatments become broadly recognized from publications with a large number of participants and in renowned journals, then granting agencies, insurance companies, journalists, and maybe even university administrators and politicians start to pay attention to the findings. For behavior analysts, group studies may not seem to add much basic knowledge beyond what is already known from studies using single-case research methods. Publication of group studies may, however, be a tactic for promotion of basic, important, and effective behavioral techniques beyond the readership of behavior analysts. The need for

informing the general public, therapists, scientists in other areas, educators, politicians, and so forth about behavior analysis techniques should, however, be balanced with a concern for ethical treatment of the participants involved. When a behavior analyst, based on experience with previous results, knows full well that a particular treatment using ­single-case methodology has proven to be effective over and over, then it is indeed an ethical dilemma to knowingly split a population of children in need of that treatment into two groups for comparison of treatment versus no treatment. Children who are in a control group for a few years for a valid statistical comparison with the treatment group may thus be deprived of an opportunity for known, effective treatment. Several so-called randomized controlled group studies and meta-analyses of such studies have been conducted over the past few decades to determine whether applied behavior modification actually works (e.g., Spreckley & Boyd, 2009). For elaborate comments and critiques of some meta-analysis studies of applied behavior analysis methods, see Kimball (2009). These time-consuming studies, with a presumed target audience of policymakers and insurance companies, would clearly not have been undertaken unless countless studies using singlecase methods had already demonstrated that the behavior of an individual child can be modified and sustained with appropriate management of reinforcement contingencies and stimulus control techniques. The sheer mass of already existing studies based on single-case methodology with successful outcomes, for literally thousands of individuals across a variety of behavior problems and treatment settings, poses the question, “How many more randomized controlled group studies and subsequent meta-analyses are necessary before single-case methods can be accepted in general as effective in treatment?” The use of inferential statistics as a method of proof leads to the very odd situation that such metaanalyses may explicitly exclude the results from application of single-case methods with individual clients (e.g., Spreckley & Boyd, 2009), even though the overall purpose of the meta-analyses is to decide whether such methods work for individual clients. 27

Iver H. Iversen

Profound misunderstandings of what can be accomplished by single-case research methods in general can on occasion be heard among pedagogues and critics of applied behavior analysis interventions for children with developmental disorders. The argument is that the trained behavior would have developed anyway given enough time (without training). For example, Spreckley and Boyd (2009) stated that “what is too often forgotten is that the overwhelming majority of children with [autism spectrum disorder] change over time as part of their development as opposed to change resulting from an intervention” (p. 343). Commentaries such as these present a negation of the immense accumulation of experimental and applied hard evidence that individual behavior can indeed be effectively and reliably changed with the use of behavior control techniques. Yet such comments persist and slow the application. Some behavior analysts appropriately react when professionals make such claims without supporting data (e.g., Kimball, 2009; Morris, 2009). Group studies may have their place in behavior analysis when intervention should not be withdrawn because of the social significance of the target behavior. Baer (1975) suggested combining methods of multiplebaseline designs and group comparisons, in which one group of subjects first serves as a comparison to another group who receives intervention; later, the comparison group also receives the same intervention. Conclusion Behavior analysts have developed designs and techniques that can increase or decrease target behavior on a certain occasion and time. These methods serve as tools for an experimental analysis of behavior. The same tools are used in applied behavior analysis to modify and maintain behavior for the purpose of helping a client. Single-case research designs offer a wide range of methods, and this overview has merely scratched the surface. Because of their accumulated successes, single-case designs are now being adopted in areas outside of behavior analysis such as medicine (Janosky et al., 2009), occupational therapy (M. V. Johnston & Smith, 2010), and pain management (Onghena & Edgington, 2005). Indeed, Guyatt et al. (2000), in their review of 28

evidence-based medicine, placed single-case research designs with randomization of treatment highest in their hierarchy of strength of evidence for treatment decisions. Single-case research designs feature an important component of replication of behavioral phenomena for a single individual (J. M. Johnston & Pennypacker, 2009; Sidman, 1960). With direct replication, the intervention is repeated, and when behavior changes on each replication, the experimenter or therapist has identified one of the variables that control the behavior. With systematic replication, the same subject or different subjects or species may be exposed to a variation of an original procedure (­Sidman, 1960), and a successful outcome fosters the accumulation of knowledge (see also Chapter 7, this volume). Sidman’s (1971) original demonstration of stimulus equivalence stands out as a golden example of the scientific value of replication as a method of proof because the remarkable results using a single child, in a carefully designed experiment, have been replicated numerous times and thereby spurred development of a whole new field of research and application. Behavior analysts are sometimes weary of designs that compare groups of subjects because no behavioral phenomena are established at the level of the individual subject with such designs. However, to promote behavioral phenomena using single-case research designs, comparisons of experiments are common, and one thereby inevitably compares groups of subjects across experiments, often across species as well. Indeed, such comparisons often prove the generality of the basic principles discovered with one small set of subjects. Besides, establishing the efficiency of a behavioral procedure to be implemented on a large population scale may require certain types of controlled group studies. Group designs and multiple-baseline designs can also be profitably combined to produce powerful and socially relevant effects on a large scale (e.g., D. K. Fox et al., 1987). With wide implementations in psychology, education, medicine, and rehabilitation, single-case methodology is now firmly established as a viable means for discovery as well as for application of basic behavioral mechanisms. Of late, behavior

Single-Case Research Methods

­ anagement interventions based on single-case m methodology followed up with efficiency determinations from population studies have successfully demonstrated how the science of the individual can be a science for all. Individual prediction is of tremendous importance, so long as the organism is to be treated scientifically. (Skinner, 1938, p. 444) Man has at his disposal yet another powerful resource—natural science with its strictly objective methods. This science, as we all know, is making big headway every day. The facts and considerations I have placed before you are one of the numerous attempts to employ—in studying the mechanism of the highest vital manifestations in the dog, the representative of the animal kingdom which is man’s best friend—a consistent, purely scientific method of thinking. (From Pavlov’s acceptance speech on receiving the Nobel Prize in 1904; Pavlov, 1955, p. 148)

References Baer, D. M. (1975). In the beginning there was the response. In E. Ramp & G. Semb (Eds.), Behavior analysis: Areas of research and application (pp. 16–30). Englewood Cliffs, NJ: Prentice-Hall. Baer, D. M. (1993). Advising as if for research productivity. Clinical Psychologist, 46, 106–109. Baer, D. M. (2005). Letters to a lawyer. In W. L. Heward, T. E. Heron, N. A. Neff, S. M. Peterson, D. M. Sainato, G. Cartledge, . . . J. C. Dardig (Eds.), Focus on behavior analysis in education: Achievements, challenges, and opportunities (pp. 3–30). Upper Saddle River, NJ: Pearson. Baer, D. M., Wolf, M. M., & Risley, T. R. (1968). Some current dimensions of applied behavior analysis. Journal of Applied Behavior Analysis, 1, 91–97. doi:10.1901/jaba.1968.1-91 Barlow, D. H., Nock, M. K., & Hersen, M. (2009). Single case experimental designs: Strategies for studying behavior change (3rd ed.). New York, NY: Pearson Education. Bernard, C. (1957). An introduction to the study of experimental medicine. New York, NY: Dover. (Original work published 1865)

Bijou, S. W., & Baer, D. M. (1961). Child development I: A systematic and empirical theory. New York, NY: Appleton-Century-Crofts. doi:10.1037/11139-000 Birbaumer, N., Ghanayim, N., Hinterberger, T., Iversen, I., Kotchoubey, B., Kübler, A., . . . Flor, H. (1999). A spelling device for the paralysed. Nature, 398, 297–298. doi:10.1038/18581 Blampied, N. M. (1999). A legacy neglected: Restating the case for single-case research in cognitive behavior therapy. Behaviour Change, 16, 89–104. doi:10.1375/ bech.16.2.89 Boakes, R. (1984). From Darwin to behaviorism: Psychology and the minds of animals. New York, NY: Cambridge University Press. Boorstin, D. J. (1983). The discoverers: A history of man’s search to know his world and himself. New York, NY: Vintage Books. Boring, E. G. (1929). A history of experimental psychology. New York, NY: Appleton-Century-Crofts. Boring, E. G. (1954). The nature and history of experimental control. American Journal of Psychology, 67, 573–589. doi:10.2307/1418483 Brenagan, W., & Iversen, I. H. (2012). Methods to differentially reinforce response duration in rats. Manuscript in preparation. Catania, A. C., & Laties, V. G. (1999). Pavlov and Skinner: Two lives in science. Journal of the Experimental Analysis of Behavior, 72, 455–461. doi:10.1901/jeab.1999.72-455 Cooper, J. O., Heron, T. E., & Heward, W. L. (2007). Applied behavior analysis (2nd ed.). Upper Saddle River, NJ: Pearson Education. Crosbie, J. (1993). Interrupted time-series analysis with brief single-subject data. Journal of Consulting and Clinical Psychology, 61, 966–974. doi:10.1037/0022006X.61.6.966 Dawson, G., Rogers, S., Munson, J., Smith, M., Winter, J., Greenson, J., . . . Varley, J. (2009). Randomized, controlled trial of an intervention for toddlers with autism: The early start Denver model. Pediatrics, 125, e17–e23. doi:10.1542/peds.2009-958 Dunlap, G., & Kincaid, D. (2001). The widening world of functional assessment: Comments on four manuals and beyond. Journal of Applied Behavior Analysis, 34, 365–377. doi:10.1901/jaba.2001.34-365 Ebbinghaus, H. (1913). Memory (H. A. Rueger & C. E. Bussenius, Trans.). New York, NY: Teachers College. (Original work published 1885) Eysenck, H. J. (1960). Behavior therapy and the neuroses. New York, NY: Pergamon Press. Fox, D. K., Hopkins, B. L., & Anger, W. K. (1987). The long-term effects of a token economy on safety performance in open-pit mining. Journal of Applied 29

Iver H. Iversen

Behavior Analysis, 20, 215–224. doi:10.1901/ jaba.1987.20-215 Fox, R. G., Copeland, R. E., Harris, J. W., Rieth, H. J., & Hall, R. V. (1975). A computerized system for selecting responsive teaching studies, catalogued along twenty-eight important dimensions. In E. Ramp & G. Semb (Eds.), Behavior analysis: Areas of research and application (pp. 124–158). Englewood Cliffs, NJ: Prentice-Hall.

Iversen, I. H. (1991). Methods of analyzing behavior patterns. In I. H. Iversen & K. A. Lattal (Eds.), Techniques in the behavioral and neural sciences: Experimental analysis of behavior, Part 2 (pp. 193–242). New York, NY: Elsevier. Iversen, I. H. (1992). Skinner’s early research: From reflexology to operant conditioning. American Psychologist, 47, 1318–1328. doi:10.1037/0003066X.47.11.1318

Ghezzi, P. M. (2007). Discrete trials teaching. Psychology in the Schools, 44, 667–679. doi:10.1002/pits.20256

Iversen, I. H. (2010). [Laboratory demonstration of acquisition of operant behavior]. Unpublished raw data.

Green, G. (2005). Division fellow, Gina Green, reacts to CNN program “Autism is a World” which focuses on facilitated communication. Psychology in Mental Retardation and Developmental Disabilities, 31, 7–10.

Iversen, I. H. (2012). Tutorial: Multiple baseline designs. Manuscript in preparation.

Green, G., Brennan, L. C., & Fein, D. (2002). Intensive behavioral treatment for a toddler at high risk for autism. Behavior Modification, 26, 69–102. doi:10.1177/0145445502026001005 Green, G., & Shane, H. C. (1994). Science, reason, and facilitated communication. Journal of the Association for Persons with Severe Handicaps, 19, 151–172. Greenspan, R. J., & Baars, B. J. (2005). Consciousness eclipsed: Jacques Loeb, Ivan P. Pavlov, and the rise of reductionistic biology after 1900. Consciousness and Cognition, 14, 219–230. doi:10.1016/j.concog. 2004.09.004 Guyatt, G. H., Haynes, R. B., Jaeschke, R. Z., Cook, D. J., Green, L., Naylor, C. D., . . . Richardson, W. S. (2000). Users’ guides to the medical literature: XXV. Evidence-based medicine: Principles for applying the users’ guides to patient care. JAMA, 284, 1290–1296. doi:10.1001/jama.284.10.1290 Harris, B. (1979). Whatever happened to little Albert? American Psychologist, 34, 151–160. doi:10.1037/ 0003-066X.34.2.151 Henton, W. W., & Iversen, I. H. (1978). Classical conditioning and operant conditioning: A response pattern analysis. New York, NY: Springer-Verlag. Holth, P. (2010). A research pioneer’s wisdom: An interview with Dr. Murray Sidman. European Journal of Behavior Analysis, 11, 181–198. Horne, P. J., Tapper, K., Lowe, C. F., Hardman, C. A., Jackson, M. C., & Woolner, J. (2004). Increasing children’s fruit and vegetable consumption: A peermodeling and rewards-based intervention. European Journal of Clinical Nutrition, 58, 1649–1660. doi:10.1038/sj.ejcn.1602024 Iversen, I. H. (1988). Tactics of graphic design: A review of Tufte’s The Visual Display of Quantitative Information [Book review]. Journal of the Experimental Analysis of Behavior, 49, 171–189. doi:10.1901/jeab.1988.49-171 30

Iversen, I. H., Ghanayim, N., Kübler, A., Neumann, N., Birbaumer, N., & Kaiser, J. (2008). A braincomputer interface tool to assess cognitive functions in completely paralyzed patients with amyotrophic lateral sclerosis. Clinical Neurophysiology, 119, 2214–2223. doi:10.1016/j.clinph.2008.07.001 Iversen, I. H., & Matsuzawa, T. (1996). Visually guided drawing in the chimpanzee (Pan troglodytes). Japanese Psychological Research, 38, 126–135. doi:10.1111/j.1468-5884.1996.tb00017.x Jacobson, J. W., Mulick, J. W., & Schwartz, A. A. (1995). A history of facilitated communication: Science, pseudoscience, and antiscience. American Psychologist, 50, 750–765. doi:10.1037/0003-066X.50.9.750 Janosky, J. E., Leininger, S. L., Hoerger, M. P., & Libkuman, T. M. (2009). Single subject designs in biomedicine. New York, NY: Springer-Verlag. doi:10.1007/978-90-481-2444-2 Johnston, J. M., & Pennypacker, H. S. (2009). Strategies and tactics in behavioral research (3rd ed.). New York, NY: Routledge. Johnston, M. V., & Smith, R. O. (2010). Single subject design: Current methodologies and future directions. OTJR: Occupation,Participation and Health, 30, 4–10. doi:10.3928/15394492-20091214-02 Jones, M. C. (1924). A laboratory study of fear: The case of Peter. Pedagogical Seminary, 31, 308–315. doi:10. 1080/08856559.1924.9944851 Kazdin, A. E. (1973). Methodological and assessment considerations in evaluating reinforcement programs in applied settings. Journal of Applied Behavior Analysis, 6, 517–531. doi:10.1901/jaba.1973.6-517 Kazdin, A. E. (2011). Single-case research designs: Methods for clinical and applied settings (2nd ed.). New York, NY: Oxford University Press. Keller, F. S., & Schoenfeld, W. N. (1950). Principles of psychology. New York, NY: Appleton-CenturyCrofts. Kennedy, C. H. (2005). Single-case designs for educational research. New York, NY: Pearson Allyn & Bacon.

Single-Case Research Methods

Kimball, J. W. (2009). Comments on Spreckley and Boyd (2009). Science in Autism Treatment, 6, 3–19. Kratochwill, T. R., & Brody, G. H. (1978). Single subject designs: A perspective on the controversy over employing statistical inference and implications for research and training in behavior modification. Behavior Modification, 2, 291–307. doi:10.1177/014544557823001 Loeb, J. (1900). Comparative physiology of the brain and comparative psychology. New York, NY: Putnam. doi:10.5962/bhl.title.1896 Lovaas, O. I. (1987). Behavioral treatment and normal educational and intellectual functioning in young autistic children. Journal of Consulting and Clinical Psychology, 55, 3–9. doi:10.1037/0022-006X.55.1.3 Lowe, F. (2010, September). Can behavior analysis change the world? Paper presented at the Ninth International Congress on Behavior Studies, Crete, Greece. Matson, J. L., Sevin, J. A., Fridley, D., & Love, S. R. (1990). Increasing spontaneous language in three autistic children. Journal of Applied Behavior Analysis, 23, 227–233. doi:10.1901/jaba.1990.23-227 McDougall, D., Hawkins, J., Brady, M., & Jenkins, A. (2006). Recent innovations in the changing criterion design: Implications for research and practice in special education. Journal of Special Education, 40, 2–15. doi:10.1177/00224669060400010101

(Ed.), Single-subject research: Strategies for evaluating change (pp. 101–165). New York, NY: Academic Press. Pavlov, I. P. (1906). The scientific investigation of the psychical faculties or processes in higher animals. Science, 24, 613–619. doi:10.1126/science.24.620.613 Pavlov, I. P. (1927). Conditioned reflexes (G. V. Anrep, Trans.). London, England: Oxford University Press. Pavlov, I. P. (1928). Lectures on conditioned reflexes: Twenty-five years of objective study of the higher nervous activity (behaviour) of animals (Vol. 1). New York, NY: International Publishers. doi:10.1037/11081-000 Pavlov, I. P. (1955). Nobel speech delivered in Stockholm on December 12, 1904. In K. S. Koshtoyants (Ed.), I. P. Pavlov: Selected works (pp. 129–148). Honolulu, HI: University Press of the Pacific. Schmidt, R. A., & Lee, T. D. (2005). Motor control and learning: A behavioral emphasis. Champaign, IL: Human Kinetics. Shull, R. L. (1991). Mathematical description of operant behavior: An introduction. In I. H. Iversen & K. A. Lattal (Eds.), Experimental analysis of behavior (Vol. 2, pp. 243–282). New York, NY: Elsevier. Sidman, M. (1960). Tactics of scientific research: Evaluating experimental data in psychology. New York, NY: Basic Books. Sidman, M. (1971). Reading and auditory-visual equivalences. Journal of Speech and Hearing Research, 14, 5–13.

Montee, B. B., Miltenberger, R. G., & Wittrock, D. (1995). An experimental analysis of facilitated communication. Journal of Applied Behavior Analysis, 28, 189–200. doi:10.1901/jaba.1995.28-189

Sidman, M. (1981). Remarks. Behaviorism, 9, 127–129.

Morgan, D. L., & Morgan, R. K. (2009). Single-case research methods for the behavioral and health sciences. Los Angeles, CA: Sage.

Skinner, B. F. (1938). The behavior of organisms: An experimental analysis. New York, NY: Appleton-CenturyCrofts.

Morris, E. K. (2009). A case study in the misrepresentation of applied behavior analysis in autism: The Gernsbacher lectures. Behavior Analyst, 32, 205–240.

Skinner, B. F. (1953). Science and human behavior. New York, NY: Macmillan.

Neef, N. A., & Iwata, B. A. (1994). Current research on functional analysis methodologies: An introduction. Journal of Applied Behavior Analysis, 27, 211–214. doi:10.1901/jaba.1994.27-211 O’Neill, R. E., McDonnell, J., Billingsly, F., & Jenson, W. (2010). Single case designs in educational and community settings. New York, NY: Merrill. Onghena, P., & Edgington, E. S. (2005). Customization of pain treatment: Single-case design and analysis. Clinical Journal of Pain, 21, 56–68. doi:10.1097/00002508-200501000-00007 Paré, W. P. (1990). Pavlov as a psychophysiological scientist. Brain Research Bulletin, 24, 643–649. doi:10.1016/0361-9230(90)90002-H Parsonson, B. S., & Baer, D. M. (1978). The analysis and presentation of graphic data. In T. R. Kratochwill

Skinner, B. F. (1956). A case history in scientific method. American Psychologist, 11, 221–233. doi:10.1037/ h0047662 Skinner, B. F. (1966). Some responses to the stimulus “Pavlov.”Conditional Reflex, 1, 74–78. Skinner, B. F. (1969). Contingencies of reinforcement. New York, NY: Appleton-Century-Crofts. Skinner, B. F. (1978). The ethics of helping people. In B. F. Skinner (Ed.), Reflections on behaviorism and society (pp. 33–47). Englewood Cliffs, NJ: Prentice Hall. Skinner, B. F. (1979). The shaping of a behaviorist. New York, NY: Knopf. Spreckley, M., & Boyd, R. (2009). Efficacy of applied behavioral intervention in preschool children with autism for improving cognitive, language, and adaptive behavior: A systematic review and meta analysis. 31

Iver H. Iversen

Journal of Pediatrics, 154, 338–344. doi:10.1016/j. jpeds.2008.09.012

Tufte, E. R. (1990). Envisioning information. Cheshire, CT: Graphics Press.

Steege, M. W., & Mace, F. C. (2007). Applied behavior analysis: Beyond discrete trial teaching. Psychology in the Schools, 44, 91–99. doi:10.1002/pits.20208

Ullman, L. P., & Krasner, L. (1965). Case studies in behavior modification. New York, NY: Holt, Rinehart & Winston.

Sulzer-Azaroff, B., & Mayer, G. R. (1991). Behavior analysis for lasting change. Fort Worth, TX: Harcourt Brace.

Van Houten, R., Axelrod, S., Baily, J. S., Favell, J. F., Foxx, R. M., Iwata, B. A., & Lovaas, O. I. (1988). The right to effective behavioral treatment. Journal of Applied Behavior Analysis, 21, 381–384. doi:10.1901/ jaba.1988.21-381

Taub, E., Uswatte, G., King, D. K., Morris, D., Crago, J. E., & Chatterjee, A. (2006). A placebo-controlled trial of constraint induced movement therapy for upper extremity after stroke. Stroke, 37, 1045–1049. doi:10.1161/01.STR.0000206463.66461.97 Thorndike, E. L. (1911). Animal intelligence. New York, NY: Macmillan. Thorndike, E. L. (1927). The law of effect. American Journal of Psychology, 39, 212–222. doi:10.2307/1415413 Todes, D. P. (2002). Pavlov’s physiological factory: Experiment, interpretation, laboratory enterprise. Baltimore, MD: Johns Hopkins University Press. Tufte, E. R. (1983). The visual display of quantitative information. Cheshire, CT: Graphics Press.

32

Watson, J. B., & Rayner, R. (1920). Conditioned emotional reactions. Journal of Experimental Psychology, 3, 1–14. doi:10.1037/h0069608 Wheeler, D. L., Jacobson, J. W., Paglieri, R. A., & Schwartz, A. A. (1993). An experimental assessment of facilitated communication. Mental Retardation, 31, 49–59. Wolpe, J. (1958). Psychotherapy by reciprocal inhibition. Stanford, CA: Stanford University Press. Wood, J. D. (2004). The first Nobel prize for integrated systems physiology: Ivan Petrovich Pavlov, 1904. Physiology, 19, 326–330. doi:10.1152/ physiol.00034.2004

Chapter 2

The Five Pillars of the Experimental Analysis of Behavior Kennon A. Lattal

“What is the experimental analysis of behavior?” Skinner (1966) famously asked in an address to Division 25 of the American Psychological Association, now the Division for Behavior Analysis (then the Division for the Experimental Analysis of Behavior). His answer included a set of methods and a subject matter, both of which originated with his research and conceptual analyses that began in the 1930s. Since those early days, what began as operant conditioning has evolved from its humble laboratory and nonhuman animal origins to encompass the breadth of contemporary psychology. This handbook, which is itself testimony to the preceding observation, is the impetus for revisiting Skinner’s question. The developments described in each of the handbook’s chapters are predicated on a few fundamental principles, considered here as the pillars that constitute the foundation of the experimental analysis of behavior (TEAB). These five pillars—research methods, reinforcement, punishment, control by stimuli correlated with reinforcers and punishers, and contextual and stimulus control—are the subject of this chapter. Together, they provide the conceptual and empirical framework for understanding the ways in which environmental events interact with behavior. The pillars, although discussed separately from one another here for didactic purposes, are inextricably linked: The methods of TEAB are pervasive in investigating the other pillars; punishment and stimulus

control are not possible in the absence of the reinforcement that maintains the behavior being punished or under control of other stimuli; stimuli correlated with reinforcers and punishers also have their effects only in the context of reinforcement and are also closely related to other, more direct stimulus control processes; and punishment affects both reinforcement and stimulus control. Pillar 1: Research Methods Research methods in TEAB are more than a set of techniques for collecting and analyzing data. They certainly enable those activities, but, more important, they reflect the basic epistemological stance of behavior analysis: The determinants of behavior are to be found in the interactions between individuals and their environment. This stance led to the adoption and evolution of methods and concepts that emphasize the analysis of functional relations between features of that environment and the behavior of individual organisms. Skinner (1956) put it as follows: We are within reach of a science of the individual. This will be achieved not by resorting to some special theory of knowledge in which intuition or understanding takes the place of observation

This chapter is dedicated to Stephen B. Kendall, who, as my first instructor and, later, mentor in the experimental analysis of behavior, provided an environment that allowed me to learn about the experimental analysis of behavior by experimenting. I thank Karen Anderson, Liz Kyonka, Jack Marr, Mike Perone, and Claire St. Peter for helpful discussions on specific topics reviewed in this chapter, and Rogelio Escobar and Carlos Cançado for their valuable comments on an earlier version of the chapter. DOI: 10.1037/13937-002 APA Handbook of Behavior Analysis: Vol. 1. Methods and Principles, G. J. Madden (Editor-in-Chief) Copyright © 2013 by the American Psychological Association. All rights reserved.

33

Kennon A. Lattal

and analysis, but through an increasing grasp of relevant conditions to produce order in the individual case. (p. 95)

Single-Case Procedures and Designs Two distinguishing features of research methods in TEAB are emphases on what Bachrach (1960) called the informal theoretical approach and on examining the effects of independent variables on well-defined responses of individual subjects. Skinner’s (1956) review of his early research defined Bachrach’s label. The inductive tradition allows free rein in isolating the variables of which behavior is a function, unencumbered by many of the shoulds, oughts, and inflexibilities of research designs derived from inferential statistical research methods (Michael, 1974; see Chapters 5 and 7, this volume). The essence of the second feature, single-case experimental designs, is that by investigating the effects of the independent variable on individual subjects, each subject serves as its own control. Thus, effects of independent variables are compared, within individual subjects, with a baseline on which the independent variable is absent (or present at some other value). This methodological approach to analyzing the subject matter of psychology can be contrasted with an approach based on the inferential statistical analysis of the data generated across different groups of subjects exposed to the presence or absence or different values of the independent variable (Michael, 1974; see Chapters 7 and 8, this volume). A single-case analysis precludes a major source of variation inherent in all group designs: that variation resulting from between-subjects comparisons. It also minimizes other so-called threats to internal validity such as those associated with statistical regression toward the mean and subject selection biases (cf. Kazdin, 1982). Three central features of single-case research are selecting an appropriate design, establishing baseline performance, and selecting the number of subjects to study. The most basic design involves first establishing a baseline, A; then introducing the independent variable, B; and finally returning to the baseline. From this basic A-B-A design, many variations spring (Kazdin, 1982; see Chapter 5, this volume). Within-subject designs in the tradition of 34

TEAB involve the repeated observation of the targeted response over multiple sessions until an appropriate level of stability is achieved. Without an appropriate design and an appropriate degree of stability in the baseline, attributing the changes in the dependent variable to the independent variable is not possible. Decisions about the criteria for the baseline begin with a definition of the response. Skinner (1966) noted that “an emphasis on rate of occurrence of repeated instances of an operant distinguishes the experimental analysis of behavior” (p. 213). Unless the response is systematically measured and repeatable, stability will be difficult, if not impossible, to achieve. The dimensions of baseline stability are the amount of variability or bounce in the data and the extent of trends. Baseline stability criteria are typically established relative to the anticipated effects of the independent variable. Thus, if a large effect of the independent variable is anticipated, more variation in the baseline is acceptable because, presumably, the effect will be outside the baseline range. Similarly, if a strong downward trend is expected when the independent variable is introduced, then an upward trend in the baseline data is more acceptable than if the independent variable were expected to increase the rate of responding. Some circumstances may not seem to lend themselves readily to single-case designs. An example is that in which behavioral effects cannot be reversed, and thus baselines cannot be recovered. Sidman (1960), however, observed that even in such cases, “the use of separate groups destroys the continuity of cause and effect that characterizes an irreversible behavioral process” (p. 53). In the case of irreversible effects, creative designs in the single-case tradition have been used to circumvent the problem. Boren and Devine (1968), for example, investigated the repeated acquisition of behavioral chains by arranging a task in which monkeys learned a sequence of 10 responses. Once that pattern was stable, the pattern was changed, and the monkeys had to learn a new sequence. This procedure allowed the study of repeated acquisition of behavioral chains in individual subjects across a relatively long period of time. In other cases either in which baselines are unlikely to be reversed or in which it is ethically

The Five Pillars of the Experimental Analysis of Behavior

questionable to do so, a multiple-baseline design often can be used (Baer, Wolf, & Risley, 1968; see Chapter 5, this volume). Selecting the number of subjects is based on both practical concerns and experimenter judgment. A few studies in the Journal of the Experimental Analysis of Behavior were truly single subject in that they involved only one subject (e.g., de Lorge, 1971), but most have involved more. Decisions about numbers of subjects interact with decisions about the design, types of independent variables being studied, and range of values of these variables. Between-subjects direct replications and both between- and withinsubject systematic replications involving different values of the independent variable increase the generality of the findings (see Chapter 7, this volume). Individual-subject research has emphasized experimental control over variability, as contrasted with group designs in which important sources of variability are isolated statistically at the conclusion of the experiment. Sidman (1960) noted that “acceptance of variability as unavoidable or, in some sense as representative of ‘the real world’ is a philosophy that leads to the ignoring of relevant factors” (p. 152). He methodically laid out the tactics of minimizing variability in experimental situations and identified several sources of such variability. Between-subjects variability already has been discussed. Another major source of variability discussed by Sidman is that resulting from weak experimental control. Inferential statistical analysis is sometimes used to supplement experimental analysis. Such analysis is not needed if baseline and manipulation of response distributions do not overlap, but sometimes they do, and in these cases some behavior analysts have argued for their inclusion (see, e.g., Baron [1999] and particularly Davison [1999] on the potential utility of nonparametric statistics in within-subject designs in which baseline and intervention distributions overlap). Others (e.g., Michael, 1974), however, have noted that statistical analysis of group data draws the focus away from an experimental analysis of effects demonstrable in individual subjects, removes the experimenter from the data, and substitutes statistical control for experimental control. A final point with respect to single-case designs relates to the previous discussion of the inductive

method. When individual subjects’ behavior is studied, it is not surprising that the same value of a variable may have different effects across subjects. For example, a relatively brief delay of reinforcement may markedly reduce the responding of one subject and not change the responding of another. Rather than averaging the two, the tactic in TEAB is to conduct a parametric analysis of delay duration with both subjects, to search for orderly and qualitatively similar functional relations across subjects even though, on the basis of intersubject comparisons, individuals may respond differently at any particular value. The achievement of these qualitatively similar functional relations contributes to the generality of the effect. The establishment of experimental control through the methods described in this section is a major theme of TEAB, exemplified in each of the other, empirical pillars of TEAB. Before reviewing those empirical pillars, however, examining how critical features of the environment are defined and used in TEAB is important.

Defining Environmental Events In the laboratory, responses often are defined operationally, as a matter of convenience, in terms of, for example, a switch closure. In principle, however, they are defined functionally, in terms of their effects. Baer (1981) observed that although every response or class of responses has form or structure, operant behavior has no necessary structure. Rather, its structure is determined by environmental circumstance: The structure of operant behavior is limited primarily by our ability to arrange the environment into contingencies with that behavior; to the extent that we can wield the environment more and more completely, to that extent behavior has less and less necessary structure. This is tantamount to saying that it is mainly our current relatively low level of technological control over the environment that seems to leave behavior with apparent necessary structure and that such a limitation is trivial. (Baer, 1981, p. 220). 35

Kennon A. Lattal

Functional definitions in TEAB originated with Skinner’s (1935) concept of the operant. With that analysis, he organized fluid, unique individual responses into integrated units or classes— operants—whereby all the unique members have the same effect on the environment. Stimuli were similarly organized into functional classes on the basis of similarity of environmental (behavioral) effect. He also conceptualized reinforcement not in terms of its forms or features, but functionally, in terms of its effects on responses. Thus, any environmental event can function, in principle, as a reinforcer or as a punisher, or as neither, depending on how it affects behavior. On the question of circularity, Skinner (1938) simply noted that a reinforcing stimulus is defined as such by its power to produce the resulting change. There is no circularity about this; some stimuli are found to produce the change, others not, and they are classified as reinforcing and non-reinforcing accordingly. (p. 62) Another aspect of the contextual basis of reinforcers and punishers is what Keller and Schoenfeld (1950; cf. Michael, 1982) called the establishing operation. The establishing operation delineates the conditions necessary for some event or activity to function as a reinforcer or punisher. In most laboratory research, establishing a reinforcer involves restricted access, be it, for example, to food or to periods free of electric shock delivery. That is, reinforcers and punishers have to be established by constructing a specific context or history. Morse and Kelleher (1977) summarized several experiments in which they suggested that electric shock delivery sufficient to maintain avoidance behavior was established as a positive reinforcer. This establishment was accomplished by creating particular kinds of behavioral histories. McKearney (1972), for example, first maintained responding by a free-operant shockavoidance schedule and then concurrently super­ imposed a schedule of similar shocks delivered independently of responding. This schedule was in turn replaced by response-dependent shocks scheduled at the same rate as the previously responseindependent ones, and the avoidance schedule was 36

eliminated. Responding then was maintained when its only consequence was to deliver an electric shock that had previously been avoided. As the research cited by Morse and Kelleher (1977) suggests, the type of event is less important than its behavioral effect (which in turn depends on the organism’s history of interaction with the event). Events that increase or maintain responding when made dependent on the response are categorized as reinforcers, and events that suppress or eliminate responding are categorized as punishers. Associating a valence, positive or negative, with these functionally defined reinforcers and punishers is conventional. The valence describes the operation whereby the behavioral effect of the consequence occurs, that is, whether the event is added to or subtracted from the environment, which yields a 2 × 2 contingency table in which valence is shown as a function of behavioral change (maintain or increase in the case of reinforcement, and decrease or eliminate in the case of punishment). Despite widespread adoption of this categorization system, the use of valences has been criticized on the grounds that they are arbitrary and ambiguous (Baron & Galizio, 2005; Michael, 1975). Baron and Galizio (2005) cited an experiment in which “rats kept in a cold chamber would press a lever that turned on a heat lamp” (p. 87) to make the point that in such cases it indeed is difficult to separate cold removal and heat presentation. Presenting food, it has been argued, may be tantamount to removing (or at least reducing) deprivation and removing electric shock may be tantamount to presenting a shock-free period (cf. Verhave, 1962). Although the Michael (1975) and Baron and Galizio position falls on sympathetic ears (e.g., Marr, 2006), the distinction continues. The continued use of the positive–negative distinction is a commentary on its utility in the general verbal community of behavior analysts as well as in application and ­teaching. Despite potential ambiguities in some circumstances, the operations are clear—events that experimenters present and remove are sufficiently straightforward to allow description. The question of valences, as with any question in a science, should have an empirical answer. Because the jury is still out on this question, the long-enduring practice

The Five Pillars of the Experimental Analysis of Behavior

of identifying valences on the basis of experimental operations is retained in this chapter. Pillar 2: Reinforcement An organism behaves in the context of an environment in which other events are constantly occurring, some as a result of its responses, and others independent of its responses. One outcome of some of these interactions is that the response becomes more likely than it would be in their absence. Such an outcome is particularly effective when a dependency exists between such environmental events and behavior, a two-term contingency involving responding and what will come to function as a reinforcer. This process of reinforcement is fundamental in understanding behavior and is thus a ­pillar of TEAB. Of the four empirical pillars, reinforcement may be considered the most basic because the other three cannot exist in the absence of reinforcement. Each of the other pillars adds another element to reinforced responding.

Establishing a Response As noted, to establish an operant response, a reinforcer must be established and the target response specified precisely so that it is distinguished from other response forms. Several techniques may then be used to bring about the target response. One is to simply wait until it occurs (e.g., Neuringer, 1970); however, the target response may never occur without more direct intervention. Some responses can be elicited or evoked through a technique known colloquially as “baiting the operandum.” Spreading a ­little peanut butter on a lever, for example, evokes considerable exploration by the rat of the lever, ­typically resulting in its depression, which then can be reinforced conventionally. The difficulty is that such baiting sometimes results in atypical response topographies that later can be problematic. A particularly effective technique related to baiting is to elicit the response through a Pavlovian conditioning procedure known as autoshaping (Brown & Jenkins, 1968). Once elicited, the response then can be reinforced. With humans, instructions are often an efficient means of establishing an operant response (see Rules and Instructions section). As with all

techniques for establishing operant responses, the success of the instructions depends on their precision. An alternative form of instructional control is to physically guide the response (e.g., Gibson, 1966). Such guided practice may be considered a form of imitation, although imitation as a more general technique of establishing operant behavior does not involve direct physical contact with the learner. The gold-standard technique for establishing an operant response is the differential reinforcement of successive approximations, or shaping. Discovered by Skinner in the 1940s, shaping involves immediately reinforcing successively closer approximations to the target response (e.g., Eckerman, Hienz, Stern, & Kowlowitz, 1980; Pear & Legris, 1987). A sophisticated analysis of shaping is that of Platt (1973), who extensively studied the shaping of interresponse times (IRTs; the time between successive responses). Baron (1991) also described a procedure for shaping responding under a shock-avoidance contingency. Shaping is part of any organism’s day-to-day interactions with its environment. Whether one is hammering a nail or learning a new computer program, the natural and immediate consequences of an action play a critical role in determining whether a given response will be eliminated, repeated, or modified. Some researchers have suggested that shaping occurs when established reinforcers occur independently of responding. Skinner (1948; see also Neuringer, 1970), for example, provided fooddeprived pigeons with brief access to food at 15-s intervals. Each pigeon developed repetitive stereotyped responses. Skinner attributed the outcome to accidental temporal contiguity between the response and food delivery. His interpretation, however, was challenged by Staddon and Simmelhag (1971) and Timberlake and Lucas (1985), who attributed the resulting behavior to biological and ecological processes rather than reinforcement. Nonetheless, the notion of superstitious behavior resulting from accidental pairings of response and reinforcer remains an important methodological and interpretational concept in TEAB (e.g., the changeover delay used ubiquitously in concurrent schedules is predicated on its value in eliminating the adventitious reinforcement of changing between concurrently available operanda). 37

Kennon A. Lattal

Positive Reinforcement Positive reinforcement is the development or maintenance of a response resulting from the responsedependent, time-limited presentation of a stimulus or event (i.e., a positive reinforcer). Schedules of positive reinforcement.  A schedule is a prescription for arranging reinforcers in relation to time and responses (Zeiler, 1984). The simplest such arrangement is to deliver reinforcers independently of responding. Zeiler (1968), for example, first stabilized key pecking of pigeons on fixed-interval (FI) or variable-interval (VI) schedules. Then, the response–reinforcer dependency was eliminated so that reinforcers were delivered independently of key pecking at the end of fixed or variable time periods. This elimination generally reduced response rates, but the patterns of responding continued to be determined by the temporal distribution of reinforcers: Fixed-time schedules yielded positively accelerated responding across the interfood intervals, and variable-time (VT) schedules yielded more evenly distributed responding across those intervals. Zeiler’s (1968) experiment underlines the importance of response–reinforcer dependency in schedulemaintained responding. This dependency has been implemented in two ways in reinforcement schedules. In ratio schedules, either a fixed or a variable number of responses is the sole requirement for reinforcement. In interval schedules (as distinguished from time schedules, in which the response–reinforcer dependency is absent), a single response after a fixed or variable time period is the requirement for reinforcement. Each of these four schedules—fixed ratio (FR), variable ratio (VR), FI, and VI—control wellknown characteristic response patterns. In addition, the distribution of reinforcers in VI and VR schedules, respectively, affect the latency to the first response after a reinforcer and, with the VI schedule, the distribution of responses across the interreinforcer interval (Blakely & Schlinger, 1988; Catania & Reynolds, 1968; Lund, 1976). Other arrangements derive from these basic schedules. For example, reinforcing a sequence of two responses separated from one another by a relatively long or a relatively short time period results in, respectively, low and high rates of responding. The 38

former arrangement is described as differentialreinforcement-of-low-rate (DRL), or an IRT > t, schedule, and the latter as a differential-reinforcementof-high-rate, or an IRT < t, schedule. The latter in particular often is arranged such that the first IRT < t after the passage of a variable period of time is reinforced. The various individual schedules can be combined to yield more complex arrangements, suited for the analysis of particular behavioral processes. The taxonomic details of such schedules are beyond the scope of this chapter (see Ferster & Skinner, 1957; Lattal, 1991). Several of them, however, are described in other sections of this chapter in the context of particular behavioral processes. Schedules of reinforcement are important in TEAB because they provide useful baselines for the analysis of other behavioral phenomena. Their importance, however, goes much further than this. The ways in which consequences are scheduled are fundamental in determining behavior. This point resonates with the earlier Baer (1981) quotation in the Defining Environmental Events section about behavioral structure. The very form of behavior is a function of the organism’s history, of the ways in which reinforcement has been arranged— scheduled—in the past as well as in the present. Parameters of positive reinforcement.  The schedules described in the previous section have their effects on behavior as a function of the parameters of the reinforcers that they arrange. Four widely studied parameters of reinforcement are dependency, rate, delay, and amount. The importance of the response–reinforcer dependency in response maintenance has been described in the preceding section. Its significance is underscored by subsequent experiments showing that variations in the frequency with which this dependency is imposed or omitted modulate response rates (e.g., Lattal, 1974; Lattal & Bryan, 1976). In addition, adding response-independent reinforcers when responding is maintained under different schedules changes both rates and patterns of responding (e.g., Lattal & Bryan, 1976; Lattal, Freeman, & Critchfield, 1989). Reinforcement rate is varied on interval schedules by changing the interreinforcer interval and on ratio schedules by varying the response requirement.

The Five Pillars of the Experimental Analysis of Behavior

The effects of reinforcement rate depend on what is measured (e.g., response rate, latency to the first response after a reinforcer). Generally speaking, positively decelerated hyperbolic functions describe the relation between response rate and reinforcement rate (Blakely & Schlinger, 1988; Catania & Reynolds, 1968; Felton & Lyon, 1966; but see the Behavioral Economics section later in this chapter—if the economy is closed, a different relation may hold). Delaying a reinforcer from the response that produces it generally decreases response rates as a function of the delay duration (whether the delay is accompanied by a stimulus change) and the schedule on which it is imposed (Lattal, 2010). The effects of the delay also may be separated from the inevitable changes in reinforcement rate and distribution that accompany the introduction of a delay of reinforcement (Lattal, 1987). Amount of reinforcement includes both its form and its quantity. In terms of form, some reinforcers are substitutable for one another to differing degrees (e.g., root beer and lemon-lime soda), whereas other qualitatively different reinforcers do not substitute for one another, but may be complementary. Two complementary reinforcers covary with one another (e.g., food and water). Reinforcers that vary in concentration (e.g., a 10% sucrose solution vs. a 50% sucrose solution), magnitude (one vs. six food pellets), or duration (1 s versus 6 s of food access) often have variable effects on behavior (see review by Bonem & Crossman, 1988), with some investigators (e.g., Blakely & Schlinger, 1988) reporting systematic differences as a function of duration, but others not (Bonem & Crossman, 1988). One variable that affects these different outcomes is whether the quantitatively different reinforcers are arranged across successive conditions or within individual sessions (Catania, 1963). DeGrandpre, Bickel, Hughes, Layng, and Badger (1993) suggested that reinforcer amount effects are better predicted by taking into account both other reinforcement parameters and the schedule requirements (see Volume 2, Chapter 8, this handbook).

Negative Reinforcement Negative reinforcement is the development or maintenance of a response resulting from the

response-dependent, time-limited removal of some stimulus or event (i.e., a negative reinforcer). Schedules of negative reinforcement.  Schedules of negative reinforcement involve contingencies in which situations are either terminated or postponed as a consequence of the response. The prototypical stimulus used as the negative reinforcer in laboratory investigations of negative reinforcement is electrical stimulation, because of both its reliability and its specifiability in physical terms, although examples of negative reinforcement involving other types of events abound. Escape.  Responding according to some schedule intermittently terminates the delivery of electric shocks for short periods. Azrin, Holz, Hake, and Allyon (1963) delivered to squirrel monkeys response-independent shocks according to a VT schedule. A fixed number of lever presses suspended shock delivery and changed the stimulus conditions (turning on a tone and dimming the chamber lights) for a fixed time period. Lever pressing was a function of both the duration of the time out and shock intensity, but the data were in the form of cumulative records, precluding a quantitative analysis of the functional relations between responding and these variables. This and an earlier experiment on VI escape schedules with rats (Dinsmoor, 1962) are among the few studies reporting the effects of schedules of negative reinforcement based on shock termination, thereby limiting the generality of the findings. One problem with using escape from electric shock is that shock can elicit responses, such as freezing or emotional reactions, that are incompatible with the operant escape response. An alternative method of studying escape that circumvents this problem is a timeout from the avoidance procedure first described by Verhave (1962). Perone and Galizio (1987, Experiment 1) trained rats to lever press when this response postponed the delivery of scheduled shocks. At the same time, a multiple schedule was in effect for a second lever in the chamber. ­During one of the multiple-schedule components, pressing the second lever produced timeouts from avoidance (i.e., escape from the avoidance contingency and the stimuli associated with it) according 39

Kennon A. Lattal

to a VI schedule. Presses on the second lever had no effect during the other multiple-schedule component (escape extinction). During the VI escape component, responding on the second lever was of moderate rate and constant over time, but it was infrequent in the escape-extinction component. Other parameters of timeout from avoidance largely have been unexplored. Avoidance.  The difference between escape and avoidance is one of degree rather than kind. When the escape procedure is conceptualized conventionally as allowing response-produced termination of a currently present stimulus, avoidance procedures allow responses to preclude, cancel, or postpone stimuli that, in the absence of the response, will occur. The presentation of an electric shock, for example, is preceded by a warning stimulus, during which a response terminates the stimulus and cancels the impending shock. Thus, there is escape from a stimulus associated with a negative reinforcer as well as avoidance of the negative reinforcer itself. Although a good bit of research has been conducted on discriminated avoidance (in which a stimulus change precedes an impending negative reinforcer, e.g., Hoffman, 1966), avoidance unaccompanied by stimulus change is more commonly investigated in TEAB. Free-operant avoidance, sometimes labeled nondiscriminated or unsignaled avoidance, is characterized procedurally by the absence of an exteroceptive stimulus change after the response that postpones or deletes a forthcoming stimulus, such as electric shock. The original free-operant avoidance procedure was described by Sidman (1953) and often bears his name. Each response postponed for a fixed period an otherwise-scheduled electric shock. If a shock was delivered, subsequent shocks were delivered at fixed intervals until the response occurred. These two temporal parameters, labeled, respectively, the response–shock (R-S) and shock–shock (S-S) intervals together determine response rates. Deletion and fixed- and variable-cycle avoidance schedules arrange the cancellation of otherwise unsignaled, scheduled shocks as a function of responding, with effects on responding similar to those of Sidman avoidance (see Baron, 1991). 40

Parameters of negative reinforcement.  Two variables that affect the rate of responding under schedules of negative reinforcement are the parameters (e.g., type, frequency [S-S interval in the case of free-operant avoidance], intensity, duration) of the stimulus that is to be escaped or avoided and the duration of the period of stimulus avoidance or elimination yielded by each response (e.g., the R-S interval in Sidman avoidance). Leander (1973) found that response rates on free-operant avoidance schedules were an increasing function of the interaction between electric shock intensity and duration (cf. Das Graças de Souza, de Moraes, & Todorov, 1984). Shock frequency can be manipulated by changing either the S-S or the R-S interval. With other parameters of the avoidance contingency held constant, Sidman (1953) showed that response rates increased with shorter S-S intervals. Response rates also vary inversely with the duration of the R-S interval during free-operant avoidance, such that shorter R-S intervals control higher response rates than longer R-S intervals (Sidman, 1953). Furthermore, Logue and de Villiers (1978) used concurrent variable-cycle avoidance schedules to show that response rates on operanda associated with these schedules were proportional to the frequency of scheduled shocks (rate of negative reinforcement) arranged on the two alternatives. That is, more frequently scheduled shocks controlled higher response rates than did less frequently scheduled shocks.

Extinction Extinction is functionally a reduction or elimination of responding brought about in either of two general operations: by removing the positive or negative reinforcer or by rendering the reinforcer ineffective by eliminating the establishing operation. The former is described hereafter as conventional extinction, because these operations are the ones more commonly used in TEAB when analyzing extinction. With positive reinforcement, the latter is accomplished by either providing continuous access to the reinforcer (satiation) or by removing the response–reinforcer dependency (Rescorla & Skucy, 1969; see Schedules of Positive Reinforcement section earlier in this chapter for the effects of this operation on responding). With

The Five Pillars of the Experimental Analysis of Behavior

negative reinforcement, extinction is accomplished by making the negative reinforcer inescapable. The rapidity of extinction depends both on the organism’s history of reinforcement and probably (although experimental analyses are lacking) on which of the aforementioned procedures are used to arrange extinction (Shnidman, 1968). Herrnstein (1969) suggested that the speed of extinction of avoidance is related to the discriminability of the extinction contingency, an observation that holds as well in the case of positive reinforcement. Extinction effects are rarely permanent once reinforcement is reinstated. Permanent effects of extinction are likely the result of the alternative reinforcement of other responses while extinction of the targeted response is in effect. Extinction also can generate other responses, some of which may be generalized or induced from the extinguished response itself and others of which depend on other stimuli in the environment in which extinction occurs (see Volume 2, Chapter 4, this handbook). Some instances of such behavior are described as schedule induced and are perhaps more accurately labeled extinction induced because they typically occur during those parts of a schedule associated with nonreinforcement (local extinction). For example, such responding is observed during the period after reinforcement under FR or FI schedules, in which the probability of a reinforcer is zero. Azrin, Hutchinson, and Hake (1966; see also Kupfer, Allen, & Malagodi, 2008) found that pigeons attack conspecifics when a previously reinforced key peck response is extinguished. Another example of the generative effects of extinction is resurgence. If a response is reinforced and then extinguished while a second response is concurrently reinforced, extinguishing that second response leads to a resurgence of the first response. The effect occurs whether the second response is or is not extinguished before concurrently reinforcing the second, and the effect depends on parameters of both the first and the second reinforced response (e.g., Bruzek, Thompson, & Peters, 2009; Lieving & Lattal, 2003).

Frameworks Different frameworks have evolved that summarize and integrate the empirical findings deriving from

analyses of the reinforcement of operant behavior. All begin with description. Many involve quantitative analysis and extrapolation through modeling. Others, although also quantitative in the sense of reducing measurement down to numerical representation, are less abstract, remaining closer to observed functional relations. Each has been successful in accounting for numerous aspects of behavioral phenomena. Each also has limitations. None has achieved universal acceptance. The result is that instead of representing a progression with one framework leading to another, these frameworks together make up a web of interrelated observations, each contributing something to the general understanding of how reinforcement has its effects on behavior. The following sections provide an overview of some of these contributions to this web. Levels of influence.  Thorndike (1911) observed that of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur. (p. 244) The temporal relation between a response and the reinforcer that follows has been a hallmark of the reinforcement process. Skinner’s (1948) “superstition” demonstration underscored the importance of this relation by suggesting that even in the absence of a programmed consequence of responding, the environment will strengthen whatever response occurs contiguously with the reinforcer. Ferster and Skinner (1957) frequently assigned response– reinforcer temporal contiguity a primary role in accounting for the effects of reinforcement schedules. To say that all were not content with response– reinforcer temporal contiguity as the central mechanism for reinforcement is an understatement. In a seminal experiment, Herrnstein and Hineline (1966) exposed rats to a schedule consisting of two response-independent shock distributions, one frequent, the other less so. Each session started in the 41

Kennon A. Lattal

frequent-shock distribution, and a lever press shifted the distribution of shocks to the leaner one, at which the rat remained until a shock was delivered. At this point, the frequent shock distribution was reinstated and remained in effect until the next response, at which point the above-described cycle repeated. Responses did not eliminate shocks; they only reduced shock frequency. Furthermore, because shocks were distributed randomly in time, temporal discriminations were precluded. Herrnstein and Hineline reasoned that if responding were maintained under this schedule, it would be because of an aggregated effect of reductions in shock frequency over noninstantaneous time periods (cf. Sidman, 1966). Consistent with Herrnstein and Hineline’s findings, Baum’s (1973) description of the correlation-base law of effect remains a cogent summary of a molar framework for reinforcement (see also Baum, 1989; Williams [1983] critiqued the correlational framework). TEAB makes frequent reference to level of analysis. This phrase refers to both the description of data and the framework for accounting for those data. Molecular descriptions are of individual or groups of responses, often in relation to reinforcement. Molar descriptions are of aggregated responses across time and the allocation of time to differing activities. Molecular accounts of reinforcement effects emphasize the role of events occurring at the time of reinforcement (e.g., Peele, Casey, & Silberberg, 1984), and molar accounts emphasize the role of aggregated effects of reinforcers integrated over noninstantaneous time periods (e.g., Baum, 1989). Proponents of each framework have at various times claimed primacy in accounting for the effects of reinforcement, but the isolation of an irrefutable single mechanism at one level or another seems remote. The issue is not unlike others concerning levels of analysis in other disciplines, for example, punctuated equilibrium versus continuous evolution and wave and particle theories of light. The “resolution” of the molar versus molecular issue may ultimately be pragmatic: The appropriate level is that at which behavior is predicted and controlled for the purposes at hand. Relational reinforcement theory.  Reinforcers necessarily involve activities related to their access, 42

consumption, or use. Premack (1959) proposed reinforcement to be access to a (relatively) preferred activity, such as eating. For Premack, the first step in assessing reinforcement was to create a preference hierarchy. Next, highly preferred activities were restricted and made accessible contingent on engagement in a nonpreferred activity, with the outcome that the low-probability response increased in frequency. Timberlake and Allison (1974) suggested that the Premack principle was a corollary of a more general response deprivation principle whereby any response constrained below its baseline level can function as a reinforcer for another response that allows the constrained response to rise to its baseline level. Premack’s analysis foreshadowed other conceptualizations of reinforcement contingencies in terms of constraints on behavioral output (e.g., Staddon, 1979). Choice and matching.  Perhaps the contribution to modern behavior analysis with the greatest impact is the matching law (Herrnstein, 1970; see Chapter 10, this volume). The matching law is both a summary of a number of empirical reinforcement effects and a framework for integrating those effects. Herrnstein’s (1961) original proposal was a simple quantitative statement that relative responding is distributed proportionally among concurrently available alternatives as a function of the relative reinforcement proportions associated with each alternative. He and others thereafter developed it into its more generalized form, expressed by Baum (1974; see also Staddon, 1968; McDowell, 1989) as r  R1 = b 1  , R2  r2  a

where R1 and R2 are response rates to the two alternatives and r1 and r2 are reinforcement rates associated with those alternatives. The parameters a and b are indices of, respectively, sensitivity (also sometimes labeled the discriminability of the alternatives) and bias (e.g., a preexisting preference for one operandum over the other). When restated in logarithmic form, a and b describe the slope and intercept of a straight line fitted to the plot of the two ratios on either axis of a graph. A rather typical, but not universal, finding in many experiments is undermatching, that is, preferences for the richer alternative are less

The Five Pillars of the Experimental Analysis of Behavior

than is predicted by a strict proportionality between response and reinforcement ratios. This undermatching and overmatching (a greater-than-predicted preference for the richer alternative, and bias) were the impetus for developing the generalized matching law. One of Herrnstein’s (1970) insights was that all behavior should be considered in the framework of choice. Even when only one response alternative is being measured, there is a choice between that response and engaging in other behavior. This observation led to two further conclusions: The total amount of behavior in a situation is constant or fixed (but see McDowell, 1986); it is simply distributed differently depending on the circumstances, and there are unmeasured sources of reinforcement (originally labeled ro, but later re). Thus, deviations from the strict proportionality rule, such as undermatching, are taken by some to reflect changes in re as well as perhaps bias or sensitivity changes. Davison and Tustin (1978) considered the matching law in the broader context of decision theory, in particular, signal detection theory (D. M. Green & Swets, 1966). Originally developed to distinguish sensitivity and bias effects in psychophysical data, Davison and Tustin used signal detection theory to describe the discriminative function of reinforcement in maintaining operant behavior. Their analysis thus describes the mathematical separation of the biasing and discriminative stimulus effects of the reinforcer. More generally, the analysis of choice has been extended across TEAB, from simple reinforcement schedules to foraging, social behavior, and applied behavior analysis (see Volume 2, Chapter 7, this handbook). Response strength and behavioral momentum.  Reinforcement makes a response more likely, and this increase in response probability or rate has conventionally been taken by many as evidence of the strength of that response. One difficulty with such a view of response strength is that response rate is not determined simply by whether the response has been reinforced but by how the contingencies by which the response is reinforced are arranged. Thus, the absence of the target response may index its strength in the case of a differential-reinforcementof-other-behavior (DRO) schedule, in which

reinforcement occurs only if the target response does not occur for the specified period. Nevin (1974; see Volume 2, Chapter 5, this handbook) conceptualized response strength as resistance to change of a response when competing reinforcement contingencies impinge on that response. The relation between response strength and resistance to change originated early in the psychology of learning (e.g., Hull, 1943), but Nevin expanded it to schedule-maintained responding. He arranged multiple schedules of reinforcement in which the parameters of the reinforcer—rate, magnitude, and delay, in different experiments—differed in the two components. The reinforcer maintaining the response was made less effective by, in different conditions, removing it (extinction), prefeeding the food-restricted animals (satiation), or providing response-independent reinforcers at different rates during the chamber blackout that separated the two components. Each of these disrupting operations had similar disruptive (response rate–lowering) effects such that the responding maintained by more frequent, larger, or less delayed reinforcers was less reduced than was responding in the other component in which the reinforcers were less frequent, shorter, or more delayed. In considering the resistance of behavior to change, Nevin, Mandell, and Atak (1983) proposed a behavioral analogy to physical momentum, that is, the product of mass and velocity: When responding occurs at the same rate in two different schedule components, but one is less affected by an external variable than is the other, we suggest that the performance exhibiting greater resistance to change be construed as having greater mass. (p. 50) Nevin et al. have shown that reinforcement rate, regardless of the contingency between responses and reinforcers, is a primary determinant of momentum: More frequently reinforced responses are more resistant to change. For example, Nevin, Tota, ­Torquato, and Shull (1990, Experiment 1) found that responding maintained by a combination of response-dependent and response-independent food deliveries (a VI schedule + a VT schedule) was 43

Kennon A. Lattal

more resistant to disruption than was responding maintained by response-dependent food delivery only. This was the case because the reinforcement rate in the VI + VT component was higher, even though response rate was lower in this component. Nevin et al. (1990) observed that “as a result of Pavlovian contingencies, nonspecific effects underlying Pavlovian contingencies resulting from the association of discriminative stimuli with different rates of reinforcement may have nonspecific effects that ‘arouse or motivate operant behavior’” (p. 374). Another consideration in behavioral momentum may be response rate. Different reinforcement rates result in different response rates. When reinforcement rates are held constant and response rates are varied, the lower response rates are often more resistant to change (Blackman, 1968; Lattal, 1989). Behavioral economics.  Skinner (1953) observed that “statements about goods, money, prices, wages, and so on, are often made without mentioning human behavior directly, and many important generalizations in economics appear to be relatively independent of the behavior of the individual” (p. 398). The chasm described by Skinner has long since been bridged, both in TEAB (Madden, 2000; see Volume 2, Chapter 8, this handbook) and in the discipline of economics. This bridging has come about through the mutual concern of the two disciplines with the behavior of consumption of goods and services as a function of environmental circumstances. The behavioral–economic framework has been a useful heuristic for generating experimental analyses that have expanded the understanding of reinforcement in several ways. In a token economy on a psychiatric ward (Allyon & Azrin, 1968), for example, consumption of a nominal reinforcer was reduced if the same item was available at a lower cost, or no cost, elsewhere. Thus, for example, if visitors brought desirable food items onto the ward from outside, demand for those food items within the token economy diminished. Hursh (1980) captured the essential features of this scenario by distinguishing open economies, in which items are available from multiple sources, from closed economies, in which those items are only available in a defined context. Hall and Lattal (1990) 44

directly compared the effects of reinforcement rate on VI schedule performance in open and closed economies. In the open economy, sessions were terminated before the pigeon earned its daily food allotment; postsession feeding was provided so the animal was maintained at a target weight. In the closed economy, the pigeons earned all of their food by key pecking. The functions relating response rate to reinforcement rate differed for the two economic contexts. In the open economy, response rates decreased with decreasing reinforcement rate, whereas in the closed economy, response rates increased with decreasing reinforcement rate. Such findings suggest that the functional relations that obtain between reinforcement parameters and behavior are not universal, but depend on the context in which they occur. Consumption also differs as a function of the reinforcer context. The interaction between reinforcers lies on a continuum. At one extreme, one reinforcer is just as effective as another in maintaining behavior (perfect substitutes). At the other extreme, reinforcers do not substitute for one another at all. Instead, as consumption of one increases (decreases) with price changes, so does the other despite its price being unchanged. Such a relation reveals that the reinforcers function as perfect complements. Between the extremes, reinforcers substitute for one another to varying degrees (for a review, see L. Green & Freed, 1998). Behavioral–economic analyses have shown that qualitatively different reinforcers may vary differentially in the extent to which they sustain behavior in the context of different environmental challenges, often expressed as cost in economic analyses. Some reinforcers continue to sustain behavior even as cost, measured, for example, by the number of responses required for reinforcement, increases, and the behavior sustained by others diminishes with such increased cost. This difference in response sustainability across increasing cost distinguishes inelastic (fixed sustainability regardless of cost) from elastic (sustainability varies with cost) reinforcers. Other ways of increasing the cost of a reinforcer are by increasing the delay between the reinforcer and the response that produced it or by reinforcing a

The Five Pillars of the Experimental Analysis of Behavior

response with decreasing probability. These two techniques of changing reinforcer cost have been combined with different magnitudes of reinforcement to yield what first was called a self-control paradigm (Rachlin & Green, 1972) but now generally is described as delay (or probability, as appropriate) discounting. A choice is arranged between a small or a large (or a less or more probable) reinforcer, delivered immediately after the choice response. Not surprisingly, the larger (or more probable) reinforcer almost always is selected. Next, a delay is imposed between the response and delivery of the larger reinforcer (or the probability of the larger reinforcer is decreased), and the magnitude of the small, immediate (or small but sure thing) reinforcer is varied systematically until an indifference point is reached at which either choice is equally as likely (e.g., Mazur, 1986; Richards, Mitchell, de Wit, & Seiden, 1997). Using the delay or probability discounting procedure, a function can be created relating indifference points to the changing cost. Steep discounting of delayed reinforcers correlates with addictions such as substance use disorder and pathological gambling (see Madden & Bickel, 2010, for a review). A final, cautionary note that applies to economic concepts and terms as well as more generally to verbal labels commonly used in TEAB, such as contrast or even reinforcement: Terms such as elastic or inelastic or complementary or substitutable are descriptive labels, not explanations. Reinforcers do not have their effects because they are inelastic, substitutable, or discounted. Behavior dynamics.  The emphasis in TEAB on steady-state performance sometimes obscures the centrality of the environment–behavior dynamic that characterizes virtually every contingency of reinforcement. Dynamics implies change, and change is most immediately apparent when behavior is in transition, as in the acquisition of a response previously in the repertoire in only primitive form, transitions from one set of reinforcement conditions to another, or behavioral change from reinforcement to extinction. Steady-state performance, however, also reveals a dynamic system. As Marr (personal communication, November 2010) observed, “[All] contingencies engender and manifest systems

dynamics.” The imposed contingency is not necessarily the effective one because that imposed contingency constantly interacts with responding, and the resulting dynamic is what is ultimately responsible for behavioral modulation and control. This distinction was captured by Zeiler (1977b), who distinguished between direct and indirect variables operating in reinforcement schedules. In a ratio schedule, for example, the response requirement, n, is a direct, specified variable (e.g., as in FR n), but responding takes time, so as one varies the ratio requirement, one is also indirectly varying the time between reinforcer deliveries. Another way of expressing this relation is by the feedback function of a ratio schedule: Reinforcement rate is determined directly by response rate. The feedback function is a quantitative description of the contingency specifying the dynamic interplay between responding and reinforcement. Using the methods developed by quantitative analysis of dynamical systems, a few behavior analysts have explored some properties of reinforcement contingencies (see Marr, 1992). For example, in an elegant analysis Palya (1992) examined the dynamic structure among successive IRTs in interval schedules, and Hoyert (1992) applied nonlinear dynamical systems (chaos) theory in an attempt to describe the cyclical interval-to-interval changes in response output that characterize steady-state FI schedule performance. Indeed, much of the response variability that is often regarded as a nuisance to be controlled (e.g., Sidman, 1960) may, from a dynamic systems perspective, be the inevitable outcome of the dynamic nature of any reinforcement contingency. Reinforcement in biological context.  Skinner, (1981; see also Donahoe, 2003; Staddon & Simmelhag, 1971) noted the parallels between the selection of traits in evolutionary time and the selection of behavior over the organism’s lifetime (i.e., ontogeny). Phylogeny underlies the selection of behavior by reinforcement in at least two ways. First, certain kinds of events may come to function as reinforcers or punishers in part because of phylogeny. Food, water, and drugs of various sorts, for example, may function as reinforcers at 45

Kennon A. Lattal

least in part because of their relation to the organism’s evolved physiology. It is both retrograde and false, however, to suggest that reinforcers reduce to physiological needs. Reinforcers have been more productively viewed functionally; however, the organism’s phylogeny certainly cannot be ignored in discussions of reinforcement (see also Breland & Breland, 1961). Second, the mere fact that organism’s behavior is determined to a considerable extent by consequences is prima facie evidence of evolutionary processes at work. Thus, Skinner (1981) proposed that some responses are selected (reinforced) and therefore tend to recur, and those that are not selected or are selected against tend to disappear from the repertoire. The research of Neuringer (2002; see Chapter 22, this volume) on variability as an operant sheds additional light on the interplay between behavioral variation and behavioral selection. The analysis of foraging also has been a focal point of research attempting to place reinforcement in biological perspective. Foraging, whether it is for food, mates, or other commodities such as new spring dresses, involves choice. Lea (1979; see also Fantino, 1991) described an operant model for foraging based on concurrent chained schedules (see Response-Dependent Stimuli Correlated With Previously Established Reinforcers section later in this chapter) in which foraging is viewed as consisting of several elements, beginning with search and ending in consumption (including buying the new spring dress). Such an approach holds out the possibility of integrating ecology and TEAB (Fantino, 1991). A related integrative approach is found in parallels between foraging in natural environments and choice as described by the matching law and its variants (see Choice and Matching section earlier in this chapter). Optimal foraging theory, for example, posits that organisms select those alternatives in such a way that costs and benefits of the alternatives are weighed in determining choices (as contrasted, e.g., with maximizing theory, which posits that choices are made such that reinforcement opportunities are maximum). Optimal foraging theory is not, however, without its critics. Zeiler (1992), for example, has suggested that “optimality theory ignores the 46

fact that natural selection works on what it has to work with, not on ideals” (p. 420) and optimizing means to do the best conceivable. However, natural selection need not maximize returns. . . . What selection must do is follow a satisficing principle. . . . To satisfice means to do well enough to get by, not necessarily to do the best possible. (p. 420) Zeiler thus concluded that optimization in fact may be rare in natural settings. He distinguished evolutionary and immediate function of behavior, noting that the former relates to the fitness enhancement of behavior and the latter to its more immediate effects. In his view, optimal foraging theory errs in using the methods of immediate function to address questions of evolutionary function. Pillar 3: Punishment An outcome of some interactions between an organism’s responses and environmental events is that the responses become less likely than they would be in the absence of those events. As with reinforcement, this outcome is particularly effective when there is a dependency between such environmental events and behavior, a two-term contingency involving responding and what comes to function as a punisher. Such a process of punishment constitutes the third pillar of TEAB.

Positive Punishment Positive punishment is the suppression or elimination of a response resulting from the response-dependent, time-limited presentation of a stimulus or event (i.e., a negative reinforcer). In the laboratory, electric shock is a prototypical positive punisher because, at the parameters used, it produces no injury to the organism, it is precisely initiated and terminated, and it is easily specified in physical terms (e.g., its intensity, frequency, and duration). Furthermore, parameters of shock can be selected that minimize sensitization (overreactivity to a stimulus) and habituation (adaptation or underreactivity to a stimulus). Punishment always is investigated in the context of reinforcement because responding must be

The Five Pillars of the Experimental Analysis of Behavior

maintained before it can be punished. As a result, the effects of punishers always are relative to the prevailing reinforcement conditions. Perhaps the most important of these is the schedule of reinforcement. Punishment exaggerates postreinforcement pausing on FR and FI schedules, and it decreases response rates on VI schedules (Azrin & Holz, 1966). Because responding on DRL schedules is relatively inefficient (i.e., responses are frequently made before the IRT > t criterion has elapsed), by suppressing responding punishment actually increases the rate of reinforcement. Even so, pigeons will escape from punishment of DRL responding to a situation in which the DRL schedule is in effect without punishment (Azrin, Hake, Holz, & Hutchinson, 1965). Although most investigations of punishment have been conducted using baselines involving positive reinforcement, negative reinforcement schedules also are effective baselines for the study of punishment (e.g., Lattal & Griffin, 1972). Punishment effects vary as a function of parameters of both the reinforcer and the punisher. With respect to reinforcement, punishment is less effective when the organism is more deprived of the reinforcer (Azrin, Holz, & Hake, 1963). The effects of reinforcement rate on the efficacy of punishment are less clear. Church and Raymond (1967) reported that punishment efficacy increased as the rate of reinforcement decreased. When, however, Holz (1968) punished responding on each of two concurrently available VI schedules arranging different rates of reinforcement, the functions relating punishment intensity and the percentage of response reduction from a no-punishment baseline were virtually identical. Holz’s results suggest a similar relative effect of punishment independent of the rate of reinforcement. Perhaps other tests, such as those suggested by behavioral momentum theory (see Response Strength and Behavioral Momentum section earlier in this chapter) could prove useful in resolving these seemingly different results. Parameters of punishment include its immediacy with respect to the target response, intensity, duration, and frequency. Azrin (1956) showed that punishers dependent on a response were more suppressive of responding than were otherwise equivalent punishers delivered independently of

responding at the same rate. Punishers that are more intense (e.g., higher amperage in the case of electric shock) and more frequent have greater suppressive effects, assuming the conditions of reinforcement are held constant (see Azrin & Holz, 1966, for a review). The effects of punisher duration are complicated by the fact that longer duration punishers may adventitiously reinforce responses contiguous with their offset, thereby potentially confounding the effect of the response-dependent presentation of the punisher.

Negative Punishment Negative punishment is the suppression or elimination of a response resulting from the responsedependent, time-limited removal of a stimulus or event. Both negative punishment and conventional extinction involve removing the opportunity for reinforcement. Negative punishment differs from extinction in three critical ways. In conventional extinction, the removal of the opportunity for reinforcement occurs independently of responding, is relatively permanent (or at least indefinite), and is not correlated with a stimulus change. In negative punishment, the removal of the opportunity for reinforcement is response dependent, time limited, and sometimes (but not necessarily) correlated with a distinct stimulus. These latter three characteristics are shared by three procedures: DRO schedules (sometimes also called differential reinforcement of pausing or omission training), timeout from positive reinforcement, and response cost. Under a DRO schedule, reinforcers depend on the nonoccurrence of the target response for a predetermined interval. Responses during the interreinforcer interval produce no stimulus change, but each response resets the interreinforcer interval. Despite the label of reinforcement, the response-dependent, time-limited removal of the opportunity for reinforcement typically results in substantial, if not total, response suppression, that is, punishment. With DROs, both the amount of time that each response delays the reinforcer and the time between successive reinforcers in the absence of intervening responding can be varied. Neither parameter seems to make much difference once responding is reduced. They may, however, affect the speed with which responding 47

Kennon A. Lattal

is reduced and the recovery after termination of the contingency (Uhl & Garcia, 1969). As with other punishment procedures, a DRO contingency may be superimposed on a reinforcement schedule maintaining responding (Zeiler, 1976, 1977a), allowing examination of the effects of punishers, positive or negative, on steady-state responding. Lattal and Boyer (1980), for example, reinforced key pecking according to an FI 5-min schedule. At the same time, reinforcers were available according to a VI schedule for pauses in pecking of x−s or more. Pecking thus postponed any reinforcers that were made available under the VI schedule. No systematic relation was obtained between required pause duration and response rate. With a constant 5-s pause required for reinforcement of not pecking, however, the rate of key pecking was a negative function of the frequency of DRO reinforcement. That is, the more often a key peck postponed food delivery, the lower the response rates were and thus the greater the punishment effect was. Timeouts are similar to DROs in that, when used as punishers, they occur as response-dependent, relatively short-term periods of nonreinforcement. They differ from DROs because the periods of nonreinforcement are not necessarily resetting with successive responses, and they are accompanied by a stimulus change. Timeout effects are relative to the prevailing conditions of reinforcement. As was described earlier, periods of timeout from negative reinforcement function as reinforcers, as do periods of timeout from extinction (e.g., Azrin, 1961). ­Timeouts from situations correlated with reinforcement, however, suppress responding when they are response dependent (Kaufman & Baron, 1968). With response cost, each response or some portion of responses, depending on the punishment schedule, results in the immediate loss of reinforcers, or some portion of reinforcers. Response cost is most commonly used in laboratory and applied settings in which humans earn points or tokens according to some schedule of reinforcement. Weiner (1962), for example, subtracted one previously earned point for each response made by adult human participants earning points by responding under VI schedules. The effect was considerable suppression of responding. Response cost, similar to 48

timeout, can entail a concurrent loss of reinforcement as responding is suppressed. Pietras and Hackenberg (2005), however, showed that response cost has a direct suppressive effect on responding, independent of changes in reinforcement rate.

Frameworks Punishment has been conceptualized by different investigators as either a primary or a secondary process. Punishment as a primary or direct process.  Thorndike (1911) proposed that punishment effects are equivalent and parallel to those of reinforcement but opposite in direction. Thus, the responsestrengthening effects defining reinforcement were mirror-image effects of the response-suppressing effects defining punishment. Schuster and Rachlin (1968) suggested three examples: (a) the suppressive and facilitative effects of following a conditioned stimulus (CS) with, respectively, shock or food; (b) mirror-image stimulus generalization gradients around stimuli associated with reinforcement and punishment; and (c) similar indifference when given a choice of response-dependent and responseindependent food or between response-dependent shock and response-independent shock (Shuster & Rachlin, 1968). Although (c) has held up to experimental analysis (Brinker & Treadway, 1975; Moore & Fantino, 1975; Schuster & Rachlin, 1968), (a) and (b) have proven more difficult to confirm. For example, precise comparisons of generalization gradients based on punishment and reinforcement are challenging to interpret because of the complexities of equating the food and shock stimuli on which the gradients are based. The research described in the Response-Independent Stimuli Correlated With Reinforcers and Punishers section later in this chapter illustrates the complexities of interpreting (a). Punishment as a secondary or indirect process.  Considered a secondary process, the response suppression obtained when responding is punished comes about indirectly as the result of negative reinforcement of other responses. Thus, punishment is interpreted as a two-stage (factor) process whereby, first, the stimulus becomes aversive and, second, responses that result in its avoidance are then

The Five Pillars of the Experimental Analysis of Behavior

negatively reinforced. Hence, as unpunished responses are reinforced because they escape or avoid punishers, punished ones decrease, resulting in what appears as target-response suppression (cf. Arbuckle & Lattal, 1987). A variation of punishment as a secondary process is the competitive suppressive view that responding decreases because punishment degrades or devalues the reinforcer (e.g., Deluty, 1976). Thus, the suppressive effect of punishment is seen as an indirect effect of a less potent reinforcer for punished responses, thereby increasing the potency of reinforcers for nonpunished responses. Contrary to Deluty (1976), Farley’s (1980) results, however, supported a direct suppressive interpretation of punishment. Critchfield, Paletz, MacAleese, and Newland (2003) compared the direct and competitive suppression interpretations of punishment. Using human subjects in a task in which responding was reinforced with points and punished by the loss of a portion of those same points, a quantitative model based on the direct suppression interpretation yielded better fits to the data. Furthermore, Rasmussen and Newland (2008) suggested that the negative law of effect may not be symmetrical. Using a procedure similar to Critchfield et al.’s, they showed that single punishers subtract more value than single reinforcers add. Pillar 4: Control by Stimuli Correlated with Reinforcers and Punishers Reinforcers and punishers often are presented in the context of other stimuli that are initially without discernable effect on behavior. Over time and with continued correlation with established reinforcers or punishers, these other events come to have behavioral effects similar to the events with which they have been correlated. Such behavioral control by these other events is what places them as the fourth pillar of TEAB.

Response-Independent Stimuli Correlated With Previously Established Reinforcers and Punishers In the typical operant arrangement for studying the effects of conditioned stimuli, responding is ­maintained according to some schedule of

r­ einforcement, onto which the stimuli and their correlated events are superimposed. In the first such study, Estes and Skinner (1941) trained rats’ lever pressing on an FI food reinforcement schedule and periodically imposed a 3-min tone (a warning stimulus or conditional stimulus [CS] followed by a brief electric shock). Both the CS and the shock occurred independently of responding, and the FI continued to operate during the CS (otherwise, the response would simply extinguish during the CS). Lever pressing was suppressed during the tone relative to no-tone periods. This conditioned suppression effect occurs under a variety of parameters of the reinforcement schedule, the stimulus at the end of the warning stimulus, and the warning stimulus itself (see Blackman, 1977, for a review). When warning stimuli that precede an unavoidable shock are superimposed during avoidancemaintained responding (see the earlier Avoidance section), responses during the CS may either increase or decrease relative to those during the no-CS periods. The effect appears to depend on whether the shock at the end of the CS period is discriminable from those used to maintain avoidance. If the same shocks are used, responding is facilitated during the CS; if the shocks are distinct, the outcome is often suppression during the warning stimulus (Blackman, 1977). A similar arrangement has been studied with positive reinforcement–maintained responding, in which a reinforcer is delivered instead of a shock at the end of the CS. The effects of reinforcers at the end of the CS that are the same as or different from that arranged by the baseline schedule have been investigated. Azrin and Hake (1969) labeled this procedure positive conditioned suppression when they found that VI-maintained lever pressing of rats during the CS was generally suppressed relative to the no-CS periods. Similar suppression occurred whether the event at the end of the CS was the same as or different from the reinforcer used to maintain responding. LoLordo (1971), however, found that pigeons’ responding increased when the CS ended with the same reinforcer arranged by the background schedule, a result he related to autoshaping (Brown & Jenkins, 1968). Facilitation or suppression during a CS followed by a positive reinforcer 49

Kennon A. Lattal

seems to depend on both CS duration and the schedule of reinforcement. For example, Kelly (1973) observed both suppression and facilitation during a CS as a function of whether the baseline schedule was DRL or VR.

Response-Dependent Stimuli Correlated With Previously Established Punishers Despite being commonplace in everyday life, conditioned punishment has only rarely been studied in the laboratory. In an investigation of positive conditioned punishment, Hake and Azrin (1965) first established responding on a VI 120-s schedule of positive reinforcement in the presence of a white key light. The key-light color irregularly changed from white to red for 30 s. Reinforcement continued to be arranged according to the VI 120-s schedule during the red key light; however, at the end of the redkey-light period, a 500-ms electric shock was delivered independently of responding. This shock suppressed but did not eliminate responding when the key light was red. To this conditioned suppression procedure, Hake and Azrin then added the following contingency. When the white key light was on, each response produced a 500-ms flash of the red key light. Responding in the presence of the white key light was suppressed. When white-keylight responses produced a 500-ms flash of a green key light, when green previously had not been paired with shock, responding during the white key light was not suppressed. In addition, the degree of response suppression was a function of the intensity of the shock after the red key light. A parallel demonstration of negative conditioned punishment based on stimuli correlated with the onset of a timeout was conducted by Gibson (1968). Gibson used Hake and Azrin’s (1965) procedure, except that instead of terminating the red key light with an electric shock, it terminated with a timeout during which the chamber was dark and reinforcement was precluded. The effect was to facilitate key pecking by pigeons during the red key light relative to rates during the white key light. When, however, responses during the white key light produced 500-ms flashes of the red key light, as in Hake and Azrin, responding during the white key light was suppressed. This demonstration of conditioned 50

­ egative punishment is of particular interest n because it occurred despite the fact that stimuli ­preceding timeouts typically facilitate rather than suppress responding. Thus, the results cannot be interpreted in terms of the red key light serving as a discriminative stimulus for a lower rate of reinforcement (as they could be in Hake and Azrin’s experiment).

Response-Dependent Stimuli Correlated With Previously Established Reinforcers Stimuli correlated with an already-established reinforcer maintain responses that produce them (Williams, 1994). This is the case for stimuli correlated with either the termination or postponement of a negative reinforcer (Siegel & Milby, 1969) or with the onset of a positive reinforcer. In the latter case, early research on conditioned reinforcement used chained schedules (e.g., Kelleher & Gollub, 1962), higher order schedules (Kelleher, 1966), and a two– response-key procedure (Zimmerman, 1969). Interpretational limitations of each of these methods (Williams, 1994) led to the use of two other procedures in the analysis of conditioned reinforcement. The observing–response procedure (Wyckoff, 1952) involves first establishing a discrimination between two stimuli by using a multiple schedule in which one stimulus is correlated with a schedule of reinforcement and the other with extinction (or a schedule arranging a different reinforcement rate). Once discriminative control is established, a third stimulus is correlated with both components to yield a mixed schedule. An observing response on a second operandum converts the mixed schedule to a multiple schedule for a short interval (typically 10–30 s with pigeons). Observing responses are maintained on the second operandum. These responses, of nonhumans at least, are maintained primarily, if not exclusively, by the positive stimulus (S+) and not by the stimulus correlated with extinction (see Fantino & Silberberg, 2010; Perone & Baron, 1980; for an alternative interpretation of the role of the negative stimulus [or S−], see Escobar & Bruner, 2009). The general interpretation has been that the stimulus correlated with reinforcement functions as a conditioned reinforcer maintaining the observing response.

The Five Pillars of the Experimental Analysis of Behavior

Fantino (1969, 1977) proposed that stimuli function as conditioned reinforcers to the extent that they represent a reduction in the delay (time) to reinforcement relative to the delay operative in their absence. Consider the observing procedure described earlier, assuming that the two components alternate randomly every 2 min. Observing responses convert, for 15-s periods, a mixed VI 2-min extinction schedule to a multiple VI 2-min extinction schedule. Because components are 2 min each, the mixedschedule stimulus is correlated with a 4-min period (delay) between successive reinforcers. In the presence of the stimulus correlated with the VI schedule, the delay between reinforcers is 2 min, a 50% reduction in delay time relative to that in the ­mixed-schedule stimulus. Fantino’s delay reduction hypothesis asserts that this signaled reduction in delay to reinforcement maintains observing responses. In other tests of the delay reduction hypothesis, concurrent chained schedules have been used (in a chained schedule, distinct stimuli accompany the different links). In such tests, two concurrently available identical VI schedules serve as the initial links of the chained schedules. The equivalent VI initial links ensure that either terminal link is accessed approximately equally as often. The terminal links are mutually exclusive: The first initial link requirement met leads to its terminal link and simultaneously cancels the alternative chained schedule for that cycle. When the terminal link requirement is met, the response is reinforced, and the concurrent initial links recur. Thus, for example, if the two terminal links are VI 1 min and VI 5 min and the initial links are both VI 1 min, the average time to reinforcement achieved by responding on both alternatives is 3.5 min (0.5 min in the initial link + 3 min in the terminal link). Responding exclusively on the operandum leading to the VI 1-min terminal link produces a reinforcer on average every 2 min, a reinforcement delay reduction of 1.5 min from the average for responding on both. Responding exclusively on the operandum leading to the VI 5-min terminal link produces a reinforcer once every 6 min, yielding a reinforcement delay increase of 2.5 min ­relative to the average for responding on both. The greater delay reduction for responding on

the ­operandum leading to the VI 1-min terminal link predicts an exclusive preference for this alternative, a prediction confirmed by experimental analysis. Before leaving the topic of conditioned reinforcement, it should be noted that there is not uniform agreement as to the significance of conditioned reinforcement (and, by extrapolation, conditioned punishment) as a concept. Although it has strong proponents (Dinsmoor, 1983; Fantino, 1977; Williams, 1994), among its critics are Davison and Baum (2006), who suggested that the concept had outlived its usefulness and that conditioned reinforcement is more usefully considered in terms of discriminative stimulus control. This suggestion harkens back to an earlier one that conditioned reinforcers must first be established as discriminative stimuli (e.g., Keller & Schoenfeld, 1950), but Davison and Baum called for the abandonment of the concept (see also Chapter 17, this volume). Pillar 5: Contextual and Stimulus Control Interactions between responding and reinforcers and punishers occur in broader environmental contexts, both distal or historical and proximal or contemporary. These contexts define what is sometimes called antecedent control of behavior. The term is something of a misnomer because under such circumstances, the behavior is controlled by the twoterm contingency in the context of the third, antecedent event. Such joint control by reinforcement contingencies in context is what constitutes the final pillar of TEAB.

Behavioral History Perhaps the broadest context for the two-term contingency is historical. Although the research is not extensive, several experiments have examined with some precision functional relations between past experiences and present behavior. These analyses began with the work of Weiner (e.g., 1969), who showed that different individuals responded differently on contemporary reinforcement schedules as a function of their previous experience responding on other schedules. Freeman and Lattal (1992) 51

Kennon A. Lattal

investigated the effects of different histories under stimulus control within individual pigeons. The pigeons were first trained on FR and DRL schedules, equated for reinforcement rate and designed to generate disparate response rate, in the presence of distinct stimuli. When subsequently exposed to FI schedules in the presence of both stimuli, they responded for many sessions at higher rates in the presence of the stimuli previously associated with the FR (high response rate) schedule. This effect not only replicated Weiner’s human research, but it showed within individual subjects that an organism’s behavioral history could be controlled by the stimuli with which that history was correlated. Other experiments have elaborated such behavioral history effects. Ono (2004), for example, showed how preferences for forced versus free choices were determined by the organism’s past experiences with the two alternatives.

Discriminative Stimulus Control By correlating distinct stimuli with different conditions of reinforcement or punishment, responding typically comes to be controlled by those stimuli. Such discriminative stimulus control occurs when experimental variations in the stimuli lead to correlated variations in behavior. Discriminative stimulus control can be established with both reinforcement and punishment. The prototypical example of discriminative stimulus control is one in which responding is reinforced in the presence of one stimulus, the S+ or SD, and not in the presence of another, the S− or SΔ. Stimulus control, however, can involve different stimuli correlated with different conditions of reinforcement, in which case there would be two positive stimuli, or it can involve conditions in which punishment is present or absent in the presence of different stimuli. The lack of overriding importance of the form of the stimulus in establishing positive discriminative stimulus control was illustrated by Holz and Azrin (1961). They first punished each of a pigeon’s key responses otherwise maintained by a VI schedule of reinforcement. Then, both punishment and reinforcement were discontinued, allowing responding to drop to low, but nonzero, levels. At that point, 52

each response again was punished, but reinforcement was not reinstated. Because of its prior correlation with reinforcement, punishment functioned as an S+, thereby resulting in considerable responding, at least in the short term. This result underlines the functional definition of discriminative stimuli: They are defined in terms of their effect, not by their form. Two prerequisites for stimulus control are (a) that the stimuli be different from both the absence of stimulation (i.e., above the absolute threshold) and that they be discriminable from one another (i.e., above the difference threshold) and (b) that the stimuli be correlated with different reinforcement or punishment contingencies. Thresholds are in part physiological and phylogenic. Human responding, for example, cannot be brought under control of visual stimuli outside the range of physical detection of the human eye. Signal detection theory (D. M. Green & Swets, 1966) posits that the discriminable dimension of a stimulus can be separated from the reinforcement contingencies that bias choices in assessments of threshold measurements. Researchers in TEAB have used signal detection methods to not only complement other methods of assessing control by conventional sensory modality stimuli (i.e., visual and auditory stimuli; cf. Nevin, 1969) but also to isolate the discriminative and reinforcing properties of a host of other environmental events, including reinforcement contingencies themselves (e.g., Davison & Tustin, 1978; Lattal, 1979). Another issue that receives considerable attention in the analysis of stimulus control is attention (see Chapter 17, this volume). Although from some perspectives, attention is considered a prerequisite to stimulus control, in TEAB attention is stimulus control (e.g., Ray, 1969). The behavioral index of attention is whether the organism is responding to the nominal stimuli being presented. Thus, attending to a stimulus means responding in its presence and not in its absence, and such differential responding also defines stimulus control. According to this analysis, an instructor does not get the class’s attention to start the day’s activities; responding to the day’s activities is what having the class’s attention means. Correlating stimuli with different reinforcement or punishment contingencies establishes discriminative stimulus control. It sometimes is labeled

The Five Pillars of the Experimental Analysis of Behavior

discrimination, although care is taken in TEAB to ensure that the term describes an environment– behavior relation and not an action initiated by the organism. Discriminations have been established in two ways. The conventional technique is simply to expose the organism to the discrimination task until the behavior comes under the control of the different discriminative stimuli. Terrace (1963) reported differences in the number of responses made to a stimulus correlated with extinction (S−) as a function of how the discrimination was trained, specifically, how the S− and correlated period of nonreinforcement was introduced. The typical, sudden introduction of the S− after responding had been well established in the presence of another stimulus resulted in many (unreinforced) responses during the S− presentations, responses that Terrace labeled as errors. Introducing the S− in a different way changes the behavior it controls. Terrace introduced the S− simultaneously with the commencement of S+ training, but at low intensity (the S− was a colored light transilluminating the response key, initially for very brief time periods. Over successive sessions, both the intensity and the duration of the S− were increased gradually as a function of the pigeon’s behavior in the presence of the S−. This procedure yielded few responses to the S− throughout training and during the steady-state S+–S− discriminative performance. Terrace (1966) suggested that the S− functioned differently when established with, as opposed to without, errors; however, it was unclear whether the fading procedure or the absence of responses to the S− was responsible for these differences. Subsequent research qualified some of Terrace’s suggestions (e.g., Rilling, 1977).

Stimulus Generalization Stimulus generalization refers to changes or gradations in responding as a function of changes or gradations in the stimulus with which the reinforcement or punishment contingency originally was correlated. In a typical procedure, a discrimination is established (generalization gradients are more reliable when a discriminative training procedure is used) between S+ (e.g., a horizontal line projected on a response key) and S− (e.g., a vertical line). In a

test of stimulus generalization, typically conducted in the absence of reinforcement, lines differing in degree of tilt are presented in mixed order, and responding to each is recorded. The result is a gradient, with responding relatively high in the presence of stimuli most like the S+ in training and lowest in the presence of stimuli most like the S−. The shape of the gradient indexes stimulus generalization. A flat gradient suggests all stimuli are responded to similarly, that is, significant generalization or minimal discrimination. A steep gradient (that drops sharply between the S+ and the next-most-similar stimuli, e.g.) indicates that the stimuli differentially control responding, that is, significant discrimination or minimal generalization. The peak, that is, the highest point, of the gradient, is often not at the original training stimulus but rather shifted to the next stimulus in the direction opposite the S−. This peak shift, as it is labeled, has been suggested to reflect aversive properties of the S− in that it did not occur when discriminations were trained without errors (Terrace, 1966). Stimulus generalization gradients also can be observed around the S−. Their assessment poses a difficulty if the S+ and S− are on the same continuum: The gradients around both the S+ and the S− are confounded by the fact that movement away from the S− constitutes movement toward the S+ and vice versa. The solution is to use as the S+ and S− orthogonal stimuli, that is, stimuli that are on different stimulus dimensions, for example, color and line tilt. This way, changes away from, for example, the line tilt correlated with S−, are not changes toward the key color correlated with the S+. Inhibitory generalization gradients typically are V shaped, with the lowest responding in the presence of the S− and increasing with increasing disparity between the test stimulus and the S−. These gradients, sometimes labeled inhibitory generalization gradients, have been interpreted to indicate that extinction, or nonreinforcement, involves the learning of other behavior rather than simply eliminating nonreinforced responding (Hearst, Besley, & Farthing, 1970).

Conditional Stimulus Control The three-term contingency discussed in the preceding sections can itself be brought under stimulus 53

Kennon A. Lattal

control, giving rise to a four-term contingency. Here, the stimuli defining the three-term contingency are conditional on another, superordinate set of stimuli, defining the fourth term. Conditional stimulus control has been studied widely using a procedure sometimes called matching to sample (see Sidman & Tailby, 1982). The procedure consists of three-element trials separated from one another by an ITI. In a typical arrangement, a pigeon is confronted by a three–response-key array. In the presence of a preparatory stimulus, a response turns on a sample stimulus, say a red or green key light, each with a probability of 0.5 on a given trial. A response to the transilluminated key turns on the two side stimuli (comparison component), one red and the other green. A peck to the key colored the same as the sample stimulus results in food access for 3 s, whereas a peck to the other key terminates the trial. After an intertrial interval, the cycle repeats. Thus, the red and green lights in the comparison component can be either an S+ or an S− conditional on the stimulus in the sample component. The percentage of choices of colors corresponding to the sample increases with exposure, reaching an asymptote near 100% correct. Variations on the procedure include (a) turning off the sample light during the choice component (zero-delay matching), (b) using sample and choice stimuli that differ in dimension (symbolic matching to sample), (c) using topographically different responses to the different sample stimuli (e.g., a key peck to one and a treadle press to the other), (d) using qualitatively different reinforcers for correct responses to either of the stimuli (differential outcomes procedure), and (e) imposing delays between the response to the sample and onset of the choice component (delayed matching to sample; see MacKay, 1991, for a review). There are myriad possibilities for sample stimuli. Everything from simple colors to astonishingly complex visual arrays has been used to establish conditional stimulus control of responding. A particularly fruitful area of research involving conditional stimulus control is that of delayed matching to sample (see Chapter 18, this volume). Generally speaking, choice accuracy declines as delays increase. Indifference between the choices is reached with pigeons at around 30-s delays. The appropriate description 54

of these gradients is a matter of interpretation. Those who are more cognitively oriented consider the gradients to reflect changes in memory, whereas those favoring a behavior-analytic interpretation generally describe them using action terms such as remembering or forgetting. The conditional discrimination procedure involving both delayed presentations of choice components and the use of complex visual arrays as samples has given rise in part to the study of animal cognition, which has in turn led to often unfounded, and sometimes inexplicable, speculations about cognitive mechanisms underlying conditional stimulus control (and other behavioral phenomena) in nonhuman animals. The conceptual issues related to the interpretation of behavioral processes involving stimulus control in terms of memory or other cognitive mechanisms is beyond the scope of this chapter. Branch (1977), Watkins (1990), and many others have offered perspectives on these issues that are consistent with TEAB.

Temporal Discriminative Stimulus Control of Responding and Timing Every schedule of reinforcement, positive or negative, involves time. It is a direct variable in interval schedules and an indirect one in ratio schedules. Early research on FI schedules suggested that the passage of time functioned as an S+ (e.g., Dews, 1970). The discriminative properties of time were also borne out in conditional discrimination experiments in which the reinforced response was conditional on the passage of one time interval versus another (e.g., Stubbs, 1968) and in experiments involving the peak interval procedure, in which occasional reinforcers are deleted from a series of FIs to reveal where responding peaks before waning. Research on temporal control in turn has given rise to different quantitative theories of timing, notably scalar expectancy theory (Gibbon, 1977) and the behavioral theory of timing (Killeen & Fetterman, 1988). Both theories integrate significant amounts of data generated using the aforementioned procedures, and both have had considerable heuristic value. The behavioral theory of timing focuses more directly on environmental and behavioral events in accounting for the discriminative properties of time.

The Five Pillars of the Experimental Analysis of Behavior

Concept Learning One definition of a concept is in terms of stimulus control. A concept may be said to exist when a similar response is controlled by common elements of otherwise dissimilar stimuli. Human concepts often are verbal in nature, for example, abstract configurations of stimuli that evoke words such as love, ­esoteric, liberating, and so forth. Despite their complexity, concepts are considered in TEAB to be on a continuum with other types of stimulus control of behavior. The classic demonstration of stimulus control of responding by an abstract stimulus was that of Herrnstein, Loveland, and Cable (1976). An S+–S− discrimination was established using a multiple schedule in which responses in one component were reinforced according to a VI 30-s schedule and extinguished in the other. The S+ in each of three experiments was, respectively, one of a variety of slides (more than 1,500 different ones—half positive and half negative in terms of containing the concept under study—were used in each of the three experiments) that were pictures of trees, water, or a particular person. In each experiment, the S− was the absence of these features in an otherwise parallel set of slides. Response rates were higher in the presence of the concept under investigation than in its absence. The basic results of Herrnstein et al. have been replicated systematically many times, using a variety of types of visual stimuli (e.g., Wasserman, Young, & Peissig, 2002). The general topic of concepts and concept learning as instances of stimulus control has been approached in a different, but equally fruitful way by Sidman (e.g., 1986; Sidman & Tailby, 1982). Consider three groups of unrelated stimuli, A, B, and C, presented on a computer screen. Different patterns make up A; different shapes, B; and nonsense syllables, C. The question posed by Sidman and Tailby (1982) was how these structurally different groups of stimuli might all come to control similar responses to them, that is, become equivalent to one another—to function as a stimulus controlling the same response; that is, as a concept. Sidman and Tailby (1982) turned to mathematics for a definition of equivalence and to the conditional discrimination procedure (outlined earlier) for its analysis. An equivalence relation in mathematics

requires a demonstration of three properties: reflexivity, symmetry, and transitivity. Reflexivity is established by showing, in the absence of reinforcement, generalized identity matching (i.e., selecting the comparison stimulus that is identical to the sample). In normally developing humans, the tests for symmetry and transitivity often are combined. One such test consists of teaching the relation between A and B and that between A and C, using the conditional discrimination procedure described previously (e.g., given Sample A, select B from among the available comparison stimuli). In subsequent no-feedback test trials, if C is selected after a B sample and B is selected after a C sample, then these emergent (untrained) transitive relations require that the trained A–B and A–C relations be symmetric (B–A and C–A, respectively). Stimulus equivalence suggests a mechanism whereby new stimulus relations can develop in the absence of direct reinforcement. Sidman (1986) suggested that these emergent relations could address criticisms of the inflexibility of a behavior-analytic approach that relies on direct reinforcement to establish new responses. In addition, if different sets of equivalence relations are themselves equated through training a connecting relation (Sidman, Kirk, & Wilson-Morris, 1985), then the number of equivalence relations established without training increases exponentially. As Sidman observed, both of these outcomes of stimulus equivalence are important advancements in accounting for the acquisition of verbal behavior within a behavioranalytic framework. By expanding the analysis of stimulus equivalence to a five-term contingency, Sidman also attempted to account for meaning in context (see Volume 2, Chapters 1, 6, and 18, this handbook).

Rules and Instructions An important source of discriminative stimulus control of human behavior is verbal behavior (Skinner, 1957). The analysis of this discriminative function of verbal behavior has most frequently taken the form in TEAB of an analysis of how rules and instructions (the two terms are used interchangeably here, but see also Catania [1998] for suggestions concerning the rules for describing such control of 55

Kennon A. Lattal

behavior) act in concert with contingencies to control human behavior. The control by instructions or rules is widely, but not universally, considered a type of discriminative control over responding (e.g., Blakely & Schlinger, 1987). In human interactions, instructions may be spoken, written, or both. The effectiveness of instructions in controlling behavior varies in part as a function of their specificity and their congruency with the contingencies to which they refer. Galizio (1979), for example, elegantly showed how congruent instructions complement contingencies and incongruent ones can lead to ignoring the instruction. Incongruence does not universally have this effect, however. In some experiments, inaccurate instructions have been found to exert control over responding at the expense of the actual contingencies in effect. Galizio’s (1979) results, however, may be more a qualification than the rule in explicating the role of instructions in controlling human behavior. In many situations in which explicit losses are not incurred for following rules, humans often behave in stereotyped ways that suggest they are following either an instruction or their interpretation of an instruction. This outcome is not surprising given the long extraexperimental history of reinforced rule following. Even in the absence of explicit rules, some observers have postulated that humans construct their own rules. Such an analysis, however, is a quagmire— once private events such as self-generated rules are postulated to control responding in some situations, it becomes difficult to exclude their possible role in every situation. Nonetheless, in one study of how self-generated rules might control behavior, Catania, Matthews, and Shimoff (1982) had college students respond on two buttons; one reinforced high rate responding, and the other reinforced lower rate responding. The task was interrupted from time to time, and the students were asked to guess (by completing a series of structured sentences) what the schedule requirements were on the buttons. Stating the correct rule was shaped by reinforcing approximations to the correct description with points. Of interest was the relation between the shaped rule and responding in accord with the schedule in effect on either button. In general, the shaped rules 56

functioned as a discriminative stimulus controlling responding under the two schedules. The Five Pillars Redux Methods, reinforcement, punishment, control by stimuli correlated with reinforcers and punishers, and contextual and stimulus control—these are the five pillars of TEAB, the foundation on which the analyses and findings described in other chapters of this handbook are constructed. A review of these pillars balanced several factors with one another. The first was differing views as to what is fundamental. As with any science, inconsistencies in findings and differences in interpretation are commonplace. They are, however, the fodder for further growth in the science. The second was depth versus breadth of the topics. The relative space devoted to the four pillars representing empirical findings in TEAB reflects, more or less, the relative research activity making up each of those pillars. Each pillar is, of course, deeper than can be developed within the space constraints assigned. Important material was truncated to attain the breadth of coverage expected of an overview. The third was classic and contemporary research. Both have a role in defining foundations; the former lay the groundwork for contemporary developments, which in turn herald the future of TEAB. Finally, the metaphorical nature of the five pillars needs to be taken a step further, for these are not pillars of stone. Rather, the material of these pillars is organic, subject to the same contingencies that they seek to describe. Research areas and problems come and go for a host of reasons. They are subject to the vicissitudes of life: Researchers come into their own, move, change, retire, or die (physically or metaphorically; sometimes both, but sometimes at different times); agency funding and university priorities change. Dead ends are reached. Marvelous discoveries captivate entire generations of scientists, or maybe just one scientist. Changes in TEAB will change both the content and, over time, perhaps the very pillars themselves. Indeed, it is highly likely that the research described in this handbook eventually will rewrite this chapter. As TEAB and the pillars that support it continue to

The Five Pillars of the Experimental Analysis of Behavior

evolve, TEAB will contribute even more to an understanding of the behavior of organisms.

References Allyon, T., & Azrin, N. H. (1968). The token economy. New York, NY: Appleton-Century-Crofts. Arbuckle, J. L., & Lattal, K. A. (1987). A role for negative reinforcement of response omission in punishment? Journal of the Experimental Analysis of Behavior, 48, 407–416. doi:10.1901/jeab.1987.48-407 Azrin, N. H. (1956). Some effects of two intermittent schedules of immediate and non-immediate punishment. Journal of Psychology, 42, 3–21. doi:10.1080/00 223980.1956.9713020 Azrin, N. H. (1961). Time-out from positive reinforcement. Science, 133, 382–383. doi:10.1126/science. 133.3450.382 Azrin, N. H., & Hake, D. F. (1969). Positive conditioned suppression: Conditioned suppression using positive reinforcers as the unconditioned stimuli. Journal of the Experimental Analysis of Behavior, 12, 167–173. doi:10.1901/jeab.1969.12-167 Azrin, N. H., Hake, D. F., Holz, W. C., & Hutchinson, R. R. (1965). Motivational aspects of escape from punishment. Journal of the Experimental Analysis of Behavior, 8, 31–44. doi:10.1901/jeab.1965.8-31 Azrin, N. H., & Holz, W. C. (1966). Punishment. In W. K. Honig (Ed.), Operantbehavior: Areas of research and application (pp. 380–447). New York, NY: Appleton-Century-Crofts. Azrin, N. H., Holz, W. C., & Hake, D. F. (1963). Fixedratio punishment. Journal of the Experimental Analysis of Behavior, 6, 141–148. doi:10.1901/ jeab.1963.6-141 Azrin, N. H., Holz, W. C., Hake, D. F., & Allyon, T. (1963). Fixed-ratio escape reinforcement. Journal of the Experimental Analysis of Behavior, 6, 449–456. doi:10.1901/jeab.1963.6-449 Azrin, N. H., Hutchinson, R. R., & Hake, D. F. (1966). Extinction-induced aggression. Journal of the Experimental Analysis of Behavior, 9, 191–204. doi:10.1901/jeab.1966.9-191 Bachrach, A. (1960). Psychological research: An introduction. New York, NY: Random House. Baer, D. M. (1981). The imposition of structure on behavior and the demolition of behavioral structures. In D. J. Bernstein (Ed.), Response structure and organization (pp. 217–254). Lincoln: University of Nebraska Press. Baer, D. M., Wolf, M. M., & Risley, T. R. (1968). Some current dimensions of applied behavior analysis. Journal of Applied Behavior Analysis, 1, 91–97. doi:10.1901/jaba.1968.1-91

Baron, A. (1991). Avoidance and punishment. In I. Iversen & K. A. Lattal (Eds.), Experimental analysis of behavior: Part1 (pp. 173–217). Amsterdam, the Netherlands: Elsevier. Baron, A. (1999). Statistical inference in behavior analysis: Friend or foe? Behavior Analyst, 22, 83–85. Baron, A., & Galizio, M. (2005). Positive and negative reinforcement: Should the distinction be preserved? Behavior Analyst, 28, 85–98. Baum, W. M. (1973). The correlation-based law of effect. Journal of the Experimental Analysis of Behavior, 20, 137–153. doi:10.1901/jeab.1973.20-137 Baum, W. M. (1974). On two types of deviation from the matching law: Bias and undermatching. Journal of the Experimental Analysis of Behavior, 22, 231–242. doi:10.1901/jeab.1974.22-231 Baum, W. M. (1989). Quantitative description and molar description of the environment. Behavior Analyst, 12, 167–176. Blackman, D. (1968). Conditioned suppression or facilitation as a function of the behavioral baseline. Journal of the Experimental Analysis of Behavior, 11, 53–61. doi:10.1901/jeab.1968.11-53 Blackman, D. E. (1977). Conditioned suppression and the effects of classical conditioning on operant behavior. In W. K. Honig & J. E. R. Staddon (Eds.), Handbook of operant behavior (pp. 340–363). New York, NY: Prentice Hall. Blakely, E., & Schlinger, H. (1987). Rules: Functionaltering contingency-specifying stimuli. Behavior Analyst, 10, 183–187. Blakely, E., & Schlinger, H. (1988). Determinants of pausing under variable-ratio schedules: Reinforcer magnitude, ratio size, and schedule configuration. Journal of the Experimental Analysis of Behavior, 50, 65–73. doi:10.1901/jeab.1988.50-65 Bonem, M., & Crossman, E. K. (1988). Elucidating the effects of reinforcer magnitude. Psychological Bulletin, 104, 348–362. doi:10.1037/00332909.104.3.348 Boren, J. J., & Devine, D. D. (1968). The repeated acquisition of behavioral chains. Journal of the Experimental Analysis of Behavior, 11, 651–660. doi:10.1901/ jeab.1968.11-651 Branch, M. N. (1977). On the role of “memory” in behavior analysis. Journal of the Experimental Analysis of Behavior, 28, 171–179. doi:10.1901/jeab.1977.28-171 Breland, K., & Breland, M. (1961). The misbehavior of organisms. American Psychologist, 16, 681–684. doi:10.1037/h0040090 Brinker, R. P., & Treadway, J. T. (1975). Preference and discrimination between response-dependent and response-independent schedules of reinforcement. 57

Kennon A. Lattal

Journal of the Experimental Analysis of Behavior, 24, 73–77. doi:10.1901/jeab.1975.24-73 Brown, P. L., & Jenkins, H. M. (1968). Autoshaping the pigeon’s key-peck. Journal of the Experimental Analysis of Behavior, 11, 1–8. doi:10.1901/ jeab.1968.11-1 Bruzek, J. L., Thompson, R. H., & Peters, L. C. (2009). Resurgence of infant caregiving responses. Journal of the Experimental Analysis of Behavior, 92, 327–343. doi:10.1901/jeab.2009-92-327 Catania, A. C. (1963). Concurrent performances: A baseline for the study of reinforcement magnitude. Journal of the Experimental Analysis of Behavior, 6, 299–300. doi:10.1901/jeab.1963.6-299 Catania, A. C. (1998). The taxonomy of verbal behavior. In K. A. Lattal & M. Perone (Eds.), Handbook of research methods in human operant behavior (pp. 405–433). New York, NY: Plenum Press. Catania, A. C., Matthews, B. A., & Shimoff, E. (1982). Instructed versus shaped human verbal behavior: Interactions with nonverbal responding. Journal of the Experimental Analysis of Behavior, 38, 233–248. doi:10.1901/jeab.1982.38-233 Catania, A. C., & Reynolds, G. S. (1968). A quantitative analysis of responding maintained by interval schedules of reinforcement. Journal of the Experimental Analysis of Behavior, 11, 327–383. doi:10.1901/ jeab.1968.11-s327 Church, R. M., & Raymond, G. A. (1967). Influence of the schedule of positive reinforcement on punished behavior. Journal of Comparative and Physiological Psychology, 63, 329–332. doi:10.1037/h0024382 Critchfield, T. S., Paletz, E. M., MacAleese, K. R., & Newland, M. C. (2003). Punishment in human choice: Direct or competitive suppression? Journal of the Experimental Analysis of Behavior, 80, 1–27. doi:10.1901/jeab.2003.80-1 Das Graças de Souza, D. D., de Moraes, A. B. A., & Todorov, J. C. (1984). Shock intensity and signaled avoidance responding. Journal of the Experimental Analysis of Behavior, 42, 67–74. doi:10.1901/ jeab.1984.42-67 Davison, M. (1999). Statistical inference in behavior analysis: Having my cake and eating it. Behavior Analyst, 22, 99–103. Davison, M., & Baum, W. M. (2006). Do conditional reinforcers count? Journal of the Experimental Analysis of Behavior, 86, 269–283. doi:10.1901/jeab.2006.56-05 Davison, M. C., & Tustin, R. D. (1978). The relation between the generalized matching law and signal detection theory. Journal of the Experimental Analysis of Behavior, 29, 331–336. doi:10.1901/jeab.1978.29-331 DeGrandpre, R. J., Bickel, W. K., Hughes, J. R., Layng, M. P., & Badger, G. (1993). Unit price as a useful metric in 58

analyzing effects of reinforcer magnitude. Journal of the Experimental Analysis of Behavior, 60, 641–666. doi:10.1901/jeab.1993.60-641 de Lorge, J. (1971). The effects of brief stimuli presented under a multiple schedule of second-order schedules. Journal of the Experimental Analysis of Behavior, 15, 19–25. doi:10.1901/jeab.1971.15-19 Deluty, M. Z. (1976). Choice and the rate of punishment in concurrent schedules. Journal of the Experimental Analysis of Behavior, 25, 75–80. doi:10.1901/ jeab.1976.25-75 Dews, P. B. (1970). The theory of fixed-interval responding. In W. N. Schoenfeld (Ed.), The theory of reinforcement schedules (pp. 43–61). New York, NY: Appleton-Century-Crofts. Dinsmoor, J. A. (1962). Variable-interval escape from stimuli accompanied by shocks. Journal of the Experimental Analysis of Behavior, 5, 41–47. doi:10.1901/jeab. 1962.5-41 Dinsmoor, J. A. (1983). Observing and conditioned reinforcement. Behavioral and Brain Sciences, 6, 693–728. doi:10.1017/S0140525X00017969 Donahoe, J. W. (2003). Selectionism. In K. A. Lattal & P. N. Chase (Eds.), Behavior theory and philosophy (pp. 103–128). New York, NY: Kluwer Academic. Eckerman, D. A., Hienz, R. D., Stern, S., & Kowlowitz, V. (1980). Shaping the location of a pigeon’s peck: Effect of rate and size of shaping steps. Journal of the Experimental Analysis of Behavior, 33, 299–310. doi:10.1901/jeab.1980.33-299 Escobar, R., & Bruner, C. A. (2009). Observing responses and serial stimuli: Searching for the reinforcing properties of the S−. Journal of the Experimental Analysis of Behavior, 92, 215–231. doi:10.1901/jeab.2009.92-215 Estes, W. K., & Skinner, B. F. (1941). Some quantitative properties of anxiety. Journal of Experimental Psychology, 29, 390–400. doi:10.1037/h0062283 Fantino, E. (1969). Conditioned reinforcement, choice, and the psychological distance to reward. In D. P. Hendry (Ed.), Conditioned reinforcement (pp. 163–191). Homewood, IL: Dorsey Press. Fantino, E. (1977). Conditioned reinforcement: Choice and information. In W. K. Honig & J. E. R. Staddon (Eds.), Handbook of operant behavior (pp. 313–339). New York, NY: Prentice Hall. Fantino, E. (1991). Behavioral ecology. In I. Iversen & K. A. Lattal (Eds.), Experimental analysis of behavior: Part 2 (pp. 117–153). Amsterdam, the Netherlands: Elsevier. Fantino, E., & Silberberg, A. (2010). Revisiting the role of bad news in maintaining human observing behavior. Journal of the Experimental Analysis of Behavior, 93, 157–170. doi:10.1901/jeab.2010.93-157

The Five Pillars of the Experimental Analysis of Behavior

Farley, J. (1980). Reinforcement and punishment effects in concurrent schedules: A test of two models. Journal of the Experimental Analysis of Behavior, 33, 311–326. doi:10.1901/jeab.1980.33-311

Herrnstein, R. J., & Hineline, P. N. (1966). Negative reinforcement as shock-frequency reduction. Journal of the Experimental Analysis of Behavior, 9, 421–430. doi:10.1901/jeab.1966.9-421

Felton, M., & Lyon, D. O. (1966). The post-reinforcement pause. Journal of the Experimental Analysis of Behavior, 9, 131–134. doi:10.1901/jeab.1966.9-131

Herrnstein, R. J., Loveland, D. H., & Cable, C. (1976). Natural concepts in pigeons. Journal of Experimental Psychology: Animal Behavior Processes, 2, 285–302. doi:10.1037/0097-7403.2.4.285

Ferster, C. B., & Skinner, B. F. (1957). Schedules of reinforcement. New York, NY: Appleton-Century-Crofts. doi:10.1037/10627-000 Freeman, T. J., & Lattal, K. A. (1992). Stimulus control of behavioral history. Journal of the Experimental Analysis of Behavior, 57, 5–15. doi:10.1901/jeab. 1992.57-5 Galizio, M. (1979). Contingency-shaped and rulegoverned behavior: Instructional control of human loss avoidance. Journal of the Experimental Analysis of Behavior, 31, 53–70. doi:10.1901/jeab.1979.31-53 Gibbon, J. (1977). Scalar expectancy theory and Weber’s law in animal timing. Psychological Review, 84, 279–325. doi:10.1037/0033-295X.84.3.279 Gibson, D. A. (1966). A quick and simple method for magazine training the pigeon. Perceptual and Motor Skills, 23, 1230. doi:10.2466/pms.1966.23.3f.1230 Gibson, D. A. (1968). Conditioned punishment by stimuli signalling time out from positive reinforcement. Unpublished doctoral dissertation, University of Alabama, Tuscaloosa. Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York, NY: Wiley. Green, L., & Freed, D. E. (1998). Behavioral economics. In W. O’Donohue (Ed.), Learning and behavior therapy (pp. 274–300). Needham Heights, MA: Allyn & Bacon. Hake, D. F., & Azrin, N. H. (1965). Conditioned punishment. Journal of the Experimental Analysis of Behavior, 8, 279–293. doi:10.1901/jeab.1965.8-279 Hall, G. A., & Lattal, K. A. (1990). Variable-interval schedule performance under open and closed economies. Journal of the Experimental Analysis of Behavior, 54, 13–22. doi:10.1901/jeab.1990.54-13 Hearst, E., Besley, S., & Farthing, G. W. (1970). Inhibition and the stimulus control of operant behavior. Journal of the Experimental Analysis of Behavior, 14, 373–409. doi:10.1901/jeab.1970.14-s373 Herrnstein, R. J. (1961). Relative and absolute strength of response as a function of frequency of reinforcement. Journal of the Experimental Analysis of Behavior, 4, 267–272. doi:10.1901/jeab.1961.4-267

Hoffman, H. S. (1966). The analysis of discriminated avoidance. In W. K. Honig (Ed.), Operant behavior: Areas of research and application (pp. 499–530). New York, NY: Appleton-Century-Crofts. Holz, W. C. (1968). Punishment and rate of positive reinforcement. Journal of the Experimental Analysis of Behavior, 11, 285–292. doi:10.1901/jeab.1968.11-285 Holz, W. C., & Azrin, N. H. (1961). Discriminative properties of punishment. Journal of the Experimental Analysis of Behavior, 4, 225–232. doi:10.1901/ jeab.1961.4-225 Hoyert, M. S. (1992). Order and chaos in fixed-interval schedules of reinforcement. Journal of the Experimental Analysis of Behavior, 57, 339–363. doi:10.1901/ jeab.1992.57-339 Hull, C. L. (1943). Principles of psychology. New York, NY: Appleton-Century-Crofts. Hursh, S. R. (1980). Economic concepts for the analysis of behavior. Journal of the Experimental Analysis of Behavior, 34, 219–238. doi:10.1901/jeab.1980.34-219 Kaufman, A., & Baron, A. (1968). Suppression of behavior by timeout punishment when suppression results in loss of positive reinforcement. Journal of the Experimental Analysis of Behavior, 11, 595–607. doi:10.1901/jeab.1968.11-595 Kazdin, A. E. (1982). Single-case research designs: Methods for clinical and applied settings. New York, NY: Oxford University Press. Kelleher, R. T. (1966). Chaining and conditioned reinforcement. In W. K. Honig (Ed.), Operant behavior: Areas of research and application (pp. 160–212). New York, NY: Appleton-Century-Crofts. Kelleher, R. T., & Gollub, L. R. (1962). A review of conditioned reinforcement. Journal of the Experimental Analysis of Behavior, 5, 543–597. doi:10.1901/ jeab.1962.5-s543 Keller, F. S., & Schoenfeld, W. N. (1950). Principles of psychology. New York, NY: Appleton-Century-Crofts.

Herrnstein, R. J. (1969). Method and theory in the study of avoidance. Psychological Review, 76, 49–69.

Kelly, D. D. (1973). Suppression of random-ratio and acceleration of temporally spaced responding by the same prereward stimulus in monkeys. Journal of the Experimental Analysis of Behavior, 20, 363–373. doi:10.1901/jeab.1973.20-363

Herrnstein, R. J. (1970). On the law of effect. Journal of the Experimental Analysis of Behavior, 13, 243–266. doi:10.1901/jeab.1970.13-243

Killeen, P. R., & Fetterman, J. G. (1988). A behavioral theory of timing. Psychological Review, 95, 274–295. doi:10.1037/0033-295X.95.2.274 59

Kennon A. Lattal

Kupfer, A. S., Allen, R., & Malagodi, E. F. (2008). Induced attack during fixed-ratio and matchedtime schedules of food presentation. Journal of the Experimental Analysis of Behavior, 89, 31–48. doi:10.1901/jeab.2008.89-31

Leander, J. D. (1973). Shock intensity and duration interactions on free-operant avoidance behavior. Journal of the Experimental Analysis of Behavior, 19, 481–490. doi:10.1901/jeab.1973.19-481

Lattal, K. A. (1974). Combinations of responsereinforcer dependence and independence. Journal of the Experimental Analysis of Behavior, 22, 357–362. doi:10.1901/jeab.1974.22-357

Lieving, G. A., & Lattal, K. A. (2003). Recency, repeatability, and reinforcer retrenchment: An experimental analysis of resurgence. Journal of the Experimental Analysis of Behavior, 80, 217–233. doi:10.1901/ jeab.2003.80-217

Lattal, K. A. (1975). Reinforcement contingencies as discriminative stimuli. Journal of the Experimental Analysis of Behavior, 23, 241–246. doi:10.1901/ jeab.1975.23-241

Logue, A. W., & de Villiers, P. A. (1978). Matching in concurrent variable-interval avoidance schedules. Journal of the Experimental Analysis of Behavior, 29, 61–66. doi:10.1901/jeab.1978.29-61

Lattal, K. A. (1979). Reinforcement contingencies as discriminative stimuli: II. Effects of changes in stimulus probability. Journal of the Experimental Analysis of Behavior, 31, 51–22.

LoLordo, V. M. (1971). Facilitation of food-reinforced responding by a signal for response-independent food. Journal of the Experimental Analysis of Behavior, 15, 49–55. doi:10.1901/jeab.1971.15-49

Lattal, K. A. (1987). Considerations in the experimental analysis of reinforcement delay. In M. L. Commons, J. Mazur, J. A. Nevin, & H. Rachlin (Eds.), Quantitative studies of operant behavior: The effect of delay and of intervening events on reinforcement value (pp. 107–123). New York, NY: Erlbaum.

Lund, C. A. (1976). Effects of variations in the temporal distribution of reinforcements on interval schedule performance. Journal of the Experimental Analysis of Behavior, 26, 155–164. doi:10.1901/jeab.1976.26-155

Lattal, K. A. (1989). Contingencies on response rate and resistance to change. Learning and Motivation, 20, 191–203. doi:10.1016/0023-9690(89)90017-9 Lattal, K. A. (1991). Scheduling positive reinforcers. In I. H. Iversen & K. A. Lattal (Eds.), Experimental analysis of behavior: Part I (pp. 87–134). Amsterdam, the Netherlands: Elsevier. Lattal, K. A. (2010). Delayed reinforcement of operant behavior. Journal of the Experimental Analysis of Behavior, 93, 129–139. doi:10.1901/jeab.2010.93-129 Lattal, K. A., & Boyer, S. S. (1980). Alternative reinforcement effects on fixed-interval responding. Journal of the Experimental Analysis of Behavior, 34, 285–296. doi:10.1901/jeab.1980.34-285 Lattal, K. A., & Bryan, A. J. (1976). Effects of concurrent response-independent reinforcement on fixedinterval schedule performance. Journal of the Experimental Analysis of Behavior, 26, 495–504. doi:10.1901/jeab.1976.26-495 Lattal, K. A., Freeman, T. J., & Critchfield, T. (1989). Dependency location in interval schedules of reinforcement. Journal of the Experimental Analysis of Behavior, 51, 101–117. doi:10.1901/jeab.1989.51-101 Lattal, K. A., & Griffin, M. A. (1972). Punishment contrast during free-operant avoidance. Journal of the Experimental Analysis of Behavior, 18, 509–516. doi:10.1901/jeab.1972.18-509 Lea, S. E. G. (1986). Foraging and reinforcement schedules in the pigeon: Optimal and non-optimal aspects of choice. Animal Behaviour, 34, 1759–1768. doi:10.1016/S0003-3472(86)80262-7 60

Mackay, H. A. (1991). Conditional stimulus control. In I. Iversen & K. A. Lattal (Eds.), Experimental analysis of behavior: Part 1 (pp. 301–350). Amsterdam, the Netherlands: Elsevier. Madden, G. J. (2000). A behavioral economics primer. In W. Bickel & R. K. Vuchinich (Eds.), Reframing health behavior change with behavioral economics (pp. 3–26). Mahwah, NJ: Erlbaum. Madden, G. J., & Bickel, W. K. (Eds.). (2010). Impulsivity: The behavioral and neurological science of discounting. Washington, DC: American Psychological Association. doi:10.1037/12069-000 Marr, M. J. (1992). Behavior dynamics: One perspective. Journal of the Experimental Analysis of Behavior, 57, 249–266. doi:10.1901/jeab.1992.57-249 Marr, M. J. (2006). Through the looking glass: Symmetry in behavioral principles? Behavior Analyst, 29, 125–128. Mazur, J. E. (1986). Choice between single and multiple delayed reinforcers. Journal of the Experimental Analysis of Behavior, 46, 67–77. doi:10.1901/ jeab.1986.46-67 McDowell, J. J. (1986). On the falsifiability of matching theory. Journal of the Experimental Analysis of Behavior, 45, 63–74. doi:10.1901/jeab.1986.45-63 McDowell, J. J. (1989). Two modern developments in matching theory. Behavior Analyst, 12, 153–166. McKearney, J. W. (1972). Maintenance and suppression of responding under schedules of electric shock presentation. Journal of the Experimental Analysis of Behavior, 17, 425–432. doi:10.1901/jeab.1972.17-425 Michael, J. (1974). Statistical inference for individual organism research: Mixed blessing or curse? Journal

The Five Pillars of the Experimental Analysis of Behavior

of Applied Behavior Analysis, 7, 647–653. doi:10.1901/ jaba.1974.7-647 Michael, J. (1975). Positive and negative reinforcement: A distinction that is no longer necessary; or a better way to talk about bad things. Behaviorism, 3, 33–44. Michael, J. (1982). Distinguishing between discriminative and motivational functions of stimuli. Journal of the Experimental Analysis of Behavior, 37, 149–155. doi:10.1901/jeab.1982.37-149 Moore, J., & Fantino, E. (1975). Choice and response contingencies. Journal of the Experimental Analysis of Behavior, 23, 339–347. doi:10.1901/jeab.1975. 23-339 Morse, W. H., & Kelleher, R. T. (1977). Determinants of reinforcement and punishment. In W. K. Honig & J. E. R. Staddon (Eds.), Handbook of operant behavior (pp. 98–124). New York, NY: Prentice Hall. Neuringer, A. J. (1970). Superstitious key pecking after three peck-produced reinforcements. Journal of the Experimental Analysis of Behavior, 13, 127–134. doi:10.1901/jeab.1970.13-127 Neuringer, A. (2002). Operant variability: Evidence, functions, and theory. Psychonomic Bulletin and Review, 9, 672–705. doi:10.3758/BF03196324 Nevin, J. A. (1969). Signal detection theory and operant behavior: A review of David M. Green and John A. Swets’Signal detection theory and psychophysics. Journal of the Experimental Analysis of Behavior, 12, 475–480. doi:10.1901/jeab.1969.12-475 Nevin, J. A. (1974). Response strength in multiple schedules. Journal of the Experimental Analysis of Behavior, 21, 389–408. doi:10.1901/jeab.1974.21-389 Nevin, J. A., Mandell, C., & Atak, J. R. (1983). The analysis of behavioral momentum. Journal of the Experimental Analysis of Behavior, 39, 49–59. doi:10.1901/jeab.1983.39-49 Nevin, J. A., Tota, M. E., Torquato, R. D., & Shull, R. L. (1990). Alternative reinforcement increases resistance to change: Pavlovian or operant contingencies? Journal of the Experimental Analysis of Behavior, 53, 359–379. doi:10.1901/jeab.1990.53-359 Ono, K. (2004). Effects of experience on preference between forced and free choice. Journal of the Experimental Analysis of Behavior, 81, 27–37. doi:10.1901/jeab.2004.81-27 Palya, W. L. (1992). Dynamics in the fine structure of schedule-controlled behavior. Journal of the Experimental Analysis of Behavior, 57, 267–287. doi:10.1901/jeab.1992.57-267 Pear, J. J., & Legris, J. A. (1987). Shaping by automated tracking of an arbitrary operant response. Journal of the Experimental Analysis of Behavior, 47, 241–247. doi:10.1901/jeab.1987.47-241

Peele, D. B., Casey, J., & Silberberg, A. (1984). Primacy of interresponse time reinforcement in accounting for rate differences under variable-ratio and variable-interval schedules. Journal of Experimental Psychology: Animal Behavior Processes, 10, 149–167. doi:10.1037/0097-7403.10.2.149 Perone, M., & Baron, A. (1980). Reinforcement of human observing behavior by a stimulus correlated with extinction or increased effort. Journal of the Experimental Analysis of Behavior, 34, 239–261. doi:10.1901/jeab.1980.34-239 Perone, M., & Galizio, M. (1987). Variable-interval schedules of time out from avoidance. Journal of the Experimental Analysis of Behavior, 47, 97–113. doi:10.1901/jeab.1987.47-97 Pietras, C. J., & Hackenberg, T. D. (2005). Responsecost punishment via token loss with pigeons. Behavioural Processes, 69, 343–356. doi:10.1016/j. beproc.2005.02.026 Platt, J. R. (1973). Percentile reinforcement: Paradigms for experimental analysis of response shaping. In G. H. Bower (Ed.), The psychology of learning and motivation: Advances in theory and research (Vol. 7, pp. 271–296). New York, NY: Academic Press. Premack, D. (1959). Toward empirical behavior laws: 1. Positive reinforcement. Psychological Review, 66, 219–233. doi:10.1037/h0040891 Rachlin, H., & Green, L. (1972). Commitment, choice and self-control. Journal of the Experimental Analysis of Behavior, 17, 15–22. doi:10.1901/jeab.1972.17-15 Rasmussen, E. B, & Newlin, C. (2008). Asymmetry of reinforcement and punishment in human choice. Journal of the Experimental Analysis of Behavior, 89, 157–167. doi:10.1901/jeab.2008.89-157 Ray, B. A. (1969). Selective attention: The effects of combining stimuli which control incompatible behavior. Journal of the Experimental Analysis of Behavior, 12, 539–550. doi:10.1901/jeab.1969.12-539 Rescorla, R. A., & Skucy, J. C. (1969). Effect of responseindependent reinforcers during extinction. Journal of Comparative and Physiological Psychology, 67, 381–389. doi:10.1037/h0026793 Richards, J. B., Mitchell, S. H., de Wit, H., & Seiden, L. S. (1997). Determination of discount functions in rats with an adjusting-amount procedure. Journal of the Experimental Analysis of Behavior, 67, 353–366. doi:10.1901/jeab.1997.67-353 Rilling, M. (1977). Stimulus control and inhibitory processes. In W. K. Honing & J. E. R. Staddon (Eds.), Handbook of operant behavior (pp. 432–480). Englewood Cliffs, NJ: Prentice-Hall. Schuster, R., & Rachlin, H. (1968). Indifference between punishment and free shock: Evidence for the negative 61

Kennon A. Lattal

law of effect. Journal of the Experimental Analysis of Behavior, 11, 777–786. doi:10.1901/jeab.1968.11-777 Shnidman, S. R. (1968). Extinction of Sidman avoidance behavior. Journal of the Experimental Analysis of Behavior, 11, 153–156. doi:10.1901/jeab.1968.11-153 Sidman, M. (1953). Two temporal parameters of the maintenance of avoidance behavior by the white rat. Journal of Comparative and Physiological Psychology, 46, 253–261. doi:10.1037/h0060730 Sidman, M. (1960). Tactics of scientific research. New York, NY: Basic Books. Sidman, M. (1966). Avoidance behavior. In W. Honig (Ed.), Operant behavior: Areas of research and application (pp. 448–498). New York, NY: AppletonCentury-Crofts. Sidman, M. (1986). Functional analysis of emergent verbal classes. In T. Thompson & M. D. Zeiler (Eds.), Analysis and integration of behavioral units (pp. 213–245). Hillsdale, NJ: Erlbaum. Sidman, M., Kirk, B., & Wilson-Morris, M. (1985). Six-member stimulus classes generated by conditional-discrimination procedures. Journal of the Experimental Analysis of Behavior, 43, 21–42. doi:10.1901/jeab.1985.43-21 Sidman, M., & Tailby, W. (1982). Conditional discrimination vs. matching to sample: An expansion of the testing paradigm. Journal of the Experimental Analysis of Behavior, 37, 5–22. doi:10.1901/jeab.1982.37-5 Siegel, P. S., & Milby, J. B. (1969). Secondary reinforcement in relation to shock termination: Second chapter. Psychological Bulletin, 72, 146–156. doi:10.1037/ h0027781 Skinner, B. F. (1935). The generic nature of the concepts of stimulus and response. Journal of General Psychology, 12, 40–65. doi:10.1080/00221309.1935.9920087 Skinner, B. F. (1938). Behavior of organisms. New York, NY: Appleton-Century-Crofts. Skinner, B. F. (1948). “Superstition” in the pigeon. Journal of Experimental Psychology, 38, 168–172. doi:10.1037/h0055873 Skinner, B. F. (1953). Science and human behavior. New York, NY: Macmillan. Skinner, B. F. (1956). A case history in scientific method. American Psychologist, 11, 221–233. doi:10.1037/ h0047662 Skinner, B. F. (1957). Verbal behavior. New York, NY: Appleton-Century-Crofts. doi:10.1037/11256-000

Staddon, J. E. R. (1968). Spaced responding and choice: A preliminary analysis. Journal of the Experimental Analysis of Behavior, 11, 669–682. doi:10.1901/jeab.1968.11-669 Staddon, J. E. R. (1979). Operant behavior as adaptation to constraint. Journal of Experimental Psychology: General, 108, 48–67. doi:10.1037/0096-3445.108.1.48 Staddon, J. E. R., & Simmelhag, V. (1971). The “superstition” experiment: A re-examination of its implications for the principles of adaptive behavior. Psychological Review, 78, 3–43. doi:10.1037/h0030305 Stubbs, A. (1968). The discrimination of stimulus duration by pigeons. Journal of the Experimental Analysis of Behavior, 11, 223–238. doi:10.1901/jeab.1968.11-223 Terrace, H. S. (1963). Discrimination learning with and without “errors.” Journal of the Experimental Analysis of Behavior, 6, 1–27. doi:10.1901/jeab.1963.6-1 Terrace, H. S. (1966). Stimulus control. In W. Honig (Ed.), Operant behavior: Areas of research and application (pp. 271–344). New York, NY: AppletonCentury-Crofts. Thorndike, E. L. (1911). Animal intelligence. New York, NY: Macmillan. Timberlake, W., & Allison, J. (1974). Response deprivation: An empirical approach to instrumental performance. Psychological Review, 81, 146–164. doi:10.1037/h0036101 Timberlake, W., & Lucas, G. A. (1985). The basis of superstitious behavior: Chance contingency, stimulus substitution, or appetitive behavior? Journal of the Experimental Analysis of Behavior, 44, 279–299. doi:10.1901/jeab.1985.44-279 Uhl, C. N., & Garcia, E. E. (1969). Comparison of omission with extinction in response elimination in rats. Journal of Comparative and Physiological Psychology, 69, 554–562. doi:10.1037/h0028243 Verhave, T. (1962). The functional properties of a time out from an avoidance schedule. Journal of the Experimental Analysis of Behavior, 5, 391–422. doi:10.1901/jeab.1962.5-391 Wasserman, E. A., Young, M. E., & Peissig, J. J. (2002). Brief presentations are sufficient for pigeons to discriminate arrays of same and different stimuli. Journal of the Experimental Analysis of Behavior, 78, 365–373. doi:10.1901/jeab.2002.78-365 Watkins, M. J. (1990). Mediationism and the obfuscation of memory. American Psychologist, 45, 328–335. doi:10.1037/0003-066X.45.3.328

Skinner, B. F. (1966). What is the experimental analysis of behavior? Journal of the Experimental Analysis of Behavior, 9, 213–218. doi:10.1901/jeab.1966.9-213

Weiner, H. (1962). Some effects of response cost upon human operant behavior. Journal of the Experimental Analysis of Behavior, 5, 201–208. doi:10.1901/ jeab.1962.5-201

Skinner, B. F. (1981). Selection by consequences. Science, 213, 501–504. doi:10.1126/science.7244649

Weiner, H. (1969). Conditioning history and the control of human avoidance and escape responding. Journal

62

The Five Pillars of the Experimental Analysis of Behavior

of the Experimental Analysis of Behavior, 12, 1039–1043. doi:10.1901/jeab.1969.12-1039 Williams, B. A. (1983). Revising the principle of reinforcement. Behaviorism, 11, 63–88.

Zeiler, M. D. (1977a). Elimination of reinforced behavior: Intermittent schedules of not-responding. Journal of the Experimental Analysis of Behavior, 27, 23–32. doi:10.1901/jeab.1977.27-23

Williams, B. A. (1994). Conditioned reinforcement: Experimental and theoretical issues. Behavior Analyst, 17, 261–285.

Zeiler, M. D. (1977b). Schedules of reinforcement: The controlling variables. In W. K. Honig & J. E. R. Staddon (Eds.), Handbook of operant behavior (pp. 201–232). New York, NY: Prentice Hall.

Wyckoff, L. B. (1952). The role of observing responses in discrimination learning. Psychological Review, 59, 431–442. doi:10.1037/h0053932

Zeiler, M. D. (1984). Reinforcement schedules: The sleeping giant. Journal of the Experimental Analysis of Behavior, 42, 485–493. doi:10.1901/jeab.1984.42-485

Zeiler, M. D. (1968). Fixed and variable schedules of response-independent reinforcement. Journal of the Experimental Analysis of Behavior, 11, 405–414. doi:10.1901/jeab.1968.11-405

Zeiler, M. D. (1992). On immediate function. Journal of the Experimental Analysis of Behavior, 57, 417–427. doi:10.1901/jeab.1992.57-417

Zeiler, M. D. (1976). Positive reinforcement and the elimination of reinforced responses. Journal of the Experimental Analysis of Behavior, 26, 37–44. doi:10.1901/jeab.1976.26-37

Zimmerman, J. (1969). Meanwhile . . . back at the key: Maintenance of behavior by conditioned reinforcement and response-independent primary reinforcement. In D. P. Hendry (Ed.), Conditioned reinforcement (pp. 91–124). Homewood, IL: Dorsey Press.

63

Chapter 3

Translational Research in Behavior Analysis William V. Dube

As the 21st century began, the National Institutes of Health delineated a road map for accelerating biomedical research progress with increased attention to more quickly translating basic research into human studies and then into tests and treatments that improve clinical practice with direct benefits to patients (Zerhouni, 2003). A consensus definition of the term translational research adopted by several of the institutes and other organizations is “the process of applying ideas, insights, and discoveries generated through basic scientific inquiry to the treatment or prevention of human disease” (World Health Organization, 2004, p. 141). The National Institutes of Health recognized that a reengineering effort was needed to support the development of translational science. The National Institutes of Health established the Clinical and Translational Science Awards Consortium in October 2006 to assist institutions in creating integrated academic homes for multi- and interdisciplinary research teams to apply new knowledge and techniques to patient care. This consortium began with 12 academic health centers located throughout the nation, has increased to 55 centers as of this writing, and is expected to expand to approximately 60 institutions by 2012. Woolf (2008) noted that two definitions of translational research exist. The term most commonly refers to the bench-to-bedside activity of using the knowledge gained from basic biological sciences to

produce new drugs, devices, and treatments, which has been described as from bench to bedside, from Petri dish to people, from animal to human. This is what has been typically meant by translational research. Scientists work at the molecular, then cellular level to test “basic research,” then proceed to applications for animals, and on to humans. (upFRONT, 2006, p. 8) The end point for this first stage of translational research (often referred to as T1) is the production of a promising new treatment with clinical and commercial potential. The second stage of translational research (T2) addresses the gap between basic science and clinical medicine by improving access to treatment, systems of care, point-of-care decision support tools, and so forth. “The ‘laboratory’ for T2 research is the community and ambulatory care settings, where populationbased interventions . . . bring the results of T1 research to the public . . . [an] ‘implementation science’ of fielding and evaluating interventions in real world settings” (Woolf, 2008, p. 211). Westfall, Mold, and Fagnan (2007) noted that the design of implementation systems requires a different skill set than that of the typical practicing physician. For this reason, they proposed that the final stage of translation

Preparation of this chapter was supported in part by the Eunice Kennedy Shriver National Institute of Child Health and Human Development, primarily Grants HD004147, HD046666, and HD055456. The contents of this chapter are solely the responsibility of the author and do not necessarily represent the official views of the National Institute of Child Health and Human Development. I acknowledge the truly impressive efforts of the authors of the chapters I discuss. It was a pleasure working with them. DOI: 10.1037/13937-003 APA Handbook of Behavior Analysis: Vol. 1. Methods and Principles, G. J. Madden (Editor-in-Chief) Copyright © 2013 by the American Psychological Association. All rights reserved.

65

William V. Dube

involves a distinct third step (T3) with a focus on solving the problems encountered by primary care physicians as they attempt to incorporate new discoveries into clinical practice. The distinction between T2 and T3 is that between translation to patients and translation to practice. To this end, Mold and Peterson (2005) described emerging “primary care practice-based research networks [that] are challenging traditional distinctions between research and quality improvement” (p. S12). To summarize the biomedical perspective, T1 is the translation from bench to bedside, or dish to human; T2, from bedside to community; and T3, from dissemination to practice. To most behavior analysts, this process sounds very familiar. From the beginning, modern behavior analysis has sought not only to treat behavior as the subject matter of a natural science, but also to “apply [the methods of science] to human affairs” (Skinner, 1953, p. 5). The goal of translational research in biomedical science is the treatment or prevention of human disease. From the perspective of behavior analysis, human disease encompasses a range of maladaptive behavior. Examples include educational failures; behavior with high risk for health problems such as substance abuse or unsafe sex; disruptive, aggressive, and self-injurious behavior in individuals with developmental disabilities; and unhappiness and loss of functioning in daily life related to depression, anxiety, or similar conditions. As behavior analysts look at human behavioral diseases, the Petri dish is the pigeon in the operant chamber and similar arrangements; the bedside includes the human operant laboratory, analogue environments of functional analysis (Iwata, Dorsey, Slifer, Bauman, & Richman, 1982/1994), and so forth; and practice settings abound for behavior analytically inspired treatments in schools, clinics, businesses, and virtually anywhere where significant human activity occurs. The two volumes of this handbook describe much of this basic, translational, and applied research. Translational Behavior Analysis Basic and applied behavior-analytic research have proceeded in parallel for many years, with varying degrees of cross-fertilization; for a recent summary of the basic–applied interface, see Mace and 66

Critchfield (2010). One question that arises when considering the basic–applied distinction is whether the two classifications provide sufficient description of activity in the field. Is a useful distinction able to be made between translational and applied behavior analysis? McIlvane et al. (2011) pointed out several dimensions along which the two areas differ. For example, the participants in applied research are selected because they have ongoing behavioral issues and will derive immediate-term benefit from participation, usually as the resolution of a clinical problem. In contrast, the participants in translational research are selected because they are representatives of clinical or social groups; they may receive some immediate benefit from participation, but the primary goal of the research is a longer term search for processes, principles, and procedures that apply beyond the individual case. Other distinctions include differences in typical research environments, publication outlets, and funding models. At the T1 stage of translational behavior analysis, an important goal is to provide support for applications by validating underlying principles. Every chapter in Volume 2, Part I, of this handbook includes relevant T1 research findings that document continuity in underlying behavioral processes along the continuum from the behavioral bench to bedside. Why is this important? One reason is that behavioral interventions do not always produce the expected results, and a complete understanding of relevant underlying processes may improve the clinician’s or teacher’s ability to make informed procedural adjustments. In terms that Overmier and Burke (1992) used to describe animal models for human diseases, a translational research pathway builds confidence that the relation between a set of basic research findings and a set of applied procedures is one of true material equivalence (i.e., homology of behavioral processes) rather than mere conceptual equivalence (behavioral similarity). I adapt one example from Volume 2, Chapter 7, this handbook, by Jacobs, Borrero, and Vollmer. Suppose that the results of a reinforcer preference test for a student receiving special education services showed that Reinforcer A was highly preferred to several others. Yet when Reinforcer A was provided for completing increasing numbers of arithmetic problems, it became

Translational Research in Behavior Analysis

ineffective at sustaining the behavior. A teacher might not even consider trying other reinforcers that were less preferred, based on the preference assessment results. An understanding of translational research in behavioral economics, however, might lead the teacher to question whether the problem was related to elasticity of demand—the degree of decrease in the value of a reinforcer as its price (the amount of work required to obtain it) increases (for additional information, see the discussion of essential value in Volume 2, Chapter 8, this handbook). The typical reinforcer assessment evaluates relative preference at very low behavioral prices; the required response is usually merely reaching or pointing. Translational research in behavioral economics may suggest a reevaluation of relative reinforcer preferences with response requirements that more closely match those of the arithmetic task. Reinforcers less preferred than Reinforcer A at low prices may become more preferred at higher prices. A second example may be drawn from Volume 2, Chapter 6, this handbook, on discrimination learning and stimulus control. Relatively early research on stimulus equivalence called into question whether language ability was necessary to show positive results on equivalence tests (e.g., Devany, Hayes, & Nelson, 1986; for an introduction to stimulus equivalence, see Chapter 16, this volume and Volume 2, Chapter 1, this handbook). The translational research reviewed in Volume 2, Chapter 6, this handbook, by McIlvane has shown, however, that the typical conditional discrimination procedures used in equivalence research may engender unsuspected forms of stimulus control that meet the reinforcement contingencies (and thus produce high accuracy scores during initial training) but are nevertheless incompatible with equivalence relations because they do not require discrimination of all of the stimuli presented (for details, see Volume 2, Chapter 6, this handbook). The susceptibility to these unsuspected forms of stimulus control may be more prevalent in individuals at lower developmental levels. With procedural enhancements and variations designed to reveal or control for these possibilities, equivalence relations have been shown in humans with minimal language (D. Carr, Wilkinson, Blackman, & McIlvane, 2000; Lionello-DeNolf, McIlvane, Canovas, & Barros, 2008).

Behavior analysis has also addressed issues analogous to those of the T2 and T3 stages of biomedical translational research, and the chapters in Volume 2 of this handbook include descriptions of these efforts. Particular emphasis on T2 and T3 stages is found in Volume 2, Chapters 1 and 18, this handbook. Volume 2, Chapter 10, this handbook, with its emphasis on the application of known behavioral principles to bring about population-wide changes in behavior and cultural evolution, is a particularly good example of the T3 stage of behavior analysis. Bidirectional Translation As McIlvane et al. (2011) noted, translational behavior analysis need not be thought of as a one-way street. Topics, methods, and strategies for basic research may be selected expressly to foster translation, and translational goals may be influenced by the practical realities of application research and intervention practice. To list a few examples, in Volume 2, Chapter 5, this handbook, Nevin and Wacker explicitly call for reverse translation from application to basic research to provide an empirical base to aid in interpretation of factors influencing long-term treatment outcomes for problem behavior. One impetus for the research in stimulus control that McIlvane describes in Volume 2, Chapter 6, this handbook, comes from the problems encountered in special education classrooms in the course of discrete-trials teaching. In their chapter on acceptance and commitment therapy (ACT; Volume 2, Chapter 18, this handbook), Levin et al. move from the clinic to the human operant laboratory for their evaluations of the impact of various ACT components. This approach allows them to evaluate the components with nonclinical populations and using laboratory measures such as task persistence or recovery from mood inductions. The reader will find other examples in Volume 2, Part I, of the handbook. Overview of Volume 2, Part I: Translational Research in Behavior Analysis

Arranging Reinforcement Contingencies In Volume 2, Chapter 3, this handbook, DeLeon, Bullock, and Catania describe findings from basic 67

William V. Dube

and translational research that can inform the design of reinforcement systems in applied settings. The chapter is divided into four major sections. In the first section, The Contingencies: Which Schedule Should We Use? DeLeon et al. consider reinforcement schedules and include descriptions of the basic ratio and interval schedules and their characteristics as well as several other types of schedules useful in application. These schedules include differential reinforcement, in which contingencies may specify certain defined response classes or low or high rates of responding; response-independent schedules; and concurrent schedules, which operate simultaneously and usually independently for two or more different response classes. In the second section, The Response: What Response Should We Reinforce? DeLeon et al. focus on the response and include discussions of research using percentile schedules to shape response forms (e.g., increasing the duration of working on a task), lag schedules to increase response diversity and variability, and embedding reinforcement contingencies within prompted response sequences. They also include an interesting treatment of the punished-byrewards issue, in which some critics of behavior analysis have claimed that the use of extrinsic reinforcers reduces creativity and destroys intrinsic motivation (e.g., Kohn, 1993), and a summary of research findings that refute this notion. In the third section, The Reinforcer: Which Reinforcer Should We Choose? DeLeon et al. consider the clinician’s choice of which reinforcer to use. They include thorough discussions of research on preference assessment procedures and the design and implementation of token reinforcement systems. Thus far, the translational research on token exchange schedules has indicated that token reinforcers may maintain their effectiveness when reinforcement is delayed better than the directly consumable reinforcers for which they are exchanged, which is identified as a topic for future research in applied settings. In the fourth section, Reinforcer Effectiveness and the Relativity of Reinforcement, DeLeon et al. consider the role of context in reinforcer effectiveness and include an introduction to behavioral economics, in which reinforcers are treated as commodities 68

and the behavioral requirements of the reinforcement contingency (e.g., response effort) are treated as the price of that commodity. Changes in price may affect demand for a commodity (typically determined by measuring consumption), and DeLeon et al. describe ways in which the relations between effort and reinforcer effectiveness may have implications for reinforcer use and selection in applied settings. DeLeon et al. conclude a discussion of a couple of tools for the applied researcher and clinician: a decision tree (Figure 3.1) to guide the selection of reinforcers for applied settings and a list of suggestions for troubleshooting in situations in which arranged reinforcement contingencies do not have the desired effect.

Operant Extinction: Elimination and Generation of Behavior Procedurally, extinction refers to the discontinuation of reinforcement for a response. In Volume 2, Chapter 4, this handbook, Lattal, St. Peter, and Escobar review research on extinction and its effects in both the elimination and the generation of behavior. Those readers unfamiliar with behavior analysis may at first be surprised that extinction procedures and response generation are related, and Lattal et al. provide an interesting introduction to this relation. All of the chapters in Volume 2, Part I, of this handbook describe both basic research in non– human animal laboratories and applied research in clinical settings. Volume 2, Chapter 4, this handbook features a very tight integration of the two, often in the same paragraph. The chapter begins with a brief review of the history of the study of extinction, from the early 20th century through Skinner’s early work. This review is followed by clear procedural definitions of relevant terms and an introduction to the types of functional outcomes that may result from procedural extinction, including both the reduction in the response that no longer generates reinforcement and its response-generating or response-inducing effects on other classes of behavior. The remainder of the chapter is divided into two sections, one on the eliminative effects of extinction and one on the generative effects. In the former section, Lattal et al. review the interactions of extinction with schedules of reinforcement and other

Translational Research in Behavior Analysis

parameters of reinforcement, the effects of repeated exposures to extinction, and the effects of sudden versus gradual introduction; the latter section is relevant to the gradual reduction in reinforcement frequency (“schedule thinning”) that is an important component of many clinical treatment interventions. In this section, Lattal et al. also review research on several response-elimination procedures with high relevance to applied behavior analysis. These procedures include differential reinforcement for other behavior, response-produced time outs (i.e., a signaled period in which a reinforcer is not available), and procedures that remove the response–reinforcer dependency, often termed noncontingent reinforcement in applied work. In this section in particular, Lattal et al.’s close integration of basic and applied research provides a valuable resource for the clinician. The section on generative effects of extinction begins by reviewing research on the extinction burst, a period of increased response rate that sometimes— but not always—follows the onset of extinction. Most research on generative effects is reviewed in terms of increased variability in the topography of the target response, increased rate of responses that were previously unreinforced but topographically related to the reinforced response (e.g., behavior that did not meet a defined response criterion), and schedule-induced behavior, which includes behavior topographically unrelated to the target response that occurs during the periods of extinction in intermittent reinforcement schedules. Research with humans has documented schedule-induced behavior, including drinking and stereotypy, and schedule-induced responding has been suggested as a mechanism related to drug taking and smoking. In this section, Lattal et al. also review research on resurgence, which is the recurrence of previously reinforced responding when a more recently reinforced response is extinguished and the recovery of behavior after periods of extinction by the presentation of discriminative, contextual, or reinforcing stimuli.

Simple and Complex Discrimination Learning In Volume 2, Chapter 6, this handbook, McIlvane systematically lays out the complexities involved in

analyzing stimulus control in discrimination learning. His approach makes the topic accessible in part because it incorporates principles of programmed instruction, including a careful analysis and presentation of prerequisites for the reader and a systematic progression from simple to complex. This chapter will be of interest to students and more experienced readers alike, including many who do not consider themselves to be behavior analysts. In fact, one of McIlvane’s explicit goals is to provide a reference for students of cognitive neuroscience who use discrimination procedures as behavioral preparations to study correlations between behavior and underlying neurobiological processes. His message is that even seemingly straightforward procedures may bring unsuspected analytical complexities, and failure to address these complexities will introduce unmeasured sources of variability in the data. McIlvane adopts Sidman’s (1986) analytical units of behavior analysis as a consistent framework for the discussion. The initial section describes simple discrimination procedures in terms of three-term analytical units that correspond to the three terms of the operant contingency: antecedent stimulus, behavior, and consequence. When conditional discrimination, which requires relational stimulus control by two (or more) stimuli (e.g., sample and comparison stimuli in a matching-to-sample procedure) is considered, the analytic unit is expanded to four terms that include two antecedent stimuli. At each level of analysis, McIlvane carefully explains and illustrates analyses in terms of select versus reject stimulus control (e.g., as exerted by the correct and incorrect choices in a discrimination task) and configural versus relational stimulus control (e.g., as exerted by two or more stimuli as a unitary compound in the former and by independent stimulus elements in the latter). Notably, McIlvane always makes a careful distinction between the terms of the procedure, as defined by the experimenter, and the actual analytical units of behavior that those procedures engender in the organism. One of the most important points of the chapter is that these two sets of specifications need not always correspond. I use an oversimplified example here for brevity: A special education teacher may assume that accurate performance on a 69

William V. Dube

matching-to-sample task in which printed-word samples BALL and BAT are matched to corresponding pictures indicates that two word–picture relations have been learned. As McIlvane points out, however, the student may have learned to select the ball picture when the sample was BALL and reject the ball picture when the sample was BAT, a performance that does not include any stimulus control at all by the specific features of the bat picture. If performance that depends on control by those features of the bat picture is poor in subsequent testing for more advanced relational learning (e.g., stimulus equivalence), the teacher may erroneously conclude that the student is not capable of the more advanced performance, when in fact the problem was the teacher’s failure to fully analyze the stimulus– control basis for the initial training. McIlvane discusses situations such as this one in terms of a lack of coherence between the teacher’s (or experimenter’s) assumptions about stimulus control and the actual controlling stimuli and relations. The final sections of McIlvane’s chapter include a brief description of the current state of behavioranalytic theory on the acquisition of stimulus control via differential reinforcement and some ideas about how theory might be advanced. This discussion touches on the issue of improving the coherence between the desired stimulus control (by the experimenter, teacher, etc.) and the stimulus control that is actually captured by the contingencies. McIlvane concludes the chapter with a consideration of two of the current areas of translational research in stimulus control: stimulus control shaping (e.g., by gradual stimulus change procedures) and the analysis of relational learning processes as seen, for example, in stimulus equivalence.

Response Strength and Behavioral Persistence Behavioral momentum theory makes an analogy between behavior and physical momentum: Response rate is analogous to the velocity of a moving body, and an independent aspect of behavior analogous to inertial mass is inferred from the persistence of response rate to disruption by some challenge analogous to an external force applied to a moving body. In Volume 2, Chapter 5, this handbook, 70

Nevin and Wacker open with a discussion of the concept of response strength and basic research, showing that response rate and persistence (resistance to change) are independent aspects of behavior. Response rate is determined by response–reinforcer contingencies (schedules of reinforcement), and resistance to change is determined by the stimulus– reinforcer relation—that is, the characteristic reinforcer rate within a given stimulus situation— independent of response contingencies. This latter point, that persistence is determined by the stimulus– reinforcer (Pavlovian) relation and independent of response–reinforcer (operant) contingencies, has important implications for applied behavior analysis, and Nevin and Wacker include a clear explanation of the basic research supporting it (e.g., Nevin, Tota, Torquato, & Shull, 1990). Why is the distinction between rate-governing and persistence-governing environmental relations important? One answer is that applied behavior analysis has developed several successful approaches for the treatment of problem behavior in populations with developmental limitations. In such populations, verbal behavior is often unreliable, and so the interventions depend on direct manipulation of reinforcement contingencies. Procedures such as differential reinforcement of other behavior, differential reinforcement of alternative behavior, and response-independent reinforcer deliveries (often termed noncontingent reinforcement) are all designed to reduce the frequency of a problem behavior, and they all accomplish it by increasing the overall rate of reinforcement in treatment settings. A wealth of applied research has shown that these procedures have been broadly effective in reducing the rates of problem behavior. Evidence, however, including some presented in Nevin and Wacker’s chapter, has shown that they may do so at the cost of increasing the longer term persistence of the problem behavior when the inevitable challenges to treatment (e.g., brief periods of extinction) are encountered over time (e.g., Mace et al., 2010). That longer term persistence may be manifested in a posttreatment resurgence of problem behavior. Long-term maintenance of treatment gains is a relatively understudied area in applied behavior analysis, and the latter portion of Nevin and Wacker’s

Translational Research in Behavior Analysis

chapter outlines a new approach to the issue, inspired by behavioral momentum theory. In current practice, maintenance is often defined in terms of continuing the desired behavior change over time and under the prevailing conditions of treatment. Nevin and Wacker propose that this step toward maintenance is necessary but not sufficient. They redefine maintenance as the persistence of treatment gains when confronted by changes in antecedent stimuli (people, tasks, prompts, etc.) and the consequences of behavior: “Rather than focusing almost exclusively on behavior occurring under stable treatment conditions, researchers should also consider how various treatment conditions produce or inhibit persistence during challenges to the treatment” (p. 124). Nevin and Wacker present a longitudinal analysis of the effects of extinction challenges to treatment over an average of 35 weeks for eight children. For example, after differential reinforcement of alternative behavior had replaced one child’s destructive behavior with appropriate requesting, the requests were ignored during occasional brief extinction periods. The data from these extinction challenges show gradually increasing persistence of adaptive behavior and decreasing destructive behavior over a 7-month intervention period. Nevin and Wacker conclude the chapter with some suggestions for further research, which include some interesting reverse translation possibilities for the basic research laboratory to examine strengthening and weakening of behavioral persistence over long time courses and the effects of variation in the stimulus situation. Another goal for further research is to determine the extent to which the underlying behavioral processes for “high-p” procedures (Mace et al., 1988), used in applied research to increase compliance, can be related to those of behavioral momentum theory.

Translational Applications of Quantitative Choice Models Jacobs, Borrero, and Vollmer’s goal in Volume 2, Chapter 7, this handbook is “to foster appreciation of quantitative analyses of behavior for readers whose primary interests are in applied research and practice” (p. 165). Quantitative analysis may at first seem a bit opaque to some whose primary interests

are in applied research and practice. As Commons (2001) has noted, however, quantitative analysis “is not primarily a matter of fitting arbitrary functions to data points. Rather, each parameter and variable in a model represents part of a process that has theoretical, empirical, and applied interpretations” (p. 275). Throughout the chapter, Jacobs et al. discuss many translational research studies that illustrate the relationships between the analyses’ parameters and variables and events in socially relevant human behavior, both in and outside of the laboratory. The first of two major sections focuses on the matching law, which relates the relative allocation of behavior between and among concurrently available options to the relative value of the obtained reinforcement for those behavioral options, where value is the product of rate, quality, amount (magnitude), and immediacy (delay). One of the major contributions of the matching law is that it puts behavior in context—the behaving organism always has a choice between engaging in some behavior of interest to the behavior analyst or doing something else. Seen from the perspective of the behavior analyst, therefore, there is also a choice: To increase the behavior of interest, one may increase relative reinforcement contingent on that behavior or decrease reinforcement available for alternative behavior, or both. The matching law section opens with a brief account of the development of the mathematical model, accompanied by text relating the mathematical terms to aspects of behavior. Helpful figures illustrate the effects of changing the values of the terms in the corresponding equations. This introduction is followed by a review of research showing the applicability of the matching law to such diverse human activities as choice between conversational partners among college students, between academic tasks among students with academic delays, between problem and appropriate behavior among children with intellectual and developmental disabilities, and even between 2- and 3-point shots among basketball players. Areas identified for further research on choice include the effects of delay to reinforcement for responses that closely follow a shift from one behavioral option to another (changeover delay), delay to reinforcement for responses 71

William V. Dube

during extended periods of behavior, and analytic tools to better account for ratio- versus interval-like aspects of reinforcement schedules in uncontrolled environments. The second major section of the chapter covers temporal discounting, which describes the relationships between the impact of a reinforcer and the amount of time that will pass before that reinforcer is obtained. This research area is relevant to issues involving human impulsivity and self-control and maladaptive choices to engage in behavior that produces a relatively smaller immediate reinforcer (e.g., second helping of chocolate cake) at the cost of foregoing a relatively larger delayed reinforcer (weight control for better health and appearance). The text and accompanying figure explain how temporal discounting is well described by a hyperbolic function, how this function helps to account for impulsive choice of the smaller, more immediate reinforcer, and how the methods can be used to determine a quantitative value describing the degree to which the individual will discount a reinforcer (i.e., how steeply the value of a reinforcer will decrease as the delay to obtain it increases). Jacobs et al. then go on to review research in assessment of impulsivity in adults and in children, both those who are typically developing and those with developmental disabilities. Research on strategies for decreasing impulsivity is also reviewed. The section concludes with a discussion of the potential for future research to develop discounting assessments that could identify developmental markers and risk for problems associated with impulse control. The remainder of the chapter provides brief overviews of several other quantitative approaches and models relevant to translational research in behavior analysis: behavioral economics (addressed in Volume 2, Chapter 8, this handbook), behavioral ecology, behavioral momentum (addressed in Volume 2, Chapter 5, this handbook), and behavioral detection models based on signal detection theory. Each of these sections provides helpful introductory material.

Behavioral Economics The first half of Volume 2, Chapter 8, this handbook includes an introduction and overview of several important concepts. In the introduction, Hursh et al. 72

describe how a common interest in value and choice in the fields of economics and behavioral psychology provide a context for (a) the extension of microeconomic theory to the consumption of reinforcers in organisms and (b) the application of operant conditioning principles to the economics of demand for commodities. I highly recommend the overview of behavioral economic concepts, regardless of the reader’s level of familiarity with the topics. There is something here for everyone. The material is presented in a clear and balanced discussion that covers areas of consistency and agreement as well as apparent exceptions and thus possible issues for further research. The discussion is illustrated with examples from both basic and translational research. Among the most important concepts reviewed are demand, value, and discounting. As noted earlier, behavioral economics treats reinforcers as commodities and the behavior required to obtain a commodity as its price. Of primary interest is how changes in price affect demand. Demand is measured in terms of consumption, and Hursh et al. discuss the distinction between total consumption as a fundamental dependent variable and other common behavior-analytic dependent variables such as response rate. As price increases, consumption decreases. Hursh et al. describe the essential value of a commodity (reinforcer) in terms of the rate of change in consumption as price increases, that is, “an organism’s defense of consumption in the face of constraint. Commodities that are most vociferously defended are the most essential” (pp. 196–197). In the Quantifying Value From Demand Curve Data section, Hursh et al. discuss methods for obtaining quantitative estimates of value based on rate of change in consumption or, alternatively, the price point that supports peak responding. Also covered are ways in which the availability and price of alternate commodities may affect essential value. When the delivery of a reinforcer is delayed, its value decreases in relation to the duration of the delay; that is, the value is discounted. Quantitative analyses of discounting have revealed a very interesting and significant difference between the discounting function predicted by normative economic theory and that actually obtained by the experimental

Translational Research in Behavior Analysis

analysis of behavior. The former is exponential, in which a reinforcer is devalued at a constant rate over time, and the latter is hyperbolic, in which the devaluation rate is more extreme in the immediate future than in the more distant future. This exponential versus hyperbolic difference is illustrated in the top portion of Hursh et al.’s Figure 8.9. Because of the acceleration in value as a reinforcer becomes more immediate, choice may suddenly shift from a larger– later reinforcer (e.g., losing weight for better health and appearance) to a smaller–sooner one (e.g., an imminent piece of apple pie). Included in the chapter is a very accessible discussion of how the discovery of hyperbolic temporal discounting provides a scientifically grounded explanation for seemingly irrational choices and preference reversals. The second half of the chapter reviews translational research in behavioral economics with attention to areas with potential for further research and development. Much of the translational activity thus far has been related to analyses and treatment for addictions. One set of questions asks whether analyses of demand characteristics and essential value can be used to predict response to treatment interventions by behavioral therapy, response to treatment by medication, and the transition from inpatient to outpatient treatment. In the Translating Delay Discounting section, Hursh et al. look at some potentially promising intervention approaches that address the person with addiction’s discounting of the long-term benefits that would result from recovery. One approach is to teach tolerance for delay; Hursh et al. judge this approach to be at the proof-of-concept stage, with much work remaining to be done. Another approach has been termed reward bundling (e.g., Ainslie, 2001): Rather than a choice between two isolated reinforcers (one smaller–sooner and one larger– later), the choice is between two series (bundles) of reinforcers distributed in time. The theory is that the bundles will affect choice in relation to the sums of the discounted values of all reinforcers in the series. That is, the difference in value of the bundles will be greater than the difference in value of two single reinforcers, and this greater disparity will more than offset the immediacy of the first reward in the smaller–sooner series (see Hursh et al.’s Figure 8.14).

(Although not discussed in the chapter, this approach seems at least conceptually similar to the bundling of distributed social reinforcers in 12-step programs.) Research on reward bundling has yet to be accomplished in applied settings. A third approach, training in skills related to executive functioning (problem solving, strategic planning, working memory), is also in the earliest stages of translational research. In the final section of the chapter, Translating Behavioral Economic Principles to Other Applied Settings, Hursh et al. describe translational research with individuals who have autism, intellectual and developmental disabilities, or both. Hursh et al. cover the assessment of value in such populations and also describe some approaches to scheduling reinforcers in ways that may help to maintain demand as price rises. The section concludes with some behavioral economic considerations of the marketplace in which problem behavior occurs and some issues to be considered when introducing therapeutic reinforcers into this marketplace.

Applied Behavior Analysis and Neuroscience In Volume 2, Chapter 2, this handbook, Travis Thompson examines the pathway from basic laboratory research to application “at the interface of behavior analysis and neuroscience” (p. 33). The first section of the chapter, Behavior Analysis in Physiological Psychology, focuses on behavioral pharmacology, with an emphasis on drug addiction. The historical material discussed in this section includes eyewitness accounts of the early work with nonhuman primates from one of the pioneers in the field. There is an emerging understanding that addiction involves not only the biochemical and physiological effects of drugs but also their interactions with reinforcement contingencies. The translational path proceeds to the merger of behavioral pharmacology and applied behavior analysis in the treatment of substance abuse, and the chapter includes a review of some recent research in this area. In the Establishing Operations: Food Motivation section, Thompson examines research on the effects of neuropeptide Y, which increases food intake but 73

William V. Dube

apparently via an underlying mechanism distinct from that of food deprivation. In this section, Thompson reviews functional magnetic resonance imaging (MRI) research of brain activity associated with food motivation and evidence that typical patterns of neural activity begin in childhood. Also discussed is research comparing pre- and postmeal activation while viewing food pictures in typical control participants and individuals with PraderWilli syndrome, the most common known genetic cause of life-threatening obesity in children. The research points to distinct neural mechanisms associated with food intake in Prader-Willi syndrome. In three of the remaining sections, Thompson reviews translational research in three areas: (a) treatments for some forms of self-injurious behavior in developmental disabilities that are maintained by the release of endogenous opioids; (b) changes in the motor regions of the brain associated with constraint-induced rehabilitation, in which operant contingencies are used to encourage people with stroke to use affected limbs; and (c) functional MRI studies of the effects of exposure to operant contingencies of reinforcement on subsequent viewing of discriminative stimuli and members of stimulus equivalence classes. Thompson also includes an intriguing section on a possible relation between synaptogenesis and success in early intensive behavioral intervention (EIBI) in autism. Although EIBI has been very successful in some cases, the gains are rather modest for approximately half of treated children. Thompson reviews evidence that synapse formation can be activity dependent and raises the question of whether individual differences in synaptogenesis in response to the operant reinforcement contingencies of EIBI— differences that may have a genetic basis—could be related to the ultimate degree of success. This area seems to be a very fertile one for future research.

Environmental Health and Behavior Analysis Newland begins his chapter (Volume 2, Chapter 9, this handbook) with a few sobering facts. For example, of the tens of thousands of chemicals in production, only a small fraction have been subjected to rigorous tests for neurotoxicity. Even among 74

high-priority chemicals whose structure suggests significant potential for toxicity and for which commercial use exceeds 500 tons per year, fewer than one third have undergone any neurotoxicity testing. Newland goes on to outline behavioral neurotoxicological testing methods and analysis criteria using both operant and Pavlovian contingencies. These methods are illustrated with examples that include high levels of exposure to manganese (e.g., in unsafe mining conditions), solvents, ozone, electrical fields, and others. The value of individual subject data is illustrated with Cory-Slechta’s (1986) results on the effects of developmental exposure to lead on fixedinterval responding (see Newland’s Figure 9.3 and accompanying text). Given the nature of the subject matter, toxicity, testing in nonhuman animals is a necessity. One of the important contributions of this chapter is Newland’s Human Testing section, in which he explores problems with and solutions for comparing the results of studies with nonhuman laboratory animals with those conducted with human participants. One solution to this problem is to develop tests that can be used among both populations. This approach is particularly useful when the test results in humans correlate well with more general measures of functioning; for example, the correlation of performance on an incremental repeated acquisition test with that on a standardized intelligence (IQ) test, described by Newland. Another approach involves the elimination or minimization of verbal instruction with humans, which also facilitates comparisons among a diverse array of human populations and cultural groups (e.g., migrant laborers with occupational exposure to pesticides). In the Mechanisms and Interventions section, Newland considers deficits in motor function and the role of reinforcement contingencies in recovery of function (a topic Thompson also addresses in his chapter). Research on the development of tolerance to neurotoxicants and adjustment to impairment is also described as well as disruption of behavioral allocation in behavioral choice procedures (which are described in more detail in Volume 2, Chapter 7, this handbook) and behavioral flexibility as measured by discrimination reversal procedures after gestational exposure to lead or methylmercury.

Translational Research in Behavior Analysis

Newland goes on to describe how these effects may be understood in terms of distortion in the impact of reinforcers and disrupted dopamine function. In the remainder of the chapter, Newland provides an education in the scientific and organizational problems that must be solved to conduct an evidence-based risk assessment. He describes a process for deriving estimates of tolerable human exposures from controlled laboratory studies of animals and from epidemiological studies of exposed humans, in an open and transparent manner. To help meet the need to advance the pace of testing, a current emphasis in the area of in vitro testing (formation of proteins, activity of cells, etc.) is on “highthroughput” techniques for rapidly identifying and characterizing neurotoxicity in large-scale efforts. An important challenge identified for the next generation of behavioral toxicologists is the development of meaningful high-throughput behavioral tests; as Newland notes, “One cannot reproduce the [fixed-interval] schedule in a dish” (p. 245).

From Behavioral Research to Clinical Therapy Clinical behavior analysis (CBA) refers to that branch of applied behavior analysis that addresses what are commonly known as mental health issues in verbally competent adult outpatients: anxiety disorders, depression, adjustment disorders, and so forth. In Volume 2, Chapter 1, this handbook, Guinther and Dougher provide an overview of the historical development of CBA and describe translational research in relation to specific CBA therapies. The historical overview begins with the behavior therapy movement of the 1950s and 1960s and the rise to prominence of cognitive–behavioral therapy in the 1970s, including a discussion of why the mentalistic aspects of cognitive behavior therapy removed it from behavior analysis. Goldiamond’s (1974/2002) constructional approach is credited as the first fully articulated behavior-analytic outpatient therapy, and the development of the more modern CBA therapies over the next 30 years is briefly noted. The Translational Research Relevant to CBA section is preceded by a thoughtful introduction to the conceptual basis of CBA in the Skinnerian analyses

of private events (such as thoughts and feelings) as behavior itself rather than as causes of behavior and the functional (as opposed to formal) analysis of verbal behavior as operant behavior controlled by its audience-mediated consequences. This part of the chapter is divided into three sections: Rule Governance, Equivalence Relations and the Transfer of Stimulus Functions, and Other Stimulus Relations and the Transformation of Functions. Rule governance is the study of interactions in behavioral control between antecedent stimuli in the form of verbal rules or instructions that describe contingencies (including self-generated) and the actual consequences for behavior; the rules may or may not accurately reflect the contingencies. The implications for CBA of self-generated rules at odds with the actual contingencies of daily life seem evident. Stimulus equivalence refers to stimulus–stimulus relations of mutual substitutability (for a basic research review, see Chapter 16, this volume). For example, the spoken word dog and the printed word DOG may be related by equivalence if both control pointing to a picture of a dog (but see Volume 2, Chapter 6, this handbook, for some of the complexities of a thorough analysis). In the Equivalence Relations and the Transfer of Stimulus Functions section, Guinther and Dougher point out that the most clinically interesting feature of stimulus equivalence is transfer of function: Stimulus functions acquired by direct conditioning with one stimulus will also be manifested by other equivalent stimuli in the absence of direct experience. Equivalence relations and transfer of function in verbal stimuli help to explain why words can provoke emotional responses, and the research reviewed in this section documents how stimuli can come to elicit emotions or evoke avoidance behavior via equivalence relations and in the absence of any direct conditioning. The subsequent section, Other Stimulus Relations and the Transformation of Functions, on relational frame theory (RFT) provides an exceptionally clear description of the relation between stimulus equivalence and RFT. RFT expands the study of stimulus–stimulus relations beyond those of equivalence to include opposition, more than–less than, before–after, and many others. The research reviewed has indicated that the transfer of functions 75

William V. Dube

seen in stimulus equivalence may be modified accordingly by other types of stimulus–stimulus relations and is thus termed transformation of function. For example, when initial training established more-than–less-than relations among stimuli such that A < B < C, and B subsequently predicted electric shock in a Pavlovian conditioning procedure, experimental participants showed smaller skin conductance changes to A and larger skin conductance changes to C than to B, even though there was no direct conditioning with the A and C stimuli (Dougher, Hamilton, Fink, & Harrington, 2007). The section that follows, Verbal Behavior and the Clinical Relevance of RFT, considers the wideranging implications for clinical relevance of RFT applied to verbal behavior. In the remainder of the chapter, Guinther and Dougher review the major CBA therapies in current practice. In the CBA Therapies section, they begin by discussing Goldiamond’s (2002) constructional approach, which is foundational in both historical and conceptual senses. As an editorial note, I think that the importance of Goldiamond’s contributions to applied behavior analysis in general cannot be overemphasized. To list a few examples, the diagnostic technique of descriptive analysis (e.g., McComas, Vollmer, & Kennedy, 2009), treatment interventions such as functional communication training that focus on building appropriate repertoires that produce the same class of reinforcers as problem behavior (e.g., E. G. Carr & Durand, 1985), and the focus on repertoire building rather than symptom reduction in ACT (e.g., Hayes, Strosahl, & Wilson, 1999) owe conceptual debts to Goldiamond. The subsections describing the therapies are very clearly labeled, and the reader is referred to Guinther and Douglas’s chapter itself for clear introductions to each one and related efficacy research. These therapies include functional analytic psychotherapy, which is helpful for improving the relationship between therapist and patient; integrative behavioral couples therapy for couples involved in romantic relationships; dialectical behavior therapy for the treatment of borderline personality disorder and other severe problems; behavioral activation therapy, with a primary emphasis on treatment of 76

depression; and a brief introduction to ACT for the treatment of a wide variety of quality-of-life issues (brief because ACT is the subject of Volume 2, Chapter 18, this handbook).

Acceptance and Commitment Therapy In Volume 2, Chapter 18, this handbook, Levin, Hayes, and Vilardaga present an in-depth look at ACT, which is arguably on the leading edge of clinical behavior analysis. ACT developed in concert with RFT, and in the introductory sections of the chapter, Levin et al. present a detailed account of its development. They describe functional contextualism, the philosophical foundation of RFT and ACT, along with a related development strategy termed contextual behavioral science. A key aspect of this approach that seems distinct from mainstream behavior analysis is described as “user-friendly analyses . . . that practitioners can access without fully grasping the details of the basic account, while a smaller number of scientists work out technical accounts” (p. 462). The unifying conceptual system for ACT at an applied level is the psychological flexibility model, which is based on six key processes, each of which is defined and explained: defusion interventions, acceptance interventions, contact with the present moment, self-as-context (perspective taking), values (motivational context), and committed action. A case example is presented that illustrates how these processes are applied to case conceptualization and treatment in ACT, with the goal of significantly improved psychological flexibility. In the next section of the chapter, Expanded Use of Methodologies to Test the Theoretical Model and Treatment Technology, Levin et al. describe a body of evidence supporting the efficacy of ACT. As guided by the contextual behavioral science approach, the treatment technologies and underlying theoretical models have been evaluated by a variety of methodologies, including group designs. Levin et al. describe microcomponent studies that are typically small scale and laboratory based and often use nonclinical populations and focus on “broadly relevant features of behavior such as task persistence and recovery from mood inductions” (p. 470). The results of more than 40 such

Translational Research in Behavior Analysis

microcomponent studies “have suggested that many of these [ACT] components are psychologically active” (p. 470). Also described is research examining the processes of change, and results in this area show that ACT affects theoretically specified processes and that changes in these processes predict changes in treatment outcomes. One of the most salient characteristic of ACT is its broad applicability. This section on evaluative research concludes with a summary of research showing ACT’s impact on an impressive array of problem areas including depression, anxiety, psychosis, chronic pain, substance use, coping with chronic illness, weight loss and maintenance, burnout, sports performance, and others. In the final section of the chapter, Effectiveness, Dissemination, and Training Research, Levin et al. describe the active role of contextual behavioral science researchers in studying issues related to the effective implementation of ACT. The results of this research have shown that training in ACT improves patient outcomes, and these results are also shaping the development of methods for training clinicians in the intervention approach.

Prosocial Behavior and Environment in a Public Health Framework In Volume 2, Chapter 10, this handbook, Biglan and Glenn propose a framework to relate known behavioral principles as they affect individuals to the cultural practices of society as a whole, for the purpose of making general improvements in human wellbeing. In the opening section of the chapter, Behavior Analysis and Cultural Change, they define macrobehavior as the similar but independent behavior of many individuals that has a cumulative effect on the environment. A fundamental difference between behavior and macrobehavior is that the former describes a lineage of recurring responses of an individual that is shaped and maintained by its consequences over time. In contrast, macrobehavior is an aggregate of the operant behavior of many individuals and not controlled by consequences at the macrobehavioral level. The behavioral components of a social system are described as interlocking behavioral contingencies that produce a measurable outcome; for example,

within the social system of a school, one product of educational interlocking behavioral contingencies is students’ academic repertoires. The term metacontingencies describes the contingent relations between interlocking behavioral contingencies and their products and the consequent actions of the external environment on the social system. Biglan and Glenn argue that when the cumulative effects of macrobehavior are detrimental to society and human well-being, the macrobehavior should be treated as a behavioral public health problem. Examples of such detrimental macrobehaviors include substance abuse, academic failure, and crime. Because research from several disciplines has related a small set of non-nurturing conditions to problems of public health and behavior, Biglan and Glenn propose an approach that focuses on increasing the prevalence of nurturing environments. They describe ways to promote such environments that include (among others) minimizing toxic and aversive conditions, reinforcing prosocial behavior, and setting limits on opportunities for problem behavior. Biglan and Glenn then present two extended examples of planned interventions that operate at the macrobehavioral level to make changes in social systems by promoting nurturing environments. The first is the schoolwide Positive Behavior Supports (PBS) program, which as of 2010 had been adopted by more than 9,500 U.S. schools. They delineate the foundational constructs of PBS and emphasize the importance of multilevel organizational support at state, district, and school levels. From a macrobehavioral perspective, they point out the need for research on the specific factors that influence the spread of PBS. The second example of an organized effort to change macrobehavior is the tobacco control movement. This section includes an absorbing look at the interlocking behavioral contingencies of the tobacco industry as a social system and how tobacco marketing resulted in smoking by approximately half of men and one third of women by the middle of the 20th century. Biglan and Glenn also describe the activities of the tobacco control movement, resulting in a macrobehavioral change that reduced smoking behavior by approximately half during the second half of the century. 77

William V. Dube

Biglan and Glenn conclude with a review of evidence-based programs and practices that may be marshaled to produce system-level changes in public health and related policies. In their words, The next challenge is to develop an empirical science of intentional cultural evolution. . . . We argue that this effort will strengthen with a focus on (a) how to influence the spread of a behavior by increasing the incidence of reinforcement for that behavior and (b) how to alter the metacontingencies for organizations that select practices contributing to a more nurturing society. (p. 270) Summary During his term as editor of the Journal of Applied Behavior Analysis, Wacker (1996) pointed out the “need for studies that bridge basic and applied research” (p. 11). The research reviewed in the chapters I have discussed is certainly responsive to that need. It encompasses the T1, T2, and T3 range of translations described in biomedical science, and the bridge has firm abutments in both the basic and the applied research literatures.

References Ainslie, G. (2001). Breakdown of will. Cambridge, England: Cambridge University Press. Carr, D., Wilkinson, K. M., Blackman, D., & McIlvane, W. J. (2000). Equivalence classes in individuals with minimal verbal repertoires. Journal of the Experimental Analysis of Behavior, 74, 101–114. doi:10.1901/jeab.2000.74-101 Carr, E. G., & Durand, V. M. (1985). Reducing behavior problems through functional communication training. Journal of Applied Behavior Analysis, 18, 111–126. doi:10.1901/jaba.1985.18-111 Commons, M. L. (2001). A short history of the society for quantitative analyses of behavior. Behavior Analyst Today, 2, 275–279. Cory-Slechta, D. A. (1986). Vulnerability to lead at later developmental stages. In N. Krasgenor, D. B. Gray, & T. Thompson (Eds.), Advances in behavioral pharmacology: Developmental behavioral pharmacology (pp. 151–168). Hillsdale, NJ: Erlbaum. Devany, J. M., Hayes, S. C., & Nelson, R. O. (1986). Equivalence class formation in language-able 78

and language-disabled children. Journal of the Experimental Analysis of Behavior, 46, 243–257. doi:10.1901/jeab.1986.46-243 Dougher, M. J., Hamilton, D. A., Fink, B. C., & Harrington, J. (2007). Transformation of the discriminative and eliciting functions of generalized relational stimuli. Journal of the Experimental Analysis of Behavior, 88, 179–197. doi:10.1901/jeab.2007.45-05 Goldiamond, I. (2002). Toward a constructional approach to social problems: Ethical and constitutional issues raised by applied behavior analysis. Behavior and Social Issues, 11, 108–197. (Original work published 1974) Hayes, S. C., Strosahl, K., & Wilson, K. G. (1999). Acceptance and commitment therapy: An experiential approach to behavior change. New York, NY: Guilford Press. Iwata, B. A., Dorsey, M. F., Slifer, K. J., Bauman, K. E., & Richman, G. S. (1994). Toward a functional analysis of self-injury. Journal of Applied Behavior Analysis, 27, 197–209. (Original work published 1982) Kohn, A. (1993). Punished by rewards. Boston, MA: Houghton Mifflin. Lionello-DeNolf, K. M., McIlvane, W. J., Canovas, S. D. G., & Barros, R. S. (2008). Reversal learning set and functional equivalence in children with and without autism. Psychological Record, 58, 15–36. Mace, F. C., & Critchfield, T. S. (2010). Translational research in behavior analysis: Historical traditions and imperative for the future. Journal of the Experimental Analysis of Behavior, 93, 293–312. doi:10.1901/jeab.2010.93-293 Mace, F. C., Hock, M. L., Lalli, J. S., West, P. J., Belfiore, P., Pinter, E., & Brown, D. K. (1988). Behavioral momentum in the treatment of noncompliance. Journal of Applied Behavior Analysis, 21, 123–141. doi:10.1901/jaba.1988.21-123 Mace, F. C., McComas, J. J., Mauro, B. C., Progar, P. R., Taylor, B., Ervin, R., & Zangrillo, A. N. (2010). Differential reinforcement of alternative behavior increases resistance to extinction: Clinical demonstration, animal modeling, and clinical test of one solution. Journal of the Experimental Analysis of Behavior, 93, 349–367. doi:10.1901/jeab.2010.93-349 McComas, J. J., Vollmer, T., & Kennedy, C. (2009). Descriptive analysis: Quantification and examination of behavior–environment interactions. Journal of Applied Behavior Analysis, 42, 411–412. doi:10.1901/ jaba.2009.42-411 McIlvane, W. J., Dube, W. V., Lionello-DeNolf, K. M., Serna, R. W., Barros, R. S., & Galvão, O. F. (2011). Some current dimensions of translational behavior analysis: From laboratory research to intervention for persons with autism spectrum disorders. In

Translational Research in Behavior Analysis

E. A. Mayville & J. A. Mulick (Eds.), Behavioral foundations of effective autism treatment (pp. 155–181). Cornwall-on-Hudson, NY: Sloan. Mold, J. W., & Peterson, K. A. (2005). Primary care practice-based research networks: Working at the interface between research and quality improvement. Annals of Family Medicine, 3(Suppl. 1), S12–S20. doi:10.1370/afm.303 Nevin, J. A., Tota, M. E., Torquato, R. D., & Shull, R. L. (1990). Alternative reinforcement increases resistance to change: Pavlovian or operant contingencies? Journal of the Experimental Analysis of Behavior, 53, 359–379. doi:10.1901/jeab.1990.53-359 Overmier, J. B., & Burke, P. D. (1992). Animal models of human pathology: A quarter century of behavioral research. Washington, DC: American Psychological Association. Sidman, M. (1986). Functional analysis of emergent verbal classes. In T. Thompson & M. D. Zeiler (Eds.), Analysis and integration of behavioral units (pp. 213–245). Hillsdale, NJ: Erlbaum.

Skinner, B. F. (1953). Science and human behavior. New York, NY: Macmillan. upFRONT. (2006). Translational science. Philadelphia: Penn Nursing, University of Pennsylvania. Retrieved from http://www.nursing.upenn.edu/about/ Documents/UpFront_8.30.pdf Wacker, D. P. (1996). Behavior analysis research in JABA: A need for studies that bridge basic and applied research. Experimental Analysis of Human Behavior Bulletin, 14, 11–14. Westfall, J. M., Mold, J., & Fagnan, L. (2007). Practicebased research—“Blue highways” on the NIH roadmap. JAMA, 297, 403–406. doi:10.1001/jama.297.4.403 Woolf, S. H. (2008). The meaning of translational research and why it matters. JAMA, 299, 211–213. doi:10.1001/jama.2007.26 World Health Organization. (2004). World report on knowledge for better health. Geneva, Switzerland: Author. Zerhouni, E. (2003). The NIH roadmap. Science, 302, 63–72. doi:10.1126/science.1091867

79

Chapter 4

Applied Behavior Analysis Dorothea C. Lerman, Brian A. Iwata, and Gregory P. Hanley

Applied behavior analysis (ABA) differs from other areas of applied psychology in many respects, but two are especially prominent. First, ABA is not an eclectic enterprise, borrowing theory and method from varied persuasions; it is grounded in the theoretical and experimental orientation of behavior analysis. Second, whereas most applied fields are distinguished by their emphasis on a particular clientele, problem, or setting, ABA is constrained only by its principles and methods. ABA focuses on any aspect of human and sometimes nonhuman behavior, regardless of who emits it or where it occurs, crossing professional boundaries typically used to define clinical, educational, and organizational psychology as well as generational cohorts and diagnostic categories. Thus, the subject matter of ABA is not tied to any specific area of application. Other chapters in this handbook present summaries of applied research organized along more traditional lines. In this chapter, we emphasize ABA’s distinctive features and summarize its major themes on the basis of the behavioral processes of interest. Origins of Applied Behavior Analysis The official beginning of the field of ABA is easy to pinpoint because the term was coined in 1968 when the Journal of Applied Behavior Analysis (JABA) was founded. However, that date represents the culmination of events initiated many years prior. Tracing the emergence of ABA before 1968 is an arbitrary

process because the borders separating basic, translational, and applied research are fluid. Nevertheless, Fuller’s (1949) study, the first published report of operant conditioning with a human, serves as a good starting point. The response that Fuller shaped—a simple arm movement in an individual with profound intellectual disability—had little adaptive value, but it was significant in demonstrating the influence of operant contingencies on human behavior. Although many will find the article’s similarities to current ABA work almost nonexistent, it is worth noting that Boyle and Greer (1983) published an extension of Fuller’s work 35 years later in JABA in which they used similar methods to shape similar responses among comatose patients. Soon after Fuller’s (1949) article appeared, other reports of human operant conditioning followed during the 1950s (Azrin & Lindsley, 1956; Bijou, 1955; Gewirtz & Baer, 1958; Lindsley, 1956). At the end of that decade, Ayllon and Michael (1959) published what many consider the first example of ABA because it contained multiple single-case analyses of different interventions (extinction, differential reinforcement, avoidance training, noncontingent reinforcement) with a range of target behaviors (excessive visitation to the nursing station, aggression, refusal to self-feed, and hoarding) exhibited by psychiatric patients. Similar reports were published throughout the 1960s (see Kazdin, 1978, for a more extensive discussion) in various journals, including the Journal of the Experimental Analysis of Behavior. The board of directors of the Journal of the Experimental Analysis of Behavior, recognizing the need for

DOI: 10.1037/13937-004 APA Handbook of Behavior Analysis: Vol. 1. Methods and Principles, G. J. Madden (Editor-in-Chief) Copyright © 2013 by the American Psychological Association. All rights reserved.

81

Lerman, Iwata, and Hanley

a journal devoted to applications of operant conditioning, initiated the publication of JABA in 1968. The first issue of JABA contained an article that essentially defined the field. D. M. Baer, Wolf, and Risley (1968) proposed seven defining characteristics—dimensions—of ABA: 1. Applied: The focus is on problems of social importance. 2. Behavioral: Dependent variables reflect direct measurement of the behaviors of interest. 3. Analytic: Demonstrations of behavior change include convincing evidence of experimental control, favoring repeated measures of individual behavior and replication of treatment effects. 4. Technological: Procedural descriptions specify operational features of intervention for all relevant responses. 5. Conceptually systematic: Techniques used to change behavior are related to the basic principles from which they are derived. 6. Effectiveness: The extent of behavior change is sufficient to be of practical value. 7. Generality: Effects of intervention strategies can be programmed across time, behavior, and setting. These dimensions have served as useful guides in planning and evaluating ABA research since the field’s inception. It is a mistake to assume, however, that any particular study or even most of those published in JABA or similar journals illustrates each of the characteristics of ABA described by D. M. Baer et al. (1968). The most obvious example is the dimension of generality: A study with two or three participants cannot establish the external validity of a therapeutic intervention; multiple, systematic replication and extension accomplish that goal (see Chapter 7, this volume). Similarly, although applied may suggest exclusive emphasis on problematic human behavior (deficit or excess), a case can be made for studying an arbitrary response such as eye movement (Schroeder & Holland, 1968) to identify procedures for improving attention or even nonhuman behavior when it is difficult to conduct certain demonstrations with humans, as in the shaping of self-injury (Schaefer, 1970). 82

Distinctive Features of Applied Behavior Analysis Methodological differences between ABA and other therapeutic endeavors are so numerous that discussion is beyond the scope of this chapter; it would also be partially redundant with a description of the experimental analysis of behavior contained in Chapter 2 of this volume. Instead, we describe here some features of applied research that distinguish it from basic laboratory work.

Nature of the Response (Dependent Variable) Basic research focuses primarily on the fundamental learning process. As a result, structural aspects of a response are usually of secondary concern, and dependent variables often consist of arbitrary, discrete responses such as a rat’s lever presses or a pigeon’s key pecks. The response per se assumes greater importance in applied research because the goal of intervention is usually to change some topographical aspect of behavior. In fact, much of ABA research involves establishing a new response that is not in an individual’s repertoire or modifying a current response topography from one that is deficient, socially unacceptable, or even dangerous to one that is deemed more appropriate by the larger community. For example, eating can be accomplished in any number of ways, but the form that eating takes may be a determinant of social acceptability (O’Brien & Azrin, 1972). Other examples include cadence of speech (R. J. Jones & Azrin, 1969), structural aspects of composition (Brigham, Graubard, & Stans, 1972), and nuances of social interaction (Serna, Schumaker, Sherman, & Sheldon, 1991). Sometimes the goal of applied research is not to establish a specific response but, rather, to increase topographical variability, as in the development of creativity (Glover & Gary, 1976; Goetz & Baer, 1973). Applied researchers also focus more often on the measurement of larger units of behavior because most adaptive performances are made up of response chains, such as self-care sequences, academic work, and vocational behavior. Some chains involve repetition of a similar topography, as in building a construction out of blocks (Bancroft,

Applied Behavior Analysis

Weiss, Libby, & Ahearn, 2011); others, however, can be extremely complex. For example, Allison and Ayllon (1980) evaluated a behavioral approach to athletics coaching and measured varied behaviors such as blocks (football), handsprings (gymnastics), and serves (tennis) as response chains consisting of five to 11 members. A third distinction between basic and applied research is the multidimensional nature of measurement in applied work. Basic researchers typically use response rate or response allocation as a dependent variable because both are convenient, standard, and highly sensitive measures (Skinner, 1966; see Chapter 10, this volume). Although they occasionally study other response dimensions such as intensity or force, they do so to determine how these aspects of behavior are influenced by environmental manipulation (Hunter & Davison, 1982; Sumpter, Temple, & Foster, 1998), not because changing these response dimensions is a goal of the research. Applied researchers, by contrast, will study response allocation because, for example, decreasing responses allocated to drug taking is of societal importance (see Volume 2, Chapter 19, this handbook). Likewise, they will study response intensity because it is either too low, as in inaudible speech (Jackson & Wallace, 1974), or too high, as in noisy behavior on the school bus (Greene, Bailey, & ­Barber, 1981). Furthermore, many human performances are such that clinical improvement requires change in more than one dimension of a response. For example, effective treatment of a child’s sleep disturbance should decrease not only the frequency but also the duration of nighttime wakening (France & Hudson, 1990; see Volume 2, Chapter 17, this handbook). Quantifying changes in academic performance also requires multiple measures, such as the duration of work, the amount of work completed, and accuracy (J. C. McGinnis, Friman, & Carlyon, 1999). In addition to focusing on multiple dimensions of a response (or a response chain), applied researchers often measure several different responses that, in combination, reflect improvement in the problem of interest. The delivery of courteous service (Johnson & Fawcett, 1994), for example, requires the correct performance of a number of responses and response chains that differ depending on the context of the

employee–customer interaction. In a similar way, contemporary research on response suppression attempts not only to eliminate problematic behavior but also to establish more socially appropriate alternatives. Finally, some dependent variables in applied research are not based on direct measurement of any response but rather on some observable outcome or product of a response. At the individual level, response products might consist of carbon monoxide level as an index of smoking (Dallery & Glenn, 2005) or pounds lost as a measure of diet or exercise (Mann, 1972). Aggregate products such as amount of litter (Bacon-Prue, Blount, Pickering, & Drabman, 1980) or community water usage (Agras, Jacob, & Lebedeck, 1980) have also been used occasionally in applied research when measurement of individual behavior is difficult or impossible.

Treatment Characteristics (Independent Variable) Much of ABA research involves the extension of well-established basic principles such as reinforcement, stimulus control, extinction, and punishment, including parametric and comparative analysis. A more distinctive feature of applied work is that operational features of independent variables take on special characteristics determined by the target behavior of interest or practical considerations during implementation. These characteristics include greater emphasis on supplementary cues, procedural variation as a means of enhancing acceptability, and the use of intervention packages. Supplementary cues.  The aim of research on response acquisition is to produce new behavior, and efficiency is often an important consideration. Although many adaptive performances could be shaped eventually (as they are in the nonhuman basic research laboratory) through differential reinforcement of successive approximations, applied researchers typically rely heavily on supplementary methods. Common procedures include simple instructions (Bates, 1980), static cues such as pictures (Wacker & Berg, 1983), in vivo or video modeling (Geiger, LeBlanc, Dillon, & Bates, 2010), and physical prompting (R. H. Thompson, McKerchar, & Dancho, 2004). 83

Lerman, Iwata, and Hanley

Procedural variation.  Interventions evaluated in a research context are designed for implementation under more naturalistic conditions. This fact makes procedural aspects of treatment, although perhaps incidental to their effects, potential determinants of adoption by parents, teachers, and other therapeutic agents. As such, the applied researcher will vary different components of the procedure depending on the consumer of the intervention. For example, interventions designed to teach new behaviors will experiment with components of the procedure that are anticipated to produce more efficient response acquisition. Social acceptability is a major consideration for response suppression procedures, and several studies have explored ways to maintain the intervention’s efficacy while improving the procedure’s acceptability. A good example of the latter is time out from positive reinforcement. Aside from variations in duration (White, Nielsen, & Johnson, 1972) and schedule (Clark, Rowbury, Baer, & Baer, 1973), the form of time out has included seclusion in a room (Budd, Green, & Baer, 1976), physical holding (Rolider & Van Houten, 1985), removal of a ribbon worn around the neck combined with loss of access to ongoing activities (Foxx & Shapiro, 1978), and placement on the periphery of activity so that one could watch but not participate (Porterfield, Herbert-Jackson, & Risley, 1976). Intervention packages.  When the goal of intervention is to simply change behavior without identifying the necessary components of treatment, ABA researchers may combine two or more distinct independent variables into a package. Perhaps the simplest example of such a package is a teacher’s use of praise, smiles, physical contact (a pat on the back), and feedback as consequences for appropriate student behavior. Although a component analysis of treatment effects may be informative because it is unclear which teacher responses actually functioned as positive reinforcement, it may not matter much in a practical sense because none was costly or time consuming. Often, an intervention package, if costly, will be subject to a component analysis after it has proven effective (Yeaton & Bailey, 1983). Similarly, interventions requiring a great deal of effort to implement may benefit from attempts to isolate 84

necessary versus incidental treatment components. For example, the continence training program developed by Azrin and Foxx (1971), although highly effective and efficient, is extremely labor intensive and includes procedures to increase the frequency of elimination, reinforce the absence of toileting accidents as well as the occurrence of correct toileting, teach dressing and undressing skills, and punish toileting accidents with a series of corrective behaviors and time out.

Instrumentation and Design Measurement represents a significant challenge to applied researchers for at least three reasons. First, most research is conducted outside of controlled laboratory settings where recording of responses is not easily automated. Second, dependent variables are often multidimensional in nature and may consist of several distinct response topographies. Finally, when interventions consist of a series of procedures, the consistency of implementing independent variables becomes an additional concern (Peterson, Homer, & Wonderlich, 1982). As a result of these factors, human observers collect most data in applied research because of their versatility in detecting a wide range of responses. This practice introduces the potential for measurement variability that is not found in most basic research. When designing observation procedures, the applied researcher must consider several important factors, including how behavior will be categorically coded by the observers (Meany-Daboul, Roscoe, Bourret, & Ahearn, 2007), how the observers may be most efficiently trained (Wildman, Erickson, & Kent, 1975), how to compute interobserver agreement (Mudford, Taylor, & Martin, 2009), and variables that influence reliability and accuracy of measurement (Kazdin, 1977; Lerman et al., 2010). Experimental control in basic research is typically achieved through a demonstration of reversibility (Cumming & Schoenfeld, 1959); that is, within-subject replication of behavior change resulting from systematic application and removal of the independent variable. Applied researchers often use a similar approach with a variety of reversal-type procedures such as A-B-A, B-A-B, A-B-A-B, and multielement designs (see Chapter 5, this volume).

Applied Behavior Analysis

Occasionally, however, the effects of an applied intervention are irreversible. One situation arises when the independent variable consists of instruction or training whose effects cannot be removed. A second arises when behavior, once changed, encounters other sources of control. For example, a teacher may find it necessary to use prompting, praise, and tangible reinforcers to increase a socially withdrawn child’s rate of peer interaction. Once interaction increases, however, the child may obtain a new source of reinforcement derived from playing with peers, which maintains the child’s social behavior independent of the teacher’s actions. In both of these situations, a multiple-baseline design, involving staggered introduction of treatment across two or more baselines (representing behaviors, participants, or situations), can be used to achieve replication without the necessity of a reversal. The traditional multiple-baseline design involves concurrent measurement in which data are collected across all baselines at the same points in calendar time. Because the logistics of clinical work or the inability to simultaneously recruit participants having unusual problems limits the versatility of this design, a notable variation—the nonconcurrent multiple baseline—has emerged in ABA research and may often represent the design of choice. First described by Watson and Workman (1981) and subsequently illustrated by Isaacs, Embry, and Baer (1982), the nonconcurrent multiple baseline retains the staggered feature of intervention (treatment is applied after baselines of different length); however, measurement across baselines takes place at different points in time. As noted by D. M. Baer et al. (1968), although this design does not contain an explicit control for the confounding effects of a historical accident, differences in both the baseline length and the temporal context of measurement make it extremely unlikely that the same historical accident would coincide with the implementation of treatment on more than one baseline.

Social Validity The final characteristic of ABA that distinguishes it from basic research is a by-product of application. As noted by D. M. Baer et al. (1968), “A society willing to consider a technology of its own behavior

apparently is likely to support that application when it deals with socially important behaviors” (p. 91). Stated another way, consumers are the final judges of applied work, and the term social validity refers to those judgments about three aspects of ABA: (a) the significance of goals, (b) the acceptability of procedures, and (c) the importance of effects (Kazdin, 1977; Wolf, 1978). A variety of methods have been used to examine the social validity of applied research, including evaluations of the relevance of measures by professionals (R. T. Jones, Kazdin, & Haney, 1981), ratings of behavior change by independent judges (Bornstein, Bach, McFall, Friman, & Lyons, 1980), measures of consumer satisfaction (Bourgeois, 1990), treatment choices made by clients (Hanley, Piazza, Fisher, Contrucci, & Maglieri, 1997), and cost–benefit analysis of the effects of intervention (Van Houten, Nau, & Marini, 1980). Although important in concept, the assessment of social validity raises many questions related to subjective measurement (Hawkins, 1991), bias (Fuqua & Schwade, 1986), and the selection of appropriate reference groups as the basis for comparison (Van Houten, 1979) that have yet to be resolved. Perhaps for this reason, social validation measures continue to be rarely used (J. E. Carr, Austin, Britton, Kellum, & Bailey, 1999). Major Themes in Applied Research

Response Acquisition Relying on the basic principles of learning derived from the experimental analysis of behavior (see Chapters 2 and 3, this volume), applied behavior analysts have developed a technology for teaching new behavior and refining aspects of existing behavior. Much of the teaching technology has been discovered with people diagnosed with developmental disabilities, in part because developing practical, age-appropriate skills is a primary treatment goal for these individuals. Nevertheless, this same technology has been applied in early childhood, regular education, and college classrooms and in workplaces, homes, and playing fields. Independent of the context in which it is applied, ABA approaches to teaching usually progress from (a) establishing 85

Lerman, Iwata, and Hanley

simple and observable responses; to (b) shaping more complex, independent, or fluent responses and response chains; and finally to (c) synthesizing complex or chained responses into socially relevant repertoires such as independently completing self-care skills (T. J. Thompson, Braam, & Fuqua, 1982), reading and comprehending text (Clarfield & Stoner, 2005; Volume 2, Chapter 16, this handbook), solving algebraic and trigonometric problems (Ninness et al., 2006), playing football (J. V. Stokes, Luiselli, Reed, & Fleming, 2010), or socializing with friends (Haring & Breen, 1992). Operant contingency.  The most fundamental tools of the applied behavior analyst are operant contingencies. It follows that an essential aspect of ABA teaching procedures is maximizing the effects of reinforcing consequences on target behaviors. Thus, efforts are taken to ensure that the reinforcers are valuable to the learner at the time of teaching; these antecedent events that establish the value of the reinforcing consequences are referred to as establishing operations (Laraway, Snycerski, Michael, & Poling, 2003). Efforts are also made to use salient events that signal the availability of the reinforcers to the learner; when these events occasion the target response, they are referred to as discriminative stimuli. Often, experience with a properly designed contingency is not sufficient to generate target responses, especially those that the learner has never emitted previously, so careful consideration is often given to prompts that will greatly increase the probability of the target response. So that the adaptive behavior will be emitted under more natural conditions, efforts are taken to transfer control from the prompt to more naturally occurring establishing operations and discriminative stimuli. These prompt-fading procedures will yield independent and generalized performances. We discuss each of these universal elements of ABA approaches to response acquisition in greater detail and in the context of published empirical examples. Selecting a target response.  ABA procedures for developing new behavior share several common characteristics independent of the deficit being addressed or the population with whom it is being addressed. Selection of an objective target behavior 86

is the universal first step. Specific target behaviors are usually selected because they will improve a person’s quality of life in both the short and the long term by allowing the person access to new reinforcers and additional reinforcing contexts. Responses targeted with ABA procedures often allow people to live more independently or behave more effectively and efficiently or in more socially acceptable ways. Rosales-Ruiz and Baer (1977) emphasized the importance of selecting behavior cusps, which they defined as changes in behavior that have important consequences for the person beyond the change itself. For example, learning to walk or talk provides the learner with unprecedented access to important materials, social interactions, and sensory experiences that then occasion other, more complex behavior, such as dancing or reading, allowing access to even richer events and interactions (Bosch & Fuqua, 2001). Task analysis.  Once selected, target behaviors are then examined to identify their components and any relevant sequences to the components. This process is often referred to as task analysis (e.g., Cronin & Cuvo, 1979; Resnick, Wang, & Kaplan, 1973; Williams & Cuvo, 1986). The process of task analysis was nicely illustrated in a study by Cuvo, Leaf, and Borakove (1978), who taught janitorial skills to 11 young adults with intellectual disabilities. Six general bathroom-cleaning steps were identified (mirror, sink, urinal, toilet, floor, miscellaneous); each was then subdivided into 13 to 56 component steps before training, which consisted of instructions, modeling, prompting, and differential reinforcement. Task analysis is also an important initial step when teaching almost any athletic skill. For instance, J. V. Stokes, Luiselli, Reed, and Fleming (2010) task analyzed offensive line-blocking skills of starting high school varsity football players before using video and auditory feedback to teach these same skills to other lineman on the team. Much early ABA research identified aspects of task analyses that resulted in better learning. A general finding by Williams and Cuvo (1986) is that that greater specificity of a task analysis results in better generalized performance. Performance assessment.  A direct assessment of the learner’s baseline performance is conducted to

Applied Behavior Analysis

determine which components of a target skill need to be taught (e.g., Lerman, Vorndran, Addison, & Kuhn, 2004; T. J. Thompson et al., 1982). Teaching is then implemented for those components that the learner has not yet mastered. While assessing tackling performance by high school football players, for example, J. V. Stokes, Luiselli, and Reed (2010) noticed that most players did not place their face mask on the opponent’s chest or wrap their arms around the opponent, so these specific skills were targeted for teaching. T. J. Thompson et al. (1982) sought to teach laundry skills (sorting, washing, and drying clothes) to three young adults with intellectual disabilities. The skill set included 74 distinct behaviors. To enhance the efficiency of the teaching and the evaluation of the teaching procedures, they assessed all 74 skills in a baseline probe and then focused their teaching only on those skills that were not evident during the probe. Selecting reinforcers.  The success of behavior change programs is heavily dependent on consequences that are delivered as reinforcers, which are typically selected on the basis of an assessment of the learner’s preference for a range of events or activities. These assessments can be indirect, involving an interview of teachers or parents who attempt to describe the learner’s preferences (Cautela & Kastenbaum, 1967; Cote, Thompson, Hanley, & McKerchar, 2007; Fisher, Piazza, Bowman, & Amari, 1996), but in more cases, direct assessments are conducted (e.g., DeLeon & Iwata, 1996; Fisher et al., 1992; Pace, Ivancic, Edwards, Iwata, & Page, 1985; Sundby, Dickinson, & Michael, 1996). Direct assessments of preference usually involve the actual presentation of potentially reinforcing items in one of several formats, measurement of a person’s selections, and brief access to an item after its selection. Items that are selected or interacted with most often are then delivered as consequences for some behavior of interest (see Volume 2, Chapter 12, this handbook, for more detail on preference assessment and reinforcer evaluation as they are often applied among children diagnosed with autism). In fact, a common example of the use of arbitrary responses in ABA is as tests for the efficacy of the preferred items before their selection as reinforcers in teaching

programs for socially important behavior (J. E. Carr, Nicolson, & Higbee, 2000; Fisher et al., 1992; Pace et al., 1985). Establishing reinforcers.  Once reinforcers are identified, tactics to establish and maintain their value throughout the teaching period are considered. One common procedure is to reserve the use of items delivered as consequences for correct responding to formal teaching periods. Vollmer and Iwata (1991) showed that scheduling brief periods without access to events such as food, music, or social interaction increased their effectiveness as reinforcers during later teaching sessions. Roane, Call, and Falcomata (2005) found similar positive effects of limiting access to leisure items outside of the teaching context, and Hanley, Iwata, and Roscoe (2006) showed that access to preferred items for periods as long as 24 to 72 hours before their use could decrease their value. Because mere access to preferred items can diminish their motivational efficacy in teaching programs, other tactics—such as varying the amount and type of reinforcers (Egel, 1981), allowing the learner opportunities to choose among different reinforcers (Fisher, Thompson, Piazza, Crosland, & Gotjen, 1997; Tiger, Hanley, & Hernandez, 2006), and using token reinforcers that can be traded in for a variety of back-up reinforcers— are often arranged in ABA-based teaching programs (Kazdin & Bootzin, 1972; Moher, Gould, Hegg, & Mahoney, 2008). For a discussion of these issues in translational research, see Volume 2, Chapter 8, this handbook. Contingency, contiguity, and timing.  Ensuring that reinforcers are delivered only for particular target responses during the teaching session is another hallmark of ABA programs. When the target response is reinforced and all other responses are extinguished, it is referred to as differential reinforcement (see Vladescu & Kodak, 2010, for a recent review). Differential reinforcement, and the immediacy with which the reinforcing event is delivered after target responses (Gleeson & Lattal, 1987; Vollmer & Hackenberg, 2001), form strong contingencies that result in rapid learning. Some evidence has suggested that learning via reinforcement contingencies will proceed more 87

Lerman, Iwata, and Hanley

efficiently when the rate of learning opportunities is kept high. For instance, Carnine (1976) showed that both correct responding and task participation of typically developing children were higher when instructions were presented with very brief rather than longer intertrial intervals. Similar results have been reported with a wide variety of skills training among participants with intellectual disabilities (e.g., Koegel, Dunlap, & Dyer, 1980). The mechanism of this effect is, at present, unknown but may be the result of the inability of a low reinforcement rate to maintain adequate levels of participant attention. Gaining stimulus control.  Strong contingencies of reinforcement not only aid in the teaching of new responses but also make it easier to ensure that responding occurs in the presence of a specific stimulus condition; this correlation between antecedent stimuli and reinforced responding is derived from basic research on stimulus control (Pilgrim, Jackson, & Galizio, 2000; Sidman & Stoddard, 1967). Stimulus control is developed by differentially reinforcing a response in the presence of certain stimulus properties but not others (e.g., saying “dog” in the presence of the letters d-o-g and not in the presence of the letters g-o-d). The formation of stimulus control is an important part of the development of effective behavioral repertoires. Successful reading, for example, is entirely dependent on specific responses (usually vocalizations) coming under control of specific letter combinations (Mueller, Olmi, & Saunders, 2000; Sidman & Willson-Morris, 1974). Accurate responses to instructions occur when specific vocal or motor responses come under control of particular combinations of written or spoken words (Cuvo, Davis, O’Reilly, Mooney, & Crowley, 1992). Developing stimulus control is also important for social behaviors like saying “please” (Halle & Holt, 1991), making a request of a person who appears busy (Kuhn, Chirighin, & Zelenka, 2010), or engaging in important independent living skills, such as an efficient exiting response to a fire alarm (Bannerman, Sheldon, & Sherman, 1991). Prompting and prompt fading.  Because the arrangement of contingencies per se may not be sufficient to produce new behavior, a great deal of ABA research has been devoted to the use of prompting 88

and prompt-fading procedures (Demchak, 1990; Gast, VanBiervliet, & Spradlin, 1979; Godby, Gast, & Wolery, 1987; Odom, Chandler, Ostrosky, McConnell, & Reaney, 1992; Schreibman, 1975; Wolery & Gast,1984; Wolery et al., 1992). Two general types of prompts have been the focus of many applied studies. One is a response prompt, which involves the use of a supplementary cue to occasion a correct response, for example, providing vocal instructions, models, or physical guidance to perform the target behavior. These prompts are often eliminated by delaying the prompt across successive trials until the correct response occurs before (i.e., without) the prompt (Schuster, Gast, Wolery, & Guiltinan, 1988). R. H. Thompson, McKerchar, and Dancho (2004) illustrated an example of response prompts by using physical prompts to teach three infants to emit the manual signs “please” and “more” in the presence of food. The prompts were gradually delayed after the visible presentation of the food until the infants were independently signing for the food. The second type of prompt is a stimulus prompt, in which some aspect of the discriminative stimulus is modified to more effectively occasion a correct response. For instance, Duffy and Wishart (1987) taught children with Down syndrome and typically developing toddlers to point to particular shapes when the corresponding shape was named. Prompting consisted of initially representing the named shape with an object much larger than those representing the incorrect comparison shapes. Fading was accomplished by gradually increasing the sizes of the incorrect shapes until all shapes were the same size. Shaping.  Certain responses such as early speech sounds are difficult to prompt, in which case shaping may be required to initiate behavior. Shaping involves slight changes in a differential reinforcement contingency such that closer approximations to the target behavior are reinforced over time, and previous approximations are extinguished. Bourret, Vollmer, and Rapp (2004) initially used vocal and model prompts to teach vocalizations to two children diagnosed with autism. The experimenters instructed the participants to emit target utterances

Applied Behavior Analysis

(e.g., say “tune”) and reinforced successful utterances with access to music. When children did not emit the target utterance, imitation of simpler models was reinforced (e.g., changing say “tune” to say “tuh”). When the children began imitating the shorter phonemes, the criterion for reinforcement was reapplied to the complete spoken word. Error reduction.  Differential reinforcement of correct responses is a universal aspect of acquisition programs and may account for changes in the frequency of incorrect as well as correct responses. That is, errors may decrease simply as a result of extinction (Charlop & Walsh, 1986). More often, however, instructors deliver explicit consequences after errors, which can include a simple statement such as “no” (Bennett & Ling, 1972), a prompt to engage in the correct response (Clarke, Remington, & Light, 1986), or a remedial trial consisting of repetition of the trial on which an error was made (Nutter & Reid, 1978), an easier trial (Kirby & Holborn, 1986), or a more difficult trial (Repp & Karsh, 1992). Numerous variations on the remedial strategy have been reported in the literature, and Worsdell et al. (2005) conducted a series of comparative studies on quantitative and qualitative characteristics of remediation. Their results showed that (a) multiple repetitions of the correct response were more effective than a single response repetition, (b) correction for every error made was superior to intermittent error correction, and (c) repetition of relevant training words was slightly superior to mere repetition of irrelevant words. An interesting aspect of the data was that error correction involving presentation of irrelevant material also enhanced learning, implicating negative reinforcement (avoidance of remedial trials) as a source of influence during response acquisition. Further variations.  Aside from the many ways in which a particular aspect of the teaching situation may be arranged, multiple components may be programmed simultaneously to either increase the efficiency of acquisition or enhance generalization of the effects of the teaching. For instance, Hart and Risley (1968, 1974, 1975, 1980) published a series of studies on the use of incidental teaching—a milieu-based approach to teaching language in

which trials were initiated only when a child showed interest in an object or topic. The key features of their procedures were described as follows: Whenever a child selected a preschool play material, they were prompted and required to ask for it, first by name (noun), then by the name plus a word that described the material (adjectivenoun combination), then by use of a color adjective-noun combination, and finally by requesting the material and describing how they were going to use it (compound sentence). As each requirement was made, the children’s general use of that aspect of language markedly increased. (Hart & Risley, 1974, p. 243) The changing criterion for reinforcement inherent to shaping procedures was evident in incidental teaching; more significant was the fact that instruction (a) occurred intermittently and (b) capitalized on a child-initiated response that identified both the task (name the object) and the reinforcer (access to the named object). These were novel and important features of the instructional program. Although many ABA-based procedures are applied to individual learners, many situations arise in which individuals perform or could perform as part of a group, which serve as the occasion for implementing contingencies on group behavior. Group contingencies involve several arrangements in which the performance of one, some, or all members of a group determines the delivery of reinforcement. Perhaps the best early examples of research on group contingencies can be found in studies conducted at Achievement Place, a home-style setting for predelinquent boys (Bailey, Wolf, & Phillips, 1970; Fixsen, Phillips, & Wolf, 1973; Phillips, 1968; Phillips, Phillips, Fixsen, & Wolf, 1971; Phillips, Phillips, Wolf, & Fixsen, 1973). For instance, Phillips et al. (1971) arranged group contingencies to increase promptness, room cleaning, money saving, and watching the news. In a follow-up study, Phillips et al. (1973) showed that a contingency in which a democratically elected peer manager had the authority both to give and to take away points for peers’ performances was more effective and 89

Lerman, Iwata, and Hanley

preferred by the adolescents than an individual contingency. One of the unique features of group contingencies is that they can create a context for unprogrammed peer-mediated contingencies ranging from praise and offers of assistance to criticism and sabotage. Although some of these side effects have been reported (Frankosky & Sulzer-Azaroff, 1978; Hughes, 1979; Speltz, Shimamura, & McReynolds, 1982), a thorough analysis of the types of social interaction generated by different group contingencies and their role in changing behavior has not been conducted.

Maintenance and Generalization A behavioral technology would have limited clinical value if it failed to produce durable changes in responding. Furthermore, behavior analysts would like performance to transfer across relevant (nontraining) environments and perhaps even to other (untrained) ways of behaving. The term maintenance refers to the persistence of behavior change across time, and the term generalization refers to the persistence of behavior change across settings, people, responses, and other nontraining stimuli. Maintenance and generalization are said to occur if performance persists despite the absence of ancillary antecedents (e.g., prompts) or consequences (e.g., delivery of a token after each response) that originally produced learning. For example, suppose an instructor uses model prompts and reinforcement (e.g., praise plus a piece of candy) to teach a child to say “thank you” when the instructor hands the child a snack in the kitchen. Generalization is said to occur if the child (unprompted) says “thank you” when handed (a) other preferred items (e.g., a toy), (b) a snack in locations other than the kitchen area, or (c) a snack by people other than the instructor, and the response is followed by praise only, candy intermittently, or no consequence at all (T. F. Stokes & Baer, 1977).1 The changed conditions under which “thank you” was said are typically considered examples of stimulus generalization. By contrast, response generalization usually refers to changes in responses that were not directly taught, such as the

child saying, “thanks a lot” instead of “thank you” when handed a snack. Continuing with this example, maintenance is said to occur if the child persists in saying “thank you” without the use of prompts and continuous reinforcement. The persistence and transfer of behavior change is also desirable for behaviors that have been targeted for reduction (Shore, Iwata, Lerman, & Shirley, 1994). Maintenance and generalization are commonly treated as separate areas of concern, but they are necessarily intertwined. A single occurrence of a behavior in a nontraining context might constitute generalized responding. However, behavior change must persist long enough to be detected and to satisfy clinical goals. Koegel and Rincover (1977) clearly illustrated this distinction between maintenance and generalization. They taught children with autism to follow simple instructions in a therapy setting while simultaneously assessing responding in a different setting, in which a novel therapist who did not reinforce correct responses presented the instructions. An analysis of performance across consecutive instructional trials revealed the emergence of generalized responding for two of three participants. However, performance rapidly extinguished in the generalization setting while being maintained in the training setting in which correct responses continued to produce reinforcement. It is possible that stimuli associated with reinforcement in the training setting acquired exclusive control over responding, that stimuli in the generalization setting became discriminative for extinction, or both. The processes of stimulus control and reinforcement are likely determinants of both maintenance and generalization (Kirby & Bickel, 1988). Maintenance.  Two primary approaches have been used in applied research to evaluate the persistence of behavior change over time. In the most common approach, experimenters terminate the training condition and then briefly measure the response after an intervening period of time. Successful maintenance has often been reported as an outcome of training when this approach has been used to

This treatment of generalization deviates from that in the laboratory, where generalization is tested in extinction. The basic conceptualization of generalization requires that the response occur in the absence of reinforcement. The more pragmatic approach to generalization frequently taken in ABA requires only that the response occur in the absence of the same consequence that produced the original learning.

1

90

Applied Behavior Analysis

assess maintenance (e.g., Cummings & Carr, 2009; Pierce & Schreibman, 1994). However, this finding is somewhat surprising because few experimenters have explicitly arranged conditions to promote durable behavior change. Furthermore, although the authors typically delineated the conditions in effect during the maintenance check (e.g., the response did not produce reinforcement), they rarely provided information about potential determinants of maintenance during the intervening period (e.g., number of opportunities to engage in the response; contingencies for responding). In a second, less common approach to assessing maintenance, experimenters repeatedly measure levels of responding after removing all sources of reinforcement for the response or after replacing the programmed reinforcer (e.g., candy) with a more naturalistic reinforcer (e.g., praise). Performance persisted in these studies only when special arrangements were made to promote maintenance. These arrangements typically took the form of reinforcement schedule thinning (e.g., R. A. Baer, Blount, ­Detrich, & Stokes, 1987; Ducharme & Holborn, 1997; Hopkins, 1968; Kale, Kaye, Whelan, & Hopkins, 1968; Kazdin & Polster, 1973) or teaching participants to recruit reinforcement from others in the natural environment (e.g., Seymour & Stokes, 1976). Generalization.  In a seminal article, T. F. Stokes and Baer (1977) summarized various strategies to promote generalization that had appeared in the applied literature at that time. The technology of generalization described in their article has changed very little since its publication. The most commonly used ways to program stimulus generalization include (a) varying the stimulus dimension or dimensions of interest during training (i.e., varying the setting, trainers, materials), (b) ensuring that the stimuli present during training are similar to those in the generalization context, (c) thinning the reinforcement schedule during training, and (d) arranging for the behavior to contact reinforcement in the generalization context (Ducharme & Holborn, 1997; Durand & Carr, 1991, 1992; Marzullo-Kerth, Reeve, Reeve, & Townsend, 2011; T. F. Stokes, Baer, & Jackson, 1974). As discussed by Kirby and Bickel (1988), these approaches likely

promote ­generalization by preventing the development of inappropriate or restricted stimulus control. They do so by varying stimuli that are irrelevant to the response (e.g., specific location of the learner) while maintaining relevant features of the training situation (e.g., delivery of a particular instruction), or by making stimuli specific to the training setting indiscriminable. In the latter case, thinning the schedule of reinforcement, delaying reinforcement, and interspersing training and generalization tests may prevent the presence of the reinforcer from acquiring a discriminative function for further responding. Unambiguous examples of response generalization are more difficult to find in the applied literature, and few studies have evaluated factors that might promote this type of generalized responding. Changes in topographical variations of the targeted behavior, similar to the previous example of the student saying “thanks a lot” instead of “thank you,” may be more likely to occur when the targeted response is exposed to extinction (e.g., Duker & van Lent, 1991; Goetz & Baer, 1973). Some authors have reported collateral changes in responses that bear no physical resemblance to the behavior exposed to treatment, but the mechanisms responsible for these changes were unclear. Presumably, the targeted and generalized response forms were members of the same functional response class. In a study by Barton and Ascione (1979), for example, children taught to engage in vocal sharing (e.g., requesting to share others’ materials, inviting others to share their own materials) showed increases in physical sharing (e.g., handing other children toys) even though the experimenters did not directly teach those responses. Koegel and Frea (1993) reported corresponding increases in untreated aspects of social communication, such as appropriate topic content and facial expressions, after teaching children with autism to use appropriate eye gaze and gestures during conversations. A similar type of generalized behavior change has also been reported when only some topographies of problem behavior were exposed to treatment (e.g., Lovaas & Simmons, 1969; Singh, Watson, & Winton, 1986). Other commonly studied forms of generalization contain elements of both stimulus and response 91

Lerman, Iwata, and Hanley

generalization because different variations of the trained response occur under different variations of the training stimuli. For example, children who receive reinforcement for imitating specific motor movements will begin to imitate novel motor movements in the absence of reinforcement, an emergent skill called generalized imitation (Garcia, Baer, & Firestone, 1971; Young, Krantz, McClannahan, & Poulson, 1994). Other examples can be found in the research on generative language, including the generalized use of the plural morpheme (e.g., Guess, Sailor, Rutherford, & Baer, 1968), subject–verb agreement (e.g., Lutzker & Sherman, 1974), and sentence structure (e.g., “I want _________”; Hernandez, Hanley, Ingvarsson, & Tiger, 2007) and generalization from expressive to receptive language modalities (e.g., Guess & Baer, 1973).

Response Suppression Treating maladaptive behavior has been a major concern of applied researchers and clinicians since the inception of the field. Behaviors targeted for reduction have included responses that put the person performing the behavior and others at risk of injury (e.g., aggression, self-injury, cigarette smoking) as well as those that interfere with learning or adaptive behavior (e.g., disruption, noncompliance). From the earliest research, experimenters recognized that many of these behaviors are maintained by reinforcement contingencies and that modifying these contingencies might help alleviate the problem (e.g., Ayllon & Michael, 1959; Lovaas, Freitag, Gold, & Kassorla, 1965; Wolf, Risley, & Mees, 1963). However, the field initially lacked a systematic approach to identifying the variables that maintain problem behavior. Although some early research focused on a range of variables that might be functionally related to serious behavior disorders (e.g., E. G. Carr, Newsom, & Binkoff, 1980; Lovaas & Simmons, 1969; Rincover, Cook, Peoples, & Packard, 1979; Thomas, Becker, & Armstrong, 1968), most were outcomedriven extensions of basic research studies in which the effects of differential reinforcement and punishment were superimposed on unknown reinforcement contingencies for responding (Bostow & Bailey, 1969; Burchard & Barrera, 1972; Skiba, Pettigrew, & Alden, 1971). Although frequently 92

successful, the latter approach was likely responsible for the inconsistent results reported with most forms of treatment (e.g., Favell et al., 1982) and a greater reliance on punishment in both research and application (Kahng, Iwata, & Lewin, 2002; Pelios, Morren, Tesch, & Axelrod, 1999). Publication of a systematic method for identifying the function or functions of problem behavior (Iwata, Dorsey, Slifer, Bauman, & Richman, 1982) shifted the focus of behavior-analytic approaches to response suppression. Functional analysis methodology involves a direct test of multiple potential reinforcers for problem behavior, including positive reinforcers such as attention or toys and negative reinforcers such as escape from demands. Because of the utility of this assessment approach, treatments that involve terminating the reinforcement contingency for problem behavior (i.e., extinction), delivering the maintaining reinforcer as part of differential reinforcement procedures, and manipulating relevant motivating operations have taken precedent in research and practice. Research has also continued to evaluate the generality of the functional analysis methodology across a variety of behavior problems, populations, and idiosyncratic variables (e.g., Bowman, Fisher, Thompson, & Piazza, 1997; Hagopian, Bruzek, Bowman, & Jennett, 2007). Most recently, knowledge about behavioral function has permitted more detailed analyses of the mechanisms underlying common treatment procedures and factors that influence their effectiveness. Laboratory research on basic processes that reduce responding provided the foundation for behavior analysts’ current technology of treatments for problem behavior, including the commonly used procedural variations of extinction, differential reinforcement, satiation, and punishment. In the following sections, we provide an overview of these response suppression procedures. Extinction.  Terminating the reinforcement contingency that maintains a behavior is the simplest, most direct way to suppress responding. In application, however, extinction requires knowledge of these maintaining contingencies to ensure that the procedural form of the intervention (e.g., withholding attention, preventing escape from instructions)

Applied Behavior Analysis

is matched to behavioral function (e.g., maintenance by positive reinforcement in the form of attention or negative reinforcement in the form of escape from instructions). Early demonstrations of extinction were based on hypotheses about the function of problem behavior, which were then confirmed by withholding the putative maintaining reinforcer. For example, Wolf, Birnbrauer, Williams, and Lawler (1965) speculated that the vomiting of a 9-year-old girl with intellectual disabilities was maintained by escape from the classroom. The experimenters instructed the teacher to refrain from sending the student back to her dormitory contingent on vomiting. The frequency of vomiting decreased to zero levels across 30 days, suggesting that the behavior was, in fact, maintained by negative reinforcement. Lovaas and Simmons (1969) conducted one of the earliest demonstrations of extinction with behavior maintained by positive reinforcement. The participants were two children with intellectual disabilities who engaged in severe self-injury. On the basis of the assumption that both children’s behavior was maintained by attention from others, the children were left on their beds alone while an observer recorded instances of selfinjury from an observation room. Despite the success in reducing the self-injury, the experimenters concluded that extinction was an undesirable form of treatment because of the initial high levels of responding before response reduction. The development of functional analysis methodology, as described in Volume 2, Chapter 14, this handbook, greatly facilitated the study of extinction and its procedural variations, including extinction of behavior maintained by positive reinforcement (e.g., withholding toys; Day, Rea, Schussler, Larsen, & Johnson, 1988), extinction of behavior maintained by negative reinforcement (e.g., physically guiding compliance to prevent escape from academic demands; Iwata, Pace, Kalsher, Cowdery, & Cataldo, 1990), and extinction of behavior maintained by automatic reinforcement (e.g., applying protective equipment to block the putative sensory reinforcer for self-injury; Kuhn, DeLeon, Fisher, & Wilke, 1999). Nonetheless, reports of some undesirable effects of extinction (response bursts, resistance to

extinction, extinction-induced aggression) led to the more common practice of combining extinction with other treatment procedures. Research findings have supported this practice by showing that extinction is more effective or associated with fewer side effects when combined with differential or noncontingent reinforcement (E. G. Carr & Durand, 1985; Fisher, DeLeon, Rodriguez-Catter, & Keeney, 2004; Lerman & Iwata, 1995; Piazza, Patel, Gulotta, Sevin, & Layer, 2003; Steege et al., 1990; Vollmer et al., 1998). Moreover, it appears that extinction may often be crucial to the effectiveness of these other treatment procedures (Hagopian, Fisher, Sullivan, Acquisto, & LeBlanc, 1998; Mazaleski, Iwata, Vollmer, Zarcone, & Smith, 1993; Zarcone, Iwata, Mazaleski, & Smith, 1994). A key issue in the use of extinction is the detrimental impact of poor procedural integrity on treatment outcomes as well as strategies to remedy this impact. Caregivers are sometimes unwilling or unable to completely withhold reinforcement for problem behavior. Thus, the practical constraints of using extinction in applied settings have recently occasioned further research on ways to treat problem behavior despite continued reinforcement of the behavior. Differential reinforcement.  Interventions that involve delivering a reinforcer for an alternative behavior (differential reinforcement of alterative behavior [DRA]), for the absence of problem behavior (differential reinforcement of other behavior [DRO]), and for reduced levels of problem behavior (differential reinforcement of low rates ([DRL]) remain the most common approaches to treatment. In early treatment studies, differential reinforcement procedures were applied without knowledge of the variables maintaining the targeted behavior. Hence, problem behavior is likely to continue to produce its maintaining reinforcer (e.g., escape from demands) when an irrelevant reinforcer (e.g., candy) is delivered when the individual met the reinforcement contingency. Although less than ideal, these interventions were shown to be effective in several studies (e.g., Allen & Harris, 1966). The use of functional reinforcers not only increased the likelihood of success with differential 93

Lerman, Iwata, and Hanley

reinforcement but resulted in the development of a frequently used variation of DRA called functional communication training. With functional communication training, the reinforcer that has maintained problem behavior is delivered for a communicative response (e.g., saying, “break please” to receive escape) while problem behavior is extinguished. Other variations of DRA involve reinforcing an alternative or incompatible (noncommunicative) behavior (e.g., compliance to demands, toy play). Under DRO and DRL schedules, the person performing the behavior receives a reinforcer if problem behavior does not occur, or if it has occurred less than a specified number of times, during a particular time interval. DRO and DRL have less clinical appeal than DRA because no new behaviors are taught; hence, DRA is more commonly used in research and practice. Recent research on differential reinforcement has focused on determinants of maintenance in applied settings to address problems related to caregiver errors in implementation and transitions from intensive to more practical intervention (e.g., Athens & Vollmer, 2010; Fisher, Thompson, Hagopian, Bowman, & Krug, 2000; Hagopian, Contrucci Kuhn, Long, & Rush, 2005; Hanley, Iwata, & Thompson, 2001; Kuhn et al., 2010; Lalli et al., 1999; St. Peter Pipkin, Vollmer, & Sloman, 2010; Vollmer, Roane, Ringdahl, & Marcus, 1999). This research has shown the following factors to be detrimental to successful treatment outcomes: (a) failing to withhold reinforcement for problem behavior, (b) failing to deliver earned reinforcers, and (c) thinning the schedule of reinforcement for appropriate behavior. These findings have led to additional research on ways to increase the success of differential reinforcement despite these challenges to successful outcomes. Effective strategies have included increasing the quality of reinforcement for appropriate behavior when reinforcement continues to follow problem behavior (Athens & Vollmer, 2010), providing access to alternative stimuli or activities during schedule thinning (Fisher et al., 2000; Hagopian, Contrucci, Kuhn, Long, & Rush, 2005), and teaching clients to respond differentially to stimuli associated with periods of reinforcement versus extinction (Kuhn et al., 2010). 94

Motivating operations.  Procedures intended to abolish the reinforcing effects of consequences that maintain problem behavior have most commonly taken the form of response-independent delivery of a reinforcer, also called noncontingent reinforcement (e.g., Vollmer, Iwata, Zarcone, Smith, & Mazaleski, 1993). In most applications, the reinforcer that had maintained problem behavior was delivered on a fixed-time or variable-time schedule while problem behavior was exposed to extinction. Other variations of noncontingent reinforcement, however, have been shown to be effective, including delivery of an irrelevant reinforcer (e.g., food, toys) and delivery of reinforcement in the absence of extinction for problem behavior (e.g., Fisher, DeLeon, RodriguezCatter & Keeney, 2004; Lalli et al., 1999; Lomas, Fisher, & Kelley, 2010). The suppressive effects of noncontingent reinforcement also appeared to endure when reinforcer delivery was discontinued for short periods of time (e.g., 10–15 minutes; M. A. McGinnis, Houchins-Juárez, McDaniel, & Kennedy, 2010; O’Reilly et al., 2009). Most other procedures intended to abolish the reinforcing value of the consequence have focused on modifications to aversive stimuli that set the occasion for problem behavior. These modifications have included reducing the frequency or pace of instructions, changing features of tasks, embedding instructions in preferred activities, and alternating difficult instructions with easier ones (e.g., Dunlap, Kern-Dunlap, Clarke, & Robbins, 1991; Horner, Day, Sprague, O’Brien, & Heathfield, 1991; Kemp & Carr, 1995; Zarcone, Iwata, Smith, Mazaleski, & Lerman, 1994). In nearly all cases, these interventions were combined with extinction for problem behavior. Punishment.  A variety of procedures have been effective in reducing behavior through an apparent punishment process. Research has shown that a variety of stimuli, including reprimands, physical restraint, water mist, tastes, smells, noise, and shock (Dorsey, Iwata, Ong, & McSween, 1980; Lalli, Livezey, & Kates, 1996; Linscheid, Iwata, Ricketts, Williams, & Griffin, 1990; Maglieri, DeLeon, Rodriguez-Catter, & Sevin, 2000; Sajwaj, Libet, & Agras, 1974; Stricker, Miltenberger, Garlinghouse, & Tulloch, 2003) can decrease problem behavior very

Applied Behavior Analysis

quickly and safely. Punishment based on the contingent removal of events, including time out from positive reinforcement and response cost, has also been evaluated in the treatment of behavior disorders (Kahng, Tarbox, & Wilke, 2001; Toole, Bowman, Thomason, Hagopian, & Rush, 2003). Although research on punishment has declined in recent years, a substantial body of work has accumulated over the past five decades, revealing much about the application of punishment and factors that influence treatment outcomes. Consistent with basic findings, this research has indicated that punishment will suppress behavior most effectively when the consequence (a) is delivered immediately, (b) follows nearly all instances of the problem behavior, and (c) is combined with extinction of the problem behavior (see Lerman & Vorndran, 2002, for a review). In addition, treatment outcomes can be enhanced by combining punishment with DRA (R. H. Thompson, Iwata, Conners, & Roscoe, 1999) and establishing discriminative control over the response (e.g., Maglieri et al., 2000; Piazza, Hanley, & Fisher, 1996). Nonetheless, an insufficient amount of research has been conducted to develop prescriptions for long-term maintenance and generalization of punishment effects. Furthermore, although the literature contains numerous reports of desirable and undesirable side effects of punishment (e.g., increases in toy play [Koegel, Firestone, Kramme, & Dunlap, 1974], increases in aggression and crying [Hagopian & Adelinis, 2001]), no research has identified the determinants of these outcomes. Concluding Comments The ABA technologies derived from the basic principles of behavior have produced socially important outcomes for many different types of people (e.g., people who abuse substances, college students, ­athletes, older individuals, employees, people with intellectual disabilities), for a variety of target responses (e.g., literacy, smoking, sleep disturbance, aggression, safe driving), and in a diversity of settings (e.g., businesses, schools, hospitals, homes). Research in ABA has gone beyond simple demonstrations of application, generating knowledge about the mechanisms that underlie common social

problems and how behavioral processes operate under more naturalistic conditions. However, despite more than 50 years of research and practice, some essential questions remain. For example, how do behavior analysts ensure that treatment effects endure over the long term? What approaches are needed to establish complex social repertoires? And how do behavior analysts promote adoption of their technologies by those who would most benefit from them? Moreover, behavior analysts have barely scratched the surface in studying some critical social problems, such as overeating, criminal behavior, and schoolyard bullying. The documented success of their behavioral technologies for remediating other sorts of problems (e.g., self-injury in individuals with developmental disabilities, safety skills of factory workers, reading skills of school-age children; drug use of people with addiction) suggests that further research and practice into relatively unexplored areas will broaden the impact and reach of ABA.

References Agras, W. S., Jacob, R. G., & Lebedeck, M. (1980). The California drought: A quasi-experimental analysis of social policy. Journal of Applied Behavior Analysis, 13, 561–570. doi:10.1901/jaba.1980.13-561 Allen, K. E., & Harris, F. R. (1966). Elimination of a child’s excessive scratching by training the mother in reinforcement procedures. Behaviour Research and Therapy, 4, 79–84. doi:10.1016/0005-7967(66) 90046-5 Allison, M. G., & Ayllon, R. (1980). Behavioral coaching in the development of skills in football, gymnastics, and tennis. Journal of Applied Behavior Analysis, 13, 297–314. doi:10.1901/jaba.1980.13-297 Athens, E. S., & Vollmer, T. R. (2010). An investigation of differential reinforcement of alternative behavior without extinction. Journal of Applied Behavior Analysis, 43, 569–589. doi:10.1901/jaba.2010.43-569 Ayllon, T., & Michael, J. (1959). The psychiatric nurse as a behavioral engineer. Journal of the Experimental Analysis of Behavior, 2, 323–334. doi:10.1901/ jeab.1959.2-323 Azrin, N. H., & Foxx, R. M. (1971). A rapid method of toilet training the institutionalized retarded. Journal of Applied Behavior Analysis, 4, 89–99. doi:10.1901/ jaba.1971.4-89 Azrin, N. H., & Lindsley, O. R. (1956). The reinforcement of cooperation between children. Journal 95

Lerman, Iwata, and Hanley

of Abnormal and Social Psychology, 52, 100–102. doi:10.1037/h0042490 Bacon-Prue, A., Blount, R., Pickering, D., & Drabman, R. (1980). An evaluation of three litter control procedures: Trash receptacles, paid workers, and the marked item techniques. Journal of Applied Behavior Analysis, 13, 165–170. doi:10.1901/jaba.1980.13-165 Baer, D. M., Wolf, M. M., & Risley, T. R. (1968). Some current dimensions of applied behavior analysis. Journal of Applied Behavior Analysis, 1, 91–97. doi:10.1901/jaba.1968.1-91 Baer, R. A., Blount, R. L., Detrich, R., & Stokes, T. F. (1987). Using intermittent reinforcement to program maintenance of verbal/nonverbal correspondence. Journal of Applied Behavior Analysis, 20, 179–184. doi:10.1901/jaba.1987.20-179

Bosch, S., & Fuqua, R. W. (2001). Behavioral cusps: A model for selecting target behaviors. Journal of Applied Behavior Analysis, 34, 123–125. doi:10.1901/ jaba.2001.34-123 Bostow, D. E., & Bailey, J. (1969). Modification of severe disruptive and aggressive behavior using brief timeout and reinforcement procedures. Journal of Applied Behavior Analysis, 2, 31–37. doi:10.1901/jaba.1969.2-31 Bourgeois, M. S. (1990). Enhancing conversation skills in patients with Alzheimer’s disease using a prosthetic memory aid. Journal of Applied Behavior Analysis, 23, 29–42. doi:10.1901/jaba.1990.23-29 Bourret, J., Vollmer, T. R., & Rapp, J. T. (2004). Evaluation of a vocal mand assessment and vocal mand training procedures. Journal of Applied Behavior Analysis, 37, 129–144. doi:10.1901/jaba.2004.37-129

Bailey, J. S., Wolf, M. M., & Phillips, E. L. (1970). Home-based reinforcement and the modification of pre-delinquents’ classroom behavior. Journal of Applied Behavior Analysis, 3, 223–233. doi:10.1901/ jaba.1970.3-223

Bowman, L. G., Fisher, W. W., Thompson, R. H., & Piazza, C. C. (1997). On the relation of mands and the function of destructive behavior. Journal of Applied Behavior Analysis, 30, 251–265. doi:10.1901/ jaba.1997.30-251

Bancroft, S. L., Weiss, J. S., Libby, M. E., & Ahern, W. H. (2011). A comparison of procedural variations in teaching behavior chains: Manual guidance, trainer completion, and no completion of untrained steps. Journal of Applied Behavior Analysis, 44, 559–569.

Boyle, M. E., & Greer, R. D. (1983). Operant procedures and the comatose patient. Journal of Applied Behavior Analysis, 16, 3–12. doi:10.1901/jaba.1983.16-3

Bannerman, D. J., Sheldon, J. B., & Sherman, J. A. (1991). Teaching adults with severe and profound retardation to exit their homes upon hearing the fire alarm. Journal of Applied Behavior Analysis, 24, 571–577. doi:10.1901/jaba.1991.24-571 Barton, E. J., & Ascione, F. R. (1979). Sharing in preschool children: Facilitation, stimulus generalization, response generalization, and maintenance. Journal of Applied Behavior Analysis, 12, 417–430. doi:10.1901/ jaba.1979.12-417

Brigham, T. A., Graubard, P. S., & Stans, A. (1972). Analysis of the effects of sequential reinforcement contingencies on aspects of composition. Journal of Applied Behavior Analysis, 5, 421–429. doi:10.1901/ jaba.1972.5-421 Budd, K. S., Green, D. R., & Baer, D. M. (1976). An analysis of multiple misplaced parental social contingencies. Journal of Applied Behavior Analysis, 9, 459–470. doi:10.1901/jaba.1976.9-459 Burchard, J. D., & Barrera, F. (1972). An analysis of timeout and response cost in a programmed environment. Journal of Applied Behavior Analysis, 5, 271–282. doi:10.1901/jaba.1972.5-271

Bates, P. (1980). The effectiveness of interpersonal skills training on the social skill acquisition of moderately and mildly retarded adults. Journal of Applied Behavior Analysis, 13, 237–248. doi:10.1901/ jaba.1980.13-237

Carnine, D. W. (1976). Effects of two teacher-presentation rates on off-task behavior, answering correctly, and participation. Journal of Applied Behavior Analysis, 9, 199–206. doi:10.1901/jaba.1976.9-199

Bennett, C. W., & Ling, D. (1972). Teaching a complex verbal response to a hearing-impaired girl. Journal of Applied Behavior Analysis, 5, 321–327. doi:10.1901/ jaba.1972.5-321

Carr, E. G., & Durand, V. M. (1985). Reducing behavior problems through functional communication training. Journal of Applied Behavior Analysis, 18, 111–126. doi:10.1901/jaba.1985.18-111

Bijou, S. W. (1955). A systematic approach to an experimental analysis of young children. Child Development, 26, 161–168.

Carr, E. G., Newsom, C. D., & Binkoff, J. A. (1980). Escape as a factor in the aggressive behavior of two retarded children. Journal of Applied Behavior Analysis, 13, 101–117. doi:10.1901/jaba.1980.13-101

Bornstein, P. H., Bach, P. J., McFall, M. E., Friman, P. C., & Lyons, P. D. (1980). Application of a social skills training program in the modification of interpersonal deficits among retarded adults: A clinical replication. Journal of Applied Behavior Analysis, 13, 171–176. doi:10.1901/jaba.1980.13-171 96

Carr, J. E., Austin, J. E., Britton, L. N., Kellum, K. K., & Bailey, J. S. (1999). An assessment of social validity trends in applied behavior analysis. Behavioral Interventions, 14, 223–231. doi:10.1002/(SICI)1099078X(199910/12)14:43.0.CO;2-Y

Applied Behavior Analysis

Carr, J. E., Nicolson, A. C., & Higbee, T. S. (2000). Evaluation of a brief multiple-stimulus preference assessment in a naturalistic context. Journal of Applied Behavior Analysis, 33, 353–357. doi:10.1901/ jaba.2000.33-353 Cautela, J. R., & Kastenbaum, R. (1967). A reinforcement survey schedule for use in therapy, training, and research. Psychological Reports, 20, 1115–1130. doi:10.2466/pr0.1967.20.3c.1115 Charlop, M. H., & Walsh, M. E. (1986). Increasing autistic children’s spontaneous verbalizations of affection: An assessment of time delay and peer modeling procedures. Journal of Applied Behavior Analysis, 19, 307–314. doi:10.1901/jaba.1986.19-307 Clarfield, J., & Stoner, G. (2005). The effects of computerized reading instruction on the academic performance of students identified with ADHD. School Psychology Review, 34, 246–254. Clark, H. B., Rowbury, T., Baer, A. M., & Baer, D. M. (1973). Timeout as a punishing stimulus in continuous and intermittent schedules. Journal of Applied Behavior Analysis, 6, 443–455. doi:10.1901/ jaba.1973.6-443

Journal of Applied Behavior Analysis, 11, 345–355. doi:10.1901/jaba.1978.11-345 Dallery, J., & Glenn, I. M. (2005). Effects of an Internetbased voucher reinforcement program for smoking abstinence: A feasibility study. Journal of Applied Behavior Analysis, 38, 349–357. doi:10.1901/jaba. 2005.150-04 Day, R. M., Rea, J. A., Schussler, N. G., Larsen, S. E., & Johnson, W. L. (1988). A functionally based approach to the treatment of self-injurious behavior. Behavior Modification, 12, 565–589. doi:10.1177/ 01454455880124005 DeLeon, I. G., & Iwata, B. A. (1996). Evaluation of a multiple-stimulus presentation format for assessing reinforcer preferences. Journal of Applied Behavior Analysis, 29, 519–533. doi:10.1901/jaba.1996.29-519 Demchak, M. (1990). Response prompting and fading methods: A review. American Journal on Mental Retardation, 94, 603–615. Dorsey, M. F., Iwata, B. A., Ong, P., & McSween, T. E. (1980). Treatment of self-injurious behavior using a water mist: Initial response suppression and generalization. Journal of Applied Behavior Analysis, 13, 343–353. doi:10.1901/jaba.1980.13-343

Clarke, S., Remington, B., & Light, P. (1986). An evaluation of the relationship between receptive speech skills and expressive signing. Journal of Applied Behavior Analysis, 19, 231–239. doi:10.1901/jaba. 1986.19-231

Ducharme, D. E., & Holborn, S. W. (1997). Programming generalization of social skills in preschool children with hearing impairments. Journal of Applied Behavior Analysis, 30, 639–651. doi:10.1901/jaba.1997.30-639

Cote, C. A., Thompson, R. H., Hanley, G. P., & McKerchar, P. M. (2007). Teacher report versus direct assessment of preferences for identifying reinforcers for young children. Journal of Applied Behavior Analysis, 40, 157–166. doi:10.1901/jaba.2007.177-05

Duffy, L., & Wishart, J. G. (1987). A comparison of two procedures for teaching discrimination skills to Down’s syndrome and non-handicapped children. British Journal of Educational Psychology, 57, 265–278. doi:10.1111/j.2044-8279.1987.tb00856.x

Cronin, K. A., & Cuvo, A. J. (1979). Teaching mending skills to mentally retarded adolescents. Journal of Applied Behavior Analysis, 12, 401–406. doi:10.1901/ jaba.1979.12-401

Duker, P. C., & van Lent, C. (1991). Inducing variability in communicative gestures used by severely retarded individuals. Journal of Applied Behavior Analysis, 24, 379–386. doi:10.1901/jaba.1991.24-379

Cumming, W. W., & Schoenfeld, W. N. (1959). Some data on behavioral reversibility in a steady state experiment. Journal of the Experimental Analysis of Behavior, 2, 87–90. doi:10.1901/jeab.1959.2-87

Dunlap, G., Kern-Dunlap, L., Clarke, S., & Robbins, F. R. (1991). Functional assessment, curricular revision, and severe behavior problems. Journal of Applied Behavior Analysis, 24, 387–397. doi:10.1901/ jaba.1991.24-387

Cummings, A. R., & Carr, J. E. (2009). Evaluating progress in behavioral programs for children with autism spectrum disorders via continuous and discontinuous measurement. Journal of Applied Behavior Analysis, 42, 57–71. doi:10.1901/jaba.2009.42-57 Cuvo, A. J., Davis, P. K., O’Reilly, M. F., Mooney, B. M., & Crowley, R. (1992). Promoting stimulus control with textual prompts and performance feedback for persons with mild disabilities. Journal of Applied Behavior Analysis, 25, 477–489. doi:10.1901/ jaba.1992.25-477 Cuvo, A. J., Leaf, R. B., & Borakove, L. S. (1978). Teaching janitorial skills to the mentally retarded: Acquisition, generalization, and maintenance.

Durand, V. M., & Carr, E. G. (1991). Functional communication training to reduce challenging behavior: Maintenance and application in new settings. Journal of Applied Behavior Analysis, 24, 251–264. doi:10.1901/jaba.1991.24-251 Durand, V. M., & Carr, E. G. (1992). An analysis of maintenance following functional communication training. Journal of Applied Behavior Analysis, 25, 777–794. doi:10.1901/jaba.1992.25-777 Egel, A. L. (1981). Reinforcer variation: Implications for motivating developmentally disabled children. Journal of Applied Behavior Analysis, 14, 345–350. doi:10.1901/jaba.1981.14-345 97

Lerman, Iwata, and Hanley

Favell, J. E., Azrin, N. H., Baumeister, A. A., Carr, E. G., Dorsey, M. F., Forehand, R., & Solnick, J. V. (1982). The treatment of self-injurious behavior. Behavior Therapy, 13, 529–554. doi:10.1016/S00057894(82)80015-4 Fisher, W. W., DeLeon, I. G., Rodriguez-Catter, V., & Keeney, K. M. (2004). Enhancing the effects of extinction on attention-maintained behavior through noncontingent delivery of attention or stimuli identified via a competing stimulus assessment. Journal of Applied Behavior Analysis, 37, 171–184. doi:10.1901/ jaba.2004.37-171 Fisher, W. W., Piazza, C. C., Bowman, L. G., & Amari, A. (1996). Integrating caregiver report with a direct choice assessment to enhance reinforcer identification. American Journal on Mental Retardation, 101, 15–25. Fisher, W. W., Piazza, C. C., Bowman, L. G., Hagopian, L. P., Owens, J. C., & Slevin, I. (1992). A comparison of two approaches for identifying reinforcers for persons with severe and profound disabilities. Journal of Applied Behavior Analysis, 25, 491–498. doi:10.1901/ jaba.1992.25-491 Fisher, W. W., Thompson, R. H., Hagopian, L. P., Bowman, L. G., & Krug, A. (2000). Facilitating tolerance of delayed reinforcement during functional communication training. Behavior Modification, 24, 3–29. doi:10.1177/0145445500241001

R. W. Fuqua (Eds.), Research methods in applied behavior analysis (pp. 265–292). New York, NY: Plenum Press. Garcia, E., Baer, D. M., & Firestone, I. (1971). The development of generalized imitation within topographically determined boundaries. Journal of Applied Behavior Analysis, 4, 101–112. doi:10.1901/ jaba.1971.4-101 Gast, D. L., VanBiervliet, A., & Spradlin, J. E. (1979). Teaching number-word equivalences: A study of transfer. American Journal of Mental Deficiency, 83, 524–527. Geiger, K. B., LeBlanc, L. A., Dillon, C. M., & Bates, S. L. (2010). An evaluation of preference for video and in vivo modeling. Journal of Applied Behavior Analysis, 43, 279–283. doi:10.1901/jaba.2010.43-279 Gewirtz, J. L., & Baer, D. M. (1958). The effect of brief social deprivation on behaviors for a social reinforcer. Journal of Abnormal and Social Psychology, 57, 165–172. doi:10.1037/h0042880 Gleeson, S., & Lattal, K. A. (1987). Response-reinforcer relations and the maintenance of behavior. Journal of the Experimental Analysis of Behavior, 48, 383–393. doi:10.1901/jeab.1987.48-383 Glover, J., & Gary, A. L. (1976). Procedures to increase some aspects of creativity. Journal of Applied Behavior Analysis, 9, 79–84. doi:10.1901/jaba.1976.9-79

Fisher, W. W., Thompson, R. H., Piazza, C. C., Crosland, K., & Gotjen, D. (1997). On the relative reinforcing effects of choice and differential consequences. Journal of Applied Behavior Analysis, 30, 423–438. doi:10.1901/jaba.1997.30-423

Godby, S., Gast, D. L., & Wolery, M. (1987). A comparison of time delay and system of least prompts in teaching object discrimination. Research in Developmental Disabilities, 8, 283–305. doi:10.1016/ 0891-4222(87)90009-6

Fixsen, D. L., Phillips, E. L., & Wolf, M. M. (1973). Achievement place: Experiments in self-government with pre-delinquents. Journal of Applied Behavior Analysis, 6, 31–47. doi:10.1901/jaba.1973.6-31

Goetz, E. M., & Baer, D. M. (1973). Social control of form diversity and the emergence of new forms in children’s blockbuilding. Journal of Applied Behavior Analysis, 6, 209–217. doi:10.1901/jaba.1973.6-209

Foxx, R. M., & Shapiro, S. T. (1978). The timeout ribbon: A nonexclusionary timeout procedure. Journal of Applied Behavior Analysis, 11, 125–136. doi:10.1901/ jaba.1978.11-125

Greene, B. F., Bailey, J. S., & Barber, F. (1981). An analysis and reduction of disruptive behavior on school buses. Journal of Applied Behavior Analysis, 14, 177–192. doi:10.1901/jaba.1981.14-177

France, K. G., & Hudson, S. M. (1990). Behavior management of infant sleep disturbance. Journal of Applied Behavior Analysis, 23, 91–98. doi:10.1901/ jaba.1990.23-91

Guess, D., & Baer, D. M. (1973). An analysis of individual differences in generalization between receptive and productive language in retarded children. Journal of Applied Behavior Analysis, 6, 311–329. doi:10.1901/jaba.1973.6-311

Frankosky, R., & Sulzer-Azaroff, B. (1978). Individual and group contingencies and collateral social behaviors. Behavior Therapy, 9, 313–327. doi:10.1016/ S0005-7894(78)80075-6 Fuller, P. R. (1949). Operant conditioning of a vegetative human organism. American Journal of Psychology, 62, 587–590. doi:10.2307/1418565 Fuqua, R. W., & Schwade, J. (1986). Social validation and applied behavioral research. In A. Poling & 98

Guess, D., Sailor, W., Rutherford, G., & Baer, D. M. (1968). An experimental analysis of linguistic development: The productive use of the plural morpheme. Journal of Applied Behavior Analysis, 1, 297–306. doi:10.1901/jaba.1968.1-297 Hagopian, L. P., & Adelinis, J. D. (2001). Response blocking with and without redirection for the treatment of pica. Journal of Applied Behavior Analysis, 34, 527–530. doi:10.1901/jaba.2001.34-527

Applied Behavior Analysis

Hagopian, L. P., Bruzek, J. L., Bowman, L. G., & Jennett, H. K. (2007). Assessment and treatment of problem behavior occasioned by interruption of free-operant behavior. Journal of Applied Behavior Analysis, 40, 89–103. doi:10.1901/jaba.2007.63-05 Hagopian, L. P., Contrucci Kuhn, S. A., Long, E. S., & Rush, K. S. (2005). Schedule thinning following communication training: Using competing stimuli to enhance tolerance to decrements in reinforcer density. Journal of Applied Behavior Analysis, 38, 177–193. doi:10.1901/jaba.2005.43-04

Applied Behavior Analysis, 13, 407–432. doi:10.1901/ jaba.1980.13-407 Hawkins, R. P. (1991). Is social validity what we are interested in? Argument for a functional approach. Journal of Applied Behavior Analysis, 24, 205–213. doi:10.1901/jaba.1991.24-205 Hernandez, E., Hanley, G. P., Ingvarsson, E. T., & Tiger, J. H. (2007). A preliminary evaluation of the emergence of novel mand forms. Journal of Applied Behavior Analysis, 40, 137–156. doi:10.1901/ jaba.2007.96-05

Hagopian, L. P., Fisher, W. W., Sullivan, M. T., Acquisto, J., & LeBlanc, L. A. (1998). Effectiveness of functional communication training with and without extinction and punishment: A summary of 21 inpatient cases. Journal of Applied Behavior Analysis, 31, 211–235. doi:10.1901/jaba.1998.31-211

Hopkins, B. L. (1968). Effects of candy and social reinforcement, instructions, and reinforcement schedule leaning on the modification and maintenance of smiling. Journal of Applied Behavior Analysis, 1, 121–129. doi:10.1901/jaba.1968.1-121

Halle, J. W., & Holt, B. (1991). Assessing stimulus control in natural settings: An analysis of stimuli that acquire control during training. Journal of Applied Behavior Analysis, 24, 579–589. doi:10.1901/jaba. 1991.24-579

Horner, R. H., Day, H. M., Sprague, J. R., O’Brien, M., & Heathfield, L. T. (1991). Interspersed requests: A nonaversive procedure for reducing aggression and self-injury during instruction. Journal of Applied Behavior Analysis, 24, 265–278. doi:10.1901/ jaba.1991.24-265

Hanley, G. P., Iwata, B. A., & Roscoe, E. M. (2006). Factors influencing the stability of preferences. Journal of Applied Behavior Analysis, 39, 189–202. doi:10.1901/jaba.2006.163-04 Hanley, G. P., Iwata, B. A., & Thompson, R. H. (2001). Reinforcement schedule thinning following treatment with functional communication training. Journal of Applied Behavior Analysis, 34, 17–38. doi:10.1901/jaba.2001.34-17 Hanley, G. P., Piazza, C. C., Fisher, W. W., Contrucci, S. A., & Maglieri, K. A. (1997). Evaluation of client preference for function-based treatment packages. Journal of Applied Behavior Analysis, 30, 459–473. doi:10.1901/jaba.1997.30-459 Haring, T. G., & Breen, C. G. (1992). A peer-mediated social network intervention to enhance the social integration of persons with moderate and severe disabilities. Journal of Applied Behavior Analysis, 25, 319–333. doi:10.1901/jaba.1992.25-319 Hart, B. M., & Risley, T. R. (1968). Establishing use of descriptive adjectives in the spontaneous speech of disadvantaged preschool children. Journal of Applied Behavior Analysis, 1, 109–120. doi:10.1901/ jaba.1968.1-109 Hart, B., & Risley, T. R. (1974). Using preschool materials to modify the language of disadvantaged children. Journal of Applied Behavior Analysis, 7, 243–256. doi:10.1901/jaba.1974.7-243 Hart, B., & Risley, T. R. (1975). Incidental teaching of language in the preschool. Journal of Applied Behavior Analysis, 8, 411–420. doi:10.1901/jaba.1975.8-411 Hart, B., & Risley, T. R. (1980). In vivo language intervention: Unanticipated general effects. Journal of

Hughes, H. M. (1979). Behavior change in children at a therapeutic summer camp as a function of feedback and individual versus group contingencies. Journal of Abnormal Child Psychology, 7, 211–219. doi:10.1007/ BF00918901 Hunter, I., & Davison, M. (1982). Independence of response force and reinforcement rate on concurrent variable-interval schedule performance. Journal of the Experimental Analysis of Behavior, 37, 183–197. doi:10.1901/jeab.1982.37-183 Isaacs, C. D., Embry, L. H., & Baer, D. M. (1982). Training family therapists: An experimental analysis. Journal of Applied Behavior Analysis, 15, 505–520. doi:10.1901/jaba.1982.15-505 Iwata, B. A., Dorsey, M. F., Slifer, K. J., Bauman, K. E., & Richman, G. S. (1982). Toward a functional analysis of self-injury. Analysis and Intervention in Developmental Disabilities, 2, 3–20. doi:10.1016/ 0270-4684(82)90003-9 Iwata, B. A., Pace, G. M., Kalsher, M. J., Cowdery, G. E., & Cataldo, M. F. (1990). Experimental analysis and extinction of self-injurious escape behavior. Journal of Applied Behavior Analysis, 23, 11–27. doi:10.1901/ jaba.1990.23-11 Jackson, D. A., & Wallace, R. F. (1974). The modification and generalization of voice loudness in a fifteen-yearold retarded girl. Journal of Applied Behavior Analysis, 7, 461–471. doi:10.1901/jaba.1974.7-461 Johnson, M. D., & Fawcett, S. B. (1994). Courteous service: Its assessment and modification in a human service organization. Journal of Applied Behavior Analysis, 27, 145–152. doi:10.1901/jaba.1994.27-145 99

Lerman, Iwata, and Hanley

Jones, R. J., & Azrin, N. H. (1969). Behavioral engineering: Stuttering as a function of stimulus duration during speech synchronization. Journal of Applied Behavior Analysis, 2, 223–229. doi:10.1901/ jaba.1969.2-223

Koegel, R. L., Firestone, P. B., Kramme, K. W., & Dunlap, G. (1974). Increasing spontaneous play by suppressing self-stimulation in autistic children. Journal of Applied Behavior Analysis, 7, 521–528. doi:10.1901/ jaba.1974.7-521

Jones, R. T., Kazdin, A. E., & Haney, J. I. (1981). Social validation and training of emergency fire safety skills for potential injury prevention and life saving. Journal of Applied Behavior Analysis, 14, 249–260. doi:10.1901/jaba.1981.14-249

Koegel, R. L., & Frea, W. D. (1993). Treatment of social behavior in autism through the modification of pivotal social skills. Journal of Applied Behavior Analysis, 26, 369–377. doi:10.1901/jaba.1993.26-369

Kahng, S., Iwata, B. A., & Lewin, A. (2002). Behavioral treatment of self-injury, 1964–2000. American Journal on Mental Retardation, 107, 212–221. doi:10.1352/0895-8017(2002)107 2.0.CO;2 Kahng, S. W., Tarbox, J., & Wilke, A. E. (2001). Use of a multicomponent treatment for food refusal. Journal of Applied Behavior Analysis, 34, 93–96. doi:10.1901/ jaba.2001.34-93 Kale, R. J., Kaye, J. H., Whelan, P. A., & Hopkins, B. L. (1968). The effects of reinforcement on the modification, maintenance, and generalization of social responses of mental patients. Journal of Applied Behavior Analysis, 1, 307–314. doi:10.1901/jaba. 1968.1-307 Kazdin, A. E. (1977). Artifact, bias, and complexity of assessment: The ABCs of reliability. Journal of Applied Behavior Analysis, 10, 141–150. doi:10.1901/ jaba.1977.10-141 Kazdin, A. E. (1978). History of behavior modification. Baltimore, MD: University Park Press. Kazdin, A. E., & Bootzin, R. R. (1972). The token economy: An evaluative review. Journal of Applied Behavior Analysis, 5, 343–372. doi:10.1901/jaba.1972.5-343 Kazdin, A. E., & Polster, R. (1973). Intermittent token reinforcement and response maintenance in extinction. Behavior Therapy, 4, 386–391. doi:10.1016/ S0005-7894(73)80118-2 Kemp, D. C., & Carr, E. G. (1995). Reduction of severe problem behavior in community employment using an hypothesis-driven multicomponent intervention approach. Journal of the Association for Persons With Severe Handicaps, 20, 229–247. Kirby, K. C., & Bickel, W. K. (1988). Toward an explicit analysis of generalization: A stimulus control interpretation. Behavior Analyst, 11, 115–129. Kirby, K. C., & Holborn, S. W. (1986). Trained, generalized, and collateral behavior changes of preschool children receiving gross-motor skills training. Journal of Applied Behavior Analysis, 19, 283–288. doi:10.1901/jaba.1986.19-283 Koegel, R. L., Dunlap, G., & Dyer, K. (1980). Intertrial interval duration and learning in autistic children. Journal of Applied Behavior Analysis, 13, 91–99. doi:10.1901/jaba.1980.13-91 100

Koegel, R. L., & Rincover, A. (1977). Research on the difference between generalization and maintenance in extra-therapy responding. Journal of Applied Behavior Analysis, 10, 1–12. doi:10.1901/jaba.1977.10-1 Kuhn, D. E., Chirighin, A. E., & Zelenka, K. (2010). Discriminated functional communication: A procedural extension of functional communication training. Journal of Applied Behavior Analysis, 43, 249–264. doi:10.1901/jaba.2010.43-249 Kuhn, D. E., DeLeon, I. G., Fisher, W. W., & Wilke, A. E. (1999). Clarifying an ambiguous functional analysis with matched and mismatched extinction procedures. Journal of Applied Behavior Analysis, 32, 99–102. doi:10.1901/jaba.1999.32-99 Lalli, J. S., Livezey, K., & Kates, K. (1996). Functional analysis and treatment of eye poking with response blocking. Journal of Applied Behavior Analysis, 29, 129–132. doi:10.1901/jaba.1996.29-129 Lalli, J. S., Vollmer, T. R., Progar, P. R., Wright, C., Borrero, J., Daniel, D., & May, W. (1999). Competition between positive and negative reinforcement in the treatment of escape behavior. Journal of Applied Behavior Analysis, 32, 285–296. doi:10.1901/jaba.1999.32-285 Laraway, S., Snycerski, S., Michael, J., & Poling, A. (2003). Motivating operations and terms to describe them: Some further refinements. Journal of Applied Behavior Analysis, 36, 407–414. doi:10.1901/jaba. 2003.36-407 Lerman, D. C., & Iwata, B. A. (1995). Prevalence of the extinction burst and its attenuation during treatment. Journal of Applied Behavior Analysis, 28, 93–94. doi:10.1901/jaba.1995.28-93 Lerman, D. C., Tetreault, A., Hovanetz, A., Bellaci, E., Miller, J., Karp, H., & Toupard, A. (2010). Applying signal detection theory to the study of observer accuracy and bias in behavioral assessment. Journal of Applied Behavior Analysis, 43, 195–213. doi:10.1901/ jaba.2010.43-195 Lerman, D. C., & Vorndran, C. (2002). On the status of knowledge for using punishment: Implications for treating behavior disorders. Journal of Applied Behavior Analysis, 35, 431–464. doi:10.1901/ jaba.2002.35-431 Lerman, D. C., Vorndran, C., Addison, L., & Kuhn, S. A. C. (2004). A rapid assessment of skills in young

Applied Behavior Analysis

children with autism. Journal of Applied Behavior Analysis, 37, 11–26. doi:10.1901/jaba.2004.37-11 Lindsley, O. R. (1956). Operant conditioning methods applied to research in chronic schizophrenia. Psychiatric Research Reports, 5, 118–139. Linscheid, T. R., Iwata, B. A., Ricketts, R. W., Williams, D. E., & Griffin, J. C. (1990). Clinical evaluation of the self-injurious behavior inhibiting device (SIBIS). Journal of Applied Behavior Analysis, 23, 53–78. doi:10.1901/jaba.1990.23-53 Lomas, J. E., Fisher, W. W., & Kelley, M. E. (2010). The effects of variable-time delivery of food items and praise on problem on problem behavior reinforced by escape. Journal of Applied Behavior Analysis, 43, 425–435. doi:10.1901/jaba.2010.43-425 Lovaas, O. I., Freitag, G., Gold, V. J., & Kassorla, I. C. (1965). Experimental studies in childhood schizophrenia: Analysis of self-destructive behavior. Journal of Experimental Child Psychology, 2, 67–84. doi:10.1016/0022-0965(65)90016-0 Lovaas, O. I., & Simmons, J. Q. (1969). Manipulation of self-destruction in three retarded children. Journal of Applied Behavior Analysis, 2, 143–157. doi:10.1901/ jaba.1969.2-143 Lutzker, J. R., & Sherman, J. A. (1974). Producing generative sentence usage by imitation and reinforcement procedures. Journal of Applied Behavior Analysis, 7, 447–460. doi:10.1901/jaba.1974.7-447 Maglieri, K. A., DeLeon, I. G., Rodriguez-Catter, V., & Sevin, B. M. (2000). Treatment of covert food stealing in an individual with Prader-Willi syndrome. Journal of Applied Behavior Analysis, 33, 615–618. doi:10.1901/jaba.2000.33-615 Mann, R. A. (1972). The behavior-therapeutic use of contingency contracting to control an adult behavior problem: Weight control. Journal of Applied Behavior Analysis, 5, 99–109. doi:10.1901/jaba.1972.5-99 Marzullo-Kerth, D., Reeve, S. A., Reeve, K. F., & Townsend, D. B. (2011). Using multiple-exemplar training to teach a generalized repertoire of sharing to children with autism. Journal of Applied Behavior Analysis, 44, 279–294. Mazaleski, J. L., Iwata, B. A., Vollmer, T. R., Zarcone, J. R., & Smith, R. G. (1993). Analysis of the reinforcement and extinction components in DRO contingencies with self-injury. Journal of Applied Behavior Analysis, 26, 143–156. doi:10.1901/jaba.1993.26-143

reinforcement for problem behavior. Journal of Applied Behavior Analysis, 43, 119–123. doi:10.1901/ jaba.2010.43-119 Meany-Daboul, M. G., Roscoe, E. R., Bourret, J. C., & Ahearn, W. A. (2007). A comparison of momentary time sampling and partial-interval recording for evaluating functional relations. Journal of Applied Behavior Analysis, 40, 501–514. doi:10.1901/jaba.2007.40-501 Moher, C. A., Gould, D. D., Hegg, E., & Mahoney, A. M. (2008). Non-generalized and generalized conditioned reinforcers: Establishment and validation. Behavioral Interventions, 23, 13–38. doi:10.1002/ bin.253 Mudford, O. C., Taylor, S. A., & Martin, N. T. (2009). Continuous recording and interobserver agreement algorithms reported in the Journal of Applied Behavior Analysis (1995–2005). Journal of Applied Behavior Analysis, 42, 165–169. doi:10.1901/jaba.2009.42-165 Mueller, M. M., Olmi, D. J., & Saunders, K. J. (2000). Recombinative generalization of within-syllable units in prereading children. Journal of Applied Behavior Analysis, 33, 515–531. doi:10.1901/jaba.2000.33-515 Ninness, C., Barnes-Holmes, D., Rumph, R., McCuller, G., Ford, A. M., Payne, R., & Elliott, M. P. (2006). Transformations of mathematical and stimulus functions. Journal of Applied Behavior Analysis, 39, 299–321. doi:10.1901/jaba.2006.139-05 Nutter, D., & Reid, D. H. (1978). Teaching retarded women a clothing selection skill using community norms. Journal of Applied Behavior Analysis, 11, 475–487. doi:10.1901/jaba.1978.11-475 O’Brien, F., & Azrin, N. H. (1972). Developing proper mealtime behaviors of the institutionalized retarded. Journal of Applied Behavior Analysis, 5, 389–399. doi:10.1901/jaba.1972.5-389 Odom, S. L., Chandler, L. K., Ostrosky, M., McConnell, S. R., & Reaney, S. (1992). Fading teacher prompts from peer-initiation interventions for young children with disabilities. Journal of Applied Behavior Analysis, 25, 307–317. doi:10.1901/jaba.1992.25-307 O’Reilly, M., Lang, R., Davis, T., Rispoli, M., Machalicek, W., Sigafoos, J., & Didden, R. (2009). A systematic examination of different parameters of presession exposure to tangible stimuli that maintain problem behavior. Journal of Applied Behavior Analysis, 42, 773–783. doi:10.1901/jaba.2009.42-773

McGinnis, J. C., Friman, P. C., & Carlyon, W. D. (1999). The effect of token rewards on “intrinsic” motivation for doing math. Journal of Applied Behavior Analysis, 32, 375–379. doi:10.1901/jaba.1999.32-375

Pace, G. M., Ivancic, M. T., Edwards, G. L., Iwata, B. A., & Page, T. J. (1985). Assessment of stimulus preference assessment and reinforcer value with profoundly retarded individuals. Journal of Applied Behavior Analysis, 18, 249–255. doi:10.1901/jaba.1985.18-249

McGinnis, M. A., Houchins-Juarez, N., McDaniel, J. L., & Kennedy, C. H. (2010). Abolishing and establishing operation analyses of social attention as positive

Pelios, L., Morren, J., Tesch, D., & Axelrod, S. (1999). The impact of functional analysis methodology on treatment choice for self-injurious and aggressive 101

Lerman, Iwata, and Hanley

behavior. Journal of Applied Behavior Analysis, 32, 185–195. doi:10.1901/jaba.1999.32-185 Peterson, L., Homer, A. L., & Wonderlich, S. A. (1982). The integrity of independent variables in behavior analysis. Journal of Applied Behavior Analysis, 15, 477–492. doi:10.1901/jaba.1982.15-477 Phillips, E. L. (1968). Achievement place: Token reinforcement procedures in a home-style rehabilitation setting for “pre-delinquent” boys. Journal of Applied Behavior Analysis, 1, 213–223. doi:10.1901/ jaba.1968.1-213 Phillips, E. L., Phillips, E. A., Fixsen, D. L., & Wolf, M. M. (1971). Achievement Place: Modification of the behaviors of pre-delinquent boys within a token economy. Journal of Applied Behavior Analysis, 4, 45–59. doi:10.1901/jaba.1971.4-45 Phillips, E. L., Phillips, E. A., Wolf, M. M., & Fixsen, D. L. (1973). Achievement Place: Development of the elected manager system. Journal of Applied Behavior Analysis, 6, 541–561. doi:10.1901/jaba.1973.6-541 Piazza, C. C., Hanley, G. P., & Fisher, W. W. (1996). Functional analysis and treatment of cigarette pica. Journal of Applied Behavior Analysis, 29, 437–450. doi:10.1901/jaba.1996.29-437 Piazza, C. C., Patel, M. R., Gulotta, C. S., Sevin, B. M., & Layer, S. A. (2003). On the relative contributions of positive reinforcement and escape extinction in the treatment of food refusal. Journal of Applied Behavior Analysis, 36, 309–324. doi:10.1901/jaba.2003.36-309 Pierce, K. L., & Schreibman, L. (1994). Teaching daily living skills to children with autism in unsupervised settings through pictorial self-management. Journal of Applied Behavior Analysis, 27, 471–481. doi:10.1901/jaba.1994.27-471 Pilgrim, C., Jackson, J., & Galizio, M. (2000). Acquisition of arbitrary conditional discriminations by young normally developing children. Journal of the Experimental Analysis of Behavior, 73, 177–193. doi:10.1901/jeab.2000.73-177 Porterfield, J. K., Herbert-Jackson, E., & Risley, T. R. (1976). Contingent observation: An effective and acceptable procedure for reducing disruptive behavior of young children in a group setting. Journal of Applied Behavior Analysis, 9, 55–64. doi:10.1901/ jaba.1976.9-55 Repp, A. C., & Karsh, K. G. (1992). An analysis of a group teaching procedure for persons with developmental disabilities. Journal of Applied Behavior Analysis, 25, 701–712. doi:10.1901/jaba.1992.25-701 Resnick, L. B., Wang, M. C., & Kaplan, J. (1973). Task analysis in curriculum design: A hierarchically sequenced introductory mathematics curriculum. Journal of Applied Behavior Analysis, 6, 679–709. doi:10.1901/jaba.1973.6-679 102

Rincover, A., Cook, R., Peoples, A., & Packard, D. (1979). Sensory extinction and sensory reinforcement principles for programming multiple adaptive behavior change. Journal of Applied Behavior Analysis, 12, 221–233. doi:10.1901/jaba.1979.12-221 Roane, H. S., Call, N. A., & Falcomata, T. S. (2005). A preliminary analysis of adaptive responding under open and closed economies. Journal of Applied Behavior Analysis, 38, 335–348. doi:10.1901/jaba.2005.85-04 Rolider, A., & Van Houten, R. (1985). Movement suppression time-out for undesirable behavior in psychotic and severely developmentally delayed children. Journal of Applied Behavior Analysis, 18, 275–288. doi:10.1901/ jaba.1985.18-275 Rosales-Ruiz, J., & Baer, D. M. (1997). Behavioral cusps: A developmental and pragmatic concept for behavior analysis. Journal of Applied Behavior Analysis, 30, 533–544. doi:10.1901/jaba.1997.30-533 Sajwaj, T., Libet, J., & Agras, S. (1974). Lemon-juice therapy: The control of life-threatening rumination in a six-month-old infant. Journal of Applied Behavior Analysis, 7, 557–563. doi:10.1901/jaba.1974.7-557 Schaefer, H. H. (1970). Self-injurious behavior: Shaping “head-banging” in monkeys. Journal of Applied Behavior Analysis, 3, 111–116. doi:10.1901/jaba. 1970.3-111 Schreibman, L. (1975). Effects of within-stimulus and extra-stimulus prompting on discrimination learning in autistic children. Journal of Applied Behavior Analysis, 8, 91–112. doi:10.1901/jaba.1975.8-91 Schroeder, S. R., & Holland, J. G. (1968). Operant control of eye movements. Journal of Applied Behavior Analysis, 1, 161–166. doi:10.1901/jaba.1968.1-161 Schuster, J. W., Gast, D. L., Wolery, M., & Guiltinan, S. (1988). The effectiveness of a constant time-delay procedure to teach chained responses to adolescents with mental retardation. Journal of Applied Behavior Analysis, 21, 169–178. doi:10.1901/jaba.1988.21-169 Serna, L. A., Schumaker, J. B., Sherman, J. A., & Sheldon, J. B. (1991). In-home generalization of social interactions in families of adolescents with behavior problems. Journal of Applied Behavior Analysis, 24, 733–746. doi:10.1901/jaba.1991.24-733 Seymour, F. W., & Stokes, T. F. (1976). Self-recording in training girls to increase work and evoke staff praise in an institution for offenders. Journal of Applied Behavior Analysis, 9, 41–54. doi:10.1901/jaba.1976.9-41 Shore, B. A., Iwata, B. A., Lerman, D. C., & Shirley, M. J. (1994). Assessing and programming generalized behavioral reduction across multiple stimulus parameters. Journal of Applied Behavior Analysis, 27, 371–384. doi:10.1901/jaba.1994.27-371 Sidman, M., & Stoddard, L. T. (1967). The effectiveness of fading in programming a simultaneous

Applied Behavior Analysis

form discrimination for retarded children. Journal of the Experimental Analysis of Behavior, 10, 3–15. doi:10.1901/jeab.1967.10-3 Sidman, M., & Willson-Morris, M. (1974). Testing for reading comprehension: A brief report on stimulus control. Journal of Applied Behavior Analysis, 7, 327–332. doi:10.1901/jaba.1974.7-327 Singh, N. N., Watson, J. E., & Winton, A. S. (1986). Treating self-injury: Water mist spray versus facial screening or forced arm exercise. Journal of Applied Behavior Analysis, 19, 403–410. doi:10.1901/jaba. 1986.19-403 Skiba, E. A., Pettigrew, L. E., & Alden, S. E. (1971). A behavioral approach to the control of thumbsucking in the classroom. Journal of Applied Behavior Analysis, 4, 121–125. doi:10.1901/jaba.1971.4-121 Skinner, B. F. (1966). What is the experimental analysis of behavior? Journal of the Experimental Analysis of Behavior, 9, 213–218. doi:10.1901/jeab.1966.9-213 Speltz, M. L., Shimamura, J. W., & McReynolds, W. T. (1982). Procedural variations in group contingencies: Effects on children’s academic and social behaviors. Journal of Applied Behavior Analysis, 15, 533–544. doi:10.1901/jaba.1982.15-533 Steege, M. W., Wacker, D. P., Cigrand, K. C., Berg, W. K., Novak, C. G., Reimers, T. M., & DeRaad, A. (1990). Use of negative reinforcement in the treatment of self-injurious behavior. Journal of Applied Behavior Analysis, 23, 459–467. doi:10.1901/jaba.1990. 23-459 Stokes, J. V., Luiselli, J. K., & Reed, D. D. (2010). A behavioral intervention for teaching tackling skills to high school football athletes. Journal of Applied Behavior Analysis, 43, 509–512. doi:10.1901/jaba. 2010.43-509 Stokes, J. V., Luiselli, J. K., Reed, D. D., & Fleming, R. K. (2010). Behavioral coaching to improve offensive line pass-blocking skills of high school football athletes. Journal of Applied Behavior Analysis, 43, 463–472. doi:10.1901/jaba.2010.43-463 Stokes, T. F., & Baer, D. M. (1977). An implicit technology of generalization. Journal of Applied Behavior Analysis, 10, 349–367. doi:10.1901/jaba.1977.10-349 Stokes, T. F., Baer, D. M., & Jackson, R. L. (1974). Programming the generalization of a greeting response in four retarded children. Journal of Applied Behavior Analysis, 7, 599–610. doi:10.1901/ jaba.1974.7-599 St. Peter Pipkin, C., Vollmer, T. R., & Sloman, K. N. (2010). Effects of treatment integrity failures during differential reinforcement of alternative behavior: A translational model. Journal of Applied Behavior Analysis, 43, 47–70. doi:10.1901/jaba.2010.43-47 Stricker, J. M., Miltenberger, R. G., Garlinghouse, M., & Tulloch, H. E. (2003). Augmenting stimulus

intensity with an awareness enhancement device in the treatment of finger sucking. Education and Treatment of Children, 26, 22–29. Sumpter, C., Temple, W., & Foster, T. M. (1998). Response form, force, and number: Effects on concurrent-schedule performance. Journal of the Experimental Analysis of Behavior, 70, 45–68. doi:10.1901/jeab.1998.70-45 Sundby, S. M., Dickinson, A., & Michael, J. (1996). Evaluation of a computer simulation to assess subject preference for different types of incentive pay. Journal of Organizational Behavior Management, 16, 45–67. doi:10.1300/J075v16n01_04 Thomas, D. R., Becker, W. C., & Armstrong, M. (1968). Production and elimination of disruptive classroom behavior by systematically varying teacher’s behavior. Journal of Applied Behavior Analysis, 1, 35–45. doi:10.1901/jaba.1968.1-35 Thompson, R. H., Iwata, B. A., Conners, J., & Roscoe, E. M. (1999). Effects of reinforcement for alternative behavior during punishment of self-injury. Journal of Applied Behavior Analysis, 32, 317–328. doi:10.1901/ jaba.1999.32-317 Thompson, R. H., McKerchar, P. M., & Dancho, K. A. (2004). The effects of delayed physical prompts and reinforcement on infant sign language acquisition. Journal of Applied Behavior Analysis, 37, 379–383. doi:10.1901/jaba.2004.37-379 Thompson, T. J., Braam, S. J., & Fuqua, R. W. (1982). Training and generalization of laundry skills: A multiple probe evaluation with handicapped persons. Journal of Applied Behavior Analysis, 15, 177–182. doi:10.1901/jaba.1982.15-177 Tiger, J. H., Hanley, G. P., & Hernandez, E. (2006). A further evaluation of the reinforcing value of choice. Journal of Applied Behavior Analysis, 39, 1–16. doi:10.1901/jaba.2006.158-04 Toole, L. M., Bowman, L. G., Thomason, J. L., Hagopian, L. P., & Rush, K. S. (2003). Observed increases in positive affect during behavioral treatment. Behavioral Interventions, 18, 35–42. doi:10.1002/bin.124 Van Houten, R. (1979). Social validation: The evolution of standards of competency for target behaviors. Journal of Applied Behavior Analysis, 12, 581–591. doi:10.1901/jaba.1979.12-581 Van Houten, R., Nau, P., & Marini, Z. (1980). An analysis of public posting in reducing speeding behavior on an urban highway. Journal of Applied Behavior Analysis, 13, 383–395. doi:10.1901/jaba.1980.13-383 Vladescu, J. C., & Kodak, T. (2010). A review of recent studies on differential reinforcement during skill acquisition in early intervention. Journal of Applied Behavior Analysis, 43, 351–355. doi:10.1901/ jaba.2010.43-351 103

Lerman, Iwata, and Hanley

Vollmer, T. R., & Hackenberg, T. D. (2001). Reinforce­ ment contingencies and social reinforcement: Some reciprocal relations between basic and applied research. Journal of Applied Behavior Analysis, 34, 241–253. doi:10.1901/jaba.2001.34-241 Vollmer, T. R., & Iwata, B. A. (1991). Establishing operations and reinforcement effects. Journal of Applied Behavior Analysis, 24, 279–291. doi:10.1901/ jaba.1991.24-279 Vollmer, T. R., Iwata, B. A., Zarcone, J. R., Smith, R. G., & Mazaleski, J. L. (1993). The role of attention in the treatment of attention-maintained self-injurious behavior: Noncontingent reinforcement and differential reinforcement of other behavior. Journal of Applied Behavior Analysis, 26, 9–21. doi:10.1901/ jaba.1993.26-9 Vollmer, T. R., Progar, P. R., Lalli, J. S., Van Camp, C. M., Sierp, B. J., Wright, C. S., & Eisenschink, K. J. (1998). Fixed-time schedules attenuate extinctioninduced phenomena in the treatment of severe aberrant behavior. Journal of Applied Behavior Analysis, 31, 529–542. doi:10.1901/jaba.1998.31-529 Vollmer, T. R., Roane, H. S., Ringdahl, J. E., & Marcus, B. A. (1999). Evaluating treatment challenges with differential reinforcement of alternative behavior. Journal of Applied Behavior Analysis, 32, 9–23. doi:10.1901/jaba.1999.32-9 Wacker, D. P., & Berg, W. K. (1983). Effects of picture prompts on the acquisition of complex vocational tasks by mentally retarded individuals. Journal of Applied Behavior Analysis, 16, 417–433. doi:10.1901/ jaba.1983.16-417 Watson, P. J., & Workman, E. A. (1981). The nonconcurrent multiple baseline across individuals design: An extension of the traditional multiple baseline design. Journal of Behavior Therapy and Experimental Psychiatry, 12, 257–259. White, G. D., Nielsen, G., & Johnson, S. M. (1972). Timeout duration and the suppression of deviant behavior in children. Journal of Applied Behavior Analysis, 5, 111–120. doi:10.1901/jaba.1972.5-111

Wolery, M., & Gast, D. L. (1984). Effective and efficient procedures for the transfer of stimulus control. Topics in Early Childhood Special Education, 4, 52–77. doi:10.1177/027112148400400305 Wolery, M., Holcombe, A., Cybriwsky, C., Doyle, P. M., Schuster, J. W., Ault, M. J., & Gast, D. L. (1992). Constant time delay with discrete responses: A review of effectiveness and demographic, procedural, and methodological parameters. Research in Developmental Disabilities, 13, 239–266. doi:10.1016/ 0891-4222(92)90028-5 Wolf, M. M. (1978). Social validity: The case for subjective measurement or how applied behavior analysis is finding its heart. Journal of Applied Behavior Analysis, 11, 203–214. doi:10.1901/jaba.1978.11-203 Wolf, M. M., Birnbrauer, J. S., Williams, T., & Lawler, J. (1965). A note on apparent extinction of the vomiting behavior of a retarded child. In L. P. Ullmann & L. Krasner (Eds.), Case studies in behavior modification (pp. 364–366). New York, NY: Holt, Rinehart, & Winston. Wolf, M. M., Risley, T. R., & Mees, H. (1963). Application of operant conditioning procedures to the behavior problems of an autistic child. Behaviour Research and Therapy, 1, 305–312. doi:10.1016/00057967(63)90045-7 Worsdell, A. S., Iwata, B. A., Dozier, C. L., Johnson, A. D., Neidert, P. L., & Thomason, J. L. (2005). Analysis of response repetition as an error-correction strategy during sight-word reading. Journal of Applied Behavior Analysis, 38, 511–527. doi:10.1901/jaba.2005.115-04 Yeaton, W. H., & Bailey, J. S. (1983). Utilization analysis of a pedestrian safety training program. Journal of Applied Behavior Analysis, 16, 203–216. doi:10.1901/ jaba.1983.16-203 Young, J. M., Krantz, P. J., McClannahan, L. E., & Poulson, C. L. (1994). Generalized imitation and response-class formation in children with autism. Journal of Applied Behavior Analysis, 27, 685–697. doi:10.1901/jaba.1994.27-685

Wildman, B. G., Erickson, M. T., & Kent, R. N. (1975). The effect of two training procedures on observer agreement and variability of behavior ratings. Child Development, 46, 520–524.

Zarcone, J. R., Iwata, B. A., Mazaleski, J. L., & Smith, R. G. (1994). Momentum and extinction effects on self-injurious escape behavior and noncompliance. Journal of Applied Behavior Analysis, 27, 649–658. doi:10.1901/jaba.1994.27-649

Williams, G. E., & Cuvo, A. J. (1986). Training apartment upkeep skills to rehabilitation clients: A comparison of task analytic strategies. Journal of Applied Behavior Analysis, 19, 39–51. doi:10.1901/ jaba.1986.19-39

Zarcone, J. R., Iwata, B. A., Smith, R. G., Mazaleski, J. L., & Lerman, D. C. (1994). Reemergence and extinction of self-injurious escape behavior during stimulus (instructional) fading. Journal of Applied Behavior Analysis, 27, 307–316. doi:10.1901/jaba.1994.27-307

104

Chapter 5

Single-Case Experimental Designs Michael Perone and Daniel E. Hursh

Single-case experimental designs are characterized by repeated measurements of an individual’s behavior, comparisons across experimental conditions imposed on that individual, and assessment of the measurements’ reliability within and across the conditions. Such designs were integral to the development of behavioral science. Early work in the field of psychology depended on the analysis of the experiences of one or a few individuals (Ebbinghaus, 1885/1913; Thorndike, 1911; Wertheimer, 1912). The investigator identified a phenomenon (e.g., learning and memory, the law of effect, the phi phenomenon) and pursued experimental arrangements that assessed its reliability and the functional relations among the pertinent variables (e.g., the relation between the length of a series of nonsense syllables and learning curves, recall, and retention; the relation between the consequences of behavior and the rate of the behavior; the relation between an observer’s distance from blinking lights and appearance of movement). Because the research was conducted on the investigators themselves (e.g., the memory work of Ebbinghaus) or on just a few participants (e.g., Thorndike’s cats and Wertheimer’s human observers), the experimental arrangements often involved intensive study, with numerous measurements of behavior recorded while each individual was studied under a variety of conditions. Only after the development of statistical methods for analyzing aggregate data did the focus shift to comparisons across groups of participants, with each group exposed to a single condition (see also Chapter 8, this volume). In the original case, the

“participants” were plants in fields split into plots. The statistical methods were developed to assess the significance of differences in yields of plots of plants treated differently. R. A. Fisher’s (1925) Statistical Methods for Research Workers set the course for the field. Fisher began development of his methods while employed as the statistician at an agricultural experiment station early in his career. The fact that data on large numbers of participants tend to be normally distributed (regardless of whether the participants are plants, people, or other animals) led to the easy adaptation of group statistical methods to research with humans. The standard practice came to emphasize the importance of group means, differences in these means, and the use of statistical tests to draw inferences about the likelihood that the group differences were representative of differences in the populations of interest (e.g., Kazdin, 1999; Perone, 1999). Despite the rise of group statistical methods, single-case designs continued to be used in some important work because they allowed the investigator to study the details of relations among variables as expressed in the behavior of individuals (e.g., Bijou, 1955; Skinner, 1938; Watson, 1913), which resulted in reasonably clear demonstrations of functional relations among the variables being studied (e.g., conditioned startle responses, reinforcement, and schedules of reinforcement). Articulation of the necessary elements of single-case designs, notably in Sidman’s (1960) seminal Tactics of Scientific Research, helped make the designs practically de rigueur in basic research on free-operant behavior

DOI: 10.1037/13937-005 APA Handbook of Behavior Analysis: Vol. 1. Methods and Principles, G. J. Madden (Editor-in-Chief) Copyright © 2013 by the American Psychological Association. All rights reserved.

107

Perone and Hursh

(Baron & Perone, 1998; Johnston & Pennypacker, 2009; Perone, 1991). Translation of basic laboratory research for application in everyday situations resulted in the further development of how singlecase research designs were to serve applied researchers (Baer, Wolf, & Risley, 1968, 1987; Bailey & Burch, 2002; Barlow, Nock, & Hersen, 2009; Morgan & Morgan, 2009; see Chapter 8, this volume). In this chapter, we describe and provide examples of the various design elements that constitute single-case methods. We begin by considering the fundamental requirement of any experiment— internal validity—and the kinds of obstacles to internal validity that are most likely to be encountered in single-case experiments. Next, we describe a variety of designs, ranging in complexity, that are commonly associated with the single-case approach. Included are designs to study irreversible or reversible changes in behavior, experimental conditions arranged successively or simultaneously, and the effects of one or more independent variables. In each instance, we evaluate the degree to which the design can overcome obstacles to internal validity. Some designs, for practical or ethical reasons, exclude important controls and thus compromise internal validity, but most single-case designs are robust in promoting internal validity. A great strength of the single-case approach is its flexibility, and we describe how single-case designs can be adjusted dynamically, over the course of an experiment, in response to the ongoing pattern of results. We go on to review the commitment of single-case investigators to identifying and taking command of the variables that control behavior. This commitment is expressed in the steady-state strategy that underlies most contemporary single-case research. Finally, we describe how interparticipant replication, a seeming departure from a single-case approach, is needed to assess the degree to which an investigator has succeeded in identifying and controlling relevant variables (see also Chapter 7, this volume). Internal Validity of Single-Case Experiments The essential goal of an experiment is to make valid decisions about causal relations between the 108

variables of interest. When the results of an experiment provide clear evidence that manipulation of the independent variable caused the changes measured in the dependent variable, the experiment is said to have internal validity. Investigators are also concerned with other kinds of validity. Kazdin (1999) and Shadish, Cook, and Campbell (2002) listed construct validity, statistical conclusion validity, and external validity. Of these, external validity, which is concerned with the generality of experimental outcomes across populations, settings, times, or variables, seems to draw the lion’s share of attention from methodologists. This critically important issue is addressed by Branch and Pennypacker in this volume’s Chapter 7. Here we need only say that from the standpoint of experimental design, internal validity takes precedence because it is prerequisite to external validity. Unless an investigator can describe the functional relation between the independent and dependent variables with confidence, worrying about the generality of the relation would be premature. As Campbell and Stanley (1963) put it, “Internal validity is the basic minimum without which an experiment is uninterpretable” (p. 5). (For a thoughtful discussion of the interplay between internal and external validity, see Kazdin, 1999, pp. 35–38, and for a more general discussion considering all four types of validity, see Shadish et al., 2002, pp. 93–102.) Experimental designs are judged largely in terms of how well they promote internal validity. It may be helpful to think of a completed experiment as a kind of argument in which the design and results lead to a conclusion about causality. Internal validity has to do with the persuasiveness of the argument. Consumers of the research—journal reviewers and editors initially—will differ in their susceptibility to the argument, which is why editors weigh the judgments of several reviewers to render a verdict on the validity of an experiment and whether a report of it merits publication. It may also be helpful to remember, as you read this or any other chapter about experimental design, that good design can only foster internal validity; it cannot guarantee it. Internal validity is determined not only by the experimental design but also by the experimental outcomes. Consider, for example, a simple experiment to evaluate a treatment to reduce

Single-Case Experimental Designs

smoking. The investigator begins by taking a few weeks to measure the baseline rate of smoking (e.g., in cigarettes per day). Suppose the treatment is applied, and after a few weeks smoking ceases altogether. Finally, the treatment is withdrawn, that is, the investigator reinstates the baseline conditions. What happens next is critical to an evaluation of the experiment’s internal validity. If smoking recovers, returning to levels near those observed during the initial baseline, the investigator can make a strong inference about the reductive effect of the treatment on smoking. If smoking fails to recover, however, the causal status of the treatment is ambiguous. It might have been the cause of a permanent reduction in smoking, but the evidence is open to alternative accounts. It is possible that some other variable, operating over the course of time, is responsible for the absence of smoking. Fortunately, there are ways to resolve the ambiguity; they are discussed later in the Designs for Irreversible Effects section. The general point remains: A final decision about internal validity must wait until the data have been collected and analyzed and conclusions about the effect of the experimental treatment have been made. Internal validity is fostered by designs that eliminate or reduce the influence of extraneous variables that could compete with the independent variable for control of the dependent variable. The investigator’s design objective is to eliminate such variables— famously labeled by Campbell and Stanley (1963) as threats to internal validity—or, if that is not possible, to equalize their effects across experimental conditions so that they are not confounded with the independent variable. Because single-case experiments compare conditions imposed on an individual, investigators must guard against threats that operate as a function of time or repeated exposure to experimental treatments: history, maturation, testing, and instrumentation. History, in this context, generally refers to the influence of factors outside the laboratory. For example, an increase in the tobacco tax during a smoking cessation study could contribute to a smoker’s success in giving up the habit and inflate the apparent effect of the experimental treatment. Maturation refers to processes occurring within the research participant. As the name implies, they

may be developmental in character; for example, with age, changes in cognitive and social development could affect the efficacy of cartoons as reinforcers. Maturational variables may also involve shorter term processes such as fatigue, boredom, and hunger, and investigators should be aware of these processes even in highly controlled laboratory experiments. Working in the animal laboratory, McSweeney and her colleagues (e.g., McSweeney & Roll, 1993) showed that even when the procedure is held constant, response rates may change systematically over the course of a session. There has been some disagreement about the responsible process (the primary contenders are satiation and habituation; see McSweeney & Murphy, 2000), but from the standpoint of experimental design this disagreement does not matter. What does matter is that any design that compares treatment conditions arranged early and late in a session may confound the conditions with a maturational process. Testing is a concern when repeated exposure to a measurement procedure may, in itself, affect behavior. Investigators who rely on verbal measures may be especially concerned. It is obvious that asking a participant the same questions over and over could lead to stereotyped answers, thus blocking the test’s sensitivity to changes in experimental treatments. It may be less obvious that purely operant procedures are also susceptible to the testing threat. For example, as rats gain experience with fixed-ratio schedules, they tend to acquire increasingly efficient response topographies. Over a series of sessions with a fixed-ratio schedule, these changes in responding will be confounded with the effects of the experimental conditions. Instrumentation is a threat when systematic changes or drift in a measuring device may contaminate the data collected over the course of a study. An investigator may neglect to periodically recalibrate the force required to activate an operandum, for example, or the sensitivity of a computer touch screen may be reduced by repeated use. The instrumentation threat is most likely an issue in research that relies on human observers to collect or code data (Chapter 6, this volume). Prudent investigators will carefully consider both the methods used to train their human observers and those aspects of 109

Perone and Hursh

their experimental protocol that may influence the consistency of the observers’ work. These four time- and experience-related threats to internal validity can be addressed successfully in single-case designs by way of replication. Throughout an experiment, behavior is measured repeatedly so that the effect of the experimental manipulation can be assessed on a nearly continuous basis. Kazdin (1982) emphasized the importance of repeated measurement by calling it the fundamental requirement of single-case designs (p. 104). If the behavioral measures show (a) minimal variation in value across time within each experimental condition, (b) systematic differences across conditions, and (c) comparable values when conditions are replicated, then the experimental manipulation is the most plausible causal factor. With such a pattern of results, the influence of extraneous factors categorized as history, maturation, testing, or instrumentation would appear to be either eliminated or held constant. Designs Next, we turn to some illustrative designs and consider the degree to which they are likely to be successful in addressing threats to internal validity.

Designs Without Replicated Conditions Two simple designs that omit replication of experimental conditions have appeared in the literature. These designs do, however, involve repeated measurement within a condition, allowing investigators to rely on patterns in the results over time to assess the possible impact of an intervention. The intervention-only design (Moxley, 1998) is most useful in situations in which it is unethical to take the time to collect baseline data (as with dangerous or illegal behavior) or it is not feasible (as in instructional situations in which the yet-to-betaught behavior is absent from the participant’s repertoire). The data collected early in the process of intervening serves as a kind of baseline for changes that occur as the intervention proceeds. Changes that are systematic, such as accelerations, decelerations, or changes in variability, are taken as evidence of the intervention’s effectiveness. 110

Considered in the abstract, the intervention-only design would appear to be unacceptably weak in its defense against threats to internal validity. Consider the idealized pattern of results in Figure 5.1. The increase in behavior could be the result of the intervention, but it is also easy to imagine how it might result from, say, historical or maturational factors. Details about the procedure and the independent and dependent variables might lead to a more positive evaluation of the study’s validity. Suppose, for example, the behavior represented in Figure 5.1 is correct operations of a factory machine and the intervention is some procedure for training the correct operation. Suppose also that the machine is unique in both form and operation—nothing similar is available outside the factory training environment. Under these restricted circumstances, attributing the improved performance to the training is plausible. Still, one must admit that the conclusion is limited, and the restricted circumstances needed to support it might be rare. The baseline intervention or A-B design improves on the intervention-only design by adding a true baseline phase. In the idealized results shown in Figure 5.2, a stable behavioral baseline is followed by a conspicuous change that coincides with the intervention. The time course of behavioral change in the intervention phase is similar to that shown in Figure 5.1 for the intervention-only design. The evidence of an intervention effect is strengthened in the A-B design because the intervention results are

Figure 5.1.  Idealized results in an intervention-only design. The increase in behavior over the initial values, consistent with the goal or expected effect of the intervention, is taken as evidence that the intervention caused the increase.

Single-Case Experimental Designs

attributed to the treatment. Even a long and stable baseline is no guarantee of internal validity: To the extent that behavioral change is delayed from the onset of the intervention, alternative explanations may become increasingly plausible. In recognition of these sometimes insurmountable limitations of designs without replications, a considerable range of designs has evolved that includes replications.

Designs With Successive Conditions Figure 5.2.  Idealized results in a baseline– intervention or A-B design. A stable behavioral baseline is followed by a conspicuous change coincident with the intervention, suggesting that the intervention caused the change.

preceded by a lengthy series of measurements in which change is absent. The causal inference—that the intervention is responsible for the behavioral change—is supported by the fact that the behavior changed only when the intervention was implemented. More generally, an immediate change in level, trend, or variability coincident with the beginning of the intervention is taken as evidence of a possible functional relation between the intervention and the dependent variable. Although the A-B design is an improvement over the intervention-only design, it remains susceptible to history, maturation, and testing effects (and perhaps also to instrumentation effects). The plausibility of these threats to internal validity is exacerbated when the experimental outcomes fall short of the ideal, as is often the case, especially in applied research in which field settings may compromise experimental control of extraneous variables and ethical or clinical concerns may prevent the collection of extended baseline data. The shorter the baseline is, the more the A-B design comes to resemble the intervention-only design. If the baseline measurements are characterized by significant variability, it may be difficult to claim that any change in behavior is clearly coincident with the treatment, which is especially the case if the baseline variability is systematic. For example, if an upward trend is apparent in the baseline, continuation of the trend in the intervention phase cannot with confidence be

A straightforward extension of the A-B design yields a major improvement in promoting internal validity: Simply reinstate the baseline condition after the intervention—an A-B-A design. A common variation is the B-A-B design, in which the intervention is imposed in the first phase, withdrawn in the second, and reinstated in the third. In either case, the underlying logic is the same, and in both the ideal outcome is for behavior to change in some systematic way from the first phase to the second and then return to initial values when the original condition is reinstated. In the A-B-A-B design, the replication of the baseline is followed by a replication of the intervention. If a change occurs in the data patterns that replicates or approximates those of the first intervention phase, the plausibility of history, maturation, testing, or instrumentation effects is reduced even further, and a compelling case can be made for the intervention’s effectiveness. Put simply, the likelihood of other events being responsible for behavioral changes is greatly reduced if the changes occur when and only when the conditions are changed. The A-B-A-B design contains an initial demonstration of an effect (the first A to B change), shows that the effect is likely the result of the intervention (the B to A change), and convinces one of that by replicating the effect (the second A to B change). Figure 5.3 illustrates a possible outcome. The hypothetical data in this particular example fall short of the ideal: The initial baseline is brief, and behavior is still changing when each of the subsequent three conditions is terminated. With such an outcome, the experiment leaves unanswered the ultimate effect of the experimental treatment. Nevertheless, the systematic changes in trend that coincide repeatedly 111

Perone and Hursh

Figure 5.3.  Hypothetical results in an A-B-A-B design. The experimental treatments in the two B phases consistently reduce behavior, and reinstatement of the baseline procedure in the second A phase increases behavior. The reversibility of the behavior change in this pattern of results supports causal inferences about the experimental treatment.

with the initiation of the intervention (B) and baseline (A) phases leave no doubt about the causal role of the experimental treatment. It would be highly implausible to claim that something other than the treatment was responsible for reducing the behavior. Many research questions call for a comparison across two or more interventions. Several design options are available. One may use an A-B-A (or A-B-A-B) design in which both the A and B phases involve an experimental treatment. If a conventional baseline is desired, one may use an A-B-A-C-A design or perhaps an A-B-C-B design (in which A designates the conventional baseline and B and C designate ­distinct interventions). In all of these designs, each condition is imposed for a series of observations so that the effect of each treatment is given sufficient time to become evident (as in Figure 5.3). In basic laboratory research with rats or pigeons, it is not unusual for a condition to be imposed for weeks of daily sessions until behavior stabilizes and the behavioral effect is replicated from one observation to the next (this topic is discussed in the SteadyState Strategy section).

Designs With Simultaneous Conditions Another tactic for comparing interventions involves changing the conditions frequently to assess their relative impacts quickly. For example, a therapist 112

may want to identify the most effective way to get a client to talk more rationally about his or her fears. One approach may be to debunk any irrational talk; another may be to suggest alternative rational ways to talk about fears. The therapist can simply alternate these approaches within or across sessions and observe which approach produces more rational talk. A teacher who wants to know whether the latest approach to spelling is effective may use that approach on some days and the old approach on other days while assessing the students’ spelling performance throughout to decide whether the latest approach is better. This design tactic requires that the outcomes being assessed are likely to be sensitive to such frequent changes and that the experience of one intervention has only minimal impact on the effectiveness of the alternatives. Such designs are called multielement designs (­Sidman, 1960; Ulman & Sulzer-Azaroff, 1975). In one variation on this tactic, the alternating-treatments design, two or more treatments are alternated rapidly (Barlow et al., 2009). The operational definition of rapid depends on the experimental context and could involve individual conditions lasting from minutes to days. For example, a therapist may, within a single session, switch back and forth from debunking irrational talk to suggesting alternative rational ways to talk about a client’s fears, or a teacher may spend a week on the old approach to spelling before switching to the latest approach. Figure 5.4 shows a common way to present the results from experiments with an alternatingtreatments design. Results from the experimental treatments are represented by different symbols; the separation of the two functions documents the difference in the treatments’ effectiveness, and more important, the reproducibility of the difference across time attests to the reliability of the effect. When inspecting a graph such as that in Figure 5.4, it is important to remember that the design involves a special kind of reversal, in that the behavior is rising and falling across successive presentations of the two treatments. The highly reliable character of the effects of the two treatments is obscured by the graphing convention: Results from like conditions are connected, even though the data points do not represent successive observations.

Single-Case Experimental Designs

Figure 5.4.  Conventional presentation of results in an alternating-treatments design. The lines do not connect the data points in temporal sequence; rather, they connect data points collected under like treatment conditions.

In another multielement design, the experimental treatments are arranged concurrently, with the participant choosing which to access (sometimes called simultaneous-availability-of-all-conditions design [Browning, 1967] or, more commonly in the basic literature, simply a concurrent schedule). In many cases, the goal is to assess preferences. For example, a therapist may ask the client which tactic he or she wants the therapist to use during the ­session or the teacher may ask students which approach to spelling they want to use that day. A problem arises, however, if the participant’s choices are unconstrained: One treatment may be chosen to the exclusion of the other. Such an outcome may represent a strong preference, but it could also represent happenstance, as when a participant selects a particular option at the outset of an experiment and simply sticks with it. Without adequate exposure to all of the options, it would be inappropriate to draw conclusions about preference or, indeed, even to consider the procedure as arranging a meaningful choice. Procedures have been developed to address this problem and ensure that the participant is regularly exposed to the available treatment conditions. Some investigators devote portions of the experiment to forced-choice procedures that momentarily constrain the participant’s options to a single treatment (e.g., Mazur, 1985). When the concurrent assessment involves schedules of reinforcement, the schedules can be arranged so that reinforcement

rates can be maximized only if the participant occasionally samples all of the schedules (Stubbs & Pliskoff, 1969). We have discussed multielement designs in the context of comparisons across experimental treatments, a design tactic that Sidman (1960) called multielement manipulations. Multielement designs can also be used to measure an experimental treatment’s effect on two or more different response classes or operants, a design tactic that Sidman called multielement baselines. The idea is to arrange the experimental circumstances to generate two or more behavioral baselines more or less simultaneously, which can be accomplished by arranging a multiple schedule or concurrent schedules. Once stable baselines have been established, an experimental treatment is applied to both. For example, a multiple schedule might be arranged with contingencies to engender high rates of behavior in one component and low rates in the other. In one or more experimental conditions, a drug may be administered to discover whether the effect of the drug depends on the baseline rate (e.g., Lucki & DeLong, 1983). Multielement designs have a major strength as well as a significant limitation. Their strength is in promoting internal validity. Because multielement designs allow experimental treatments to be compared almost simultaneously (i.e., within a single session or pair of sessions), the influence of the time-related threats of history, maturation, testing, and instrumentation is equalized across the conditions. Their limitation is that the temporal juxtaposition of the two conditions may generate different effects than the conditions might generate if arranged in isolation from one another—or, put another way, the treatments may interact. The use of signals to demarcate the treatments and foster discrimination between them, as in the concurrent schedule variant, is sometimes intended to reduce the interaction. Another step is to separate the treatments in time; if the treatments are not temporally contiguous, the effect of one treatment is less likely to carry over to the next. In basic laboratory experiments, this separation is effected by interposing time outs between the components of a multiple schedule. In field experiments, the separation may arise in 113

Perone and Hursh

the customary scheme of things—for example, when treatments are alternated across school days or across weekly therapy sessions. There is no guarantee, however, that these steps actually do prevent interacting treatments. The only sure way to allay this concern is to conduct additional research in which each treatment is studied in isolation.

Designs for Irreversible Effects So far, we have considered two general classes of experimental designs. The first consists of the intervention-only and baseline–intervention (A-B) designs. Although these designs may be justifiable under special circumstances, they are undesirable because, in general, they provide little protection against time- and experience-related threats to internal validity. The second class of experimental designs promotes internal validity through replication of experimental conditions. The difference between the two classes can be summarized this way: In an experiment with an A-B design, any change observed from A to B might be related to the experimental treatment, but—depending on the particulars of the experiment—the change might reflect the operation of maturation, history, testing, or instrumentation. Adding replications (e.g., in A-B-A, A-B-A-B, or multielement designs) tests these alternative explanations. Each replication, if accompanied by appropriate changes in behavior, makes it less plausible that something other than the experimental treatment could have caused the changes. To promote internal validity, the designs in the second class require that the participant experience another treatment or a return to baseline. Replicating conditions is not always possible or desirable, however, for several reasons. First, some treatment effects are not likely to disappear simply because the treatment has been discontinued (e.g., the learning of a math fact, reading skill, or social skill that allows the learner access to desirable items or activities). The use of an A-B-A design to assess such an irreversible outcome will yield ambiguous results: When the baseline condition is replicated, the behavior remains unchanged. It is not possible to say whether the outcome is the persistent effect of the treatment or the effect of some other factor. 114

Another problem arises in cases in which a participant’s experience with one treatment has an impact on the effects produced by another treatment (e.g., being taught decoding skills can result in more rapid sight word learning). If the two treatments were compared in an alternating-treatments design, their effects would be obscured. The last problem is ethical rather than logistical: If the treatment effect is beneficial (e.g., reduction in self-injurious behavior), it would be undesirable to withdraw it and return behavior to pretreatment values even if the withdrawal might decisively demonstrate the treatment’s efficacy. Multiple-baseline designs.  One way to avoid the practical, ethical, and confounding problems of withdrawing, reversing, or alternating treatments is to arrange for the replication of a treatment’s impact to occur across participants, behaviors, or settings. These multiple-baseline designs (Baer et al., 1968) were developed just for such situations. Data are collected under two or more independent baseline conditions. The baselines often come from more than one participant, but they may also come from the same participant engaged in different behaviors or from the same participant behaving in different settings. Once the baseline behavior is shown to be stable, the experimental treatment is implemented in time-staggered fashion to one baseline (i.e., one participant, behavior, or setting) at a time. Adding the treatment to a second baseline is only done once the impact of the treatment for the first baseline has become obvious. Thus, the baseline data for untreated participants, responses, or settings serves as a control for confounding variables. That is, if changes are observed when and only when the treatment is applied to each of the participants, responses, or settings, it is unlikely that other variables can account for the changes. An idealized pattern of results is shown in Figure 5.5. Some examples may help illustrate the three common variants of the multiple-baseline design. If a teacher has experience suggesting that peer tutoring may help some of her students who struggle with solving equations, that teacher may assign a peer tutor to one struggling student at a time to observe whether each of the struggling students’

Single-Case Experimental Designs

concurrently across more than one participant, class of behavior, or setting. When such frequent measurement is not feasible, multiple-probe designs (Horner & Baer, 1978) are available. These designs differ from multiple-baseline designs in that instead of frequent measurements, only occasional probe measurements are taken. That is, the teacher, parent, or mental health worker mentioned in the examples arranges to measure the outcomes less often. He or she may assess the outcomes only weekly rather than daily, even though the experimental conditions (baseline or treatment) would be implemented continuously.

Figure 5.5.  A multiple-baseline design with an experimental treatment imposed in staggered temporal fashion across three independent baselines. The baselines could represent the behavior of different participants, the behavior of one participant in different settings, or the different behaviors of one participant. The strict coincidence between the imposition of the treatment and the appearance of behavior change allows the change to be attributed to the treatment.

equation-solving performance improves when they begin to work with their peer tutor and not before (multiple-baseline design across participants). If a parent has heard from other parents that developing a behavior contract can be a successful means of getting their child to do their chores, that parent may create an initial contract that includes only one chore and then add chores to the contract one at a time as he or she observes that the child’s completion of each chore becomes reliable only after it is added to the contract (multiple-baseline design across behaviors). If a mental health worker serves a client who has difficulty purchasing items, that mental health worker may provide modeling, guidance, and reinforcement for the client’s purchasing skills at a neighborhood convenience store, then provide the same treatment at the supermarket, and if successful there provide the same treatment at the department store across town (multiple-baseline design across settings). All of these multiple-baseline designs require the feasibility of taking frequent measures more or less

Changing-criterion designs.  What if the research problem is restricted to just a single baseline—only one participant, one class of behavior, or one setting—and it is not practical or ethical to withdraw or reverse treatment? We have already described two ways to deal with such a situation: the interventiononly design and the A-B design. We have also noted the weaknesses of these designs as regards internal validity. A third option, the changing-criterion design (Hartmann & Hall, 1976), offers better protection against threats to internal validity. This design is well suited to the study of variables that can be implemented progressively. For example, a teacher may use token reinforcers to help a student develop fluency in solving math problems. After measuring the student’s baseline rate of problem solving, the teacher may offer a token if the student’s rate is increased by, say, 10%. Each time the student’s rate of problem solving stabilizes at the new criterion for reinforcement, the criterion is raised. If, as illustrated in Figure 5.6, the student’s performance repeatedly conforms to the succession of increasingly stringent criteria, it is possible to attribute the changes in performance to the changing experimental treatment. As this example implies, changing-criterion designs are especially useful when the goal is to assess treatments designed to shape skilled performances or engender novel behavior (Morgan & Morgan, 2009).

Additional Design Options Two additional classes of single-case designs are commonly used, especially in the basic experimental analysis of behavior. 115

Perone and Hursh

Figure 5.6.  A changing-criterion design. Reinforcement is contingent on particular rates of behavior; each time behavior adjusts to a rate criterion, a new criterion is imposed.

Parametric designs.  Experiments that compare several levels of a quantitative treatment are said to use parametric designs. The literature on choice (e.g., Chapter 14, this volume) abounds with such designs; for example, in studies of matching, a pigeon may be exposed to a series of conditions that differ in terms of the distribution of food reinforcers across a pair of concurrently available response keys. Across successive conditions, the relative rate of reinforcement might be progressively increased (an ascending order) or decreased (a descending order), or the rates may be imposed in some irregular order. From a design standpoint, the issue is how to dissociate the effects of the experimental variable from the maturation, history, testing, or instrumentation. If an experiment arranges five relative rates in an ascending sequence, the design might be designated an A-B-C-D-E design. It is easy to see that the fundamental logic parallels that of the A-B design, and as such, the design is vulnerable to the same threats to internal validity. If, for example, relative response rates rise across the successive conditions, the outcome may be attributed to the experimental manipulation (response allocations match reinforcer allocations), but alternative explanations in terms of maturation, history, testing, or instrumentation may also be plausible. As actually implemented, however, parametric designs rarely suffer from this problem. Three 116

strategies are commonly used. First, one or more conditions are replicated to separate the effects of the treatment from the effects associated with timing. For example, one could replace a deficient A-B-C-D-E design with an A-B-C-D-E-A design or perhaps an A-B-C-D-E-A-C design. If the rising relative response rates result from some time-related or experiential factor, the rates should continue to rise in the replicated conditions. If, however, the rates revert back to the values observed in the initial A (and C) conditions, one can safely attribute the behavioral effects to the manipulation of relative reinforcement rate. The second strategy is to implement the conditions not in an ascending or descending sequence but rather in an irregular sequence. If response rates rise or fall simply in relation to the temporal position of the condition, the results may be attributed to time-related or experiential factors. If, instead, the rates are systematically related to the levels of the experimental variable (e.g., if response allocations match reinforcer allocations), the most plausible explanation would identify the experimental variable as the causal factor. The last strategy departs from a purely singlecase analysis: Different participants are exposed to the conditions in different orders. For example, one participant may experience an ascending sequence while another experiences a descending sequence and yet a third experiences an irregular sequence, or each participant may receive a different irregular order. If the behavior of all the participants shows the same relation to the experimental variable, despite the variation in the temporal order of the conditions, then it would again appear that the experimental manipulation is responsible. It is beneficial to combine these strategies. For example, one might arrange one or more replicated conditions as part of each participant’s experience, while arranging different sequences of conditions across participants. If a systematic relation between the experimental manipulation and behavior is observed under such circumstances, the case for attributing causality to the experimental manipulation becomes compelling. Yet another approach is to combine the parametric strategy with the A-B-A reversal strategy. An

Single-Case Experimental Designs

investigator might begin the experiment with a baseline Condition A (or treat the first level of the quantitative independent variable as a baseline) and, after stabilizing behavior at each successive level of the quantitative variable (Conditions B, C, etc.), return to the baseline condition. Thus, an A-B-C-D-E design could be replaced with an A-B-A-C-A-D-AE-A design. The obvious disadvantage is the large investment of time in repeating the baseline condition. The advantage is that the effect of each treatment can be evaluated relative to a fixed baseline. Factorial designs.  Behavior is controlled by multiple variables at any given moment, and experiments may be designed to analyze such control by including all possible combinations of the levels of two or more independent variables. These factorial designs are ubiquitous in the behavioral and biomedical sciences. They tend to be associated with group statistical traditions—indeed, a staple of graduate training in psychology is to teach the statistical methods of analysis of variance in the context of factorial research designs (e.g., Keppel & Wickens, 2004). Nevertheless, the factorial strategy is by no means restricted to group statistical approaches (Smith, Best, Cylke, & Stubbs, 2000) and is readily used in single-case experiments. As an example, consider an unpublished experiment (Wade-Galuska, Galuska, & Perone, 2004) concerned with variables that affect pausing on fixed-ratio schedules. A pigeon was trained on a multiple-baseline schedule in which 100 pecks on a response key produced either 2-second or 6-second access to mixed grain. Different key colors signaled the two schedule components, designated here as lean (ending in 2-second access to grain) and rich (ending in 6-second access). This arrangement (details are available in Perone & Courtney, 1992) made it possible to study, on a within-session basis, the effects of two factors on the pausing that took place between components: the magnitude of the reinforcer delivered before the pause (the past reinforcer, lean or rich) and the signaled magnitude of the reinforcer to be delivered on completing the next ratio (the upcoming reinforcer, lean or rich). Another factor was manipulated across successive phases of the experiment: The pigeon’s body weight

was 70%, 80%, or 90% of its free-feeding weight. Thus, the experiment had a 2 × 2 × 3 factorial design (two levels of past reinforcer × two levels of upcoming reinforcer × three levels of body weight) and, therefore, 12 combinations of the levels of the three factors. The results are shown in Figure 5.7. Each panel represents one of the body weight conditions. Note that this factor was manipulated quantitatively in an ascending series (70%, 80%, 90%), with a final phase devoted to a replication of the 70% condition. In this way, following the recommendations offered earlier in the Parametric Designs section, the experiment disentangled any confound between time- or experience-related processes and the experimental variable of body weight. Within each panel are the median pauses, calculated over the last 10 sessions of each body weight condition, in each of the four possible combinations of the other two experimental variables, the past and upcoming reinforcer magnitudes. The past reinforcer is shown on the x-axis and the upcoming reinforcer is shown with filled (lean) and unfilled (rich) data points.

Figure 5.7.  A factorial design to study three factors that could affect pausing on a fixed-ratio schedule: Past schedule condition (lean [L] or rich [R], represented on the x-axis), upcoming (Upc.) schedule condition (L or R, represented by filled and unfilled circles, respectively), and body weight (expressed as a percentage of free-feeding weight; each weight condition is represented in a different panel). Note the replication of the 70% body weight condition (rightmost panel). The results are from a single pigeon; shown are medians and interquartile ranges of the last 10 sessions of each condition. Data from Wade-Galuska, Galuska, and Perone (2004). 117

Perone and Hursh

Within each condition, pausing was a joint function of the past and upcoming schedules of reinforcement. When the key color signaled that the upcoming schedule would be rich (unfilled circles), the past reinforcer had no apparent effect: Pausing was brief after both lean and rich schedules. When the key color signaled that the upcoming schedule would be lean (filled circles), however, the past reinforcer had a major impact: Pausing was extended after a rich schedule. In other words, the effect of the past reinforcer was bounded by, or depended on, the signaled magnitude of the next reinforcer. When the effect of one factor depends on the level of another factor, the factors are said to interact. In the conventional terminology of factorial research design, the interaction between the past and upcoming magnitudes of reinforcement would be called a two-way interaction. The interaction itself depended on the level of the body weight factor: As body weight was increased, the interaction between the past and upcoming magnitudes of reinforcement was enhanced. This kind of finding constitutes a threeway interaction. Note also that in the final phase of the experiment, replicating the 70% body weight condition reduced the interaction between the magnitudes to the values observed in the first phase. In applied research, the use of single-case factorial designs can also prove beneficial. An example is assessment of the interaction between the type of directive and reinforcement contingencies as they affect participants’ compliance with the directives and disruptive behavior (Richman et al., 2001). This three-experiment sequence first established the effectiveness of various forms of directives, then assessed their effectiveness across situations, and finally assessed the interaction between the forms of the directives and targets of differential reinforcement contingencies. All of the experiments used multielement designs to determine the impact of the independent variables on the outcomes for each of the participants. Factorial designs are prevalent in the behavioral sciences specifically because they provide a framework for describing how multiple variables interact to control behavior. The presence of an interaction sheds light on the boundaries of a variable’s effect 118

and thereby allows for more complete and general descriptions of functional relations between environment and behavior. Flexibility of Implementation A strength of single-case research designs lies in the dynamics of their implementation. The examples we have offered of various single-case designs are merely the usual ways in which the single-case research design strategy is used. It is important to recognize that, in practice, the designs may be modified in response to the pattern of results that emerges as the data are collected. Indeed, this feature of the approach is what led Skinner (1956) to favor single-case designs. It is also possible to combine aspects of the basic single-case designs and even include aspects of group comparisons. This kind of flexibility can be an asset to any program of experimental research. It takes on special significance when the research topic is novel, when the investigator’s ability to exert experimental control is limited by ethical or logistical considerations, and when the goal is to produce an empirically validated therapeutic result for an individual. It is possible that once a behavior is changed, withdrawal of the treatment or reversal of the contingencies in an A-B-A design may not return the behavior to baseline values. From a therapeutic or educational standpoint, this is not a bad thing: In the long run, the therapist or teacher usually wants the participant’s behavior to come under the control of, and be maintained by, the consequences it automatically produces, so that the participant no longer depends on an intervention or treatment (see Chapter 7, this volume). However, from an experimental standpoint, it is a serious problem because it leaves unanswered the question of what caused the behavior to change in the first place: Was it the experimental treatment or some process of maturation, history, testing, or instrumentation? When behavior fails to revert to baseline values in an A-B-A design, the investigator may switch to a multiple-baseline design (if data have been collected for more than one participant, behavior, or setting). Thus, it is advisable for any investigator to consider the feasibility of establishing multiple baselines from the

Single-Case Experimental Designs

beginning, in case the behavior of interest does not return to the baseline value. Multiple-baseline designs have their own set of challenges requiring dynamic decision making by the investigator. Sometimes imposing the experimental treatment on one baseline will be followed by behavioral change not only in the treated baseline but also in the as-yet-untreated baselines. This might reflect the operation of maturation, history, testing, or instrumentation—in other words, it might mean that the treatment is ineffective. Another possibility is that the treatment really is responsible for change, and the effect has spread across the baselines because they are not independent of one another. This threat to internal validity, which Cook and Campbell (1979) called diffusion of treatments, can jeopardize multiple-baseline experiments under several circumstances: (a) In a multiplebaseline across-participants design, all of the participants are in the same environment and may learn by observing the treatment being applied; (b) in a multiple-baseline across-behaviors design, all of the responses are coming from the same participant, and learning one response may facilitate the learning of other responses; or (c) in a multiplebaseline across-settings design, the same participant is responding in all of the settings, and when the response is treated and changed in one setting, it may change in the untreated settings. The antidote, of course, is for the investigator to select participants, behaviors, or settings that experience and logic suggest will be independent of one another. Because experience and logic do not guarantee that an investigator will choose independent participants, behaviors, or settings, it is advisable to select as many baselines as is feasible so that the probability of at least some of them being independent is increased. Interdependence of a few baselines (changes occurring concurrently across those baselines) with independence of other baselines (changes occurring only when treatment is applied) in a multiplebaseline design can be informative. The investigator has the opportunity to inspect the similarities across the baselines that change concurrently and the differences between those baselines and the baselines that change only when the treatment is applied.

These comparisons and contrasts can help to isolate the participant, behavior, and setting variables that interact with the treatment to produce the changes. For example, a teacher modeling tactics for solving various types of math problems may see students solving problems for which the solutions have yet to be modeled. If the teacher is also collecting data on the students’ solving of social studies problems and does not observe those problems being solved until the solutions are modeled, one can make the case for the general effects of modeling problem solutions. This then sets the occasion for designing another investigation to systematically study the features of the modeling of the math problem solutions to determine which features are essential for which types of problems. If having many baselines is not feasible and the investigator faces interdependence of all of the baselines, the possible design tactics include (a) withdrawing the treatment or reversing the contingencies or (b) arranging for a changing criterion within the treatment. The first choice depends on the probability of the behavior’s return to baseline values and the ethical appropriateness of such a tactic. The second choice depends on the feasibility of incorporating the changing criterion into the treatment and the sensitivity of the behavior being measured to such changes. Either tactic, when successful, demonstrates the functional relation between the treatment and the outcomes. They both also set up the rationale for studying the interdependence of the baselines in a subsequent investigation. As with all efforts to investigate natural phenomena, unexpected results help to hone understanding of the phenomena and guide further investigations. Other design combinations may be considered. Withdrawing the treatment or reversing the contingencies in a successful multiple-baseline experiment can probe for the durability of the treatment effects and add another degree of replication should the changes not be durable. Gradually removing components of interventions to assess the importance of each or the intervention’s durability is another variation (a partial withdrawal design; Rusch & Kazdin, 1981). Withdrawing treatment from some participants, responses, or settings (a sequential withdrawal design; Rusch & Kazdin, 1981) to assess the 119

Perone and Hursh

­ urability of treatment effects is another variation to d be considered depending on the focus of the investigation. The point of all of these additional design options is that although the research question drives the initial selection of the elements of single-case design, once the data collection begins decisions about the next condition are driven by the patterns emerging in the data being collected. Unexpected patterns can and should lead the investigator to ask how best to arrange the next phase of the investigation to ensure that the original or revised research question can be answered unambiguously. Steady-State Strategy Behavioral experiments assess the effect of a treatment by comparing behavior measured during exposure to the treatment with behavior measured without the treatment or, if the experimental logic dictates, with behavior measured during exposure to some other treatment. In a single-case experiment, the conditions are imposed on an individual over some period of time, and behavior is measured repeatedly within each condition. Inferences about the experimental treatment’s effectiveness are usually supported by demonstrations that the difference in behavior observed across the conditions clearly exceeds any variability observed within the conditions. The basic strategy is not unlike the one that underlies conventional tests of statistical inference: The F ratio associated with the analysis of variance is formed by dividing an estimate of variance between experimental groups by an estimate of variance within the groups, and only if the betweengroups variance is large relative to the within-group variance does the investigator conclude that the experimental treatment made a statistically significant difference. The prevailing approach in single-case experiments—the steady-state strategy—is to impose a condition until behavior is more or less stable from one measurement (session, lesson, etc.) to the next. The idea is to fix the environmental variables controlling behavior until the environment–behavior relation reaches equilibrium or, as Sidman (1960) put it, a steady state. 120

At this point, the experimental environment is rearranged to impose the next condition, again until behavior stabilizes.

Strategic Requirements The steady-state strategy has three requirements (Perone, 1994): 1. The investigator must have sufficient control over extraneous variables to allow behavior to stabilize. 2. The investigator must be able to maintain each condition long enough to allow behavior to stabilize; even under ideal laboratory controls, it will take time for behavior to reach a new equilibrium when conditions are changed. 3. The investigator must be able to recognize the steady state when it is achieved. Meeting the first two requirements is not a matter of experimental design; rather, the key issues are scientific understanding and resources, including time and access to participants’ behavior. The investigator must have a reasonable idea of the extraneous variables to be eliminated or held constant to allow the potential effect of the experimental variable to become manifest. The investigator must have the wherewithal to control the extraneous variables, and he or she must have relatively unimpeded access to the behavior of interest: An A-B-A-B design, for example, may require scores of sessions distributed over several months or more if behavior is to be given time to stabilize in each phase. In any given area of study, initial investigations will suffer from gaps in the understanding of the behavioral processes at work and, consequently, of the variables in need of control. Persistent efforts at experimental analysis will pay dividends in identifying the relevant variables and developing the means to control them. Persistence alone, however, cannot provide an investigator the access to behavior that may be needed to execute the steady-state strategy. Much depends on the nature of the topic at hand and the available resources. Investigators of topics in basic research may be in the most advantageous position, especially if they study animals. Not only are they able to control almost every facet of the animal’s

Single-Case Experimental Designs

living arrangements (e.g., diet, housing, light–dark cycles, opportunities to engage conspecifics), they also have unfettered access to the animal’s behavior. Sessions may be conducted daily for months without interruption. Such circumstances are ideal for steady-state research. Special problems arise when human participants replace rats and pigeons (Baron & Perone, 1998). The typical human participant lives, works, plays, eats, drinks, and sleeps outside of the investigator’s influence and is thus exposed to numerous factors that may play a role in the participant’s experimental behavior (only in rare cases do human participants live in the laboratory; for an interesting example, see Bernstein & Ebbesen, 1978). These limitations indicate a need for strong countermeasures, such as experimental manipulations that are “especially forcing” (Morse & Kelleher, 1977; see also Baron & Perone, 1998, pp. 68–69) and increased exposure to the laboratory environment over an extended series of sessions. Unfortunately, human research—when extended access to the participant’s behavior may be needed most—is when such access is most difficult to attain. Monetary incentives can help bring participants to the laboratory for repeated study, of course, but even well-funded investigators will find that the number of sessions that, say, a college student will tolerate is lower than that commonly conducted in research with rats. To address this practical constraint, some investigators arrange brief sessions, sometimes lasting as little as 10 minutes (e.g., Okouchi, 2009), and schedule a series of such sessions each time the participant visits the laboratory. Of course, the duration of the sessions is not the critical issue; rather, the question is whether one can complete an experiment in a few hours in the human laboratory and compare the results to experiments that take months or years in the animal laboratory. The answer will depend on the goals of the research as well as the investigator’s judgment about the size of the anticipated effects and the speed of their onset. Relatively brief experiments can be defended when they are successful in producing stable and reproducible behavioral outcomes within and across participants. Caution is warranted in planning and interpreting such experiments, however, because the behavioral effects of the experimental

manipulations may not always develop according to the investigator’s timetable. Sometimes there is no substitute for prolonged exposure to the contingencies, and what happens in the short term may not predict what happens in the long term (for an illustration, see Baron & Perone, 1998, pp. 50–52). In applied research, logistical and ethical issues magnify the problem of behavioral access. Participants with clinically relevant repertoires may not be available in large numbers, and the nature of their problem behavior may sharply limit the duration of sessions. If the research is conducted in a therapeutic context, addressing the participant’s problem will take priority over purely scientific considerations, and ethical concerns about leaving problem behavior untreated may restrict the nature of the experimental designs as well as the durations of both baseline and treatment conditions. The steady-state strategy works best when behavior is measured repeatedly under controlled experimental conditions imposed long enough for the behavior to reach demonstrable states of equilibrium. The pages of the Journal of the Experimental Analysis of Behavior and the Journal of Applied Behavior Analysis attest that these challenges can be met. It is inevitable, however, that some experiments will fall short. In some cases, conducting single-case experiments in the absence of steady states will still be possible, as suggested by the hypothetical outcomes depicted in Figures 5.2, 5.3, and 5.6. Even in these examples, however, the number of behavioral observations is large. We suggest, therefore, that although single-case experiments may be viable in some cases without steady states, they are not likely to succeed without significant access to the behavior in the form of extensive repeated measurement (for a comprehensive discussion of this issue in the context of applied research, see Barlow et al., 2009, pp. 62–65 and 88–94, and Johnston & Pennypacker, 2009, pp. 191–218). When the efforts to achieve steady states fall short, an investigator may consider the use of statistical tests to discriminate treatment effects from a noisy background of behavioral variability. Many arguments, both pro and con, have been made in this connection (e.g., Ator, 1999; Baron, 1999; Branch, 1999; Crosbie, 1999; Davison, 1999; Kratochwill & Levin, 2010; 121

Perone and Hursh

Perone, 1999; Shull, 1999; Smith et al., 2000; Todman & Dugard, 2001; see Chapters 7 and 11, this volume). We are concerned that reliance on inferential statistics may retard the search for effective forms of control. By comparison, the major advantage of the steadystate strategy is that it fosters the development of strong control. Unsystematic variability (noise or bounce in the data) is addressed by reducing the influence of extraneous factors and increasing the influence of the independent variable. Systematic variability (the trend that occurs in the transition between steady states) is addressed by holding the experimental environment constant until behavior stabilizes. Put simply, the steady-state strategy demands that treatment effects be clarified by improving direct control of behavior.

Stability Criteria The final requirement of the steady-state strategy is that of recognizing the production of a steady state. Various decision rules have been devised for this purpose. These stability criteria are often expressed in mathematical terms and indicate, in one way or another, the kind and amount of variation in behavior that will be acceptable over a series of observations. Commonly used criteria specify (a) the number of sessions or observations to be considered in assessing the evidence of a steady state, (b) that an increasing or decreasing trend must be absent, and (c) how much bounce can be tolerated in the behavior across sessions. If the most recent behavior within a condition (e.g., responding in the last six sessions) is absent of trend and reasonably free from bounce, behavior is said to be stable. Sidman (1960) provided the seminal discussion of stability criteria. Detailed descriptions of stability criteria, with examples from the human and animal literature, can be found in Baron and Perone (1998) and Perone (1991). Perhaps the most important difference among stability criteria is in how they specify the tolerable limits on bounce. Some criteria use relative measures; for example, when considering the most recent six sessions, the mean response rate in the first three sessions and the mean in the last three sessions may differ by no more than 10% of the overall six-session mean. Other criteria may use 122

absolute measures; for example, the mean rates in the first three sessions and last three sessions may differ by no more than five responses per minute. Not all stability criteria are expressed in quantitative terms. In some experiments, steady states are identified by visual inspection of graphed results. In other experiments, each condition is imposed for a fixed number of sessions (e.g., 30), and behavior in the last several sessions (e.g., five) is considered representative of the steady state. As Sidman (1960) noted, the selection of a stability criterion depends on the nature of the experimental question and the investigator’s judgment and experience. The visual stability criterion may be justified, for example, when the investigator’s experience leads to the expectation of large or dramatic changes across conditions. The fixed-time stability criterion works well when a program of research has progressed to the point at which the investigator can confidently predict how many sessions will be needed to achieve a steady state. Even the quantitative criteria—the relative stability criterion and the absolute stability criterion—are specified in light of the experimental question and the investigator’s judgment and experience. In the abstract, divorced from such considerations, it is impossible to say, for example, whether a 10% relative criterion is more or less stringent than a five-responses-per-minute absolute criterion (for a detailed discussion of the relationship between relative and absolute stability criteria, see Perone, 1991, pp. 141–144). The adequacy of a stability criterion is assessed over the course of an experiment. A criterion is adequate, according to Sidman (1960, p. 259), if it yields orderly and replicable functional relations between the independent and dependent variables. In this connection, it is important to recognize that any stability criterion, no matter how stringent, may be met by chance, that is, in the absence of an actual steady state. However, a criterion is highly unlikely to be repeatedly met by chance across the various experimental conditions. Once is not Enough Single-case designs are single because the primary unit of analysis is the behavior of the individual

Single-Case Experimental Designs

organism. Treatment effects are assessed by comparing the individual’s response with different levels of the independent variable, and control is demonstrated by two kinds of replication: (a) the stability of the individual’s behavior from one observation to the next under constant circumstances within a condition and (b) the stability of the change in the individual’s behavior from one experimental condition to another. The single descriptor is misleading in that singlecase research rarely involves just one individual. In addition to the within-participant forms of replication that we have emphasized throughout this ­chapter, procedures are also replicated across ­participants. Single-case investigators approach interparticipant replication in two general ways, described by Sidman (1960) as direct replication and systematic replication (see also Chapter 7, this volume). In the context of interparticipant replication, direct replication consists of repeating the experimental procedures with additional participants. A review of any representative sample of the literature of basic or applied behavior analysis will document that direct interparticipant replication is, for all intents and purposes, required to establish the credibility of single-case experimentation—even in basic laboratory research with animals, where control is at its utmost. Why, in a science devoted to the analysis of behavior in the individual organism, should be this so? Interparticipant replication is needed to show that the investigator has been successful in identifying the relevant variables and bringing them under satisfactory degrees of control. Whenever manipulation of an independent variable produces the same kind of behavioral change in a new participant, one grows increasingly confident that the investigator is both manipulating the causal factor and eliminating (or otherwise controlling) the influence of extraneous factors that could obscure the causal relation. What if the attempt at interparticipant replication fails? Suppose, for example, that an A-B-A-B design produces a clear, reliable effect in one participant but not in another? One might be inclined to question the reality of the original result, to declare it a fluke. However, this would be a strategic error of

elephantine proportions. A result that can be replicated on an intraparticipant basis (i.e., from A to B, back to A, and back again to B) cannot be dismissed so easily. The failure to replicate the effect in another participant does not negate the original finding; rather, it unveils the incompleteness of one’s understanding of the original finding. The investigator may have erred in his or her operational definition of the independent variable, his or her control of the independent variable may be defective, or the investigator may have failed to recognize other relevant variables and isolate the experiment from their influence. “If this proves to be the case,” said Sidman (1960, p. 74), “failure of [interparticipant] replication will serve as a spur to further research rather than lead to a simple rejection of the original data.” Systematic replication is an attempt to replicate a functional relation under circumstances that differ from those of the original experiment. The experimental conditions might be imposed in a different order. The range of a parametric variable might be extended. The personal characteristics of a therapeutic agent or teacher might be changed (e.g., from female to male). The classification of the participants might differ (e.g., pigeons might be studied instead of rats or typically developing children instead of those with developmental delays). New behavioral repertoires might be observed (e.g., swimming instead of studying), new stimulus modalities might be activated (e.g., with auditory rather than visual stimuli), or new behavioral consequences might be arranged (e.g., attention instead of edibles or the postponement of a shock instead of the presentation of a food pellet). In this way—by replicating functional relations across a range of individuals, behaviors, and operations—investigators can discover the boundaries of a phenomenon and thereby reach conclusions about its generality. This issue is discussed in more detail in this volume’s Chapter 7. Direct replications are often considered an integral part of a given experiment: The report of a typical experiment incorporates single-case results from several participants, and the similarity of results across the participants is a key feature in assessing the adequacy of control over the variables under 123

Perone and Hursh

study. By comparison, systematic replications are conducted across experiments and, over the course of time, across investigators, laboratories, clinics, and so forth. When investigators refer to research topics or areas of investigation, they are commonly referring to collections of systematic replications that have been designed specifically to address the limits of some behavioral phenomenon (e.g., consider the literatures on choice, conditioned reinforcement, resurgence, and response-independent presentations of previously established reinforcers, so-called “noncontingent reinforcement”). Sometimes systematic replications are a byproduct, rather than the focus, of the research, as when an investigator makes use of a previous finding to pursue a new line of inquiry. For example, the investigator might use fixed-interval schedules of reinforcement as a baseline procedure to evaluate the effects of a drug on temporally controlled behavior. Whatever may be learned about the drug’s effects, the results are likely to extend understanding of fixed-interval behavior. The ubiquity of such research (Sidman [1960] called it the baseline method of systematic replication) is responsible for an array of empirical generalizations—such as schedule effects—that have come to be regarded as foundations of basic and applied behavior analysis. Final Remarks Single-case methods are well suited to the intensive study of behavior at the level of the individual organism. When properly designed and executed, single-case experiments offer robust protection against threats to internal validity. Their ability to support causal inferences certainly matches that of the group comparison methods that dominate the behavioral sciences, despite the small number of participants and the absence of sophisticated statistical tests. Strong methods of direct control over behavior, exemplified by the steady-state strategy, obviate the need for statistical inference, and replication within and across participants attests to the adequacy of the control. The prevalence of single-case designs in basic and applied behavior analysis is not simply a matter of their effectiveness in assessing the effects of 124

experimental treatments or describing functional relations, although their effectiveness is obviously important. There is another less practical, more theoretical reason. Fundamentally, the prevalence of the approach derives from a conception of behavior as an intrinsically intraindividual phenomenon, a continuous reciprocal interaction between an organism and its environment (e.g., Skinner, 1938, 1966). By this reckoning, only methods that respect the individual character of behavior—single-case methods—are valid. The point was made forcefully by Sidman (1960), who essentially ruled that methods based on comparisons across individuals—as in group statistical methods—fall outside the boundaries of behavioral science. To illustrate, Sidman (1960) considered the plight of an investigator interested in the effects of the number of reinforcements during acquisition of behavior on the subsequent extinction of the behavior. It is easy to see the severe limitation of a singlecase experiment that would expose an individual to a series of reinforcement conditions, each followed by an extinction test. Clearly, the individual’s cumulating experience with successive extinctions would be confounded with the effects of the reinforcement variable; one could expect extinction to proceed more rapidly with experience regardless of the number of reinforcements in the acquisition phase before each test. An obvious solution might be to expose separate groups of individuals to the values of the reinforcement variable and combine the results of each group’s single extinction test to construct a function relating the number of reinforcements to the rate of extinction. “But,” said Sidman, the function so obtained does not represent a behavioral process. The use of separate groups destroys the continuity of cause and effect that characterizes an irreversible behavioral process. . . . If it proves impossible to obtain an uncontaminated relation between number of reinforcements and resistance to extinction in a single subject, because of the fact that successive extinctions interact with each other, then the “pure” relation simply does not exist. The solution to

Single-Case Experimental Designs

our problem is to cease trying to discover such a pure relation, and to direct our research toward the study of behavior as it actually exists. . . . The [investigator] should not be deceived into concluding that the group type of experiment in any way provides a more adequately controlled or more generalizable substitute for individual data. (p. 53) For investigators who endorse Sidman’s (1960) position, the adoption of single-case methods is not a pragmatic decision. To the contrary, it is a theoretical commitment: Single-case methods are seen as a defining feature of behavioral science (see also Johnston & Pennypacker, 2009; Sidman, 1990). For these investigators, to depart from single-case methods is to change fields. By investigating functional relations via the single-case approach described in this chapter, a thorough understanding of how and why behavior occurs can be achieved.

References Ator, N. A. (1999). Statistical inference in behavior analysis: Environmental determinants? Behavior Analyst, 22, 93–97. Baer, D. M., Wolf, M. M., & Risley, T. R. (1968). Some current dimensions of applied behavior analysis. Journal of Applied Behavior Analysis, 1, 91–97. doi:10.1901/jaba.1968.1-91 Baer, D. M., Wolf, M. M., & Risley, T. R. (1987). Some still-current dimensions of applied behavior analysis. Journal of Applied Behavior Analysis, 20, 313–327. doi:10.1901/jaba.1987.20-313 Bailey, J. S., & Burch, M. R. (2002). Research methods in applied behavior analysis. Thousand Oaks, CA: Sage. Barlow, D. H., Nock, M. K., & Hersen, M. (2009). Single case experimental designs: Strategies for studying behavior change (3rd ed.). Boston, MA: Pearson. Baron, A. (1999). Statistical inference in behavior analysis: Friend or foe? Behavior Analyst, 22, 83–85.

Bijou, S. W. (1955). A systematic approach to an experimental analysis of young children. Child Development, 26, 161–168. Branch, M. N. (1999). Statistical inference in behavior analysis: Some things significance testing does and does not do. Behavior Analyst, 22, 87–92. Browning, R. M. (1967). A same-subject design for simultaneous comparison of three reinforcement contingencies. Behaviour Research and Therapy, 5, 237–243. doi:10.1016/0005-7967(67)90038-1 Campbell, D. T., & Stanley, J. C. (1963). Experimental and quasi-experimental designs for research. Chicago, IL: Rand McNally. Cook, T. D., & Campbell, D. T. (1979). Quasiexperimentation: Design and analysis issues for field settings. Chicago, IL: Rand McNally. Crosbie, J. (1999). Statistical inference in behavior analysis: Useful friend. Behavior Analyst, 22, 105–108. Davison, M. (1999). Statistical inference in behavior analysis: Having my cake and eating it? Behavior Analyst, 22, 99–103. Ebbinghaus, H. (1913). Memory: A contribution to experimental psychology (H. A. Ruger & C. E. Bussenius, Trans.). New York, NY: Columbia Teachers College. (Original work published 1885) Fisher, R. A. (1925). Statistical methods for research workers. Edinburgh, Scotland: Oliver & Boyd. Hartmann, D. P., & Hall, R. V. (1976). The changing criterion design. Journal of Applied Behavior Analysis, 9, 527–532. doi:10.1901/jaba.1976.9-527 Horner, R. D., & Baer, D. M. (1978). Multiple-probe technique: A variation on the multiple baseline. Journal of Applied Behavior Analysis, 11, 189–196. doi:10.1901/ jaba.1978.11-189 Johnston, J. M., & Pennypacker, H. S. (2009). Strategies and tactics of behavioral research (3rd ed.). New York, NY: Routledge. Kazdin, A. E. (1982). Single-case research designs: Methods for clinical and applied settings. New York, NY: Oxford University Press. Kazdin, A. E. (1999). Research designs in clinical psychology. New York, NY: Allyn & Bacon. Keppel, G., & Wickens, T. D. (2004). Design and analysis: A researcher’s handbook (4th ed.). Upper Saddle River, NJ: Prentice-Hall.

Baron, A., & Perone, M. (1998). Experimental design and analysis in the laboratory study of human operant behavior. In K. A. Lattal & M. Perone (Eds.), Handbook of research methods in human operant behavior (pp. 45–91). New York, NY: Plenum Press.

Kratochwill, T. R., & Levin, J. R. (2010). Enhancing the scientific credibility of single-case intervention research: Randomization to the rescue. Psychological Methods, 15, 124–144. doi:10.1037/a0017736

Bernstein, D. J., & Ebbesen, E. B. (1978). Reinforcement and substitution in humans: A multiple-response analysis. Journal of the Experimental Analysis of Behavior, 30, 243–253. doi:10.1901/jeab.1978.30-243

Lucki, I., & DeLong, R. E. (1983). Control rate of response or reinforcement and amphetamine’s effect on behavior. Journal of the Experimental Analysis of Behavior, 40, 123–132. doi:10.1901/jeab.1983.40-123 125

Perone and Hursh

Mazur, J. E. (1985). Probability and delay of reinforcement as factors in discrete-trial choice. Journal of the Experimental Analysis of Behavior, 43, 341–351. doi:10.1901/jeab.1985.43-341

Rusch, F. R., & Kazdin, A. E. (1981). Toward a methodology of withdrawal designs for the assessment of response maintenance. Journal of Applied Behavior Analysis, 14, 131–140. doi:10.1901/jaba.1981.14-131

McSweeney, F. K., & Murphy, E. S. (2000). Criticisms of the satiety hypothesis as an explanation for within-session decreases in responding. Journal of the Experimental Analysis of Behavior, 74, 347–361. doi:10.1901/jeab.2000.74-347

Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton Mifflin.

McSweeney, F. K., & Roll, J. M. (1993). Responding changes systematically within sessions during conditioning procedures. Journal of the Experimental Analysis of Behavior, 60, 621–640. doi:10.1901/ jeab.1993.60-621 Morgan, D. L., & Morgan, R. K. (2009). Single-case research methods for the behavioral and health sciences. Thousand Oaks, CA: Sage. Morse, W. H., & Kelleher, R. T. (1977). Determinants of reinforcement and punishment. In W. K. Honig & J. E. R. Staddon (Eds.), Handbook of operant behavior (pp. 174–200). Englewood Cliffs, NJ: Prentice-Hall. Moxley, R. A. (1998). Treatment-only designs and student self-recording as strategies for public school teachers. Education and Treatment of Children, 21, 37–61. [Erratum. (1998). Education and Treatment of Children, 21, 229.] Okouchi, H. (2009). Response acquisition by humans with delayed reinforcement. Journal of the Experimental Analysis of Behavior, 91, 377–390. doi:10.1901/ jeab.2009.91-377 Perone, M. (1991). Experimental design in the analysis of free-operant behavior. In I. H. Iversen & K. A. Lattal (Eds.), Techniques in the behavioral and neural sciences: Vol. 6. Experimental analysis of behavior, Part 1 (pp. 135–171). Amsterdam, the Netherlands: Elsevier. Perone, M. (1994). Single-subject designs and developmental psychology. In S. H. Cohen & H. W. Reese (Eds.), Life-span developmental psychology: Methodological contributions (pp. 95–118). Hillsdale, NJ: Erlbaum. Perone, M. (1999). Statistical control in behavior analysis: Experimental control is better. Behavior Analyst, 22, 109–116. Perone, M., & Courtney, K. (1992). Fixed-ratio pausing: Joint effects of past reinforcer magnitude and stimuli correlated with upcoming magnitude. Journal of the Experimental Analysis of Behavior, 57, 33–46. doi:10.1901/jeab.1992.57-33 Richman, D. M., Wacker, D. P., Cooper-Brown, L. J., Kayser, K., Crosland, K., Stephens, T. J., & Asmus, J. (2001). Stimulus characteristics within directives: Effects on accuracy of task completion. Journal of Applied Behavior Analysis, 34, 289–312. doi:10.1901/ jaba.2001.34-289 126

Shull, R. L. (1999). Statistical inference in behavior analysis: Discussant’s remarks. Behavior Analyst, 22, 117–121. Sidman, M. (1960). Tactics of scientific research: Evaluating experimental data in psychology. New York, NY: Basic Books. Sidman, M. (1990). Tactics: In reply. Behavior Analyst, 13, 187–197. Skinner, B. F. (1938). The behavior of organisms. New York, NY: Appleton-Century-Crofts. Skinner, B. F. (1956). A case history in scientific method. American Psychologist, 11, 221–233. doi:10.1037/ h0047662 Skinner, B. F. (1966). What is the experimental analysis of behavior? Journal of the Experimental Analysis of Behavior, 9, 213–218. doi:10.1901/jeab.1966.9-213 Smith, L. D., Best, L. A., Cylke, V. A., & Stubbs, D. A. (2000). Psychology without p values: Data analysis at the turn of the century. American Psychologist, 55, 260–263. doi:10.1037/0003-066X.55.2.260 Stubbs, D. A., & Pliskoff, S. S. (1969). Concurrent responding with fixed relative rate of reinforcement. Journal of the Experimental Analysis of Behavior, 12, 887–895. doi:10.1901/jeab.1969.12-887 Thorndike, E. L. (1911). Animal intelligence: Experimental studies. New York, NY: Macmillan. doi:10.5962/bhl. title.1201 Todman, J. B., & Dugard, P. (2001). Single-case and small-n experimental designs: A practical guide to randomization tests. Mahwah, NJ: Erlbaum. Ulman, J. D., & Sulzer-Azaroff, B. (1975). Multielement baseline design in educational research. In E. Ramp & G. Semb (Eds.), Behavior analysis: Areas of research and application (pp. 371–391). Englewood Cliffs, NJ: Prentice-Hall. Wade-Galuska, T., Galuska, C. M., & Perone, M. (2004). [Pausing during signaled rich-to-lean shifts in reinforcer context: Effects of cue accuracy and food deprivation]. Unpublished raw data. Watson, J. (1913). Psychology as the behaviorist views it. Psychological Review, 20, 158–177. doi:10.1037/ h0074428 Wertheimer, M. (1912). Experimentelle studien uber das sehen von bewegung [Experimental studies of the perception of motion]. Zeitschrift fur Psychologie, 61, 61–265.

Chapter 6

Observation and Measurement in Behavior Analysis Raymond G. Miltenberger and Timothy M. Weil

You can observe a lot by watching. —Yogi Berra, circa 1964 Throughout its history, behavior analysis has focused on building an inductive science that uses behavioral observation techniques to identify functional behavior–environment relations such that alteration of these relations may result in behavior change that is scientifically, and often socially, meaningful. Measuring behavior–environment relations in accurate and reliable ways is thus an integral tool in the process of analyzing and changing behavior. Observable behavior has formal properties or dimensions that can be measured. These behavioral dimensions include frequency, intensity, duration, and latency (e.g., Bailey & Burch, 2002; Miltenberger, 2012). Each of these dimensions affords the observer the opportunity to measure changes in level, trend, and variability when alterations of environmental variables occur naturally or are manipulated under various programmed conditions. Observation and measurement of behavior may take many forms and involve a variety of techniques across practically any setting. In this chapter, we discuss observation and measurement in the field of behavior analysis with a focus on identifying and measuring the target behavior, logistics of observation, recording procedures and devices, reactivity, interobserver agreement (IOA), and ethical considerations. The information discussed in this chapter is relevant to both research and practice in behavior analysis because observation and measurement of behavior is central to both endeavors.

Behavior Regardless of whether the purpose of investigation is research or practice, it is first necessary to define behavior. Many authors define behavior in slightly different terms; however, each stresses an individual’s action or movement. According to Miltenberger (2012), behavior involves the actions of individuals, what people say and do. Malott and Trojan-Suarez (2004) suggested that behavior is anything a dead man cannot do, again suggesting that behavior consists of action or movement. Cooper, Heron, and Heward (2007) said that “behavior is the activity of living organisms. Human behavior is everything people do including how they move and what they say, think, and feel” (p. 25). Finally, Johnston and Pennypacker (1993) stated that behavior is that portion of an organism’s interaction with its environment that is characterized by detectable displacement in space through time of some part of the organism and that results in a measurable change in at least one aspect of the environment. (p. 23) These definitions of behavior are rooted in the traditional characterization of an operant as observable action or movement that has some subsequent effect on (operates on) the environment (Johnston & Pennypacker, 1993, p. 25). Although the Cooper et al. definition of behavior includes thinking and feeling, these actions are nonetheless those of an individual that can be observed and recorded. Therefore, in this chapter we focus on observation and ­measurement of

DOI: 10.1037/13937-006 APA Handbook of Behavior Analysis: Vol. 1. Methods and Principles, G. J. Madden (Editor-in-Chief) Copyright © 2013 by the American Psychological Association. All rights reserved.

127

Miltenberger and Weil

behavior that can be detected, and thus recorded, by an observer. In some cases, the observer may be the individual engaging in the behavior. Selecting and Defining Target Behavior The first step in developing a plan for observing and recording behavior is to select and then define appropriate target behavior.

Selecting Target Behavior Target behavior can be selected for a variety of overlapping reasons (see Kazdin, 2010). It may be useful but arbitrary; representative of a broader class of operants; the focus of intervention or educational efforts that occur in a particular setting (e.g., academic performance in a school setting); chosen because it causes impairment in some area of functioning for the individual; of particular concern to the individual or significant others who seek to change the behavior; or chosen because it will prevent the development of future problems (e.g., the promotion of safety skills to prevent injury). When selecting target behavior, three general categories are considered: behavioral deficits, behavioral excesses, and problems of stimulus control. Behavioral deficits are behaviors that need to increase, such as desirable communicative responses for a child with autism who has limited language (e.g., Sundberg & Michael, 2001). Behavioral excesses are behaviors that need to decrease, such as selfinjurious or aggressive behavior emitted by an individual with intellectual disability (e.g., Lerman, Iwata, Smith, & Vollmer, 1994). Problems of stimulus control are present when behaviors occur, but not at the appropriate time or place or in the appropriate context. For example, a child may learn to engage in a safety skill during training but fail to use it when the opportunity arises in the natural environment (e.g., Gatheridge et al., 2004; Himle, Miltenberger, Flessner, & Gatheridge, 2004). Likewise, a child with autism may learn to label an object but be unable to ask for the same object (e.g., failure of tact-to-mand transfer; Wallace, Iwata, & Hanley, 2006). Identifying developmentally appropriate topographies and levels of behavior is also important when selecting 128

the target behavior. For example, in research on stuttering, Wagaman, Miltenberger, and Arndorfer (1993) chose a criterion of 3% or fewer stuttered words as an indication of treatment success on the basis of research showing that as many as 3% of the words spoken by typical speakers were dysfluent. A guiding factor in the selection of a target behavior in applied work is its social significance (Baer, Wolf, & Risley, 1968). Behavior is targeted that will increase the client’s meaningful and effective interactions with the environment. One index of social significance is the assessment of the social validity of the targeted behavior (Wolf, 1978). According to Wolf (1978), one of the three levels of social validity is the degree to which society validates the social significance of the goals of a behavior change procedure. In this regard, the important question posed by Wolf is, “Are the specific behavioral goals really what society wants?” (p. 207). In practice, assessment of the social validity of a target behavior or goal involves asking consumers for feedback on what behavior should be addressed, in what order, and to what extent. Of course, the target behavior selected in this way may possibly have some secondary gain or benefit for the person providing the report and thus may or may not be in the client’s best interest. The behavior analyst must be aware of this possibility and decide with the client, client surrogates, treatment team members, or some or all of these on the target behavior that best serves the client’s interests. Although behavior analysts are interested in the behavior of clients or research participants, they are also interested in the behavior of the implementers carrying out behavior-analytic procedures. The degree to which individuals implement assessment and treatment procedures as planned is referred to as implementation fidelity, procedural fidelity, or treatment integrity (e.g., Gresham, Gansle, & Noell, 1993; Peterson, Homer, & Wonderlich, 1982). Implementation fidelity is important because higher fidelity is associated with better treatment outcomes (e.g., DiGennaro, Martens, & Kleinmann, 2007; DiGennaro, Martens, & McIntyre, 2005; DiGennaroReed, Codding, Catania, & Maguire, 2010; Plavnick, Ferreri, & Maupin, 2010). Implementation fidelity is assessed by observing and recording the behavior

Observation and Measurement in Behavior Analysis

of the implementers as they observe and record the behavior of the clients or research participants and as the implementers carry out intervention procedures. Everything discussed in this chapter applies not only to observing and recording the behavior of clients or research participants, but also to measuring the behavior of the implementers.

Defining the Target Behavior The target behavior should be defined in terms that are objective, clear, and complete (Kazdin, 2010). A behavioral definition must include active verbs describing the individual’s actions, that is, the topography or form of the action being observed. Some behavioral definitions may also include the environmental events that precede (antecedents) or follow (consequences) the behavior. Behavioral definitions cannot include category labels (e.g., aggression) or appeal to internal states or characteristics (e.g., strong willed) but rather must identify the topography of the behavior. A behavioral definition should be easy to read and should suffice as a starting point for the observer to engage in data collection. Once a

behavior analyst begins to observe instances of the behavior, the behavioral definition may be modified on the basis of those observations. Some examples of target behavior definitions used in behavior-analytic research are shown in Table 6.1. Note the precise descriptions of behavior in these examples and the inclusion of the appropriate context for the behavior (e.g., unscripted verbalizations are “verbalizations that were not modeled in the video script but were appropriate to the context of the toy”) or the necessary timing of the behavior in relation to other events (e.g., an acceptance occurs when “the child’s mouth opened . . . within 3 seconds after the food item was held within 1 inch of the mouth”). Logistics of Observation Once the target behavior is identified and defined, the time and place of observation must be determined and the observers identified. These logistics of observation are not insignificant, because the choices of observation periods and observers will determine the quality of the data derived from the observations.

Table 6.1 Examples of Target Behavior Definitions From Published Articles Involving Behavior Analytic Assessment and Treatment Label Empathy

Acceptance Expulsion

Activity engagement Compliance Unscripted verbalizations

Definition “A contextually appropriate response to a display of affect by a doll, puppet, or person that contained motor and vocal components (in any order) and began within 3 s of the end of the display.” (p. 20) “The child’s mouth opened so that the spoon or piece of food could be delivered within 3 s after the food item was held within 1 in. of the mouth.” (p. 329) “Any amount of food (that had been in the mouth) was visible outside the mouth (Joan only) or outside the lip and chin area (Nancy, Jerry, Holly) prior to presentation of the next bite.” (p. 329) “Facial orientation toward activity materials, appropriate use of activity materials, or comments related to the activity.” (p. 178) “The child independently completing or initiating the activity described in the instruction within 10 s.” (p. 535) “Verbalizations that were not modeled in the video script but were appropriate to the context of the toy [that was present].” (p. 47)

Citation Schrandt, Townsend, and Poulson (2009) Riordan, Iwata, Finney, Wohl, and Stanley (1984) Riordan et al. (1984)

Mace et al. (2009) Wilder, Zonneveld, Harris, Marcus, and Reagan (2007) MacDonald, Sacramone, Mansfield, Wiltz, and Ahern (2009)

129

Miltenberger and Weil

Time and Place of Observations Observation periods (the time and place chosen for observation) should be scheduled when the target behavior is most likely to occur (or, in the case of behavioral deficits, when the target behavior should be occurring but is not). In some cases, the target behavior occurs mostly or exclusively in the context of specific events (e.g., behavioral acquisition training in academic sessions or athletic performance), and therefore, observation sessions have to occur at those times. A behavior analyst, however, may be interested in measuring the target behavior in naturalistic settings in which the behavior is, for the most part, free of temporal constraints. In these instances, it is possible to interview the client and significant people to narrow the observation window. In addition, it may be valuable to validate the reports by collecting initial data on the occurrence of the target behavior. For instance, scatterplot assessments, which identify at half-hour intervals throughout the day whether the behavior did not occur, occurred once, or occurred multiple times, may help identify the best time to schedule the observation period (Touchette, Macdonald, & Langer, 1985). In instances in which reports on the occurrence of the target behavior are not available or when the reports are not descriptive enough, the behavior analyst should err on the side of caution and conduct a scatterplot assessment or other initial observations to narrow the observation window. Identifying the times at which the target behavior is most likely to occur is desirable to capture the greatest number of instances. The rationale for observing and recording as many instances of the behavior as possible rests with evaluation of function during assessment and analysis of the effects of the independent variable during intervention. When it is not possible to observe enough instances of the behavior across a number of observation periods to establish clear relations between the behavior and specific antecedents and consequences, treatment implementation may be delayed. With behavior that occurs at a lower rate, a longer time frame of observation may be necessary to establish functional relations. Although a delay to intervention after a baseline may be acceptable in some situations, in others it 130

could be undesirable or unacceptable for the client or significant others, such as teachers or parents. Such delays can sometimes be circumvented by structuring observations in an analog setting to evaluate the effects of likely antecedent and consequent stimuli with the objective of evoking the target behavior. Alternatively, samples of the behavior might be collected in the natural environment at various times to provide a sufficient baseline that would allow making an accurate assessment of function, deciding on an appropriate intervention, or both. Circumstances such as availability of observers and the client’s availability must also be considered. A final consideration in preparing to make observations is to select a placement within the observation environment that permits a full view of the person and the behavior of interest while at the same time minimizing disruptions to the client or others in the environment. In addition, when collecting IOA data, it is important for both observers to see the same events from similar angles and distances but simultaneously maintain their status as independent observers. Depending on the characteristics of the setting, issues may arise, such as walls or columns that impede seeing the behavior and interruptions from staff or other individuals. Disruptions in the environment are also a concern. For example, children in elementary school classrooms are notorious for approaching and interacting with an adult while he or she is observing and recording the target behavior. In addition, if the target child is identified, the other children may cause disruptions by talking to the target child or behaving in otherwise disruptive ways. Any disruption should be recorded so that it can be considered in accounting for variability in the data.

Selecting Observers Most behavioral observations in research and practice are conducted by trained observers. Trained observers may not be needed in some laboratory settings in which permanent products are produced or equipment records the behavior automatically. However, in applied settings, individuals conducting behavioral observations must see the behavior as it occurs so data can be collected onsite or recorded for review at a later time. Individuals who could

Observation and Measurement in Behavior Analysis

conduct behavioral observations in the same setting as the client include participant observers (individuals who are typically present in the environment), nonparticipant observers (data collectors who are external to the typical workings of the environment), or the client whose behavior is being observed (self-monitoring). Participant observers.  According to Hayes, Barlow, and Nelson-Gray (1999), participant observers may be used in situations in which a significant other or other responsible party in the setting (e.g., parent, teacher) is available (and trained) to collect data. The primary advantage of including these individuals as observers is that they are already in the environment, which eliminates the potential logistical problems of scheduling nonparticipant observers. In addition, the likelihood of the child or student showing reactivity to observation is lessened because the person being observed is likely to have habituated to the participant observer’s presence over time. A limitation when using participant observers is that the observers may not have time to conduct observations because of their other responsibilities in the setting. One factor to consider when arranging participant observation (and nonparticipant observation) is the possibility of surreptitious observation. With surreptitious observation, the participant observer would not announce or otherwise cue the participant to the fact that a particular observation session is taking place (e.g., Mowery, Miltenberger, & Weil, 2010). For example, in Mowery et al. (2010), graduate students were present in a group home setting to record staff behavior. However, the staff members were told that the students were there to observe and record client behavior as part of a class project (this deception was approved by the institutional review board, and participants were later debriefed). Surreptitious observation leads to less reactivity in the person being observed. For surreptitious observation to occur ethically, the client or participant must consent to observation with the knowledge that he or she will not be told, and may not be aware of, the exact time and place of observation (e.g., Wright & Miltenberger, 1987). The exception would be when a parent or guardian gives consent

for surreptitious observation of a child or a researcher gets institutional review board approval for deception and later debriefs the participants. Nonparticipant observers.  When it is either impossible or undesirable to involve a participant observer, nonparticipant observers who are not part of the typical environment are used. For instance, observations in school settings may require that an observer sit in an unobtrusive area of the classroom and conduct observations of the child at various times of the day. Three challenges of having nonparticipant observers involved in data collection are access, scheduling, and cost. Because observations tend to occur while clients are involved in social settings, it may not be permissible to observe because of the potential for disruption or for reasons of confidentiality. In the latter case, when conducting observations of the behavior of a single individual in a group setting such as a classroom, it is typical to require consent of all students in the group because all are present during the observation. This is especially true when observations of minors occur. Because observation periods may be relatively short (especially in the context of research), it may also be difficult to schedule an observer several times a day or week to collect data for only 15 to 60 minutes. In addition, the client’s schedule may restrict when observation may occur. Finally, a significant cost may be associated with the inclusion of skilled data collectors who may need to be hired to fulfill this role. Circumventing excessive costs is possible, however, if student interns or other staff already at the site are available. In addition to the monetary cost of the observers, there is cost in terms of time and effort to train the observers and conduct checks for IOA to ensure consistency in the data collected. Self-monitoring.  When the target behavior occurs in the absence of others, it may be useful to have clients observe and record their own behavior. When asking clients to record their own behavior, it is necessary to train them as you would train any observer. Although there are examples of research using data gathered through self-monitoring (e.g., marijuana use [Twohig, Shoenberger, & Hayes, 2007]; disruptive outbursts during athletic performances [Allen, 131

Miltenberger and Weil

1998]; physical activity levels [Van Wormer, 2004]; binge eating [Stickney & Miltenberger, 1999]), self-monitoring is less desirable than observation by another individual because it may be unreliable. If the target behavior occurs in the absence of others, then IOA cannot be assessed. For example, Bosch, Miltenberger, Gross, Knudson, and BrowerBreitweiser (2008) used self-monitoring to collect information on instances of binge eating by young women but could not collect IOA data because binge eating occurred only when the individual was alone. Self-monitoring is best used when it can be combined with periodic independent observations to assess IOA. Independent observations occur when a second observer records the same behavior at the same time but has no knowledge of the other observer’s recording. Thus, the recording of both observers is under the stimulus control of the behavior being observed and is not influenced by the recording behavior of the other observer. When IOA is high, it might indicate that self-monitoring is being conducted with fidelity. It is possible, however, that self-monitoring is conducted with fidelity only under the conditions of another observer being present, but not when the client is alone or away from the other observer. In some instances, it is possible to collect secondary data or product measures that can be used to verify self-monitoring. For instance, researchers measured expired carbon monoxide samples in smoking cessation research (Brown et al., 2008; Raiff, Faix, Turturici, & Dallery, 2010) and tested urine samples in research on substance abuse (Hayes et al., 2004; Wong et al., 2003). Given the potential unreliability of self-monitoring, taking steps to produce the most accurate data possible through self-monitoring is important. Such steps might include making a data sheet or data collection device as easy to use as possible, tying data collection to specific times or specific activities to cue the client to record his or her behavior, having other people in the client’s environment cue the client to conduct self-monitoring, checking with the client frequently by phone or e-mail to see whether self-monitoring is occurring, having the client submit data daily via e-mail or text message, and praising the client for reporting data rather than for the level of the behavior to avoid influencing the data. 132

Even with these procedures in place, clients may still engage in data collection with poor fidelity or make up data in an attempt to please the therapist or researcher. Therefore, self-monitoring that lacks verification should be avoided as a form of data collection whenever possible.

Training Observers Adequate observer training is necessary to have confidence in the data. Observing and recording behavior can be a complex endeavor in which the observer must record, simultaneously or in rapid order, a number of response classes following a specific protocol, often while attending to timing cues (see Sampling Procedures section). Finally, following this routine session after session may lead to boredom and set the occasion for observer drift. Observer drift is the loosening of the observer’s adherence to the behavioral definitions that are used to identify the behavioral topographies to be recorded, a decrease in attending to specific features of the data collection system, or both. When observer drift occurs, the accuracy and reliability of the data suffer, and faulty decisions or conclusions may result (see Kazdin, 1977). One way to train observers is to use behavior skills training (Miltenberger, 2012), which involves providing instructions and modeling, having the observer rehearse the observation and recording procedures, and providing feedback immediately after the performance. Such training occurs first with simulated occurrences of the target behavior in the training setting and then with actual occurrences of the target behavior in the natural environment. Subsequent booster sessions can be conducted in which the necessary training components are used to correct problems. To maintain adequate data collection, it is necessary to reinforce accurate data collection and detect and correct errors that occur. Several factors can influence the fidelity of data collection (Kazdin, 1977). These include the quality of initial training, consequences delivered for the target behavior (Harris & Ciminero, 1978), feedback from a supervisor for accurate data collection (Mozingo, Smith, Riordan, Reiss, & Bailey, 2006), complexity and predictability of the behavior being observed (Mash & McElwee, 1974), and the mere presence of a

Observation and Measurement in Behavior Analysis

supervisor (Mozingo et al., 2006). With these factors in mind, strong initial training and periodic assessment and retraining of observers are recommended for participant observers, nonparticipant observers, and individuals engaging in self-monitoring. Recording Procedures The procedures available for collecting data on targeted behavior are categorized as continuous recording procedures, sampling procedures, and product recording.

Continuous Recording Continuous recording (also called event recording) procedures involve observation and recording of each behavioral event as it occurs during the observation period. Continuous recording will produce the most precise measure of the behavior because every occurrence is recorded. However, continuous recording is also the most laborious method because the observer must have constant contact with the participant’s behavior throughout the observation period. As with all forms of data collection, continuous recording requires the behavior analyst to first identify the dimensions on which to focus. Observers are recommended to initially collect data on multiple dimensions of the behavior (e.g., frequency and duration) to identify the most relevant dimensions and to then wean over the course of observations as the analysis identifies the most important dimensions. For example, in a classroom situation involving academic performance, it may be useful to count the number of math problems completed correctly, latency to initiate the task (and each problem), and the time spent on each problem. If after several observations the observer finds that it takes a while for the child to initiate the task, resulting in a low number of problems completed, focusing on measuring latency to initiate the task and frequency of correct responses may be useful. Next, we describe data collection procedures related to different dimensions of behavior. Although we discuss the procedures separately, various combinations of these procedures may produce important data for analysis that would not be apparent with a focus on a single procedure.

Frequency.  Perhaps the most common form of continuous recording is frequency recording: counting the number of occurrences of the target behavior in the observation period (Mozingo et al., 2006). Frequency recording is most appropriate when the behavior occurs in discrete units with fairly consistent durations. In frequency recording, each occurrence of the target behavior (defined by the onset and offset of the behavior) is recorded in the observation period. Frequency data may be reported as total frequency—number of responses per observation session—or converted to rate—number of responses per unit of time (e.g., responses per minute). Total frequency would only be reported if the observation periods were of the same duration over time. The advantage of reporting rate is that the measure is equivalent across observation periods of different durations. Frequency recording requires the identification of a clear onset and offset of the target behavior so each instance can be counted. It has been used with a wide range of target behavior when the number of responses is the most important characteristic of the behavior. Examples include recording the frequency of tics (Miltenberger, Woods, & Himle, 2007), greetings (Therrien, Wilder, Rodriguez, & Wine, 2005), requests (Marckel, Neef, & Ferreri, 2006), and mathematics problems completed (Mayfield & Vollmer, 2007). When it is difficult to discriminate the onset or offset of the behavior or the behavior occurs at high rates such that instances of the behavior cannot be counted accurately (e.g., high-frequency tics or stereotypic behavior), a behavior sampling procedure (i.e., interval or time-sample recording; see below) is a more appropriate recording procedure. As we elaborate on later, in sampling procedures the behavior is recorded as occurring or not occurring within consecutive or nonconsecutive intervals of time, but individual responses are not counted. Four additional methods of recording frequency are frequency-within-interval recording, real-time recording, cumulative frequency, and percentage of opportunities. Each method has advantages over a straight frequency count. Frequency-within-interval recording.  One limitation of frequency recording is that it does not 133

Miltenberger and Weil

provide information on the timing of the responses in the observation period. With frequency-withininterval recording, the frequency of the behavior is recorded within consecutive intervals of time to indicate when the behavior occurred within the observation period. To conduct frequency-withininterval recording, the data sheet is divided into consecutive intervals, a timing device cues the observer to the appropriate interval, and the observer records each occurrence of the behavior in the appropriate interval. By providing information on the number of responses and the timing of responses, more precise measures of IOA can be calculated. Real-time recording.  Combining features of frequency and duration procedures, real-time recording also allows the researcher to collect information on the temporal distribution of the target behavior over the course of an observation period (Kahng & Iwata, 1998; Miltenberger, Rapp, & Long, 1999). Through use of either video playback or computers in real time, it is possible to record the exact time of onset and offset of each occurrence of the behavior. For discrete momentary responses that occur for 1 second or less, the onset and offset are recorded in the same second. Real-time recording is especially valuable when conducting within-session analysis of behavioral sequences or antecedent–behavior–consequence relations. Borrero and Borrero (2008) conducted realtime observations that included the recording of both target behavior and precursor behavior or events related to the target behavior. These data were then used to construct a moment-to-moment analysis (lagsequential analysis) that provided probability values for the occurrence of the precursor given the target behavior and of the target behavior given the precursor. The probability of a precursor reliably increased approximately 1 second before the emission of the target behavior. In addition, the probability of the target behavior was greatest within 1 second after the precursor behavior or event. The real-time analysis suggested that the precursor behavior or event was a reliable predictor of the target behavior. Additional analysis showed that both the precursor behavior and the target behavior served the same function (e.g., both led to escape from demands). 134

Cumulative frequency.  The origins of measuring operant behavior involved the use of an electromechanical data recording procedure that was designed to record instances of behavior cumulatively across time (Skinner, 1956). Each response produced an uptick in the data path as the pen moved across the paper and the paper revolved around a drum. The original paper records of this recording were only about 6 inches wide, and thus the pen used to record responses would, on reaching the top of the paper, reset to the bottom of the paper and continue recording the responses. Increasing slopes indicated higher response rates; horizontal lines indicated an absence of the response. This apparatus for the automatic recording of cumulative frequencies is no longer used, but the usefulness of cumulative response measures persists. In cumulative frequency graphs, data are displayed as a function of time across the x-axis and cumulative frequency along the y-axis. The frequency of responses that occur in a given time period is added to the frequency in the previous time period. Thus, data presented in a cumulative record must either maintain at a particular level (no new responses) or increase (new responses) across time but never decrease. The use of cumulative frequency plots allows one to assess frequency and temporal patterns of responding. Percentage of opportunities.  In some cases, recording the occurrence of a response in relation to a specific event or response opportunity is useful. In such cases, the percentage of opportunities with correct responses is more important than the number of responses that occurred. For example, in recording compliance with adult requests, the percentage of requests to which the child responds correctly is more important than the number of correct responses. Ten instances of compliance are desirable if 10 opportunities occur. However, 10 instances of compliance are much less desirable in relation to 30 opportunities. Other examples include the percentage of math problems completed correctly, percentage of free throws made in a basketball game, percentage of signals detected on a radar screen during a training exercise, and percentage of trials in which an item is labeled correctly during language

Observation and Measurement in Behavior Analysis

training. Considering that the number of opportunities might vary in each of these cases, the percentage of opportunities is a more sensitive measure of the behavior than a simple frequency count or a rate measure. When a percentage-of-opportunities measure is used, reporting the number of opportunities as well as the percentage of correct responses is important. If the number of opportunities is substantively different across observations, it may affect the variability of the data and the interpretation of the results. For instance, if on one occasion a child is provided with 10 spelling words and spells eight correctly, the result is 80% correct. The next day, if the two words spelled incorrectly are retested and the child spells one of the words correctly, the second performance result is 50% correct. These data are not comparable because the number of opportunities varied greatly, and inappropriate conclusions could be drawn from the results if only percentages were reported. In these instances, providing an indicator of the number of opportunities to respond in the graphical representation of the data will assist the reader in interpreting the results. Duration.  When each response lasts for an extended period of time or does not lend itself to a frequency count (e.g., as in behavior such as reading or play), it may be useful to record the duration of the behavior, that is, the time from its onset to its offset. Duration recording is desirable when the most important aspect of the behavior is the length of time for which it occurs. For example, if the interest were in sustained performance or time on task, duration recording is appropriate. If classroom teachers are concerned with sustained engagement in academic activities, the observer would identify the length of time that engagement is desired (such as in reading) and collect data on the duration of engagement to identify any discrepancy between the target duration and actual performance. Once a discrepancy is determined to exist, programming for successively longer durations could be initiated. Other situations involve a combination of duration and frequency recording, as when the goal is to decrease a young child’s tantrum behavior. If tantrums occur multiple times per day and each tantrum

continues for a number of minutes, recording both frequency and duration will reveal whether tantrums are occurring less often and occurring for shorter periods of time after intervention. Finally, many types of behavior targeted in applied work do not lend themselves readily to frequency counts because they consist of (a) responses that occur rapidly and repetitively over extended periods of time (such as stereotypic behavior), (b) complexes of discrete responses integrated into chains or other higher order units, or (c) both. For rapid, repetitive responses, for which each onset and offset is not easily discriminated, a duration measure can be used. In such cases, a time period in which the behavior is absent can help the observer discriminate the end of one episode and the start of the next. For target behavior consisting of multiple component behaviors, the target behavior might be defined as the entire chain, and a duration measure would then consist of recording the time from the onset of the first response in the chain to the offset of the last response in the chain. Finally, in some instances duration is used to measure a behavior with multiple component responses when it does not make sense to reduce the behavior to a frequency count of its component responses. For example, duration of play would be of greater interest to a parent trying to increase a child’s play time than would a frequency count of the number of steps the child traveled across the playground, went up and down a slide, or moved back and forth on a swing. Latency.  Latency is the length of time from the presentation of a discriminative stimulus to the initiation of the behavior. Latency is of interest when the speed of initiation of the behavior is an important feature. For example, latency is the time from the sound of the starter’s pistol to the sprinter’s movement off the starting blocks, the time it takes for a child to initiate a task after the teacher’s request, or the time it takes the wait staff at a restaurant to respond once a customer is seated. When working with a child who does not complete math problems in the allotted time, for example, latency indicates how long it takes the child to initiate the task. By contrast, duration assesses how long it takes the child to complete each problem 135

Miltenberger and Weil

once he or she starts working on it. Depending on the child and the circumstance, one or both dimensions may be an important focus of assessment and intervention. Magnitude.  On occasion, evaluating the magnitude or intensity of behavior is useful. One example of response magnitude is the force exerted (e.g., muscle flexion), and another is the loudness of a verbal response (as measured in decibels). Although decreases in frequency, and perhaps duration, of undesirable behaviors such as tantrums or selfinjury are important, a reduction in magnitude may be an important initial goal. In some cases, reductions in magnitude may be observed before substantial decreases occur on other dimensions, such as frequency and duration. Alternatively, magnitude may increase temporarily during an extinction burst before the behavior decreases in frequency, duration, or intensity. Recording magnitude may be valuable when considering recovery from an accident or injury such as a knee injury for a football player. Measurement would pertain to the ability of the affected muscles to exert force after rehabilitation. In these situations, recording magnitude tends to require equipment such as that used by physical therapists to evaluate force. Direct observation of response magnitude may not always measure force, however. Observers can use intensity rating scales to evaluate the magnitude of a response. For instance, given a scale ranging from 1 to 10, a teacher may rate the intensity of each occurrence of an undesirable behavior. In using rating scales, it is important to anchor the points within the scale such that two observers can agree on the level of intensity given a variety of occurrences of the behavior (e.g., 1 = mild whining, 10 = loud screaming, throwing items, and head banging). Although anchoring categories on a scale is considered valuable to decrease the variability in responding across observers, the literature is not clear as to how many individual categories need be defined (Pedhazur & Schmelkin, 1991). Another example in which magnitude can be measured with a rating scale is the intensity of a fear response (Miltenberger, Wright, & Fuqua, 1986; Twohig, Masuda, 136

Varra, & Hayes, 2005) or other emotional responses (Stickney & Miltenberger, 1999). In general, intensity rating scales present issues of both reliability and validity because the ratings that might be assigned to specific instances of behavior may be ambiguous; this is especially true when rating fear or emotional responses because the magnitude of these behaviors can be rated only by the individual engaging in the behavior.

Sampling Procedures It may not always be possible to collect adequate information on the target behavior using continuous recording procedures. When the onset and offset of each instance of the behavior cannot be identified, continuous recording is not possible. Likewise, the constraints imposed by some environments, some target behaviors, or some observers may make continuous recording impossible. For example, the ­person exhibiting the target behavior might not be continuously in sight of the observer, the target behavior might occur almost exclusively when the individual is alone, or the observer might have other responsibilities that compete with observation. In these instances, it may be desirable, and necessary, to collect samples of the behavior that provide an estimate of the behavior’s true level. Behaviorsampling procedures include interval recording and time-sample recording. In both procedures, the observation period is divided into smaller units of time, and the observer records whether the behavior occurred in each interval. Interval recording.  Interval recording involves dividing the observation period into equal consecutive intervals and recording whether the behavior occurred in each. Interval recording is different than frequency recording or frequency-withininterval recording in that an interval is scored once regardless of whether a single instance or multiple instances of a behavior occurred during the interval. In behavior analysis research, intervals are usually short—typically 10 to 15 seconds (DiGennaro et al., 2007; Mace et al., 2009). Short intervals (usually less than 20 seconds) are valuable when behavior occurs at moderate to high frequencies or when multiple topographies of behavior are recorded.

Observation and Measurement in Behavior Analysis

Valuable too are shorter intervals when temporal correlations may yield information on antecedent events and potential maintaining consequences. When interested in the relation between the target behavior and antecedents and consequences, the observer records whether any of the three events occurred in each interval to examine the temporal patterns of the behavior and its potential controlling variables (e.g., Repp & Karsh, 1994). An additional condition under which shorter intervals are valuable is when an understanding of within-session temporal distribution of the behavior is necessary. For example, to determine whether self-injurious behavior is high in a functional analysis condition because of an extinction burst, the behavior analyst identifies whether more intervals were scored early in the session than later in the session (Vollmer, Marcus, Ringdahl, & Roane, 1995). Similar patterns could be discerned with cumulative frequency recording or real-time recording as well. In some applied settings, however, intervals might be much longer—perhaps 15 or 30 minutes (Aikman & Garbutt, 2003)—when behavior occurs less frequently. Under these conditions, it may be difficult to draw useful correlations between antecedent and consequent events and the behavior as well as behavior–behavior relations. Such limitations notwithstanding, longer intervals are typically used for the convenience of data collectors (often participant observers) who can engage in other responsibilities and still collect data. Typically, the observer has a data sheet with consecutive intervals designated for recording, and during the observation period, the observer is prompted with auditory (through headphones so as to not disrupt the ongoing behavior of the observee) or tactile (vibration) cues delivered via a timing device to move from interval to interval while observing and recording the target behavior. As time passes, the observer records the occurrence of the target behavior in the appropriate interval; a blank interval indicates the nonoccurrence of the behavior in that interval. In some cases, a computer is used for data collection, and as the observer records the behavior, the software puts the data into the proper interval. At the end of the observation period, the number of intervals in which the behavior is observed is

divided by the number of observation intervals, and the result is reported as the percentage of intervals in which the behavior occurred. A similar process is used for time-sample recording (described in the next section). The two types of interval recording procedures are partial-interval recording and whole-interval recording. In partial-interval recording, the observer records the occurrence of the target behavior if it occurred at any time within the interval. That is, the interval is scored if the target behavior occurred briefly in the interval or throughout the entire interval. Furthermore, if the onset of the behavior occurs in one interval and its offset occurs in the next, both intervals are scored (e.g., Meidinger et al., 2005). In whole-interval recording, the interval is scored only if the target behavior occurred throughout the entire interval. Whole-interval recording is more useful with continuous behavior (e.g., play) than with discrete or quickly occurring behavior (e.g., a face slap). Typically, whole-interval recording is used when a behavior occurs over longer periods of time, as might be seen with noncompliant behavior or on-task behavior. For example, Athens, Vollmer, and St. Peter Pipkin (2007) recorded duration of ontask behavior in 3-second intervals only if the behavior was present for the entire interval. Time-sample recording.  In time-sample recording, the observation period is divided into intervals of time, but observation intervals are separated by periods without observation. Time-sample recording permits the observer to focus on other tasks when not observing the target behavior. For example, the observation period might be divided into 15-second intervals, but observation occurs only at the end of the interval. Likewise, an observation period might be divided into 30-minute intervals, but observation and recording occur only in the last 5 minutes of every 30 minutes. These intervals can be equally divided, as when an observation occurs every 15 minutes, or variably divided to provide some flexibility for the observer (such as a teacher who cannot observe exactly on the quarter hour). The data are displayed as a percentage (the number of intervals with target behavior divided by the number of intervals of observation). 137

Miltenberger and Weil

For instance, if evaluating the extent of social interactions between adolescents on a wing of an inpatient psychiatric facility were desirable, conducting observations every 15 minutes might be possible. In this example, the observer would be engaged in a job-related activity and, when prompted by a timer, look up from his or her work, note whether the target behavior was occurring, and record the result. Data of this sort could identify which of the adolescents tend to engage in social interactions and the typical times at which social interactions are likely to occur. From this sampling approach, it is possible to refine the data collection process toward a more precise measure of behavior. Interval and time-sample recording have benefits and limitations. The benefit of interval recording is that with consecutive observation intervals, no instance of the target behavior is missed during the observation period. The limitation, especially with shorter intervals, is that it requires the continuous attention of, and frequent recording by, the observer, making it difficult for the observer to engage in other activities during the observation period. A limitation of time-sample recording is that because observation intervals are separated by periods without observation, some instances of the target behavior may be missed during the observation period. However, a benefit is that the observer can engage in other activities during the periods between observation intervals, making the procedure more user friendly for participant observers such as teachers or parents. Although interval and time-sample recording procedures are used widely in behavior-analytic research, some authors have cautioned that the results of these sampling procedures might not always correspond highly with data collected through continuous recording procedures in which every behavioral event is recorded (e.g., Rapp et al., 2007; Rapp, Colby-Dirksen, Michalski, Carroll, & Lindenberg, 2008). In summarizing the numerous studies that have compared interval and timesample recording with continuous recording procedures, Rapp et al. (2008) concluded that interval recording tends to overestimate the duration of the behavior, time-sample procedures with small intervals tend to produce accurate estimates of duration, 138

and interval recording with small intervals tends to produce fairly accurate estimates of frequency. Although Rapp et al. provided several suggestions to guide decision making regarding the use of interval and time-sample procedures, they concluded by suggesting that small interval sizes in interval and time-sample procedures are likely to produce the best results.

Product Recording In some cases, the outcome or the product of the behavior may be of interest, either as a primary dependent variable or as a complement to direct observation of the behavior itself. When the behavior changes the physical environment, this product can be recorded as an index of the occurrence of the behavior. In some instances, collecting data on products is valuable because measuring the behavior directly may not be possible. For example, weight is measured in weight-loss programs because measuring the behavior that produces weight loss (i.e., diet and exercise) is usually not feasible. Examples of product recording may include number of academic problems completed or number of units assembled in a factory. In each case, the occurrence of the behavior is not observed directly; rather, the products of the behavior are recorded as an indication of its occurrence. In such cases, a focus on the products of behavior is easier and more efficient than recording the behavioral events as they occur. An important note in recording permanent products is that although the focus is on results, if the results fall short of expected quantity or quality, the focus can then turn to evaluation of the behavior involved in producing the products being measured (Daniels & Daniels, 2004). Beyond measuring the production of tangible items, product recording can be used to measure the physical damage caused by a problem behavior. For example, self-injurious behavior can produce tissue damage such as bruises, lacerations, or other bodily injuries, and product recording could be used to assess the severity of these injuries. Iwata, Pace, ­Kissel, Nau, and Farber (1990) developed the SelfInjury Trauma Scale to quantify surface injury resulting from self-injurious behavior. Other examples of this type of product recording include the

Observation and Measurement in Behavior Analysis

assessment of the size of a bald area related to chronic hair pulling (Rapp, Miltenberger, & Long, 1998; Rapp, Miltenberger, Long, Elliott, & Lumley, 1998) or the length of fingernails as a measure of nail biting (Flessner et al., 2005; Long, Miltenberger, Ellingson, & Ott, 1999). Still other examples of product recording include a measure of weight or body mass index as an indication of changes in eating (Donaldson & Normand, 2009; see also Young et al., 2006), measuring chemicals in urine samples as a measure of drug ingestion (Silverman et al., 2007), or weighing food before and after a feeding session to assess the amount of food consumed (Kerwin, Ahearn, Eicher, & Swearingin, 1998; Maglieri, DeLeon, Rodriguez-Catter, & Sevin, 2000; Patel, Piazza, Layer, Colemen, & Swartzwelder, 2005). An advantage of product recording is that the observer does not have to be present to record the occurrence of the behavior (Miltenberger, 2012) because the product can be recorded at a more convenient time after the behavior has occurred (e.g., at the end of the class period or after the shift in a factory). A drawback of product recording, especially when used with a group of individuals, is that it might not be possible to determine which person engaged in the behavior that resulted in the product. Perhaps another student completed the academic problems or another worker helped produce the units in the factory (Jessup & Stahelski, 1999). Although product recording is valuable when the interest is in the tangible outcome of the behavior, there must be some way to determine which individual was responsible for the products being measured (e.g., did the urine sample come from the client or someone else?). Another potential problem with some uses of product recording is that it may not identify the behavior that resulted in the product. For example, correct answers to math problems may have been produced by cheating, and weight loss may have been produced through self-induced vomiting rather than an increase in exercise or a reduction in calorie consumption.

Recording Devices Once the appropriate recording procedure has been chosen, the next step is to choose a recording device. Because the observer must record instances

of the behavior as they occur, the observer’s behavior must result in a product that can be used later for purposes of analysis. A recording device allows the observer to produce a permanent product from the observation session. The most commonly used recording device is a data sheet structured for the type of recording procedure being conducted. Figures 6.1, 6.2, and 6.3 show sample data sheets structured for frequency recording, duration recording, and interval recording, respectively. Although data sheets are used most often for data collection, other recording devices, both low tech and high tech, can be used to record instances of the behavior. Several types of low-tech recording devices have been used, such as wrist counters for frequency recording (Lindsley, 1968) or stop watches for duration recording. Still other possibilities include activities as simple as moving beads from one side of a string to the other, placing a coin from one pocket to another, making small tears in a piece of paper, or making a hash mark on a piece of masking tape affixed to the recorder’s sleeve to record frequency (Miltenberger, 2012). In fact, it is feasible to record frequency with whatever may be available in the environment as long as the observer can produce a product that represents the occurrence of the behavior. Although recording on a data sheet is the most frequently used data collection process, with rapidly changing technologies there is a move to identify high-tech methods to streamline and automate data collection (Connell & Witt, 2004; Jackson & Dixon, 2007; Kahng & Iwata, 1998). In applied behavior analysis research, electronic devices such as a personal digital assistant (Fogel, Miltenberger, Graves, & Koehler, 2010) or hand-held or laptop computers (Gravlee, Zenk, Woods, Rowe, & Schulz, 2006; Kahng & Iwata, 1998; Repp, Karsh, Felce, & Ludewig, 1989) are frequently used for data collection. In addition, the use of bar codes and scanners (Saunders, Saunders, & Saunders, 1993) for data collection has been reported. With bar code scanners, an observer holds a small battery-powered scanning device and a sheet of paper with the bar codes ordered according to behavioral topography. When the target behavior is observed, the data collector scans the relevant bar code to record the 139

Miltenberger and Weil

Frequency Data Form Child: James M.

Start Date: 9/15/2010

Observer: R.M.

Primary/Reliability

Setting: Mrs. Johnson’s Class

Instructions: First, indicate date of observation in the far left column. Second, place a tick mark for each occurrence of behavior during the specified academic activity for that day. Definition of behavior:_________________________________________________________ ____________________________________________________________________________ Date

Circle Time

Individual Social Reading Studies

Mathematics

Science

Writing

Daily Total

Figure 6.1.  Example of a daily frequency data sheet that involves a breakdown of the frequency of the behavior by curricular areas in a general education classroom setting. Duration Data Form Child: James M.

Start Date: 9/15/2010

Observer: R.M.

Primary/Reliability

Setting: Mathematics

Instructions: First, indicate date of observation in the far left column. Second, identify the start time (onset) and the stop time (offset) for each occurrence of the behavior. Use more than one line if necessary. Definition of behavior:_________________________________________________________ ____________________________________________________________________________ Date

Onset

Offset

Onset

Offset

Onset

Offset

Onset

Offset

Daily Duration

Figure 6.2.  An example of a duration data sheet that provides information on the onset and offset of each occurrence of the behavior and the frequency of the behavior each day.

occurrence of the behavior and the time within the observation period. The use of bar codes is, however, only one of several ways to conduct electronic recording of behavior. In one investigation evaluating a shaping procedure to increase the reach of a pole vaulter, a photoelectric beam was used to determine the height of the vaulter’s reach just after planting the pole for the 140

vault (Scott, Scott, & Goldwater, 1997). Another high-tech method of data collection involves software for cell phones. These software applications, colloquially referred to as apps, allow behavior analysts to use the computing power of their phones for data collection. The advantages of this technology are numerous; however, the most obvious benefits are the use of a small, portable device that can

Observation and Measurement in Behavior Analysis

Interval Data Form Child: James M.

Date: 9/15/2010

Setting: Mathematics

Observer: R.M.

Primary/Reliability

Instructions: Place a check mark in the appropriate column to reflect the events that occurred in each 10–s interval. Definition of behavior:_________________________________________________________ ____________________________________________________________________________ 10-s Intervals

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Demand Placed

Aggression

Attention

Ignore

Escape

Figure 6.3.  An example of a 10-second interval data sheet (partial or whole) that provides information on the occurrence of the target behavior and probable antecedents and consequences. In this example, the hypothesis is that the aggressive behavior occurs after the delivery of a demand by the teacher. In addition, potential responses by the teacher to the problem behavior are included. When completed, this data sheet will provide information on the temporal relationship between teacher behavior and student problem behavior.

facilitate any form of data collection mentioned thus far and the ability to graph the data. Finally, these graphs can be sent via text message to parents, teachers, or colleagues (Maher, 2009). Undoubtedly, as technology advances, even more high-tech data collection methods will emerge.

Reactivity of Observation A long-standing concern for behavioral researchers is how observation affects performance (e.g., Parsons, 1974). Reactivity is the term used to describe changes in behavior resulting from the act of observing and recording the behavior. Typically, when reactivity occurs, the behavior changes in the

desired direction (e.g., Brackett, Reid, & Green, 2007; Mowery et al., 2010). Several researchers have evaluated the effects of staff reactivity to observations (Boyce & Geller, 2001; Brackett et al., 2007; Codding, Livanis, Pace, & Vaca, 2008; Mowery et al., 2010). Mowery et al. (2010) evaluated staff adherence to a protocol designed to increase the frequency of staff’s positive social initiations during leisure activities with adults with developmental disabilities. They evaluated the effects on staff behavior of having a supervisor absent or a supervisor present in the environment to determine whether reactivity to the supervisor’s presence would occur. Positive social interactions only 141

Miltenberger and Weil

occurred at acceptable levels when the supervisor was present, suggesting that reactivity is an important issue to consider in the valid assessment of staff performance. Considering reactivity to observation in research or clinical settings is important because the target behavior may be influenced not only by the intervention but also by the act of observing. When responding comes under the stimulus control of an observer, the level of behavior in the presence of the observer is likely to be different than in the absence of the observer. Considering that most staff behavior must be performed in the absence of supervision, conducting observations without reactivity to obtain an accurate characterization of the behavior is important. There are a variety of ways in which to minimize reactivity. For instance, making observers a part of the regular environment for several sessions or weeks before actual data collection occurs may result in habituation to the presence of the observers. It is important to keep in mind that habituation to the observer is only likely to occur as long as the observer does not interact with the person being observed and no consequences are delivered by the observer or others who may be associated with the observer. If the setting permits, reactivity may be avoided in other ways. Video monitoring devices mounted unobtrusively in the setting may be used. In instances in which the cameras can be seen by the person being observed, habituation to the presence of the camera is likely to occur in the absence of feedback or consequences contingent on performance (e.g., Rapp, Miltenberger, Long, et al., 1998). In addition, using an observation room equipped with a one-way observation window provides the observer an opportunity to conduct unannounced observations. If the use of an observation window is not possible, the use of confederates may be considered. Although confederates are present in the setting to collect data (unobtrusively), a benign purpose for their presence other than data collection is provided to those being observed. That is, deception is used to conceal the true purpose of the observers’ presence. As mentioned in the Mowery et al. (2010) study, confederates may be used to increase the chances that the data collected are 142

representative of typical levels (the level expected in the absence of observation). Confederates can be any variety of individuals such as a coworker, classmate, spouse, or person external to the setting as seen in Mowery et al. (2010), in which observers were introduced as student social workers who were in the setting to observe individuals with intellectual disabilities. In recent research on abduction prevention, children were observed without their knowledge to assess their safety skills as a confederate approached and attempted to lure them in a store setting (Beck & Miltenberger, 2009). Research on child safety skills training has demonstrated that children are more likely to respond correctly when they are aware of observation than when they are not aware of observation (Gatheridge et al., 2004; Himle, Miltenberger, Gatheridge, & Flessner, 2004). An important caveat is that the use of confederates may raise ethical concerns and should be approached cautiously. The use of video or other inconspicuous monitoring systems may present the same ethical concerns and thus should also be approached with caution. Prior approval is needed when using deceptive covert observation. Additionally, for research purposes such covert observation must be approved by an institutional review board with appropriate debriefing afterward. Interobserver Agreement Within research and practice in applied behavior analysis, accurate data collection is important (see Chapter 7, this volume). Accuracy refers to the extent to which the recorded level of the behavior matches the true level of the behavior (Cooper et al., 2007; Johnston & Pennypacker, 1993; Kazdin, 1977). To evaluate accuracy, a researcher must be able to obtain a measure of the true level of the behavior to compare with the measurement of the behavior produced by the observer. The difficulty arises in obtaining the true level of the behavior, as most recording is done by humans who must discriminate the occurrence of the behavior from nonoccurrences (a stimulus–control problem). A “truer” level of the behavior may be obtained through mechanical means, but equipment also may fail

Observation and Measurement in Behavior Analysis

on occasion and produce errors. Alternatively, ­automated recording devices may fail to register responses that vary slightly in topography or location. Thus, knowing the true level of the behavior with certainty is impossible, and therefore accuracy is not measured in behavioral research. Instead, behavior analysts train observers so that the data they collect are in agreement with those collected by another observer who has received training in recording the target behavior. Although measuring agreement between observers provides no information about the accuracy of either set of observations, it does improve the believability of the data. That is, when two independent observers agree on every occurrence and nonoccurrence of behavior, one has more confidence that they are using the same definition of the target behavior, observing and recording the same responses, and marking the form correctly (Miltenberger, 2012). If a valid definition of the behavior of interest is being used, then high agreement scores increase the belief that the behavior has been recorded accurately; although, again, accuracy has not been measured. A frequently used measure of agreement between observers is simply the percentage of observations that agree, a measure commonly referred to as IOA. IOA is calculated by dividing the number of agreements (both observers recorded the occurrence or nonoccurrence of the behavior) by the number of agreements plus disagreements (one observer recorded the occurrence of the behavior and the other recorded the nonoccurrence of the behavior) and multiplying the quotient by 100. For an adequate assessment of IOA, two independent data collectors are recommended to be present during at least one third of all observation sessions across all participants and phases of a clinical intervention or research study (Cooper et al., 2007). This level of IOA assessment (one third of sessions) is an arbitrary number, and efforts to maximize the number of assessments that produce strong percentages of agreement should result in greater confidence in the data. Cooper et al. (2007) suggested that research studies maintain 90% or higher IOA but agreed that 80% or higher may be acceptable under some circumstances. Kazdin (2010) offered a different

perspective on the acceptable level of IOA and suggested that the level of agreement that is acceptable is one that indicates to the researcher that the observers are sufficiently consistent in their recording of the behavior, that the behaviors are adequately defined, and that the measures will be sensitive to change in the client’s performance over time. (p. 118) Kazdin suggested that the number and complexity of behaviors being recorded, possible sources of bias, expected level of change in the behavior, and method of computing IOA are all considerations in deciding on an acceptable level of IOA. For example, if small changes in behavior are likely with the intervention, then higher IOA would be demanded. However, if larger changes are expected, then lower levels of IOA might be tolerated. The bottom line is that behavior analysts should strive for levels of IOA as high as possible (e.g., 90% or more) but consider the factors that might contribute to lower levels and make adjustments as warranted by these factors. IOA can be calculated in a variety of ways. How it is computed depends on the dimension of the behavior being evaluated and how it is measured. Next we describe common methods for calculating IOA.

Frequency Recording To calculate IOA on frequency recording, the smaller frequency is divided by the larger frequency. For example, if one observer records 40 occurrences of a target behavior and a second independent observer records 35, the percentage of IOA during that observation session is 35/40, or 87.5%. The limitation of IOA in frequency recording is that there is no evidence that the two observers recorded the same behavioral event even when IOA is high. For example, if one observer recorded nine instances of the behavior and the other observer recorded 10 instances, the two observers ostensibly agreed on nine of the 10 instances for an IOA of 90%. It is possible, however, that the observers were actually recording different instances of the behavior. One way to increase confidence that the two observers were agreeing on specific responses in frequency 143

Miltenberger and Weil

IOA is to collect frequency data in intervals and then compare the frequency in each interval (see Frequency Within Interval section later in this chapter). Dividing the observation period into shorter, equal intervals permits a closer look at the recording of frequencies in shorter time blocks. In this way, there can be more confidence that the observers recorded the same instances of the behavior when agreement is high. To further enhance confidence that observers are recording the same behavioral events, it is possible to collect data on behavior as it occurs in real time. With real-time recording, it is possible to determine whether there is exact agreement on each instance of the behavior.

Real-Time Recording When using real-time recording, the onset and offset of the behavior are recorded on a second-by-second basis. Therefore, IOA can be calculated by dividing the number of seconds in which the two observers agreed that the behavior was or was not occurring by the number of seconds in the observation session. Typically, an agreement on the onset and offset of the behavior can be defined as occurring when both observers recorded the onset or offset at exactly the same second. This form of IOA is the most stringent because agreement is calculated for every second of the observation period (e.g., Rapp, Miltenberger, & Long, 1998; Rapp, Miltenberger, Long, et al., 1998). Alternatively, IOA could be conducted on the frequency of the behavior, but an agreement would only be scored when both observers recorded the onset of the behavior at the same instant or within a small window of time (e.g., within 1 or 2 seconds of each other).

Duration Recording IOA for duration recording is calculated by dividing the smaller duration by the larger duration. For example, if one observer records 90 minutes of break time taken in an 8-hour shift and the reliability observer records 85 minutes, the agreement between observers is 85/90, or 94.4%. The same limitation described earlier for IOA on frequency recording pertains to IOA on duration recording. Although the duration recorded by the two observers may be similar, unless the data are time stamped, 144

there is no evidence that the two observers were recording the same instances of the behavior. Realtime recording is a way to overcome this problem.

Interval and Time-Sample Recording Computing IOA with interval data requires an intervalby-interval check for agreement on the occurrence and nonoccurrence of the behavior throughout the observation period. The number of intervals of agreement is then divided by the number of intervals in the observation period to produce a percentage of agreement. An agreement is defined as an interval in which both observers had a marked interval (indicating that the behavior occurred) or an unmarked interval (indicating that the behavior did not occur). Using only one target behavior for this example, consider a 10-minute observation session with data recorded at 10-second intervals (60 intervals total). If the number of intervals of observation with agreements is 56 of 60, the percentage of IOA is 56/60, or 93.3%. Two variations of IOA calculations for interval recording, which correct for chance agreement with low-rate and high-rate behavior, are occurrenceonly and nonoccurrence-only calculations. The occurrence-only calculation is used with low-rate behavior (from which chance agreement on nonoccurrence is high) and involves calculating IOA using only agreements on occurrence and removing agreements on nonoccurrence from consideration (agreements on occurrence divided by agreements plus disagreements on occurrence). The nonoccurrenceonly calculation is used with high-rate behavior (from which chance agreement on occurrence is high) and involves calculation of IOA using only agreements on nonoccurrence and removing agreements on occurrence from consideration (agreements on nonoccurrence divided by agreements plus disagreements on nonoccurrence).

Cohen’s Kappa Kappa is another method of calculating observer agreement, but it corrects for the probability that two observers will agree as a result of chance alone. Kappa is computed using the following formula: =

Po − Pc , 1 − Pc

Observation and Measurement in Behavior Analysis

where Po is the proportion of agreement between observers (sum of agreements on occurrences and nonoccurrences divided by the total number of intervals) and Pc is the proportion of agreement expected by chance. The latter may be obtained using the following formula:  ( O1o )( O2o )  +  ( O1n )( O2 n )  Pc =  , I2 where O1o is the number of occurrences recorded by Observer 1, O2o is the number of occurrences recorded by Observer 2; O1n and O2n are nonoccurrence counts, and I is the number of observations made by each observer. For example if Observer 1 scored nine intervals out of 10 and Observer 2 scored eight intervals out of 10 (see Exhibit 6.1), kappa would be calculated as follows: Po = .90; PC = (72 + 2)/102 = .74, κ = .62. Kappa values can range from −1 to 1, with 0 reflecting a chance level of agreement. No single rule for interpreting an obtained kappa value may be given because the number of different categories into which behavior may be classified will affect kappa. If only two categories are used (e.g., occurrence vs. nonoccurrence), then the probability of a chance agreement is higher than if more categories had been used. Higher probabilities of chance agreement are reflected in lower kappa values. Thus, if the preceding example had used three categories (e.g., slow-, medium-, or high-rate responding) and IOA had been the same (90%), then kappa would have been more than .62.

Exhibit 6.1 Recordings for Observer 1 and Observer 2 in 10 Observation Intervals Interval 1

2

3

4

5 6 Observer 1

7

8

9

10

X

X

X

X

X Observer 2

X

X

X

X

X

X

X

X

X

X

X

X

Kappa can be affected by other factors, including the distribution of the ratings of the observers (Sim & Wright, 2005). Because these latter factors have little to do with the degree to which two observers agree, there is little agreement on what constitutes an acceptable kappa value. Within the social sciences, Kazdin (2010) suggested that a kappa value of 0.7 or higher reflects an acceptable level of agreement. Lower criteria for acceptable kappa values may be found (e.g., Fleiss, 1981; Landis & Koch, 1977), but these are as arbitrary as the cutoff suggested by Kazdin (von Eye & von Eye, 2008). Perhaps for this reason, kappa is less often used by applied behavior analysts than is IOA.

Frequency Within Interval Calculating IOA for frequency within interval minimizes the limitation identified for IOA on frequency recording (high agreement even though the two observers might be recording different instances of behavior). For example, if agreement is calculated within each of a series of 20-second intervals, then there is no chance that a response recorded by Observer A in Interval 1 will be counted as an agreement with a different response recorded by Observer B in Interval 12. To calculate frequency-within-interval agreement, calculate a percentage of agreement between observers for each interval (smaller number divided by larger number), sum the percentages for all the intervals, and divide by the number of intervals in the observation period. Exhibit 6.2 illustrates an IOA calculation for frequency-withininterval data for 10 intervals for two observers. Each X corresponds to an occurrence of the behavior in an interval.

Ethical Considerations Several ethical issues must be considered when conducting observations and measurement as part of research or clinical practice in applied behavior analysis (e.g., Bailey & Burch, 2005). First, a behavior analyst should observe and record the person’s behavior only after receiving written consent from the individual or the individual’s parent or guardian. As part of the consent process, the individual must be apprised of and agree to the ways in which the data will be used (research presentation or publication, 145

Miltenberger and Weil

Exhibit 6.2 Frequency-Within-Interval Recordings for Two Observers and Interobserver Agreement Calculation Interval 1

2

3

4

5 6 Observer 1

7

8

9

10

XXX 67%

X 100%

100%

XX 100%

XX 100% 100% Observer 2

X 50%

XXX 100%

XX 50%

100%

XX

X

XX

XXX

X

XX

XX

Note. Interobserver agreement = (67 + 100 + 100 + 100 + 100 + 100 + 50 + 100 + 50 + 100)/10 = 86.7%.

clinical decision making). If the behavior analyst identifies new uses for the data after the original consent has been obtained, new consent must be obtained from the individual for the new ways in which the data will be used. Second, the individual must know when and where observation will take place unless the individual provides written consent for surreptitious or unannounced observation. Third, observation and recording must take place in such a way that confidentiality is maintained for the individual receiving services or participating in research. To maintain confidentiality, the observer must not draw attention to the person being observed and must not inform any other people about the observations unless the individual being observed has provided written permission to do so. In addition, behavior analysts must use pseudonyms and disguise other identifying information in presentations and publications. Fourth, observers must treat the individual being observed and others in the setting with dignity and respect at all times during the course of their participation in research or as they are receiving clinical services. Summary Observation and measurement are at the heart of applied behavior analysis because behavior (and its controlling variables) is the subject matter of both research and practice. As discussed in this chapter, 146

adequate measurement of behavior requires clear definitions of the target behavior, precise specifications of recording logistics and procedures, appropriate choice of recording devices, and consideration of reactivity and IOA. The validity of conclusions that can be drawn from experimental manipulations of controlling variables or evaluations of treatment effectiveness depends on the adequacy of the observation and measurement of the behaviors targeted in these endeavors.

References Aikman, G., & Garbutt, V. (2003). Brief probes: A method for analyzing the function of disruptive behaviour in the natural environment. Behavioural and Cognitive Psychotherapy, 31, 215–220. doi:10.1017/S1352465 803002108 Allen, K. D. (1998). The use of an enhanced simplified habit reversal procedure to reduce disruptive outbursts during athletic performance. Journal of Applied Behavior Analysis, 31, 489–492. doi:10.1901/ jaba.1998.31-489 Athens, E. S., Vollmer, T. R., & St. Peter Pipkin, C. C. (2007). Shaping academic task engagement with percentile schedules. Journal of Applied Behavior Analysis, 40, 475–488. doi:10.1901/jaba.2007.40-475 Baer, D. M., Wolf, M. M., & Risley, T. R. (1968). Some current dimensions of applied behavior analysis. Journal of Applied Behavior Analysis, 1, 91–97. doi:10.1901/jaba.1968.1-91 Bailey, J. S., & Burch, M. R. (2002). Research methods in applied behavior analysis. Thousand Oaks, CA: Sage.

Observation and Measurement in Behavior Analysis

Bailey, J. S., & Burch, M. R. (2005). Ethics for behavior analysts. Mahwah, NJ: Erlbaum.

negative reinforcement: Effects on teacher and student behavior. School Psychology Review, 34, 220–231.

Beck, K. V., & Miltenberger, R. G. (2009). Evaluation of a commercially available program and in situ training by parents to teach abduction-prevention skills to children. Journal of Applied Behavior Analysis, 42, 761–772. doi:10.1901/jaba.2009.42-761

DiGennaro-Reed, F. D., Codding, R., Catania, C. N., & Maguire, H. (2010). Effects of video modeling on treatment integrity of behavioral interventions. Journal of Applied Behavior Analysis, 43, 291–295. doi:10.1901/jaba.2010.43-291

Borrero, C. S. W., & Borrero, J. C. (2008). Descriptive and experimental analyses of potential precursors to problem behavior. Journal of Applied Behavior Analysis, 41, 83–96. doi:10.1901/jaba.2008.41-83

Donaldson, J. M., & Normand, M. P. (2009). Using goals setting, self-monitoring, and feedback to increase calorie expenditure in obese adults. Behavioral Interventions, 24, 73–83. doi:10.1002/bin.277

Bosch, A., Miltenberger, R. G., Gross, A., Knudson, P., & Brower-Breitweiser, C. (2008). Evaluation of extinction as a functional treatment for binge eating. Behavior Modification, 32, 556–576. doi:10.1177/ 0145445507313271

Fleiss, J. L. (1981). Statistical methods for rates and proportions (2nd ed.). New York, NY: Wiley.

Boyce, T. E., & Geller, E. S. (2001). A technology to measure multiple driving behaviors without self-report or participant reactivity. Journal of Applied Behavior Analysis, 34, 39–55. doi:10.1901/jaba.2001.34-39 Brackett, L., Reid, D. H., & Green, C. W. (2007). Effects of reactivity to observations on staff performance. Journal of Applied Behavior Analysis, 40, 191–195. doi:10.1901/jaba.2007.112-05 Brown, R. A., Palm, K. M., Strong, D., Lejuez, C., Kahler, C., Zvolensky, M., . . . Gifford, E. (2008). Distress tolerance treatment for early-lapse smokers: Rationale, program description, and preliminary findings. Behavior Modification, 32, 302–332. doi:10.1177/0145445507309024 Codding, R. S., Livanis, A., Pace, G. M., & Vaca, L. (2008). Using performance feedback to improve treatment integrity of classwide behavior plans: An investigation of observer reactivity. Journal of Applied Behavior Analysis, 41, 417–422. doi:10.1901/ jaba.2008.41-417 Connell, J. E., & Witt, J. C. (2004). Applications of computer-based instruction: Using specialized software to aid letter-name and letter-sound recognition. Journal of Applied Behavior Analysis, 37, 67–71. doi:10.1901/jaba.2004.37-67 Cooper, J. O., Heron, T. E., & Heward, W. L. (2007). Applied behavior analysis (2nd ed.). Upper Saddle River, NJ: Pearson Education.

Flessner, C. A., Miltenberger, R. G., Egemo, K., Jostad, C., Gatheridge, B. J., Neighbors, C., . . . Kelso, P. (2005). An evaluation of the social support component of simplified habit reversal. Behavior Therapy, 36, 35–42. doi:10.1016/S0005-7894(05)80052-8 Fogel, V. A., Miltenberger, R. G., Graves, R., & Koehler, S. (2010). Evaluating the effects of exergaming on physical activity among inactive children in a physical education classroom. Journal of Applied Behavior Analysis, 43, 591–600. doi:10.1901/jaba.2010.43-591 Gatheridge, B. J., Miltenberger, R., Huneke, D. F., Satterlund, M. J., Mattern, A. R., Johnson, B. M., & Flessner, C. A. (2004). A comparison of two programs to teach firearm injury prevention skills to 6and 7-year-old children. Pediatrics, 114, e294–e299. doi:10.1542/peds.2003-0635-L Gravlee, C. C., Zenk, S. N., Woods, S., Rowe, Z., & Schulz, A. J. (2006). Handheld computers for direct observation of the social and physical environment. Field Methods, 18, 382–397. doi:10.1177/1525822X 06293067 Gresham, F. M., Gansle, K. A., & Noell, G. H. (1993). Treatment integrity in applied behavior analysis with children. Journal of Applied Behavior Analysis, 26, 257–263. doi:10.1901/jaba.1993.26-257 Harris, F. C., & Ciminero, A. R. (1978). The effects of witnessing consequences on the behavioral recording of experimental observers. Journal of Applied Behavior Analysis, 11, 513–521. doi:10.1901/jaba.1978.11-513

Daniels, A. C., & Daniels, J. E. (2004). Performance management: Changing behavior that drives organizational effectiveness. Atlanta, GA: Performance Management.

Hayes, S. C., Barlow, D. H., & Nelson-Gray, R. O. (1999). The scientist practitioner: Research and accountability in the age of managed care. Boston, MA: Allyn & Bacon.

DiGennaro, F. D., Martens, B. K., & Kleinmann, A. E. (2007). A comparison of performance feedback procedures on teachers’ treatment implementation integrity and students’ inappropriate behavior in special education classrooms. Journal of Applied Behavior Analysis, 40, 447–461. doi:10.1901/jaba.2007.40-447

Hayes, S. C., Wilson, K. G., Gifford, E., Bissett, R., Piasecki, M., Batten, S., . . . Gregg, J. (2004). A preliminary trial of twelve-step facilitation and acceptance and commitment therapy with polysubstance-abusing methadonemaintained opiate addicts. Behavior Therapy, 35, 667–688. doi:10.1016/S0005-7894(04)80014-5

DiGennaro, F. D., Martens, B. K., & McIntyre, L. L. (2005). Increasing treatment integrity through

Himle, M. B., Miltenberger, R. G., Flessner, C., & Gatheridge, B. (2004). Teaching safety skills to children to prevent 147

Miltenberger and Weil

gun play. Journal of Applied Behavior Analysis, 37, 1–9. doi:10.1901/jaba.2004.37-1 Himle, M. B., Miltenberger, R. G., Gatheridge, B., & Flessner, C. (2004). An evaluation of two procedures for training skills to prevent gun play in children. Pediatrics, 113, 70–77. doi:10.1542/peds.113.1.70 Iwata, B. A., Pace, G. M., Kissel, R. C., Nau, P. A., & Farber, J. M. (1990). The Self-Injury Trauma (SIT) scale: A method for quantifying surface tissue damage caused by self-injurious behavior. Journal of Applied Behavior Analysis, 23, 99–110. doi:10.1901/ jaba.1990.23-99 Jackson, J., & Dixon, M. R. (2007). A mobile computing solution for collecting functional analysis data on a pocket PC. Journal of Applied Behavior Analysis, 40, 359–384. doi:10.1901/jaba.2007.46-06 Jessup, P. A., & Stahelski, A. J. (1999). The effects of a combined goal setting, feedback and incentive intervention on job performance in a manufacturing environment. Journal of Organizational Behavior Management, 19, 5–26. doi:10.1300/J075v19n03_02 Johnston, J. M., & Pennypacker, H. S. (1993). Readings for strategies and tactics of behavioral research (2nd ed.). Hillsdale, NJ: Erlbaum. Kahng, S. W., & Iwata, B. A. (1998). Computerized systems for collecting real-time observational data. Journal of Applied Behavior Analysis, 31, 253–261. doi:10.1901/jaba.1998.31-253 Kazdin, A. E. (1977). Artifact, bias, and complexity of assessment: The ABCs of reliability. Journal of Applied Behavior Analysis, 10, 141–150. doi:10.1901/ jaba.1977.10-141 Kazdin, A. E. (2010). Single case research designs: Methods for clinical and applied settings (2nd ed.). New York, NY: Oxford University Press. Kerwin, M. L., Ahearn, W. H., Eicher, P. S., & Swearingin, W. (1998). The relationship between food refusal and self-injurious behavior: A case study. Journal of Behavior Therapy and Experimental Psychiatry, 29, 67–77. doi:10.1016/S0005-7916(97)00040-2 Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174. doi:10.2307/2529310 Lerman, D. C., Iwata, B. A., Smith, R. G., & Vollmer, T. R. (1994). Restraint fading and the development of alternative behaviour in the treatment of self-restraint and self-injury. Journal of Intellectual Disability Research, 38, 135–148. doi:10.1111/j.1365-2788.1994.tb00369.x Lindsley, O. R. (1968). Technical note: A reliable wrist counter for recording behavior rates. Journal of Applied Behavior Analysis, 1, 77–78. doi:10.1901/ jaba.1968.1-77 Long, E. S., Miltenberger, R. G., Ellingson, S. A., & Ott, S. M. (1999). Augmenting simplified habit reversal 148

in the treatment of oral-digit habits exhibited by individuals with mental retardation. Journal of Applied Behavior Analysis, 32, 353–365. doi:10.1901/ jaba.1999.32-353 MacDonald, R., Sacramone, S., Mansfield, R., Wiltz, K., & Ahern, W. (2009). Using video modeling to teach reciprocal pretend play to children with autism. Journal of Applied Behavior Analysis, 42, 43–55. doi:10.1901/jaba.2009.42-43 Mace, F. C., Prager, K. L., Thomas, K., Kochy, J., Dyer, T. J., Perry, L., & Pritchard, D. (2009). Effects of stimulant medication under varied motivational operations. Journal of Applied Behavior Analysis, 42, 177–183. doi:10.1901/jaba.2009.42-177 Maglieri, K. A., DeLeon, I. G., Rodriguez-Catter, V. R., & Sevin, B. M. (2000). Treatment of covert food stealing in an individual with Prader-Willi syndrome. Journal of Applied Behavior Analysis, 33, 615–618. doi:10.1901/jaba.2000.33-615 Maher, E. (2009). Behavior Tracker Pro. Retrieved from http://www.behaviortrackerpro.com/btp/Welcome. html Malott, R., & Trojan-Suarez, E. A. (2004). Elementary principles of behavior (5th ed.). Upper Saddle River, NJ: Prentice Hall. Marckel, J. M., Neef, N. A., & Ferreri, S. J. (2006). A preliminary analysis of teaching improvisation with the picture exchange communication system to children with autism. Journal of Applied Behavior Analysis, 39, 109–115. doi:10.1901/jaba.2006.131-04 Mash, E. J., & McElwee, J. (1974). Situational effects on observer accuracy: Behavioral predictability, prior experience, and complexity of coding categories. Child Development, 45, 367–377. doi:10.2307/1127957 Mayfield, K. H., & Vollmer, T. R. (2007). Teaching math skills to at-risk students using home-based peer tutoring. Journal of Applied Behavior Analysis, 40, 223–237. doi:10.1901/jaba.2007.108-05 Meidinger, A. L., Miltenberger, R. G., Himle, M., Omvig, M., Trainor, C., & Crosby, R. (2005). An investigation of tic suppression and the rebound effect in Tourette’s disorder. Behavior Modification, 29, 716–745. doi:10.1177/0145445505279262 Miltenberger, R. G. (2012). Behavior modification: Principles and procedures (5th ed.). Belmont, CA: Wadsworth. Miltenberger, R., Rapp, J., & Long, E. (1999). A low tech method for conducting real time recording. Journal of Applied Behavior Analysis, 32, 119–120. doi:10.1901/ jaba.1999.32-119 Miltenberger, R. G., Woods, D. W., & Himle, M. (2007). Tic disorders and trichotillomania. In P. Sturmey (Ed.), Handbook of functional analysis and clinical psychology (pp. 151–170). Burlington, MA: Elsevier.

Observation and Measurement in Behavior Analysis

Miltenberger, R. G., Wright, K. M., & Fuqua, R. W. (1986). Graduated in vivo exposure with a severe spider phobic. Scandinavian Journal of Behaviour Therapy, 15, 71–76. doi:10.1080/16506078609455763 Mowery, J., Miltenberger, R., & Weil, T. (2010). Evaluating the effects of reactivity to supervisor presence on staff response to tactile prompts and self-monitoring in a group home setting. Behavioral Interventions, 25, 21–35. Mozingo, D. B., Smith, T., Riordan, M. R., Reiss, M. L., & Bailey, J. S. (2006). Enhancing frequency recording by developmental disabilities treatment staff. Journal of Applied Behavior Analysis, 39, 253–256. doi:10.1901/jaba.2006.55-05 Parsons, H. M. (1974). What happened at Hawthorne? Science, 183, 922–932. doi:10.1126/science.183. 4128.922 Patel, M. R., Piazza, C. C., Layer, S. A., Coleman, R., & Swartzwelder, D. M. (2005). A systematic evaluation of food textures to decrease packing and increase oral intake in children with pediatric feeding disorders. Journal of Applied Behavior Analysis, 38, 89–100. doi:10.1901/jaba.2005.161-02 Pedhazur, E., & Schmelkin, L. (1991). Measurement, design, and analysis: An integrated approach. Hillsdale, NJ: Erlbaum. Peterson, L., Homer, A. L., & Wonderlich, S. A. (1982). The integrity of independent variables in behavior analysis. Journal of Applied Behavior Analysis, 15, 477–492. doi:10.1901/jaba.1982.15-477 Plavnick, J. B., Ferreri, S. J., & Maupin, A. N. (2010). The effects of self-monitoring on the procedural integrity of behavioral intervention for young children with developmental disabilities. Journal of Applied Behavior Analysis, 43, 315–320. doi:10.1901/ jaba.2010.43-315 Raiff, B. R., Faix, C., Turturici, M., & Dallery, J. (2010). Breath carbon monoxide output is affected by speed of emptying the lungs: Implications for laboratory and smoking cessation research. Nicotine and Tobacco Research, 12, 834–838. doi:10.1093/ntr/ntq090 Rapp, J. T., Colby, A. M., Vollmer, T. R., Roane, H. S., Lomas, J., & Britton, L. N. (2007). Interval recording for duration events: A re-evaluation. Behavioral Interventions, 22, 319–345. doi:10.1002/bin.239 Rapp, J. T., Colby-Dirksen, A. M., Michalski, D. N., Carroll, R. A., & Lindenberg, A. M. (2008). Detecting changes in simulated events using partial-interval recording and momentary time sampling. Behavioral Interventions, 23, 237–269. doi:10.1002/bin.269 Rapp, J. T., Miltenberger, R. G., & Long, E. S. (1998). Augmenting simplified habit reversal with an awareness enhancement device: Preliminary findings. Journal of Applied Behavior Analysis, 31, 665–668. doi:10.1901/jaba.1998.31-665

Rapp, J. T., Miltenberger, R. G., Long, E. S., Elliott, A. J., & Lumley, V. A. (1998). Simplified habit reversal for chronic hair pulling in three adolescents: A clinical replication with direct observation. Journal of Applied Behavior Analysis, 31, 299–302. doi:10.1901/ jaba.1998.31-299 Repp, A. C., & Karsh, K. G. (1994). Hypothesis-based interventions for tantrum behaviors of persons with developmental disabilities in school settings. Journal of Applied Behavior Analysis, 27, 21–31. doi:10.1901/ jaba.1994.27-21 Repp, A. C., Karsh, K. G., Felce, D., & Ludewig, D. (1989). Further comments on using hand-held computers for data collection. Journal of Applied Behavior Analysis, 22, 336–337. doi:10.1901/jaba.1989.22-336 Riordan, M. M., Iwata, B. A., Finney, J. W., Wohl, M. K., & Stanley, A. E. (1984). Behavioral assessment and treatment of chronic food refusal in handicapped children. Journal of Applied Behavior Analysis, 17, 327–341. doi:10.1901/jaba.1984.17-327 Saunders, M. D., Saunders, J. L., & Saunders, R. R. (1993). A program evaluation of classroom data collection with bar codes. Research in Developmental Disabilities, 14, 1–18. doi:10.1016/0891-4222(93)90002-2 Schrandt, J. A., Townsend, D. B., & Poulson, C. L. (2009). Teaching empathy skills to children with autism. Journal of Applied Behavior Analysis, 42, 17–32. doi:10.1901/jaba.2009.42-17 Scott, D., Scott, L. M., & Goldwater, B. (1997). A performance improvement program for an internationallevel track and field athlete. Journal of Applied Behavior Analysis, 30, 573–575. doi:10.1901/jaba.1997.30-573 Silverman, K., Wong, C. J., Needham, M., Diemer, K. N., Knealing, T., Crone-Todd, D., . . . Kolodner, K. (2007). A randomized trial of employment-based reinforcement of cocaine abstinence in injection drug users. Journal of Applied Behavior Analysis, 40, 387–410. doi:10.1901/jaba.2007.40-387 Sim, J., & Wright, C. C. (2005). The kappa statistic in reliability studies: Use, interpretation, and sample size requirements. Physical Therapy, 85, 257–268. Skinner, B. F. (1956). A case history in scientific method. American Psychologist, 11, 221–233. doi:10.1037/ h0047662 Stickney, M. I., & Miltenberger, R. G. (1999). Evaluation of direct and indirect measures for the functional assessment of binge eating. International Journal of Eating Disorders, 26, 195–204. doi:10.1002/(SICI)1098108X(199909)26:23.0.CO;2-2 Sundberg, M. L., & Michael, J. (2001). The benefits of Skinner’s analysis of verbal behavior for children with autism. Behavior Modification, 25, 698–724. doi:10.1177/0145445501255003 Therrien, K., Wilder, D. A., Rodriguez, M., & Wine, B. (2005). Preintervention analysis and improvement 149

Miltenberger and Weil

of customer greeting in a restaurant. Journal of Applied Behavior Analysis, 38, 411–415. doi:10.1901/ jaba.2005.89-04 Touchette, P. E., Macdonald, R. F., & Langer, S. N. (1985). A scatterplot for identifying stimulus control of problem behavior. Journal of Applied Behavior Analysis, 18, 343–351. doi:10.1901/jaba.1985.18-343 Twohig, M. P., Masuda, A., Varra, A. A., & Hayes, S. C. (2005). Acceptance and commitment therapy as a treatment for anxiety disorders. In S. M. Orsillo & L. Roemer (Eds.), Acceptance and mindfulnessbased approaches to anxiety: Conceptualization and treatment (pp. 101–129). New York, NY: Kluwer. doi:10.1007/0-387-25989-9_4 Twohig, M. P., Shoenberger, D., & Hayes, S. C. (2007). A preliminary investigation of acceptance and commitment therapy as a treatment for marijuana dependence in adults. Journal of Applied Behavior Analysis, 40, 619–632. doi:10.1901/jaba.2007.619-632

in children. Journal of Applied Behavior Analysis, 26, 53–61. doi:10.1901/jaba.1993.26-53 Wallace, M. D., Iwata, B. A., & Hanley, G. P. (2006). Establishment of mands following tact training as a function of reinforcer strength. Journal of Applied Behavior Analysis, 39, 17–24. doi:10.1901/ jaba.2006.119-04 Wilder, D. A., Zonneveld, K., Harris, C., Marcus, A., & Reagan, R. (2007). Further analysis of antecedent interventions on preschoolers’ compliance. Journal of Applied Behavior Analysis, 40, 535–539. doi:10.1901/ jaba.2007.40-535 Wolf, M. M. (1978). Social validity: The case for subjective measurement or how applied behavior analysis is finding its heart. Journal of Applied Behavior Analysis, 11, 203–214. doi:10.1901/jaba.1978.11-203

VanWormer, J. J. (2004). Pedometers and brief e-counseling: Increasing physical activity for overweight adults. Journal of Applied Behavior Analysis, 37, 421–425. doi:10.1901/jaba.2004.37-421

Wong, C. J., Sheppard, J.-M., Dallery, J., Bedient, G., Robles, E., Svikis, D., & Silverman, K. (2003). Effects of reinforcer magnitude on data-entry productivity in chronically unemployed drug abusers participating in a therapeutic workplace. Experimental and Clinical Psychopharmacology, 11, 46–55. doi:10.1037/10641297.11.1.46

Vollmer, T. R., Marcus, B. A., Ringdahl, J. E., & Roane, H. S. (1995). Progressing from brief assessments to extended experimental analyses in the evaluation of aberrant behavior. Journal of Applied Behavior Analysis, 28, 561–576.

Wright, K. M., & Miltenberger, R. G. (1987). Awareness training in the treatment of head and facial tics. Journal of Behavior Therapy and Experimental Psychiatry, 18, 269–274. doi:10.1016/0005-7916(87) 90010-3

von Eye, A., & von Eye, M. (2008). On the marginal dependency of Cohen’s κ. European Psychologist, 13, 305–315. doi:10.1027/1016-9040.13.4.305

Young, J., Zarcone, J., Holsen, L., Anderson, M. C., Hall, S., Richman, D., . . . Thompson, T. (2006). A measure of food seeking in individuals with Prader-Willi syndrome. Journal of Intellectual Disability Research, 50, 18–24. doi:10.1111/j.1365-2788.2005.00724.x

Wagaman, J. R., Miltenberger, R. G., & Arndorfer, R. E. (1993). Analysis of a simplified treatment for stuttering

150

Chapter 7

Generality and Generalization of Research Findings Marc N. Branch and Henry S. Pennypacker

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication. (Cohen, 1994, p. 997) Confirmation comes from repetition. . . . Repetition is the basis for judging . . . significance and confidence. (Tukey, 1969, pp. 84–85) As the general psychology research community becomes increasingly aware (e.g., Cohen, 1994; Loftus, 1991, 1996; Wilkinson & Task Force on Statistical Inference, 1999) of the limitations of traditional group designs and statistical inference methods with regard to assessing reliability and generality of research findings, we present an alternative approach that has been substantially developed in the branch of psychology now known as behavior analysis. In this chapter, we outline how individual subject methods, that is, so-called single-case designs, provide straightforward and, in principle, simple methods to assess the reliability and generality of research findings. Overview The chapter consists of three major sections. In the first, we summarize the limitations of traditional methods, especially as they relate to assessing reliability and generality of research findings concerning behavior. We make the case that traditional methods have obscured an important distinction that has led to psychology’s consisting of

two related, but separable, subject matters, behavioral science and actuarial science. We also focus on the issue of generality across individuals and how traditional methods can give the illusion of such generality. In the second major section, we discuss dimensions of generality in addition to generality across individuals. Here we define scientific generality and several other forms of generality as well. In so doing, we introduce the roles of replication, both direct and systematic, in assessing generality of research results. We argue that replication, instead of statistical inference, is an alternative primary method for determining not only the reliability of results but also for assessing and characterizing the generality of scientific findings. In the third major section, we discuss generalization of treatment effects, the fundamentals of technology transfer, and the practices that characterize translational research. There, we write of programming for and assessment of generalizability of scientific findings to applied settings. We expand our view then to the engineering issues of technology development (or technology transfer and translational research) as a capstone demonstration of generalization based on an understanding of generality of research findings. Limitations of Traditional Methods The traditional group-mean, statistical-inference approach to analyzing research results has faced

Preparation of this chapter was supported by National Institute on Drug Abuse Grant DA004074. DOI: 10.1037/13937-007 APA Handbook of Behavior Analysis: Vol. 1. Methods and Principles, G. J. Madden (Editor-in-Chief) Copyright © 2013 by the American Psychological Association. All rights reserved.

151

Branch and Pennypacker

consistent criticism for more than 4 decades (e.g., Bakan, 1966; Carver, 1978; Cohen, 1994; Gigerenzer, Krauss, & Vitouch, 2004; Loftus, 1991, 1996; Meehl, 1967, 1978; Nickerson, 2000; Rozeboom, 1960). Most of that criticism has focused on what those methods have to say about the reliability of research findings, which is appropriate because if findings are not reliable, there is no need to assess their generality. These methods, however, have also been criticized with respect to theory testing and development, issues that directly relate to generality. We treat these two categories of criticism separately.

Significance Testing and Reliability After all of the carefully reasoned criticism of significance testing that has been published, one would hope that a clear understanding of its limits would exist among professional psychologists. That, however, appears not to be true, as noted by Cohen (1994), who lamented that after 4 decades of severe criticism, the ­ritual of null hypothesis significance testing . . . still persists. [As does] near universal misinterpretation of p as the probability that H-sub-0 is false, [and] the misinterpretation that its complement is the probability of successful replication. (p. 997) Cohen’s assertion is supported by survey evidence revealing that a substantial majority of academic research psychologists incorrectly interpret p values and statistical significance (Haller & Krauss, 2002; Kalinowski, Fidler, & Cumming, 2008; Oakes, 1986). That a significant proportion of professional psychologists do not appreciate what statistical ­significance and, especially, p values represent is apparent testimony to a weakness in the training of research psychologists, a failing that lies at the feet of those of us who are engaged in teaching them. In fact, Haller and Krauss (2002) included a sample of statistical methodology instructors in their study and found that 80% of them were mistaken in their understanding of p values, so it comes as less of a surprise that the misconceptions are widespread. The following discussion, therefore, is another attempt to make clear what a p value is and what it means. 152

A p value, which results from a significance test, is a conditional probability. Specifically, it is the probability, if the null hypothesis is true, of obtaining data of a particular sort. That is, in algebraic symbols, it is p = P(Data|H0). The important point is that p ≠ P(H0|Data), which is what a researcher would presumably really like to know. In other words, a p value does not provide quantitative information about whether the null hypothesis is true, which is apparently widely misunderstood. Because it does not provide the oft-assumed information about the likelihood of the null hypothesis being true, a p value of .01 does not mean that the probability of the null hypothesis being true is 1 in 100. In fact, it conveys nothing quantitative about the truth of the null hypothesis. To see why, note that changing the order of conditionality in a conditional probability is crucially important. Consider such examples as P(Dead|Electrocuted) versus P(Electrocuted|Dead) or P(Cloudy|Raining) versus P(Raining|Cloudy). The first probability in each pair tells nothing about the second, just as P(Data|H0) reveals nothing about P(H0|Data). A p value, therefore, has quantitative meaning only if the null hypothesis is true, but when performing statistical tests not only does one not know whether the null hypothesis is true, one probably assumes it is not. The important fact is that a finding of statistical significance, via a small p value, does not imply that the null hypothesis is unlikely to be true. The incorrect logic underlying the mistaken conclusion (cf. Falk & Greenbaum, 1995) apparently goes as follows: If the null hypothesis is true, data of a certain sort are unlikely. I obtained data of that sort, so therefore the null hypothesis is unlikely to be true. That so-called logic is precisely the same as the following: If the next person I meet is an American, he or she is unlikely to be the President. I just met the President. Therefore, he or she is unlikely to be an American. The fundamental misunderstanding of what a p value is leads directly to the more serious problem of assuming that it indicates something quantitative about the reliability, that is, the likelihood of replication, of the finding. A common misunderstanding (see Haller & Krauss, 2002, and Oakes, 1986, for evidence) is that a p value, for example of .01, is the

Generality and Generalization of Research Findings

complement of the probability of replication should the experiment be repeated. That is, the mistaken assumption is that if one conducted the experiment 100 times, one should replicate the result on 99 of those occasions (at least on average). If one knew that the null hypotheses were true, then that would be a correct interpretation of the p value. Of course, though, one does not know whether H0 is true (again, one usually hopes it is not). In fact, one conducts the statistical test so that one can make what one (mistakenly) hopes is an educated guess about whether it is true. Thus, to say on the basis of a small p value that a result is statistically reliable is to strain the meaning of reliable beyond reasonable limits. This limitation of statistical significance is not based on technical details of the null hypothesis. That is, the problem does not lie with whether the underlying distribution is formally normal or near normal or whether the statistical test involved is demonstrably robust with respect to violations of assumptions about the underlying distribution. The limitation is based in the logic of the approach. All the assumptions about the distributional characteristic null hypothesis might in fact be true, but that is not relevant when one is speaking of what a p value indicates. A major limitation of statistical significance, therefore, is that it does not provide direct information about the reliability of research findings. Without knowledge about reliability, no examination of generality can occur because repeatability is the most basic test of generality. Notwithstanding that limitation, however, significance testing based on group means may be seen, incorrectly, to have implications for generality of findings across subjects. Adherence to this view unfortunately gains strength as sample size increases. In fact, however, regardless of sample size, no information about intersubject generality can be extracted from a ­significance statement because no knowledge is afforded concerning the number of subjects for whom the effect actually occurred. We examine the implications of this fact in more detail below. Aside from the limits surrounding reliability just described, other characteristics of group-mean data warrant examination as we move into a discussion

of generality. It is here that we show that psychology, presumably because of the widespread use of significance testing, has developed two distinguishable subject matters.

Significance Testing and Generality Traditional significance testing approaches in psychology are generally based on data averaged across individuals. As is well known, the mean from a group of individuals (a sample) provides an estimate of the mean of the entire population from which the sample is drawn, and that estimate can be bounded by confidence intervals that provide information (not the probability, however, that the population mean falls within the interval; see Smithson, 2003) about how confident one can be that the population mean lies within such intervals. Thus, the sample mean provides information about a parameter that applies to the entire population. That fact appears to imply substantial generality; it applies to the entire population (however delimited), so generality appears maximized. This raises two important issues. First is the question of representativeness of the means, both sample and population. That is, identical or similar means can result from substantially different distributions of scores. Two examples that illustrate this fact are given in Figures 7.1 and 7.2. In Figure 7.1, four distributions of 20 scores are arrayed horizontally in the upper panel. In the top row, the values are arithmetically separated, whereas in the other three, they are clustered in various ways. Note that none of the four is particularly “normal” in appearance, that is, clustered in the middle. The four plots in the lower panel show—with the top plot corresponding to the top distribution in the upper panel, and so on—the means (solid points) and standard deviations (bars) of the four distributions. They are, as planned, identical. These data show that identical means and standard deviations, the stock in trade of inferential statistics, can be obtained from very different distributions of values. That is, in these cases the means and standard deviations do not provide a particularly informative or representative indication of what the individual values are, which implies that when dealing with averages of measures, or averages across individuals, attention must be paid to the representativeness of 153

Branch and Pennypacker

14

14

12

12

10

10

8

8

Y

Y

Figure 7.1.  Upper panel: Four distributions of values, with each symbol representing one value on the x-axis. Lower panel: The corresponding means and standard deviations of the four corresponding distributions from the upper panel. From The Elements of Graphing Data (rev. ed., p. 215), by W. S. Cleveland, 1994, Summit, NJ: Hobart Press. Copyright 1994 by AT&T Bell Laboratories. Reprinted with permission.

6

6

4

4

2

2

0

0

5

10

15

0

20

0

5

14

14

12

12

10

10

8

8

6

6

4

4

2

2

0

0

5

10

X

10

15

20

15

20

X

Y

Y

X

15

20

0

0

5

10

X

Figure 7.2.  Anscombe’s quartet. Each of the four graphs shows 11 x–y pairs and the bestfitting (least-squares estimate) straight line through the points. The slopes and intercepts of the lines are identical. From “Graphs in Statistical Analysis,” by F. J. Anscombe, 1973, American Statistician, 27, pp. 19–20. Copyright 1973 by the American Statistical Association. Adapted with permission. All rights reserved. 154

Generality and Generalization of Research Findings

the mean, not just its value, or even its standard deviation. Figure 7.2, which contains what is known as Anscombe’s quartet (Anscombe, 1973), provides an even more dramatic illustration of how focusing only on the average of a set of numbers can lead one to miss important features of that set. The four graphs in Figure 7.2 plot 11 values in x–y coordinates and show the best-fitting (via the least-squares method) straight line to the data. Obviously, the distributions of points are quite different in the four sets. Nevertheless, the means for the x values are all the same, as are their standard deviations. The same is true for the y values (yielding eight instances of the sort shown in Figure 7.1). In addition, the slopes and intercepts of the straight lines are identical for all four sets, as are the sums of squared errors and sums of squared residuals. Thus, all four would yield the same correlation coefficient describing the relation between x and y. The point of these illustrations is to indicate that a sample mean, even though a predictor of a population mean, is not necessarily a good description of individual values, so it is not necessarily a good indicator of the generality across individual measures. When the measures come from individual people (or other nonhuman animals), it follows that the average of the group may not reveal, and may well conceal, much about individuals. It is important to remember, therefore, that sample means from a group of individuals permit inferences about the population average, but these means do not permit inferences to individuals unless it is demonstrated that the mean is, in fact, representative of individuals. Surprisingly, it is rare in psychology to see the issue of representativeness of an average even mentioned, although recently, in the domain of randomized clinical trials, the limitations attendant to group averages have been gaining increased mention (e.g., Penston, 2005; Williams, 2010). Many experimental designs, nevertheless, involve comparison across groups with large numbers of subjects, which raises the question of the practicality of presenting the data of every individual. The concern is legitimate, but the problem is not solved by resorting to the study of group averages only. Excellent techniques for comparing distributions,

like stem-and-leaf plots, box plots, and quantile– quantile plots, are available (Cleveland, 1994; Tukey, 1977). They provide a more complete description of measures from individuals, or a useful subset (as can be the case with quantile–quantile plots), than do simple means and standard errors or means and confidence intervals. We presume that as null-hypothesis significance-testing approaches become less prevalent, more effort will be directed toward developing new and better techniques for comparing distributions, methods that will include and make evident the measures from individuals.

Two Separable Subject Matters for Psychology? In some instances, the difference between a population parameter, such as the population average, and the activity of an individual is obvious. For example, consider the average rate of pregnancy in women between 20 and 30 years old. Suppose that rate is 7%. That, of course, is a useful statistic and can be used to predict how many women in that age category will be pregnant. More important for the present purposes, however, is that the value, 7%, applies to no individual woman. That is, no woman is 7% pregnant. A woman is either pregnant or she is not. What of situations, however, in which an average is representative of the behavior of individuals? For example, suppose that a particular teaching technique is discovered to result in a 10% increase in performance on some examination and that the improvement is at or near 10% for every individual. Is that not a case in which a group average would permit estimation of a population mean that is, in fact, a good descriptor of the effect of the training for individuals and, because it applies to the population, has wide generality? The answer is yes and no. The point to be made here is somewhat subtle, and so we elaborate on it with an example. Consider a situation in which a scientist is trying to determine the relation between amount of practice at solving five-letter anagrams and subsequent speed at solving six-letter anagrams. Suppose, specifically, that no practice and 10, 50, 100, and 200 anagrams of practice are to be compared. After the practice, subjects who have never previously solved anagrams, except for those seen in the practice phase, are given 50 new 155

Branch and Pennypacker

anagrams to solve, and the time to complete is recorded. Because total practice might be a determinant of speed, the scientist opts to use a betweengroups design, with each group being exposed to one of the practice regimens. That is, the hope is to extract the seemingly pure relation between practice and later speed, uncontaminated by prior relevant practice. The scientist then averages the data from each group and uses those means to describe the function relating amount of practice to speed of solving the new, more difficult anagrams. In an actual case, variability would likely be found among individuals within each group, so one issue would be how representative the average is of each member of each group. For our example, however, assume that the average is representative, even perfectly so (i.e., every subject in a group gives exactly the same value). The scientist has generated a function, probably one that describes an increase in speed of solving anagrams as a function of amount of prior practice. In our example, that function allows us to predict exactly what an individual would do if exposed to a certain amount of practice. Even though the means for each group are representative and therefore permit prediction about individual behavior, the important point is that the function has no meaning for an individual. That is, the function does not describe something that would occur for an individual because no individual can be exposed to different amounts of practice for the first time. The function is an actuarial account, not a description of a behavioral process. It is, of course, to the extent that the means are representative, a useful finding. It is just not descriptive of a behavioral process in an individual. To examine the same issue at the level of an individual would require investigation of sequences of amounts of practice, and that examination would have to include experiments that factor in the role of repeated practice. Obviously, such an endeavor is considerably more complicated than the study that generated the actuarial curve, but it is the only way to develop a science of individual behavior. The ontogenetic roots of behavior cumulate over lifetimes. In later portions of this chapter, we discuss how the complications may be confronted.

156

The point is not to diminish the value of actuarial data, nor to suggest that psychologists abandon the collection and analysis of such data. If means are highly representative, such data can offer predictions at the individual subject level. Even if the means are not highly representative, organizations such as insurance companies and governments can and do make important use of such information in determining appropriate shared risk or regulatory policy, respectively. The point is, using insurance rates as an example, that just because you are in a particular group, for example, that of drivers between the ages of 16 and 25, for which the mean rate of accidents is higher than for another group, does not indicate that you personally are more likely to have an automobile accident. It does mean, however, that for the insurance company to remain profitable, insurance rates need to be higher for all members of the group. Similarly, with respect to health policy, even though most people who smoke cigarettes do not get lung cancer, the incidence of lung cancer, on a relative basis, is substantially greater, on average, in that group. Because the group is large, even a low incidence rate yields a substantial number of actual lung cancer cases, so it is in the government’s, and the population’s, interest to reduce the number of people who smoke cigarettes. The crux of the matter is that actuarial and behavioral data, although related in that the former depend on the latter, are distinguishable and, therefore, should be distinguished. Psychology, to the extent that it relies solely on the methods of inferential statistics that use averages across individuals, becomes an actuarial science, not a science of behavioral processes. The methods described in this ­chapter are aimed at including in psychology its oft-stated goal of being a science of behavior (or of the mind). Behavioral and inferred mental processes really make sense only at the level of the individual. (The same is true of physiology, which has become a rather exact science in part because of the influence of Claude Bernard, 1865/1957.) A person’s behavior, including thinking, imagining, and so forth, is particular to that person. That is, people do not share their minds or their behavior with others, just as they do not share their physiology. A counterargument

Generality and Generalization of Research Findings

is that behavior and mental activity are too variable from individual to individual to permit a scientific analysis. We based this chapter on the more optimistic view that such activity is amenable to study at the level of the individual. Because a good deal of application of psychological knowledge involves dealing with individuals, for example, as in psychotherapy, understanding at the level of the individual remains a worthy goal. Support for the viewpoint that a science of individual behavior is possible, however, requires an elaboration of how an individual subject–based analysis can yield information that is more generally applicable to more than one or a few individuals.

Let us compare how a more traditional betweengroups approach might fare in dealing with the issue. We apply music to one group and not to another. What will result will depend on the distribution of baseline accuracy across individuals. Figure 7.3 shows three possible population distributions. In B, most people have low accuracy, in C most have high accuracy, and in A people fall into two groups with respect to baseline accuracy. If one performed the experiment on groups and took the group mean to be the indicator of the effect of the independent variable, the conclusion would depend on the underlying distribution. In A, the conclusion

Why Single-Case Designs Do Not Mean That N = 1 Traditional approaches, with the attendant limitations described thus far, likely arose, at least in part, because of a legitimate concern about focusing research on individual subjects who are studied repeatedly through time (more on this later). Such research is usually performed with relatively few subjects, leaving open the possibility that effects seen might be limited with respect to generality across other individuals. An example, modeled after one offered by Sidman (1960), provides a response to such misgivings. Suppose we were interested in whether listening to classical music while solving arithmetic problems improves accuracy. Using a single-case approach, the study is started with a single subject. We might first establish a baseline of accuracy (more on this later) by measuring it over several successive exposures. Next, we would test the subject with the music present and then with it absent. Suppose we find that accuracy is increased when music is present and reverts to normal when it is not. Suppose also that unbeknownst to us, the effect music will have depends on the baseline level of accuracy; if accuracy is initially low, it is enhanced by the presence of music, whereas if it is initially high, it is reduced when the music is on. We might mistakenly conclude, on the basis of the results from the one subject, that music increases accuracy of solving the kinds of arithmetic problems used.

Figure 7.3.  Three hypothetical frequency distributions characterizing the number of people displaying different baseline rates. From Tactics of Scientific Research: Evaluating Experimental Data in Psychology (p. 149), by M. Sidman, 1960, New York, NY: Basic Books. Copyright 1988 by Murray Sidman. Reprinted with permission.

157

Branch and Pennypacker

might well be that music has no effect, with the lowered accuracy in people with high baseline accuracy canceling out the increases that result among those with low baseline accuracy. If the population is distributed as in B, the conclusion would be that music increases accuracy because the mean would move in the direction of improved accuracy. The important point is that simply considering the group average makes it less likely that the baseline dependency that underlies the effect will be seen. Let us now compare what might transpire with the single-case approach, an approach based on replication. Having seen the effect in the first subject, we recruit a second and do the experiment again. Suppose that the population distribution is as depicted in Figure 7.3B. The most likely scenario is that the second subject will also have low baseline accuracy because someone sampled from the population is most likely to manifest modal characteristics. We get the same result and could, mistakenly, conclude that music enhances arithmetic accuracy. That is, we make the same mistake as with the group-average approach. The difference between the two approaches, however, is that the group mean approach makes it more difficult to discover the underlying, real effect. The single-case approach, however, if enough replications are done, will eventually and inevitably reveal the problem because sooner or later someone with high baseline accuracy will be examined and show a decrease. A key phrase in the previous sentence is “if enough replications are done.” Whether that happens is likely to depend on the perceived importance of the effect. If it is deemed important, it is likely to be subjected to additional research, which will, in turn, lead to additional replications. Thus, the single-case approach is not some sort of panacea with respect to identifying such relations, but it offers a direct path to corrective action. Of course, it is possible to ferret out the baseline dependency using a group-mean approach, but that will happen only if attention is paid to the data of individual subjects in a group. In the singlecase approach, those data are automatically scrutinized. A major point is that single case does not necessarily imply that only one or even only a few subjects be examined. Some research questions might involve examination of many subjects. (We 158

discuss later how to decide how many subjects to test.) What the approach involves is studying each subject essentially as an independent experiment. Generality across subjects is therefore examined directly by seeing how often the experiment’s effects are replicated. A second major point is that the apparent virtues of studying many subjects, a standard aspect of traditional research designs in psychology, are realized only if the data from each subject are individually analyzed.

Null-Hypothesis Significance Testing and Theory Development A major goal in any science is the development of theory, and there is a sense in which theory has clear relevance to generality. Effective theories are those that account for a wide array of research results. That is, they apply generally. The way in which significance testing is most commonly used in psychology, however, mitigates against the orderly development and testing of theories and against the analysis of competing theories. The problem was first identified as a paradox by Meehl (1967; see also Meehl, 1978). The problem is a logical one based largely on the choice of the null hypothesis as “no effect.” The logic of the common approach is as follows. An investigator has a hypothesis that imposition of a variable, X, will change another measure, Y. This hypothesis is sometimes called the alternative hypothesis. The null hypothesis is then chosen to be that X will not change Y, that is, that it will be without effect. Next, the X condition is imposed, and Y is measured. A comparison is then made of Y without X and Y with X. A statistic is then calculated that is generally a ratio of changes in Y as a result of X over changes in Y as a result of anything else. In more technical terms, the statistic is effect variance over error variance. The larger the statistic, the smaller the p value, and the more likely it is that statistical significance is achieved and the null hypothesis rejected. Standard teaching demands that even though one can decide to reject the null hypothesis, logic prevents one from accepting the alternative hypothesis. Instead, one would say that if the null hypothesis is rejected, the alternative hypothesis gains support. The paradox noted by Meehl (1967) arises from the nature of the statistic itself. The size of the

Generality and Generalization of Research Findings

statistic is controlled by two values, the effect size and the error variance, so it can be increased in two ways. The way of interest for this discussion is via a decrease in error variance, the denominator. A major way of decreasing error variance is through increased experimental rigor (one avenue of which is to increase the number of subjects). To the degree that extraneous variables (the “anything else” mentioned earlier) can be eliminated or held constant, error variance should decrease, making it more likely that that statistic will be large enough to warrant a decision as to statistical significance. The ­paradox, therefore, is that as experimental rigor is increased—that is, as experimental techniques are refined and improved—statistical significance becomes more likely, with the consequence that the alternative hypothesis gains support, no matter what the alternative hypothesis is. That does not seem like a recipe for cumulative progress in science. Simple null-hypothesis significance testing with the null hypothesis set at no effect cannot, by itself, help to develop theory. Meehl (1967) described one approach that can obviate this paradox, which is to use significance testing with a null hypothesis that is not “no effect.” Instead, the null hypothesis is what the theory (or alternative hypothesis) predicts. Consider how the logic operates when this tactic is used. As experimental rigor increases, error variance is decreased, making it more likely that the resulting statistic will reach a critical value. When that value is achieved, the null hypothesis is rejected, but in this case it is the investigator’s theory that is rejected. Rather than increased experimental rigor resulting in its being easier for one’s theory to gain support, it results in its being easier to reject one’s theory. Increasing experimental control puts the theory to a more rigorous test, not an easier one as is the case when using the no-effect, or no-difference, null hypothesis. The harder one works to reject a theory and fails to succeed, the more confidence one has in the theory. Training in statistical inference, at least for psychologists, does not usually emphasize that the null hypothesis need not be no effect. It can, nevertheless, as just noted, be some particular effect. Note that it has to be some specific value other than zero.

The use of a particular value as the null hypothesis therefore requires that one’s theory be quantitative enough to generate a specific value. This approach is what characterizes tests of goodness of fit (those that use significance tests) of quantitatively specified functions. This approach of setting the null hypothesis at a value predicted by theory is nevertheless not immune to the previously described weaknesses of significance testing in general. If, however, significance testing is used to make decisions, at least this latter approach does not suffer from the weakness of making it easier to support a researcher’s theory, regardless of what it is, as methods improve. In this section of the chapter, we have made the case, we hope, that commonly used psychology research methods have limitations in assessing reliability and generality of research findings. In addition, the methods have resulted in many areas of psychology being largely actuarial, group-­average– focused science rather than aimed at the behavior of individuals. In the next section, we describe the basics of an alternative approach that is based on replication rather than significance testing and group averages. It is useful to remember that important science was conducted before the invention of significance testing, and what follows is a description of the application of methods used to establish most of modern physics and chemistry (and physiology) to the study of behavior. The approach focuses on understanding behavioral processes, rather than actuarial ones, and has already yielded a good deal of success, as other chapters in Volume 2 of this handbook illustrate. We should note, nevertheless, that even if the goal is actuarial prediction and influence, the methods of statistical inference are limited in what they can achieve with respect to reliability of research findings. As we argue, the only sure way to examine reliability of results is to repeat them, so replication is the key strategy for both subject matters of psychology. Assessing Reliability and Generality via Replication The two distinguishable categories of replication are direct replication and systematic replication, 159

Branch and Pennypacker

Direct Replication: Within-Subject Reliability and Baselines In the first part of this section, we describe the methods and roles of direct replication with the same experimental subject (i.e., a truly single-case experiment). We open with this simplest case, and with an example, not only to illustrate how the strategy can be used, but also to deal more clearly with reservations about and limitations of the approach as well as how decisions about characteristics of the replicative process may be made. For our example, suppose that we want to measure the amount of a certain kind of food eaten after some period without food. We let our subject eat after 12 hours of fasting; suppose that she eats 250 grams. Direct replication of this observation would require that we do the same test again. One possible, but unlikely, result would be that she would eat 250 grams again, providing an exact replication. The amount eaten would more likely be slightly different, say 245 grams. We might then conduct another replication to see whether the trend toward eating less was replicable. Suppose on that occasion our subject eats 257 grams, making it less likely that there is a trend toward less ingestion with successive tests. We could repeat the process again and again. By repeatedly observing the amount eaten after a 12-hour fast, we gain more confidence with each successive measurement about how much our 160

subject will eat of that particular food after 12 hours of not eating. One thing that direct replication can provide, via a sequence of direct, intrasubject replications such as that just described, is a baseline. The left segment of Figure 7.4 shows that there appears to be a steady baseline amount of intake in our example. A question that might arise is how many observations are needed to establish a baseline, that is, to come up with a convincing assessment? The answer is that it depends. There is no rule or convention about how many replications are needed to render an outcome considered reliable in the eyes of the scientific community. One factor of importance is how much is already known. In some of the more advanced ­physical sciences, a single replication (usually by a different research team) might be adequate. In our example, the researcher might have conducted similar research previously, discovered that the baseline value does not change after 10 observations, and thus deemed 10 replications enough. The researcher who chooses replication as a strategy to determine reliability of findings, therefore, does not have the comfort of a set of conventions (akin to those available to investigators who use conventional levels of statistical significance) to decide whether to ­conclude if an effect is reliable enough to warrant reporting to the scientific community. Instead, the investigator’s judgment plays a role, and his or her scientific reputation is dependent to some degree on 300 250 Grams Eaten

although, as we show, the distinction is not a sharp one. Most researchers are familiar with the concept of direct replication, which refers to repeating an experiment as exactly as possible. If the results are the same or similar enough, the initial effect is said to be replicated. Direct replication, therefore, is mainly used to assess the reliability of a research finding, but as we show, there is a sense in which it also provides information about generality. Systematic replication is the designation for a repetition of the experiment with something altered to see whether the effect can be observed in changed circumstances. If the results are replicated, then the generality of the finding is extended to the new circumstances. Many varieties of systematic replication exist, and it is the strategy most relevant to examining the generality of research findings.

200 150

Baseline - Food 1

Food 2

Food 1

100 50 0

0

2

4

6

8 10 12 14 16 18 20 22 24 26 Successive Tests

Figure 7.4.  Hypothetical data from a series of observations of eating. The first 10 points and last six points are amounts eaten of Food 1. The middle six points are amounts eaten of Food 2.

Generality and Generalization of Research Findings

that judgment. One of the comforts of a set of conventions is that if a researcher abides by them and results are later discovered, via failed attempts at replication, not to be reliable, that researcher’s reputation suffers little. In contrast, one can argue that there are both advantages and disadvantages to relying on replication. Important advantages are having the benefit of informed judgment, especially of a seasoned investigator, and the fact that social pressure rides more directly on the researcher’s reputation. The disadvantage comes from the lack of an agreed-on set of conventions. Principled arguments about which is better for science can be made for both positions, but we favor the view that science, as a social–behavioral activity, will fare better, or at least no worse, if researchers are held more accountable for their conclusions about reliability and generality than for their adherence to a set of arbitrary, often misunderstood conventions. Returning to the role of a baseline construed as a set of intrasubject replications, such baselines can serve as bases of comparison for effects of experimental changes. For example, after establishing a baseline of eating of the first food, we could change what the food is, perhaps less tasty or more or less calorie laden. The second set of points in Figure 7.4, which in essence depict measures from a second set of replications, have been chosen to indicate a decrease. The reliability of the effect is illustrated by the successive similarity of values, and judgments about how many replications are needed would be based on the same sorts of considerations as involved in the original baseline. A usual check would involve return to the original food, and the third set of points indicates a likely result, once again with a series of replications. The overall experiment, therefore, is an example of the ubiquitous A-B-A design (see Chapter 1, this volume). Replication, of course, need not refer only to a series of successive measurements under identical conditions to produce a baseline. If the type of finding summarized in Figure 7.4 were especially counterintuitive or at considerable odds with existing knowledge, one might well repeat the entire project, Food 1 to Food 2 to Food 1, and that, too, would constitute a direct intrasubject replication. In fact, the entire project could be carried out multiple

times if, in the investigator’s judgment, such confirmation was necessary. Each successful replication increases confidence that the independent variable, change of food type, is responsible for the change in eating.

Direct Replication: Between-Subjects Reliability and Generality After all this work, an immediate limitation is that the findings, so far as we know, may well apply only to the one person studied. Our first result is based on intrasubject replication. If the goal of the research was to see whether the change in food can influence eating, then it may be the case that no further replication is needed. It is likely, however, that our interest extends beyond what is possible to what is common. In that case, additional replication is in order, which brings us to the next type of direct replication, replication with different subjects, or intersubject replication. Intersubject replication is used to examine generality, in this case across subjects, and in this single-case design N is extended to more than 1. Intersubject replication makes clear the fuzziness of the distinction between direct and systematic replication. The latter is generally defined as a replication with something changed (see below), and a new subject is certainly a change. We also suggest that systematic replication is a main strategy for assessing generality, and by studying a second subject, generality across individuals is on trial. It is even possible to suggest that most replications, even intrasubject replications, are, in fact, systematic. For example, in the intrasubject replication described above, time is different for successive observations, and the subject brings a different history to each observation period. It nevertheless has become standard to characterize replications in which the procedures are essentially the same as direct replications. As we outline shortly, systematic replications are characterized by changes in procedure or conditions that can be quite substantial. As noted in the section Significance Testing and Generality earlier in this chapter, an emphasis on replication with individual subjects approaches the issue of subject generality by increasing the number of subjects studied. Suppose, for the sake of our example, we study a second subject, performing the 161

Branch and Pennypacker

entire experiment, baseline, new food, baseline, and the whole sequence, over again. There are two major classes of outcomes. One, we get the same effect. Two, we do not. Let us deal initially with the former possibility. The first issue is what we would accept as “same.” The second person’s baseline level would likely not be exactly the same, and in fact, it might be notably different, averaging, say, 200 grams. Should we count that as a failure to replicate? The answer is (again), it depends. If our major concern was the exact amount eaten and the factors contributing to that, then the result might well be considered a failure to replicate. We will hold off for a bit, however, on what to do in the face of such failures, and move forward on the assumption that we are not so much concerned with the exact amount eaten as with whether the change in food results in a change in amount eaten. In that case, we might replicate, with the second subject, the whole sequence of conditions, Food 1, Food 2, and back to Food 1. Two possibilities exist: The results are the same as for the first subject or they are not, and again, consequently, an important issue is what is meant by same. The results are unlikely, especially in behavioral science, to be identical quantitatively, and, in fact, if the baseline is different, the change in intake cannot be identical in both absolute and relative terms, so we are left to decide whether to focus on what is different or on what is similar. In this stage of the discussion, let us assume that intake decreased, as it had for the first subject. In that case, we might feel confident that an important feature of the data has been replicated. A next question, then, would be whether additional replication with other subjects is needed. In this particular example, the answer would most likely be yes, but as is generally the case, the real answer is that it depends on what the goals of the experiment are. Behavioral scientists, by and large, tend to focus on similarities rather than differences, so if features of data reveal similarity across individuals, those similarities are likely to be pursued. Consider, therefore, a situation in which the data for the second subject are dissimilar, not only in quantitative terms but in qualitative ones as well. For example, suppose that for the second subject the change from Food 1 to Food 2 results in an increase in amount 162

eaten rather than a decrease. Here, there is no question that an important aspect of the first result has not been replicated. What is to be done then? The answer lies in the assumption of determinism that is at the core of behavioral science. If there is a difference observed between Subject 1 and Subject 2, that difference is the result of some other influence. That is, people do not differ for no reason. In fact, the failure to replicate the exact intake levels at baseline must also be a result of some factor. Failure to replicate, therefore, is an occasion on which to initiate a search for the variable or variables responsible for the differences in outcomes. Suppose, for example, that Subject 1 was female, and Subject 2 was male. Tests with other men and women (note the expansion of N) could reveal whether this factor was important in determining the outcome. Similarly, we have already assumed different baseline levels, so it might be the case that baseline level is related to the direction of change in intake, a hypothesis that can be examined by studying additional subjects. It is interesting that examination of this second possibility could be aided if the issue of different baselines between Subject 1 and Subject 2 had been assumed to be a failure to replicate. In that case, we would have focused on reasons for the difference and may have identified factors that determine baseline level. If that were so, it might be possible to control the baseline levels and to change them systematically, thus providing a direct method for studying the relation between baseline level and the effect of changing the food. Another possible reason that disparate effects are observed between subjects is differing sensitivity to the particular value of the independent variable used. In the example just described, the independent variable was characterized qualitatively as a change in food type, making assessment of sensitivity to it difficult to assess. If, however, the independent variable can be characterized quantitatively, for instance by carbohydrate content in our example, the technique of systematic replication, elaborated below, can be used to examine the possibility. An important issue in considering direct replication arises when intersubject replication succeeds but intrasubject replication does not. Taking our example, suppose that when the conditions were

Generality and Generalization of Research Findings

changed back to Food 1 with our first subject (cf. Figure 7.4), eating remained at the lower level, which would prevent replication of the effect in Subject 1. Such a result indicates either that some variable other than the change of food was responsible for the decrease in eating or that the exposure to Food 2 has produced a long-lasting change in eating. Support for the second view can come from attempts at intersubject replication. If experiments with subsequent subjects reveal that a shift from Food 1 to Food 2 results in a relatively permanent decrease in eating, the effect is verified. When initial effects are not recaptured after intervening experience that produces a change, the change is said to be irreversible. Using replication to examine irreversible effects requires intersubject replication, so we have here another instance in which N = 1 does not mean that only one subject need be studied. Many effects in psychology are irreversible, for example, those that we call learning, so the individual subject approach requires that intersubject replication be used to assess the reliability of such effects, and in so doing the generality of the effect across subjects is automatically examined. A focus on each subject individually, of course, does not prevent the use of traditional data analysis approaches, should an investigator be so inclined (for inferential statistical analyses appropriate to single-case research designs, see Chapters 11 and 12, this volume). Some, for example, might want to present group averages so that actuarial predictions can be made. Standard techniques can be used simply by engaging in the usual sorts of data manipulation. An emphasis on the data from individuals, nevertheless, can be used to enhance the presentation. For example, consider a study by Dunn, ­Sigmon, Thomas, Heil, and Higgins (2008), who compared two conditions aimed at reducing cigarette smoking. In one, vouchers were given contingent on breath samples that indicated that no smoking had occurred, whereas in the other the vouchers were given independently of whether the subject had smoked. Figure 7.5 shows some of the results. The bars show group means, and the dots show data from each individual, illustrating the degree to which effects were replicable across patients and the representativeness of the group

Figure 7.5.  Number of days of continuous abstinence from smoking cigarettes in two groups of subjects. Circles are data from individuals. Open bars and brackets show the group means and standard errors of those means. Subjects represented by the left bar received vouchers contingent on abstinence, whereas those represented by the right bar received vouchers independent of their behavior. The top bracket and asterisk indicate that the mean difference was statistically significant at the .01 level. From “Voucher-Based Contingent Reinforcement of Smoking Abstinence Among Methadone-Maintained Patients: A Pilot Study,” by K. E. Dunn, S. C. Sigmon, C. S. Thomas, S. H. Heil, and S. T. Higgins, 2008, Journal of Applied Behavior Analysis, 41, p. 533. Copyright 2008 by the Society for the Experimental Analysis of Behavior, Inc. Reprinted with permission.

averages. Such a display of data provides considerably more useful information than do presentations that include only means or results of tests of statistical significance.

Systematic Replication: Parametric Experiments To this point, our emphasis has been on the intra- and intersubject generality and reliability of effects, and we have argued that individual subject approaches can be effectively used to assess it. Generality of effects, however, is not limited to generality across individuals, and it is to other forms of generality, culminating with scientific generality, to which we now turn. As noted earlier, systematic replication refers to replication with something changed, and, as also noted, a case can be made that replication with a new subject is a form of systematic replication in 163

Branch and Pennypacker

that it is an experiment with something changed, namely the experimental subject. From such replications come assessments of the across-subject generality of effects. In this section, we discuss other sorts of changes between experiments that constitute systematic replication. To do so, let us begin again with our example of effects of food type on eating. Suppose that after obtaining the data in Figure 7.4, we perform a systematic replication of the study rather than a direct repetition. For example, we might notice that Food 2’s carbohydrate content is higher than that of Food 1. We decide, therefore, to alter the carbohydrate content of Food 2 (and let us assume, likely impossible, without changing the taste) so that it matches that of Food 1, and repeat the experiment. Such an experiment would examine the generality of Food 2’s effect on eating to a new carbohydrate level. If adjusting Food 2’s carbohydrate amount to equal that of Food 1 resulted in the switch in foods having no effect on eating, two things can be concluded. One, the original result was not replicated. In such cases, it is often wise to replicate the original experiment to determine whether unknown variables might have been responsible. Two, carbohydrate amount is identified as a likely important variable. Thus, systematic replication is not only a method for discovering generality of effects, it is also an approach that can lead to finding controlling variables. Continuing our description of types of systematic replication, let us assume we decide to examine more fully the role of carbohydrates in eating. Our original experiment may be conducted several times but with a different carbohydrate mix for Food 2 on each occasion. Each repetition of the experiment, then, constitutes a systematic replication because a new value of carbohydrate is used for each instance. Experiments that systematically vary the value of a variable are called parametric experiments, and they play an especially important role in assessing generality. Consider the data in Figure 7.6, which are constructed to emulate what might result if several intersubject replications of a parametric experiment were conducted. Parametric examination provides a number of advantages when assessing the reliability and generality of results. First, had only a single value of the 164

Figure 7.6.  Hypothetical data for three subjects showing the relationship between carbohydrate content and amount eaten.

independent variable been assessed, we might have been less than impressed with the degree of intersubject replicability of the data. The results of parametric examination, however, reveal a good deal of similarity across the three subjects: All show the same basic relation. At low percentages, the amount eaten is roughly constant within each individual. As the percentage increases, the amount eaten decreases until the percentage reaches a value above which further increases are associated with no changes in amount eaten. Second, and this is a key characteristic of parametric evaluation, the data suggest that only a range of levels of the independent variable result in a change in behavior. That is, parametric experiments permit the identification of boundary conditions, or limiting conditions, outside of which a variable is relatively ineffective. As we show later when dealing with the issue of scientific generality, information about boundary conditions can be extremely important. Figure 7.6 also illustrates how parametric experiments can help deal with the problem of lack of intersubject replicability when a single value of an independent variable is examined. Recalling our original example of comparison of food types, consider what could have happened if our first two subjects were Subjects 1 and 3 of Figure 7.6 and Food 1 had contained 20% carbohydrate and Food 2 had contained 25%. Changing the food type would have produced a change for Subject 1 but not for Subject 3,

Generality and Generalization of Research Findings

leading to a conclusion that we had failed to replicate the food change effect across subjects. The parametric examination, however, shows that both subjects are similar in how food intake was influenced by carbohydrate content, except that behavior of the two subjects was sensitive in a slightly different range. One of the most satisfying outcomes of parametric experiments is when they reveal similarities that cannot be judged when only a single value of an independent variable is tested. It is worth noting, too, that parametric experiments can reveal that apparent intersubject replicability can be misleading regarding how a variable influences behavior. It is possible that tests with a single value of an independent variable might lead to very similar quantitative results for several subjects, whereas a parametric analysis reveals that very different functions describing the relation between the independent variable happen to cross or come close together at the particular value of the independent variable evaluated. Parametric experiments illustrate one of the strengths of being able to characterize independent variables quantitatively. Experiments that determine how much of this yields how much of that provide more information about generality than do experiments that simply test whether a particular value of an independent variable has an effect. They can identify similarity where none is evident with a single value of an independent variable, and they can also determine whether apparent similarity is unrepresentative. We should note that parametric experiments are not limited in application to only primary independent variables, such as that shown in our fictitious example. Any variable associated with an experiment can be systematically varied. As an example, the experiment just described could be conducted under a range of temperatures, a range of degrees of hydration of the subjects, a range of times without food before the test, and any of several other variables. Those experiments, too, would provide information about the range of conditions under which the independent variable of carbohydrate content exerts its effects in the circumstances of the experiment. Parametric experiments, although very important, are not the only kind of systematic replications. One

other type involves using earlier findings as a starting point, or baseline, for examination of other variables. As an example, consider the phenomenon of false memory in the laboratory, produced by a procedure originally developed by Deese (1959) and later elaborated by Roediger and McDermott (1995). In these studies, subjects said they recalled or recognized words that were not presented. A great deal of research followed the original demonstrations, and these experiments varied procedural details, measurement techniques, subject characteristics, and so forth. In each instance, therefore, in which the false memory effect was reproduced, the reliability of the phenomenon was demonstrated and its generality extended. Using the reproduction of previous findings as a starting point for subsequent research, therefore, is a useful and productive technique for examining reliability and generality of research outcomes. Sidman (1960), in his characterization of techniques of systematic replication, described a type he called “systematic replication by affirming the consequent” (p. 127). Essentially, this approach is very similar to the idea of hypothesis testing because the systematic replication is not based on simply changing some aspect of the experiment to see whether effects can still be reproduced but rather on what the investigator sees to be the implications of previous results. That is, the replication may be based on the investigator’s interpretation of what the data mean. For example, consider our fictitious study of the effects of carbohydrate content on eating. That result, and perhaps those of other experiments, might suggest that the phenomenon is not specific to eating. Carbohydrate ingestion possibly leads to general lethargy or low motivation for voluntary behavior. If we suspect that, we might devise other experiments that could be viewed as systematic replications based on the possible implications of the previous findings. If the results were consistent with the lethargy interpretation, the view would gain in credence; if they were not, the view might well be abandoned. As Sidman (1960) noted, definite conclusions may not be drawn from successful replications by affirming the consequent, but, as he also noted, the approach is essential to science. The degree to which one’s confidence in an interpretation of data grows with successful replications 165

Branch and Pennypacker

depends on many things, not the least of which is how counterintuitive the predicted outcome is.

Types of Generality Assessed and Established by Systematic Replication Johnston and Pennypacker (2009) offered a useful characterization of the dimensions along which generality can be examined. They initially suggested a dichotomy between “generality of” and “generality across.” Generality across is simple to understand. As we have already noted, replication can be used to determine generality across subjects or situations, a type of generality usually of considerable interest. Systematic replication comes to the fore in the assessment of generality across species and across settings. By definition, systematic replication is an attempt at replication with something different, so if the species is changed, or if something (or a lot) about the setting is altered, the replication attempt is a systematic one. In both cases, the issue of what constitutes a successful replication may arise. Consider, for example, if we decided to attempt a crossspecies replication of our experiments with food types, and our new species was a mouse. Obviously, mice would eat considerably less, and therefore a precise, quantitative replication would not be possible. We might (actually, probably would), however, argue that the replication was successful if the relation between carbohydrate content and eating was replicated, that is, if at low concentrations there was little effect on eating, but as carbohydrate content increased, the amount eaten decreased until some level is reached above which further decreases were not seen (cf. Figure 7.6). What if the content values at which the decreases begin and end differ between the species? For example, mice may begin to show a decline when the food reaches 15% carbohydrate, whereas with the humans, decreases are not evident until the food contains 25% carbohydrate. Is that a failure to replicate? Again, the answer is yes and no. The business of science is to find regularities in nature, so emphasis is properly placed on similarities. Differences ­virtually always exist, so they are easy to find. Nevertheless, they cannot be ignored entirely, but their main role is not to indicate that the similarities ­evident are somehow unimportant, but rather to 166

promote further research into the origins of the differences if the differences are judged to be important. The scientist and the scientific community make judgments about the need for further investigation of the differences that are always present in replications. Generality of also plays an essential role in science. Johnston and Pennypacker (2009) described several categories of generality of, but here we focus on one in hopes of making the concept clear: generality of process. Our example is a behavioral process familiar to most psychologists, specifically the process of reinforcement of operant (purposive) behavior. Reinforcement refers to the increase in likelihood of behavior as a result of earlier instances being followed by certain consequences, which is the process. Systematic replications across an immense range of both behavioral activities and a very large range of consequences have been shown to provide instances of the process. For example, in addition to the traditional lever press and key peck, activities ranging from the electrical activity of an imperceptible movement of the thumb (de Hefferline, Keenan, & Harford, 1959), to vocal responses of chicks (Lane, 1960), to generalized imitation in children with developmental delays (Baer, Peterson, & Sherman, 1967), to the extensive range of activities described in the use of reinforcement in the treatment of behavior disorders (e.g., Martin & Pear, 2007; Ullman & Krasner, 1966) have all been shown as instances of the process. Similarly, the range of events used as effective consequences to produce reinforcement is also broad. Consequences such as praise, food, intravenous drug administration, opening a window, reducing a loud noise, access to exercise, and many, many others have been effectively used to produce reinforcement. All the reports may be viewed as describing systematic replications of the earliest experiments on the process (e.g., Skinner, 1932; Thorndike, 1898). This generality of process is what stands as the rationale for speaking of reinforcement theory. The argument is similar to that offered for the motion of objects. Whatever those objects are, and whether they are falling, floating, being ballistically projected, or orbiting in outer space, they can be subsumed under the notion of gravitational attraction,

Generality and Generalization of Research Findings

Newton’s theory of gravity. An even more dramatic example is provided by living things. All manner of plants and animals populate the earth, and their differences are obvious and virtually countless. What is less obvious but explains the variety is that all life can be considered to have developed from the operation of three processes: variation, selection, and retention (Darwin, 1859). The sameness of cellular architecture, including nuclear material (e.g., DNA and RNA), also attests to the similarity. Likewise, all the myriad instances of reinforcement suggest that considering them instances of a single process is reasonable. As noted earlier, an important goal of science is to discover uniformities. In fact, as Duhem (1954) noted, one of the key features of explanation is identification of the like in the unlike. Objects look different, are made of different substances, and may or may not be moving in variety of ways, but they are similar in how they are affected by gravity. Behavioral activities take on many forms, and as just noted, so can the consequences of those activities. Nevertheless, they can (on many occasions) exhibit the phenomenon known theoretically as reinforcement, an instance of generality of process.

Scientific Generality Another extremely important concept is scientific generality, a type of generality that has some counterintuitive characteristics. Scientific generality is important for at least two reasons. One, scientific generality speaks to scientists’ ability to reproduce their own findings and those of other scientists, as well. Two, scientific generality speaks directly to the possibility of effective application and translation of laboratory findings to the world at large, as discussed more fully later in the last section of this chapter. Scientific generality is defined by knowledgeable reproducibility. That is, it is not characterized in terms of breadth of applicability, but instead in terms of identification of factors that are required for a phenomenon to occur. To illustrate the difference between scientific generality and, for example, generality across people, consider again the fictitious experiment on food types. Suppose that the original experiments were all performed with male subjects. On an attempt at replication with female subjects, it is discovered that food type, or carbohydrate

c­ omposition, has no effect at all on eating. That, of course, would be clear indication of a limit to the across-subjects generality of the effect on eating. It would, however, represent an increase in scientific generality because it specifies more clearly the conditions required to produce the phenomenon of reduced food intake. As stated by Johnston and Pennypacker (2009), “A procedure can be quite valuable even though it is effective under a narrow range of conditions, as long as we know what those conditions are” (pp. 343–344). The vital role that systematic replication, and even failures of systematic replication, can play in establishing scientific generality therefore becomes evident. Scientific generality represents an understanding of the variables responsible for a phenomenon. Generalization, Technology Transfer, and Translational Research The function of any science is the acquisition of basic knowledge. A secondary benefit is often the possibility of applying that knowledge in ways that impart benefit to some element of the culture at large. For example, Galileo’s basic astronomic observations eventually led to improved navigation procedures with attendant benefits to the colonial powers of 17th-century Europe. Pasteur’s discovery in 1863 of the microorganisms that sour wine and turn it into vinegar, and the observation that heat would kill them, led eventually to the germ theory of disease and the development of vaccines. In the case of behavior analysis, a relatively young science, sufficient basic knowledge has been acquired to permit vigorous attempts at application. A discipline known as applied behavior analysis, ­discussed extensively elsewhere in Volume 2 of this handbook, is the primary result of these efforts, although application of the findings of behavior analysis are to be found in a variety of other disciplines including medicine, education, and management, to name but a few. In this section, we describe issues surrounding attempts to apply laboratory research findings in the wider world at large. Specifically, we discuss topics related to applying research findings from controlled 167

Branch and Pennypacker

laboratory or therapeutic settings to new situations or less controlled environments. First, we describe the issue of generalization of behavioral treatment effects from treatment settings to real-world circumstances. Then we outline basic general strategies for effective transfer of technologies, taking into account the known scientific generality of behavioral processes. Finally, we offer comments on the notion of translational research, a matter of much contemporary interest.

Generalization of Applications One of the earliest subjects of discussion that arose with the development of behavior therapy and behavior modification techniques was the issue referred to as generalization (e.g., Yates, 1970). Specifically, there was concern about whether improvements produced in a therapy setting would also appear in other, nontherapy (e.g., everyday life) situations. The term generalization was borrowed from a core behavioral process discovered by experimental psychologists, that after learning to respond in a particular way in the presence of a particular stimulus, say frequency of a tone, the same behavior may also occur in the presence of other more or less similar stimuli, say, other frequencies of the tone. It is an apparently simple logical step to suggest that behavior learned in a therapy environment might also appear in nontherapy, real-world environments, and when it does so, the result can be called generalization (but see Johnston, 1979, for problems with such a simple extrapolation). Because applied behavior analysis generally involves establishing conditions that alter behavior, the issue of whether those changes are restricted to the learning situations arranged or whether they become evident in other situations is usually important. For example, if a child who engages in aggressive behavior is exposed to a treatment program to reduce aggression, a goal would be to decrease aggression not only in the treatment setting but in all settings. In a seminal article, Stokes and Baer (1977) discussed the issue of generalization of treatment effects. A key contribution of their article was to indicate that in general, if effects of a treatment are to be manifested in a variety of circumstances, achieving that outcome must be considered in 168

designing the intervention intended to effect the change in behavior. That is, it is not always sufficient to simply arrange circumstances that produce a desired change in behavior in the circumscribed environment in which the treatment is undertaken. Instead, procedures should be used that increase the probability that the change will be enduring and manifested in those parts of a client’s environment in which the changes are useful. That insight has been followed by the development of general strategies to enhance the likelihood that behavior changes occur not only in the treatment environment but also in other appropriate ones. For example, Miltenberger (2008) described several general strategies that can be used to promote generalization of treatment effects. The most direct strategy is to arrange for rewards to occur immediately after instances of generalization occur. Such an approach essentially entails taking treatment to the environments in which it is hoped the new behavioral patterns will occur. That is, the training environment is not explicitly delimited. Such an approach is now widespread in applied behavior analysis partly as a consequence of an emphasis on analyzing reinforcement functions before implementing treatment (see Iwata, Dorsey, Slifer, Bauman, & Richman, 1982). This approach to problem behavior entails discovering whether the behavior is maintained by reinforcement, and if it is, identifying what the reinforcers are in the environments in which the problem behavior occurs. Once the reinforcers responsible for the maintenance of the problem behavior are identified, then procedures based on that knowledge are implemented in the situations in which the behavior occurs. A related second strategy identified by Miltenberger (2008) is consideration of the conditions operating in the environments in which the changed behavior would be occurring. The idea here is that behavior that is changed in the therapeutic setting, for example learning effective social skills, will lead, if performed, to more satisfying social interactions in the nontherapy environment, and those successes will help to solidify the gains made in the therapy sessions. In designing the therapeutic goals, therefore, consideration is given to what sorts of behavior

Generality and Generalization of Research Findings

are most likely to be successful in the nontherapy environment. A less obvious strategy applies when the nontherapy environment appears to offer little or no support for the changed behavior. An example is when therapy is aimed at training an adolescent to walk away from aggressive provocation in a schoolyard. Behaving in such an aggression-thwarting manner is not likely to result in positive outcomes with peers, who are instead likely to provide taunts and jeers after such actions. In such a case, it may be prudent to try to change the normal consequences in the schoolyard by having teachers or other monitors provide positive consequences (perhaps in the form of privileges, praise, etc.) for such actions when they occur. That is, the strategy here involves altering the contingencies operating in the nontherapy environment. A fourth general strategy is to try to make the therapy setting more like the nontherapy environment in which the changed behavior is to occur. A study by Poche, Brouwer, and Swearingen (1981) illustrated this approach. They taught abduction prevention skills to preschool children, but in so doing incorporated a relatively large number of abduction lures in the training. The intent was that by including a wide variety of possible lures that might be used by would-be kidnappers, the training would be more effective in real-world situations than if it had not involved those variations. The general strategy in this case was to train with as many likely relevant situations as possible. Another way to view this strategy is that it involves incorporating stimuli that are present in the nontherapy environment into the training. A fifth approach is somewhat less intuitive, but research has suggested that it may be effective. The core idea is that if a variety of different forms of effective behavior are established by the therapy or training, the chance of effective behavior occurring in the nontherapy environment is better, and as a result the successful behavior will be supported and continue to occur. As a simple illustration, Miltenberger (2008) offered the example of teaching a shy person a variety of specific ways to ask for a date, which provides the person with several actions to try, some of which are likely to be successful outside of therapy.

In this section, we focused on particular strategies for ensuring that desired changes in behavior established through therapeutic methods occur and persist in nontraining or nontherapy environments, that is, in the everyday world. Employment of tactics emerging from the strategies described has yielded many successes, and the methods are part of the armamentarium of applied behavior analysts. These techniques to promote generalization of behavior changes have emerged from a consideration of ­fundamental behavioral processes that have been identified and analyzed in basic research and then subsequently validated as effective through applied research. They represent, consequently, what can be called successful transfer from basic science to effective technology, namely, an instance of what has come to be called technology transfer. In the next section, we discuss some general principles of effective technology transfer.

Technology Transfer People often use the term technology to refer to the body of applied knowledge and practices that emanate from basic science. The term technology transfer refers to the process by which the discoveries or inventions of basic science actually make their way into the body of technology and become available for use outside the laboratory by any individual ­willing to undergo the expense of acquiring the technology. Technology transfer can occur with its basis in any science, so general principles exist that apply across scientific disciplines. The process is somewhat complex, and pausing to review some of the basic details that apply when a technology is brought to the commercial level will be helpful. The criteria, both legal and practical, that must be met for successful technology transfer set an upper limit against which applications of science can be evaluated. A discovery or invention is an item of intellectual property. It belongs to the inventor or discoverer or whoever sponsored the research that led to its existence. The transfer process usually involves a second party taking ownership of the property and thus the right to exploit it for commercial gain. It therefore must be protected, usually by means of a patent or copyright. Once ownership is secured, it can be 169

Branch and Pennypacker

transferred in exchange for a lump sum payment or, more often, a license agreement whereby the licensee acquires the exclusive right to produce and distribute the technology and the licensor is entitled to a royalty based on sales. Thus, for example, the Quaker Oats Company executed a licensing agreement with the University of Florida that allowed the company to produce and distribute Gatorade exclusively in exchange for a royalty in the form of a percentage of revenue.

Requirements for Technology Transfer For a candidate technology to meet the eligibility requirements for transfer, it must meet three criteria: quantification, repetition, and verification. These terms arise from the engineering literature (e.g., Hench, 1990); their counterparts in the behavioral literature will become obvious in a moment. Let us consider these characteristics in more detail and see how they conform to the products of behavioranalytic research. First, we discuss quantification. Behavior analysis has long used the measurement strategies of the natural sciences (Johnston & Pennypacker, 1980) with the result that, as Osborne (1995) stated, Physical standards of measurement bind behavior analysis to the physical and natural sciences. Interpretation of dependent variables need not change from experiment to experiment. It is a feature of . . . idemnotic measures that response frequencies on a particular parameter of a fixed-ratio schedule of reinforcement can be compared validly within sessions and across sessions, within laboratories and across laboratories, within species and across species. (p. 249) We are therefore able to state precisely and unambiguously the quantitative characteristics of behavior resulting from application of a particular procedure. Repetition is the practical use of replication as discussed earlier. The phenomenon must be able to be reproduced at will for it to serve as a component of a transferrable technology. An early (late 1950s and early 1960s) example of this feature is the 170

application by pharmaceutical companies of the reproducible effects of reinforcement schedules in evaluating drugs. A standard approach was to establish a known baseline of performance by an animal using a particular schedule of reinforcement, then evaluate the perturbation, if any, of an experimental compound on that performance. If the perturbation could be reliably reproduced (repeated), the relation between the compound and its effect on behavior was affirmed (cf. McKearney, 1975; Sudilovsky, Gershon, & Beer, 1975). Verification is most akin to the concept of generality. In establishing the generality of a behavioral process or phenomenon, researchers seek to specify the range of conditions under which the phenomenon occurs while eliminating those that are irrelevant or extraneous. Similarly, when transferring a technology, the recipient must be afforded a complete specification of the necessary and sufficient conditions for reproduction of the effects of the technology. Extensive research to nail down the parameters of generality is the only way to achieve this objective. It cannot be obtained by appeal to the results of significance tests for all of the reasons detailed earlier. A simple yet elegant example of a well-established behavioral technology that has easily transferred is the Invisible Fence for animal control, which is a direct application of the principles of signaled avoidance. A wire is buried underground around the perimeter of the enclosed area in which the animals are to remain. The animal wears a collar that receives auditory signals (beeps) and delivers electric shocks through two electrodes that contact the animal’s neck. As the animal comes within a few feet of the wire, a beep sounds. If the animal proceeds, it receives a shock. The owner teaches the animal to withdraw on hearing the beep by carrying (or leading) the animal into the proximity of the beep, then shouting “No!” and carrying or leading the animal away from the beep. After several such trials, the animal is released and will eventually receive the shock. Its escape response has been well learned, and it will very likely never contact the shock again. Rather, it will avoid it when it hears the beep. A more elaborate example of a behavioral technology that has been successfully transferred is the

Generality and Generalization of Research Findings

MammaCare Method of manual breast examination as a means of early detection of breast cancer (­Pennypacker, 1986). This example is unusual because the basic research was conducted by the same individuals who eventually took the technology to market. A capsule history of the development of MammaCare and its subsequent transfer is available online (Pennypacker, 2008). In brief, a high-fidelity simulation of human breast tissue was created with the help of materials science engineers. Patent protection for this device was obtained. Basic psychophysical research using this simulation was conducted to determine the limits of detectability by human fingers of lifelike simulations of breast tumors and other abnormalities. Once these limits were known, behavioral studies isolated the components of a technique that allowed an examiner to approach the established psychophysical limits. Early translational research established that practicing the resulting techniques on the simulation enabled examiners to detect real lumps in live breast tissue at about twice their baseline accuracy. Extensive research was then undertaken to establish procedures for teaching the new examination technique, which became known as MammaCare. Technology transfer became possible when standards of performance were established and training methods were devised that could be readily repeated and whose results could be verified. As a result, individuals wanting to offer such training either to the public or to other medical professionals (people who routinely perform clinical breast examinations, e.g.) may now become certified and can operate independently. Technology transfer was greatly accelerated in the United States by the passage in 1980 of the Bayh–Dole Act, which made it possible for institutions conducting research with federal funds to retain the products of that research. Offices of licensing and technology soon appeared in all of the major research universities, which in turn began licensing to private organizations the products of their sponsored research. The resulting revenue in the form of fees and royalties has been of significant benefit to these institutions. Most of this activity, however, has taken place in the hard sciences, engineering and medicine. Fields such as behavior

analysis were not ready to enjoy the stimulative effects of the Bayh–Dole Act at the time it became law. Analogous federal attention to the clinical and behavioral sciences has emerged in the form of the National Institutes of Health’s National Center for Research Resources, which makes Clinical and Translational Science Awards. Mace and Critchfield (2010) cited a statement by the acting director of the National Institutes of Health’s Office of Behavioral and Social Sciences to the effect that “its [the institute’s] mission is science in pursuit of fundamental knowledge about the nature and behavior of living systems and the application of that knowledge to extend healthy life and reduce the burdens on illness and disability” (p. 307). The aim is to accelerate the application of basic biomedical research to clinical problems. We now turn our attention to this type of research.

Translational Research From our perspective, translational research is a node along the continuum from basic bench science to the sort of application that results from technology transfer. The continuum is abstractly defined by generality. The ultimate goal of translational research may be broadly seen as establishing the limits of generality of whatever variable, process, or procedure is of interest. Translational research is therefore a somewhat less stringent endeavor than full technology transfer to bridge the gap from bench to bedside. A distinguishing feature of this approach is that the basic scientist and clinician often collaborate in a synergistic manner. This practice will likely accelerate the development of candidate technologies because the applied aspect is undergoing constant examination and refinement. Lerman (2003) has provided an excellent overview of translational research in behavior analysis. She correctly observed that the bulk of the literature on applied behavior analysis consists of reports of translational research. She went on to describe a series of maturational stages of this endeavor, beginning with the early demonstrations of the generality of the process of reinforcement across species, individuals, and settings. From these emerged concerns with other basic processes such as extinction, 171

Branch and Pennypacker

stimulus generalization, discrimination, and the effects of basic contingencies of reinforcement. As these types of demonstrations increasingly proved beneficial to clinical populations, a new dimension of this research emerged that focused on issues of training, maintenance of benefits, and even the implications of such practices for public policy. Concurrently, focus shifted from individual cases to larger entities such as schools, corporations, and even military units. At the same time, a small but growing body of translational research will explicitly hasten the development of mature technologies that can be transferred with the usual attendant financial and cultural benefits. One such effort is a study by St. Peter Pipkin, Vollmer, and Sloman (2010), who examined the effects of decreasing adherence to a schedule of differential reinforcement of alternative behavior, first in a laboratory setting and then, using the exact same reinforcement schedule parameters, in an educational setting with two individuals with educational handicaps. They explored the generality of a procedure across settings and populations and further documented the effects of deliberate failure to impose the differential reinforcement of alternative behavior schedule as specified, either by not delivering reinforcement when required or by “accidentally” allowing undesirable behavior to contact reinforcement at various times. These manipulations constitute an attempt to demonstrate the consequences of breakdowns in treatment integrity, which in some cases can be highly destructive and in others may be negligible. The type of translational research just described constitutes an important step toward the development of a technology that can be transferred in the sense discussed earlier. Treatment integrity is directly analogous to what the engineers call verification. St. Peter Pipkin et al. (2010) provided guidance for establishing a range of allowable treatment integrity failure within which effectiveness may be maintained, which is akin to specifying tolerances in a manufacturing process or allowable failure rates of components in complex equipment.

General Considerations for Translational Research Mace and Critchfield (2010) offered an informative perspective on the current role of translational 172

research in behavior analysis. They stressed the importance of conducting research that can be of more or less immediate benefit to society if substantial societal support for such research is to occur. In our view, the likelihood of such research actually attaining that criterion would be augmented to the extent that researchers keep as their ultimate goal development of a technology that can be transferred as we have discussed. In fact, very few examples of commercially transferrable technology have yet to emerge from behavior analysis (Pennypacker & Hench, 1997). There is, however, sufficient promise in the replicability of the discipline’s basic findings to encourage development of transferrable technologies, and the availability of substantial research support is critical. More translational research aimed at identifying and isolating the conditions under which specified procedures can be assured to consistently generate measurable and desirable effects on the behavior of individuals will hasten the emergence of such technologies. Summing Up The subdiscipline known as behavior analysis offers an alternative approach to assessing reliability and generality of research findings, that is, an approach that is different from that used by most psychological researchers today. The methods that provide avenues to assessing reliability and generality may be of interest to psychologists who approach the field of behavior (or mind) from perspectives other than those shared by behavior analysts. At this juncture in the history of behavioral science, the methods might be especially attractive to researchers who are coming into contact with the substantial limitations of traditional methods that rely on group averages and null-hypothesis significance testing. In the early sections of this chapter, we reiterated those weaknesses because it appears that many behavioral researchers are not aware of them. Our main thrust in this chapter has been to describe and characterize types of replication and the roles that they play in determining the reliability and generality of research outcomes. We have especially emphasized the role replication can play in assessing the generality of research findings, both

Generality and Generalization of Research Findings

across subjects and conditions and of theoretical assertions. We have argued, in fact, that replication, both direct and systematic, currently represents the only set of methods that can determine whether results are reliable and how general they are. Our claim is a strong one, and it is not that replication is an alternative set of methods but rather that it is the only way to determine reliability and generality given current knowledge. Replication has served the more developed sciences very effectively, and it is our contention that it can serve behavioral science, too. At the very least, we hope we have convinced the reader that paying greater attention to replication will advance behavioral science more surely and more rapidly than the methods currently in fashion. In the final section of the chapter, we focused on issues of applying science to problems the world faces. Once reliability and generality of research findings have been established to an appropriate degree, it is sometimes possible to take advantage of that knowledge for the betterment of people and society. There are guideposts about how best to do that, and we have discussed some of them. The other chapters in this handbook present a wide-ranging description of the research and application domains that constitute behavior analysis. We see those chapters as testament to the coherent science and technology that can be developed when the markers of reliability and generality have been established through research founded on direct and systematic replication.

References Anscombe, F. J. (1973). Graphs in statistical analysis. American Statistician, 27, 17–21. doi:10.2307/2682899 Baer, D. M., Peterson, R. F., & Sherman, J. A. (1967). The development of imitation by reinforcing behavioral similarity to a model. Journal of the Experimental Analysis of Behavior, 10, 405–416. doi:10.1901/ jeab.1967.10-405

Carver, R. P. (1978). The case against statistical significance testing. Harvard Educational Review, 48, 378–399. Cleveland, W. (1994). The elements of graphing data. Summit, NJ: Hobart Press. Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997–1003. doi:10.1037/0003-066X. 49.12.997 Darwin, C. (1859). On the origin of species by means of natural selection, or the preservation of favoured races in the struggle for life. London, England: John Murray. Deese, J. (1959). On the prediction of occurrence of particular verbal intrusions in immediate recall. Journal of Experimental Psychology, 58, 17–22. doi:10.1037/ h0046671 de Hefferline, R. F., Keenan, B., & Harford, R. A. (1959). Escape and avoidance conditioning in human subjects without their observation of the response. Science, 130, 1338–1339. Duhem, P. (1954). The aim and structure of physical theory (P. P. Wiener, Trans.). New York, NY: Princeton University Press. Dunn, K. E., Sigmon, S. C., Thomas, C. S., Heil, S. H., & Higgins, S. C. (2008). Voucher-based contingent reinforcement of smoking abstinence among methadone-maintained patients: A pilot study. Journal of Applied Behavior Analysis, 41, 527–538. doi:10.1901/jaba.2008.41-527 Falk, R., & Greenbaum, C. W. (1995). Significance tests die hard: The amazing persistence of a probabilistic misconception. Theory and Psychology, 5, 75–98. doi:10.1177/0959354395051004 Gigerenzer, G., Krauss, S., & Vitouch, O. (2004). The null ritual: What you always wanted to know about significance testing but were afraid to ask. In D. Kaplan (Ed.), The Sage handbook of quantitative methodology for the social sciences (pp. 391–408). Thousand Oaks, CA: Sage. Haller, H., & Krauss, S. (2002). Misinterpretations of significance: A problem students share with their teachers? Methods of Psychological Research—Online, 7(1), 1–20. Hench, L. L. (1990, August). From concept to commerce: The challenge of technology transfer in materials. MRS Bulletin, pp. 49–53.

Bakan, D. (1966). The test of significance in psychological research. Psychological Bulletin, 66, 423–437. doi:10.1037/h0020412

Iwata, B. A., Dorsey, M. F., Slifer, K. J., Bauman, K. E., & Richman, G. S. (1982). Toward a functional analysis of self-injury. Analysis and Intervention in Developmental Disabilities, 2 3–20.

Bayh-Dole Act, Pub. L. 96-517, § 6(a), 94 Stat. 3018. (1980).

Johnston, J. M. (1979). On the relation between generalization and generality. Behavior Analyst, 2, 1–6.

Bernard, C. (1957). An introduction to the study of experimental medicine. New York, NY: Dover. (Original work published 1865)

Johnston, J. M., & Pennypacker, H. S. (1980). Strategies and tactics of human behavioral research. Hillsdale, NJ: Erlbaum. 173

Branch and Pennypacker

Johnston, J. M., & Pennypacker, H. S. (2009). Strategies and tactics of behavioral research (3rd ed.). New York, NY: Routledge. Kalinowski, P., Fidler, F., & Cumming, G. (2008). Overcoming the inverse probability fallacy: A comparison of two teaching interventions. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 4, 152–158. Lane, H. (1960). Control of vocal responding in chickens. Science, 132, 37–38. doi:10.1126/science.132.3418.37 Lerman, D. C. (2003). From the laboratory to community application: Translational research in behavior analysis. Journal of Applied Behavior Analysis, 36, 415–419. doi:10.1901/jaba.2003.36-415 Loftus, G. R. (1991). On the tyranny of hypothesis testing in the social sciences [Review of The empire of chance: How probability changed science and everyday life]. Contemporary Psychology, 36, 102–105. Loftus, G. R. (1996). Psychology will be a much better science when we change the way we analyze data. Current Directions in Psychological Science, 5, 161–171. doi:10.1111/1467-8721.ep11512376 Mace, F. C., & Critchfield, T. S. (2010). Translational research in behavior analysis: Historical traditions and imperative for the future. Journal of the Experimental Analysis of Behavior, 93, 293–312. doi:10.1901/jeab.2010.93-293 Martin, G., & Pear, J. (2007). Behavior modification: What it is and how to do it (8th ed.). Upper Saddle River, NJ: Pearson. McKearney, J. W. (1975). Drug effects and the environmental control of behavior. Pharmacological Reviews, 27, 429–436. Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34, 103–115. Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806–834. doi:10.1037/0022-006X. 46.4.806 Miltenberger, R. G. (2008). Behavior modification: Principles and procedures (4th ed.). Belmont, CA: Thompson. Nickerson, R. S. (2000). Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods, 5, 241–301. doi:10.1037/ 1082-989X.5.2.241 Oakes, M. (1986). Statistical inference: A commentary for the social and behavioral sciences. Chichester, England: Wiley. Osborne, J. G. (1995). Reading and writing about research methods in behavior analysis: A personal 174

account of a review of Johnston and Pennypacker’s Strategies and Tactics of Behavioral Research (2nd ed.) and others. Journal of the Experimental Analysis of Behavior, 64, 247–255. doi:10.1901/jeab.1995.64-247 Pennypacker, H. S. (1986). The challenge of technology transfer: Buying in without selling out. Behavior Analyst, 9, 147–156. Pennypacker, H. S. (2008). A funny thing happened on the way to the fortune, or lessons learned during 25 years of trying to transfer a behavioral technology. Behavioral Technology Today, 5, 1–31. Retrieved from http://www.behavior.org/resource.php?id=188 Pennypacker, H. S., & Hench, L. L. (1997). Making behavioral technology transferrable. Behavior Analyst, 20, 97–108. Penston, J. (2005). Large-scale randomized trials—A misguided approach to clinical research. Medical Hypotheses, 64, 651–657. doi:10.1016/j.mehy.2004.09.006 Poche, C., Brouwer, R., & Swearingen, M. (1981). Teaching self-protection to young children. Journal of Applied Behavior Analysis, 14, 169–175. doi:10.1901/ jaba.1981.14-169 Roediger, H., & McDermott, K. (1995). Creating false memories: Remembering words not presented in lists. Journal of Experimental Psychology: Learning, Memory, and Cognition, 21, 803–814. doi:10.1037/ 0278-7393.21.4.803 Rozeboom, W. W. (1960). The fallacy of the null hypothesis significance test. Psychological Bulletin, 57, 416–428. doi:10.1037/h0042040 Sidman, M. (1960). Tactics of scientific research. New York, NY: Basic Books. Skinner, B. F. (1932). On the rate of formation of a conditioned reflex. Journal of General Psychology, 7, 274–286. doi:10.1080/00221309.1932.9918467 Smithson, M. (2003). Confidence intervals. London, England: Sage. Stokes, T. F., & Baer, D. M. (1977). An implicit technology of generalization. Journal of Applied Behavior Analysis, 10, 349–367. doi:10.1901/jaba.1977.10-349 St. Peter Pipkin, C., Vollmer, T. R., & Sloman, K. N. (2010). Effects of treatment integrity failures during differential reinforcement of alternative behavior: A translational model. Journal of Applied Behavior Analysis, 43, 47–70. doi:10.1901/jaba.2010.43-47 Sudilovsky, A., Gershon, S., & Beer, B. (Eds.). (1975). Predictability in psychopharmacology: Preclinical and clinical correlations. New York, NY: Raven Press. Thorndike, E. L. (1898) Animal intelligence: An experimental study of the associative processes in animals. Psychological Review, 11(4, Whole No. 8). Tukey, J. W. (1969). Analyzing data: Sanctification or detective work? American Psychologist, 24, 83–91. doi:10.1037/h0027108

Generality and Generalization of Research Findings

Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley.

and explanations. American Psychologist, 54, 594–604. doi:10.1037/0003-066X.54.8.594

Ullman, L. P., & Krasner, L. (Eds.). (1966). Case studies in behavior modification. New York, NY: Holt, Rinehart & Winston.

Williams, B. A. (2010). Perils of evidence-based medicine. Perspectives in Biology and Medicine, 53, 106–120. doi:10.1353/pbm.0.0132

Wilkinson, L., & Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines

Yates, A. J. (1970). Behavior therapy. New York, NY: Wiley.

175

Chapter 8

Single-Case Research Designs and the Scientist-Practitioner Ideal in Applied Psychology Neville M. Blampied

Darwinism implies . . . an intense awareness that all categorical or essentialist claims about living things are overdrawn— anyone who says that all cases of this thing or that thing are naturally one way or another are saying something that isn’t so. . . . Repetition is the habit of nature, but variation is the rule of life. . . . Belief in the primacy of the single case is not an illusion nurtured by fancy but a hope quietly underscored . . . by science. The general case is the tentative abstract hypothesis; the case right there is the real thing. (Gopnik, 2009, pp. 197–198) How to cope with variation within repetition; how to balance the abstract and the particular—these issues, so deftly stated by Gopnik (2009), challenge all science, including behavioral science. In this chapter, I consider how psychology responded to these challenges by adopting the now-dominant ­paradigm for research design and analysis within the discipline and how this strongly influenced the adoption of an ideal model—the scientist-practitioner model—for applying science. I then consider problems and difficulties that have arisen with both research and its application through adherence to

this model and argue that the adoption of singlecase research strategies provides an effective solution to many of these problems. Psychology Defines Science—The Inference Revolution John Arbuthnot (Gigerenzer, 1991; Kendall & Plackett, 1977), a physician and mathematician in the household of Queen Anne, is said to have proved the existence of God (a “wise creator”) in 1710 by way of a kind of significance test, but it was not until the middle years of the 20th century that psychology made significance testing into a kind of god. Before these developments, psychologists had used a range of tabular and especially graphical techniques for analyzing what were often complex data with large sample sizes (Smith, Best, Cylke, & Stubbs, 2000). Smith et al. (2000) noted that “history clearly shows that worthy contributions to psychology do not inherently depend on the benediction of p values. Well into the 20th century psychologists continued to produce lasting, even canonical, achievements without using inferential statistics” (p. 262). The changes in both research practice and data analysis that occurred in (Western) psychology in

I acknowledge my profound debt to Murray Sidman and thank him and Rita Sidman for their many kindnesses. I also acknowledge my enduring gratitude to the academic staff of the Department of Psychology, University of Auckland, 1964–1969, for their rigorous introduction to the science of behavior, and especially to John Irwin, who introduced me to the writings of B. F. Skinner. I am also grateful for the opportunity to discuss research methodology with my friend and colleague Brian Haig and for his comments on this chapter. I also acknowledge the assistance provided by an unpublished thesis by K. J. Newman (2000) in providing very useful background on the scientist-practitioner model of clinical psychology; helpful comments from Martin Dorhay; and many excellent suggestions from the editors, Kennon A. Lattal and Gregory J. Madden. Assistance in the preparation of this chapter was provided by Emma Marshall, who was supported by a Summer Scholarship jointly funded by the University of Canterbury and the Tertiary Education Commission of New Zealand. DOI: 10.1037/13937-008 APA Handbook of Behavior Analysis: Vol. 1. Methods and Principles, G. J. Madden (Editor-in-Chief) Copyright © 2013 by the American Psychological Association. All rights reserved.

177

Neville M. Blampied

the period from 1935 to 1955 have been called an inference revolution (Gigerenzer, 1991) and have been claimed by some to have had many of the properties of a Kuhnian paradigm shift (Rucci & Tweney, 1980). The rapidity and thoroughness of this revolution was remarkable, and it had profound and lasting consequences, not least because it comprehensively redefined, in an operational way, what “science” was and is and ought to be in the context of psychological research and, therefore, what the science part of a scientist-practitioner should be. The inference revolution was foreshadowed toward the end of the 19th century by Galton, the inventor of the correlation coefficient and the control group (Dehue, 2000; Gigerenzer et al., 1989), who wrote approvingly about “scientific men” who would “devise [statistical] tests by which the value of beliefs may be ascertained” and who would “discard contemptuously whatever may be found to be untrue” (as quoted in Schlinger, 1996, p. 72). The foundations of the revolution were, however, not laid until the 1920s when R. A. (Sir Ronald) Fisher, statistician, geneticist, and evolutionary biologist, published his profoundly influential works (Wright, 2009). Statistical Methods for Research Workers (Fisher, 1925) and The Design of Experiments (Fisher, 1935) gave the world factorial designs, ­randomization of research elements to treatment conditions, the analysis of variance, the analysis of covariance, the null hypothesis (i.e., the hypothesis to be nullified), and null-hypothesis significance tests (NHST; Yates & Mather, 1963). Although some fellow statisticians were suspicious (Yates & Mather, 1963), Fisher’s ideas spread rapidly, especially in biology (see Appendix 8.1 for further information on, and definition of terms for, NHST). Given that Fisher was working as an agricultural scientist and geneticist at the time he wrote these books, the spread of his ideas to biology was explicable. Rather more surprising was the rapid adoption of Fisher’s methods by psychologists. Rucci and Tweney (1980) identified 1935 as the first year psychological research using analysis of variance was published, with 16 further examples published by

1940. These early examples were largely of applied research, and it was from applied psychology that inferential statistics spread to experimental research (Gigerenzer et al., 1989). Fisher’s statistical tests were not the only practices adopted. His advocacy of factorial designs, that is, experiments investigating more than one level of an independent variable, was also influential, reflecting his view that experimental design and statistical analysis are “different aspects of the same whole” (Fisher, 1935, p. 3). There was a hiatus in the dissemination and adoption of Fisher’s statistical methods during World War II, but there was then an acceleration in the postwar years, so by the mid-1950s, Fisher’s statistical tests were widely reported in published articles across many psychology journals; many textbooks had been published to assist with the teaching of these methods; and leading universities were requiring graduate students to take courses in these methods (Hubbard, Parsa, & Luthy, 1997; Rucci & Tweney, 1980). Since that time, more than 80% of published articles in psychology journals have typically reported significance tests (Hubbard et al., 1997). Thus, by the time the inference revolution was complete—in the mid-1950s—psychology, or at least the academic, English-speaking domain of the discipline, had developed a consensus about what it meant to be a science and to do science.1 The key attributes of this model of science are outlined in Exhibit 8.1. In a mutually strengthening cycle, they were taught in the methods courses, written about in the methods textbooks, practiced in the laboratory, required by editors, published in the journals, and imitated by other researchers. Until recently, this standard model of psychological science (henceforth termed the standard group statistical model, or standard model for short) has gone largely unchallenged in the mainstream of the discipline. As it also happened, this consensus about scientific methods was achieved at about the same time as major developments were also occurring in the understanding of clinical psychology, the topic to which I now turn.

This consensus adopted most of Fisher’s approach to statistics and research design but also incorporated, in an ad hoc and anonymous way, aspects of Jerzy Neyman and Egon Pearson’s perspective (Gigerenzer, 1991; Gigerenzer et al., 1989; Hubbard, 2004; see Appendix 8.1).

1

178

Single-Case Research Designs and the Scientist-Practitioner Ideal in Applied Psychology

Exhibit 8.1 Key Characteristics of the Standard Model of Research in Psychology Recruit as manya participantsb as possible. (Quasi)randomly allocate these participants to experimental conditions. Acquire one or a few measures of the dependent variable from each participant. Use measures of central tendency (mean, median, etc) and variance to aggregate data for each dependent variable. Compute inferential statistics one dependent variable at a time, comparing the aggregated measures for each group. Operate under a null hypothesis (H0) for which the population mean differences are zero. Use p values relative to a criterion to make an accept– reject decision concerning the null hypothesis and any alternative hypothesis. Regard results of the study as being of scientific value only if H0 has been rejected at the criterion level.

Power analysis permits the computation of a minimum sample size for a given level of power. Investigators are still encouraged to regard larger samples as preferable to smaller samples (e.g., Streiner, 2006). bIn principle, these participants are regarded as being drawn from some specified population, but in psychological research the population of interest from which the sample is considered to have been drawn is rarely specified with any precision (Wilkinson & Task Force on Statistical Inference, 1999), an omission that strictly negates the drawing of inferences from the sample. a

Clinical Psychology and the Rise of the Scientist-Practitioner Ideal O’Donnell (1985) argued that in the period from about 1870 until World War I, the emergent discipline of psychology had to surmount three major challenges: to differentiate itself from philosophy, to avoid becoming a branch of physiology, and to become seen as sufficiently useful to society that both the discipline and its practitioners garnered institutional, social, and economic support. From this matrix of historical processes, especially the last, emerged clinical psychology. Consistent with this argument, there is agreement among historians of psychology that clinical psychology had its immediate origins in the late 1900s (e.g., Benjamin, 2005;

Bootzin, 2007; Korchin, 1983; O’Donnell, 1985; Reisman, 1966, 1991; Routh, 1998), although its origins can be traced back to antiquity (Porter, 1997; Routh, 1998). By the beginning of the 20th century, clinics in both Europe and the United States were devoted to the study of mental disease, mental retardation, and children’s educational and adjustment problems, in which a nascent clinical psychology was evident (Bootzin, 2007). The most unequivocally psychological of these was the clinic established by Lightner Witmer at the University of Pennsylvania in 1896 (McReynolds, 1997). ­Witmer’s clinic focused on children (McReynolds, 1997; Reisman, 1966), and it was a place where they could be assessed by psychologists (and social workers), drawing on the psychologists’ knowledge of sensory, perceptual, and cognitive systems and, where possible, remedial actions could be prescribed. Witmer presciently emphasized two points: one, as a teacher, the clinician was to conduct demonstrations . . . so that [students] would be instructed in the art and science of psychology; two, as a scientist, the clinician was to regard each case as in part a research experiment in which the effects of his procedures and recommendations were to be discovered. (Reisman, 1966, p. 86). Bootzin (2007) noted that these small experiments anticipated the development of single-case research methods in psychology, with Witmer himself emphasizing his “belief that we shall more profitably investigate these causes [of development] by the study of the individual” and that he studied “not the abnormal child but . . . the individual child” (Witmer, 1908, as quoted in McReynolds, 1997, p. 142). Witmer established courses for training graduate students in practical psychology at the University of Pennsylvania in 1904–1905 and founded a journal— The Psychological Clinic—in March 1907. In the initial issue of this journal, Witmer (1907/1996, p. 251) first used the term clinical psychology—“The methods of clinical psychology are . . . invoked wherever the status of an individual mind is determined by observation and experiment, and pedagogical treatment 179

Neville M. Blampied

applied to affect a change”—and thus signaled the founding of a new field of psychology. Other U.S. universities soon established clinics in the Witmer model; a few hardy professional psychologists worked outside academia, all of them in institutions serving people with intellectual disabilities; and at least one institution offered an internship (Reisman, 1966; Routh, 2000). So was clinical psychology born. The mature form with which psychologists and the public are familiar did not develop, however, until toward the end of World War II. As a profession, psychology had grown in numbers and influence during the war. The end of the war produced a rapidly growing gap between professional resources, especially the number of clinical psychologists, and the needs of postwar society. To meet these needs, more graduates in clinical and applied psychology in general were required. A committee of the American Psychological Association, chaired by David Shakow, worked to develop appropriate graduate curricula, accreditation processes, and internship opportunities and funding (Committee on Training in Clinical Psychology, 1947; Reisman, 1991). The committee defined clinical psychology as having “systematic knowledge of human personality and . . . principles and methods by which it may use this knowledge to increase the mental well being of the individual” (Committee on Training in Clinical Psychology, 1947, p. 540). The report further defined clinical psychology as “both a science and an art” and stated that the graduate must have “applied and theoretical knowledge in three major areas: diagnosis, therapy, and research” (Committee on Training in Clinical Psychology, 1947, p. 540). Clinical training programs should include psychological research methods and statistics as well as clinical courses and incorporate a research dissertation leading to a doctorate (­Committee on Training in Clinical Psychology, 1947; Thorne, 1945). Between 1947 and 1949, the number of American Psychological Association– accredited doctoral programs in clinical psychology

in the United States doubled, and this increase continued apace in subsequent years (Reisman, 1991). This growth stimulated further institutional activity, and in August 1949 a famous conference, the Boulder Conference on Graduate Education in Clinical Psychology, was held at the University of Colorado in Boulder (Raimy, 1950). This conference endorsed the recommendations of the Shakow ­committee and gave an official imprimatur to what is now termed the Boulder model for the training of applied and clinical psychologists as scientistpractitioners (Bootzin, 2007). This scientistpractitioner ideal, although contested and disputed, has ever since remained the dominant ideal of the clinical psychologist throughout the Englishspeaking world (e.g., Martin, 1989; Shapiro, 2002). What this ideal entails has been specified in different ways, but Barlow, Hayes, and Nelson (1984) suggested that it specified three aspects or roles for practitioners as (a) consumers of new research, (b) empirical evaluators of their own practice, and (c) producers of scientifically credible research from their own practice setting.2 Although developed specifically in regard to clinical psychology, the model in principle embraced a wide understanding of applied psychology, so that many forms of practice could be included (B. B. Baker & Benjamin, 2000; Raimy, 1950). It has been adopted by other applied areas, such as counseling psychology (e.g., Corrie & Callahan, 2000; Vespia & Sauer, 2006), health psychology (e.g., Sheridan et al., 1989), neuropsychology (e.g., Rourke, 1995), organizational and personnel psychology (e.g., Jex & Britt, 2008), and school (educational) psychology (e.g., Edwards, 1987). Given the close coincidence in time of the completion of the inference revolution that operationally defined the scientific method in psychology, the affirmation of the scientist-practitioner ideal at the Boulder conference, and the rapid growth of graduate training programs accredited according to the Boulder model, it is hardly surprising that the

Consistent with the ethos of the time (the 1950s), those who formulated the scientist-practitioner model appear to have endorsed an essentially linear, or one-way, model of the link between basic science and applied science (Reich, 2008). In psychology, this model appears not to have been much debated or challenged, but in the history and philosophy of science, the debate has been extensive (Balaram, 2008). Stokes (1997) developed a multidimensional typology for characterizing the relationship between science and its application, and it has been suggested that psychology belongs in Stokes’s “Pasteurian quadrant” (Reich, 2008; see also Price & Behrens, 2003). This domain is characterized by high concurrent interest in both basic science and its applications (Reich, 2008) and clearly has links to the idea of translational research as well (Lerman, 2003).

2

180

Single-Case Research Designs and the Scientist-Practitioner Ideal in Applied Psychology

science part of the scientist-practitioner ideal came to be identified with the standard group statistical model (Aiken, West, & Millsap, 2008; Aiken, West, Sechrest, & Reno, 1990; Rossen & Oakland, 2008). Frank (1984) has suggested that the conspicuous fact that most of clinical practice had not been derived from science was what led the field to seize on method, as endorsed in academe by researchers, as the common element linking psychologists together. By the late 1960s, influential expositions of research designs in clinical research (e.g., Kiesler, 1971) endorsed group statistical research as the primary scientific method. Although the scientist-practitioner ideal has been criticized, disputed, modified, and lamented (e.g., Albee, 2000; Lilienfeld, Lynn, & Lohr, 2003; Peterson, 2003; Stricker, 1975; Tavaris, 2003), it continues to be vigorously affirmed as an enduring ideal in applied psychology (e.g., B. B. Baker & Benjamin, 2000; T. B. Baker, McFall, & Shoham, 2008; Belar & Perry, 1992; Kihlstrom & Kihlstrom, 1998; McFall, 2007; Soldz & McCullogh, 2000) as both informing and inspiring graduate training and professional practice. Thus, by the midpoint of the 20th century a convergence of two powerful movements within psychology had occurred. One defined what science was and how it should be done; the other specified that having each graduate personally become both a scientist and a practitioner through training and supervised practice in both research and application was the way in which professional applied psychology should be conducted. As it happened and still happens, training of graduate students in mainstream settings, particularly the major universities, whether of general experimentalists or of aspiring practitioners, emphasized the standard group statistical model as the primary and necessary pathway to pure and applied knowledge (Aiken et al., 1990, 2008; Rossen & Oakland, 2008). Some Consequences of this History The preceding section, in brief, concerned how history unfolded for psychology in the 20th century, specifically in the combination of a prescription for doing science with an ideal form of applied science practitioner. How have things turned out in the

ensuing 50-plus years? In answering this question, I first reflect briefly on some of the consequences that followed from this development. Second, I consider whether there was any alternative to history as it happened, because if there was no alternative, then no matter the consequences, psychologists have had the best of all possible worlds. The first and most obvious consequence to note is that clinical science, the foundation for the scientistpractitioner, necessarily enjoyed both the benefits and the problems (if any) of the underpinning Fisherian model, because there can be no doubt that clinical research has overwhelmingly been conducted within this tradition. Problems, however, there were. Dar, Serlin, and Omer (1994) reported an extensive methodological review of the use of statistical tests in research published in the Journal of Consulting and Clinical Psychology in the three decades spanning from 1967 to 1988 and showed that this research almost without exception was conducted within the standard NHST paradigm. They also documented the occurrence of a large number and variety of misinterpretations of statistical tests in this research, including the abuse of p values as having meaning they do not have (see Chapter 7, this volume), the absence of confidence intervals, and the gross inflation of Type I error. Where trends existed over the decades, they were often toward the frequency of misinterpretation getting worse rather than better. Dar et al. (1994) concluded that there was “a growing misuse of null hypothesis tests” (p. 79) in psychotherapy research. In this, clinical research echoed the situation in psychological research as a whole, where widespread misinterpretation of statistical tests has been notorious for decades (Cohen, 1990, 1994; see Balluerka, Gomez, & Hidalgo, 2005, and Nickerson, 2000, for comprehensive reviews). Given the centrality of conclusions based on the outcome of NHST for determining the scientific status of research and its likelihood of publication, this state of affairs is serious for all research, including clinical research. Has the situation improved more recently? Fidler et al. (2005) reported on a similar survey of empirical articles published in the same eminent journal for five time periods spanning from 1993 to 2001. NHST-based methods continued to dominate 181

Neville M. Blampied

research, but misinterpretations and misapplications of statistical methods also continued to be conspicuous. Some improvement over the Dar et al. (1994) survey was noted, but Fidler et al. (2005) commented tartly, “In a major journal dedicated to the research of psychotherapy . . . clinical significance [rather than statistical significance] should be relevant to more than 40% of articles” (p. 139). Similar surveys of other major research journals have suggested a ­common picture of continuing misinterpretation of Fisherian statistics and resistance to change (Schatz, Jay, McComb, & McLaughlin, 2005; Vacha-Haase, Nilsson, Reetz, Lance, & Thompson, 2000) despite much effort to improve statistical methods (e.g., ErcegHurn & Mirosevich, 2008; Rodgers, 2010; Wilkinson & Task Force on Statistical Inference, 1999). Given this lamentable state of affairs from the mid-1960s to the present, little reason exists to suppose that clinical science has been spared the deleterious outcomes some critics have suggested all psychological research has suffered as a result of using the dominant statistical paradigm (Michael, 1974; Rosenthal, 1995). These outcomes include serious lack of power to detect experimental effects (e.g., Cohen, 1962; Sedlmeier & Gigerenzer, 1989), the lack of genuinely cumulative knowledge in psychology (e.g., F. L. Schmidt, 1996), the conspicuous divergence of the methods of psychology from those used in other natural and social sciences (Blaich & Barreto, 2001; Meehl, 1978), and the failure of psychology to develop unambiguously falsifiable theories (e.g., Meehl, 1978). Indeed, Meehl (1978) said of the adoption of NHST that it was “a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology [emphasis added]” (p. 817). A second, and perhaps less documented consequence of the joining of clinical science with the standard research model has been the ubiquitous use of averaging across participants. One of the direct, remarkable, but little-commented-on features of the inference revolution was the almost universal and unquestioned adoption of experimental groups necessitating group averaging as almost the first step in any data analysis procedure. Most psychologists today and for the past 50-plus years would find it 182

extraordinarily difficult to imagine doing research that did not entail the computation of group averages as the initial step in any analysis (e.g., Cairns, 1986). There was nothing particularly remarkable about this use of averaging in the case of Fisher’s own research, dealing as he did with agricultural produce for which interindividual differences in the properties of individual grains of rice or kernels of wheat and so forth are not of particular interest, for which such items are so numerous as to preclude examination of each one anyway, and for which the commercial and economic system dealt with agricultural commodities in bulk. Given this, the development and use of sampling theory that permitted estimation of population parameters and the making of inferences from samples to populations made sense and continues to be appropriate when actuarial or populationlevel issues are at stake. Critically, however, for psychology, these practices in agricultural research meshed neatly with the pursuit by Quetelet, a century before Fisher, of “ideal” aspects of humankind according to the normal distribution (the normal law of error; Gigerenzer et al., 1989). Quetelet’s search for the ideal human led directly to Fisher’s inferential statistics (Johnston & Pennypacker, 1993). The question of whether this combination of teleological assumptions about perfect human types and a focus on the estimation of aggregate properties of commodities was a good model for clinical and other applied psychologies, in which the cases being served are often diverse and individual differences are potentially of great importance, seems not to have been asked during the inference revolution, during the Boulder conference, or since. Nevertheless, a range of scholars have criticized psychology’s addiction to averaging across cases. The evolutionary biologist Steven Jay Gould (1985), in “The Median Isn’t the Message,” a touching autobiographical memoir of his first cancer diagnosis, recounted how his realization that the distribution of survival time postdiagnosis was skewed beyond the median 8 months gave him hope. The point of his writing about this experience, however, was to make a more general point: “I believe that the fallacy of reified variation—or failure to consider the ‘full house’ of all cases—plunges us into serious error

Single-Case Research Designs and the Scientist-Practitioner Ideal in Applied Psychology

again and again” (Gould, 1997, p. 46). The error Gould was referring to is to focus exclusively on the mean (or median) and ignore variability, forgetting that variability is as central a phenomenon in biological systems as is any central tendency. The same point had been made by the physiologist Claude Bernard (1865/1957) more than a century before (Thompson, 1984). Developmental researchers and theorists are notable among psychologists who have criticized averaging across cases (e.g., Bornstein & Lamb, 1992; Cairns, 1986), perhaps because their subject matter involves dynamic changes over time that have different trajectories in different individuals. Among these scholars, Valsiner (1986) has noted both the double standard prevailing in psychology— purported deep interest in individuals combined with the constant practice of averaging across individuals—and has cogently explained the dangers of this. He wrote, In psychological discourse (both scientific and applied), the individual subject is constantly being given high relevance. In contrast, the individual case is usually forgotten in the practice of psychological research because it is being replaced by samples of subjects that are assumed to represent some general population. Overwhelmingly, psychologists study samples of subjects and often proceed to generalise their findings to the ideal (generic, i.e., average or prototypical) abstract individual. Furthermore, characteristics of that abstracted individual may easily become attributed to particular, concrete individuals with whom [they] work and interact. The inductive inference from samples of concrete subjects to the abstract individual and from it (now already deductively) back to the multitude of concrete human beings is guided by a number of implicit assumptions . . . that obscure insight into the science and hamper its applications. (p. 2) The inductive inference from the sample to the population is consistent with, and indeed mandated

by, the underlying statistical model because hypotheses are about populations, not samples or individuals (Hays, 1963, p. 248), and yield useful knowledge when it is the population that is of concern. As Valsiner (1986) noted, however, major difficulties arise with the (largely unacknowledged) deductive steps by which psychologists get back from the population to the individual, as they must do in applied work. The problem, another version of the uniformity myth identified by Kiesler (1966), is that the induction from the sample to the population is assumed rather than demonstrated to have generality to any individual case, but the assumption is often false unless the population mean has very small variance (Rorer, 1991), which impairs the validity of the subsequent deductions. Without this deductive step, however, individual application of group research–based scientific knowledge is not possible. Was There Any Alternative to the Standard Model? I turn now to the question of whether there were, and are, alternatives to the adoption of the standard model for research and its incorporation in the scientist-practitioner model. As noted earlier, Witmer’s conceptualization of both clinical science and clinical practice was focused on the individual, and an emphasis on the individual was evident in the early definitions of clinical psychology (Reisman, 1966, 1991; see the section Clinical Psychology and the Rise of the Scientist-Practitioner Ideal). Moreover, during the time of both the early adoption of statistical methods and the development of clinical psychology, Gordon Allport emphasized the need for psychology to be both nomothetic (concerned with general laws) and ideographic (the study of unique individuals), noting, “The application of knowledge is always to the single case” (Allport, 1942, p. 58; also see Molenaar, 2004). For the most part, however, these influences did not withstand the power of the inference revolution (Barlow & Nock, 2009). There was an exception. Since the 1930s, B. F. Skinner and the science of behavior analysis he founded have sustained trenchant criticism of 183

Neville M. Blampied

averaging across cases and rejection of NHST while maintaining a commitment to experimentation and quantification (Gigerenzer et al., 1989; Hersen & Barlow, 1976). Skinner maintained an unwavering conviction that psychology should be a science of individual behavior—“Individual prediction is of tremendous importance, so long as the organism is to be treated scientifically as a lawful system” (Skinner, 1938, p. 444)—and eschewed averaging across cases. In his magisterial exposition of behavior-analytic research design principles, Sidman (1960) noted, Reproducible group data describe some kind of order in the universe, and as such may well form the basis of a science. It cannot, however, be a science of individual behavior except of the crudest sort. And it is not a science of group behavior in the sense that the term “group” is employed by the social psychologist. It is a science of averaged behavior of individuals who are linked together only by the averaging process itself. Where this science fits in the scheme of natural phenomena is a matter for conjecture. My own feeling is that it belongs to the actuarial statistician, and not to the investigator of behavioral processes. (pp. 274–275). Skinner (1938) and Sidman (1960) also maintained that any natural behavioral process must be demonstrable in the behavior of an individual (Sidman, 1960) and that group averaging risked creating synthetic rather than real phenomena. The principles and research practices developed by Skinner and other behavior analysts have clearly demonstrated the possibility of both a basic and an applied science of individual behavior (e.g., Cooper, Heron, & Heward, 2007; Mazur, 2009) without the use of group-average data or statistical inference procedures in the single-subject–single-case research tradition (many other terms have been used, including N = 1 and time-series designs; see Hayes, 1981). Use of the word case signals that the thing studied need not be a single individual but might be a group entity of some kind, such as a 184

f­ amily, class, work group, organization, or community (Blampied, 1999; Valsiner, 1986). Development of Applied Single-Case Research Designs The experimental research designs that Skinner and his colleagues and students developed for their experimental laboratory investigations and that were systematized in Sidman (1960) are termed intrasubject replication designs. They were strongly influenced by the concept of the steady state in physical chemistry and experimental physiology (Bernard, 1865/1957; Skinner, 1956; Thompson, 1984). To understand the influence of environmental variables on behavior, the performance of an individual subject was measured repeatedly in a standardized apparatus until performance was judged to be stable. If variability persisted, the sources of this variability in the environment were investigated so that it could be minimized. Sources of variability were thus to be understood and experimentally reduced rather than statistically controlled (Sidman, 1960), thus generating an experimental phase called the baseline. Once baseline stability was achieved, an independent variable could be introduced and behavior observed until stability returned. Direct visual comparisons, by way of graphs, of the performance in the baseline and experimental conditions permitted the detection and quantification of the experimental effect. Replications (S. Schmidt, 2009) by way of return to baseline and subsequent reinstatement of the experimental variable demonstrated the reliability of the experimental control achieved and were the basis of inferences about the independent variable being the agent of the change. Further replications were performed with the same or a small number of other subjects in the same experiment to strengthen conclusions (see Chapter 7, this volume). Systematic replications of the procedure using parametric variations of the independent variable permitted a functional analysis of the relation between the independent variable and the target behavior (Sidman, 1960), further strengthening causal inferences. Using these procedures, important processes in behavior are revealed in a continuous, orderly, and

Single-Case Research Designs and the Scientist-Practitioner Ideal in Applied Psychology

reproducible fashion. Concepts and laws derived from such data are immediately applicable to the behavior of the individual, and they should permit us to move on to the interpretation of behavior in the world at large with the greatest possible speed. (Skinner, 1953, p. 78) Beginning in the early 1950s, Skinner, assisted by Lindsley, extended research in operant behavior from investigations of animals such as rats and pigeons to that of adult humans, particularly those hospitalized with a diagnosis of schizophrenia (Lindsley & Skinner, 1954; Skinner, 1954). Other researchers, notably Bijou (Kazdin, 1978; Morris, 2008), began to deploy similar procedures to research and assess the behavior of children. As Kazdin (1978) noted, this research had an increasingly applied focus that moved it from basic research to evaluation, and then inexorably toward intervention, using basic techniques such as shaping, reinforcement, and extinction to change symptomatic and problem behavior (Kazdin, 1978). This focus generated the field now called applied behavior analysis (Baer, Wolf, & Risley, 1968). O’Donnell’s (1985) observation that “nearly every facet of applied work originally derived from . . . interest in the child” (p. 237) was as true of the emergence of applied behavior analysis as it had been of Witmer’s clinical psychology. In moving from basic operant research to clinical applications, behavior analysis adapted the intrasubject research designs used in the laboratory. Early studies often reported data in cumulative graph form (see Ullmann & Krasner, 1966, and Ulrich, Stachnik, & Mabry, 1966, for examples), but by the early 1960s reversal designs were being presented in what is now the conventional form (e.g., Ayllon & Haughton, 1964). Problems with reversal designs led to the development of other applied designs such as the multiple baseline (Baer et al., 1968; Kazdin, 1978). All these designs involve time-series data from one or a few participants, and all involve experimental phases designated baseline and intervention. They differ in the way in which replication is used within and across participants to establish the reliability of the behavior changes detected by

comparing baseline and intervention phases and to make causal inferences. Hersen and Barlow (1976) published the first book-length systematic exposition of these applied single-case designs, and only a little innovation, although much refinement, has occurred since (McDougall, 2005). Several things are notable about this sequence of events. First, the development of experimental single-case research occurred at almost the same time as the widespread adoption of Fisherian statistics in psychology in the immediate postwar years and was explicitly regarded by its protagonists as an alternative to the Fisherian tradition (Gigerenzer et al., 1989; Skinner, 1956). Although the first textbook for teaching Fisherian statistics to psychologists appeared in 1940 (Lindquist, 1940), most universities offering graduate degrees did not begin systematically teaching statistics to students until after World War II (Gigerenzer et al., 1989; Rucci & Tweney, 1980), by which time Skinner’s (1938) first book had appeared, single-case research was being published (albeit not as prolifically as it was after the founding of the Journal of the Experimental Analysis of Behavior in 1958), and Sidman’s (1960) Tactics of Scientific Research was only a decade away. The opportunity then was for the scientist part of the scientist-practitioner ideal to be based on singlecase research rather than, or as well as (Jones, 1978), Fisherian inferential statistics. This alternative research design tradition has, however, had negligible influence on psychological research as a whole or on applied professional psychology in particular, and it has never been part of the graduate curriculum (Aiken et al., 1990, 2008; Rossen & Oakland, 2008). Why single-case research designs had such little influence on mainstream psychology and how science and practice might be different had they had more impact are interesting questions for historians and sociologists of science. Gigerenzer et al. (1989) suggested that several factors led to the dominance of Fisher’s ideas in psychology in the critical period. Many leading North American experimentalists were taught by Fisher himself, and Fisher’s influence was enhanced by his holding a visiting professorship in the United States in the 1930s. The relatively small size of the community of experimental psychologists assisted the rapid dissemination of the new statistics, 185

Neville M. Blampied

a process facilitated by the publication of information about them in textbooks and influential journals (Rucci & Tweney, 1980). This process was enhanced when editors of important journals, such as the Journal of Experimental Psychology, made the use of inferential statistics a requirement for acceptance of manuscripts, which happened from about 1950 onward (Gigerenzer et al., 1989). Fisher’s writings were in English, and much of the critical scholarly work about statistics and research design was published in French and German (Gigerenzer et al., 1989) and was, therefore, less accessible to Englishspeaking workers, although Bernard’s work was available in English translation by 1949. It is also true that Skinner’s ideas about experimentation and inference, although broadly contemporaneous, were not as well developed or as effectively disseminated at the critical juncture, so by an initially narrow lead, Fisherian ideas had the opportunity to become dominant in psychology. Second, in contrast to the inference revolution in psychology, in which the key ideas about research design and analysis were imported from other fields and then applied to the subject matter of the discipline, applied behavior analysis and applied singlecase research designs grew organically and directly from the underpinning experimental research in the field, both with respect to the subject matter studied (the effects of the environment on the behavior of individuals) and the methods used, which is surely a significant virtue when one considers what science a scientist-practitioner may need by way of both the content and the methods of science and how singlecase research designs may help resolve problems in the field. What Single-Case Research Can Offer the Scientist-Practitioner In their initial exposition of applied single-case designs, Hersen and Barlow (1976) stated that singlecase research was highly relevant to the scientistpractitioner ideal and that it provided an alternative to the prevailing standard model of research. They drew on the work of Bergin and Strupp (1970, 1972), who claimed that among psychotherapy researchers, “there is a growing disaffection from traditional 186

experimental designs and statistical procedures which are held inappropriate to the subject matter under study” (Bergin & Strupp, 1970, p. 25). The evidence presented by Bergin and Strupp (1970, 1972) provided an opportunity for psychology in general, and clinical psychology in particular, to reevaluate its commitment to the standard model of research, but this did not happen other than within the single-case research tradition. Hersen and Barlow agreed that the scientist-practitioner ideal had largely failed in practice because of a growing split between science and practice and that the split was substantively, but not exclusively, the result of the mismatch between the realities of clinical practice and what ­scientific research methods based on the standard model could inherently accomplish. Instead of the standard Fisherian model of research, Hersen and Barlow suggested that single-case designs were much better suited to the needs of clinical research, especially the evaluation of interventions. The argument developed by Hersen and Barlow (1976) has been further extended by Barlow et al. (1984); Hayes, Barlow, and Nelson-Gray (1999); and Barlow, Nock, and Hersen (2009) and endorsed by others (e.g., Blampied, 1999, 2000, 2001; Hayes, 1981; Kazdin, 2008; Morgan & Morgan, 2001, 2009). If one poses the question “What benefits does single-case research offer the scientist-practitioner?” one can find answers by considering how single-case research obviates some of the major problems inherent in reliance on the standard model, and second, from considering further positive contributions that single-case research can make to applied science.

Single-Case Research Is an Alternative to NHST-Based Research The fact that most psychologists past and present have ignored the enduring, cogent, and powerful criticisms of the standard NHST model of psychological research is not evidence that these criticisms do not matter (Harlow, Mulaik, & Steiger, 1997; Nickerson, 2000; F. L. Schmidt & Hunter, 1997). Even the defenders of NHST and related practices have acknowledged that the criticisms have substantive merit (e.g., Cortina & Dunlap, 1997; Frick, 1996; Hagen, 1997; Harlow et al., 1997; Wainer, 1999), and proponents of reform such as the Task

Single-Case Research Designs and the Scientist-Practitioner Ideal in Applied Psychology

Force on Statistical Inference (Wilkinson & Task Force on Statistical Inference, 1999) have sought changes in research practice, data analysis, and reporting that acknowledge much of the criticism. Unfortunately, as noted earlier, the practice of research has been glacially slow in adopting these reforms. This slowness may, in part, be because one effect of the proposed reforms is to continue to embed group statistical procedures and NHST, with all of their inchoate aspects (Gigerenzer, 1991; Hubbard, 2004), into the heart of the research enterprise while requiring that extensive additional complex procedures be added to the research toolkit. This is demanding enough for researchers, and it is even more so for practitioners. Given that single-case research principles and practice resonate with many of the criticisms of the standard model, for instance in the emphasis on graphical techniques, replication, and the rejection of NHST-based inference (Sidman, 1960), it is odd that few if any of the critics or reformers have considered the potential of single-case research as an alternative approach (Blampied, 1999, 2000). Indeed, the task force did not consider single-case research until it replied to comments on its report (American Psychological Association Board of Scientific Affairs, 2000), and even then the response was brief and unenthusiastic. Notwithstanding, single-case research designs remain a complete, quantitative, coherent, internally consistent, and proven alternative to the standard model of experimentation.

Single-Case Research Is an Effective Alternative to Group-Based Research As suggested earlier, psychology’s commitment to seeking ideal types by way of averaging over individuals is as deep rooted as its commitment to NHST-based inference. Yet this commitment poses profound difficulties, both conceptual and practical, for applied science, as Valsiner (1986) and others have so cogently noted. The conceptual difficulties (discussed earlier) relate to how abstract principles and hypotheses stated at the level of the population are to be properly applied to the highly variable individual cases with which applied scientists must deal. The practical difficulties arise from the demand

for large numbers of participants (Streiner, 2006). As the number of participants in a research project multiplies, so too do the cost, complexity, and time required to complete the research (a point made by Bergin & Strupp, 1972). Multisite, multiinvestigator, multimillion-dollar randomized controlled trial research designs are now the gold standard for psychotherapy research (e.g., Beutler & Crago, 1991; Wessley, 2001). Participation in such research is largely out of the question for most scientist-­ practitioners (indeed, it is out of reach of most researchers), so it is hardly surprising that they do not actively produce research in their practice (­Barlow et al., 1984; Hayes et al., 1999). Single-case research, in contrast, done by individual practitioners with individual cases, can make useful contributions to science.

Single-Case Research Avoids Confusing Clinical With Statistical Significance As many critics have noted, the overwhelming focus on NHST p values has seriously distorted psychological research in general and clinical research in particular (e.g., Cohen, 1990, 1994; Dar et al., 1994; Fidler et al., 2005; Hubbard & Lindsay, 2008; Meehl, 1978). The obsession with statistical significance has distorted judgment about research findings and led to persistent confusion between statistical and clinical significance, fundamentally because p is a function of sample size and, with a sufficiently large sample, infinitesimally small group-mean differences may be statistically significant (Hubbard & Lindsay, 2008). Clinical significance, in contrast, is determined at the individual level, and rather than typically being achieved by small degrees of change, requires substantive alleviation of distress, restoration of function, and attainment of quality-of-life goals (Blampied, 1999; Jacobson & Truax, 1991). Clinically, psychologists need to know that interventions have produced effective change, but null-hypothesis–based statistics do not reliably indicate this. Although techniques such as the use of effect size estimates and meta-­ analysis of group research (e.g., Bornstein, 1998; Kline, 2004) are an improvement over p values, large effect sizes may be associated with little clinical change (Hayes et al., 1999). Techniques to compute 187

Neville M. Blampied

the ­magnitude of clinically significant change (e.g., Jacobson & Truax, 1991) require analysis at the level of the individual rather than the group, suggesting that determining clinical significance is an intractable problem for group research. Furthermore, contributing to the group mean may be individuals whose scores indicate large-magnitude change, those whose scores have changed only slightly, and those whose scores have deteriorated (Hersen & Barlow, 1976). For clinical work to progress, it is essential to know about changes that are more than negligible in a ­positive way but that are still less than clinically ­significant and about deterioration; otherwise, it is impossible to engage in systematic improvement of therapies or to move toward the oft-stated goal of matching therapies to clients (Hayes et al., 1999; Kazdin, 2008). Concealing a range of therapy outcomes within a small but statistically significant mean effect does not assist with this objective. Because single-case research does not amalgamate case data but keeps the data separate and identifiable throughout the research and because it has no equivalent to the ubiquitous p value, it cannot mistake statistical for clinical significance (although it can either over- or understate the import of its findings). It can relate the magnitude of change or lack of change to each individual participant relative to that participant’s own baseline. It heeds the admonition by Cohen (1994) that there is “no objective, mechanical ritual” (p. 1001) for determining the importance of research findings and substitutes the scientist-practitioner’s trained, professional judgment as to the importance of the outcomes achieved in the research (Hammond, 1996). That the key data are presented graphically so that the changes achieved have to be obvious to visual inspection provides some assurance that only substantive changes will be reported and permits all other viewers to apply their judgment, thus subjecting conclusions to ongoing peer review and continuous revision as understanding of clinical significance changes with time and context.

Single-Case Research Enhances Ethics and Accountability One of the persisting ethical problems with standard group research protocols in which interventions are 188

being assessed is the necessity of assigning sometimes large numbers of individuals to control conditions, which are, by definition, not expected to produce much in the way of therapeutic effect. At the same time, those in the treatment group are exposed to whatever risks are entailed in the novel treatment (Hersen & Barlow, 1976). Single-case research does not eliminate these risks, because it imposes baseline and treatment phases that are the equivalent of control and experimental conditions. It does so, however, with one or a few participants, and so the effects of the treatment can be assessed while neither withholding treatment from a large number of individuals nor exposing a substantial number of individuals to whatever risks are inherent in the research. For this reason, at the very least, ethical principles should insist that a novel therapy be shown to be effective in a series of well-conducted single-case studies before any large-scale randomized controlled trials are embarked on. In contrast, the general practice is typically to go from uncontrolled case studies to randomized ­controlled trials in one step. Note also that wait-list ­control designs can be re-formed as multiple-­baseline-across-cases designs, with a consequent reduction in waiting time for therapy for most wait-listed cases and increases in internal validity (see Chapter 5, this volume, for a more comprehensive consideration of the multiple-baseline design). Moreover, Kratochwill and Levin (2010) have made recommendations about how single-case designs might be modified by the inclusion of randomization so as to make them more credible within the wider community of applied researchers. Institutional, policy, and political factors have all long been recognized to strongly influence the conduct of research (e.g., Bergin & Strupp, 1970, 1972), often to the detriment of innovation and conceptual advances, especially through decisions about what research to fund, publish, and incorporate in policy and application (Rozin, 2009). Changes in research practice that would lead to the wider use and dissemination of single-case research within applied and clinical science will require attention to these institutional and political factors. The powerful contingencies residing in research funding, particularly involving the assessment of the scientific merit of

Single-Case Research Designs and the Scientist-Practitioner Ideal in Applied Psychology

proposals incorporating single-case designs, will have to change. These changes clearly cannot be accomplished by individuals acting alone. They may require concerted action by larger groups representing both scientists and practitioners in advocating changes to institutions, funding agencies, and government. Perhaps it is time to consider another task force, sponsored jointly by the American Psychological Association, the Association for Psychological Science, and the Association for Behavior Analysis International to address these inherently political issues and to advocate for reform both within and beyond the discipline (Orlitzky, 2011). A further strength of single-case research is that it enhances accountability in research and, even more important, in practice (Acierno, Hersen, & Van Hasselt, 1996). The data for each individual participant can be examined by whoever wants to judge the outcomes claimed to have been achieved. Nobody, including any participants, critics, ethical review bodies, insurers, and family members, has to rely on understanding the possibly abstruse complexities of statistical analyses to understand what the treatment achieved (Blampied, 1999). This strength does not remove the possibility of conflict and disagreement, but it does change the character of any such debate, with transparency, openness, and comprehensibility likely to be of long-term benefit to all who are concerned about the ethics of research and practice, clients and practitioners above all (F. L. Newman & Tejeda, 1996). Wolf’s (1978) seminal work on the social validity of behavior analysis and the assessment of the acceptability of interventions to multiple constituencies, including clients, families, and other interested parties, has further extended the ways in which accountability can be achieved in both research and practice.

Single-Case Research Enhances Innovation and the Exploration of Generality Implicit in the preceding argument is the utility of single-case research in the early phases of therapy innovation. Because of its small scale, it can be used to evaluate novel ideas and new procedures with low cost and little risk. This style of research is adaptable, and protocols can be changed as developments in the research warrant (Hayes, 1981). If

treatments fail in any case, that failure is detected, and new approaches can be tried. In addition to being ideally adapted to the initial phases of research and development, single-case research is ideally suited for what is increasingly being called translational research—the systematic, collaborative pursuit of applications of science linked to basic research (Lerman, 2003; Thompson & Hackenberg, 2009). Ironically, Olatunji, Feldner, Witte, and Sorrell (2004), in discussing the link between the scientist-practitioner model and translational research, recommended that even more rigorous training in the standard model of research is necessary before the scientist-practitioner model can embrace translational research. To the contrary, I would assert that training in single-case research is needed before translational research can be widely undertaken, because much translational research, as with other applied research, needs to be applied at the individual rather than the population level. There is no point in having translational research recapitulate the errors of the past through slavish adherence to the standard model. Single-case research designs also have an underappreciated contribution to make to the establishment of generality of treatment. Consider the situation that prevails after a successful randomized trial of some new therapy. As has been widely noted (e.g., Kazdin, 2008; Seligman, 1995), the very rigor of this research protocol, requiring as it does highly selected participants with a single, carefully established diagnosis, highly trained and monitored therapists normally working from a manual, and atypical therapy settings (e.g., university research clinics), limits the generality of any findings. Indeed, such research is now termed efficacy research to distinguish it from effectiveness research—the research that establishes the utility of the therapy in typical therapy settings with typical therapists and typical clients, now also referred to as research into clinical utility (Howard, Moras, Brill, Martinovich, & Lutz, 1996). Although Chambless and Hollon (1998) acknowledged the role that single-case research might play in establishing the efficacy of psychotherapy, the focus of efficacy and effectiveness research has been almost exclusively on randomized trials (e.g., Kazdin, 2008; Kratochwill & Levin, 2010). 189

Neville M. Blampied

Using group-based randomized trials and systematic exploration of every dimension along which the participants, the therapy, the therapists, and the therapy context might be varied in the search for generality would entail a lifetime of research and a fortune just to establish the domain of general effectiveness of a single therapy. Staines (2008) argued that deficiencies in the way in which most standard psychological research is conducted severely limit the generality of its findings because of what he terms the generalization paradox, and he recommended the use of multiple studies and multiple methods as ways out of the paradox. He did not explicitly recommend the use of single-case research but might well have done so. Although still a potentially large undertaking, single-case research can be used to map the generality of any established therapy (see Drabman, Hammer, & Rosenbaum, 1979, for the initial idea of the generalization map). It does this through replication (see Chapter 7, this volume). Initially, direct replication, in which the attributes of the cases, therapy, and context are kept constant, is used to establish the reliability of the research findings. Generality can then be explored by systematic replication (Hayes et al., 1999; Sidman, 1960), in which various attributes of the cases (e.g., age, gender, symptom severity), of the treatment (e.g., number of sessions), and of the context of intervention (e.g., homes, classrooms) are systematically varied. As these dimensions are explored, the initial treatment outcome may be maintained, enhanced, or diminish or even disappear. Additional clinical replication, combining multicomponent treatment programs and clinically diverse clients and settings, further extends the ­evidence for generality (Barlow et al., 1984, 2009). As the generalization space is mapped in this way, the therapy can be adjusted for maximally effective combinations of participant, procedure, and context. Replicated failures are also important because they mark possible boundary conditions and stimulate new innovations and a new cycle of research. Note that for this reason, the reporting of such failures is to be encouraged, in contrast with the practice in the standard research tradition of not reporting failures to reject H0. 190

Single-Case Research Resolves the Double Standard Around Psychology’s Focus on the Individual If one believes, as the founders of clinical and applied psychology clearly did, that “the individual is of paramount importance in the clinical science of human behavior change” (Hersen & Barlow, 1976, p. 1), then single-case research is essential because it delivers a scientific understanding of phenomena at the individual level and removes the need for the inductive–deductive contortions exposed by Valsiner (1986; see the section Some Consequences of This History). It can, in practice, remove the distinction between the scientist and the practitioner; science and practice become one coherent enterprise. This grounding of applied clinical science in the individual—both the individual client and the individual practitioner—is probably the most important contribution of single-case research to the scientistpractitioner model, because it gives the ideal model real substance (Barlow et al., 1984, 2009; Blampied, 1999, 2001; Hayes et al., 1999; Hersen & Barlow, 1976). “Repetition is the habit of nature, but variation is the rule of life” (Gopnik, 2009, p. 197), and whether the cases dealt with by a scientist-­ practitioner are individuals, families, or social groups or are larger and more complex entities such as classes, work groups, and organizations or even health systems (Hayes et al., 1999; Morgan & Morgan, 2009), each case will be inherently unique. Yet, no matter how exceptional and singular a particular case may be, single-case research designs permit a scientific analysis to be done—a claim that cannot be made about the standard model. This is why there is such goodness of fit between the scientistpractitioner model and single-case research and why the failure to recognize the synergies between them has been a seriously consequential missed opportunity for the discipline. Conclusion I can confidently conclude that the goodness of fit between single-case research principles and practice and the scientist-practitioner model of applied practice is multifaceted, substantial, and potentially

Single-Case Research Designs and the Scientist-Practitioner Ideal in Applied Psychology

highly beneficial, although to achieve these benefits will require many changes to the way psychologists teach students, fund, analyze and publish research, and practice psychology (Blampied, 2001; Rozin, 2009). Historically, this alliance could have happened, but it did not, which can be seen as a very considerable misfortune for the discipline and its aspirations to enhance human welfare. Instead, the scientist-practitioner ideal has been distorted, distressed, and thwarted by attempting to pursue science with a scientific methodology inappropriate to the purpose. History is the story of what has happened, not what might have happened or what one wishes or thinks should have happened. The generation and adoption of the scientist-practitioner model for clinical and applied psychology is deservedly seen as an inspired and noble choice. It continues as an inspiring ideal to the present day. That the conceptualization of what the science part of the scientistpractitioner duality was thought to be should have been captured by the view of science prevailing in the wider discipline was probably inevitable. What is much more regrettable is how little consideration the discipline’s forebears gave to the aptness of their choice of scientific method, especially in the light of Bergin and Strupp’s (1970, 1972) research. Equally as regrettable is how they, and we, have persistently ignored criticism of the chosen method and how ­fervent, enduring, and exclusive adherence to the method has been (Ziliak & McCloskey, 2008). Also regrettable has been the continuing blindness of those entrusted with the scientist-practitioner ideal to the mismatch between the methods of the adopted science and the needs of practice, despite the warning of prophets such as Bergin and Strupp and Meehl and despite the existence of an alternative in the work of Skinner and those who developed and practiced applied single-case research. But it is not too late to change. I’ve said this before (quoting Stricker, 2000; see Blampied, 2001), but I will say it again: Gandhi, once asked what he thought about Western civilization, replied that it was a good idea and that somebody should try it sometime. Even more so, an alliance between the scientist-practitioner model of clinical practice and single-case research is a

very good idea, and it should be adopted now, for “scientist-practitioners [will] have no difficulty finding interesting work in [the] future. They are trained to solve behavioral problems, and the world promises to provide no shortage of those” (Benjamin, 2005, p. 27). Appendix 8.1: Terminology and Key Aspects of Null-Hypothesis Statistical Testing For any set of numerical data, one can compute measures of central tendency (medians, means, etc.) and of variation (standard deviations, etc.) as well as other measures such as correlations. These measures are called descriptive statistics. If one takes the view that the individuals who provided these data are a sample from some population of interest (and one does not necessarily have to take this view), then one can regard these descriptive statistics as estimators of the corresponding “true” population scores and can compute further statistics, such as the standard error of the mean and a confidence interval, that tell one how good an approximation these estimations are. Going a step further, if one has two or more data sets and wants to know whether they represent samples from the same population or from different populations, one uses inferential statistics. As noted in the chapter’s introduction, thanks to the work of Fisher and others, the core of such inferential statistics are null-hypothesis significance tests (NHST), of which Student’s t is the prototype. The null hypothesis (termed H0) is normally (in psychology) a nil hypothesis, that is, the samples are from the same population and therefore the true mean difference is zero (nil), and is so to an infinite number of decimal places. NHST assumes H0 to be true, and computes a probability (p), which is the theoretical probability that had samples of the same size as those observed been drawn from the same population, the test statistic (e.g., the value of t) would be as large (or larger) as it is observed to be. For Fisher, the p value was one important piece of evidence, to be considered along with other evidence, in the judgment made by the experimenter as to the implausibility of the observations assuming H0. If the p value suggests that the data are rare 191

Neville M. Blampied

(under H0), this constitutes inductive evidence against H0 and H0 can be rejected, and the smaller the value of p (other things being equal), the stronger this inductive evidence (cf. Wagenmakers, 2007). Fisher came to accept that p < .05 was a convenient (but not sanctified) value for experimenters to use in making such judgments in single experiments, but he also agreed that facts were best established by multiple experiments that permitted consistent rejection of H0 (Wright, 2009). The alternative Neyman–Pearson paradigm is termed hypothesis testing because it involved the identification of two hypotheses, H0 and an alternative hypothesis, HA. Fisher bitterly contested this viewpoint and never accepted the alternative paradigm, but it is the version that has become dominant in psychology, generally without reference to the initial controversy. This paradigm postulates both H0 and HA. HA is generally an assertion that some known factor or treatment was responsible for the observed mean difference. In this paradigm, if the obtained value of p is smaller than some long-run, prespecified error rate criterion (e.g., p < .05), called alpha, then H0 may be rejected, a result that is said to be statistically significant at the alpha level, and one may accept HA. With two hypotheses available, two errors may be made. One may reject H0 when it is true (Type I error) or one may fail to reject H0 and accept HA when H0 is false (Type II error), the probability of which error is beta. The power of a test to accept HA given that it is true or, better stated, the power to detect an experimental effect, is 1 − β. Neyman (1950) argued that control of Type I errors was more important than control of Type II errors, hence the emphasis in psychology on alpha level and continuing indifference to the power of studies (Cohen, 1994; Sedlmeier & Gigerenzer, 1989). The Neyman–Pearson paradigm does not permit inductive inference about hypotheses to be made. Rather, it permits inductive behavior, that is, the making of decisions on the basis of evidence from statistical tests, these decisions being akin to those made in industrial quality-control contexts to control the long-run rate of production of defective products (Hubbard, 2004; Hubbard & Lindsay, 2008; see also Nickerson, 2000, and Wagenmakers, 2007). 192

References Acierno, R., Hersen, M., & van Hasselt, V. B. (1996). Accountability in psychological treatment. In V. B. van Hasselt & M. Hersen (Eds.), Sourcebook of psychological treatment manuals for adult disorders (pp. 3–20). New York, NY: Plenum Press. Aiken, L. S., West, S. G., & Millsap, R. E. (2008). Doctoral training in statistics, measurement, and methodology in psychology: Replication and extension of Aiken, West, Sechrest, and Reno’s (1990) survey of PhD programs in North America. American Psychologist, 63, 32–50. doi:10.1037/0003-066X.63.1.32 Aiken, L. S., West, S. G., Sechrest, L., & Reno, R. R. (1990). Graduate training in statistics methodology and measurement in psychology. American Psychologist, 45, 721–734. doi:10.1037/0003-066X.45.6.721 Albee, G. W. (2000). The Boulder model’s fatal flaw. American Psychologist, 55, 247–248. doi:10.1037/ 0003-066X.55.2.247 Allport, G. W. (1942). The use of personal documents in psychological science. New York, NY: Social Science Research Council. American Psychological Association Board of Scientific Affairs, Task Force on Statistical Inference. (2000). Narrow and shallow. American Psychologist, 55, 965–966. doi:10.1037/0003-066X.55.8.965 Ayllon, T., & Haughton, E. (1964). Modification of symptomatic verbal behavior of mental patients. Behaviour Research and Therapy, 2, 87–97. doi:10.1016/00057967(64)90001-4 Baer, D. M., Wolf, M. M., & Risley, T. R. (1968). Some current dimensions of applied behavior analysis. Journal of Applied Behavior Analysis, 1, 91–97. doi:10.1901/jaba.1968.1-91 Baker, B. B., & Benjamin, L. T. (2000). The affirmation of the scientist-practitioner: A look back at Boulder. American Psychologist, 55, 241–247. doi:10.1037/0003-066X.55.2.241 Baker, T. B., McFall, R. M., & Shoham, V. (2008). Current status and future prospects of clinical psychology: Toward a scientifically principled approach to mental and behavioral health care. Psychological Science in the Public Interest, 9, 67–103. Balaram, P. (2008). Science, invention, and Pasteur’s quadrant. Current Science, 94, 961–962. Balluerka, N., Gomez, J., & Hidalgo, D. (2005). The controversy over null hypothesis significance testing revisited. Methodology: European Journal of Research Methods for the Behavioural and Social Sciences, 1, 55–70. doi:10.1027/1614-1881.1.2.55 Barlow, D. H., Hayes, S. C., & Nelson, R. O. (1984). The scientist practitioner: Research and accountability in educational settings. New York, NY: Pergamon Press.

Single-Case Research Designs and the Scientist-Practitioner Ideal in Applied Psychology

Barlow, D. H., & Nock, M. K. (2009). Why can’t we be more idiographic in our research? Perspectives on Psychological Science, 4, 19–21. doi:10.1111/j.17456924.2009.01088.x Barlow, D. H., Nock, M. K., & Hersen, M. (2009). Singlecase experimental designs: Strategies for studying behavior change (3rd ed.). Boston, MA: Pearson. Belar, C. D., & Perry, N. W. (1992). The National Conference on Scientist-Practitioner Education and Training for the Professional Practice of Psychology. American Psychologist, 47, 71–75. doi:10.1037/0003066X.47.1.71 Benjamin, L. T. (2005). A history of clinical psychology as a profession in America (and a glimpse of its future). Annual Review of Clinical Psychology, 1, 1–30. doi:10.1146/annurev.clinpsy.1.102803.143758 Bergin, A. E., & Strupp, H. H. (1970). New directions in psychotherapy research. Journal of Abnormal Psychology, 76, 13–26. doi:10.1037/h0029634 Bergin, A. E., & Strupp, H. H. (1972). Changing frontiers in the science of psychotherapy. New York, NY: Aldine. Bernard, C. (1957). An introduction to the study of experimental medicine (H. C. Green, Trans.). New York, NY: Dover. (Original work published 1865) Beutler, L. E., & Crago, M. (Eds.). (1991). Psychotherapy research: An international review of programmatic studies. Washington, DC: American Psychological Association. doi:10.1037/10092-000 Blaich, C. F., & Barreto, H. (2001). Typological thinking, statistical significance, and the methodological divergence of experimental psychology and economics. Behavioral and Brain Sciences, 24, 405.

Bornstein, M. H., & Lamb, M. E. (1992). Development in infancy: An introduction (3rd ed.). New York, NY: McGraw-Hill. Cairns, R. B. (1986). Phenomena lost: Issues in the study of development. In J. Valsiner (Ed.), The individual subject and scientific psychology (pp. 97–111). New York, NY: Plenum Press. Chambless, D. L., & Hollon, S. D. (1998). Defining empirically supported therapies. Journal of Consulting and Clinical Psychology, 66, 7–18. doi:10.1037/0022006X.66.1.7 Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145–153. doi:10.1037/ h0045186 Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45, 1304–1312. doi:10.1037/0003-066X. 45.12.1304 Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997–1003. doi:10.1037/0003-066X. 49.12.997 Committee on Training in Clinical Psychology. (1947). Recommended graduate training program in clinical psychology. American Psychologist, 2, 539–558. doi:10.1037/h0058236 Cooper, J. O., Heron, T. E., & Heward, W. L. (2007). Applied behavior analysis (2nd ed.). Upper Saddle River, NJ: Pearson. Corrie, S., & Callahan, M. M. (2000). A review of the scientist-practitioner model: Reflections on its potential contribution to counselling psychology within the context of current health care trends. British Journal of Medical Psychology, 73, 413–427. doi:10.1348/000711200160507

Blampied, N. M. (1999). A legacy neglected: Restating the case for single-case research in cognitive-behaviour therapy. Behaviour Change, 16, 89–104. doi:10.1375/ bech.16.2.89

Cortina, J. M., & Dunlap, W. P. (1997). On the logic and purpose of significance testing. Psychological Methods, 2, 161–172. doi:10.1037/1082-989X. 2.2.161

Blampied, N. M. (2000). Comment: Single-case research designs: A neglected alternative. American Psychologist, 55, 960. doi:10.1037/0003-066X.55.8.960

Dar, R., Serlin, R. C., & Omer, H. (1994). Misuse of statistical tests in three decades of psychotherapy research. Journal of Consulting and Clinical Psychology, 62, 75–82. doi:10.1037/0022-006X.62.1.75

Blampied, N. M. (2001). The third way: Single-case research, training, and practice in clinical psychology. Australian Psychologist, 36, 157–163. doi:10.1080/00050060108259648 Bootzin, R. R. (2007). Psychological clinical science: Why and how we got to where we are. In T. R. Treat, R. R. Bootzin, & T. B. Baker (Eds.), Psychological clinical science (pp. 1–28). New York, NY: Taylor & Francis. Bornstein, M. (1998). The shift from significance testing to effect size estimation. In N. R. Schooler (Ed.), Comprehensive clinical psychology: Vol. 3. Research and methods (pp. 313–349). Amsterdam, the Netherlands: Elsevier.

Dehue, T. (2000). From deception trials to control reagents: The introduction of the control group about a century ago. American Psychologist, 55, 264–268. doi:10.1037/0003-066X.55.2.264 Drabman, R. S., Hammer, D., & Rosenbaum, M. S. (1979). Assessing generalization in behavior modification with children: The generalization map. Behavioral Assessment, 1, 203–219. Edwards, R. (1987). Implementing the scientistpractitioner model: The school psychologist as databased problem solver. Professional School Psychology, 2, 155–161. doi:10.1037/h0090541 193

Neville M. Blampied

Erceg-Hurn, D. M., & Mirosevich, V. M. (2008). Modern robust statistical methods: An easy way to maximize the accuracy and power of your results. American Psychologist, 63, 591–601. doi:10.1037/0003-066X. 63.7.591 Fidler, F., Cumming, G., Thomason, N., Pannuzzo, D., Smith, J., Fyffe, P., . . . Schmitt, R. (2005). Toward improved statistical reporting in the Journal of Consulting and Clinical Psychology. Journal of Consulting and Clinical Psychology, 73, 136–143. doi:10.1037/0022-006X.73.1.136 Fisher, R. A. (1925). Statistical methods for research workers. London, England: Oliver & Boyd. Fisher, R. A. (1935). The design of experiments. London, England: Oliver & Boyd. Frank, G. (1984). The Boulder model: History, rationale, and critique. Professional Psychology: Research and Practice, 15, 417–435. doi:10.1037/0735-7028. 15.3.417 Frick, R. W. (1996). The appropriate use of null hypothesis testing. Psychological Methods, 1, 379–390. doi:10.1037/1082-989X.1.4.379 Gigerenzer, G. (1991). From tools to theories: A heuristic of discovery in cognitive psychology. Psychological Review, 98, 254–267. doi:10.1037/0033-295X. 98.2.254 Gigerenzer, G., Swijtink, Z., Porter, T., Datson, L., Beatty, J., & Kruger, L. (1989). The empire of chance: How probability changed science and everyday life. Cambridge, England: Cambridge University Press. Gopnik, A. (2009). Angels and ages: A short book about Darwin, Lincoln, and modern life. London, England: Quercus. Gould, S. J. (1985). The median isn’t the message. Discover, 6, 40–42. Gould, S. J. (1997). Life’s grandeur. London, England: Vintage. Hagen, R. L. (1997). In praise of the null hypothesis statistical test. American Psychologist, 52, 15–24. doi:10.1037/0003-066X.52.1.15 Hammond, G. (1996). The objections to null hypothesis testing as a means of analysing psychological data. Australian Journal of Psychology, 48, 104–106. doi:10.1080/00049539608259513 Harlow, L. L., Mulaik, S. A., & Steiger, J. H. (Eds.). (1997). What if there were no significance tests? Mahwah, NJ: Erlbaum. Hayes, S. C. (1981). Single-case research designs and empirical clinical practice. Journal of Consulting and Clinical Psychology, 49, 193–211. doi:10.1037/0022006X.49.2.193 Hayes, S. C., Barlow, D. H., & Nelson-Gray, R. O. (1999). The scientist practitioner: Research and accountability 194

in the age of managed care (2nd ed.). Boston, MA: Allyn & Bacon. Hays, W. L. (1963). Statistics for psychologists. New York, NY: Holt, Rinehart & Winston. Hersen, M., & Barlow, D. H. (1976). Single-case experimental designs: Strategies for studying behavior change. Oxford, England: Pergamon Press. Howard, K. I., Moras, K., Brill, P. L., Martinovich, Z., & Lutz, W. (1996). Evaluation of psychotherapy: Efficacy, effectiveness, and patient progress. American Psychologist, 51, 1059–1064. doi:10.1037/ 0003-066X.51.10.1059 Hubbard, R. (2004). Alphabet soup: Blurring the distinctions between p’s and α’s in psychological research. Theory and Psychology, 14, 295–327. doi:10.1177/0959354304043638 Hubbard, R., & Lindsay, R. M. (2008). Why p values are not a useful measure of evidence in statistical significance testing. Theory and Psychology, 18, 69–88. doi:10.1177/0959354307086923 Hubbard, R., Parsa, R. A., & Luthy, M. R. (1997). The spread of statistical testing in psychology. Theory and Psychology, 7, 545–554. doi:10.1177/0959354 397074006 Jacobson, N. S., & Truax, P. (1991). Clinical significance: A statistical approach to defining meaningful change in psychotherapy research. Journal of Consulting and Clinical Psychology, 59, 12–19. doi:10.1037/0022006X.59.1.12 Jex, S. M., & Britt, T. W. (2008). Organizational psychology: A scientist-practitioner approach (2nd ed.). Hoboken, NJ: Wiley. Johnston, J. M., & Pennypacker, H. S. (1993). Readings for strategies and tactics of behavioral research (2nd ed.). Hillsdale, NJ: Erlbaum. Jones, R. R. (1978). A review of: Single-case experimental designs: Strategies for studying behavior change by Michel Hersen and David H. Barlow. Journal of Applied Behavior Analysis, 11, 309–313. doi:10.1901/ jaba.1978.11-309 Kazdin, A. E. (1978). History of behavior modification: Experimental foundations of contemporary research. Baltimore, MD: University Park Press. Kazdin, A. E. (2008). Evidence-based treatment and practice: New opportunities to bridge clinical research and practice, enhance the knowledge base, and improve patient care. American Psychologist, 63, 146–159. doi:10.1037/0003-066X.63.3.146 Kendall, M. G., & Plackett, R. L. (Eds.). (1977). Studies in the history of statistics and probability (Volume 2). High Wycombe, England: Griffin. Kiesler, D. J. (1966). Some myths of psychotherapy research and the search for a paradigm. Psychological Bulletin, 65, 110–136. doi:10.1037/h0022911

Single-Case Research Designs and the Scientist-Practitioner Ideal in Applied Psychology

Kiesler, D. J. (1971). Experimental designs in psychotherapy research. In A. E. Bergin & S. L. Garfield (Eds.), Handbook of psychotherapy and behavior change (pp. 36–74). London, England: Wiley. Kihlstrom, J. F., & Kihlstrom, L. C. (1998). Integrating science and practice in an environment of managed care. In D. K. Routh & R. J. DeRubes (Eds.), The science of clinical psychology: Accomplishments and future directions (pp. 281–293). Washington, DC: American Psychological Association. doi:10.1037/10280-012 Kline, R. B. (2004). Beyond significance testing: Reforming data analysis methods in behavioral research. Washington, DC: American Psychological Association. doi:10.1037/10693-000 Korchin, S. J. (1983). The history of clinical psychology: A personal view. In M. Hersen, A. E. Kazdin, & A. S. Bellack (Eds.), The clinical psychology handbook (pp. 5–19). New York, NY: Pergamon Press. Kratochwill, T. R., & Levin, J. R. (2010). Enhancing the credibility of single-case intervention research: Randomization to the rescue. Psychological Methods, 15, 124–144. doi:10.1037/a0017736 Lerman, D. C. (2003). From the laboratory to community application: Translational research in behavior analysis. Journal of Applied Behavior Analysis, 36, 415–419. doi:10.1901/jaba.2003.36-415

Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806–834. doi:10.1037/0022-006X. 46.4.806 Michael, J. (1974). Statistical inference for individual organism research: Mixed blessing or curse. Journal of Applied Behavior Analysis, 7, 647–653. doi:10.1901/jaba.1974.7-647 Molenaar, P. C. M. (2004). A manifesto on psychology as idiographic science: Bringing the person back into scientific psychology, this time forever. Measurement: Interdisciplinary Research and Perspective, 2, 201–218. Morgan, D. L., & Morgan, R. K. (2001). Singleparticipant research design: Bringing science to managed care. American Psychologist, 56, 119–127. doi:10.1037/0003-066X.56.2.119 Morgan, D. L., & Morgan, R. K. (2009). Single-case research methods for the behavioral and health sciences. Los Angeles, CA: Sage. Morris, E. K. (2008). Sidney W. Bijou: The Illinois years, 1965–1975. Behavior Analyst, 31, 179–203. Newman, F. L., & Tejeda, M. J. (1996). The need for research that is designed to support decisions in the delivery of mental health services. American Psychologist, 51, 1040–1049. doi:10.1037/0003066X.51.10.1040

Lilienfeld, S. O., Lynn, S. J., & Lohr, J. M. (2003). Science and pseudoscience in clinical psychology. In S. O. Lilienfeld, S. J. Lynn, & J. M. Mohr (Eds.), Science and pseudoscience in clinical psychology (pp. 1–14). New York, NY: Guilford Press.

Newman, K. J. (2000). The current implementation status of the Boulder model. Unpublished master’s thesis, University of Canterbury, Christchurch, New Zealand.

Lindquist, F. F. (1940). Statistical analysis in educational research. Boston, MA: Houghton Mifflin.

Neyman, J. (1950). First course in probability and statistics. New York, NY: Holt.

Lindsley, O. R., & Skinner, B. F. (1954). A method for the experimental analysis of the behavior of psychotic patients. American Psychologist, 9, 419–420.

Nickerson, R. S. (2000). Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods, 5, 241–301. doi:10.1037/1082-989X.5.2.241

Martin, P. R. (1989). The scientist-practitioner model and clinical psychology: Time for a change? Australian Psychologist, 24, 71–92. doi:10.1080/00050068 908259551 Mazur, J. (2009). Learning and behavior. Upper Saddle River, NJ: Prentice Hall. McDougall, D. (2005). The range-bound changing criterion design. Behavioral Interventions, 20, 129–137. doi:10.1002/bin.189 McFall, R. M. (2007). On psychological clinical science. In T. A. Treat, R. R. Bootzin, & T. B. Baker (Eds.), Psychological clinical science: Papers in honor of Richard M. McFall (pp. 363–396). New York, NY: Psychology Press. McReynolds, P. (1997). Lightner Witmer: His life and times. Washington, DC: American Psychological Association. doi:10.1037/10253-000

O’Donnell, J. M. (1985). The origins of behaviorism: American psychology, 1870–1920. New York, NY: New York University Press. Olatunji, B. O., Feldner, M. T., Witte, T. H., & Sorrell, J. T. (2004). Graduate training of the scientistpractitioner: Issues in translational research and statistical analysis. Behavior Therapist, 27, 45–50. Orlitzky, M. (2011). How can significance tests be deinstitutionalized? Organizational Research Methods. Advance online publication. doi:10.1177/1094428 111428356 Peterson, D. R. (2003). Unintended consequences: Ventures and misadventures in the education of professional psychologists. American Psychologist, 58, 791–800. doi:10.1037/0003-066X.58.10.791 Porter, R. (1997). The greatest benefit to mankind. London, England: HarperCollins. 195

Neville M. Blampied

Price, R. H., & Behrens, T. (2003). Working Pasteur’s quadrant: Harnessing science and action for community change. American Journal of Community Psychology, 31, 219–223. doi:10.1023/A:10239 50402338

Schatz, P., Jay, K. A., McComb, J., & McLaughlin, J. R. (2005). Misuse of statistical tests in Archives of Clinical Neuropsychology publications. Archives of Clinical Neuropsychology, 20, 1053–1059. doi:10.1016/j.acn.2005.06.006

Raimy, V. C. (Ed.). (1950). Training in clinical psychology (Boulder conference). New York, NY: Prentice Hall.

Schlinger, H. D. (1996). How the human got its spots. Skeptic, 4, 68–76.

Reich, J. W. (2008). Integrating science and practice: Adopting the Pasteurian model. Review of General Psychology, 12, 365–377. doi:10.1037/1089-2680. 12.4.365

Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1, 115–129. doi:10.1037/1082-989X.1.2.115

Reisman, J. M. (1966). The development of clinical psychology. New York, NY: Appleton-Century-Crofts.

Schmidt, F. L., & Hunter, J. E. (1997). Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.), What if there were no significance tests? (pp. 38–64). Mahwah, NJ: Erlbaum.

Reisman, J. M. (1991). A history of clinical psychology (2nd ed.). New York, NY: Taylor & Francis. Rodgers, J. L. (2010). The epistemology of mathematical and statistical modeling: A quiet methodological revolution. American Psychologist, 65, 1–12. doi:10.1037/a0018326 Rorer, L. G. (1991). Some myths of science in psychology. In D. Cicchetti & W. M. Grove (Eds.), Thinking clearly about psychology: Vol. 1. Matters of public interest (pp. 61–87). Minneapolis: University of Minnesota Press. Rosenthal, R. (1995). Progress in clinical psychology: Is there any? Clinical Psychology: Science and Practice, 2, 133–150. doi:10.1111/j.1468-2850.1995.tb00035.x Rossen, E., & Oakland, T. (2008). Graduate preparation in research methods: The current status of APAaccredited professional programs in psychology. Training and Education in Professional Psychology, 2, 42–49. doi:10.1037/1931-3918.2.1.42 Rourke, B. P. (1995). The science of practice and the practice of science: The scientist-practitioner model in clinical neuropsychology. Canadian Psychology/ Psychology Canadienne, 36, 259–277. Routh, D. K. (1998). Hippocrates meets Democritus: A history of psychiatry and clinical psychology. In A. S. Bellack & M. Hersen (Eds.), Comprehensive clinical psychology: Vol. 1. Foundations (pp. 2–48). Oxford, England: Elsevier. Routh, D. K. (2000). Clinical psychology training: A history of ideas and practices prior to 1946. American Psychologist, 55, 236–241. doi:10.1037/0003-066X. 55.2.236

Schmidt, S. (2009). Shall we really do it again? The powerful concept of replication is neglected in the social sciences. Review of General Psychology, 13, 90–100. doi:10.1037/a0015108 Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies? Psychological Bulletin, 105, 309–316. doi:10.1037/ 0033-2909.105.2.309 Seligman, M. E. P. (1995). The effectiveness of psychotherapy: The Consumer Reports study. American Psychologist, 50, 965–974. doi:10.1037/0003-066X. 50.12.965 Shapiro, D. (2002). Renewing the scientist-practitioner model. Psychologist, 15, 232–234. Sheridan, E. P., Perry, N. W., Johnson, S. B., Clayman, D., Ulmer, R., Prohaska, T., . . . Beckman, L. (1989). Research and practice in health psychology. Health Psychology, 8, 777–779. doi:10.1037/h0090321 Sidman, M. (1960). Tactics of scientific research. New York, NY: Basic Books. Skinner, B. F. (1938). The behavior of organisms. New York, NY: Appleton-Century-Crofts. Skinner, B. F. (1953). Some contributions of an experimental analysis of behavior to psychology as a whole. American Psychologist, 8, 69–78. doi:10.1037/ h0054118 Skinner, B. F. (1954). A new method for the experimental analysis of the behavior of psychotic patients. Journal of Nervous and Mental Disease, 120, 403–406.

Rozin, R. (2009). What kind of empirical research should we publish, fund, and reward. Perspectives on Psychological Science, 4, 435–439. doi:10.1111/ j.1745-6924.2009.01151.x

Skinner, B. F. (1956). A case history in scientific method. American Psychologist, 11, 221–233. doi:10.1037/ h0047662

Rucci, A. J., & Tweney, R. D. (1980). Analysis of variance and the “second discipline” of scientific psychology: A historical account. Psychological Bulletin, 87, 166–184. doi:10.1037/0033-2909.87.1.166

Smith, L. D., Best, L. A., Cylke, V. A., & Stubbs, A. D. (2000). Psychology without p values: Data analysis at the turn of the 19th century. American Psychologist, 55, 260–263. doi:10.1037/0003-066X.55.2.260

196

Single-Case Research Designs and the Scientist-Practitioner Ideal in Applied Psychology

Soldz, S., & McCullogh, L. (Eds.). (2000). Reconciling empirical knowledge and clinical experience: The art and science of psychotherapy. Washington, DC: American Psychological Association. doi:10.1037/ 10567-000 Staines, G. L. (2008). The causal generalization paradox: The case of treatment outcome research. Review of General Psychology, 12, 236–252. doi:10.1037/10892680.12.3.236

editorial policies regarding statistical significance and effect size. Theory and Psychology, 10, 413–425. doi:10.1177/0959354300103006 Valsiner, J. (1986). Where is the individual subject in scientific psychology? In J. Valsiner (Ed.), The individual subject and scientific psychology (pp. 1–14). New York, NY: Plenum Press.

Stokes, D. E. (1997). Pasteur’s quadrant: Basic science and technological innovation. Washington, DC: Brookings Institution Press.

Vespia, K. M., & Sauer, E. M. (2006). Defining characteristic or unrealistic ideal: Historical and contemporary perspectives on scientist-practitioner training in counselling psychology. Counselling Psychology Quarterly, 19, 229–251. doi:10.1080/09515070600960449

Streiner, D. L. (2006). Sample size in clinical research: When is enough enough? Journal of Personality Assessment, 87, 259–260. doi:10.1207/s15327752jpa8703_06

Wagenmakers, E. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin and Review, 14, 779–804.

Stricker, G. (1975). On professional schools and professional degrees. American Psychologist, 30, 1062–1066. doi:10.1037/0003-066X.30.11.1062 Stricker, G. (2000). The scientist-practitioner model: Gandhi was right again. American Psychologist, 55, 253–254. doi:10.1037/0003-066X.55.2.253 Tavaris, C. (2003). The widening scientist-practitioner gap. In S. O. Lilienfeld, S. J. Lynn, & J. M. Mohr (Eds.), Science and pseudoscience in clinical psychology (pp. ix–xviii). New York, NY: Guilford Press. Thompson, T. (1984). The examining magistrate for nature: A retrospective review of Claude Bernard’s An introduction to the study of experimental medicine. Journal of the Experimental Analysis of Behavior, 41, 211–216. doi:10.1901/jeab.1984.41-211 Thompson, T., & Hackenberg, T. D. (2009). Introduction: Translational science lectures. Behavior Analyst, 32, 269–271. Thorne, F. C. (1945). The field of clinical psychology, past, present, future [Editorial]. Journal of Clinical Psychology, 1, 1–20. Ullmann, L. P., & Krasner, L. (Eds.). (1966). Case studies in behavior modification. New York, NY: Holt, Rinehart & Winston. Ulrich, R., Stachnik, T., & Mabry, J. (1966). Control of human behavior. Glenview, IL: Scott, Foresman. Vacha-Haase, T., Nilsson, J. E., Reetz, D. R., & Thompson, B. (2000). Reporting practices and APA

Wainer, H. (1999). One cheer for null hypothesis significance testing. Psychological Methods, 4, 212–213. doi:10.1037/1082-989X.4.2.212 Wessley, S. (2001). Randomised controlled-trials: The gold standard. In C. Mace, S. Moorey, & B. Roberts (Eds.), Evidence in the psychological therapies (pp. 46–60). Hove, England: Brunner-Routledge. Wilkinson, L., & Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594–604. doi:10.1037/0003-066X.54.8.594 Witmer, L. (1996). Clinical psychology. American Psychologist, 51, 248–251. (Original work published 1907) doi:10.1037/0003-066X.51.3.248 Wolf, M. M. (1978). Social validity: The case for subjective measurement or how applied behavior analysis is finding its heart. Journal of Applied Behavior Analysis, 11, 203–214. doi:10.1901/jaba.1978.11-203 Wright, D. B. (2009). Ten statisticians and their impacts for psychologists. Perspectives on Psychological Science, 4, 587–597. doi:10.1111/j.1745-6924.2009.01167.x Yates, F., & Mather, K. (1963). Ronald Aylmer Fisher. Biographical Memoirs of Fellows of the Royal Society of London, 9, 91–120. Ziliak, S. T., & McCloskey, D. N. (2008). The cult of statistical significance. Ann Arbor: University of Michigan Press.

197

Chapter 9

Visual Analysis in Single-Case Research Jason C. Bourret and Cynthia J. Pietras

The visual analysis, or inspection, of graphs showing the relation between environmental (independent) variables and behavior is the principal method of analyzing data in behavior analysis. This chapter is an introduction to such visual analysis. We begin by describing the components, and construction, of some common types of graphs used in behavior analysis. We then describe some techniques for analyzing graphic data, including graphs from common single-subject experimental designs. Types of Graphs and Their Construction Of the many ways to graph data (see Harris, 1996), the graph types most frequently used by behavior analysts are cumulative frequency graphs, bar graphs, line graphs, and scatterplots. Each of these is described in more detail in this section.

Cumulative Frequency Graphs Cumulative frequency graphs show the cumulative number of responses across observation periods. The earliest, and most common, such graph used by behavior analysts is the cumulative record. In the other graph types discussed in this section, measures of behavior during an observation period are collapsed into a single quantity (e.g., mean response rate during a session) that is represented on a graph by a single data point. By contrast, cumulative records show each response and when it occurred during an observation period. Thus, cumulative records provide a detailed picture of within-session

behavior patterns (see Ferster & Skinner, 1957/1997). An example of a cumulative record is shown in Figure 9.1. On a cumulative record, equal horizontal distances represent equal lengths of time; equal vertical distances represent equal numbers of responses. The slope of the curve in a cumulative record indicates the rate of responding. Researchers sometimes include an inset scale on cumulative records to indicate the rate of responding represented by different slopes, although usually more precise calculations of rate are also provided. Small vertical lines oblique to the prevailing slope of the line, traditionally called pips, typically indicate reinforcer deliveries. When the response pen reaches the top of the page, it resets to the bottom, producing a straight vertical line. Researchers may also program the response pen to reset at designated times (e.g., when a schedule change occurs), to visually separate data collected under different conditions. Cumulative records were traditionally generated by nowobsolete, specially designed machines (cumulative recorders). More contemporarily, computer software programs that record and plot each response as it occurs have been used to generate these records. Cumulative records can also be constructed after data collection is complete, if the time of occurrence of each response and all other relevant events during a session are recorded. Although the cumulative record was one of the most commonly used graphs in the early years of the experimental analysis of behavior, it has since fallen out of favor as researchers increasingly present data averaged across a single or multiple sessions. It is

DOI: 10.1037/13937-009 APA Handbook of Behavior Analysis: Vol. 1. Methods and Principles, G. J. Madden (Editor-in-Chief) Copyright © 2013 by the American Psychological Association. All rights reserved.

199

Bourret and Pietras

RESPONSES

HIGHER RATE “SMOOTH” RESPONDING

“GRAINY” RESPONDING

LOWER RATE NO RESPONDING

PIPS EVENT PIN

TIME

Figure 9.1.  Example of patterns that may be observed on a cumulative record. Cumulative responses are shown on the vertical axis, and time is shown on the horizontal axis. Each response moves the response pen a constant distance in the vertical direction, and the paper moves vertically at a constant speed. Shown (right to left) are smooth curves indicating constant rates of responding, grainy curves indicating irregular patterns of responding, shallow curves indicating low rates of responding, and steep curves indicating high rates of responding. Flat portions indicate no responding. Pips (downward deflections of the response pen) usually indicate reinforcer deliveries. Movements of the event pen are used to signal changes in experimental contingencies or stimulus conditions. Data are hypothetical.

useful in the experimental analysis of behavior not as a primary means of data analysis, but as a means of monitoring within-session performance. Another notable exception to the decline of the cumulative record is research published by Gallistel and his colleagues (e.g., Gallistel et al., 2007). In this research line, cumulative records (some of them quite creatively constructed with more than simple responses on the vertical axis) are used extensively to better understand the process by which organisms allocate their behavior between concurrently available sources of food.

Bar Graphs Bar graphs show discrete categories (i.e., a nominal scale; Stevens, 1946) along the horizontal (x) axis and values of the dependent variable on the vertical (y) axis (Shah, Freedman, & Vekiri, 2005). In behavior analysis, bar graphs are often used to show percentages (e.g., of correct responses) or average performance across stable sessions or conditions. As shown in Figure 9.2, bar graphs facilitate comparisons of performances (i.e., the height of each bar) 200

across conditions. Typically, bars are separated from each other, but related bars may be grouped together. One variation of the standard (vertical) bar graph is a horizontal bar graph, in which the categorical variable is plotted on the y-axis. On these graphs, the length of the bar along the x-axis shows the value of the dependent variable. Bars graphs may also be drawn so that bars can deviate in either direction from a center line. Such graphs may be used, for example, to show increases or decreases from baseline values that are represented by a center horizontal line at zero. Bar graphs are similar to histograms, but histograms (in which the bars touch each other) have interval x-axis values and typically show frequency distributions (see Figure 9.3). In bar graphs, the y-axis scale usually begins at the lowest possible value (e.g., zero) but may begin at a higher value if the low range would be devoid of data. Constraining the lower range of y-axis values will make differences across conditions appear ­bigger than they would have been had the range extended to zero, a factor to consider when evaluating the clinical (or other) importance of the visually

Visual Analysis in Single-Case Research

100

When group statistical designs are used, bar graphs frequently summarize the differences in group mean performances. Reliance on statistical analyses of grouped data may lead to the omission of error bars from such graphs, a practice that obscures the size of individual differences within groups. Figure 9.4 shows a variation on the between-groups bar graph that displays the performance of individual participants (individual data points) while maintaining the ease of comparing measures of central tendency (height of the bars). Graphs constructed in this way make it possible to evaluate to what extent the height of the bar describes the behavior of individual participants. A second advantage of this type of bar graph is that it allows readers to evaluate whether the within-group variance is normally distributed, an important factor when evaluating the appropriateness of the statistical analyses used. Finally, the graphing conventions of Figure 9.4 encourage the researchers to consider those individuals in the treatment group for whom the treatment produced no positive effect. As often noted by

Percent Correct

80 60 40 20 0 No Timeout

5-s Timeout

20-s Timeout

No Timeout

5-s Timeout

20-s Timeout

100

Percent Correct

80 60 40 20 0

Figure 9.2.  Examples of bar graphs. Data are hypothetical.

Proportion of Responses

0.32 0.28 0.24 0.20 0.16 0.12 0.08 0.04 0.00

0

1

2

3

4

5

6

7

8

9 >10

Interresponse Time Bins (s)

Figure 9.3.  Example of a histogram. Data are hypothetical.

apparent difference. When bar graphs show measures of central tendency (e.g., means), error bars (the vertical lines in Figure 9.2) should be included to depict the variance in the data (e.g., betweensession differences in percentage correct).

Figure 9.4.  A bar graph that shows mean performance and the performance of individuals making up the mean. The height of the bar shows the mean percentage of urine samples negative for cocaine and opiates, and the closed circles show the percentage of negative samples for each individual undergoing employmentbased abstinence reinforcement treatment for cocaine dependence. From “Employment-Based Abstinence Reinforcement as a Maintenance Intervention for the Treatment of Cocaine Dependence: A Randomized Controlled Trial,” by A. DeFulio, W. D. Donlin, C. J. Wong, and K. Silverman, 2009, Addiction, 104, p. 1535. Copyright 2009 by John Wiley & Sons, Ltd. Used with permission. 201

Bourret and Pietras

Sidman (1960), the data from these individuals serve to illustrate that the behavior is not fully understood and, by investigating the factors affecting these individuals’ behavior further, more effective interventions will follow.

Line Graphs (Time Series and Relational) Line graphs are used to depict behavior across time (time-series line graphs) or relations between two variables (relational line graphs; see Tufte, 1983). With time-series line graphs (e.g., see Figure 9.5), the horizontal axis (referred to as the abscissa or the x-axis) illustrates the time point at which each data point was collected, and behavior is plotted at each of these time points on the vertical y-axis (ordinate). On relational line graphs, the x-axis shows values of the independent variable, and the y-axis shows a central tendency measure of the dependent variable. Data points in both types of line graphs are connected with straight lines. Figure 9.5 shows the parts of a time-series line graph. Axis labels indicate the variables plotted. If multiple graphs appear in a figure with the same axes, then, to reduce the amount of text in the figure, only the x- and y-axes on the bottom and leftmost graphs, respectively, are labeled (for an example, see Figure 9.6). The scales of both the x- and y-axes are divided into intervals. Successive intervals on an axis are equally sized and are marked with short lines, called tick marks, that intersect the

Figure 9.5.  Diagram of parts of a time-series line graph. Data are hypothetical. 202

axis. Tick marks can point inward or outward, but they should point outward if inward-pointing ticks would interfere with the data. Tick marks are labeled to indicate the interval value. Tick-mark intervals should be frequent enough that a reader can determine the value of a data point, but not so frequent that the axis becomes cluttered. Intervals between major ticks may be marked with unlabeled minor ticks. In Figure 9.5, for example, every fifth interval is labeled, and minor tick marks indicate intervals between them. If multiple graphs appear in a figure with the same axis scale, then only the tick marks on the bottom and left graphs, respectively, are labeled. The x- and y-axis scales should be large enough to encompass the full range of the data; however, they should not exceed the data range, or the graph will contain empty space and compress the data unnecessarily. Starting the axis scales at values below the minimum data range may make data points at the minimum value (e.g., zero) more visible by preventing them from falling directly on the x- or y-axis. As with bar graphs, line graphs normally start at zero, but if the data range is great, there may be a break in the axis with the larger numbers indicated after the break. When plotting data for individual participants in separate graphs within a single figure, it is ­sometimes not realistic to represent data for each

Figure 9.6.  Examples of linear and logarithmic (log) scales. The upper two graphs show data plotted using linear y-axis scales. The lower two graphs show the same data, but plotted using log (base 10) y-axis scales. Data are hypothetical.

Visual Analysis in Single-Case Research

­ articipant within a single range on the y-axis (e.g., p the range for one participant may be between one and 10 responses and for another, between 400 and 450 responses). It is always better to use the same ranges on the y-axis, but when this is not possible, different ranges for different participants may be used, and this deviation must be noted. The aspect ratio, or the height-to-width ratio of the y- and x-axes, should not distort the data. Too great an aspect ratio (i.e., a very tall graph) may magnify variability, or small effects, and too small an aspect ratio (i.e., a very long graph) may obscure important changes in behavior or variability in a data set (Parsonson & Baer, 1986). A 2:3 y:x aspect ratio (Parsonson & Baer, 1978) or a 1.0:1.618 aspect ratio (Tufte, 1983) has been recommended. Breaks on the y-axis may be used if there are outlier data points. Outliers are idiosyncratic data points that far exceed the range of other data (e.g., data points more than 3 standard deviations from the mean). Breaks on the x-axis may be used if there are breaks in data collection. Data points are marked with symbols, and a data path is created by connecting the data points with straight lines. When multiple data paths are shown on a graph, each data type is represented by a distinct symbol, and a central figure legend provides a concise description of each path. A common graphing convention in applied behavior analysis is to describe each data path with text and an arrow pointing from the text to the corresponding data path. Using a central legend, as in Figure 9.5, facilitates the transmission of information because scanning the graph is not required to find figure legend information. A second advantage of the central legend is that it avoids the possibility that arrows pointing to a particularly high or low data point may influence the visual analysis of the data. Phase changes in time-series line graphs are denoted by vertical lines extending the length of the y-axes. Phase-change lines are placed between the last data point of a condition and the first data point of the new condition. Data paths are broken across phase-change lines to avoid the appearance that behavior change produced by an independent variable occurred before the change in condition. Descriptive phase labels are centered at the top of

the space allocated to each phase. Figure legends, and phase labels, are usually placed within the rectangular space created by the x- and y-axes. Figure captions are placed below graphs and describe what is plotted, the axes, any abbreviations or symbols that appear in the graph, and any axis breaks. Linear interval scales are the most common scales used in time-series line graphs, but logarithmic (log) interval scales and semi-log interval scales (in which the x- or y-axis is log scaled and the other is linearly scaled) are also used. Log scales are helpful in more normally distributing data sets that are skewed toward large values (Cleveland, 1994), transforming curvilinear data into linear data (Shull, 1991) and showing proportional changes in behavior (Cooper, Heron, & Heward, 1987). Because the logarithm of zero is undefined, log scales have no zero. Log base 10, base 2, and base e (natural logs) are the most common log scales (see Cleveland, 1994, for some recommendations for the use of various log bases). Illustrations of data plotted on both a linear scale and a semi-log (base 10) scale are shown in Figure 9.6. In the upper left graph, response rates in Phase B appear to increase more quickly between sessions than in Phase A. This difference, however, may be attributed to the greater absolute value of the response rate. Plotted on a log scale (lower left graph), it is visually apparent that the rate of change is similar in both phases. In the upper right graph, there appears to be a large shift in performance from Phase A to Phase B. The arithmetic scale of the y-axis, however, compresses the low rates in Phase A. When data are plotted on a log scale, the low rates are more visible and the increase in responding in Phase B can be seen to be part of an increasing trend that began in Phase A.

Scatterplots Scatterplots present a dependent variable in relation to either an independent variable (in which case the graph may be described as a relational graph; see Tufte, 1983) or another dependent variable. When both measures are dependent variables, either measure can be plotted on the horizontal axis, although if one measure is conceptualized as a predictor variable, it is plotted on the x-axis, and the other, the 203

Bourret and Pietras

Figure 9.7.  Example of a scatterplot. Data are hypothetical.

criterion variable, is plotted on the y-axis. An example of a scatterplot is shown in Figure 9.7. In this figure, which shows data from a hypothetical experiment investigating choice between two concurrently available reinforcement schedules, the log of the ratio of response rates on the right and left alternatives is plotted on the y-axis and the log of the ratio of reinforcement rates is plotted on the x-axis. In scatterplots, data points are not connected with lines, usually because measures are independent of each other (e.g., they are data points from different conditions or participants) or because they are not sequentially related. Sometimes, however, lines or curves are fit to data on scatterplots to indicate the form of the relation between the two variables (see Interpreting Relational Graphs section). The line in Figure 9.7 shows the best-fitting linear regression line. That data points fall near this line indicates a linear relation (matching) between response rates and reinforcement rates.

Other Types of Graphs The types of graphs we have discussed do not represent an exhaustive list of the types of graphs used by behavior analysts and subjected to visual analysis. For example, sometimes examining data across time and within individual sessions is useful, in which case a three-dimensional graph would be appropriate, with the dependent variable on the y-axis, within-session time on the x-axis, and successive sessions on the third (z) axis (e.g., Cançado & Lattal, 2011). Three-dimensional graphs may also 204

be used to show other types of interactions, such as changes in interresponse time distributions on a reinforcement schedule across sessions (Gentry, Weiss, & Laties, 1983) or effects of different drug doses on response run length distributions (Galbicka, Fowler, & Ritch, 1991). Other graphing techniques have been used to depict specific kinds of relations. Staddon and Simmelhag (1971), for example, used detailed flow charts to graphically show the conditional probabilities of different responses given an initial response. Davison and Baum (2006) depicted the number of responses to different alternatives in a choice experiment as different-sized circles (bubbles). This technique could also be useful in showing, for example, time allocated to playing with multiple toys by a child across successive time periods. These examples are but a few of specialized graphs that may be useful in enhancing the visual depiction of specific data sets or aspects of data sets. For a more complete description of graph types, see Harris (1996). Cleveland and McGill (1984, 1985) offered some useful advice on how to choose graph types to show data with maximum clarity. In undertaking a graphical analysis of data, there are no immutable rules concerning which graphs to use for depicting what. Use is based on precedence, but investigators also need to think outside the axes (so to speak) in using graphs to tell the story of their data.

General Recommendations for Graph Construction Many features of a graph influence a reader’s reaction to the data. Even small details such as tick marks, axis scaling, data symbols, aspect (y:x) ratio, and so forth can affect a graph’s impact, and poor graphing methods can lead to misinterpretations of results (Cleveland, 1994). Creating graphs that are accurate, meaningful, rich in information, yet readily interpretable, therefore, requires planning, experimenting, reviewing, and close attention to detail (Cleveland, 1994; Parsonson & Baer, 1992). For some additional recommendations on producing useful graphs, see Baron and Perone (1998), Cleveland (1994), Johnston and Pennypacker (1993), Parsonson and Baer (1978, 1986), and Tufte (1983).

Visual Analysis in Single-Case Research

When preparing graphs for publication, the Publication Manual of the American Psychological Association (American Psychological Association, 2010) also offers valuable advice. Interpreting Graphical Data In the sections that follow, we describe some strategies for visually analyzing graphical data presented in cumulative records, bar graphs, time-series line graphs, and scatterplots. We also discuss the visual analysis of graphical data generated by some commonly used single-subject experimental designs. If the graph is published, the first step in visual analysis is to determine what is plotted by reading all of the text describing the graph, including the axis labels, condition labels, figure legend, and figure caption. The next step is the analysis of patterns in the graphical data.

Interpreting Cumulative Records In cumulative records, changes in rate of responding and variability in responding are analyzed by examining changes in the slope of the records (Johnston & Pennypacker, 1993). Several patterns that may be distinguished in cumulative records are shown in Figure 9.1, a hypothetical cumulative record. The first smooth curve shows responding occurring at a steady, constant rate, whereas the second shows grainy responding, or responding occurring in unsystematic bouts of high and low rates separated by varying periods of not responding. The flat portion of the third curve indicates no responding. The greater slope of the fourth curve compared with the third curve indicates a higher rate of responding. Cumulative records also allow an analysis of responding at a more local level. For example, in Figure 9.1, the second curve from the left, between the third and fourth pip, shows that responding occurred first at a low rate, then rapidly increased, then gradually decreased again before the reinforcer delivery. Such a fine-grained analysis is not possible with other graph types.

Interpreting Bar Graphs When visually analyzing bar graphs, the relative heights of bars are compared across conditions (see

Figure 9.2). When making this comparison, attention should be given to the y-axis scale to determine whether the range has been truncated. In bar graphs depicting average performance within a phase, a critical element to evaluate is the length of the error bars. Very long error bars suggest that the performance may not have been stable, and so it will be important to evaluate the stability criterion used. If the range of the data is used, long error bars may also occur if an outlying data point was included in the data set depicted in the graph. If so, then the average value depicted by the height of the bar may not represent most of the data; in such cases, the median would be a better measure of central tendency. Error bars also indicate the overlap of data points across conditions. For example, Figure 9.2 shows results from a hypothetical experiment that evaluated the effects of three time-out durations after incorrect responses on match-to-sample performance. The height of the bar is the mean, and the error bar shows the standard deviation. In the top graph, the error bars are long, and the mean of the 20-second condition overlaps with the variance in the 5-second condition. Thus, differences between the 5- and 20-second time-out duration conditions are less convincing than the difference between the no time-out and the 20-second conditions. Error bars provide no information about trends in the data, however, and a reader must look to the text of the article or to other graphs for evidence that the performances plotted in a bar graph represent stable responding. Care should also be taken to consider which measure of variability is represented by the error bars. The standard deviations plotted in Figure 9.2 quantify the average deviation of each data point from the condition mean and, therefore, are an appropriate measure of variability when mean values are reported (interquartile ranges usually accompany medians). Some errors bars will depict the standard error of the mean, and readers should interpret these with caution. The standard error of the mean is used to estimate the standard deviation among many different means sampled from a normally distributed population of values. As such, it tells one less about the variability in the data than does the standard deviation. Moreover, the standard 205

Bourret and Pietras

error of the mean is calculated by dividing the sample standard deviation by the square root of n (i.e., the number of values used to calculate the mean); thus, error bars depicting the standard error of the mean will be increasingly more narrow than the standard deviation as the number of data points included in the data set increases. If the standard error of the mean had been plotted in Figure 9.2 instead of the standard deviation, the visually apparent difference between all three conditions would seem greater even with no change in the data set plotted. The general strategies we have outlined (i.e., consider the difference in the measure of central tendency in light of the variability in the data to evaluate how convincing the difference is) are formalized by common inferential statistical tests. Behavior analysts wanting to reach a broader audience of scientists (including extramural grant reviews), professionals, and public policymakers may wish to use these tests in addition to conducting a careful visual analysis of the data.

inhibit one’s ability to detect small but clinically important effects of an experimental manipulation (Sidman, 1960). Thus, whenever possible, conditions should remain unchanged until stability is achieved. Stability of time-series data may be assessed by visual inspection or quantitative criteria (see Perone, 1991; Sidman, 1960; Chapter 5, this volume). Both will evaluate bounce and trend, with the latter catching patterns that may be missed by the quantitative criterion. Bounce refers to unsystematic changes in behavior across successive data points, whereas trend refers to a systematic (directional) change (Perone, 1991). Figure 9.8 shows baseline data for two participants. The baseline depicted in the top panel has considerably more between-session variability than

Analyzing Time-Series Data in Line Graphs The analysis of time-series data is the most prevalent visual inspection practice in behavior analysis. Basic and applied behavior analysts use visual analysis techniques to determine when behavior has stabilized within a phase, to judge whether behavior has changed between phases, and to evaluate the evidence that the experimental variable has affected individual behavior. Evaluating stability.  Once an experiment is underway, one of the first decisions a researcher must make is, When should a phase change be made? In most cases, this question is answered by evaluating the stability (i.e., consistency) of responding over time (see Chapter 5, this volume). If behavior is not stable before the condition change (e.g., there is a trend in the direction of the anticipated treatment effect), then attributing subsequent shifts in responding to the experimental manipulation will be unconvincing (Johnston & Pennypacker, 1993). Moreover, an unstable baseline (i.e., one containing a great deal of between-session variability) will 206

Figure 9.8.  Hypothetical baseline data with added mean lines (solid horizontal lines), range lines (long dashed horizontal lines), trimmed range lines (short dashed horizontal lines), and regression lines (solid trend lines).

Visual Analysis in Single-Case Research

that depicted in the lower panel. The extent to which the researcher will be concerned with this bounce in the data will depend on how large the treatment effect is likely to be. If a very large effect is expected, then the intervention data should fall well outside of the baseline range, and therefore the relatively weak experimental control established in the baseline phase would be acceptable. If, however, a smaller effect is anticipated, then the intervention data are unlikely to completely separate from the range of data in the baseline, making detection of an intervention effect impossible. Under these conditions, the researcher would be well served to further identify the source of variability in the baseline. Indeed, if the researcher succeeds in this endeavor, a potent behavior-change variable may be identified. Visually analyzing bounce may be facilitated by the use of the horizontal lines shown in Figure 9.8. The solid horizontal line shows the mean of the entire phase (i.e., the mean level of the data path) and allows one to see graphically how much each data point deviates from an ostensibly appropriate measure of central tendency.1 The dashed lines furthest from the mean line illustrate the range of the data (i.e., they are drawn through the single data point furthest from the mean), whereas the dashed lines within these dashed lines show a trimmed range in which the furthest data point from the mean is ignored (see Morley & Adams, 1991). Drawing range and trimmed range lines may be useful when considering how large the intervention effect will have to be to discriminate a difference between the baseline and intervention data. Clearly, to produce a visually apparent difference, the intervention implemented in the top panel will have to produce a much larger effect than that implemented in the bottom panel. Neither range lines nor trimmed range lines will depict changes in variability within a condition, however. To visualize changes in variability within conditions, Morley and Adams (1991) suggested plotting trended range lines. To construct these, the data in a condition are divided in half along the x-axis, and the middle x-axis value of each half is located. For each half, the

minimum and maximum y-axis data points are located, and those values are marked at the corresponding x-axis midpoint. Finally, two lines on the graph are drawn connecting the two minimum data points from each half and the two maximum data points from each half. Converging lines suggest decreasing variability (bounce) across the phase, diverging lines suggest increasing variability, and parallel lines suggest that the variability is constant. The next characteristic of the baseline data to be considered, when deciding when to change phases, is the extent to which there is a trend in the data. A first step can be to plot a line of best fit (i.e., linear regression) through the baseline data. Any graphing software package will suffice. Researchers should be aware, however, that a line of best fit can be unduly affected by outliers (Parsonson & Baer, 1992). One alternative to linear regression that was recommended by Cleveland (1994) is the curve-smoothing loess technique. The loess technique is less sensitive to outliers and does not assume that data will conform to any particular function (e.g., a straight line). This technique smoothes data and makes patterns more visible by plotting, for each x-axis value, an estimate of the center of the distribution of y values falling near that x value (akin to a moving average; for descriptions, see Cleveland, 1994; Cleveland & McGill, 1985). Linear regression, however, has the advantage of being a more widely used technique, and it quantifies the linear relation between the two variables (i.e., estimates the slope and y-intercept). In the upper panel of Figure 9.8, the line of best fit indicates an upward trend in the baseline data, suggesting that if no intervention is implemented, the rate of response will continue to increase over time. This is problematic if one expects the intervention to increase response rates. In the lower panel of Figure 9.8, the trend line is horizontal and overlaps with the mean line. Thus, in the absence of an experimental manipulation, the best prediction about future behavior is that it will remain stable with little between-session variability. Baseline data need not be completely free of trends before a phase

Plotting a mean line is appropriate only if the data in the phase are free of extreme values that will pull the mean line away from the center of the distribution. Under such cases, a median line would be a better visual analysis tool.

1

207

Bourret and Pietras

is ended and the intervention is begun. If the baseline data are trending down (up), and the intervention is anticipated to increase (decrease) responding, then the baseline trend is of little concern. A modest trend in the direction of the anticipated intervention effect is also acceptable as long as the intervention proves to produce a large change in trend and mean level. Finally, continuing a baseline until it is free of trends and the bounce is minimal is sometimes impractical. In applied settings, it may be unethical to continue a baseline until stability has been achieved because to do so delays the onset of treatment. These practical and ethical concerns, however, must be balanced with the goal of constraining the effects of extraneous variables so that orderly effects of independent variable manipulations can be observed in subsequent conditions. It may be of little use (and may also be considered unethical) to conduct an intervention if the data are so variable that it is impossible to interpret the effects of treatment. Thus, researchers should be especially concerned with achieving stability when treatment effects are unknown or are expected to be small (i.e., when one is conducting research rather than practice). Visual inspection of trends within a data set sometimes reveals nonlinear repetitive patterns, or cycles. Some cycles result from feedback loops created by self-regulating behavior–environment interactions (Baum, 1973; Sidman, 1960), whereas others result from extraneous variables. Identifying the source of cyclical patterns is sometimes necessary to produce behavior change. In Figure 9.9, for example, every other data point is higher than the preceding one. Such a pattern could be the result of

Figure 9.9.  Example of a figure showing a cyclical pattern. Data are hypothetical. 208

different experimenters conducting sessions, changes in levels of food deprivation, or perhaps practice effects if two sessions are conducted each day. Cycles may be difficult to detect if there is a good deal of between-session variability, but plotting data in various formats may help reveal cyclical patterns. For example, plotting each data point as a deviation from the mean using a vertical bar graph can make patterns in the variability more apparent (see Morley & Adams, 1991). The same strategies for evaluating the stability of baseline data are used to evaluate the stability of data in an intervention phase. Figure 9.10 repeats the data from Figure 9.8, but adds data from an intervention phase. In the upper panel, the line of best fit reveals an upward trend in the intervention phase, although the final four data points suggest that the behavior may have asymptoted. The researcher collecting these data should continue the intervention phase to determine whether the performance has reached an asymptote or will increase further given continued exposure to the intervention.

Figure 9.10.  Hypothetical baseline and intervention data with added mean lines (solid horizontal lines), range lines (long dashed horizontal lines), trimmed range lines (short dashed horizontal lines), and regression lines (solid trend lines). Baseline data are the same as in Figure 9.8.

Visual Analysis in Single-Case Research

In the lower panel, a similar upward trend may be observed in the intervention phase, but over the final 10 sessions of the phase, the performance has stabilized because there is little bounce around the mean line and no visually apparent trend. Evaluating differences across phases.  The second use of visual analysis of time-series data involves comparing the baseline and intervention data to determine whether the difference makes a compelling case that behavior has changed between phases. Determining whether behavior change was an effect of the intervention (assuming a compelling difference is observed) is a different matter, and one that we consider in more detail next. Five characteristics of the data should control the evaluation of behavior change across phases. The first is the change in level. Level refers to the immediate change in responding from the end of one phase to the beginning of the next (Kazdin, 1982). Level is assessed by comparing the last data point from a condition to the first data point of the subsequent condition. In the top graph of Figure 9.10, the change in level was a decrease from about 46 responses per minute to about 26 per minute. In the lower panel, the level increased from about 30 to 48 responses per minute. Level may be used to evaluate the magnitude of treatment effect. Large changes in level suggest a potent independent variable, but only when the data collected in the remainder of the intervention phase continue at the new level, as in the lower panel of Figure 9.10. The level change in the upper panel of Figure 9.10 is inconsistent with most of the remaining intervention data and, therefore, appears to be another instance of uncontrolled between-session variability. As this example illustrates, a level change is neither necessary nor sufficient to conclude that behavior changed in the intervention phase. The second, related characteristic that will affect judgments of treatment effects is latency to change. Latency to change is the time required for change in responding to be detected after the onset of a new experimental condition (Kazdin, 1982). To evaluate latency to change, a researcher must examine multiple data points after the condition change to determine whether a consistent change in level or a

change in trend occurs (at least three data points are required to detect a trend). A short latency to change indicates that the experimental manipulation produced an immediate effect on behavior, whereas a long latency to change indicates either that an extended exposure to the change in the independent variable is required before behavior changes (such as during extinction) or that the change is caused by an extraneous variable. Again, we consider the question of the causal relation between the behavior change and the intervention later in the chapter. In the top panel of Figure 9.10, approximately six sessions were required before the trend and mean level in the intervention phase appear distinguishable from baseline. In the lower graph, changes in trend and mean level were observed in the first three sessions after the phase change, showing more clearly that the data in the two phases are distinct. Although short latencies to change suggest that behavior has changed across phases, this change may be temporary and, therefore, additional observations should be made until one is convinced that the change is enduring. How many additional observations are necessary will be affected by factors such as baseline variability (as in the top panel of Figure 9.10) and how novel the finding is (skeptical scientists prefer to have many observations when the intervention is novel). Under most conditions, an intervention that produces a large but temporary behavior change is of limited utility. The third characteristic of time-series data that is used when visually evaluating differences across phases is the mean shift (Parsonson & Baer, 1992). Mean shift refers to the amount by which the means differ across phases. In both panels of Figure 9.10, there is an upward mean shift from baseline to intervention. The bottom graph, however, illustrates a shift that is visually more compelling. The reason for this takes us to the fourth characteristic controlling visual analysis activities: between-phase overlap. In the upper panel of Figure 9.10, as the range lines illustrate, five of eight data points in the intervention condition fall within the range of the preceding baseline data, and, therefore, the difference is not convincing. Perhaps, in the top graph, if additional data were collected during the intervention phase, and assuming responding remained at 209

Bourret and Pietras

the upper plateau characterizing the final intervention sessions, the difference might be compelling. In the lower graph of Figure 9.10, the level change, mean shift, and limited between-phase overlap in range make the difference visually apparent. The fifth characteristic of the data that will affect visual evaluation of between-phase differences is trend. As noted earlier, if the baseline data are trending up (or down) and responding increased (or decreased) during the intervention phase (upper graph of Figure 9.10), then the mean shift will not be convincing unless the trend line is much steeper in the intervention phase than at baseline. In the upper graph of Figure 9.10, the baseline data show a slight upward trend. Data in the subsequent intervention phase show a steeper trend. The greater the difference in trend is, the clearer it is that the mean shift in the intervention is not simply a continuation of the baseline trend. When evaluating mean shifts, floor or ceiling effects must be considered. These effects occur when performance has reached a minimum or maximum, respectively, beyond which it cannot change further. For example, if baseline response rates are low and an intervention is expected to decrease responding, mean shifts may be small because response rates have little room to further decrease. Readers skeptical of visual analysis practices may be unsettled by the use of terms and phrases such as judgment, visually apparent, and much steeper. How much steeper is “much steeper?” Although any interpretation of data requires the researcher make a variety of judgment calls (e.g., which statistic to use, how to handle missing data), Fisher, Kelley, and Lomas (2003) sought to reduce the number of judgments by developing the conservative dual-criterion (CDC) technique to aid the visual analysis of singlecase data. This method, illustrated in Figure 9.11, involves extending the baseline mean and trend lines into the intervention phase and raising both of these lines by 0.25 standard deviation (or lowering the lines by 0.25 standard deviation, if the intervention is anticipated to decrease responding). A difference across conditions is judged as meaningful when the number of intervention-phase data points falling above both lines (or below both lines in the case of an intervention designed to decrease a behavior) 210

Figure 9.11.  Example of the conservative dualcriterion technique applied to hypothetical intervention data. In the baseline phase, solid horizontal lines are mean lines, and solid trend lines are regression lines. To analyze intervention effects, these lines are superimposed onto the intervention phase and raised by 0.25 standard deviation from the baseline means. See text for details.

exceeds a criterion based on the binomial equation (i.e., exceeds the number that would be expected by chance). Fisher et al. found that the use of CDC procedures improved agreement on visual inspection (data were hypothetical, and intervention effects were computer generated). Figure 9.11 shows the CDC applied to the data shown in Figure 9.10. In the top panel, three data points in the intervention phase fall above the two criterion lines. Following the table presented in Fisher et al. (2003) for treatment conditions with eight data points, seven data points should be above both lines to conclude that a compelling difference exists between phases. In the lower panel, the CDC requires that 12 of the 15 data points in the intervention condition appear above both lines, a criterion easily met, so the researcher may conclude that behavior changed across phases. Although the CDC method appears to improve the accuracy of judgments of behavior change, only

Visual Analysis in Single-Case Research

a few studies have yet investigated this technique (Fisher et al., 2003; Stewart, Carr, Brandt, & McHenry, 2007). The Fisher et al. (2003) procedure is, of course, but one technique for making visual assessment of data more objective. It is ultimately incumbent on the investigator or therapist to provide convincing evidence of an effect, whether through some formal set of rules as illustrated by Fisher et al. or by amplifying the effect to the point at which reasonable people agree on it, through increased control over both independent and extraneous variables. Assuming that appropriate decisions were made about stability and a visually apparent behavior change was observed from baseline to the intervention phase, the next task is to evaluate the role of the intervention in that behavior change. Evaluating the causal role of the intervention requires that an appropriate experimental design be used, a topic falling under the scope of Chapter 5, this volume. Here, we largely confine our discussion to the visual analysis techniques appropriate to the most commonly used single-case research designs. In these sections of the chapter, the visual analysis focuses on answering the question, “Did the intervention change behavior?”

systematic behavior changes with each manipulation provide evidence for a causal relation. Figure 9.12 shows the previously considered data set now extended to include a second baseline and a second intervention condition. The bottom graph is easily interpreted. The visually apparent difference between the first baseline phase and the first intervention phase is reversed in the return to baseline. In the second baseline phase, responding was well outside the range in which data should have fallen had no experimental manipulation been implemented. Further evidence for an intervention effect is that responding returned to the level observed in the original baseline. The reintroduction of the intervention (fourth phase) reverses the downward trend in the second baseline, yielding a striking level shift, mean shift, and minimal variability. There is very little overlap in the data across conditions, the latency to change is short, and there are no trends that make interpretation difficult. The mean level is close to the mean level obtained in the first exposure to the intervention, thus replicating the effect. These data thus make a strong case for the intervention as an effective means of influencing behavior.

Comparison designs.  The data shown in the lower panel of Figure 9.11 come from a comparison design (or A-B design). There is evidence of a convincing change in behavior across conditions, level and mean level differ, variability and overlap of the data points across conditions are not interfering, and the latency to change is short. Despite stable data, one cannot conclude that the intervention produced the visually apparent behavior change. Although the rapid level change suggests an intervention effect, one cannot rule out extraneous variables that may have changed at the same time that the intervention was introduced (e.g., in addition to the intervention, Participant 2 may have been informed that if his productivity did not improve, his job would be in jeopardy). When visually analyzing data, a difference in behavior between two phases is insufficient evidence that the intervention, and not extraneous variables, produced the change. Reversal designs.  In a reversal design, the experimental variable is introduced and removed, and

Figure 9.12.  Example of a reversal design. Data from the first baseline and intervention phases are the same as shown in Figures 9.8 and 9.10. Data are hypothetical. 211

Bourret and Pietras

40 30

Intervention 1 Intervention 2 No Intervention

20 10

A 0 40

Responses per Min

The upper panel of Figure 9.12 tells a different story. When the baseline conditions are reestablished in the third phase, there is a precipitous downward trend in behavior. Although this behavior change is consistent with the removal of an effective intervention, the between-session variability in the preceding condition renders an unconvincing the argument for a between-phase behavior change. Clearly, the hypothetical researcher who collected these data failed to continue the first intervention phase long enough for a stable pattern of behavior to develop. If responding had stabilized in the upper plateau reached at the end of the first intervention phase, the sharp reduction in responding in the second baseline may have been more compelling. When the intervention is again introduced, the downward trend levels off, but the data points overlap considerably with the data points for the preceding baseline condition. Furthermore, the mean level in the second intervention phase did not closely replicate the mean level of the first intervention phase.

30 20 10 B 0 40 30 20

Multielement designs.  Figure 9.13 shows data from three hypothetical multielement experimental designs (Barlow & Hayes, 1979). In this design, conditions alternate (often after every session), and consistent level changes are suggestive of a functional (causal) relation between variables arranged in the condition and the behavior of interest. Visual analysis of multielement design data requires evaluation of sequence effects in addition to variability, mean shift, trend, overlap, and latency to change. Detection of sequence effects requires close attention to the ordering of conditions and patterning in the data (i.e., if responding in one condition is consistently elevated when preceded by one of the other conditions). The data in Figure 9.13 represent response rate in three different conditions, two interventions, and a no-intervention control condition. The top graph is easily interpreted. The mean level in Intervention 1 is higher than in the other two conditions, and there is no difference in mean level between Intervention 2 and the no-intervention condition. The data are relatively stable (i.e., there is little variability), and there are no trends to complicate interpretation. Thus, the difference in behavior between the 212

10 C

0 2

4

6

8

10

12

14

16

18

20

Sessions

Figure 9.13.  Examples of data from multielement designs. Data are hypothetical.

conditions is obvious. The effects of each experimental manipulation on response rate are reliable (each repetition of a condition allows a test of reliability), and it would be extremely unlikely that some extraneous variable would happen to produce increases in response rate in each Intervention 1 ­session and none in any other session. Finally, the effect of Intervention 1 does not appear to be dependent on the prior condition, and in at least one case, the effect lasts when two consecutive Intervention 1 sessions are completed. These data provide compelling evidence that Intervention 1 is responsible for producing higher response rates than either Intervention 2 or the no-intervention condition. The middle graph contains more withincondition variability. The mean level is higher in the

Visual Analysis in Single-Case Research

Multiple-baseline designs.  Multiple-baseline designs are frequently used in applied settings, either when it would be unethical to remove the treatment or because the treatment is expected to produce an irreversible effect. The design involves a series of comparison designs in which the researcher implements the treatment variable at different times across participants, behaviors, or contexts. A researcher visually analyzing data from a multiplebaseline design must evaluate whether there are convincing changes in mean level from baseline to treatment conditions, whether the effects are replicated across baselines, and whether changes in behavior occur only when the treatment is implemented for each baseline. Figure 9.14 illustrates data from a multiplebaseline design. In the top panel, a brief baseline precedes the intervention. The intervention produces level changes and mean shifts easily discriminated as behavior change. There is no latency to change, and there are no trends or overlap between data points across conditions to complicate data interpretation. As noted in the Comparison

40

Baseline

Intervention

30 20 10 A

0 40 Responses per Min

Intervention 1 condition; however, there is considerable overlap in the range of response rates observed in each condition. Because it is not clear that behavior is distinct across conditions, the question of causation is moot. In the third graph, the mean level of the data in Intervention 1 is high, the mean level in the no-intervention condition is low, but the data in Intervention 2 are more difficult to interpret. During some sessions, the response rate is low; during others, it is high. This graph shows a hypothetical sequence effect. Each time an Intervention 2 session follows an Intervention 1 session, the response rate is high; otherwise, the response rate is low. The rate during the Intervention 2 sessions is affected by the preceding condition, which complicates interpretation of the effects of Intervention 2. If the high-rate Intervention 2 sessions were merely a carry-over effect of Intervention 1, then the nointervention sessions that follow Intervention 1 ­sessions should show comparable high rates. A researcher who obtains findings of this sort will conduct further experimentation to clarify the processes responsible for the sequence effect.

30 20 10 B

0 40 30 20 10

C

0 2

4

6

8

10

12

14

16

18

20

Sessions

Figure 9.14.  Example of data from a multiplebaseline design. Data are hypothetical.

Designs section, however, these data alone are insufficient to support causal statements. The behavior change could also have been caused by an extraneous variable that changed at the same time as the intervention (e.g., a change in classroom contingencies). If the latter were true, then one might expect this variable to affect behavior in the other baselines. To evaluate this, one examines the other baselines for behavior change that corresponds with the introduction of the intervention in the first graph (i.e., at Session 5). Figure 9.14 shows evidence of this, which strengthens the case that the intervention produced the behavior change observed in the top panel. Further evidence that the intervention is related to the behavior change must be gathered in the remaining 213

Bourret and Pietras

panels of Figure 9.14 because the intervention is implemented at different points in time across the baselines. In the second graph, the data in baseline show an upward trend. Because the effect of the intervention is a decrease in the mean level, it does not negatively affect identifying the change in behavior in the second phase. In examining the third graph, these baseline data were unaffected by the phase change depicted in the second graph, which provides further evidence that the behavior change observed in the first two graphs is a function of the intervention and not some uncontrolled variable. In the third graph, the baseline is relatively stable, and there is a large, immediate reduction in response rate after implementation of the intervention, which replicates the effects observed in the first two graphs, providing strong evidence of the effects of the intervention. Changing-criterion designs.  In changing-criterion designs, a functional relation between the intervention and behavior change is established by (a) periodically changing the contingency specifying which responses (e.g., those with a force between 20 and 30 g) will lead to experimenter-arranged consequences and (b) observing that behavior approximates the specified criterion (Hall & Fox, 1977; Hartmann & Hall, 1976). Typically, the criterion in graphs of changing-criterion designs is indicated by horizontal lines at each phase. Visual analysis of changing-criterion designs, as with that of other designs, requires an assessment of variability, level, mean shift, trend, overlap, and latency to change but also requires an assessment of the relation between behavior and the criterion. Figure 9.15 shows data from Hartmann and Hall (1976), who used a changing-criterion design to assess the effectiveness of a smoking cessation program. During the intervention, the participant was fined a small amount of money for smoking above a criterion number of cigarettes and earned a small amount of money for smoking fewer cigarettes. In the top graph, the number of cigarettes smoked per day is shown across successive days of the intervention. In baseline, the number of cigarettes smoked per day was stable over the first 6 days but fell precipitously on Day 7. Ideally, the researchers would not have begun the intervention on Day 8, as 214

Figure 9.15.  Example of data from a changingcriterion design. The figure shows the number of cigarettes smoked per day. Solid horizontal lines depict the criterion for each phase. From “The Changing Criterion Design,” by D. P. Hartmann and R. V. Hall, 1976, Journal of Applied Behavior Analysis, 9, p. 529. Copyright 1976 by the Society for the Experimental Analysis of Behavior, Inc. Used with permission.

they did, because if a trend line was drawn through these baseline data, the subsequent decreases in smoking would be predicted to occur in the absence of an intervention. Had the researchers collected more baseline data, they would likely have found that Day 7 was uncharacteristic of this individual’s rate of smoking and could have more clearly established the stable rate of baseline smoking. In subsequent phases (B–G), the criterion number of cigarettes was systematically decreased, as indicated by the horizontal line in each phase. Changes in the criterion tended to produce level changes, with many subsequent data points falling exactly on the criterion specified in that phase. Each phase establishes a new mean level approximating the criterion. There are no long latencies to change, and the variability in the data and overlap in data points across conditions are not sufficient to cause concern. Finally, with the exception of Phase F, there is no downward trend in any phase, suggesting that if the criterion remained unchanged, smoking would remain at the current depicted level. Thus, the visual analysis of these data raises concerns about the downward trend in the baseline, but these concerns are largely addressed by the repeated demonstrations of control over smoking rate in each condition. If the study were ongoing and concerns remained, the

Visual Analysis in Single-Case Research

researchers could set the next criterion (Phase H) above the last one. If smoking rate increased to the new criterion, then additional evidence for intervention control would be established while nullifying concerns about the downward trend in baseline.

Interpreting Relational Graphs Researchers who conduct time-series research may report their outcomes using relational graphs. In these cases, each data point represents the mean (or another appropriate measure of central tendency) of steadystate responding from a condition. When evaluating these data, measures of variability, such as error bars, are also assessed to help determine whether responding was stable (see Interpreting Bar Graphs section). Data on relational graphs are evaluated by analyzing the clustering and trend of the data points. Data that appear horizontal across all values of the x-axis indicate that the independent or predictor variable has no effect on behavior or that there is no correlation between the two dependent variables. Sometimes behavior changes in a linear fashion across the range of x-axis values of the independent variable. When both axes of the graph are scaled linearly, a linear relation indicates that changing the independent variable produces a constant increase or decrease in behavior. Nonlinear relations indicate that the behavioral effect of the independent variable changes across x-axis values. Figure 9.16 provides an example. Here, the subjective value of a $10 reward is plotted as a function of the delay to its delivery. Both axes are linear, and the relation between the variables is nonlinear. Relational graphs, when properly constructed, allow the researcher to quickly evaluate the relation between variables. It is common, however, to evaluate relational data more precisely with quantitative methods, including curve-fitting techniques (e.g., linear and nonlinear regression), Pearson’s correlation coefficient, and quantitative models (see Chapters 10 and 12, this volume). Curve-fitting techniques clarify the form of the relation between the independent and dependent variables, and Pearson’s correlation coefficient quantifies the relation between two dependent variables. Quantitative models may describe more complex behavior–environment relations and are used to make predictions about behavior. Even

Figure 9.16.  Example of curve fitting. From “Delay or Probability Discounting in a Model of Impulsive Behavior: Effect of Alcohol,” by J. B. Richards, L. Zhang, S. H. Mitchell, and H. de Wit, 1999, Journal of the Experimental Analysis of Behavior, 71, p. 132. Copyright 1976 by the Society for the Experimental Analysis of Behavior, Inc. Used with permission.

when quantitative methods are used to describe data, however, visual analysis is used as a supplement. For example, visual analysis can help researchers choose which type of curve to fit to the data, evaluate whether data trends are linear and thus appropriate for calculating Pearson correlation coefficients, or determine whether the data have systematic deviations from the fit of a quantitative model. For example, Figure 9.16 shows the best fits of both exponential and hyperbolic models to the subjective value of delayed money. The figure shows that the exponential model systematically predicts a lower y-axis value than that obtained at the highest x-axis value. Reference lines may be added to relational graphs (or other graph types) to provide a point of comparison to the data. For instance, in a graph showing discrete-trial performances, such as matching-to-sample, reference lines may be plotted at values expected by chance. In a graph depicting choice data, reference lines might be plotted at values indicative of indifference. Conclusion Graphs provide clear and detailed summaries of research findings that can guide scientific decisions 215

Bourret and Pietras

and efficiently communicate research results. Visual analysis, as with any form of data analysis, requires training and practice. The use of visual analysis as a method of data interpretation requires graph readers to make sophisticated decisions, taking into account numerous aspects of the data. This complexity can make the task seem daunting or subjective; however, visual analysis in conjunction with rigorous experimental procedures is a proven, powerful, and flexible method for generating scientific knowledge. The development of effective behavioral technologies provides evidence of the ultimate utility of the visual analysis techniques used in behavior-analytic research. Data analyzed by means of visual inspection have contributed to a technology that produces meaningful behavior change in individuals across a wide range of skill domains and populations, including individuals with no diagnoses and those with diagnoses including attention deficit/hyperactivity disorder, autism, an array of developmental disabilities, pediatric feeding disorders, and schizophrenia, to name a few (Didden, Duker, & Korzilius, 1997; Lundervold & Bourland, 1988; Weisz, Weiss, Han, Granger, & Morton, 1995). Because of its history of effective application and advantages for the study of the behavior of individuals, behavior analysts remain committed to visual inspection as a primary method of data analysis.

References American Psychological Association. (2010). Publication manual of the American Psychological Association (6th ed.). Washington, DC: Author. Barlow, D. H., & Hayes, S. C. (1979). Alternating treatments design: One strategy for comparing the effects of two treatments in a single subject. Journal of Applied Behavior Analysis, 12, 199–210. doi:10.1901/ jaba.1979.12-199 Baron, A., & Perone, M. (1998). Experimental design and analysis in the laboratory study of human operant behavior. In K. A. Lattal & M. Perone (Eds.), Handbook of research methods in human operant behavior (pp. 45–91). New York, NY: Plenum Press. Baum, W. M. (1973). The correlation-based law of effect. Journal of the Experimental Analysis of Behavior, 20, 137–153. doi:10.1901/jeab.1973.20-137 Cançado, C. R. X., & Lattal, K. A. (2011). Resurgence of temporal patterns of responding. Journal of the Experimental Analysis of Behavior, 95, 271–287. 216

Cleveland, W. S. (1994). The elements of graphing data (rev. ed.). Summit, NJ: Hobart Press. Cleveland, W. S., & McGill, R. (1984). Graphical perception: Theory, experimentation, and application to the development of graphical methods. Journal of the American Statistical Association, 79, 531–554. doi:10.2307/2288400 Cleveland, W. S., & McGill, R. (1985). Graphical perception and graphical methods for analyzing scientific data. Science, 229, 828–833. doi:10.1126/ science.229.4716.828 Cooper, J. O., Heron, T. E., & Heward, W. L. (1987). Applied behavior analysis. Columbus, OH: Merrill. Davison, M., & Baum, W. M. (2006). Do conditional reinforcers count? Journal of the Experimental Analysis of Behavior, 86, 269–283. doi:10.1901/ jeab.2006.56-05 DeFulio, A., Donlin, W. D., Wong, C. J., & Silverman, K. (2009). Employment-based abstinence reinforcement as a maintenance intervention for the treatment of cocaine dependence: A randomized controlled trial. Addiction, 104, 1530–1538. doi:10.1111/j.13600443.2009.02657.x Didden, R., Duker, P. C., & Korzilius, H. (1997). Metaanalytic study on treatment effectiveness for problem behaviors with individuals who have mental retardation. American Journal on Mental Retardation, 101, 387–399. Ferster, C. B., & Skinner, B. F. (1997). Schedules of reinforcement. Acton, MA: Copley. (Original work published 1957) doi:10.1037/10627-000 Fisher, W. W., Kelley, M. E., & Lomas, J. E. (2003). Visual aids and structured criteria for improving inspection and interpretation of single-case designs. Journal of Applied Behavior Analysis, 36, 387–406. doi:10.1901/jaba.2003.36-387 Galbicka, G., Fowler, K. P., & Ritch, Z. J. (1991). Control over response number by a targeted percentile schedule: Reinforcement loss and the acute effects of d-amphetamine. Journal of the Experimental Analysis of Behavior, 56, 205–215. doi:10.1901/jeab.1991.56-205 Gallistel, C. R., King, A. P., Gottlieb, D., Balci, F., Papachristos, E. B., Szalecki, M., & Carnone, K. S. (2007). Is matching innate? Journal of the Experimental Analysis of Behavior, 87, 161–199. doi:10.1901/jeab. 2007.92-05 Gentry, G. D., Weiss, B., & Laties, V. G. (1983). The microanalysis of fixed-interval responding. Journal of the Experimental Analysis of Behavior, 39, 327–343. doi:10.1901/jeab.1983.39-327 Hall, R. V., & Fox, R. G. (1977). Changing-criterion designs: An alternative applied behavior analysis procedure. In B. C. Etzel, J. M. LeBlanc, & D. M. Baer (Eds.), New developments in behavioral research:

Visual Analysis in Single-Case Research

Theory, method, and application (pp. 151–166). Hillsdale, NJ: Erlbaum. Harris, R. L. (1996). Information graphics: A comprehensive illustrated reference. Atlanta, GA: Management Graphics. Hartmann, D. P., & Hall, R. V. (1976). The changing criterion design. Journal of Applied Behavior Analysis, 9, 527–532. doi:10.1901/jaba.1976.9-527 Johnston, J. M., & Pennypacker, H. S. (1993). Strategies and tactics of behavioral research (2nd ed.). Hillsdale, NJ: Erlbaum. Kazdin, A. E. (1982). Single-case research designs: Methods for clinical and applied settings. New York, NY: Oxford University Press. Lundervold, D., & Bourland, G. (1988). Quantitative analysis of treatment of aggression, self-injury, and property destruction. Behavior Modification, 12, 590–617. doi:10.1177/01454455880124006 Morley, S., & Adams, N. I. (1991). Graphical analysis of single-case time series data. British Journal of Clinical Psychology, 30, 97–115. doi:10.1111/j.2044-8260. 1991.tb00926.x Parsonson, B. S., & Baer, D. M. (1978). The analysis and presentation of graphic data. In T. R. Kratochwill (Ed.), Single-subject research: Strategies for evaluating change (pp. 101–165). New York, NY: Academic Press. Parsonson, B. S., & Baer, D. M. (1986). The graphic analysis of data. In A. Poling & R. W. Fuqua (Eds.), Research methods in applied behavior analysis: Issues and advances (pp. 157–186). New York, NY: Plenum Press. Parsonson, B. S., & Baer, D. M. (1992). The visual analysis of data, and current research into the stimuli controlling it. In T. R. Kratochwill & J. R. Levin (Eds.), Single-case research design and analysis: New directions for psychology and education (pp. 15–40). Hillsdale, NJ: Erlbaum. Perone, M. (1991). Experimental design in the analysis of free-operant behavior. In I. H. Iversen & K. A. Lattal

(Eds.), Experimental analysis of behavior, Parts 1 and 2. Techniques in the behavioral and neural sciences (Vol. 6, pp. 135–171). New York, NY: Elsevier. Richards, J. B., Zhang, L., Mitchell, S. H., & de Wit, H. (1999). Delay or probability discounting in a model of impulsive behavior: Effect of alcohol. Journal of the Experimental Analysis of Behavior, 71, 121–143. doi:10.1901/jeab.1999.71-121 Shah, P., Freedman, E. C., & Vekiri, I. (2005). The comprehension of quantitative information in graphical displays. In P. Shah & A. Miyake (Eds.), The Cambridge handbook of visuospatial thinking (pp. 426– 476). New York, NY: Cambridge University Press. Shull, R. L. (1991). Mathematical description of operant behavior: An introduction. In I. H. Iversen & K. A. Lattal (Eds.), Experimental analysis of behavior, Parts 1 and 2. Techniques in the behavioral and neural sciences (Vol. 6, pp. 243–282). New York, NY: Elsevier. Sidman, M. (1960). Tactics of scientific research. Oxford, England: Basic Books. Staddon, J. E. R., & Simmelhag, V. L. (1971). The “superstition” experiment: A reexamination of its implications for the principles of adaptive behavior. Psychological Review, 78, 3–43. doi:10.1037/h0030305 Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103, 677–680. doi:10.1126/science.103. 2684.677 Stewart, K. K., Carr, J. E., Brandt, C. W., & McHenry, M. M. (2007). An evaluation of the conservative dualcriterion method for teaching university students to visually inspect AB-design graphs. Journal of Applied Behavior Analysis, 40, 713–718. Tufte, E. R. (1983). The visual display of quantitative information. Cheshire, CT: Graphics Press. Weisz, J. R., Weiss, B., Han, S. S., Granger, D. A., & Morton, T. (1995). Effects of psychotherapy with children and adolescents revisited: A meta-analysis of treatment outcome studies. Psychological Bulletin, 117, 450–468. doi:10.1037/0033-2909.117.3.450

217

Chapter 10

Quantitative Description of Environment–Behavior Relations Jesse Dallery and Paul L. Soto

In 1687, Sir Isaac Newton invented a profound new way of thinking about how objects move across space and time. Newton tackled familiar empirical facts such as the observation that objects fall when dropped, but the description of falling objects was elevated to a new realm of precise, quantitative analysis. Among other feats, his analysis produced the universal law of gravitation, which yielded considerable predictive and practical advantages over previous accounts (Kline, 1959). The stunning successes of classical mechanics, of landing rockets on the moon, would not have been possible before Newton’s framework. In addition to its predictive and practical advantages, the universal law of gravitation unified seemingly disparate phenomena: It described moving bodies both here on Earth and in the heavens. Newton’s account also provided a foundation for attempts to explain gravity. As sciences have matured, from astronomy to zoology, so too has their use of quantitative analysis and their ability to describe, unify, and explain. The field of behavior analysis has witnessed similar transformations in how people describe an organism’s behavior across space and time. Fortunately, one does not need Newton’s calculus to understand and appreciate these exciting advances in behavioral science. In this chapter, we explain key techniques involved in quantitative analysis. We describe how quantitative models are evaluated and compared. To provide some theoretical backbone to our explication of techniques, we use a model of choice known as matching theory as

a case study in quantitative analysis. Although we have selected matching theory as our case study because of its widespread application in behavior analysis, the techniques and issues we discuss could be generalized to any quantitative model of ­environment–behavior relations, and where appropriate we highlight these extensions. On the Utility of Quantitative Models Models specify relations among dependent variables and one or more independent variables. These relations can be specified using words (a verbal model) or mathematics (a mathematical or quantitative model). Verbal models can be just as useful as quantitative models in describing the causes of behavior. For instance, an applied behavior analyst does not need quantitative models to assess determinants of self-injurious behavior or to generate verbal behavior for a child with autism. However, there are also examples in which quantitative models are useful in the applied realm (Critchfield & Reed, 2009; Fisher & Mazur 1997; Mazur, 2006; McDowell, 1982). Quantitative models of choice can help the analyst tease apart controlling variables in a naturalistic setting (McDowell, 1982; see Fuqua, 1984, for some caveats to this assertion), or they may be useful in terms of evaluating preferences for treatment options (Fisher & Mazur, 1997). Even if the benefits are not immediate, knowledge of quantitative accounts could lead to alternative treatments

We thank Rachel Cassidy and Bethany Raiff for comments on an earlier version of this chapter. DOI: 10.1037/13937-010 APA Handbook of Behavior Analysis: Vol. 1. Methods and Principles, G. J. Madden (Editor-in-Chief) Copyright © 2013 by the American Psychological Association. All rights reserved.

219

Dallery and Soto

(McDowell, 1982), or it could inspire new insights into the determinants of problem behavior (Critchfield & Kollins, 2001; Fisher & Mazur, 1997; Nevin & Grace, 2000; see Volume 2, Chapters 5 and 7, this handbook). Behavioral scientists should appreciate and understand a wide variety of analytic methods to understand the determinants of behavior. Quantitative analysis entails a rich and powerful set of tools. From 1970 to 2000, basic behavior-analytic science saw an increase from 10% to almost 30% in articles published in the Journal of the Experimental Analysis of Behavior that used equations to describe behavior (Mazur, 2006). In addition to understanding the advances described in these articles, other intellectual and practical payoffs are derived from knowledge of quantitative analysis. These benefits may be delayed, but they can be profound. The English philosopher Roger Bacon (c. 1214–1294, as quoted in Kline, 1959) noted, Mathematics is the gate and the key of the sciences. . . . Neglect of mathematics works injury to all knowledge, since he who is ignorant of it cannot know the other sciences or the things of this world. And what is worse, men who are thus ignorant are unable to perceive their own ignorance and so do not seek a remedy. (p. 1) Quantitative models have at least four benefits. First, quantitative models force a greater degree of precision than their verbal counterparts. Assume several reinforcer rates have been experimentally arranged, from low to high, and the response rates obtained at each reinforcer rate have been measured. A verbal description might assert that at low reinforcer rates, increases in reinforcer rate produce rapid increases in response rate, but that response rate eventually reaches a maximum at higher reinforcer rates. The same description of the relation between response rate and reinforcer rate is embodied by Herrnstein’s hyperbolic equation (described in more detail later). Herrnstein’s equation states succinctly and precisely how behavior will change with increases in reinforcement. Moreover, Herrnstein’s equation precisely asserts how reinforcement from the experimentally arranged schedule of 220

reinforcement interacts with other, background sources of reinforcement. Assumptions about and descriptions of interactions between variables can be particularly troublesome and obscure if formulated in verbal terms. Quantitative models force specificity about the nature of these interactions. For example, Staddon (1984) analyzed a model of social dynamics by expressing the model quantitatively. Staddon’s quantitative analysis revealed several inconsistencies (and even contradictions) in the corresponding verbal formulation. Staddon concluded that “unaided verbal reasoning is almost useless in the analysis of the dynamics of interactions, human or mechanical” (p. 507). Second, the predictions of a quantitative model are more specific than the predictions of a verbal description, which allows one to observe minor but systematic deviations from the predictions. For instance, one could verbally describe where a projectile might land after being launched into the air (e.g., “over there”), or one could use an equation to make a precise prediction (e.g., 157 feet, 2 inches, due north). Any deviation from the latter location may be precisely measured, and records may be kept to determine whether the equation makes systematic errors (e.g., it consistently overestimates the distance traveled). Similarly, one could hypothesize verbally about how much reinforcement would be necessary to establish some level of appropriate behavior (e.g., one might say one needs to identify a “powerful” reinforcer or deliver the reinforcer “frequently”). As with the estimate of the projectile, a hypothesis might be qualitatively correct, but it will be more precise if one uses established equations of reinforced responding. Admittedly, this degree of precision may not be necessary in an applied context. Nevertheless, a hallmark of behavioral science is the predictive power of its explanations, and, as these examples imply, quantitative models surpass verbal models in making predictions. Third, to the extent that the predictions and conditions under which one should observe them are more precise, the more falsifiable the theory becomes. As the philosopher of science Karl Popper (1962) noted, falsifiability is a virtue of scientific theory. Consider Sir Arthur Eddington’s test of Albert Einstein’s theory of general relativity. One of

Quantitative Description of Environment–Behavior Relations

general relativity theory’s critical predictions is that gravity bends light. To test this prediction, Eddington measured the amount of shift in light from stars close to the sun (Silver, 1998). The sun’s powerful gravitational field, according to general relativity theory, should produce measurable shifts in the light emanating from the nearby stars. Eddington waited for an eclipse (the eclipse of 1919), when the stars adjacent to the sun were observable, and he measured the amount of shift in light emanating from those stars. He found not only that the light from these stars did indeed bend around the sun, but also that the exact amount of shift was as predicted by Einstein’s theory. An observation (or more realistically, a series of observations) to the contrary would have falsified the theory. Similarly, Herrnstein (1974) noted that certain predictions made by his quantitative theory of reinforced responding were especially vulnerable to “empirical confirmation or refutation” (p. 164). All good scientific theories, whether verbal or quantitative, are falsifiable. One virtue of most quantitative theories is that the conditions under which one can refute them are relatively clear (McDowell, 1986). Fourth, quantitative modeling encourages researchers to unify diverse phenomena; to see similarities in the determinants of behavior across settings, cultures, and species (Critchfield & Reed, 2009; Lattal, 2001; Mazur, 2006; see Chapter 7, this volume). Another way to put this is that a goal of a science of behavior is to discover invariance, or regularity in nature. For example, people see the diversity of structure in the animal kingdom as a result of evolutionary processes and the diversity of geological events as a result of tectonic processes. The mathematician Bell (1945, p. 420) defined invariance as “changelessness in the midst of change, permanence in a world of flux, the persistence of configurations that remain the same despite the swirl and stress of countless transformations” (also quoted in Nevin, 1984, p. 422). Skinner (1950/1972) also saw the discovery of invariance as a worthy goal: Beyond the collection of uniform relationships lies the need for a formal

representation1 of the data reduced to a minimal number of terms. A theoretical construction may yield greater generality than any assemblage of facts. It will not stand in the way of our search for functional relations because it will arise only after relevant variables have been found and studied. Though it may be difficult to understand, it will not be easily misunderstood. (p. 100). Skinner’s (1950/1972) assessment of theory, however, was tempered by a recommendation that psychologists must first establish an experimental analysis of how relevant variables affect behavior. Ultimately, however, quantitative theory increases the precision and generality of explanations and improves the ability to predict and influence behavior. Structure and Function of Quantitative Models Although quantitative models can generate new predictions and explanations of behavior, they are often developed inductively on the basis of descriptions of empirical facts, and this is where we start. After rigorous, parametric, experimental analysis, researchers plot a dependent variable (e.g., response rate, interresponse times, ratio of time spent engaged in one activity over another activity) as a function of an independent variable (e.g., reinforcer rate, time in session, reinforcers delivered for one activity relative to another activity) in graphical space. They examine the shape of the relation. Is it described by a straight line or a curve? In behavioral science, relations characterized by straight lines or monotonic curves (curves that change in one direction) are common. Researchers may also model behavioral processes that show a bitonic relation, which is a curve that changes in two directions (e.g., a curve that increases and then decreases or vice versa). Although these shapes are not complicated, the equations that describe them may appear to be. If the equations are intimidating at first, start with the shapes and the specific environment–behavior

A formal representation means a mathematical representation.

1

221

Dallery and Soto

relations described by the shapes. Therefore, use careful visual inspection (see Chapter 9, this volume) of the data. The importance of visual analysis is illustrated nicely by Anscombe’s quartet (Anscombe, 1973), which is a series of four datasets that are shown in Figure 10.1. Across the four panels of the figure, the mean and variance of the data points in each panel are equivalent, as is the linear equation describing the relation between the independent (x-axis) and dependent (y-axis) variables: y = 0.5x + 3. Even a casual visual analysis, however, reveals that the shape of each dataset is remarkably distinct and that only the data in the upper left panel are appropriately described by a linear equation. A purely quantitative analysis alone is not sufficient. Careful use of quantitative analysis involves careful visual inspection of the data (Parsonson & Baer, 1992; see Chapter 9, this volume), not to mention rigorous experimental analysis (Perone, 1999; Sidman, 1960; see Chapter 5, this volume). After a visual analysis of the relation between independent and dependent variables, one must

dissect the anatomy of the equation that describes the relation. In behavioral and psychology journals, equations may appear in the introduction of an article. Thus, dissecting the equation may require some detective work and skipping ahead in the article to find a graph that shows the shape described by the equation. Comparing the equation with the graph is useful. Here is an example of a common equation in behavioral science:  r  R = k .  r + re  

This equation, known as Herrnstein’s hyperbolic equation, describes a hyperbolic relation between reinforcer rate, r, and response rate, R. Figure 10.2 shows two examples of this hyperbolic shape described by the equation. The first step to understanding a new equation is to identify the variables in the equation, or the environmental and behavioral events that are measured directly in the experiment. In the case of Herrnstein’s hyperbolic equation, the experimenter measures response rate,

Figure 10.1.  Anscombe’s (1973) quartet. The four graphs show different data sets that are described by the same linear equation, which is indicated by the straight line in each panel. The datasets also have the same mean and variance. From “Graphs in Statistical Analysis,” by F. J. Anscombe, 1973, American Statistician, 27, pp. 19–20. Reprinted with permission from The American Statistician. Copyright 1973 by the American Statistical Association. All rights reserved. 222

(1)

Quantitative Description of Environment–Behavior Relations

Figure 10.2.  Two examples of Herrnstein’s hyperbolic equation, which predicts how response rate changes with increases in reinforcer rate. The only difference between the two curves is the value of re.

R, as a function of reinforcer rate, r. For example, assume one measured rates of engaging in disruptive behavior as a function of rates of social attention for this behavior. The dependent variables (e.g., response rates) will always appear on the left side of the equation, and the independent variables (e.g., reinforcer rates) will always appear on the right side of the equation. Next, identify the parameters. The parameters are the quantities in the equation that are not directly measured; rather, they are estimated statistically (see Evaluating Linear Models: Linear Regression and Evaluating Linear Models: Nonlinear Regression sections). The parameters affect the shape and position of the curve (or line), and a computer program that conducts regression analyses will find values for the parameters so that the curve (or line) comes as close to the data as possible (called best-fit parameter values). To illustrate how the value of a parameter affects the shape of the curve in Figure 10.2, we used two values for the parameter re and the same value for the parameter k. If one considers just the top curve, for example, the curve predicts response rates given one value of k, one value of re, and all reinforcer rates between zero and 200 reinforcers per hour. The curve can also predict response rates for higher reinforcer rates, but for practical purposes, we do not show these in the graph. The value of re is small in the top curve. In terms of the equation, the value of re does not produce a large impact in the denominator, and so

response rates increase rapidly with increases in reinforcer rates. As re increases, however, as in the bottom curve, the effects of increases in reinforcer rates are essentially dampened. Because r and re are summed in the denominator, smaller changes occur in R with increases in reinforcer rates. Parameters usually reflect the operation of environmental, behavioral, or physiological processes. For example, the parameter re in Herrnstein’s hyperbolic equation is thought to reflect the rate of background, extraneous reinforcers. In our example, these extraneous reinforcers represent all of the reinforcers in the child’s environment (video games, food, etc.) except for the reinforcers delivered as a consequence of disruptive behavior. The degree to which the interpretations of parameters are confirmed empirically is an interesting and exciting feature of quantitative modeling. As we discuss in more detail in the Evaluating Nonlinear Models: Nonlinear Regression section, even if an equation describes a dataset extremely well (e.g., the curve comes very close to the obtained data), it does not necessarily mean that the interpretations of the parameters are supported. Eventually, such inferences should be supported by direct experimental analysis of the controlling variables, whether biological or environmental (Dallery & Soto, 2004; Shull, 1991). For instance, if re is thought to measure extraneous reinforcers, its value should increase or decrease as reinforcers are added or subtracted, respectively, from the environment in which the target behavior is measured. One way to understand an equation is to plot the equation using multiple parameter values as in Figure 10.2. Creating multiple plots of an equation can easily be done using a spreadsheet program such as Microsoft Excel (at the time of this writing, Excel is perhaps the most common spreadsheet program in use). As described earlier, Figure 10.2 illustrates that the smaller the value of re, the steeper the curve rises toward its asymptote, k. Although not shown, plotting Equation 1 with different values of k illustrates how increases in k increase the maximum value reached by the curve (100 in Figure 10.2). Understanding how the shape of the function changes as the parameter values change is essential because changes in the shape reflect changes in 223

Dallery and Soto

behavior. In an applied context, increasing re as depicted in Figure 10.2 would mean that rates of disruptive behavior would decrease, even if reinforcer rates for disruptive behavior remain the same (compare the predicted response rates between curves when the reinforcer rate is 100 reinforcers/ hour; McDowell, 1986). There is obviously more to dissecting and digesting a quantitative model than examining the shape defined by the equation. Researchers need to ask whether they have enough data to evaluate the model (e.g., a general rule of thumb is to have twice as many data points as the number of parameters), whether there are known properties of behavior or alternative models that they need to consider in evaluating the model, and whether they need to consider statistical or theoretical requirements for estimated parameter values (we discuss these considerations in more detail in the Evaluating Linear Models: Linear Regression and Evaluating Nonlinear Models: Nonlinear Regression sections). This brief introduction provides a starting point from which to approach equations in behavioral science. All equations in behavioral science share the same general structural and functional characteristics. The tactics presented in this section represent a broad, yet effective strategy to dissect equations. They may be summarized as follows: Visually inspect the shape of the dataset when plotted in graphical space, decompose the variables and parameters in the equation, explore how the plot of the equation changes as the parameter values change, and consider what these changes mean in terms of behavior and its causes. Introduction to a Quantitative Description of Choice One purpose of quantitative analysis is to understand the complex interaction between the determinants of behavior (e.g., biological, environmental, pharmacological) and some measure of behavior (e.g., response rate, latency). For instance, researchers may be interested in when and why a bee travels to a new field of flowers for nectar, why a shopper on a diet picks up some candy in the checkout line, or why a pigeon in an operant chamber impulsively 224

pecks a lighted key for an immediately delivered small amount of grain instead of waiting for a larger amount. These examples of choice, of choosing one activity over another, are amenable to quantitative analysis. One common way in which psychologists study choice in the operant laboratory is by using concurrent schedules of reinforcement. In the pigeon example, two concurrent, mutually exclusive alternatives were available (pecking for small payoffs or waiting for larger payoffs). This situation is also common in naturalistic, clinical settings: A child may choose to engage in disruptive behavior rather than appropriate behavior (Borrero & Vollmer, 2002); a smoker may choose to light up or forgo a cigarette (Dallery & Raiff, 2007). The relative rates at which the behavior occurs—in the laboratory or in the world outside the laboratory—can be powerfully governed by relative rates of reinforcement earned for engaging in them. To quantitatively model the relation between relative responding and reinforcement over time, the response and reinforcer rates need to be denoted in mathematical terms. In the case of the pigeon responding on a two-alternative concurrent schedule of reinforcement, the rate of responding on one alternative can be represented by R1, and the rate of responding on the other by R2. The rate at which reinforcers are delivered on the two alternatives can be represented by r1 and r2. Some authors use different letters, but they signify the same quantities. In 1961, Richard Herrnstein examined the relation between rates of reinforcement and responding in a seminal study. His subjects were pigeons, and they could peck two lighted keys that were mounted on the front panel (one on the left and one on the right) of an experimental operant chamber. Pecking the keys sometimes resulted in brief access to food. Specifically, the rates of food access varied according to variable-interval (VI) schedules of reinforcement, which delivered a reinforcer for the first response after some interval had elapsed. For example, the left key may have been scheduled to provide access to food every 30 seconds on average, and the right key may have only provided access to food once every 60 seconds on average. After varying these rates of reinforcement across a wide range of values, Herrnstein plotted the proportion of responses allocated to one

Quantitative Description of Environment–Behavior Relations

Figure 10.3.  Examples of matching and deviations from matching. The left panel shows perfect matching. The right panel shows two common deviations from perfect matching. The dashed curve is an example of bias, and the dotted curve represents an example of insensitivity.

key (in our example, responding on the left key is designated as R1) as a function of the proportion of reinforcers delivered on that key. Herrnstein found that the shape of the data path was a straight line. See the diagonal line in the left panel of Figure 10.3. Data points indicate hypothetical obtained response and reinforcer proportions. The straight line predicts that the proportion of responses should equal, or match, the proportion of reinforcers obtained at that alternative. So, for example, if the proportion of reinforcers on the left key equaled .25, then the proportion of behavior allocated to that side would also equal .25. This shape can be described by an equation: R1 r = 1 . R1 + R 2 r1 + r2 

(2)

Equation 2 describes a straight line passing through the origin (x = 0, y = 0) with a slope of 1—a relation termed perfect matching. Herrnstein’s (1961) study, and Equation 2, has inspired hundreds of studies on choice. The applications have ranged from the laboratory to clinical settings (Borrero & Vollmer, 2002; Murray & Kollins, 2000; St. Peter et al., 2005; Taylor, Lincoln, & Foster, 2010; see Volume 2, Chapter 7, this handbook), social dynamics (Borrero et al., 2007; Conger & Killeen, 1974), and education (Martens & Houk, 1989). The matching law has also served as a basis for studies in behavioral ecology (Houston, 1986) and behavioral neuroscience (Lau & Glimcher, 2005).

As the predictions of Equation 2 were compared with behavior, however, researchers soon discovered that perfect matching was actually a special case of a more generalized equation (Baum, 1974; Staddon, 1968). Obtained data tended to deviate in two ways from the straight line shown in the left panel of Figure 10.3. First, animals sometimes showed a bias for one alternative over another. Perhaps one response key was easier to peck or, in a naturalistic setting, the attention derived from disruptive behavior was “better” than the minor attention for appropriate behavior. In the right panel of Figure 10.3, the impact of such asymmetries on behavior is shown by the dashed curve. Because the curve is bent upward, one can deduce in the context of our example that left responses are occurring more frequently than predicted by Equation 2 (the solid line). Note that this bias toward the left alternative is not caused by the rates of reinforcement associated with the left and right alternatives. Second, animals also showed some degree of insensitivity to changes in the rates of reinforcement. The dotted curve shows an example of such insensitivity; note how the curve bends toward indifference, or a proportion of .5. The curve is above the straight line of equality at x-axis values less than .5 and below the straight line at x-axis values above .5. A single equation describing sensitivity to relative reinforcer rate across the two alternatives and bias toward one alternative over the other (not 225

Dallery and Soto

resulting from differences in reinforcer rate) would be useful. Unfortunately, bias and sensitivity are hard to detect visually and to quantify when proportions are used. They are hard to detect because a plot of an animal’s behavior may show biased responding (e.g., the plot tends to bend upward) and insensitivity (the plot also bends toward indifference) at the same time. Also, data from live organisms usually demonstrate more variability than our graphical example, which further complicates visual detection of bias and insensitivity. Staddon (1968) found that plotting ratios instead of proportions aided in visually detecting deviations from strict matching. The ratio form of matching is algebraically equivalent to the proportional form (to obtain the ratio form from the proportional form, invert both sides of Equation 2, separate the terms involved, and subtract one from both sides), and it is written R1 r1 = . R 2 r2 

(3)

Baum (1974) modified Equation 3 to account for deviations from matching by adding two parameters, a and b, to Equation 3: r  R1 = b 1  , R2  r2  

where a is the slope, log b is the intercept, and the logged response ratio and logged reinforcer ratios are the dependent and independent variables, respectively. This straight line is different from the line described by Equation 2 in that its slope and intercept can vary; perfect matching specifies a slope of 1 and an intercept of 0. Because the logarithmic form of the generalized matching equation is a straight line, one can evaluate the equation’s descriptive accuracy by fitting a line to the data. Bias is indicated by the intercept of the straight line, log b, and sensitivity is indicated by the slope of the line, a. One example of a line described by Equation 5 is shown in Figure 10.4. First, note that the slope is less than 1.0. This means that for every unit change in the reinforcer ratio, there is less than a one-unit change in the response ratio. Considering what the value of the slope means in terms of unit changes in the dependent and independent variables is often useful. Second, note that the line is elevated above the origin, which shows some bias for the response alternative in the numerator, which is R1, or the left key, in our example. Of course, the line could be steeper than 1.0, and it could intersect the ordinate below the origin. Consider what this

a

(4)

where a quantifies sensitivity to the ratio of reinforcer rates on alternatives 1 and 2, and b quantifies bias. Equation 4 is known as the generalized matching equation, or power–function matching, because the independent variable is raised to a power, a. Traditionally, Equation 4 is evaluated by plotting the logarithm of the response ratio as a function of the logarithm of the reinforcer ratio. Logarithmic transformation2 of Equation 4 results in a straight line when the equation’s predictions are plotted in graphical space. The logarithmic form of the generalized matching equation is r  R  log  1  = a log  1  + log b,  r2   R2 

(5) 

Figure 10.4.  An example of the generalized matching equation. The line represents a bestfitting line obtained via least-squares regression to the obtained response and reinforcer rates. See the text for the equation. The data points in columns E and F from Table 10.2 are plotted.

The logarithm of the product of two numbers is the sum of the logarithms of the individual numbers, and the logarithm of an exponentiated number is the product of the exponent and the logarithm of the number.

2

226

Quantitative Description of Environment–Behavior Relations

would mean in terms of behavior. It would mean that the change in response ratios is more than one unit for every unit change in reinforcer ratios (called overmatching; Baum, 1979) and that there is a bias toward the response alternative in the denominator, the right key. Now one has a single, elegant equation to describe how two variables (i.e., reinforcer rates on two alternatives), other asymmetries in the properties of the responses or reinforcers (i.e., changes in the intercept or bias), and the degree of insensitivity to changes in reinforcer rates (i.e., change in the slope or sensitivity) affect choice. One log transforms the ratios to evaluate the generalized matching equation for several reasons. The main reason, as implied earlier, is that it is easier to detect departures from strict matching. The log transform also turns the curves produced by bias and insensitivity into straight lines (McDowell, 1989), and thus bias and insensitivity are reflected by changes in intercept and slope, respectively. Indeed, quantitative analysis has several useful data transformations, which are used for a variety of reasons, for example, to increase interpretability of a graph or to meet assumptions for statistical analysis (e.g., normality). Some balk at the idea of transforming behavioral data; after all, one does not observe logarithms of behavior. This argument misses the point of transforms, which is to magnify the ability to describe and explain environment–behavior relations. For instance, a logarithmic transform might help one to see the spread of data across the range of values of the independent variable, especially when that range is very large. That is, when plotting logarithms of the data, the spacing of data between numbers 1 and 10 is the same as the spacing between the numbers 10 and 100. Indeed, even a simple measure of response rate is a transformation. Transformations can have advantages, much as the metric system has advantages over the U.S. system of measurements (see Shull, 1991, and Tukey, 1977, for further discussion of transformations). Evaluating Linear Models: Linear Regression The generalized matching equation states that the relation between log reinforcer and log response

ratios can be described by a straight line—but is it? If visual inspection suggests that the data are orderly, sensible, and linear, one can evaluate this theory of choice statistically by performing linear regression. More generally, when researchers want to evaluate a relation between an independent and a dependent variable, and the relation between them can be graphed as a straight line, they use linear regression (if the variables are not linearly related, they use nonlinear regression, which is covered in the Evaluating Nonlinear Models: Nonlinear Regression section). When the two variables are not causally related, researchers use a correlation coefficient. For example, one would not want to perform regression to describe the relation between response rate and response latency. As a rule of thumb when deciding between regression and correlation, ask whether the independent variable plotted along the abscissa is causing the changes in the dependent variable. If it is, use regression. If the variable plotted along the abscissa is a dependent variable (response latency), and so is the variable plotted along the ordinate (e.g., response rate), then a correlation is more appropriate. Also, before conducting a so-called ordinary regression analysis, whether linear or nonlinear, it is important to ensure that the assumptions of ordinary regression are not violated. For convenience, these assumptions are listed in Table 10.1. Consider the data shown in Table 10.2. These data show how a human might interact during a discussion with two confederates. The dependent variable is how much the individual interacts with each confederate or, more specifically, the rate of words per minute directed toward each confederate (columns C and D). The independent variable is the rate of social reinforcers per hour (e.g., agreement statements, nods; columns A and B) delivered by each of the confederates. The social reinforcers are delivered according to VI schedules. The rates of reinforcement are varied across 10 conditions. Thus, when the reinforcer rate is 5.97 reinforcers per hour for Confederate 1 and 4.63 reinforcers per hour for Confederate 2, the rates of words per minute directed toward each confederate are 6.84 and 4.66, respectively. It is important to remember the units in which the variables are expressed; they are important in understanding any quantitative model. 227

Dallery and Soto

Table 10.1 Assumptions of Ordinary Least-Squares Regression ■■ ■■ ■■ ■■

The values of the independent variable are known precisely. All of the variability in the data is in the values of the dependent variable. The variability in the dependent variable, the error, follows a normal bell-shaped distribution. The degree of error, or scatter, in the dependent variable is the same at all points along the line (or curve). Another way to say this is that the standard deviation of the scatter must be the same at all points along the line (or curve), which is known as homoscedasticity. The observations are independent.

Note. From Fitting Models to Biological Data Using Linear and Non-Linear Regression: A Practical Guide to Curve Fitting (p. 57), by H. Motulsky and A. Christopoulos, 2004, Oxford, England: Oxford University Press. Copyright 2004 by Oxford University Press. Adapted by permission of Oxford University Press, Inc.

Table 10.2 Regression Calculations for a Linear Model

Condition  1  2  3  4  5  6  7  8  9 10

A: rnf rate

B: rnf rate

C: rsp rate

D: rsp rate

E: log rnf

by Conf. 1

by Conf. 2

to Conf. 1

to Conf. 2

ratio

5.97 2.35 7.09 1.55 11.08 2.42 4.96 2.46 4.57 1.13

4.63 5.46 2.31 11.15 2.20 3.55 1.17 4.04 2.55 5.29

6.84 4.40 6.29 2.50 8.60 1.20 2.06 1.80 2.48 0.91

4.66 3.77 1.55 11.38 2.01 1.96 0.55 2.07 1.42 2.48

.11

F: Log rsp G: predicted log ratio (Y  )

rsp ratio (Y′)

−.37 .49 −.86 .70

.17 .07 .61 −.66 .63

.19 −.19 .49 −.58 .66

−.17 .63

−.21 .57

−.03 .60

−.22 .25

−.06 .24

−.07 .30

−.67

−.44

−.43

H: Y − Y′ I: (Y −Y′)2 −0.02 0.26 0.12 −0.08 −0.03 −0.18 −0.03 0.01 −0.06 −0.01

0.0004 0.0676 0.0144 0.0064 0.0009 0.0324 0.0009 0.0001 0.0036 0.0001

Note. Columns A–D show response and obtained reinforcer rates for a subject engaging in speech directed at Conf. 1 or 2. Columns E and F present the logged rsp. and rnf. rates, which are the data used to perform linear regression. Column G shows the predicted log rsp. ratios calculated by Equation 5 with a = .80 and log b = .10. Column H shows the residuals, and column I shows the squared residuals. The sum of the squared residuals is SSREG = 0.13, the sum of squared deviations from the mean is SSTOT =1 .78. The r2 is calculated as 1− (SSREG/SSTOT) = .93. Conf = confederate; rnf = reinforcement; rsp = response.

Columns E and F show the logged reinforcer ratios (r1/r2) and response ratios (R1/R2), respectively. The data from columns E and F are the data plotted graphically in Figure 10.4. The independent variable, the logged reinforcer ratio, is always plotted along the abscissa; it is important that the independent variable be plotted along the abscissa and the dependent variable be plotted on the ordinate when performing linear regression.

Linear regression may be calculated with a statistics program or by an algorithm in a spreadsheet program. These programs start with initial values of the slope and intercept and then adjust these values over a series of iterations so that the line is as close to the obtained data as possible.3 The initial values used by programs may be unknown to the user. If one uses a spreadsheet with regression capabilities, for example, one needs to enter the initial values

The parameters of a line can also be calculated directly from formulas derived from linear regression theory rather than minimizing the residual sumof-squares. We chose to illustrate the curve-fitting approach to calculating the parameter values because the logic should be easier to understand in the context of a more familiar function form (i.e., a line) relative to less familiar forms (i.e., curves) and because this method can be applied to both linear and nonlinear regression.

3

228

Quantitative Description of Environment–Behavior Relations

into the spreadsheet. Currently, for pedagogical purposes, assume the starting values must be entered. The starting values will depend on the particular dataset being evaluated. One shortcut in determining starting values is to inspect the plotted data. If the slope of the line is positive, one can use 1 as its starting value. If it is negative, one can use −1. One can also examine where the line will intersect the ordinate at zero on the abscissa, and then enter that value as the starting value for the intercept. Once these values are entered, the equation makes predictions about the response ratios. For example, let us use an intercept of 0 and a slope of 1 and predict the logged response ratio when the logged reinforcer ratio is 0.11. Insert these values into the right side of Equation 5: r  R  log  1  = a log  1  + log b,  r2   R2  R  log  1  = 1 ( 0.11) + 0, and  R2  R  log  1  = 0.11.  R2  The predicted log response ratio is 0.11. The prediction is not too far off from what we observed (0.17; first row in column F). Because the regression analysis has already been conducted, however, we know the prediction could be better. Note that one could, of course, select values of a and b such that the prediction at one x value exactly equaled the obtained y value, but the key is to bring the fitted line as close as possible to all the data points. The vertical distances or deviations between the calculated predictions and the obtained data points are called residuals. We do not show all of the calculations here, but there will be residuals for all of the data points; for our single data point the residual is 0.17 − 0.11 = 0.06. At first, given the initial parameter values, the residuals will be large. The algorithm has not adjusted the parameter values to make these distances smaller, which is the same thing as making the line fall closer to the data. How does the algorithm make these distances smaller? Whether the distances are positive or negative does not

matter; one can square the residuals and then sum them. (There is another important reason to square the residuals, which we come back to this later in the Rationale for Squaring the Sum of Squared Deviations in Regression Analysis section). We want this sum of squared residuals to be as small as possible, so the algorithm seeks to minimize this sum. The procedure is as follows: The program increases or decreases the parameter values and asks, is the sum of squared residuals smaller than the last iteration? The program continues to change the values to make the sum smaller. At some point, when changes in the parameter values produce very small changes in the sum of squared residuals, the algorithm stops and reports the best-fitting parameter values. The description of Table 10.2 lists the best-fitting parameter values obtained via least-squares regression (which is called least squares because of the process of minimizing the sum-of-squared deviations between the line and the data). In column G, we show the predicted data points, labeled Y′. Equation 5, the generalized matching equation, predicts these data points—they can be calculated by hand by inserting the slope and intercept values given in the table description and any one of the log reinforcer ratios in Table 10.2. The deviation between the obtained data, Y, and the predicted data, Y′, is shown in column H. If each of these values is squared, we obtain the squared deviations, shown in column I. The sum of these deviations in column I is what the algorithm minimizes to obtain the bestfitting parameter values. To show the accuracy of the regression model visually, vertical lines can be drawn between each data point and the predicted line (these lines represent the residuals), which is shown in the left panel of Figure 10.5. The distances are very small. In linear (and nonlinear) regression, these distances are compared with the distances given by another model, the null hypothesis. The null model asserts that the mean of the obtained data can account for (predict) the results better than the more complicated two-parameter model. The null model is more parsimonious. There are no parameters, just the calculated mean of the data. The accuracy of this model is shown in the right panel of Figure 10.5. To quantitatively characterize this visual comparison 229

Dallery and Soto

Figure 10.5.  A visual representation of how linear regression is a matter of model comparison. The plot on the left shows the regression model, the alternative hypothesis. The plot on the right shows the null model. In linear regression, one compares the vertical distances between the data points and the line on the left to the vertical distances between the data points and the horizontal line on the right.

between the two models, sum the squared residuals in the regression model (denoted SSREG), and sum the squared deviations in the null model (denoted SSTOT; TOT stands for total deviation from the mean). The ratio of these quantities is subtracted from 1, which provides the first index of whether the model is accurate:  SS  r 2 = 1 −  REG .  SSTOT  

(6)

The r2 resulting from the linear regression analysis is shown in Table 10.2. If the value of SSREG is close to SSTOT, then the value of the ratio will be close to 1, which will result in an r2 close to 0. If SSREG is very small compared with SSTOT, the ratio will be small and the r2 value will be close to 1. Because r2 is a ratio, it can be negative when the null model predicts the data better than the regression model. In other words, a negative value is obtained when SSREG is larger than SSTOT. Some authors designate the r2 in different ways, such as by reporting the variance accounted for in terms of a percentage (r2 multiplied by 100), abbreviated VAF or VAC. Whether r2, variance accounted for, or even variance explained, all of these represent a relative comparison of distances between the data and the predicted line (or curve; see the section Evaluating Nonlinear Models: Nonlinear Regression later in this chapter) 230

and the distances between the data and a horizontal line at the mean. In other words, r2 is a ratio of how well the model predicts the data relative to the simpler null model. The r2 is just the first step in evaluating the model’s accuracy. Examine the left panel in Figure 10.5, particularly the patterns of the data points around the line. One can see that about the same number of data points are above or below the line, that the data points are approximately the same distance from the line, and that the deviations, positive or negative, are not systematically related to the x-axis values, a condition known as homoscedasticity. Violations of homoscedasticity, a condition known as heteroscedasticity, means that there is a serious problem with the data or the model, even if one has a high r2. To understand why, examine Figure 10.6. The left and middle panels show different examples of heteroscedasticity. In the left panel, the model (the line) obviously and systematically misses the data. When the log reinforcer ratio is negative, the line overpredicts and then underpredicts the data. When the log reinforcer ratio is positive, the line tends to overpredict the data. The systematic pattern of deviations violates the requirement of homoscedasticity. In the middle panel, it is harder to see how the data depart from homoscedasticity, which is often the case in quantitative modeling. Although the r2 may be high, it does not tell the whole story. Careful visual

Quantitative Description of Environment–Behavior Relations

Figure 10.6.  Two examples of fits of a linear equation to data showing heteroscedasticity (left and middle panels) or homoscedasticity (right panel). The left panel shows that the model underpredicts the data when reinforcer ratios are smaller. The middle panel shows a more subtle example, and it may be difficult to see heteroscedasticity. The right panel shows what appears to be a homoscedastic pattern, in which the data are evenly spread along the range of the fit.

inspection reveals that more data points are above the line and that within a certain range of the x-axis, approximately −0.8 to −0.2, all the data points fall above the line. Finally, in the right panel, one can see that the number of points above and below the line are approximately equal, that the deviations above and below are roughly equal, and that there is no systematic relation between positive and negative deviations and x value, in keeping with the requirement of homoscedasticity. To be sure whether the model produces systematic departures from homoscedasticity, one needs to perform a finer grained analysis, which is called residual analysis. Most programs routinely report residual analyses, and we recommend careful inspection and reporting of these results. Residual analysis can be performed in several ways. All of them use the same general procedure: Evaluate the pattern of residuals as a function of the independent variable or as a function of the predicted data (e.g., each Y′ in column G of Table 10.2) to determine whether the residuals are evenly and randomly distributed. For example, one of the most basic ways to perform residual analysis is to plot the raw residuals (i.e., the difference between the predicted and obtained values) as a function of the independent variable. The residuals from Figure 10.6 are plotted in Figure 10.7 as a function of their corresponding reinforcer ratios. The order of the panels corresponds to the panels in Figure 10.6. Thus, the left panel shows obvious departures from homoscedasticity. The middle panel in Figure 10.7

shows more clearly than that of Figure 10.6 that the data are heteroscedastic. The right panel shows a desirable pattern of residuals, one that shows homoscedasticity. To perform a residual analysis using the data from Table 10.2, plot the data in column H, the raw residuals, as a function of the independent variable shown in column E, the logged reinforcer ratios. Residual analysis can be performed in other ways (Cook & Weisberg, 1982). The standardized residual could be plotted as a function of the independent variable. The standardized residual is calculated by dividing each raw residual by the standard deviation of residuals. Standardized residuals allow one to more easily detect the presence of outliers (e.g., a residual ±2 standard deviations from the mean of the residuals). To detect patterns in the residuals, one could use inferential statistics to supplement visual analysis. For example, in some cases a correlation between the residual and the independent variable could be calculated (e.g., Dallery, Soto, & McDowell, 2005; McDowell, 2005). A significant correlation would indicate systematic departures from homoscedasticity, and a nonsignificant correlation (which is desirable) indicates homoscedasticity. There are more advanced techniques to detect patterns in the residuals, which involve fitting curves to the residuals. Some authors use cubic, quadratic, and other polynomial equations to detect these deviations from homoscedasticity (e.g., see Sutton, Grace, McLean, & Baum, 2008). Additionally, there are techniques that involve examining the number 231

Dallery and Soto

Figure 10.7.  Residual plots of data shown in Figure 10.6. The order of panels reflects the order in Figure 10.6. See text for further details.

of residuals that have a positive or negative sign and the number of consecutive residuals of the same sign, called runs of residuals (McDowell, 2004; Motulsky & Christopoulos, 2004). Given the logic of residual analysis, the latter runs test should show that an equal number of residuals are above and below the line and that there are no consecutive runs of one sign or the other. There are several potential consequences if nonrandom residuals are obtained. First, it may mean that the assumption of random scatter is violated and that using an ordinary least-squares regression technique may not be appropriate. We explain why this is the case in the Rationale for Squaring the Deviations in Regression Analysis section. Second, the equation might be incorrect. Perhaps the equation form is incorrect, a parameter is missing from the equation, or the parameters have not been adjusted properly (see Motulsky & Christopoulos, 2004, on troubleshooting bad fits). Something may also possibly have gone wrong with the experiment. Perhaps some uncontrolled variable was responsible for the systematic departures, or an error was made in transcription or data analysis. Before dismissing a model because of nonrandom residuals, one should be sure that an error in the experiment or analysis is not the source of the nonrandom residuals. Although residual analysis is critical in evaluating the accuracy and validity of a particular model, for example, the generalized matching equation, residual analysis can also be used to compare models. Sutton et al. (2008) used several datasets to distinguish two models of choice and performed what might be called a meta-analysis of residuals. Because residuals can show deviations at a finer grained level 232

than overall measures of model accuracy such as r2, they allow researchers to more easily evaluate model accuracy (see also Dallery et al., 2005). Sutton et al. concluded that although the r2s of the two models were equivalent, analysis of the residuals yielded subtle, yet important distinctions between the models. Thus far, we have considered three essential steps in performing linear regression: careful visual inspection of the data, calculating r2, and performing residual analysis. Researchers must be sure that a line describes the data. If it does not, they may need to evaluate a different model, but only if the alternative model makes theoretical sense when applied to the data. This issue requires critical analysis. There are almost always theoretical reasons for choosing a particular model, and thus regression analysis in particular and quantitative modeling in general are not simply a matter of obtaining a good fit. Additional Questions When Evaluating Linear Regression

Do the Parameter Values Make Sense? Estimated parameter values must be plausible, both in terms of precedent and in relation to the dataset to which the equation is fitted. For example, in the case of linear regression, the y-intercept should be plausible given the obtained data in the experiment. What should one expect to see if the value of the independent variable is 0? In the case of generalized matching, the independent variable is 0 when the reinforcer rates are equal. Using the example from the preceding section, assume each confederate delivered 60 reinforcers per hour; thus, 60/60 is 1, and the log of 1 is 0. A safe initial assumption,

Quantitative Description of Environment–Behavior Relations

therefore, is that one would observe equal response rates on the two alternatives (i.e., a log response ratio close to 0), and some deviation might be expected if the response alternatives differ in some way (e.g., one confederate is more genuine or the quality of feedback is different). That is, one might expect some bias. If the obtained parameter value is widely discrepant from the existing literature, then something is most likely wrong. Solving the discrepancy may require some detective work. Perhaps the problem is lack of experimental control or human or measurement error, or some variable other than reinforcer rate is exerting an effect.

Is There Error in the Values of the Independent Variable? When researchers perform regression analysis, they are predicting the values of the dependent variable at specific values of the independent variable, which means that they need to obtain precise numbers for their obtained data. In our earlier example, this would mean precise reinforcer rates. For example, if one uses an averaged value of 6.0 reinforcers per hour as one of the independent variable values in linear regression, one must be sure that the reinforcer rates going into this average are extremely close to 6.0 reinforcers per hour. If the measurement of a particular value of the independent variable has a lot of error or variability, one cannot make precise predictions. Regression analysis assumes that there is no, or very little, error variance in the independent variable. In the case of a generalized matching analysis, there will be some variability in the independent variable, the obtained log reinforcer ratio, because the obtained log reinforcer ratio is plotted and depends in part on the rate of responding at each alternative. (One could plot the experimentally programmed reinforcer ratio, which has no error, but the programmed ratio may not be what the animal experiences.) Because the schedules for each alternative are VI schedules, the variability in reinforcer rates will be minimal compared with the variability in response rates, and thus the resulting reinforcer ratio will have minimal variability. For example, under a VI 60-second schedule, a pigeon could press once a minute or 100 times a minute and receive about the same number of reinforcers.

Therefore, the amount of variability is not enough to invalidate the use of ordinary linear regression techniques. If the variability in the independent variable is high, alternative regression approaches are available (Brace, 1977; Riggs, Guarnieri, & Addelman, 1978; although the same considerations apply for nonlinear regression, these approaches apply only to linear regression). To our knowledge, there are no specified rules about how much variability is too much to warrant an alternative approach.

Is Another Model More Appropriate? Even when the model accounts for the variability in the data points (i.e., a high r2), the residuals are homoscedastic, and the parameter values are sensible, there may be other questions regarding the model that cannot be answered by linear regression analysis. For example, one may ask whether another model of concurrent choice can account for more of the variance in the data. In the case of concurrent schedules of reinforcement, in addition to the generalized matching law, alternative models can account for choice (e.g., Dallery, McDowell, & Soto, 2004; McDowell & Kessel, 1979). One may also ask whether a more parsimonious model can account for the same amount of variance. In general, a more parsimonious model would be one that contained fewer parameters. To our knowledge, no current model can account for the data and is more parsimonious than generalized matching. In other areas of research, however, questions about parsimony and accuracy are still quite relevant (e.g., Mazur, 2001; McKerchar et al., 2009). Questions about accuracy and parsimony can be addressed using model comparison approaches, which we discuss later (see also Mazur, 2001, for an example of model comparison). The underlying assumptions of the selected model should also be examined. Indeed, a highly accurate model may be based on faulty assumptions. There are no tidy rules for discovering these assumptions; it takes critical thinking and a thorough understanding of the mathematical model. One assumption of generalized matching, for example, is that sensitivity to reinforcer rates (i.e., a in Equation 5) should be constant for an individual subject across changes in reinforcer magnitude or quality. By constant, we mean that a should not 233

Dallery and Soto

change if, for instance, a is estimated using small food pellets or large food pellets. If sensitivity to reinforcer rates does change when different reinforcer magnitudes are used, then the assumptions underlying the generalized matching equation have been violated, and the application of the equation is invalid. At least one study has shown that sensitivity to reinforcer rates does change depending on reinforcer magnitude (Elliffe, Davison, & Landon, 2008). Elliffe et al.’s (2008) findings question a matching-based understanding of how multiple independent variables—reinforcer rate and magnitude—interact and govern choice. Whether these findings will be confirmed by future research is an important issue, and such work may inspire new models that will yield new insights into the determinants of choice. Rationale for Squaring the Deviations in Regression Analysis The main feature of the regression analysis is minimizing the sum-of-squared deviations. The story behind the reason researchers seek to minimize the squared deviations, or perform so-called least-squares regression, is an interesting one. In 1801, at the age of 23, Carl Friedrich Gauss used the method of least squares to predict, with unprecedented precision, the path of a dwarf planet named Ceres. He made about 22 observations over a 40-day period before Ceres went behind the sun. To predict where it would reappear almost half a year later, he developed the method of least squares. The particular equation he used to predict the path of the planet need not concern us. What made the method work was Gauss’s assumption that any error in the observations followed a normal distribution. To illustrate why, we provide a more familiar example than planetary motion. Consider one of the more highly controlled experimental preparations in behavioral science: a pigeon pecking a lighted key for brief access to grain. After every 20 pecks, access is granted for a brief period. As response rate is measured from session to session, the numbers will vary. One day the response rate is higher, the next it is lower. However, all of the numbers revolve around some average. The distribution of the numbers is the 234

key—the scatter of response rates is normally distributed, or Gaussian. Most of the numbers will be close to the average, but some scatter will also be observed further away from the average. In other words, scatter close to the mean is more likely than scatter far away from the mean. Assume the average is known because the pigeon has been observed for 100 consecutive days. The average is 50 responses per minute. On Days 101 and 102, one student takes measurements, and on Days 103 and 104, another student takes measurements. The first student obtains 45 and 55, and the second obtains 49 and 59. What if the same students made two more measurements each? Which pair of observations would be more likely the second time around? This is a question about predicting behavior and how to maximize accuracy. To quantify accuracy, calculate deviations from the mean. Smaller deviations mean better accuracy. What if the researcher just calculates the deviations from the mean for both pairs? Ten units would be obtained for both pairs of observations (5 + 5 = 10 and 1 + 9 = 10). The researcher would conclude that both pairs are equally as likely because they have equal total deviations from the mean. However, these calculations do not take into account a property of the deviations around the mean: the normal, Gaussian distribution. To take this property into account, the deviations are squared, that is, 52 + 52 = 50 and 12 + 92 = 82. On the basis of these calculations, the first pair of observations is more likely (see also Motulsky & Christopoulos, 2004). The reason that squaring the deviations produces the most accurate predictions can be seen in Figure 10.8. The abscissa shows the values of the observations, and the ordinate shows the frequency of their occurrence. Consistent with the example, behavior has been measured for 100 days. On the basis of the distribution, one can see how the observations made by Student A are more likely; the likelihood of Student A’s observations of 45 and 55 is greater than that of the observations made by Student B. If the observational error in a system under study shows a Gaussian distribution, whether planetary motion or a pigeon’s pecking, then least-squares regression will be most accurate in generating the best parameter values, the values that make the best

Quantitative Description of Environment–Behavior Relations

Figure 10.8.  An example of a Gaussian distribution. The graph shows the frequency of observations, in this case response rates, over the course of 100 sessions. The distribution has a mean of 50. Two pairs of observations are taken, one by Student A (indicated by the As in the graph) and two by Student B (indicated by the Bs in the graph).

predictions. The error could be intrinsic to the system itself, or perhaps the variation is the result of some extrinsic factor such as measurement error or how much the experimenter feeds the pigeon between sessions. Regardless of the source, the observations over time, such as response rates over the course of 100 sessions, should show a Gaussian distribution. It is important to understand how and why the sum-of-squared deviations incorporate the Gaussian distribution in making predictions. Other methods to perform regression do not assume a Gaussian distribution. Usually, leastsquares regression is a justified practice to assess a model, but there are times when the distribution of observations do not show Gaussian scatter; sometimes the scatter of observations increases as the value of the independent variable increases, perhaps the scatter is wider than a Gaussian distribution, or perhaps the distribution is skewed and therefore non-Gaussian. For example, distributions of interresponse times are often skewed, and therefore using linear regression on mean interresponse times would not be appropriate in these cases. Techniques are available to perform regression on the basis of these distributions that involve different methods of weighting the residuals when calculating the best fit.

We do not present these techniques because of space limitations (see Motulsky & Christopoulos, 2004), but we do want to emphasize why least-squares regression would not be appropriate for these distributions. In a nutshell, it is because some residuals will have undue influence on the fit (similar to the effect of an extremely large or small value on the average of a set of values). The logic also explains why outliers require some special consideration in regression, even if one is confident that the scatter of observations follows a Gaussian distribution. Because the deviation to the outlier will be large, it will have a large impact when one attempts to fit the curve (or line). Consequently, the outliers will have a disproportionate impact on the values of the parameters, which would be unfortunate if the outlier was a fluke observation (e.g., the food dispenser stopped working one day and so did the animal). There are methods to reduce the influence of outliers, and these methods are called robust leastsquares regression. Computer programs will differ in which robust methods are offered, so we advise a search for these robust methods if outliers are present (and the reason for the outlier has been critically analyzed and including the outlier in the regression analysis is justified). 235

Dallery and Soto

An Extension of Matching Theory’s Quantitative Description of Choice Just as the development of linear models often begins with data, so too does the development of a nonlinear model. For example, imagine we have arranged a laboratory situation in which a human has one person to talk to, not two, and the rates of reinforcer delivery for talking about a certain topic, say politics, are varied across conditions. In this case, reinforcers consist of brief periods of eye contact (Beardsley & McDowell, 1992). Eye contact is provided according to several VI schedules of reinforcement (manipulated across conditions), but only when politics are discussed. We find that the rates of political speech increase with reinforcer rates. The relation is not a straight line, however; it is a curve. (Two examples of such curves are depicted in Figure 10.2.) At first, responding increases rapidly when reinforcer rates increase, and eventually responding levels off. The equation, of course, will reflect this hyperbolic shape in the observed data, and in that sense it does not take us beyond our observations. Our model, however, should extend our observations in several ways. Can it unify our observations of human behavior with the behavior of other organisms? Can it make novel and testable assertions about why the data take the shape they do? Before answering these questions, let us sketch the development of the hyperbolic equation. In 1968, Catania and Reynolds published an enormous and elegant dataset on how pigeons’ rate of pecking changed with changes in rate of VI reinforcement. The relation, as in our example of social dynamics, appeared hyperbolic. Thus, researchers knew that any quantitative account of reinforced responding on a single VI alternative must be consistent with a hyperbolic form. In 1970, Herrnstein demonstrated that one can obtain a hyperbolic model from the original matching equation, Equation 2. In deriving a hyperbolic equation from Equation 2, Herrnstein made several assumptions. First, he assumed that even when only one alternative is specifically arranged, other extraneous, unarranged behavioral alternatives exist. Consider the pigeon’s situation in Catania and Reynolds. In addition to pecking the lighted key, the 236

pigeon may engage in other behavior such as walking around, wing flapping, and so on, and presumably, each behavior produces some natural reinforcer. Similarly, a human engaged in political speech could talk about a different topic, fidget, or stare into space. Therefore, a single alternative arrangement is really a concurrent schedule, in which one choice is the target alternative and the other is the aggregate of all extraneous behaviors. Responding in a singlealternative situation can be expressed as R r = , R + R e r + re 

(7)

where Re and re represent the rates of extraneous responding and reinforcement, respectively (see Herrnstein, 1974, for a further discussion). In other words, Re and re represent any responding and reinforcement extraneous to the target alternative. The subscripts have been dropped from the target alternative because only one schedule of reinforcement is explicitly arranged. Second, Herrnstein (1970) also assumed that R (target responding) and Re (extraneous responding) are exhaustive of the total amount of behavior possible in a given environment. In other words, R + Re is all an animal can do in a given situation. Equation 7 can be solved for R by letting k = R + Re and then rearranging, which produces the familiar Equation 1:  r  R = k .  r + re 

The parameter k is now interpreted as total behavior possible in a situation. By assuming that all behavior is choice and that the total amount of behavior in a situation is constant, Herrnstein (1970) derived Equation 1 from Equation 2. Herrnstein’s (1970) theory consists of a quantitative statement about how reinforcer rate affects response rate (i.e., hyperbolically). More important, the theory also explains why reinforcement affects behavior in the manner specified by Equation 1: Reinforcement alters the distribution of a constant amount of behavior (Herrnstein, 1970, 1974). Now, at least conceptually, one can see how Herrnstein’s model links responding in situations in which only

Quantitative Description of Environment–Behavior Relations

one choice is experimentally arranged and situations in which two choices are experimentally arranged. Both represent choice situations. Indeed, Herrnstein went so far as to say that all behavior is choice. A pigeon’s key pecking and a person’s talking can be described by the same equation. The theory also explains why one feature of the hyperbolic shape is seen: asymptotic responding. Once enough reinforcers are delivered for some activity, the limits of the distribution of behavior are presumably reached (i.e., exclusive responding on the arranged alternative), and thus asymptotic levels of responding are observed. Evaluating Nonlinear Models: Nonlinear Regression The preceding conceptual analysis highlights some of the surprising and interesting consequences of quantitative models. Researchers can unify phenomena and explain why they see similar functional relations across situations and species. These consequences are moot unless the model is accurate, however, and as such we must return to analysis. Table 10.3 shows empirical observations in the single-alternative situation involving social interaction and political speech, and it follows the same general format as Table 10.2. Figure 10.9 shows a plot of the data in columns A and B, the response

and reinforcer rates, and the curve represents the least-squares fit of the hyperbolic equation. Least-squares regression for nonlinear models works in the same way as it does for linear models. As before, start with some initial parameter values for k and re. Here, it may be more difficult to guess where to start for the initial values; it is easier if one examines a graph of the data. Inspect the data points in Figure 10.9. The value of the asymptote k is measured in the units of the dependent variable— usually responses per minute. Because k reflects the Table 10.3 Regression Calculations for a Nonlinear Model B: rsp/min C: predicted A: rnf/hr

(Y )

180 138.6 98 22.5 15 9

32.0 30.1 28.0 18.8 16.3 6.7

rsp/min (Y′) D: Y − Y ′ E: (Y − Y′)2 31.46 30.45 28.78 17.68 14.14 10.09

0.54 −0.35 −0.78 1.12 2.16 −3.39

0.2916 0.1225 0.6084 1.2544 4.6656 11.4921

Note. Columns A and B show response and reinforcer rates, and column C shows predicted response rates calculated by Equation 1 with k = 35.41 and re = 22.57. Other conventions are identical to those in Table 10.1. Regression calculations: SSTOT = 478.43, SSREG = 18.43, R2 = 0.96. rnf = reinforcer; rsp = response.

Figure 10.9.  The curve shows Herrnstein’s (1970) hyperbolic equation fitted via least-squares regression to the data in Table 10.3. 237

Dallery and Soto

asymptote, one could start with the maximum response rate observed, or 32 responses per minute. Mathematically, the parameter re represents the reinforcer rate necessary to obtain half the asymptotic response rate, k. The reinforcer rate for the maximum response rate was 180 reinforcers per hour, so half of this reinforcer rate could be used as the initial value of re. We should note that, by definition, re represents the aggregate rate of reinforcement from sources extraneous to the target alternative. These extraneous reinforcers are typically not measured directly; rather, they are estimated in terms of the reinforcer delivered on the instrumental alternative. For example, if political speech were reinforced with eye contact, then an re of 10 would be read as 10 eye contacts per hour (the time unit for re is usually reinforcers per hour). In performing nonlinear regression, having a sense of what the parameters mean allows the researcher to use the data to determine initial parameter values. Then, the computer program (or algorithm) inserts the initial estimates (32 responses per minute and 90 reinforcers per hour) into Herrnstein’s (1970) equation to obtain a predicted response rate for each of the measured reinforcer rates. The computer program adjusts the parameter values to minimize the sum-of-squared residuals. Table 10.3 shows the best-fitting parameter values of k and re obtained via least-squares regression. The predicted data based on these values are show in column C. The first index of the model’s accuracy is the R2 (by convention, results of nonlinear regression are reported using R, and results of linear regression are reported using r). Similar to what was depicted in Figure 10.5, the R2 compares the vertical distances of the data to the curve with the vertical distances of the data to a horizontal line at the mean. More specifically, a ratio of the sum-of-squared residuals to the total sum-of-squares is calculated. In our example, SSREG is 18.43, and SSTOT is 478.43. Subtracting the ratio of SSREG from SSTOT from 1 yields an R2 of .96. A residual analysis is also performed to assess whether there are systematic departures from homoscedasticity. The R2 tells only part of the story. Again, one could perform several different kinds of residual analysis (see earlier), but for simplicity the 238

Figure 10.10.  Plot of the residuals of the fit of Herrnstein’s (1970) hyperbolic equation. The residual plot does not suggest systematic deviations for homoscedasticity.

residual is plotted as a function of the independent variable, which is shown in Figure 10.10. One can see that the scatter of residual values is relatively even across the range of reinforcer rates. More data may be needed to assess the assumptions of leastsquares regression, but the analysis suggests that there are no obvious departures from the assumptions. Additional Questions When Evaluating Nonlinear Regression

Do the Parameters Make Sense? To answer this question, as in linear regression, one has to know what the parameters mean and in what units they are measured. If researchers obtain an estimate of k of 1,000, and they have measured response rates in units of words per minute, this estimate of asymptotic response rates would not be feasible. Moreover, even if the maximum observed response rate were 100 or so, one should be suspicious of such a parameter value because it exceeds the observations by orders of magnitude. That is, if the parameter value exceeds what is reasonable on the basis of observed values, the researchers should think carefully about the obtained parameter value. It is indeed possible to obtain very large ks, but this

Quantitative Description of Environment–Behavior Relations

is usually because the researcher has not collected enough data under high reinforcer rates (i.e., in which response rate is close to asymptote). In choosing the VI schedules, the researchers have not sampled the parameter space effectively. For example, one would say the parameter space has not been sampled adequately if only the three lowest reinforcer rates shown in Figure 10.9 (three leftmost data points) were experimentally arranged. Even visually, using these three data points, the researchers would not be able to make a reasonable guess about the asymptote of the function. Their guess might be large (and uncertain). Similarly, the estimate produced by the regression program might be large (and uncertain, as measured by the standard error of the parameter estimate, for example; Motulsky & Christopoulos, 2004). Given that the parameter space has been adequately sampled, as shown in Figure 10.9, the obtained k of 35.41 responses per minute makes sense given the observations. The obtained re of 22.57 reinforcers per hour also makes sense. Although it may seem odd that extraneous reinforcers are measured in units of the delivered reinforcer, eye contacts, the value of eye contacts is plausible given the observed data.

Is There Error in the Values of the Independent Variable? As noted for linear regression, when one performs nonlinear regression analysis, one is predicting the values of the dependent variable at specific values of the independent variable. We do not go into this issue in detail because the arguments are the same as outlined in the Evaluating Linear Models: Linear Regression section. Briefly, recall that if there is a lot of variability in the independent variable, one cannot answer with precision the question of how the dependent variable changes with changes in the independent variable.

Is Another Model More Appropriate? As with linear models, when evaluating a nonlinear model, one should consider whether there is evidence that questions the assumptions underlying the model or evidence for other candidate models that seek to describe the same environment– behavior relation. For example, although Equation 7

typically accounts for more than 90% of the variance in response rates under VI schedules in laboratory, applied, and naturalistic settings (see de Villiers & Herrnstein, 1976; Fisher & Mazur, 1997; McDowell, 1988; and Williams, 1988, for reviews), there are some reasons to think it may not be wholly correct. As noted by Shull (1991), “The fact that an equation of particular form describes a set of data well does not mean that the assumptions that gave rise to the equation are supported” (p. 247). The assumptions underlying Herrnstein’s (1970) hyperbola have been examined in terms of how the parameters change with manipulations of environmental variables (see Dallery & Soto, 2004, for a review). According to the equation, re should vary directly with changes in the extraneous reinforcer rate, and this prediction has some empirical support (Belke & Heyman, 1994; Bradshaw, Szabadi, & Bevan, 1976). Other studies, however, have questioned this prediction (Bradshaw, 1977; Soto, McDowell, & Dallery, 2005; White, McLean, & Aldiss, 1986). Similarly, there are conflicting findings for the prediction of a constant k as reinforcer properties are manipulated. Recall that this is required by Herrnstein’s (1974) theory: Reinforcement alters the distribution of a constant amount of behavior among the available alternatives (de Villiers & Herrnstein, 1976, p. 1151; Herrnstein, 1974). Thus, the evidence has suggested some limitations to the assumptions underlying the hyperbola, and these limitations are independent of the criteria by which one evaluates the appropriateness of the model in describing a particular data set. Moreover, other models account for the same hyperbolic relations as Herrnstein’s model and predict the conditions under which k should remain constant or vary (Dallery, McDowell, & Lancaster; 2000; Dallery et al., 2004; McDowell, 1980; McDowell & Dallery, 1999). There is another reason to question the validity of Herrnstein’s equation. Recall that the original, strict matching equation, on which Herrnstein’s equation is based, was modified to account for bias and insensitivity. If bias and insensitivity are ubiquitous in concurrent schedules, they should be incorporated into Herrnstein’s (1974) account of single-alternative responding. We do not develop the equation that incorporates these parameters into Herrnstein’s original equation (see McDowell, 2005, 239

Dallery and Soto

for the derivation). As one might imagine, incorporating these parameters makes the equation appear more complicated. However, the equation describes the same hyperbolic shape, the same environment– behavior relation. Evaluating the equation involves the same steps outlined earlier. Indeed, in comparing the family of equations that make up the so-called modern theory of matching with the classic theory of matching, McDowell and colleagues (Dallery et al., 2005; McDowell, 2005) made extensive use of residual analysis to detect subtle, yet important deviations from the original, classic theory of matching. Although Dallery et al. (2005) and McDowell (2005) concluded that the modern theory represented a superior theory, alternative quantitative accounts also provide useful, general, and informative descriptions of the same environment–behavior relations (e.g., Killeen & Sitomer, 2003; Rachlin, Battalio, Kagel, & Green, 1981). Model Comparison Evaluating the accuracy and generality of a single quantitative theory is an important first step in describing environment–behavior relations. Comparing the model with other accounts is also a critical feature of the scientific process. Indeed, making distinctions between models can help further isolate controlling variables, develop critical empirical tests to distinguish the models (Mazur, 2001), and potentially lead to new insights about the processes responsible for behavior change in applied settings (Nevin & Grace, 2000). In the next section, we present two techniques to compare models. One technique, the extra sum-of-squares F test (Motulsky & Christopoulos, 2004), is based on conventional inferential statistics, and the other technique, Akaike’s information criterion (AIC; Akaike, 1974; Motulsky & Christopoulos, 2004), is based primarily on information theory (Burnham & Anderson, 2002). The objective of each technique is to provide a quantitative basis on which to judge whether one model provides a better account of the data than does another. Both methods evaluate goodness of fit relative to model complexity (i.e., the number of parameters). This is because the more parameters a model has, the better it will describe the data, and 240

thus any improvement in goodness of fit relative to another model must be considered in the context of this increase in the number of parameters. Before comparing models using either the sum-ofsquares F test or AIC, researchers must determine whether both models provide a reasonable and accurate account of the data: (a) Do they describe the data well (i.e., high r2 or R2), (b) are the residuals randomly distributed, and (c) are the parameter values sensible? If the answer to each of these three questions is yes, then researchers can proceed to model comparison with the F test or AIC. If the answer to one of the questions is no, then researchers must either resolve the issue (e.g., if nonsensible parameters are the result of experimental error) or choose between the models on the basis of the residuals or goodness of fit (i.e., if one model has nonrandom residuals or does a poor job of describing the data, then choose the other model). In addition, researchers must ensure that all datasets are expressed in the same units. They need to compare apples to apples. If the dependent variable in one data set is expressed in responses per second and in another it is expressed in responses per minute, the sum-of-squared deviations will be much larger in the former than in the latter. A comparison with different units is not appropriate. The researcher must either reexpress the data so that the units are identical (e.g., multiply or divide by 60 in the previous example) or find a way to standardize, or normalize, the data before making the comparison. The appropriate model comparison test depends on whether the models are nested. Models are nested when one model is a simpler case of the other model. For instance, the original (Equation 2) and generalized matching (Equation 5) equations are nested models because Equation 4 can be reduced to Equation 2 when a and b are equal to 1. If the models are nested, they can be compared with an extra sum-of-squares F test or with the AIC. If they are not nested, they can only be compared using AIC (or an alternative statistical approach using parametric or nonparametric methods).

Extra Sum-of-Squares F Test to Compare Nested Models Let us use strict matching and generalized matching as an example of model comparison using the data in

Quantitative Description of Environment–Behavior Relations

Table 10.2. Perfect matching is described by Equation 5 with a and b set to 1. Generalized matching is also described by Equation 5, but the parameters are free to vary. Assume that both models describe the data well, have randomly distributed residuals, and have sensible parameter values. The logic of the extra sum-of-squares F test is relatively simple. Goodness-of-fit is being compared along with the complexity of each model. The goodness-of-fit is measured by the sum-of-squared residuals. The complexity is measured by the number of parameters in each model. More parameters that are free to vary mean more complexity. The rule of parsimony asserts that simpler models should be preferred, so any increase in complexity should be more than offset by an increase in accuracy. Before going into the specific calculations, the F ratio compares accuracy (sum-of-squares) and complexity (parameters). If the calculated F ratio is high enough (and the corresponding p value is low enough), then the more complicated model can be accepted (generalized matching, in which the parameters are free to vary). If not, the simpler model is accepted (strict matching, in which the parameters are not free to vary). To calculate the extra sum-of-squares F test, one begins by calculating the sum-of-squared residuals for each model. The sum-of-squared residuals for strict matching in this case is 0.34; for generalized matching, it is 0.13 (see Table 10.4). A smaller sumof-squares is expected for generalized matching, and in general for models with more parameters that are free to vary. Next, the degrees of freedom for each model are calculated as the number of data points minus the number of parameters. For strict matching, the degrees of freedom are 10 (10 data points minus 0 free parameters); for generalized matching, the degrees of freedom are 8 (10 data points minus 2 parameters). Now, the relative difference in sumsof-squares and the relative difference in degrees of freedom are calculated. The relative difference in sums-of-squares is calculated by subtracting the alternative hypothesis sum-of-squares from the simpler hypothesis4 sum-of-squares, and then dividing that difference by the alternative hypothesis

Table 10.4 Obtained Logged Reinforcer and Response Ratios and Predicted Log Response Ratios and Corresponding Regression Calculations for Strict Matching and Generalized Matching D: predicted A: log rnf ratio

C: predicted log

log rsp ratio

B: log rsp

rsp ratio from

from general-

ratio

strict matching

ized matching

0.11 −0.37 0.49 −0.86 0.70 −0.17 0.63 −0.22 0.25 −0.67

0.17 0.07 0.61 −0.66 0.63 −0.21 0.57 −0.06 0.24 −0.44

SSTOT SSREG r2

1.78

0.11 −0.37 0.49 −0.86 0.70 −0.17 0.63 −0.22 0.25 −0.67

0.19 −0.19 0.49 −0.58 0.66 −0.03 0.60 −0.07 0.30 −0.43

Regression calculations 0.34 0.81

0.13 0.93

Note. Columns A and B show the logged response and reinforcer ratios, and columns C and D show the predicted log response ratios from strict matching (Equation 5 with a and b equal to 1) and generalized matching (Equation 5) with a = 0.80 and log b = 0.10 and their respective regression calculations. rnf = reinforcer; rsp = response.

sum-of-squares. Similarly, the relative difference in degrees of freedom is calculated by subtracting the alternative hypothesis degrees of freedom from the simpler hypothesis degrees of freedom, and then dividing that difference by the alternative hypothesis degrees of freedom. These numbers are summarized in Table 10.5. The relative difference in the sums-ofsquares is 1.62 ([0.34 − 0.13]/0.13) and the relative difference in the degrees of freedom is 0.25 ([10 − 8]/8). The F ratio is the ratio of the relative difference in the sum-of-squares to the relative difference in the degrees of freedom, and it is a measure of how much better the alternative model describes the data

In model comparison, we prefer the term simpler hypothesis to null hypothesis. We agree with Rodgers (2010) that the term null hypothesis has too much baggage and often refers to no effect or to chance processes. The modeling approach does not posit a null (or nil) effect; rather, it entails a specific, but simpler, hypothesis.

4

241

Dallery and Soto

Table 10.5 Calculations Involved in the Extra Sum-of-Squares F Test for Model Comparison of the Descriptions of Strict Matching and Generalized Matching Accounts of the Data in Table 10.4 Model Simpler hypothesis (strict matching) Alternative hypothesis (generalized matching) Difference Relative difference Ratio (F ) p

SS

df

0.34 0.13

10.00 8.00

0.21 1.62 6.48   .02

2.00 0.25

Note. Strict matching = Equation 5 with a and b equal to 1; generalized matching = Equation 5.

relative to how many more degrees of freedom the alternative model uses. In this case, the value is 6.48 (1.62/0.25) and the associated p value is .02. The p value can be found in an F ratio statistics table or by using an online calculator. Remember to use the correct degrees of freedom when looking up or calculating a p value. In this case, the degree of freedom for the numerator is 2 (the difference in degrees of freedom of our simpler and alternative hypotheses, 10 and 8, respectively), and the degree of freedom for the denominator is 8 (the alternative hypothesis degree of freedom). Typically, the criterion alpha value is set at .05, so in this case, one would reject the simpler hypothesis that the simpler model, strict matching, provides a better account of the data.

Akaike’s Information Criterion to Compare Nested Models Another technique used for comparing models is AIC. As noted earlier, AIC is based on information theory. The basis of the AIC calculation is beyond the scope of this chapter (for a brief discussion of the basis of AIC, see Motulsky & Christopoulos, 2004; for a more detailed discussion, see Burnham & Anderson, 2002). The AIC calculation allows one to determine which model is more likely to be correct (the one with the lower AIC value) and how much more likely it is to be correct. Unlike the extra sumof-squares F test, AIC can be used to compare nested 242

or nonnested models. We use our previous example to illustrate the use of AIC for model comparison. The logic is the same as in the extra sum-of-squares F test: Accuracy and complexity are compared. Increases in complexity must be offset by increases in accuracy. The first step in using AIC for model comparison is to calculate the AIC for each model using the equation  SS  AIC = N ln  REG  + 2K,  N  

(8)

where N is the number of data points, K is the number of parameters in the model plus 1, and SSREG is the model’s residual sum-of-squares. Consider Equation 8. The AIC value depends both on how well the model describes the data (how small the residual sum-of-squares is; SSREG) and on how many parameters the model has (K). When comparing two models, the model more likely to be correct is the one with the lower AIC value because a lower AIC value represents a better balance between goodness of fit and number of parameters. When the number of data points is small relative to the number of parameters, which may often be the case in behavioral experiments, AIC values will be small and should be corrected (Motulsky & Christopoulos, 2004). The corrected AIC formula is AICC = AIC +

2K ( K + 1)

. N − K −1 

(9)

In the previous example, the residual sum-ofsquares was 0.34 for strict matching and 0.13 for generalized matching. The number of data points is the same for each model (10), and K is 1 for strict matching and 3 for generalized matching. Thus, one AIC value for each model can be calculated. The corrected AIC for strict matching is −31.31, and the AIC for generalized matching is −33.43. Because the AIC value for generalized matching (designated as AICB) is lower than the AIC value for strict matching (designated as AICA), one can say that generalized matching is more likely to be correct. Some authors just present the individual AICs to make comparisons between models; the model with a smaller AIC is preferred. Another approach is to

Quantitative Description of Environment–Behavior Relations

calculate the probability that one model is more likely by using the following equation: Probability =

e

−0.5( AICB − AIC A )

1+ e

−0.5( AICB − AIC A )

. 

(10)

If the corrected AIC for generalized matching, AICB (−33.43), and the corrected AIC for strict matching, AICA (−31.31), are substituted into Equation 10, the probability that generalized matching is correct is found to be approximately .74. Alternatively, it can be said that the chance that generalized matching is correct is 74%. Note, the difference could be calculated the other way and would result in a probability of approximately .26 that strict matching is correct relative to generalized matching. Finally, an evidence ratio can be calculated that quantifies how much more likely one model is to be correct relative to another model by dividing the probability that one model is correct by the probability that the other model is correct. For the current example, the probability that generalized matching is correct (.74) could be divided by the probability that strict matching is correct (.26), resulting in an evidence ratio of 2.85, which means that generalized matching is 2.85 times more likely to be correct than strict matching. These calculations are summarized in Table 10.6.

Akaike’s Information Criterion to Compare Nonnested Models When comparing models that are not nested, when neither model is a simpler version of the other model, researchers must use AIC. They must scrutinize each model (i.e., r2, residuals, parameters)

before subjecting them to the AIC. The sum-ofsquares F test cannot be used with nonnested models. We do not provide an example of using the AIC with nonnested models because the calculations described in the previous section are exactly the same whether AIC is used for comparing nested or nonnested models. When evaluating the probability that the more likely model is correct relative to another model (Equation 10), subtract the AIC for the less likely model (i.e., the model with the larger AIC value) from the AIC for the more likely model (i.e., the model with the smaller AIC value). One area in which AIC could be applied to model comparison is in research on temporal discounting. Temporal discounting equations describe the rate at which a reinforcer loses value as the delay to its receipt increases (Critchfield & Kollins, 2001; Mazur, 1987). Several nonnested quantitative models (e.g., exponential decay, hyperbolic decay) have been proposed that account for high proportions of the variance in responding (for a recent comparison of four discounting models, see McKerchar et al., 2009). Assuming that residual analysis does not reveal departures from randomness, nonnested AIC analysis may be useful in making comparisons between these models.

Model Comparison to Evaluate Whether an Intervention Changed Behavior The model comparison approach can also be used to assess whether some intervention (e.g., an environmental or pharmacological intervention) produced a significant change in behavior. For example, the model comparison could be used to evaluate whether a drug or some other intervention produced

Table 10.6 Calculations Involved in Akaike’s Information Criterion for Comparison of Strict (Equation 5 With A and B Equal to 1) and Generalized Matching (Equation 5) Accounts of the Data in Table 4 Akaike’s information Model

SSREG

N

Parameters

criterion

Probability

Evidence ratio

Simpler hypothesis (strict matching) Alternative hypothesis (generalized matching)

0.34 0.13

10 10

0 2

−31.31 −33.43

.26 .74

2.85

243

Dallery and Soto

a change in the dose–response curve in a behavioral pharmacology experiment. Similarly, one could ask whether a curve, say a demand curve (which relates consumption to response requirement), changed as a function of a manipulation such as income (number of reinforcers available per day) or degree of food deprivation. Here, the comparison is not a change in the value of a single data point, as in an analysis of variance comparing means; the comparison is between two models (i.e., equations) of the relation across many data points (e.g., comparing a fit of the generalized matching law under control and drug conditions). One could also assess whether a particular parameter value changed as a result of some intervention. For example, one could evaluate whether the sensitivity parameter in the generalized matching equation varies as a function of some manipulation such as drug administration. Similarly, researchers in behavioral pharmacology may be interested in whether a drug alters a parameter value in a quantitative model (e.g., Pitts & Febbo, 2004). Such changes may indicate that certain behavioral mechanisms (e.g., changes in sensitivity to reinforcer amount or delay) are responsible for a particular drug effect (e.g., an increase in impulsive choice). In the interest of brevity, we provide a detailed example of model comparison and only the general logic and procedure for parameter comparison.

The central question in model comparison is, did an intervention produce a significant change in the function describing the relation between the independent and dependent variables? Consider a case in which the researcher has collected two datasets, one obtained before and one obtained after an intervention. The model comparison approach asks whether separate fits of an equation to the data, pre- and postintervention, for example, are better than a single fit of the equation to both datasets. If the separate fits are not better, it suggests that the intervention had no effect on behavior. Note that the basic logic of all model comparison is entailed by this procedure: A more complicated model (one equation fitted separately to each dataset) is being compared with a more parsimonious model (one equation fitted to both datasets). An example of this comparison is shown graphically in Figure 10.11. The left panel shows the alternative hypothesis, and the right panel shows the simpler hypothesis (more details about the figure are discussed later). Either the F test or the AIC can be used to determine whether a particular intervention changed behavior, but we present only the F test. (For the AIC, one would need to calculate the sum-of-squared residuals, N and K, for each model. The calculations for the AIC were given earlier.) Motulsky and Christopoulos (2004) provided a succinct discussion of some factors that may lead a researcher to favor one test over the other.

Figure 10.11.  A model comparison approach for nested models. The left panel shows the more complicated alternative hypothesis. The alternative hypothesis is that separate fits of the equation provide a better account of the data. The more parsimonious, simpler hypothesis, shown in the right panel, is that one equation can account for all of the data. The data points represent the logged response and reinforcer ratios presented in Table 10.6. 244

Quantitative Description of Environment–Behavior Relations

In this example, assume we have conducted an experiment with pigeons in which we have manipulated the presence of discriminative stimuli. The discriminative stimuli, such as different colored lights, signal the availability of reinforcement at each of the two alternatives. If behavior is more sensitive to changes in reinforcer rates when discriminative stimuli are used, we can make a prediction: A fit of the generalized matching equation with discriminative stimuli will differ from the fit of the generalized matching equation without discriminative stimuli. The first dataset, obtained without discriminative stimuli, and the second dataset, obtained with discriminative stimuli, are presented in Table 10.7 and plotted in the left and right panels of Figure 10.11. The simpler hypothesis is that the discriminative stimuli have no effect on behavior and that a single fit describes all the data. Thus, we fit the generalized matching equation to all the data pooled together, which is also called global fitting. The global fit is indicated by the solid line in the right panel of Figure 10.11. The values of the parameters for global fit are a = 1.09 and log b = 0.09. The alternative hypothesis is that the discriminative stimuli did indeed change behavior and that separate fits of the generalized matching equation are better (one fit for Table 10.7 Log Response and Reinforcer Ratios Obtained Under Conditions With and Without Discriminative Stimuli Without discriminative

With discriminative

stimuli

stimuli

Log rnf ratio 0.11

Log rsp ratio

−0.37 0.49 −0.86 0.7 −0.17 0.63

0.17 0.07 0.61 −0.66 0.63 −0.21 0.57

−0.22 0.25

−0.06 0.24

−0.67

−0.44

Log rnf ratio

Log rsp ratio

−0.3 −0.25 1 −0.95 0.43 −0.84 −0.77 0.43

−0.22 −0.12 1.6 −0.99 0.62 −1.2 −0.83 0.43

−0.85 0.06

−0.85 0.08

Note. rnf = reinforcer; rsp = response.

each data set). Thus, we also fit the generalized matching equation to each data set separately. The alternative hypothesis is shown graphically by the dashed and dotted lines in the left panel of Figure 10.11. The parameter values for our dataset without discriminative stimuli are a = 0.80 and log b = 0.10; the parameter values for our dataset with discriminative stimuli are a = 1.28 and log b = 0.11. As always, before moving on to model comparison using an F test or the AIC, one must evaluate each model separately in terms of visual inspection, R2, residuals, and the other criteria listed earlier. It would not be appropriate to perform model comparison if one of the models produces nonrandom residuals or unrealistic parameter values (e.g., Dallery et al., 2005). Because no problems occurred with ei