Statistical Methods for Testing and Evaluating Defense Systems : Interim Report [1 ed.] 9780309591638

161 34 1016KB

English Pages 94 Year 1994

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Statistical Methods for Testing and Evaluating Defense Systems : Interim Report [1 ed.]
 9780309591638

Citation preview

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

i

Statistical Methods for Testing and Evaluating Defense Systems Interim Report

Panel on Statistical Methods for Testing and Evaluating Defense Systems Committee on National Statistics Commission on Behavioral and Social Sciences and Education National Research Council

National Academy Press Washington, D.C. 1995

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

ii

NOTICE: The project that is the subject of this report was approved by the Governing Board of the National Research Council, whose members are drawn from the councils of the National Academy of Sciences, the National Academy of Engineering, and the Institute of Medicine. The members of the committee responsible for the report were chosen for their special competencies and with regard for appropriate balance. This report has been reviewed by a group other than the authors according to procedures approved by a Report Review Committee consisting of members of the National Academy of Sciences, the National Academy of Engineering, and the Institute of Medicine. The National Academy of Sciences is a private, nonprofit, self-perpetuating society of distinguished scholars engaged in scientific and engineering research, dedicated to the furtherance of science and technology and to their use for the general welfare. Upon the authority of the charter granted to it by the Congress in 1863, the Academy has a mandate that requires it to advise the federal government on scientific and technical matters. Dr. Bruce Alberts is president of the National Academy of Sciences. The National Academy of Engineering was established in 1964, under the charter of the National Academy of Sciences, as a parallel organization of outstanding engineers. It is autonomous in its administration and in the selection of its members, sharing with the National Academy of Sciences the responsibility for advising the federal government. The National Academy of Engineering also sponsors engineering programs aimed at meeting national needs, encourages education and research, and recognizes the superior achievements of engineers. Dr. Harold Liebowitz is president of the National Academy of Engineering. The Institute of Medicine was established in 1970 by the National Academy of Sciences to secure the services of eminent members of appropriate professions in the examination of policy matters pertaining to the health of the public. The Institute acts under the responsibility given to the National Academy of Sciences by its congressional charter to be an adviser to the federal government and, upon its own initiative, to identify issues of medical care, research, and education. Dr. Kenneth I. Shine is president of the Institute of Medicine. The National Research Council was organized by the National Academy of Sciences in 1916 to associate the broad community of science and technology with the Academy's purposes of furthering knowledge and advising the federal government. Functioning in accordance with general policies determined by the Academy, the Council has become the principal operating agency of both the National Academy of Sciences and the National Academy of Engineering in providing services to the government, the public, and the scientific and engineering communities. The Council is administered jointly by both Academies and the Institute of Medicine. Dr. Bruce Alberts and Dr. Harold Liebowitz are chairman and vice chairman, respectively, of the National Research Council. The project that is the subject of this report is supported by funds from the Office of the Director of Operational Test and Evaluation at the U.S. Department of Defense. Copyright 1995 by the National Academy of Sciences. All rights reserved. Additional copies of this report are available from: Committee on National Statistics National Research Council 2101 Constitution Avenue, N.W. Washington, D.C. 20418 Printed in the United States of America

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

iii

PANEL ON STATISTICAL METHODS FOR TESTING AND EVALUATING DEFENSE SYSTEMS JOHN E. ROLPH (Chair), School of Business Administration, University of Southern California MARION BRYSON, U.S. Army (Retired), Marina, California HERMAN CHERNOFF, Department of Statistics, Harvard University JOHN D. CHRISTIE, Logistics Management Institute, McLean, Virginia LOUIS GORDON, Filoli Information Systems, Palo Alto, California KATHRYN B. LASKEY, Department of Systems Engineering and Center of Excellence in C3I, George Mason University ROBERT C. MARSHALL, Department of Economics, Pennsylvania State University VIJAYAN N. NAIR, Department of Statistics, University of Michigan ROBERT T. O'NEILL, Division of Biometrics, Food and Drug Administration, U.S. Department of Health and Human Services, Rockville, Maryland STEPHEN M. POLLOCK, Department of Industrial and Operations Engineering, University of Michigan JESSE H. POORE, Department of Computer Science, University of Tennessee FRANCISCO J. SAMANIEGO, Division of Statistics, University of California, Davis DENNIS E. SMALLWOOD, The RAND Corporation, Santa Monica, California DUANE L. STEFFEY, Study Director MICHAEL L. COHEN, Senior Program Officer ANU PEMMARAZU, Research Assistant ERIC M. GAIER, Consultant CANDICE S. EVANS, Project Assistant

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

iv

COMMITTEE ON NATIONAL STATISTICS 1994-1995 NORMAN M. BRADBURN (Chair), National Opinion Research Center, University of Chicago JOHN E. ROLPH (Vice Chair), School of Business Administration, University of Southern California JOHN F. GEWEKE, Department of Economics, University of Minnesota JOEL B. GREENHOUSE, Department of Statistics, Carnegie Mellon University ERIC A. HANUSHEK, W. Allen Wallis Institute of Political Economy, University of Rochester ROBERT M. HAUSER, Institute for Research on Poverty, University of Wisconsin, Madison NICHOLAS JEWELL, School of Public Health, University of California, Berkeley WILLIAM NORDHAUS, Department of Economics, Yale University JANET NORWOOD, The Urban Institute, Washington, D.C. EDWARD B. PERRIN, Department of Health Services, University of Washington KEITH F. RUST, Westat, Inc., Rockville, Maryland DANIEL L. SOLOMON, College of Physical and Mathematical Sciences, North Carolina State University MIRON L. STRAF, Director

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

CONTENTS

v

Contents

PREFACE EXECUTIVE SUMMARY

vii 1

1

INTRODUCTION Study Context Panel Objectives Statistics and Information Management in Defense Testing This Report and Future Work

7 7 9 10 13

2

USE OF EXPERIMENTAL DESIGN IN OPERATIONAL TESTING Case Study #1: Apache Longbow Helicopter Case Study #2: ATACMS/BAT System Future Work

15 16 19 23

3

TESTING OF SOFTWARE-INTENSIVE SYSTEMS Role for Statistical Methods Activities to Date Future Work

24 25 26 27

4

SYSTEM RELIABILITY, AVAILABILITY, AND MAINTAINABILITY Reliability, Availability, and Maintainability Testing and Evaluation in the Military Services Variability in Reliability, Availability, and Maintainability Policy and Practice Industrial (Nonmilitary) Standards Future Work

28 29 30 31 32

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

CONTENTS

vi

5

USE OF MODELING AND SIMULATION IN OPERATIONAL TESTING Scope, Procedures, and Progress to Date Concerns Future Work

33 34 35 39

6

EFFORTS TOWARD A TAXONOMIC STRUCTURE OF DOD SYSTEMS FOR OPERATIONAL TESTING Preliminary Work Toward a Taxonomic Structure Future Work

41

A B C D E

41 46

APPENDICES The Organizational Structure of Defense Acquisition A Short History of Experimental Design, with Commentary for Operational Testing Selecting a Small Number of Operational Test Environments Individuals Consulted DoD and the Army Test and Evaluation Organization

49 55 62 72 75

REFERENCES

77

BIOGRAPHICAL SKETCHES OF PANEL MEMBERS AND STAFF

81

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

PREFACE

vii

Preface

The Committee on National Statistics of the National Research Council (NRC) has had a long-standing goal of helping to develop and encourage the use of state-of-the-art statistical methods across the federal government. As a result of this interest, discussions began several years ago during meetings of the Committee on National Statistics about the possibility of conducting a study for the U.S. Department of Defense (DoD). Mutual interest between the committee and the DoD Office of Program Analysis and Evaluation in greater application of statistics within DoD led to a meeting of key DoD personnel and several NRC staff. As a result of this meeting, system testing and evaluation emerged as an area where statistical science could prove useful. Consequently, at the request of DoD, the Committee on National Statistics, in conjunction with the NRC Committee on Applied and Theoretical Statistics, held a two-day workshop in September 1992 on experimental design, statistical modeling, simulation, sources of variability, data storage and use, and operational testing of weapon systems. The workshop was sponsored by the Office of the Director of Operational Test and Evaluation, and the Office of the Assistant Secretary of Defense for Program Analysis and Evaluation. The overarching theme of the workshop was that using more appropriate statistical approaches could improve the evaluation of weapon systems in the DoD acquisition process. Workshop participants expressed the need for a study to address in greater depth the issues that surfaced at the workshop. Therefore, at the request of DoD, a multiyear panel study was undertaken by the Committee on National Statistics in early 1994. The Panel on Statistical Methods for Testing and Evaluating Defense Systems was established to recommend statistical methods for improving the effectiveness and efficiency of testing and evaluation of defense systems, with emphasis on operational testing. The 13-member panel comprises experts in the fields of statistics (including quality management, decision theory, sequential testing, reliability theory, and experimental design), operations research, software engineering, defense acquisition, and military systems. The study is sponsored by the DoD Office of the Director of Operational Test and Evaluation. Early in its work, the panel formed seven working groups to study particular aspects of defense testing: (1) design of experiments; (2) uses of modeling and computer simulation; (3) system reliability,

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

PREFACE

viii

availability, and maintainability; (4) software-intensive systems; (5) organizational context; (6) taxonomy of defense systems in operational testing; and (7) development of case studies. An eighth working group, on methods for combining information, has been merged into two of the other working groups: modeling and simulation, a primary application of such methods; and organizational context, which will consider organizational aspects of combining information in the testing process. This interim report presents the results of the panel's work to date in these areas. We have two goals in preparing this report: (1) to provide the sponsor and the defense testing community with feedback based on the panel's ongoing review of current test practices and (2) to present our current approaches and plans so that interested parties can provide input—for example, additional literature or expert testimony—for our final report. Because the report represents work in progress, we include few conclusions and no recommendations at this time. From the beginning of this study we have enjoyed the cooperation and participation of many people. We particularly wish to acknowledge the support of Philip Coyle, director, and Ernest Seglie, science advisor, Office of the Director, Operational Test and Evaluation, U.S. Department of Defense (the study sponsors); Henry Dubin, technical director, U.S. Army Operational Test and Evaluation Command; James Duff, technical director, U.S. Navy Operational Test and Evaluation Force; Marion Williams, technical director, U.S. Air Force Operational Test and Evaluation Center; and Robert Bell, technical director, U.S. Marine Corps Operational Test and Evaluation Activity. In addition, we are grateful to many other representatives from the military services, the Office of the Secretary of Defense, and private organizations in the testing community. Appendix D provides a more comprehensive list of the panel's contacts. As the study moves into its final phase of work, the panel will investigate further the issues addressed in this report. The final report is planned for publication in December 1996. John E. Rolph, Chair Panel on Statistical Methods for Testing and Evaluating Defense Systems

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

ix

Statistical Methods for Testing and Evaluating Defense Systems

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

x

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

EXECUTIVE SUMMARY

1

Executive Summary

The Panel on Statistical Methods for Testing and Evaluating Defense Systems was formed to assess current practice related to operational testing in the Department of Defense (DoD) and to consider how the use of statistical techniques can improve that practice. This interim report has two purposes: (1) to provide the sponsor and the defense acquisition community with feedback on the panel's work to date and (2) to present our current approaches and plans so that interested parties can provide input—for example, for additional literature or expert testimony—for our final report. Since this report represents work in progress, it includes relatively few conclusions and no recommendations. Chapters of this report describe our progress to date in five major areas being addressed by working groups of the panel: use of experimental design; testing of software-intensive systems; system reliability, availability, and maintainability; use of modeling and simulation in operational testing; and efforts to develop a taxonomic structure for operational testing of new defense systems. Also, as discussed below, we have been led to take a broader view of how operational testing fits into the acquisition process, with the possibility that we may identify areas in the larger acquisition process in which changes could make operational testing more informative. The rest of this summary presents our key interim findings and outlines topics that the panel intends to consider further in the remainder of the study. KEY ISSUES Experimental Design The goal of an operational test is to measure the performance, under various conditions, of diverse aspects of a newly developed DoD system to determine whether the system satisfies a number of criteria. Some of these conditions can be completely controlled, and some are not subject to control. Since the size, scope, and duration of operational testing are constrained by budgetary, legal, and time considerations, the sample size available for a test is typically small. Thus, there is a benefit in designing a test as efficiently as possible so it can produce useful information about the performance of

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

EXECUTIVE SUMMARY

2

the system under the various conditions of interest. In its work to date, the panel has found much to commend about current practice in DoD operational testing. However, we do have several concerns related to experimental design: • Uninformative scenarios in comparative testing. The choice of test scenarios does not always reflect consideration of the relative strengths of a new system compared to an existing control in these scenarios (when a control is relevant). It is important to use a priori assessments of which scenarios will discriminate most in identifying situations in which a new system might dominate an existing system, or vice versa, in terms of performance. • Testing inside the envelope. A related concern is that operational testing tends to focus on the environments that are most typical in the field. Although this approach has the advantage that one directly estimates the system performance for the more common environments, the disadvantage is that little is known about the performance when the system is strongly stressed. • Subjective scoring rules. Scoring rules with respect to which events are considered unusable are vaguely defined, as is precisely what constitutes a trial. Further, the definition of an outlier in an operational test is not always made as objectively as possible. Testers need to be more precise about the objective of each operational test. Sometimes, understanding the performance of the system in the most demanding of several environments is paramount, so that the objective is to estimate a lower bound on system performance; at other times, a measurement of the average performance of the system across environments is needed. • Measurement inefficiencies. Data that measure system effectiveness, especially with respect to hit rates, are often treated as binary (zero-one) data, but such reduced data usually contain much less information than the original data on a continuous scale. For example, information on target miss distance can be used in modeling the variability of a shot about its mean, which in turn can be used to improve estimation of the hit rate. Testing of Software-Intensive Systems Defense systems are becoming increasingly complex and software-intensive. Early in the panel's work, it became clear that software is a critical path through which systems achieve their performance objectives. We therefore recognized the need for special attention to software-intensive systems and have sought to understand how operational testing of such systems is conducted across the military services. On the basis of our work to date, we note several concerns about current practice: • Barriers to effective software testing. Several barriers limit effective software test and evaluation. One important barrier is that DoD has not acknowledged or addressed the criticality of software to systems' operational requirements early enough in the acquisition process. There is a perception that software is secondary to hardware and can be fixed later. Also, we concur with the findings of others who have identified three related barriers: (1) DoD has not developed, implemented, or standardized decisionmaking tools and processes for measuring or projecting software system cost, schedule, and performance risks; (2) DoD has not developed a test and evaluation policy that provides consistent guidance regarding software maturity; and (3) DoD has not adequately defined and managed software requirements (U.S. General Accounting Office, 1993). • Evolutionary acquisition of software. In evolutionary acquisition, the software code that is evaluated in operational testing is being continuously changed, so that what is operationally tested is not

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

EXECUTIVE SUMMARY

3

necessarily what is deployed. Thus, evolutionary acquisition may compromise the utility of operational testing. We plan to study this issue and its implications further. System Reliability, Availability, and Maintainability Considerations of operational suitability—including reliability, availability, and maintainability—are likely to have different implications for the design and analysis of operational tests than considerations of effectiveness, and consequently merit distinct attention by the panel in its work. Our overall goal in this area is to contribute to the improved use of statistics in reliability, availability, and maintainability assessment by reviewing best current practices within DoD, in other parts of the federal government, and in private industry, with respect to both technical aspects of statistical methodology and policy aspects of reliability, availability, and maintainability testing and evaluation processes. At this time, we make the following observations: • Variability in reliability, availability, and maintainability policy and practice. Considerable differences in organization and methodology exist among the services, as well as within the testing community in each service. Such differences may be partly attributable to variability in the training and expertise of developmental and operational testing personnel. A related concern is the reliance on standard modeling assumptions (e.g., exponentiality) in circumstances in which they may not be tenable, and we are currently assessing the possible consequences for test design and evaluation. • No accepted set of best reliability, availability, and maintainability practices. Efforts to achieve more efficient (i.e., less expensive) decision making by pooling data from various sources require documentation of the data sources and of the conditions under which the data were collected, as well as clear and consistent definitions of various terms. Such efforts underscore the potential value of standardizing reliability, availability, and maintainability testing and evaluation across the services and encouraging the use of best current practices. Industry models include the Organization for International Standardization (ISO 9000) series and existing documents on reliability, availability, and maintainability practices in the automobile and telephone industries. Use of Modeling and Simulation in Operational Testing The panel's work in this area is intended to address how statistical methods might be used to assess the use of and to validate simulations for developmental or, especially, operational testing.1 It seems clear that few if any of the current collection of simulations were designed for use in developmental or operational testing. The original purpose was typically to assist in training and doctrine. Therefore, the primary question concerns the extent to which simulations, possibly with some adjustments and enhancements, can be used for testing purposes, with the important objectives of saving limited test funds, enhancing safety, effectively increasing the sample size of a test, and possibly permitting the extrapolation of test results to untested scenarios. Although we applaud efforts made throughout DoD to make operational testing more cost-effective through the use of simulation, we have identified several concerns:

1We

use the term “simulation” to mean both modeling and simulation.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

EXECUTIVE SUMMARY

4

• Infrequent and sometimes incorrectly applied attempts at rigorous validation. Rigorous validation of simulations, although difficult or expensive, is often absent or incorrectly applied in many operational testing applications. When applied, external validation can sometimes be used to overfit a model to field experience. In such cases, close correspondence between a “tuned” simulation and operational results does not necessarily imply that the simulation will predict performance well in any new scenario. The considerable literature on statistical validation of complex computer models apparently has not been effectively disseminated in the defense modeling community. • Little use of statistical methods in designing simulations and interpreting results. Statistical methods can be used in characterizing relationships between inputs and outputs, planning efficient simulation runs, interpolating results for cases that were not run, detecting and analyzing unusual cases (outliers), and estimating uncertainties in simulation results. We have seen little evidence of awareness and use of statistical methods for these purposes. The DoD simulation policy and directives literature is generally deficient in its statistical content. • Impossibility of identifying the “unknown unknowns.” Although appropriately validated simulations can supplement the knowledge gained from operational testing, no simulation can discover a system problem that arises from factors that were not included in the models on which the simulation is built. Often, unanticipated problems become apparent only during field testing of the system in an operational environment (or, in some cases, after deployment). • Lack of treatment of reliability, availability, and maintainability. Many models and simulations used in the acquisition process apparently assume perfect availability and reliability of the system. Also, this observation seems to hold more generally for other aspects of suitability, such as logistics support, interoperability, transportability, operator fatigue, and state-of-training realities. Despite their inherent limitations, simulations that purport to assess a system's operational value should incorporate, to the extent possible, estimates of the reliability, availability, and maintainability of that system. TOPICS FOR FURTHER STUDY • Optimal allocation of test resources. An important general problem concerns how to allocate a small sample of test objects to several environments optimally so that the test sample is maximally informative about the overall system performance. It may be possible to improve on the common practice of alternately varying one factor of a central test scenario. The panel is considering several alternative statistical approaches to this problem involving such techniques as multidimensional scaling, fractional factorial designs, and Bayesian methods (see Appendix C). • Training effects. Operational test results can be affected significantly by the degree of training that soldiers receive prior to testing and by player learning that occurs during the test. Therefore, we are examining the potential use of statistical methods, including trend-free and other experimental designs, to address and correct for these confounding effects. • Test sample size. With respect to evaluation, a real problem is how to decide what sample size is adequate for making decisions with a stated confidence. The panel will examine this question— sometimes referred to as “How much testing is enough?”—for discussion in our final report. This effort may involve notions of Bayesian inference, statistical decision theory, and graphical presentation of uncertainty. • Alternatives to hypothesis testing. We firmly believe that applying the standard approach to hypothesis testing in the operational test context is not appropriate for several reasons, including the asymmetry of the approach and its limited value in problems involving small sample sizes. The primary objective of operational test and evaluation is to provide information to the decision maker in the form

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

EXECUTIVE SUMMARY











5

that is most valuable in deciding on the next course of action. Therefore, in analyzing operational test results, one should concentrate on estimating quantities of interest and assessing the risks associated with possible decisions. We will continue to explore how decision-theoretic formulations might be used in evaluating operational test results. Statistical applications in software testing. The panel believes that statistical methods can and should play a role in the testing and evaluation of software-intensive systems, particularly because not every user scenario can be tested. Consequently, in order to understand current practice in operational testing of software-intensive systems and how statistical methods might be applied, we continue to seek answers in this specific context to a set of general questions related to the statistical design and analysis of experiments: How does one characterize the population of scenarios to test and the environments of use? How does one select scenarios to test from the population? How does one know when to stop testing? What are the stopping criteria? How does one generalize from the information gained during testing to the population of scenarios not tested? How does one plan for optimal use of test resources and adjust the test plan as the testing unfolds? Implications of “intended use” concept for software testing. A shift to a new paradigm is taking place, driven by the concept of “intended use” articulated in the definition of operational testing. To implement this paradigm, it would be necessary to prescribe certain criteria that, if met, would support a decision that the software (or system containing the software) is fit for field use. These criteria might involve experiments, observational studies, or other means of evaluation, and they would have to be prescribed in technical detail—including specification of costs, schedules, and methods—thus establishing requirements and constraints on the design and development process. Opportunities and methods for combining reliability, availability, and maintainability data. We have concluded that operational tests of combat equipment are not, as a rule, designed primarily with reliability, availability, and maintainability issues in mind. Addressing these issues typically involves experiments of longer duration than is feasible in operational testing. Consequently, “certification” of operational suitability can be accomplished better through other means of testing. For example, data collected during training exercises, developmental testing, component testing, bench testing, and operational testing, along with historical data on systems with similar suitability characteristics, might be appropriately combined in an inference scheme that would be much more powerful than schemes in current use. In future work, we will seek to clarify the role hierarchical modeling might play in reliability, availability, and maintainability inference from such a broadened perspective. Prescriptions for use of modeling and simulation. In formulating a position on the use of simulation in operational testing, the panel will continue to seek answers to several specific questions: Can simulations built for other purposes be used in their present state in either developmental or operational testing? What modifications might generally improve their utility for this purpose? How can their results, either in original form or suitably modified, be used to help plan developmental or operational tests? How can the results from simulations and either developmental or operational tests be combined to obtain a defensible assessment of effectiveness or suitability? These questions are all related to the degree to which simulations can approximate laboratory or field tests, and their answers involve verification, validation, and accreditation. A taxonomic structure for defense systems. Various attributes of military systems require distinctive approaches to operational testing and to the application of statistical techniques. Because of the many different factors that must be considered in any particular test, the panel decided to undertake development of a scheme for classifying weapon systems and associated testing issues. Such a taxonomic structure, if developed, should satisfy three general objectives: (1) reflect the prevalence of various types of systems; (2) highlight attributes that might call for different statistical approaches,

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

EXECUTIVE SUMMARY

6

affect decision tradeoffs, or involve qualitatively different consequences; and (3) facilitate the integration of commercial approaches by helping to align military and commercial contexts. Preliminary work on this topic suggests that producing a taxonomic structure will be difficult and that its appropriate scope, depth, and nature will depend strongly on its intended uses. • Broader issues concerning operational testing and defense acquisition. In assessing how to make optimum use of best statistical practices in operational testing, we have repeatedly been led to consider various aspects of the larger acquisition process of which operational testing is a part. For example, starting operational testing earlier in the acquisition process—an idea that has won support in the DoD community and among members of our panel—has implications for how statistical methods would be applied. Similarly, the operational test design or evaluation of system performance might conceivably make use either of optimal experimental design methods that depend on parameters that must be estimated or of statistical techniques that “borrow strength” from data earlier in the process. These approaches might use information from developmental testing, but concern about preserving the independence of operational and developmental testing could make such ideas controversial. Organizational constraints and competing incentives complicate the application of sequential testing methods, as well as some statistical ideas about how to allocate operational testing resources as a function of the total system budget. Further, ideas of quality management that have gained great acceptance in industry seem relevant to the task of developing military systems, despite the obvious contextual differences, and the implementation of such ideas would require a complete understanding of the DoD acquisition process. In future work, we hope to articulate general principles and a philosophy of managing information and quality, drawn from broad experience in industry and government. The panel will continue to gather information from all military services, conduct additional site visits, examine industrial practice, study federal agencies that engage in product development and testing, and explore international experience in military testing and acquisition. We will seek to learn about the degree of statistical training in the test agencies and the expertise that agency personnel can draw on for difficult statistical issues. From our colleagues in the defense and statistical communities and from readers of this report, we welcome suggestions, especially advice concerning additional sources of information that would be useful in advancing our work.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

INTRODUCTION

7

1 Introduction

STUDY CONTEXT The Defense Acquisition Process The DoD acquisition process comprises a series of steps that begin with a recognized need for a new capability, advance through rough conceptual planning to the development of system prototypes, and lead, ultimately, to a new system in full production that meets the stated need. This is a difficult process to manage well. Many of the technologies involved in a new system have never been used for that particular application (or may not even exist), the system can be extremely expensive, and the finished product can have significant implications for national security. Simply put, the stakes are high, and it is very difficult to prevent the occurrence of unanticipated problems. Appendix A provides a description of the military acquisition process as currently executed for many military systems, although this process is changing rapidly. We summarize the process here as the context for the panel's general approach to its mission. We also define some terminology used in the remainder of the report. The procurement of a major weapon system is divided into milestones. When a new capability is needed, the relevant service prepares a Mission Needs Statement that describes the threat this new capability is addressing. If it is determined that the new capability can be met only with new materiel, and a fairly specific approach is agreed upon, a new acquisition program is initiated. The most expensive systems are given the acquisition category (ACAT) I designation. After some additional review, the program passes milestone 0. Between milestone 0 and milestone I the program plans become more specific. At milestone I the budget process begins, and a program office and the first of (potentially) many program managers is assigned to the program. The program manager is given the job of ensuring that the program passes the next milestone in the process. In addition, between milestones 0 and I, various planning documents are prepared. In particular, the Operational Requirements Document details the link between the mission need and specific performance parameters, and the Test and Evaluation Master Plan provides the

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

INTRODUCTION

8

structure of the test and evaluation program, including schedule and resource implications. Other documents also provide program specifications; because these documents have different purposes and are produced by different groups, the specifications may not agree with those in the Operational Requirements Document. Furthermore, different documents prepared at different stages of the process may contain different program specifications. Between milestones I and II the program undergoes further refinement, including some developmental test and evaluation. Developmental testing is carried out on prototypes by the developing agency to help in designing the system to technical specifications.1 In this demonstration and validation phase, the capabilities of the system become better understood: that is, it becomes clearer whether the specifications can be achieved. Milestone II is Development Approval, at which point it is decided whether the system is mature enough to enter low-rate initial production. This decision is based on an assessment of the system's affordability and the likelihood that the parameters specified in the Operational Requirements Document can be achieved. Furthermore, the resource requirements for operational testing are specified by the testing community of the relevant service and also by the Director of Operational Test and Evaluation. The phase between milestones II and III is called engineering and manufacturing development. The objective is to develop a stable, producible, and cost-effective system design. Operational testing is a major activity that takes place at this time; some developmental testing continues, too. Operational testing is field testing carried out on production units under realistic operating conditions by typical users to verify that systems are operationally effective and suitable for their intended use and to provide essential information for assessment of acquisition risk. At milestone III, the decision is made whether to go into full production; this decision is based heavily on the results of the operational testing. Figure 1-1 is a diagram of the DoD acquisition process. Operational Testing as Part of the Acquisition Process In assessing how to make optimum use of best statistical practices in operational testing, when done as part of the acquisition process, sometimes it is necessary to consider various aspects of the larger acquisition process. For example, starting operational testing earlier in the acquisition process—an idea that has won support in the DoD community and among members of our panel—has implications for how statistical methods would be applied. Similarly, the operational test design or evaluation of system performance might conceivably make use either of optimal experimental design methods which depend on parameters that must be estimated or of statistical techniques that “borrow strength” from data earlier in the process. These approaches might use information from developmental testing, but concern about preserving the independence of operational and developmental testing could make such ideas controversial. Organizational constraints and competing incentives complicate the application of sequential testing methods, as well as some statistical ideas about how to allocate operational testing resources as a function of the total system budget. Furthermore, ideas of quality management that have gained great acceptance in industry seem relevant to the task of developing military systems, despite the obvious contextual differences, and the implementation of such ideas would require a complete understanding of the DoD acquisition process.

1The

terms “development test and evaluation” and “developmental testing” are used synonymously in this report. Similarly, the terms “operational test and evaluation” and “operational testing” are also used synonymously.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

INTRODUCTION

9

FIGURE 1-1 DoD Acquisition Process. SOURCE: Adapted from Lese (1992). PANEL OBJECTIVES The above context motivates the following blueprint for the panel's activities. The panel is working in both reactive and proactive modes. In the former mode, we are investigating current operational testing practice in DoD and examining how the use of statistical techniques can improve this practice. We will suggest improvements in four areas that overlap substantially: (1) assessment of reliability, availability, and maintainability; (2) use of modeling and simulation; (3) methods for software testing; and (4) use of experimental design techniques. We expect that suggested improvements in these areas can be implemented almost immediately because they will require no adjustment to the general acquisition process as currently structured. We also hope to develop a taxonomic structure for characterizing systems that require different types of operational test procedures. In its more proactive mode, the panel anticipates taking a longer and more expanded view of how operational testing fits into the acquisition process. The prospectus for the panel's study anticipated the need for breadth in the scope of the study: “In addition to making recommendations on how to improve operational testing under current requirements and criteria, the panel would also take a longer term perspective and consider whether and to what extent technical, organizational, and legal requirements and criteria constrain optimal decision making.” Furthermore, the prospectus mentions a major point expressed in the workshop that gave rise to the panel study: “An idea that was expressed often [at the workshop] is to encourage moving away from the present advocacy environment surrounding quality assurance, in which one party or the other is characterized as being at fault. Several speakers urged moving toward a more neutral and cooperative environment in managing quality in weapon systems,” with an emphasis on achieving final quality rather than clearing interim hurdles at program milestones (see Rolph and Steffey, 1994, for a final report of the workshop). After its initial stage of activity, the panel is inclined to echo this sentiment. In future work, we hope to articulate general principles and a philosophy of managing information and quality, drawn from broad experience in industry and government. From this perspective, we may

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

INTRODUCTION

10

suggest directions for change in the general acquisition process that would make operational testing more informative. The panel understands that the acquisition process as a whole satisfies many needs and goals and that numerous interdependencies have arisen in support of this process. Thus, changes to the acquisition process would have wide-ranging effects that would be difficult to foresee. The panel is also aware that it is not constituted in a way that would permit recommendations concerning a major restructuring of the acquisition process. However, certain changes in the process could expand the usefulness of operational testing, and we believe we have relevant experience in how information on product development should be managed and analyzed. The next section presents some preliminary thoughts on testing in product development and how statistics can be used to improve it. STATISTICS AND INFORMATION MANAGEMENT IN DEFENSE TESTING In a wide variety of applications, statistics has provided plans to meet sequences of information needs during product development cycles, methods and strategies for controlling and assuring the quality of products or systems, and designs of experimental studies to demonstrate performance outcomes. Thus, statistical science can make broad contributions in the development, testing, and evaluation of such complex entities as defense systems. Operational Testing of Complex Systems Overview Modern methods of manufacturing and product development recognize that operational test and evaluation is a necessary part of placing any product or system into widespread public use. Evaluation of the performance of products and systems in operational use against prespecified performance standards, such as effectiveness, suitability, efficacy, or safety criteria, is usually the last experimental testing stage in an evolutionary sequence of product development. At least four aspects of operational testing contribute to its difficulty and complexity: • The operational testing paradigm often does not lead to a pass/fail decision. Instead, testing can involve redesign, iteration on concepts, or changes in subcomponents. This aspect especially characterizes the operational testing of complex systems for which no competing capability exists. The statistical methodology appropriate for one-at-a-time pass/fail decisions is inappropriate for sequential problems; thus there is a need for more proper sequential methods that will increase the information derived from tests of this type. • Operational testing involves realistic engagements in which circumstances can be controlled only in the broadest sense. Human intervention, training, and operator skill level often defy control, and can play as important a role in the performance outcome as the system hardware and software. • Operational tests are often expensive. With increasingly constrained budgets, there is enormous pressure to limit the amount of operational testing solely because of cost considerations. Experiments with sparse data cannot produce information with the associated levels of statistical uncertainty and risk traditionally used to support decision making. • When attempted, the incorporation of additional sources of relevant data—before, during, and after operational testing—in the evaluation of complex systems poses methodological and organizational challenges. Methodological challenges arise from the difficulty of combining information from disparate sources using standard evaluation techniques. Such sources include training data on operators

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

INTRODUCTION

11

involved in field tests and observational data of in-use situations when they present themselves. Organizational challenges can arise when there is disagreement about the validity of certain types of information or when attempting to gather information in settings (e.g., combat) in which the primary objective is not data collection. A Continuum of Information Gathering The development, testing, and evaluation of modern complex systems does not often fall into easily segmented phases that can be assessed separately. Therefore, prospective planning is frequently important in guiding the collection and use of information during the learning and confirmation phases of system development. The iterative nature of the testing process is especially important for new, one-of-a-kind, state-of-the-art technologies that are often key components of prospective systems, because the specific capabilities of the system cannot be determined completely in advance. Information from early stages of development can provide feedback for recalibrating operational criteria. Without such recalibration, operational testing standards may be set in an unrealistic or unachievable manner. Interestingly, an Office of the Secretary of Defense memorandum requires a link between the measures of effectiveness used in cost and operational effectiveness analysis and in operational testing (Yockey et al., 1992). However, notwithstanding this required linkage, DoD and Congress have placed constraints on the sharing of experimental data between developmental and operational testing.2 (These constraints were imposed to ensure objectivity—and the appearance of objectivity—in operational testing.) Furthermore, it is important to collect in-use data (e.g., from training exercises or actual combat) on the effectiveness and suitability of a system after it has been deployed. Comparisons of operational test and in-use data can be very instructive; discrepancies might reveal flaws in the operational test design, execution, or analysis. Alternatively, differences may reflect deployment of the system in unanticipated operating environments or with significantly different functional requirements. Reconciliation of these two sources of data can yield valuable information about operational test procedures as well as the system that is now in the field. Complex Testing Conditions Complex systems have multiple measures of performance, can operate in many possible scenarios, and require a high degree of interaction between the system and the operator. These characteristics often complicate the application of classic experimental designs in operational testing. Also, uncontrolled and imperfectly controlled scenarios and conditions are part of the testing environment. In most cases, for example, operational testing involves human interaction as part of product or system usage, and the isolation and control of human factors usually cannot be achieved to the same extent as is the case, say, with certain environmental or physical factors.

2Public

Law 99-661 states, “In the case of a major defense acquisition program, no person employed by the contractor for the system being tested may be involved in the conduct of the operational test and evaluation required under subsection (a).” This has been interpreted to mean that the processing and evaluation of test data must be carried out so that there is no possibility or even appearance of involvement on the part of the system contractor in the operational test in any way other than as it would be involved with the system in combat.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

INTRODUCTION

12

Effects of Constrained Test Resources Statistical designs typically control a limited number of factors and permit the drawing of conclusions with a degree of statistical confidence that a product or system meets predetermined standards. For test results to possess a certain degree of statistical confidence, these experimental study designs may require sample sizes that exceed the number of available products or systems manufactured, or are infeasible because of cost and budget constraints. The statistical problems faced in operational testing almost always derive from the need to make acquisition decisions with sparse amounts of data. These decisions involve a higher degree of uncertainty (and, therefore, risk) than is typically desirable, but there is no easy solution to this predicament. Furthermore, acquisition decisions must, by their nature, depend on a variety of subjective inputs in addition to operational test data. Because there are significant costs to operationally test with enough samples under the wide range of plausible scenarios to provide a level of confidence desired by the public, government agencies, or Congress, it is especially important that the continuum of information derived from data collection, testing, and evaluation be used effectively. The complexities and costs associated with operational testing underscore the need to take full advantage of supplementary sources of information. Such sources include results from developmental testing, operational testing of similar systems or subsystems, and post-acquisition data from training exercises and actual combat. The effective use of supplemental data requires prospective planning during the learning and confirmation phases of system development. Testing and Evaluation in Nonmilitary Applications Constructive analogies to defense testing can be drawn from other application areas, such as the development, testing, and approval of pharmaceutical products and medical devices. Use of pilot studies is quite common in these applications. In manufacturing industries, the focus of quality improvement efforts has shifted upstream to the product design and development phases. Frequently heard expressions such as “quality by design” and “do it right the first time” express the new philosophy that quality should be built into the product at the design stage. Statistical methods such as experimental design and reliability engineering are now used up front to compare designs and vendors, as well as to optimize product and process designs. There is also less reliance on end-product testing and inspection to assure quality, especially in settings in which operational testing of the manufactured product cannot feasibly be carried out under all likely scenarios of use. The panel has not concluded its examination of the parallels between the activities of product design, endproduct testing, and information and quality management as practiced by DoD and by private industry or other federal agencies. However, there are certainly other paradigms that are regularly used and have real advantages as compared with the current DoD acquisition process, and thus deserve further examination for their relevance to DoD acquisition. Conclusion In this section we have introduced several of the broader principles we will consider in what we have referred to as our proactive mode. In our deliberations, we will apply these principles in attempting to formulate recommendations for improving testing and evaluating of defense systems. Because some of these principles imply very different organized production processes from those used today, we

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

INTRODUCTION

13

anticipate that some of our recommendations will be in the form of long-term goals rather than changes that can be implemented in the existing acquisition process. THIS REPORT AND FUTURE WORK The remainder of this report presents results of the panel's work to date in five areas being addressed by our working groups: • • • • •

Use of experimental design in operational testing (Chapter 2) Testing of software-intensive systems (Chapter 3) System reliability, availability, and maintainability (Chapter 4) Use of modeling and simulation in operational testing (Chapter 5) Efforts toward a taxonomic structure for DoD systems for operational test (Chapter 6)

In addition, five appendices are provided: Appendix A describes in detail the organizational structure of defense acquisition; Appendix B presents a short history of experimental design, with commentary for operational testing; Appendix C addresses the optimal selection of a small number of operational test environments; Appendix D lists the individuals consulted by the panel; and Appendix E provides charts showing the organization of test and evaluation within DoD overall and within the Army. In view of the study's objectives, there are at least two distinct audiences for this report: the defense testing community and the statistical community. Therefore, we may sometimes present material that is obscure to one audience yet obvious to another. Appendix A and Appendix B are intended to provide relevant background information for the statistical and testing communities, respectively. Of the material in this report, Appendix C is written at the highest mathematical level. Despite the assumptions made about some of our readers, we hope this report will nevertheless advance the general goal of increasing the level of interaction among testers, other defense analysts, and statisticians. As noted above, further work is required before the panel will be able to offer recommendations. Chapter 2, Chapter 3, Chapter 4, Chapter 5 through Chapter 6, respectively, describe our future planned work in experimental design; software-intensive systems; reliability, availability, and maintainability; modeling and simulation; and development of a taxonomic structure. In addition, the panel expects to undertake several other tasks before issuing its final report. First, we plan to perform comparisons between the current acquisition-operational testing structure in DoD and its counterparts in (1) other defense communities, such as those in Great Britain, Australia, Israel, France, Japan, and Russia; (2) other nonmilitary federal agencies, such as the National Aeronautics and Space Administration and the Food and Drug Administration; and (3) private industry, such as the automobile, semiconductor, and telephone industries. These three areas are extremely broad, and we aim simply to understand their major components. While the current DoD acquisition world is relatively singular, we hope that by investigating how others have dealt with the difficult problem of developing unique, high-cost, technologically advanced products, we will discover interesting ideas that can be modified for application to the DoD context. Also, for purposes of background, and to provide a context for readers new to this area, we will prepare a history of operational testing in DoD, focusing on the years 1970 to 1995. It will detail the births (and deaths) of agencies responsible for testing, both developmental and operational; the roles of various advisory and oversight groups; the interaction with Congress; and in general the place and necessity of operational testing in DoD acquisition.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

INTRODUCTION

14

In the area of organizational context we will examine case studies of how systems have progressed from the Mission Needs Statement through final production or termination of the system to see how the acquisition system works and what opportunities exist for change, for example, to incorporate current state-of-the-art industrial practices. To this end, we will study the role of the program manager. We will also look at the current role in the acquisition process of various oversight and consulting groups, such as the Institute for Defense Analyses and the Office of the Director of Operational Test and Evaluation. Finally, we will address issues involving the extent to which additional statisticians, statistical training, or access to expert statisticians would improve operational testing. Our efforts in this area will include studying recent General Accounting Office reports examining the manner in which systems have progressed through the stages of development (milestones). In addition, we will interview principals involved in the systems identified for case study and general experts in the acquisition process for their perspectives on how the process works and what changes might produce improvements. We will also seek some indication of the degree of statistical training of the various members of the operational testing community.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

USE OF EXPERIMENTAL DESIGN IN OPERATIONAL TESTING

15

2 Use of Experimental Design in Operational Testing

The goal of an operational test is to measure the performance, under various conditions, of diverse aspects of a newly developed defense system to determine whether the system satisfies certain criteria or specifications identified in the Test and Evaluation Master Plan, the Cost and Operational Effectiveness Analysis, the Operational Requirements Document, and related documents. There are typically a large number of relevant conditions of interest for every system, including terrain, weather conditions, types of attack, and use of obscurants and other countermeasures or evasionary tactics. Some of these conditions can be completely controlled; some can be controlled at times, but are often constrained by factors such as budget and schedule; and some are not subject to control. DoD assigns weights to combinations of these conditions in the Operational Mode Summary and Mission Profiles, with greater weight given to those combinations perceived as more likely to occur (some combinations might even be impossible) and those of greater military importance. As noted in Chapter 1, the typical sample size available for test is small, because the size, scope, and duration of operational testing are dominated by budgetary and scheduling considerations. Therefore, it is beneficial to design the test as efficiently as possible so it can produce useful information about the performance of the system—both in the test scenarios themselves and through some modeling—in untested combinations of the types of conditions cited above. Experimental design provides the theory and methods to be applied in selecting the test design, which includes the choice of test scenarios and their time sequence; determination of the sample size; and the use of techniques such as randomization, controls, matching or blocking, and sequential testing. (For an historical discussion of these techniques, see Appendix B.) Other decisions—such as whether force-on-force testing is needed, whether the test is one-on-one or many-on-many or part of a still larger engagement, and whether to test at the system level or test individual components separately—are integral to the test design as well, and have important statistical consequences. A special aspect of DoD operational testing is the expense of the systems and the individual test articles involved. Even modest gains in operational testing efficiency can result in enormous savings either by correctly identifying when there is a need for cancellation, passing, or redesign of a system, or by reducing the need for test articles that can cost millions of dollars each.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

USE OF EXPERIMENTAL DESIGN IN OPERATIONAL TESTING

16

As in industrial applications of experimental design for product testing, various circumstances can conspire to cause some runs to abort or cause the operational test designer to choose factors that were not planned in advance, thus compromising features such as the orthogonality of a design, causing confounding, and reducing the efficiency of the test. General discussion of these issues may be found in Hahn (1984). The panel is interested in understanding these real-life complications in DoD operational testing. As an example, the expense involved prohibits the use of more than one battleship in operational testing of naval systems, so it is impossible to measure the effects of differences in crew training on ship-to-ship variation. A further complication is that there are typically a large number of system requirements or specifications, referred to as measures of performance or effectiveness, for DoD systems. It is not uncommon to have as many as several dozen important specifications, and thus it is unclear which single specification should be used in optimizing the design. One could choose a “most important” specification, such as the hit rate, and select a design that would be efficient in measuring that output. However, the result might be inefficiency in measuring system reliability, since reliability measurement typically involves a broader distribution of system ages and a larger test sample size than a test of hit rate. Various statistical methods can be applied in an attempt to compromise across outputs. The panel does not address this problem in this interim report (though it is discussed briefly in Appendix B), but we intend to examine it for our final report. This chapter first presents the progress made by the panel in understanding how experimental design is currently used in DoD operational testing. This is followed by discussion of some concerns about certain aspects of the use of experimental design that we have investigated, although only cursorily. We next describe a novel approach proposed in an operational test conducted by the Army Operational Test and Evaluation Command (OPTEC) for the Army Tactical Missile System/Brilliant Anti-Tank system (a missile system that targets moving vehicles, especially tanks), which involves the use of experimental design in designing a small number of field tests to be used in calibrating a simulation model for the system. The panel is interested in understanding this approach, which may become much more common in the future as a method for combining simulations and field tests. The final section of the chapter describes the future work we plan to undertake before issuing our final report. As background material for the discussion in this chapter, a short history of experimental design is provided in Appendix B. Some elements of our discussion—for example, testing at the center versus the edge of the operating envelope—focus on the question of how to define the design space (and associated factor levels). Other elements —for example, possible uses of fractional factorial designs—are concerned with approaches for efficiently testing a well-defined design space. Such considerations are thus complementary and are among the diverse issues that must be addressed in designing a sound operational test that yields accurate and valuable information about system performance. CASE STUDY #1: APACHE LONGBOW HELICOPTER To become better acquainted with the environment, characteristics, and constraints of DoD operational testing, the panel began by investigating a single system and following it from the test design, through the testing process, and finally to evaluation of the test results. After considering the current systems under test by OPTEC, the panel decided to examine the operational testing of the Apache Longbow helicopter.1

1The

panel is extremely grateful for the cooperation of OPTEC in this activity, especially Henry Dubin, Harold Pasini, Carl Russell, and Michael Hall.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

USE OF EXPERIMENTAL DESIGN IN OPERATIONAL TESTING

17

An initial meeting at OPTEC acquainted the panel with the overall test strategy: an operational test gunnery phase at China Lake, California, with eight gunnery events, followed by a force-on-force phase at Fort Hunter Liggett, composed of 30 trials (15 involving a previous version of the helicopter for purposes of control and 15 using the newer version). The gunnery phase was intended to measure the lethality of the Longbow, while the force-on-force test would measure survivability and suitability. There are 46 measures of performance associated with this system. The panel focused its attention on the force-on-force trial, visiting the test site shortly before the operational test occurred to examine the test facility; understand the methods for registering the simulated effects of live fire; and gain a better understanding of constraints involving such factors as the facility, the soldiers, and the maintenance of the test units. Demonstrations provided at this time included the simulation of various enemy systems; scoring methods; and data collection, interpretation, and analysis that would be used to help understand why targets were missed during the test by studying whether there might have been some anomaly in the test conduct. The panel was also interested in assessment of reliability for the system. The force-on-force operational test involved two major factors under the control of the experimenter: (1) mission type—movement to contact (deep), deliberate attack (deep), deliberate attack (close), or hasty attack (close); and (2) day or night. Either zero, two, or three replications were selected for combinations of mission type with day/night; zero replications was chosen for impossible combinations (for example, deep attacks are not carried out during daylight because of the formidable challenges to achieving significant penetration of enemy lines during periods of high visibility.) The panel continued its examination of the Apache Longbow's operational testing through a presentation by OPTEC staff outlining the evaluation of the test results after the testing had been completed. This presentation emphasized the collection, management, and analysis of data resulting from the operational test, including how these activities were related to various measures of performance, the treatment of outliers, and an outline of the decision process with respect to the final hurdle of milestone III. However, this briefing did not address some details in which we were interested concerning the summarization of information related to the 46 measures of performance and how the Defense Acquisition Board will use this information in deciding whether to proceed to full production for this system. Therefore, we intend to examine the evaluation process in more depth in completing our study of this operational test. In examining the Apache Longbow operational test, the panel inspected various documents related to the test, including the Test and Evaluation Master Plan and the Operational Mode Summary and Mission Profiles. In addition, we examined some more general literature on the use of experimental design in operational testing, especially Fries (1994). Completion of this case study will only modestly acquaint the panel with DoD operational testing practice, since it will represent experience with only one system for one service. Before issuing our final report, we will determine the extent to which the Apache Longbow experience is typical of Army operational testing practice. Some Issues and Concerns The panel is impressed to see that basic principles are often used in operational tests. There seems to be much to commend about current practice. We believe the use of a control in the Apache Longbow operational test, which we consider important, is typical of Army tests. The quasi-factorial design of the Apache Longbow test also appears to be typical and indicates that the Army understands the efficiency gained from simultaneous change of test factors. At the same time, our visit to Fort Hunter Liggett made

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

USE OF EXPERIMENTAL DESIGN IN OPERATIONAL TESTING

18

it clear that routine application of standard experimental design practice will not always be feasible given the constraints of available troops, scheduling of tests, small sample sizes, and the weighting of test environments. Choice of Test Scenarios Various considerations combine in determining the test scenarios used in operational testing. Certainly the perceived likelihood or importance of an application of the new system should be considered when selecting scenarios for test. Also, if the system is to be used in a variety of situations, it is important to measure its performance in each situation relative to that of a control, subject, of course, to limitations of the available test resources. This was done in the Apache Longbow testing we observed. However, the panel did not see enough evidence that another important factor was considered in selecting the test scenarios: the a priori assessment of which scenarios would be discriminating in identifying situations where the new system might dominate the control, or vice versa. For instance, if 80 percent of the applications of a new system are likely to be at night, but the new system has no perceived advantage over the control at night, there is less reason to devote a large fraction of the test situations to night testing. Of course, the a priori wisdom may be wrong, but even if it is only approximately correct, substantial inefficiencies can result from allocating too much of the test to situations where both systems are likely to perform similarly. Uncertain Scoring Rules, Measurement Inefficiencies The panel has noted two concerns with respect to scoring and measurement. There is some evidence that the scoring roles for such factors as hits on a target of a missile are vaguely defined, specifically with respect to which events are considered unusable or precisely how a trial is defined. We present three examples of this. First, although reliability, availability, and maintainability failures are usually defined for a given system, there is often much disagreement within the scoring conference on whether a given system failure should be counted.2 The question of what constitutes an outlier in an operational test, while difficult to answer in advance, should be addressed as objectively as possible. Second, when a force-on-force trial is aborted while in progress for a reason such as accident, weather, or instrumentation failure, the issue arises of how the data generated before the failure should be used. Third, the problem persists of how to handle data contaminated by failure of instrumentation or test personnel. It is also important to be more precise about the objective of each operational test. Sometimes what is desired is to understand the performance of the system in the most demanding of several environments, so the objective is to estimate a lower bound on system performance; at other times what is needed is a measurement of the average performance of the system across environments. Sometimes what is desired is to find system defects; at other times the objective is to characterize the operational profile. Ultimately, optimal test design— maximizing information for fixed test cost—depends on what the goal of testing is. For example, if one is testing to find faults, it is better to use experimental designs that “rig” the test in informative ways. As a further complication, operational tests can be used to calibrate and evaluate simulation models, and the design implications may be different for those objec

2A

scoring conference is a group of about six to eight people—including testers, evaluators, and representatives of the program manager—who review information about possible test failures and determine whether to count such events as failures or exclude them from the regular part of the test evaluation.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

USE OF EXPERIMENTAL DESIGN IN OPERATIONAL TESTING

19

tives. A related point is that averaging across different environments could mask important evidence of differentials in system effectiveness. For example, an average 60 percent hit rate could result either from a 60 percent hit rate in all environments or from a 100 percent hit rate in 60 percent of the environments and a 0 percent hit rate in 40 percent of the environments; clearly these are two very different types of performance. Thus, expressing results in terms of an average success rate is not wholly satisfactory. A final point with respect to measurement is that much of the data that measure system effectiveness, especially with respect to hit rates, are zero-one data. It is well known that zero-one data are much less informative than data that indicate by how much a target was missed. This information can be used for better modeling the variability of a shot about its mean, which in turn can be used for better estimating the hit rate. The panel has been informed that this issue is being examined by testers at Fort Hunter Liggett. Testing Only “Inside the Envelope” Experimental design theory and practice suggest that in testing a system to determine its performance for many combinations of environmental factors, a substantial number of tests should be conducted for fairly extreme inputs, as well as those occurring more commonly. Modeling can then be used to estimate the performance for various intermediate environments. Operational testing, however, tends to focus on the environments most likely to occur in the field. While this approach has the advantage that one need not use a model to estimate the system performance for the more common environments, the disadvantage is that little is known about the performance when the system is strongly stressed. Certainly, this issue is explored in developmental testing. However, the failure modes and failure rates of operational testing tend to be different than those of developmental testing.3 CASE STUDY #2: THE ATACMS/BAT SYSTEM In the Army Tactical Missile System/Brilliant Anti-Tank (ATACMS/BAT) system, OPTEC proposes to use a relatively novel test design in which a simulation model, when calibrated by a small number of operational field tests, will provide an overall assessment of the effectiveness of the system under test. BAT submunitions use acoustic sensors to guide themselves toward moving vehicles and an infrared seeker to home terminally on targets. The submunitions are delivered to the approximate target area by the ATACMS, which releases the submunitions from its main missile. The ATACMS/BAT system is very expensive, costing several million dollars per missile. The experimental design issue is how to choose the small number of operational field tests such that the calibrated simulation will be as informative as possible. The panel is examining this problem. Here we provide a description of the ATACMS/BAT operational testing program to illustrate one context in which to think about alternative approaches to operational test design.

3One

possible reason that operational testing usually does not involve more extreme scenarios is the risk that performance results in these scenarios will be misinterpreted as indicative of the system's typical performance, without proper consideration of the relative likelihood of the scenarios.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

USE OF EXPERIMENTAL DESIGN IN OPERATIONAL TESTING

20

Plans for Operational Testing The ATACMS/BAT operational test will not be a traditional Army operational test involving force-on-force trials, but rather will be similar to a demonstration test. The Army would use a model to simulate what might happen in a real engagement. To calibrate the simulation, various kinds of data will be collected, for example, from individual submunition flights and other types of trials, including the operational test trials. A relatively novel aspect of this operational test is the use of a limited number of operational test events along with a simulation to evaluate a new weapons system. This approach has been taken because budgetary limitations on the sample size and the limited availability of equipment such as radio-controlled tanks for testing make it infeasible to develop a program of field testing that could answer the key questions about the performance of this system in a real operational testing environment. Operational testing of the ATACMS/BAT system is scheduled to take place in 1998. According to the Test and Evaluation Master Plan: This portion of the system evaluation includes the Army ATACMS/BAT launch, missile flight, dispense of the BAT submunitions, the transition to independent flight, acoustic and infrared homing, and final impact on targets. Evaluation of this discrete event also includes assessment of support system/subsystem RAM [reliability, availability, and maintainability] requirements, software, terminal accuracy, dispense effectiveness, kills per launcher load, and BAT effectiveness in the presence of countermeasures.

Initial operational test and evaluation is the primary source of data for assessing these system capabilities. There is no baseline system for comparison. The number of armored vehicle kills (against a battalion of tanks) is the bottom-line measure of the system's success. Tank battalions vary in size, but typically involve about 150 vehicles moving in formation. (Unfortunately, every country moves its tanks somewhat differently.) Under the test scoring rules, no credit is given if the submunition hits the tank treads or a truck or if two submunitions hit the same tank. There is one operational test site, and the Army has spent several million dollars developing it. There will be materiel constraints on the operational test. Only eight missiles, each of which has a full set of 13 BAT submunitions, are available for testing. Also, the test battalion will involve only 21 remotely controlled vehicles. Thus, the Army plans to use simulation as an extrapolation device, particularly in generalizing from 21 tanks to a full-size battalion (approximately 150 tanks). Important Test Factors and Conditions As stated above, all stages of missile operation must be considered in the operational test, particularly acoustic detection, infrared detection, and target impact. Factors that may affect acoustic detection of vehicles include distance from target (location, delivery error), weather (wind, air density, rain), vehicle signature (type, speed, formation), and terrain. For example, the submunitions are not independently targeted; they are programmed with logic to go to different targets. Their success at picking different targets can be affected by such factors as wind, rain, temperature, and cloud layers. Obviously, one cannot learn about system performance during bad weather if testing is conducted only on dry days. However, it is difficult to conduct operational tests in rain because the test instrumentation does not function well, and much data can be lost. Such factors as weather (rain, snow) and environment (dust, smoke) can also affect infrared detection of vehicles. Factors affecting the conditional probability of a vehicle kill given a hit include the hardness of the vehicle and the location of the hit. Possible countermeasures must also be considered. For example, the tanks may disperse at some point, instead of advancing in a straight-line formation, or may try to employ decoys or smoke obscuration.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

USE OF EXPERIMENTAL DESIGN IN OPERATIONAL TESTING

21

The operational test design, or shot matrix, in the Test and Evaluation Master Plan (see Table 2-1) lists eight test events that vary according to such factors as range of engagement, target location error, logic of targeting software, type of tank formation, aimpoint, time of day, tank speed and spacing, and threat environment. Three levels are specified for the range of engagement: near minimum, near maximum, and a medium range specified as either “2/3s” or “ROC,” the required operational capability. Target location error has two levels: median and one standard deviation (“1 sigma”) above the median level. (The distinction between centralized and decentralized is unimportant in this context.) The logic of targeting software (primary vs. alternate) and type of tank formation (linear vs. dispersed) are combined into a single, two-level factor. Aimpoint distance is either short or long, and aimpoint direction is either left or right. The aimpoint factors are not expected to be important in affecting system performance. (The payload inventory is also unimportant in this context.) Tanks are expected to travel at lower speeds and in denser formations during night operations; therefore, tank speed and spacing are combined with time of day into a single two-level factor (day vs. night). Three different threat environments are possible: benign, Level 1, and Level 2 (most severe). Clearly, in view of the limited sample size, many potentially influential factors are not represented in the shot matrix. .

Possible Statistical Methods and Aspects for Further Consideration One approach to operational testing for the ATACMS/BAT system would be to design a large fractional factorial experiment for those factors thought to have the greatest influence on the system performance. The number of effective replications can be increased if the assumption that all of the included design factors are influential turns out to be incorrect. Assuming that the aimpoint factors are inactive, a complete factorial experiment for the ATACMS/BAT system would require 23 × 32 = 72 design points. However, fractional factorial designs with two- and three-level factors could provide much information while using substantially fewer replications than a complete factorial design. Of course, these designs are less useful when higher-order interactions among factors are significant. (For a further discussion of factorial designs, see Appendix B, as well as Box and Hunter, 1961.) Another complication is that environment (or scenario) is a factor with more than two settings (levels). In the extreme, the ATACMS/BAT operational test results might be regarded as samples from several different populations representing test results from each environment. Since it will not be possible to evaluate the test in several unrelated settings, some consolidation of scenarios is needed. It is necessary to understand how to consolidate scenarios by identifying the underlying physical characteristics that have an impact on the performance measures, and to relate the performance of the system, possibly through use of a parametric model, to the underlying characteristics of those environments. This is essentially the issue discussed in Appendix C. While the above fractional factorial approach has advantages with respect to understanding system performance equally in each scenario, we can see some benefits of the current OPTEC approach if we assume that the majority of interest is focused on the “central” scenario, or the scenario of most interest. In the current OPTEC approach, the largest number of test units are allocated to this scenario, while the others are used to study one-factor-at-a-time perturbations around this scenario, such as going from day to night or from linear to dispersed formation. This approach could be well suited to gathering information on such issues while not losing too much efficiency at the scenario of most interest. And if it turns out that changing one or more factors has no effect, the information from these settings can be pooled to gain further efficiency at the scenario of most interest.

Copyright © 1994. National Academies Press. All rights reserved.

2/3s

Near max

DT/OT 2

OT 1

2/3s

2/3s

2/3s

4c

5d

6d

OT

OT

OT

OT

Median centralized

1 sigma decentralized

Median centralized AMC

Median dentralized AMC

Median decentralized

1 sigma centralized

Median centralized

ROC

TLE (Method of Control)

To be determined

Primary/Linear

Primary/Linear

Primary/Linear

Alternate/Dispersed

Primary/Linear

Primary/Linear

Primary/Linear

Target/Logic

To be determined To be determined

Long/On line

C3 systems determined

C3 system determined

Long/Left

Short/Left

Short/Right

Long/Right

Aimpoint

13 Tt

13 Tt

10 Tt, 3 Tsw

10 Tt, 3 Tsw

7 Tt, 6 Tsw

13 Tt

13 Tt

9 Tt, 4 Tsw

Payload

Level 1 day

Levels 1 & 2 night

Level 1 night

Level 1 night

Level 1 day

Levels 1 & 2

Level 1 night

Benign ROC Level 1 night

Environmentb

shot matrix reflects a two-flight developmental testing/operational testing program of engineering and manufacturing development assets conducted by the Test and Evaluation Command, and a six-flight operational testing program of Low-Rate Initial Production brilliant anti-tank assets conducted by the Test and Experimentation Command. This shot matrix is subject to threat accreditation. bFlights will be conducted during daylight hours for safety and data collection. Night/day environments pertain to vehicle speed and spacing. cOT 3 and OT 4 are ripple firings. dFlight objectives may be revised based on unforeseen data shortfalls.

aThis

NOTE: AMC Army Materiel Command C3 Command, Control and Communications DT Developmental Testing OT Operational Testing ROC Required Operational Capability

2/3s

3c

OT 2

Near min

ROC

DT/OT 1

Day

Range

Testa

TABLE 2-1 Army Tactical Missile System Block II/Brilliant Anti-Tank Operational Test Shot Matrix

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

USE OF EXPERIMENTAL DESIGN IN OPERATIONAL TESTING 22

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

USE OF EXPERIMENTAL DESIGN IN OPERATIONAL TESTING

23

FUTURE WORK The panel does not yet understand the extent to which experimental design techniques, both routine and more sophisticated, have become a part of operational test design. Therefore, this will be a major emphasis of our further work. Relatively sophisticated techniques might be needed to overcome complications posed by various constraints discussed above. Bayesian experimental design might provide methods for incorporating information from developmental testing in the designs of operational tests. The notion of pilot testing has been discussed and will be further examined. We will examine the use of experimental design in current operational test practice in the Air Force and the Navy. This effort will include studying relevant literature that provides examples of current practice and the directives presenting test policy in these services. It will include as well investigating the potential opportunity for and benefit from the use of more sophisticated experimental design techniques. Also, the panel will continue to follow progress in the design of the operational test for the ATACMS/BAT system. We will also devote some additional effort to understanding the use of experimental design in operational testing in industry and in nonmilitary federal agencies. The panel is also interested in a problem suggested by Henry Dubin, technical director of OPTEC: how to allocate a small sample of test objects optimally to several environments so that the test sample is maximally informative about the overall performance of the system. Appendix C presents the panel's thoughts on this topic to date, but we anticipate revisiting the problem and refining our thinking. The panel understands that both the degree of training received prior to an operational test and the learning curves of soldiers during the test are important confounding factors. While developmental testing generally makes use of subjects relatively well trained in the system under test, operational testing makes use of subjects whose training more closely approximates the training a user will receive in a real application. It is important to ensure that comparisons between a control and a new system are not confounded by, say, users' being more familiar with the control than the new system, or vice versa. The panel is examining the possibility of using trendfree and other experimental designs that address the possible confounding of learning effects (Daniel, 1976; Hill, 1960). For example, binary data on system kills are typically not binomial; instead they are dependent because of the learning effects during trials of operational tests. Player learning is generally not accounted for in current test practice. At best, there is side-by-side shooting in which, perhaps, learning occurs at the same rate during comparative testing of the baseline and prospective systems. With respect to evaluation, a real problem is how to decide what sample size is adequate for making decisions with a stated confidence. The panel will examine this question—sometimes referred to as “How much testing is enough?—for discussion in our final report. This effort may involve notions of Bayesian inference, decision theory, and graphical presentation of uncertainty. It is important to note in this context that experimental design principles can help make effective use of available test resources, but no design can provide definitive conclusions when insufficient data have been collected. The panel is interested in examining explicitly the tradeoff between cost and benefit of testing in our final report. Furthermore, the panel believes that the hypothesis-testing framework of operational testing is not sensible. The object of operational testing should be to provide to the decision maker the data most valuable for deciding on the next course of action. The next course of action belongs in a continuum ranging from complete acceptance to complete rejection. Therefore, in operational testing one should concentrate on estimation procedures with statements of attendant risks. We also plan to explore the utility of other methods for combining information for purposes of evaluation, including hierarchical modeling.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

TESTING OF SOFTWARE-INTENSIVE SYSTEMS

24

3 Testing of Software-Intensive Systems

Early in the panel's work, it became clear that software is a critical path through which systems achieve their performance objectives. We therefore recognized the need for special attention to software-intensive systems and better understanding of how operational testing is conducted on software-intensive systems across the military services. It has been reported that in the last three years, over 90 percent of Army initial operational test and evaluation slipped because software was not ready (U.S. Army Materiel Systems Analysis Activity, 1995). Since the 1970s, software problems discovered during operational testing have adversely affected the cost, schedule, and performance of major defense acquisition systems. In some cases, significant performance shortfalls have been identified after systems have been produced and put into operational use. Findings show that software-intensive systems generally do not meet user requirements because the systems are certified as ready for operational testing before their software is fully mature. Several barriers have been identified that limit effective software test and evaluation. One such barrier is that DoD has not acknowledged or addressed the criticality of software to systems' operational requirements early enough in the acquisition process. There is a perception that software is secondary to hardware and can be fixed later. Other barriers to effective test and evaluation of software include the following: (1) DoD has not developed, implemented, or standardized decision-making tools and processes for measuring or projecting weapon system cost, schedule, and performance risks; (2) DoD has not developed testing and evaluation policy that provides consistent guidance regarding software maturity; and (3) DoD has not adequately defined and managed software requirements. Although DoD has carefully studied what needs to be done to develop and test quality software and to field software-intensive systems, it has not effectively implemented long-standing recommendations. On the other hand, despite the lack of a DoD-wide coordinated strategy, the individual military services have made attempts to improve their software development processes (U.S. General Accounting Office, 1993). Given the above concerns, the panel formed a working group to focus on defense systems that are either software products or systems with significant software content. The group's goal is to prescribe statistical methods that will support decisions on the operational effectiveness and suitability of soft

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

TESTING OF SOFTWARE-INTENSIVE SYSTEMS

25

ware-intensive defense systems. The notion is that these methods will identify unfit systems relatively quickly and inexpensively through iterative use of techniques that promote progress toward passing operational test. A checklist associated with the methods would also be helpful to developers in knowing when their systems are ready for operational testing. The remainder of this chapter addresses the potential role for statistical methods in operational testing of software-intensive systems and describes the panel's activities to date and planned future work in this area. ROLE FOR STATISTICAL METHODS The panel sees a strong role for the use of statistical methods in the test and evaluation of software-intensive systems. Recognizing the fact that not every scenario can be tested, we have formulated the following set of questions in order to understand current practices for operational testing of software-intensive systems and areas where statistical methods might be applied: • • • •

How does one characterize the population of scenarios to test and the environments of use? How does one select scenarios to test from the population? How does one know when to stop testing? What are the stopping criteria? How does one generalize from the information gained during testing to the population of scenarios not tested? • How does one plan for optimal use of test resources and adjust the test plan as the testing unfolds? These questions are quite similar to those asked in understanding experimental design issues. However, sample sizes are typically much larger for software systems than for hardware systems, and therefore the answers to these questions will likely lead to different procedures. The panel believes that its greatest potential for significant contribution in this area will be achieved by concentrating on complex future systems, since there is potential for greater impact in targeting systems that have not yet passed through various developmental or operational test phases. The current paradigm appears to be bottom-up, with software emerging from a bewildering variety of methods, tools, and cultures, and each operational testing project having to struggle to find the money, time, and methods to test and evaluate the software to the extent necessary to put it into field use. Currently, operational testing of software-intensive systems is compromised because its methods are allowed to be driven by software development practices. The record of the software development community does not warrant adoption of their methods for operational testing. It is in the nature of software that it can be made needlessly complex beyond any threshold of evaluation and testability. However, the panel sees taking place a shift to a new, top-down paradigm driven by the concept of “intended use” as articulated in the definition of operational testing. To implement this idea, it would be necessary to prescribe certain criteria that, if met, would support a decision that the software (or system containing the software) is fit for field use. These criteria might include experiments, tests, and other means of evaluation. The criteria, including costs, schedules, and methods, would have to be prescribed in technical detail, in turn becoming requirements and constraints on the design and development process. While such constraints will not limit the potential of software, they may induce change in development methods, schedules, and budgets, making them more effective and realistic for developing systems that will pass operational testing. Software can be designed so that it will satisfy well-defined criteria for evaluation and testing. In an effort to gain a full understanding of test and evaluation of software-intensive systems, the

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

TESTING OF SOFTWARE-INTENSIVE SYSTEMS

26

panel sought analogies between the methods and problems encountered in DoD testing of such systems and those encountered in testing by other federal agencies and industry. Process was an apparent common theme among all the analogies examined, which included the Food and Drug Administration; the nuclear industry; commercial aviation; and such for-profit industries as telecommunications, banking, and highly automated manufacturing industries that have serious dependencies on software (RTCA, Inc., 1992; Food and Drug Administration, 1987; U.S. Nuclear Regulatory Commission, 1993; Scott and Lawrence, 1994). It is important to recognize that most software errors stem from design flaws that can ultimately be traced to poor requirements. Correcting design flaws is especially difficult because one must ensure that the changes do not have unintended side effects that create problems in other segments of the code. The panel recognizes that there are several important software engineering issues involved in the defense system life cycle that are not in our purview. These include configuration control during development and after deployment, so that every change made to a program is controlled and recorded to ensure the availability of the correct version of the program, as well as such issues as software reliability and upgrades. It is beyond the panel's charge to address directly the fundamentals of software engineering and current best practice for creating and maintaining software across the full system development life cycle. We view operational testing as a special moment or “snapshot” in the total life cycle of a software system. On the other hand, we recognize that software engineering is critical to successful software development and revision. If the software engineering process is flawed, then the statistical measurements and analysis used in operational testing will be out of context. ACTIVITIES TO DATE A primary goal of the panel has been to identify and develop working relationships with representatives from the different services who have primary operational test and evaluation responsibility for designing and performing experiments for software-intensive systems so that we can understand operational testing from their perspective, the difficulties they experience, and the areas where they might seek panel assistance. We have also sought to establish contacts with those who in effect work with the results of operational testing and have responsibility for reporting those results to Congress. Through our activities to date, we have identified and established contact with several key players who are involved in the operational testing of software-intensive systems within both the individual services and DoD. The panel has been engaged in information gathering through meetings, focused group discussions, conference calls, and a few site visits. In late April 1995, we made a one-day visit to Navy Operational Test and Evaluation Force headquarters to learn about the Navy's approaches to testing software-intensive systems. In conjunction with that visit, we also visited the Dam Neck Naval Surface Warfare Center to get a first-hand look at some of the Navy's software-intensive systems. In addition, we held an interservice session in Washington, D.C., with representatives from the services and DoD. This session allowed us to learn more about the services' approaches to testing software-intensive systems, while providing an opportunity for the service representatives in attendance to exchange views and share their experiences. The panel has learned about several ongoing efforts to resolve long-standing software testing problems. The Army's Software Test and Evaluation Panel is seeking to streamline and unify the software test and evaluation process through the use of consistent terms, the identification of key players in the software test and evaluation community, and the definition of specific software test and evaluation procedures (U.S. General Accounting Office, 1993; Paul, 1993, 1995). Additional objectives of the

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

TESTING OF SOFTWARE-INTENSIVE SYSTEMS

27

Software Test and Evaluation Panel are to improve management visibility and control by quantifying and monitoring software technical parameters using software metrics, and to promote earlier user involvement. Also with regard to the Army, we were informed of a fairly new operational testing strategy for expediting the fielding of software-intensive systems. This new strategy allows partial fielding of software-intensive systems once successful operational testing of a representative sample has been accomplished. Traditional operational testing of weapon systems requires that the entire system successfully complete operational testing of productionrepresentative items before fielding (Myers, 1993). In late 1991, the Air Force developed a process improvement program that was used in its software development activities. The Air Force has also been developing standardized procedures for software test and evaluation for use in both developmental and operational tests. The Navy has been engaged in an effort to improve test and evaluation of software, and has taken actions to improve its software development and testing processes. Through our meetings and conversations with DoD personnel, the panel has become aware of a category of acquisition known as evolutionary acquisition. With evolutionary acquisition systems, such as the Naval Tactical Command System-Afloat,1 the software code that is evaluated in operational testing is not necessarily what is deployed on the ship. Although several versions of the software code may be tested, it is possible that none of those versions is representative of the software to be used in the field. We are concerned that evolutionary acquisition compromises the utility of operational testing. We plan to pursue this issue further and attempt to develop a better understanding of the concept of evolutionary acquisition. FUTURE WORK A major goal will be to obtain more information about the development of software-intensive systems for the four services. We will also be developing a recommended protocol for software development and software testing, drawing from state-of-the-art industrial practice. The panel is planning more interaction with Air Force Operational Test and Evaluation Center software experts and continued interaction with software contacts in the other services. In addition, we will further examine for a potential case study the Naval Tactical Command System-Afloat system.

1The

Naval Tactical Command System-Afloat is the Navy's premier command and control system and is described as an all-source, all-knowing system that is installed in most ships and many more shore sites. It provides timely, accurate, and complete all-source information management, display, and dissemination activities —including distribution of surveillance and intelligence data and imagery to support warfare mission assessment, planning, and execution—and is a current segment of a large strategy system known as the Joint Maritime Command Information System.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

SYSTEM RELIABILITY, AVAILABILITY, AND MAINTAINABILITY

28

4 System Reliability, Availability, and Maintainability

The panel has undertaken an inquiry into current policies and statistical practices in the area of system reliability, availability, and maintainability as related to operational testing in the DoD acquisition process. As noted earlier, operational testing is intended to assess the effectiveness and suitability of defense systems under consideration for procurement. For this purpose, operational suitability is defined in DoD Instruction 5000.2 (U.S. Department of Defense, 1991) as follows: The degree to which a system can be placed satisfactorily in field use with consideration given to availability, compatibility, transportability, interoperability, wartime usage rates, maintainability, safety, human factors, manpower supportability, logistics supportability, natural environmental effects and impacts, documentation, and training requirements.1

Considerations of suitability, including reliability, availability, and maintainability, are likely to have different implications for the design and analysis of operational tests than considerations of effectiveness, and consequently merit distinct attention by the panel in its work. The panel's inquiry in this area has a threefold purpose. First, we will characterize the range of statistically based reliability, availability, and maintainability activities, resources, and personnel in the various branches of DoD in order to gauge the current breadth, depth, and variability of DoD reliability, availability, and maintainability applications. Second, we will examine several systems and activities in detail, with a view toward assessing the scope of reliability, availability, and maintainability practices in particular applications. Third, we will study the technological level of current reliability, availability, and maintainability practices as a foundation for recommendations about the potential applicability of recently developed reliability, availability, and maintainability methodology, or the need for new statistical developments. The next section reviews reliability, availability, and maintainability testing and evaluation in the military services. This is followed by an examination of variability in reliability, availability, and

1Curiously,

this definition does not explicitly include reliability as a consideration.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

SYSTEM RELIABILITY, AVAILABILITY, AND MAINTAINABILITY

29

maintainability policy and practice. Next is a brief look at industrial (nonmilitary) reliability, availability, and maintainability standards. The chapter ends with a summary and a review of the panel's planned future work in this area. RELIABILITY, AVAILABILITY, AND MAINTAINABILITY TESTING AND EVALUATION IN THE MILITARY SERVICES The panel has examined a substantial collection of government documents that touch on some aspect of reliability, availability, and maintainability testing and evaluation. We have carefully reviewed several of these sources that have been identified as widely used or cited, including DoD's “RAM Primer” (U.S. Department of Defense, 1982) and the Air Force Operational Test and Evaluation Center's Introduction to JRMET and TDSB: A Handbook for the Logistics Analyst (1995). Other documents have been scanned, including Sampling Procedures and Tables for Life and Reliability Testing (U.S. Department of Defense, 1960) and a substantial number of relevant military standards. To gain a better understanding of how reliability, availability, and maintainability testing and evaluation is conducted in the military services, the panel held telephone conferences with Army and Navy operational test and evaluation personnel and visited the Air Force Operational Test and Evaluation Center, reviewing a variety of reliability, availability, and maintainability organizational procedures and technical practices. We received briefings on the recent operational tests of the B-1B bomber and the C-130H cargo transport, as well as demonstrations of major software packages in use for test design and analysis. In addition to these activities, we addressed reliability, availability, and maintainability topics as part of our site visit to the Army Test and Experimentation Command Center at Fort Hunter Liggett, during which we observed preparations for operational testing of the Apache Longbow helicopter. Evaluation of reliability, availability, and maintainability in Navy operational testing occurs within the four major divisions of the Navy Operational Test and Evaluation Force: air warfare, undersea warfare, surface warfare, and command and control systems. Analysts work as part of operational test teams that are typically directed by military personnel with significant operational experience. Many analysts have received graduate training in operations research and statistics at the Naval Postgraduate School. The Army appears to have achieved the greatest degree of integration between developmental and operational testing in evaluating reliability, availability, and maintainability. The reliability, availability, and maintainability division at the Army Materiel Systems Analysis Activity is the organization that concentrates most on reliability, availability, and maintainability issues, but other units are also involved, including the Test and Evaluation Command, Operational Evaluation Command, Army Materiel Command, Program Evaluation Office, and Training and Doctrine Command. In the Army, reliability, availability, and maintainability data for a system are scored by a joint committee involving personnel from the Operational Evaluation Command, the Army Materiel Systems Analysis Activity, the Training and Doctrine Command system manager, and the program manager. In the Air Force, reliability, availability, and maintainability evaluation is part of the mission of the Logistics Studies and Analysis Team within the Air Force Operational Test and Evaluation Center's Directorate of Systems Analysis. Each of approximately ten analysts works concurrently on 10 to 12 different systems. Most Air Force analysts have a background in engineering or operations research, and may receive further training in statistics from courses offered by the Air Force Institute of Technology.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

SYSTEM RELIABILITY, AVAILABILITY, AND MAINTAINABILITY

30

VARIABILITY IN RELIABILITY, AVAILABILITY, AND MAINTAINABILITY POLICY AND PRACTICE The panel has developed a reasonably comprehensive understanding of the range of documents that guide the majority of DoD reliability, availability, and maintainability applications and of the professional training of technical personnel engaged in reliability, availability, and maintainability applications. A distinct impression that has emerged is that there is a high degree of variability in reliability, availability, and maintainability policy and practice among the services, as well as within the testing community in each service branch. For example, it is only recently that some agreements have been forged regarding a common vocabulary in this area. There are certain units in the individual services in which reliability, availability, and maintainability practices are modern and rigorous; some of these have members with advanced training (i.e., M.S. or Ph.D.) in statistics or a related field. On the other hand, the opportunities for advanced reliability, availability, and maintainability-related coursework within the services (whether delivered in house or through special contracts) appear to be quite uneven. The Navy has unique access to an excellent technical resource—the Naval Postgraduate School in Monterey —and often refers its more complex reliability, availability, and maintainability problems to the school's faculty directly or engages graduates of the school in addressing them. Ironically, and possibly as a consequence of this mode of operation, there seems to be less reliability, availability, and maintainability-related technical expertise in residence at Navy operational test and evaluation installations than is found among the other services. Although the Army appears to have a larger corps of reliability, availability, and maintainability professionals at work, the distribution of this specialized workforce among various developmental and operational testing installations appears to be quite uneven. The reliability group (Army Materiel Systems Analysis Activity) at Aberdeen Proving Ground, for example, has a strong educational profile. The group's use of modern tools of reliability analysis, including careful attention to experimental design, model robustness questions, and the integration of simulated and experimental inputs, is impressive and commendable. At Fort Hunter Liggett, the profile of the statistical staff is quite different, in terms of both size and years of advanced statistical training. The sample work products the panel has seen from these two installations are noticeably different, the former being more technical and analytical and the latter more descriptive. (The Training and Doctrine Analysis Command is another example of a group for whom an upgrading of capabilities in statistical modeling and analysis could pay some big dividends.) In its review of Air Force Operational Test and Evaluation Center procedures and practices, the panel was impressed by the care with which materials for training suitability analysts were assembled, and with the coordinated way in which reliability, availability, and maintainability procedures were carried out on specific testing projects. The level of energy, dedication to high standards, and careful execution of statistical procedures were commendable. Certain areas of potential improvement were also noted. Among these, the need for more personnel with advanced degrees in the field of statistics seemed most pressing. It was clear that certain statistical methods in use could be improved through the recognition of failure models other than those resulting in an exponential distribution of time between failures. Some groups with whom the panel conferred described the RAM Primer as their main reference, while others indicated they view that document as more of an introduction to these topics, providing a management perspective. This again underscores the extent of variability within the DoD testing

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

SYSTEM RELIABILITY, AVAILABILITY, AND MAINTAINABILITY

31

community. We have noted a similar variability in the way different testing groups mix civilian and military analysts in the teams assigned to their reliability, availability, and maintainability applications. The panel recognizes that different phases of the acquisition process may well have different technical requirements; thus, it seems clear that the level of the personnel involved might vary from phase to phase in accordance with the task or mission of the responsible group. We nonetheless observe that the execution of a sequence of tasks and the validity of the cumulative recommendation benefit from technical expertise at each step in the process. INDUSTRIAL (NONMILITARY) STANDARDS In parallel with various military documents in the reliability, availability, and maintainability area, the panel has reviewed a collection of documents describing reliability, availability, and maintainability industrial standards and practices. In the course of doing so, we have found that the existence of an accepted set of best practices is a goal much closer to being realized in industrial than in military settings. Models for such developments include the Organization for International Standardization (ISO) 9000 series, and existing documents on practices in the automobile and telephone industries. In our final report, the panel will want to comment on the possibility that DoD might learn from industrial practices in such areas as documentation, uniform standards, and the pooling of information. Documentation of processes and retention of records (for important decisions and valuable data) are practices now greatly emphasized in industry. The same should be true for DoD, especially for the purposes of operational testing. Efforts to achieve more efficient (i.e., less expensive) decision making by pooling data from various sources require documentation of the data sources and of the conditions under which the data were collected, as well as clear and consistent definitions of various terms. Such efforts complement attempts to standardize practices across the services and encourage the use of best current practices. The panel does believe that reliability, availability, and maintainability “certification” can better be accomplished through combined use of data collected during training exercises, developmental testing, component testing, bench testing, and operational testing, along with historical data on systems with similar suitability characteristics. In our final report, the panel will seek to clarify the role hierarchical modeling might play in reliability, availability, and maintainability inference from such a broadened perspective. One complication here is that the panel has encountered operational testing reports (one example is an Institute for Defense Analyses report on the Javelin) that put forth raw estimates of parameters of interest with no indication of the amount of uncertainty involved. This practice does not appear to be particularly rare. When such an outcome is combined with other evidence of a system's performance (regardless of the quality of this additional information), the decision maker is at a loss to describe in a definitive way the risks associated with the acquisition of the system, and combination of information from various sources is extremely difficult. Retention of records may involve some nontrivial costs, but is clearly necessary for accountability in the decision-making process. The trend in industry is to empower employees by giving them more responsibility in the decision-making process, but along with this responsibility comes the need to make people accountable for their decisions. This consideration is likely to be an important organizational aspect of the operational testing of defense systems. In addition, effective retention of information allows one to learn from historical data and past practices in a more systematic manner than is currently the case. It should not be necessary for DoD or the individual services to develop an approach to uniform standards from scratch; there is no question that existing industry guidelines can be adapted to yield

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

SYSTEM RELIABILITY, AVAILABILITY, AND MAINTAINABILITY

32

much of what is needed. A related observation concerns the need to realize the potential for more effective use of reservoirs of relevant knowledge outside DoD, including experts at the Institute for Defense Analyses and other federally funded research and development centers, faculty and students from the military academies, and personnel from public and private civilian institutions. FUTURE WORK In summary, the panel's preliminary reactions to its findings in the reliability, availability, and maintainability area are as follows: • The variability in expertise and level of reliability, availability, and maintainability practices across and within services bears further investigation, and may well lead to recommendations regarding minimal training and more comprehensive written resources. • DoD should consider emulating industrial models for establishing uniform reliability, availability, and maintainability standards. • There is a need for modernizing military reliability, availability, and maintainability practices, extending standard analyses beyond their present, restricted domains of applicability, which often involve an untenable assumption of exponentiality. The panel has not yet identified a suitable collection of military systems that should play the role of case studies in the context of a more detailed look at DoD reliability, availability, and maintainability practices. We have looked carefully at the Apache Longbow in this connection, and are still entertaining that as a candidate case study. Proceeding with the case study phase of our work remains our principal unfinished task. In the months ahead, the panel will consider matters such as appropriate research priorities in the reliability, availability, and maintainability area, and in statistics generally, given the array of complex inference problems with which the DoD testing community is currently engaged. The agenda for the next phase of the panel's reliability, availability, and maintainability-related work will include, as high priorities, increased contact with the Air Force and Marine Corps and assessment of the quality and appropriateness of current reliability, availability, and maintainability practices, together with the formulation of possible amendments aimed at greater precision, efficiency, and protection against risk. We will undertake two main activities: identifying relevant materials related to the identified case studies; interacting with Navy Operational Test and Evaluation Force staff to increase our familiarity with their procedures and with the Army Materiel Systems Analysis Activity to better understand their role in Army reliability, availability, and maintainability methodology for developmental testing. We will also undertake some Monte Carlo simulation to examine the robustness of current reliability, availability, and maintainability practices in the design and execution of life tests.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

USE OF MODELING AND SIMULATION IN OPERATIONAL TESTING

33

5 Use of Modeling and Simulation in Operational Testing

The apparent success of many military simulations for the purposes of training, doctrine development, investigation of advanced system concepts, mission rehearsal, and assessments of threats and countermeasures has resulted in their increased use for these purposes.1 At the same time, the DoD testing community, expecting substantial reductions in its budget in the near future, has expanded the use of simulations to assist in operational testing, since they offer the potential of safe and inexpensive “testing” of hardware. Our specific charge in this area is to address how statistical methods might be appropriately used to assess and validate this potential. It seems clear that few if any of the current collection of simulations were designed for use in developmental or operational testing. Constructive models, such as JANUS and CASTFOREM, have been used for comparative evaluations of specific capabilities among candidate systems or parameter values, but not to justify with any validity those systems or value comparisons within an actual combat setting.2 Therefore an important issue must be explored: the extent to which simulations, possibly with some adjustments and enhancements, could be used to assist in developmental or operational testing to save limited test funds, enhance safety, effectively increase the sample size of a test, or perhaps permit the extrapolation of test results to untested scenarios. The goals of building simulations for training and doctrine are not necessarily compatible with the goals of building simulation models for assessing the operational effectiveness of a system. Some specific concerns are as follows: • Can simulations built for other purposes be of use in either developmental or operational testing in their present state? • What modifications might generally improve their utility for this purpose?

1We

use the term “simulation” to mean both modeling and simulation.

2JANUS

and CASTFOREM are multipurpose interactive war-game models used to examine tactics.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

USE OF MODELING AND SIMULATION IN OPERATIONAL TESTING

34

• How can their results, either in original form or suitably modified, be used to help plan developmental or operational tests? • How can the results from simulations and either developmental or operational tests be combined to obtain a defensible assessment of effectiveness or suitability? For example, simulation might be used to identify weak spots in a system before an operational test is conducted so one can make fixes, alleviate problems in the testing environment (e.g., adjusting for units that were not killed because of instrumentation difficulties), or identify scenarios to test by selecting those that cover a spectrum of differences in performance between a new system and the system it is replacing. These questions are all related to the degree to which simulations can approximate laboratory or field tests, and their answers involve verification, validation, and accreditation. Thus we have also investigated the degree to which simulations have been (or are supposed to be) validated. Two issues relate specifically to the use of statistics in simulation. First is the treatment of stochastic elements in simulations, whether the simulations are used either for their original purposes or for operational testing, and associated parameter estimation and inference. Second is the use of simulations in the development of requirements documents, since this represents part of the acquisition process and might play a role in a more continuous evaluation of weapons systems. The next section describes the panel's scope, procedures, and progress to date in the modeling and simulation area. This is followed by discussion of several concerns raised by our work thus far. The chapter ends with a summary and a review of the panel's planned future work in this area. SCOPE, PROCEDURES, AND PROGRESS TO DATE To carry out our charge in this area, the panel has examined a number of documents that describe the current use of simulations both for their original purposes and for the purpose of operational testing. Some documents that have been particularly useful are Improving Test and Evaluation Effectiveness (Defense Science Board, 1989), a 1987 General Accounting Office report on DoD simulation (U.S. General Accounting Office, 1987), Systems Acquisition Manager's Guide for the Use of Models and Simulation (Defense Systems Management College, 1994), and A Framework for Using Advanced Distributed Simulation in Operational Test (Wiesenhahn and Dighton, 1993). The panel's working group on modeling and simulation participated in a one-day meeting at the Institute for Defense Analyses, where briefings were given on various simulations and their utility for operational testing. Our experiences to date have focused on Army simulations (though the one-day meeting involved simulations for both the Navy and the Air Force). Before the panel's final report is issued, we intend to (1) convene one or two more meetings and examine more of the relevant literature; (2) investigate more and different types of simulations; (3) familiarize ourselves with more of the standards and reference documents from the services other than the Army; and (4) examine more of the relevant practices from industry, from defense acquisition in other countries, and from the federal government in organizations such as NASA. The panel's investigation in this area is clearly related to topics being treated by some of our other working groups. Therefore, relevant discussion for some of the issues raised in this chapter may be found in the chapters on experimental design, software reliability, and reliability, availability, and maintainability. This is because (1) the planning of a simulation exercise involves considerations of experimental design; (2) simulations are usually software-intensive systems and therefore raise the issue

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

USE OF MODELING AND SIMULATION IN OPERATIONAL TESTING

35

of software reliability; and (3) reliability, availability, and maintainability issues should necessarily be included in any system that attempts to measure suitability or operational readiness. Unfortunately, we have found that military system simulations often do not incorporate notions of reliability, availability, and maintainability and instead implicitly assume that the simulated system is fully operational throughout the exercise. Even though much of the discussion below focuses on problems, we applaud efforts made throughout DoD to make use of simulation technology to save money and make more effective use of operational testing. Much of the DoD work on simulation is exciting and impressive technologically. The panel is cognizant that budgetary constraints are forcing much of this interest in simulation. However, there is a great deal of skepticism in the DoD community about the benefits of using simulations designed for training and other purposes, for the new purpose of operational testing. Our goal is not to add to this pessimistic view, but to assist DoD in its attempt to use simulation intelligently and supportably for operational testing. CONCERNS Rigorous Validation of Simulations Is Infrequent Two main activities that make up simulation validation are broadly defined as follows: (1) external validation is the comparison of model output with “true” values, and (2) sensitivity analysis is the determination of how inputs affect various outputs (an “internal” assessment). External validation of simulations is difficult and expensive in most applications, and the defense testing application is no exception. There are rarely “true” values available since engagements are, fortunately, relatively rare; they occur in specific environments; and when they occur, data collection is quite difficult. Furthermore, operational tests do not really provide true values since they are also simulations to some extent; for example, no live ammunition is used, the effects of the use of weapons are simulated through the use of sensors, personnel, and equipment are not subject to the stresses of actual battle, and maneuver areas are constrained and to some extent familiar to some testing units. While these arguments present genuine challenges to the process of external validation, they should not be taken to imply that such external validation is impossible. The panel is aware of a few attempts to compare simulation results with the results of operational tests. We applaud these efforts. Any discrepancies between simulation outputs and operational test results (relative to estimates of the variance in simulation outputs) indicate cause for concern. Such discrepancies should raise questions about a simulation's utility for operational testing, and should trigger careful examination of both the model and the test scenario to identify the reason. It should be noted that a discrepancy may be due to infidelity of the test scenario to operational conditions, and not necessarily to any problem with the simulation. There is an important difference (one we suspect is not always well understood by the test community in general) between comparing simulation outputs with test results and using test results to “tune” a simulation. Many complex simulations involve a large number of “free” parameters—those that can be set at different values by the analyst running the simulation. Some of these parameters are set on the basis of prior field test data from the subsystems in question. Others, however, may be adjusted specifically to improve the correspondence of simulation outputs with particular operational testing results with which they are being compared. Particularly when the number of free parameters is large in relation to the amount of available operational test data, close correspondence between a “tuned” simulation and operational results does not necessarily imply that the simulation would be a good

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

USE OF MODELING AND SIMULATION IN OPERATIONAL TESTING

36

predictor in any scenario differing from those used to tune it. A large literature is devoted to this phenomenon, known as “overfitting.”3 On the other hand, in some cases a simulation may be incapable of external validation in the strict sense of correspondence of its outputs with the real world, but may still be useful in operational testing. For an example, suppose that it is generally agreed that a simulation is deficient in a certain respect, but the direction of the deficiency's impact on the outcome in question is known. This might occur, for example, under the following conditions: • An otherwise accurate simulation fails to incorporate reliability, availability, and maintainability factors. • System A is known to be more reliable than system B. • It is agreed that increased reliability would improve the overall performance of system A relative to system B. Then if system A performs better in a simulation than system B, it can be argued that the result would have been even more strongly in favor of system A had reliability, availability, and maintainability factors been incorporated (for other examples, see Hodges and Dewar, 1992). The use of sensitivity analysis is more widespread and applied sensibly in the validation of DoD simulations. However, the high dimensionality of the input space for many simulations necessitates use of more efficient sampling and evaluation tools for learning about the features of these complex simulations. Therefore, to choose input samples more effectively, sampling techniques derived from fractional factorial designs (see Appendix B) or Latin Hypercube sampling (see McKay et al., 1979) should be used rather than one-at-a-time sensitivity analysis. To help analyze the resulting paired input-output data set, various methods for fitting surfaces to collections of test points, either parametrically with response surface modeling (see Appendix B) or nonparametrically (e.g., multivariate adaptive regression splines; see Friedman, 1991) should be used. At this point in the panel's investigation, we have yet to see a thorough validation of any simulation used for operational testing. Furthermore, we have yet to be provided evidence that modern statistical procedures are in regular use for simulation validation. (This is in contrast to the effort—and obvious successes—in the verification process.) On the other hand, there are well-validated simulations used for “providing lessons” on concepts. For example, EADSIM (a force-on-force air defense simulation) was apparently validated against field tests with favorable results. While the DoD directives we have examined are complete in their discussion of the need for validation and documentation, we have had difficulty in determining how well these directives have been followed in practice. Not much seems to have changed since the following was noted seven years ago (U.S. General Accounting Office, 1988:11, 46): In general, the efforts to validate simulation results by direct comparison to data on weapons effectiveness derived by other means are weak, and it would require substantial work to increase their credibility. Credibility would have been helped by . . . establishing that the simulation results were statistically representative. Perhaps the strongest

3Overfitting

is said to occur for a model-data set combination when a simple version of the model from a model hierarchy, that is formed by setting some parameters to fixed values (typically zero), is superior in predictive performance to a more complicated version of the model that is formed by estimating these parameters from the data set. To be less abstract, an objective definition of overfitting is possible through correspondence with a statistic that measures it, e.g., the Cp statistic for multiple regression models. Thus, multiple regression models with a high Cp statistic could be defined as being overfit.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

USE OF MODELING AND SIMULATION IN OPERATIONAL TESTING

37

contribution to credibility came from efforts to test the parameters of models and to run the models with alternative scenarios. Although many attempts have been made to develop procedures for assessing the credibility of a model/simulation, none have gained widespread acceptance. At the present time, there is no policy or process in place in DoD to assess the credibility of specific models and simulations to be used in the test and evaluation and the acquisition process.

We have found little evidence to show that this situation has changed substantially. There is a large literature on the statistical validation of complicated computer models. While this is an active area of research, a consensus is developing on what to do. This literature needs to be more widely disseminated among the DoD modeling community: important references include McKay (1992), Hodges (1987), Hodges and Dewar (1992), Iman and Conover (1982), Mitchell and Wilson (1979), Citro and Hanushek (1991), Doctor (1989), and Sacks et al. (1989). Little Evidence Is Seen for Use of Statistical Methods in Simulations There are a number of ways statistical methods can be brought to bear in the design of simulation runs and the analysis of simulation outputs. The panel has seen little evidence of awareness of these approaches. Specific areas that might be incorporated routinely include the following: • Understanding of the relationship between inputs and outputs. Methods from statistical design of experiments can be applied to plan a set of simulation runs to provide the best information about how outputs vary as inputs change. Response surface methods and more sophisticated smoothing algorithms can be applied to interpolate results to cases not run. • Estimation of variances, and use of estimated variances in decision making. There are recently developed new and easily applied methods for estimating variances of simulation outputs. Analysis-ofvariance methods can be used to draw inferences about whether observed differences are real or can be explained by natural variation. These methods can be applied in the comparison of results for different systems, or in the validation of simulations. For example, the Army Operational Test and Evaluation Command (OPTEC) performed a study comparing live, virtual (SIMNET), and constructive (CASTFOREM) simulations for the M1A2 (an advanced form of the tank used in Operation Desert Storm). The results demonstrated the limitations of simulation in approximating operational testing; this finding was supported by the difference between the results from the live and the virtual and constructive simulations. However, no confidence intervals were reported for the differences in output (which is not to say that the development of confidence intervals did not require sophisticated techniques). Therefore, there was no formal basis for inferring whether the differences found were real or simply due to natural variation. It is likely that analysis-of-variance and related techniques could have been used to examine whether the difference between these simulations was due to natural variation or systematic differences between simulations. • Detection and analysis of outliers. Outliers should be examined separately to determine the reasons for the unusual values. Outliers in operational test data may be due to coding errors or to unusual circumstances not representative of combat. If this is determined to be the case, then the outlier in question should be handled separately from other data points. In simulation results, outliers are crucial for identifying regions of the input-output space in which the behavior of the simulation changes qualitatively.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

USE OF MODELING AND SIMULATION IN OPERATIONAL TESTING

38

DoD Literature on Use of Simulations Often Lacks Statistical Content As momentum builds for the use of simulation in operational testing, there is clearly concern in the DoD community that simulation be applied correctly and cost-effectively. Most of the literature and discussions on the use of simulations in operational testing encountered by the panel were thoughtful and constructive. (A particularly good discussion can be found in Wiesenhahn and Dighton, 1993.) However, the panel is concerned with the extremely limited discussion of statistics and the almost total lack of statistical perspective in the literature we encountered. For example, there is serious concern in the community that simulations be appropriately validated. The literature repeatedly stresses that validation must be related to the purpose at hand (see, e.g., Hodges and Dewar, 1992). The panel agrees with this point, applauds the concern with proper validation, and found the discussions to be thorough and correct. However, there is little discussion of what it means to demonstrate that a simulation corresponds to the real world. Given the level of uncertainty in the results of operational testing scenarios and the stochastic nature of many simulations, it is clear that correspondence with the real world can be established only within bounds of natural variability. Thus, validation of simulation is largely a statistical exercise. Yet there is almost no discussion of statistical issues in the various directives and recommendations we encountered on the use of simulations in operational testing. Where there is discussion of statistical issues, the treatment is generally not adequate. As noted above, any discussion of statistics is usually concerned with the precision with which expectations can be estimated. We found little evidence of concern for estimating variances or for proper consideration of variability in the use of results for decision making. Moreover, statistical procedures designed for fixed sample sizes have been applied inappropriately in sequential processes. (See, e.g., U.S. General Accounting Office, 1987, which discusses uncritically a procedure in which confidence intervals were constructed repeatedly using additional runs of a simulation, until the interval was sufficiently narrow that it was determined no further runs were needed.) Distributed Interactive Simulation Raises Additional Concerns Distributed interactive simulation is a relatively new technology. For systems in which command and control is an important determinant of operational effectiveness, traditional constructive simulations not incorporating command and control are of limited use in evaluating operational effectiveness. Distributed interactive simulation with man-in-the-loop may have the potential to incorporate command and control at less expense than a field test. Although actual military applications to date have been limited, it is widely believed that distributed interactive simulation can contribute to effective use of simulation in the operational testing process, and widely claimed that it can improve some aspects of realism in the operational test environment. For example, semiautomated forces can be used to simulate threat densities not possible in field tests. Our concerns regarding statistical issues apply with even more gravity to distributed interactive simulation. Running a distributed interactive simulation is likely to be more time-consuming and expensive than running most conventional constructive simulations. This raises questions about the ability to obtain sufficient sample sizes to estimate results with reasonable precision. Moreover, there is a temptation to presume that the conditions built into a distributed interactive simulation are in fact directly analogous to controllable or independent variables, and are thus subject to the same kinds of statistical treatments. In fact, the elements of most humanintroduced aspects of a distributed interactive simulation are peculiar to the setting in which the simulation is run (reflecting such factors as fatigue,

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

USE OF MODELING AND SIMULATION IN OPERATIONAL TESTING

39

TABLE 5-1 Reliability Assessment of Command Launch Unit of Javelin in Several Testing Situations Test

Troop Handling

Mean Time Between Operational Mission Failures (in hours)

RQT I

None

63

RQT II

None

482

RDGT

None

189

PPQT

None

89

DBT

Limited

78

FDTE

Limited

50

IOT

Typical

32

RQT I

Reliability Qualification Test I

RQT II

Reliability Qualification Test II

RDGT

Reliability Development Growth Test

PPQT

Preproduction Qualification Test

DBT

Dirty Battlefield Test

FDTE

Force Development Test and Experimentation

IOT

Initial Operational Test

morale, state of alerting or fear, and anticipation of the scoring rules to be used), and require extra-statistical analysis. Simulations Cannot Identify the “Unknown Unknowns” Information one gains from simulation is obviously limited by the information put into the simulation. While simulations can be an important adjunct to testing when appropriately validated for the purpose for which they are used, no simulation can discover a system problem that arises from factors not included in the models on which the simulation is built. As an example, one system experienced unexpected reliability problems in field tests because soldiers were using an antenna as a handle, causing it to break. This kind of problem would rarely be discovered by means of a simulation. As another example, consider Table 5-1, which represents the mean time between operational mission failures for the command launch unit of the Javelin (a man-portable anti-tank missile). Note that as troop handling grows to become typical of use in the field, the mean time between operational mission failure decreases. It is therefore reasonable to assume that the failure modes differ for the various test situations (granting that some were removed during the development process). However, since a simulation, designed to incorporate reliability, would most likely include failures typical of developmental testing (rather than operational testing), such a simulation could never replace operational tests. Thus the challenge is to identify the most appropriate ways simulation can be used in concert with field tests in an overall cost-effective approach to testing. FUTURE WORK The panel is concerned that (1) rigorous validation of models and simulations for operational testing is infrequent, and external validation is at times used to overfit a model to field experience; (2) there is

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

USE OF MODELING AND SIMULATION IN OPERATIONAL TESTING

40

little evidence seen for use of statistical methods to help interpret results from simulations; (3) the literature on the use of simulations is deficient in its statistical content; and (4) simulations cannot identify the “unknown unknowns.” A number of positions can be taken on the question of the use of simulation for operational testing. These range from (1) simulation is the future of operational testing; (2) simulation, when properly validated, can play a role in assessing effectiveness, but not suitability; (3) simulation can be useful in helping to identify the scenarios most important to test; (4) simulation can be useful in planning operational tests only with respect to effectiveness; and (5) simulation in its current state is relatively useless in operational testing. The panel is not ready to express its position on this question. Simulations typically do not consider reliability, availability, and maintainability issues; do not control carefully for human factors; and are not “consolidative models” (Bankes, 1993), that is, do not consolidate known facts about a system and cannot, for the purpose at hand, safely be used as surrogates for the system itself. Simulations often are not entirely based on physical, engineering, or chemical models of the system or system components. Clearly the absence of these features reduces the utility of the simulations for operational testing. This does raise questions: At what level should a simulation focus in order to replicate, as well as possible, the operational behavior of a system? Should the simulation model the entire system or individual components? The panel is more optimistic about simulation of system components rather than entire systems. As our work goes forward, we need to expand our understanding of current practice in the Navy and Air Force, especially with respect to their validation of simulations. We will also examine the use of distributed interactive simulation in operational testing to determine the particular application of statistics in that area. Key issues requiring further investigation because of their complexity include the proper role of simulation in operational testing, the combination of information from field tests and results from simulations, and proper use of probabilistic and statistical methodology in simulation. To accomplish the above, we intend to meet with experts on simulation from the Navy Operational Test and Evaluation Force and the Air Force Operational Test and Evaluation Center to determine their current procedures, and to examine the procedures used to validate simulations used or proposed for use in operational testing. We will also meet with experts from the Institute of Defense Analyses and DoD on simulation to solicit their views on the proper role of simulation in operational testing.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

EFFORTS TOWARD A TAXONOMIC STRUCTURE OF DOD SYSTEMS FOR OPERATIONAL TESTING

41

6 Efforts Toward a Taxonomic Structure of DoD Systems for Operational Testing The testing problems raised by operational testing of military systems may be quite different depending on the nature of the system under test. For example, systems that have a major software component create different testing problems than systems that are chiefly hardware. Systems whose failure could result in loss of life must be tested differently from those whose failure could not. Systems that are modest improvements over existing systems raise issues different from those that embody entirely new technologies. A number of attributes of military systems necessitate differing approaches to operational testing and create distinctions that should be kept in mind when applying statistical techniques. Because of the many different factors that need to be considered, the panel decided it would be worthwhile to consider the development of a scheme for classifying weapon systems and weapon system testing issues. The utility of examining aspects of operational tests that are linked to features of systems under test is clear; however, attempts to progress have raised fundamental issues regarding its scope, depth, and structure. This chapter is intended to raise some of these issues and to promote discussion. It begins by presenting the results of our preliminary work toward developing a taxonomic structure and then briefly describes our planned future activities in this area. PRELIMINARY WORK TOWARD A TAXONOMIC STRUCTURE The objective of the panel's efforts in this area is to develop a taxonomic structure that can support, and help structure, analyses of the use of statistical techniques for the efficient testing and evaluation, especially operational testing, of military systems. The term “taxonomic structure” is used to emphasize that the exact nature of any proposed scheme is still under consideration, and will evolve as the work proceeds. The structure, when developed, should serve the following purposes:

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

EFFORTS TOWARD A TAXONOMIC STRUCTURE OF DOD SYSTEMS FOR OPERATIONAL TESTING

42

• Reflect the prevalence of various types of systems. • Highlight attributes that might call for different statistical approaches, affect decision tradeoffs, or involve qualitatively different consequences. • Facilitate the integration of commercial approaches by helping to align military and commercial contexts. With these general purposes in mind, one is led to think of taxonomy dimensions such as the following: • Cost of system and of testing —What is the cost of a test item? —What is the number of items to be procured? —Is the testing destructive? • Role of software —Is the system a software product? —Does the system have significant software content? —Does the system use a dedicated computer or require the development of new computer hardware? • Environment of use. How stressful is the environment within which computer hardware, sensors, motors, electronics, etc. must operate? • Environment of test and evaluation —How close are test environments to actual-use (combat) environments? —What is the relevance of simulation? —To what extent are performance evaluations dependent upon indirect measurements and inference? —To what extent is relevant prior knowledge available and able to be used (1) in the design of evaluation studies or (2) in drawing conclusions from test and evaluation? • New versus evolutionary system —Is the system a de novo development? —Is it an upgrade? —Is it a modification? —Is it a derived design? —Is it a replacement for another system? • Testing consequences —What are the consequences of not achieving a successful replacement? —What are the consequences of achieving a replacement at a much higher cost than anticipated? —What are the consequences of receiving it at a much later date than planned? —What are the consequences of receiving it at a much lower level of performance than promised? A useful taxonomic structure might be developed simply by expanding on this list, adding, deleting, or elaborating as deemed useful. But addressing questions of what to put in and what to leave out raises other questions about the various uses and purposes of the taxonomic structure. Does one wish to recognize all distinctions that may be significant for characterizing:

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

EFFORTS TOWARD A TAXONOMIC STRUCTURE OF DOD SYSTEMS FOR OPERATIONAL TESTING

• • • • • • •

43

The nature of the weapon system? The intended combat environment(s)? Other (possible) combat environments? The intended role of the weapon system and other (possible) roles of the system? The range of possible decisions that might be appropriate, given the outcome of this operational test? The cost of repairing and supporting the system? The logistics costs of fielding the system?

It quickly becomes clear that the taxonomic structure could be developed with more or less ambitious purposes in mind. The choice of purposes might well affect the number of dimensions and the necessary levels of disaggregation within each dimension. It might appear that some of these dimensions go well beyond the objective of characterizing the weapon system. But the goal of operational testing goes beyond testing, per se, to evaluation. Decisions whether to proceed to produce and field a weapon system often hinge not simply on whether the system can perform a physical function, but also on whether it can be employed to perform that function so as to provide a decisive advantage in combat and whether the range of contexts in which it could do so justifies its cost. Given issues of assessing the value of a weapon system, the taxonomic structure might include dimensions such as the following: • Scenario dependence. To what extent is the value of the weapon system, or its operational performance, affected by the testing scenario? For example, does the scenario correspond to operations on the first day of the war or after air superiority has been achieved? Does it correspond to a scenario in which we have ample warning or are caught by surprise? Is it assumed that air bases are available nearby, or that operations must be adapted for primitive air strips? • Roles and missions —Could this weapon system perform in roles and missions different from those which are tested? —Could this system provide a backup for other systems, in case they perform badly or are seriously attrited? —Does the operational testing provide information (direct or indirect) regarding possible alternative uses of the system? • Force flexibility —Would this weapon system significantly improve the flexibility inherent in our fielded portfolio of weapon systems? —Would it allow us to perform new missions, or to perform existing missions in more than one way? —Would this system free up other systems for more valuable uses? It might be said that these questions go beyond the narrower issues that are normally addressed in operational testing. But to the extent that operational tests can be designed to shed light on such questions, they will provide valuable information that bears directly on the decisions operational testing is meant to inform. Discussions of force flexibility and roles and missions raise another set of considerations, relating to whether the weapon system in question provides radical new capabilities, or, alternatively, simply can do the same job a bit better than the existing fielded system. If a weapon system represents a radical advance, it is important to recognize that its value may well not be entirely appreciated at the time a

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

EFFORTS TOWARD A TAXONOMIC STRUCTURE OF DOD SYSTEMS FOR OPERATIONAL TESTING

44

decision is made. Thus it might be useful to address in the taxonomic structure the following questions related to tactics and doctrine: • To what extent do the capabilities inherent in the system raise questions about the nature of tactical operations or even about existing doctrine? • Have the potential users and testers had adequate time to develop tactics that will utilize this weapons system most effectively? • Is it plausible that the system opens up opportunities for radical approaches that are not yet well understood? These questions relate to what is, or is not, revealed about the potential uses of the weapon system through operational testing. But it is also important to recognize that capabilities that are not explicit, not revealed, or not even tested could be more significant than those that are. A complete evaluation must recognize yet another dimension that relates to the characterization of weapons systems—deterrence. The taxonomic structure might address the following questions related to this dimension: • To what extent could this weapon system create fear among potential adversaries about just what capabilities might be demonstrated in the midst of a conflict? • To what extent does the system affect the perceptions of adversaries (and allies), as well as actual capabilities? Considerations such as these all relate to the question of whether the nature of this weapons system (and its potential uses) is now understood and the extent to which it will be better understood after the operational tests are concluded. That question leads in turn to another candidate dimension for the taxonomic structure—human factors. Questions related to this dimension include the following: • To what extent does the performance of the weapon system depend on the training of those who will operate it? • Of those, who will operate any “enemy systems” during the testing? • To what extent may the assessed performance of the weapons system be affected by the training of those who will collect, reduce, and interpret the data collected during the operational test? Note that while the competence of the operators is always recognized as a factor that may be critical to the assessed performance of a weapon system, it is also important to recognize the extent to which test results may be influenced by those who are collecting and interpreting the data. That point leads to another significant dimension that relates simultaneously to the weapon system and the test range—instrumentation. Questions here include the following: • To what extent is test range instrumentation adequate for assessing the system performance during the operational tests? • To what extent might the act of instrumenting the test articles interfere with their performance? It is clear that assessing performance is more difficult with some weapon systems than with others. It is also clear that difficulties may relate to both the nature of the weapon system under test and the capabilities of the test range. Thus one is led to want to characterize not just the weapon system, but the weapon system/test range as a combined entity.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

EFFORTS TOWARD A TAXONOMIC STRUCTURE OF DOD SYSTEMS FOR OPERATIONAL TESTING

45

At the same time, thinking about the weapon system and the test range as two components of one unitary problem is itself a faulty premise for a taxonomic structure of testing issues and contexts because very few weapons systems are themselves unitary. They often combine subsystems and components, with the performance of the overall system depending on the overall operation of and interaction among those elements. Thus it seems quite important to recognize dimensions such as the following: • Segmentation —To what extent is it possible to segment the weapon system into subsystems and components, particularly ones that can be tested independently? —To what extent do systems integration problems and subsystem interactions interfere with the validity of “segmented tests?” • Architecture —To what extent does the design or nature of this system allow for improvements on a subsystem-bysubsystem basis? —To what extent is system performance affected by the current state of development of the individual subsystems? Recognition of these dimensions raises another set of relevant considerations: while the milestone paradigm is based on notions of phases called development, production, and operations and support, it is increasingly true that development continues to proceed over the lifetime of many modern weapon systems. Yet it is also often the case that one may not understand how the current version of a weapon system “works,” in particular, what the fault modes are of complex electronic subsystems, even after we have begun to field it. Thus the taxonomic structure might include the following dimensions: • Maturation —To what extent is this system matured? —To what extent is its performance likely to improve markedly as it is better understood? After it has been fielded? After test and operational data have accumulated? • Sources of information. To what extent will there be continued reliance on data collected through operational tests or ongoing use (with proper documentation) in understanding the performance, reliability, and nature of this weapon system? • Process perfection. To what extent will the performance of this weapon system gradually improve as production, testing, repair, and support processes are perfected? • Heterogeneity —To what extent will differences among produced items be testable, or recognizable, before the items are used? —To what extent will military commanders be able to either manipulate or hedge against apparent heterogeneity among fielded units of this weapon system? The 19 dimensions noted above are not meant to be exhaustive or definitive, only to illustrate some of the directions in which the taxonomic scheme could be extended. One important conclusion is that a set of dimensions should be developed only after there is agreement on what purposes the taxonomic structure should serve, taking into account its various potential uses. It is also apparent that one objective of a conceptual taxonomy could be to list exhaustively dominant sources of variability that are relevant to the testing, evaluation, and decision-making contexts for different types of weapons systems. Clearly that represents a very ambitious goal, but not an unthinkable

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

EFFORTS TOWARD A TAXONOMIC STRUCTURE OF DOD SYSTEMS FOR OPERATIONAL TESTING

46

one. The panel could define a family of taxonomies, attempt to implement only one or two relatively simple versions, but also present and define more complex versions. For some purposes, a taxonomic structure is helpful only if it entails considerable aggregation. There is always a tradeoff between the homogeneity of the cells of a taxonomy at the end of the process and the parsimony of the cell definitions. Obviously, the number of cells grows very quickly with tree depth. The question arises of how to arrive at the point of greatest utility. One might either add branches to a simple structure or prune from a complex structure. We have not yet decided how to proceed, or even whether to proceed toward developing a taxonomy. Some panel members believe a tree structure could never work, given the necessity that many of the branch definitions are cross-referencing while others are not. Instead, some suggest a list of features, that is, a checklist, consisting of features that are either present in one or another form, or absent, with no overriding structure. This would work if many of the levels of branches did not depend on the presence or absence of other characteristics. Then instead of every member of a cell receiving a different test methodology, as would happen with a usual taxonomy, one would have a collection of test features that relate to individual properties of the checklist. Clearly, the panel is in the preliminary stages of its work on this topic. We have determined that such a taxonomic structure would be difficult to produce and that its appropriate scope, depth, and nature depend on the uses one has in mind for it. The panel would find any information about previous efforts in this area of great interest. FUTURE WORK Building on the preliminary work described above, the panel will develop a taxonomic structure that provides categories of defense systems that require qualitatively different test strategies. To this end we will examine various databases for their utility in classifying recent and current systems and in helping us determine the relative sizes of various cells of the taxonomic structure.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

47

Appendices

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

48

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX A

49

APPENDIX A

The Organizational Structure of Defense Acquisition

This appendix provides a description of the salient features of defense acquisition, with emphasis on operational testing. The procurement of a major weapon system follows a series of event-based decisions called milestones. At each milestone, a set of criteria must be met if the next phase of the acquisition is to proceed. In response to a perceived threat the relevant military service will undertake a Mission Area Analysis and Mission Needs Analysis. For example, because of a change in threat, it may be determined that there is a need to impede the advance of large armored formations well forward of the front lines. In light of this analysis or assessment, the relevant service, or possibly a unified or specified command, prepares a Mission Needs Statement. This is a conceptual document that is supposed to identify the broadly stated operational need and not a specific solution to counter the perceived threat. However, in practice the military service will sometimes try to write the Mission Needs Statement so only a certain preconceived view of a desired solution will meet the mission need. The mission need may be satisfied by either materiel or nonmateriel innovations. Within the context of the above example, the Mission Needs Statement could eventually result in a new acquisition program (e.g., a new aircraft-delivered weapon), or a new concept of operations for existing forces and equipment, or possibly some combination of both. If the Mission Needs Statement results in a determination that a new materiel solution is required, a concept formulation effort is begun. If that effort is accomplished in a manner that justifies a particular approach, and the decision makers believe the new approach has enough merit to warrant further resource commitment, a new acquisition program is begun. At this stage in the process, the program is assigned to an acquisition category (ACAT). The ACAT designation is important because it determines both the level of review that is required by law, and the level at which milestone decision authority rests in DoD. The ACAT assignment is made primarily according to the program's projected costs, but other considerations may result in the program's getting a higher level of review within DoD. Of the four acquisition categories, ACAT I through ACAT IV, ACAT I contains the highest cost systems. This

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX A

50

appendix focuses on ACAT I programs. The relevant authority for these “major defense acquisition programs” is with the Undersecretary of Defense Acquisition and Technology. The Mission Needs Statements for needs that can or are expected to result in major programs are forwarded to the Joint Requirements Oversight Council for review, evaluation, and prioritization. The council, chaired by the Vice Chairman of the Joint Chiefs of Staff, validates those Mission Needs Statements it believes require a materiel solution and forwards its recommendations to the Under Secretary of Defense for Acquisition and Technology. If the under secretary concurs with the council's recommendation, a meeting of the Defense Acquisition Board is convened for a milestone 0 review. Upon milestone 0 approval from the Undersecretary of Defense for Acquisition and Technology, a concept exploration study is undertaken. The purpose of this phase 0 in the acquisition process is to determine the most promising solution(s) to meet the broadly stated mission need. At milestone I, when a concept is approved by the undersecretary, and the Deputy Secretary of Defense agrees that it warrants funding in the DoD programming and budget process, the new acquisition program is formally begun, the relevant service establishes a program office, and a program manager is assigned. It is at this stage that DoD is adequately committed to the program to request budget authority from Congress to begin development. Congress must clearly appropriate funding each year thereafter for the acquisition program to continue. Once an acquisition program has formally begun, the program office creates an acquisition strategy that provides an overview of the planning, direction, and management approach to be used during the multiyear development and procurement process. Other important documents supporting the program and requiring approval are the Operational Requirements Document and the Acquisition Program Baseline. The Operational Requirements Document describes in some detail the translation from the broadly stated mission need to the system performance parameters that the users and the program manager believe the system must have to justify its eventual procurement. In the context of the Operational Requirements Document, performance parameters or requirements may be characteristics such as radar cross-section (visibility to the enemy), probability of kill, range, and minimum rate of fire. The Acquisition Program Baseline is like a contract between the milestone decision authority and the relevant service, including the program manager and his/her immediate superiors. It has three sections: one dealing with performance characteristics, a second with projected costs for various phases of the acquisition, and a third with the projected schedule. In each case there are objectives and less stringent thresholds, the thresholds being the minimum acceptable performance standards or the maximum acceptable costs or time periods for achieving certain objectives. If any of these thresholds are violated or projected not to be met upon program completion, the desirability of continuing or completing the program is supposed to be subject to reexamination and possible termination. The Acquisition Program Baseline is a summary document that is supposed to include only parameters of performance, cost, and schedule critical to the success of the program and to the acquisition decision authorities. Clearly the program manager and the contractors must manage and control many more parameters than are contained in this document. A problem that sometimes surfaces during operational testing is that the tests are evaluated against all the parameters specified in the Operational Requirements Document, which may be more demanding than the thresholds specified in the Acquisition Program Baseline. At that point, the military service may believe that the system is still good enough to procure even if it does not meet all of the requirements in the Operational Requirements Document. However, when the acquisition process was in the early stages, with the program competing for resources against other programs, the service may have believed that if they did not specify parameters that later proved difficult if not impossible to meet, they

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX A

51

would not get approval for the program. Also, in the early stages of a program, the service and others in the acquisition community may have believed they could achieve better performance than later proved to be the case. The difficulty this presents is that the credibility of the service and of DoD is damaged with Congress if the acquisition decision authorities attempt to change performance goals near the end of the process, when operational testing is imminent or in process. Related to this issue is the tendency to view the operational testing as a pass/fail process, rather than as part of an overall process for managing risk and balancing cost and performance tradeoffs, with the objective of acquiring quality, cost-effective systems and getting them into the hands of the forces who may have to use them in combat. The primary purposes of DoD test and evaluation are to (1) provide essential information for assessment of acquisition risk and for decision making, (2) verify attainment of technical performance specifications and objectives, (3) verify that systems are operationally effective and suitable for their intended use, and (4) provide essential information in support of decision making. In technical terms and in the management of acquisition programs, the boundaries between developmental and operational testing are not always clear or distinct. Over the years, however, organizational boundaries have developed between the two within DoD. This occurred as a result of congressional concerns, which eventually resulted in a law establishing the separation and reporting requirements for the Office of the Director of Operational Test and Evaluation. In many respects, these boundaries make the management of testing and evaluation in DoD more complex than it would otherwise be, even though the two test communities communicate well with each other. For all acquisition programs in DoD, test planning is supposed to begin very early in the acquisition process and to involve both the developmental and operational testers. Both are involved in preparing the Test and Evaluation Master Plan, which is a requirement for all acquisition programs. For all ACAT I programs, this plan must be approved by the Director, Operational Test and Evaluation, and the Deputy Director, Defense Research and Engineering (Test and Evaluation), the analogous head of developmental testing. The Test and Evaluation Master Plan documents the overall structure and objectives of the test and evaluation program, provides a framework for generating detailed test and evaluation plans, and documents associated schedule and resource implications. It relates program schedule, test management strategy and structure, and required resources to (1) critical operational issues, (2) critical technical parameters, (3) minimum acceptable operational performance requirements, (4) evaluation criteria, and (5) milestone decision points. It is prepared by the program office, with input from system testers in both developmental and operational testing, service representatives, and other technical experts. (In the Army, those in charge of requirements assist the program manager in preparing the operational testing portion of the Test and Evaluation Master Plan while in the Air Force and Navy, those in charge of requirements do not formally participate in writing the plan.) The Test and Evaluation Master Plan has five components. First, it contains a statement of requirements for the system, which is simply an interpretation of the Operational Requirements Document from the viewpoint of the testing community. Second, it contains an integrated test program summary and schedule, including the identification of which testing agencies will provide information to the program manager and when that information will be provided. Third, it contains detailed information about the criteria for the developmental tests, in which each component of the system will be evaluated. Fourth, it contains the operational test master plan, in which the critical operational issues for operational testing are described and broken into two groups, one for effectiveness and one for suitability. Finally, it identifies the resources projected to be available for purposes of testing the system, including personnel, test ranges, models, and funding. The Test and Evaluation Master Plan is updated during the various

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX A

52

phases of the acquisition process for the program. It is reported that congressional staff members sometimes become involved in reviewing the plan revisions. At milestone I, the Defense Acquisition Board reviews the acquisition strategy, the Operational Requirements Document, the Test and Evaluation Master Plan, the initial Acquisition Program Baseline, and an independent cost evaluation report from the Cost Analysis Improvement Group. If the Undersecretary of Defense for Acquisition and Technology approves the program from an acquisition perspective, and the Deputy Secretary of Defense approves the necessary near- and long-term funding in the context of the overall defense program, the new acquisition program is formally begun, and the demonstration and validation phase begins. In phase I of the acquisition process, the demonstration and validation phase that occurs between milestone I and milestone II, the objectives are to provide confidence that the technologies critical to the concept can be incorporated into a system design and to define more fully the expected capabilities of the system. This is the first stage of development in which tradeoffs can be addressed based on developmental data rather than just analytical models. Thus, this phase provides an opportunity to obtain some confidence that the parameters specified in the Operational Requirements Document will be achieved as development progresses, or to recommend that changes or tradeoffs be made because the original objectives appear to be too stringent. One of the minimum required accomplishments for this phase of the acquisition process is to identify the major cost, schedule, and performance tradeoff opportunities. At the completion of the demonstration and validation phase, the acquisition program comes to the milestone II decision point for development approval. The milestone decision authority must rigorously assess the affordability of the program at this point and establish a development baseline (a refinement or revision of the initial Acquisition Program Baseline approved at milestone I). The low-rate initial production quantity to be procured before completion of initial operational testing is also determined by the milestone decision authority at milestone II, in consultation with the Director, Operational Test and Evaluation. The quantities of articles required for operational testing are also specified at this point by the testing community, and, specifically, by the Director, Operational Test and Evaluation for ACAT I programs. (A major challenge for DoD is to balance the desire to perform operational testing on production versions of the system and the need to complete operational testing before entering full-scale production. This is especially difficult when there are large costs associated with continuing low-rate production or halting it temporarily to complete some testing and evaluation or make some fixes to problems found in initial operational testing.) Engineering and manufacturing development is phase II of the acquisition process, which follows a successful milestone II decision point. The objectives in this phase are to translate the system approach into a stable, producible, and cost-effective system design; validate the manufacturing or production process; and demonstrate through testing that system capabilities meet contract specification requirements, satisfy the mission need, and meet minimum operational performance requirements. Thus, both further developmental and operational testing are accomplished in this phase before full-rate production is approved at milestone III (for systems successfully completing the engineering and manufacturing development phase). Moreover, DoD Instruction 5000.2, “Defense Acquisition Management Policies and Procedures,” specifies that, “when possible, developmental testing should support and provide data for operational assessment prior to the beginning of formal initial operational test and evaluation by the operational test activity.” Ideally, developmental testing is conducted prior to final operational testing, and the system is required to pass an operational test readiness review which is certified by the program executive officer, the program manager's direct supervisor. However, developmental and operational testing often over

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX A

53

lap. The result may be that contractors have little opportunity to make fixes or improve the system based on lessons learned from developmental test results. In addition, as mentioned above, low-rate initial production begins during the Engineering and Manufacturing Development phase, well before all the operational test results are known, because of the desire to use production versions in operational testing and the desire to avoid the increased costs associated with stopping the low-rate production while awaiting operational test results. When a system is scheduled for operational testing, the exact details of the tests are prepared by the testers and evaluators per the Test and Evaluation Master Plan. However, resource constraints may prevent certain characteristics of the system from being ascertained. In such cases, the testers identify what they can accomplish given the constraints. The amount of control the program manager has over the testing budget for the system varies from service to service. In theory, the operational testers are meant to be wholly independent of the test result evaluators. This separation is preferred so the evaluators will not be tempted to design tests that are relatively easy to evaluate, rather than ones that are more difficult to evaluate but will produce the most informative results. In practice, the testers and evaluators work together in designing the tests. In all the services, the testing agencies are independent of the program office and any of its direct supervisory management. The results of operational testing are interpreted by many separate agencies, including: the relevant service's operational test agency and the Director, Operational Test and Evaluation. The Office of the Director, Operational Test and Evaluation is a congressionally created oversight office within DoD, reporting to the Secretary of Defense (rather than the Undersecretary for Acquisition and Technology). It prepares independent reports concerning operational testing, which are provided to the Defense Acquisition Board, the Secretary of Defense, and Congress. Prior to the publication of a report on a specific system, the program manager has the opportunity to comment on the report and ask for revisions. If the Director, Operational Test and Evaluation refuses the revisions, the service may append the program manager's comments to the report. In addition, each of the services has its own agency to interpret the test results with input from the program manager. The reports from these organizations are sent to the Defense Acquisition Board for its milestone III consideration after review by the service acquisition board (e.g., the Army Acquisition Review Board for the Army). In making the milestone III recommendation to initiate full-scale production, the Defense Acquisition Board considers the developmental test results and the reports of the Director, Operational Test and Evaluation and the service test and evaluation organizations. If approval for full-scale production is granted by the Undersecretary of Defense for Acquisition and Technology, the procurement request is included in the DoD budget request submitted by the Secretary of Defense to Congress (or the dollars included in the prior budget request are approved for obligation by the service). The full-scale production contracts are then awarded, consistent with any DoD or congressional restrictions. Follow-on operational testing is performed during the early stages of the production phase to monitor system performance and quality. In the entire acquisition process for a specific system, there will be a number of program managers because of the multiyear length of the development and procurement. These program managers are the individuals most affected by the success or failure of the program. Specifically, if the program has major problems or is terminated, the career of the program manager at that time may be significantly damaged. The program manager should be focused on overseeing the management of the program in all phases of the acquisition; particularly in the early stages of the program, however, he or she is under pressure to act as a salesman or advocate of the program rather than as an independent manager. As a result, the program manager is encouraged to have a “can do” attitude rather considering the possibility that meeting the original objective may not be feasible and that some tradeoffs must be made before

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX A

54

proceeding. There is also a strong tendency or pressure for program managers not to bring problems forward without solutions, even if those problems were not a result of their action (or lack of action). Thus, test results indicating that a system is in need of further development or fixes before the program proceeds may adversely affect the career of the program manager, even if such results are in no way tied to that individual's performance. These pressures on program managers can lead to unnecessary and unproductive tensions in the overall acquisition process and in the test and evaluation portions of the process in particular.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX B

55

APPENDIX B

A Short History of Experimental Design, with Commentary for Operational Testing

Some of the most important contributions to the theory and practice of statistical inference in the twentieth century have been those in experimental design. Most of the early development was stimulated by applications in agriculture. The statistical principles underlying design of experiments were largely developed by R. A. Fisher during his pioneering work at Rothamsted Experimental Station in the 1920s and 1930s. The use of experimental design methods in the chemical industry was promoted in the 1950s by the extensive work of Box and his collaborators on response surface designs (Box and Draper, 1987). Over the past 15 years, there has been a tremendous increase in the application of experimental design techniques in industry. This is due largely to the increased emphasis on quality improvement and the important role played by statistical methods in general, and design of experiments in particular, in Japanese industry. The work of the Japanese quality consultant G. Taguchi on robust design for variation reduction has shown the power of experimental design techniques for quality improvement. Experimental design techniques are also becoming popular in the area of computer-aided design and engineering using computer/simulation models, including applications in manufacturing (automobile and semiconductor industries), as well as in the nuclear industry (Conover and Iman, 1980). Statistical issues in the design and analysis of computer/simulation experiments are discussed in Sacks et al. (1989). Robust design uses designed experiments to study the response surfaces associated with both mean and variation, and to choose the factor settings judiciously so that both variability and bias are made simultaneously small. Variability is studied by identifying important “noise” variables and varying them systematically in offline experiments. Robust design ideas have been used extensively in industry in recent years (see Taguchi, 1986; Nair, 1992). Some basic insights of experimental design have had revolutionary impact, but many of these insights are not well known among scientists without specialized training in statistics, partly because elementary texts and first courses seldom allocate time to this topic at all, or with any depth. For example, the role of randomization and the inefficiency of the practice of varying one factor at a time are

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX B

56

not well appreciated. To the extent that this is true of the operational testing community, it should be surprising, since many of the applications and much of the support for research in experimental design derived from problems faced by DoD during and shortly after World War II. The reason may be that practical considerations in carrying out operational testing often impose such complex restrictions on the nature of the experimental design that one cannot rely on standard formulae to optimize the design. Here, as in many other applications of statistical theory to practice, it seems likely that the limited standard textbook rules and dogmas are inadequate for dealing intelligently with the problem. What is required is the kind of expertise that can adapt underlying basic principles to the current situation, an expertise rarely found outside the scope of well-trained statisticians who understand the relation of standard rules to underlying principles. Both to serve as a reference point for later discussion and to help summarize the progress made in this field, we describe a few of the basic principles and tools of experimental design in barest outline. It is our hope that appreciation of the basic principles will thus be enhanced, and the potential for more sophisticated applications developed. THE VALUE OF CONTROLS, BLOCKING, AND RANDOMIZATION Several basic principles of design of experiments are widely understood. One is the need for a control. In comparing two systems, a new one and a standard one whose behavior is relatively well known, there used to be a natural tendency to test and evaluate the new system separately. The result of such an evaluation tends to be biased by a “halo” factor because the new system is being evaluated under conditions somewhat different from the everyday conditions under which the old system has been used. To avoid this bias, it is commonplace to test both systems simultaneously under similar circumstances. With complicated weapon systems, satisfactory control may require careful consideration of the training of the personnel handling the system. The use of controls has an additional advantage besides that of eliminating a potential inadvertent bias. This advantage stems from the factors that contribute to the variability in the outcomes of individual tests. Ordinarily, the outcome of an experiment depends not only on the overall quality of the system, but also on more or less random variations, some of which are due to the general environment. To the extent that the two systems are tested in the same environment, which is likely to have a similar effect on both systems, the difference in performance is less likely to be affected by the environment, and the experiment yields a more precise estimate of the overall difference in performance of the two systems. If natural variations in the environment have a relatively large effect on the variability in performance, the ability to match pairs has a correspondingly large effect on increasing the precision of conclusions. When this principle of matching is generalized to more than two systems, it is referred to as blocking, a term derived from agricultural experiments in which several treatments are applied in each of many blocks of land. In the context of operational testing, a series of prototypes and controls are tested simultaneously under a variety of conditions defined by such factors as terrain, weather, degree of training of troops, and type of attack. Here one expects considerable homogeneity within blocks and nontrivial variation from block to block. The process of blocking raises another issue. How should the various treatments be distributed within a block? In an agricultural experiment, if one assumes that position within the block has no effect, position will not matter. But if there is a systematic gradient in soil fertility in one direction, the use of a systematic allocation might introduce a bias. One way to deal with this possibility is to anticipate the bias and allocate within the various blocks in a clever fashion designed to cancel out the

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX B

57

extraneous and irrelevant gradient effect. This is tricky, and the history of such attempts is full of misguided failures. Another approach to reducing the bias is to select the allocation within the block by randomization. Often in operational testing applications with a small number of test articles, randomization may not be necessary, and small systematic designs can be used safely. Or one can select a design at random from a restricted class of “reasonably safe” designs. However, in larger and more complicated experiments where there are many blocks, the possible biasing effect due to “unfortunate” randomizations is very likely to be minimal. Moreover, one byproduct of randomization is that it permits the statistician to ignore the complications due to many poorly understood potential biasing phenomena in constructing the probabilistic model on which to base the analysis. VARYING MORE THAN ONE FACTOR AT A TIME Perhaps one of the most important insights of experimental design is that the traditional policy of varying one factor at a time is inefficient; that is the resulting estimates have higher variance than estimates derived from experiments with the same number of replications in which several factors are simultaneously varied. We illustrate with two examples. One example, due to Hotelling and based on work by Yates, involves the weighing of eight objects whose weights are wi, 1 ≤ i ≤ 8. A chemist's scale is used, which provides a reading equal to the weight in one pan minus the weight in the other pan plus a random error with mean 0 and variance σ2. Hotelling proposes the design represented by the equations: X1 = w1 + w2 + w3 + w4 + w5 + w6 + w7 + w8 + u1 X2 = w1 + w2 + w3 − w4 − w5 − w6 − w7 + w8 + u2 X3 = w1 − w2 − w3 + w4 + w5 − w6 − w7 + w8 + u3 X4 = w1 − w2 − w3 − w4 − w5 + w6 + w7 + w8 + u4 X5 = w1 + w2 − w3 + w4 − w5 + w6 − w7 − w8 + u5 X6 = w1 + w2 − w3 − w4 + w5 − w6 + w7 − w8 + u6 X7 = w1 − w2 + w3 + w4 − w5 − w6 + w7 − w8 + u7 X8 = w1 − w2 + w3 − w4 + w5 + w6 − w7 − w8 + u8 where Xi is the observed outcome of the ith weighing, every +1 before a wj means that the jth object is in the first pan, a −1 means that it is in the other pan, and ui is the random error for the ith weighing and is not observed directly. We estimate the wj by solving the equations derived by assuming all ui = 0. This gives, for example, the estimate ŵ1 of w1, where: ŵ 1 = (X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8)/8 or ŵ 1 = w1 + (u1 + u2 + u3 + u4 + u5 + u6 + u7 + u8)/8 Since the ui are the errors resulting from independent weighings, we assume that they are independent

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX B

58

with mean 0 and variance σ2. Then a straightforward computation yields the result that the ŵ j have mean wj and variance σ2/8, and are uncorrelated. If one had applied 8 weighings to the first object alone, no better result would have been obtained for w1. Thus a design in which each object is weighed separately would require 64 weighings to achieve our results, which were derived from 8 weighings. Another example, from Mead (1988), confronts the practice of varying one factor at a time more directly. Suppose that the outcome of a treatment is affected by three factors, p, q, and r, each of which can be controlled at two levels, p0 or p1, q0 or q1, and r0 or r1. We are allowed 24 observations. In one experiment we use: • p0q0r0 and p1q0r0 four times each • p0q0r0 and p0q1r0 four times each • p0q0r0 and p0q0r1 four times each An alternative second experiment uses each of the eight combinations p0q0r0, p0q0r1, p0q1r0, p1q0r0, p1q1r0, p1q0r1, p0q1r1, and p1q1r1 three times. We are interested in estimating the difference in average effect due to the use of p1 rather than p0. Assume that effects of the factors are additive, and the observations have a common variance σ2 about their expectation. Then the variance of the estimate of the difference due to p1qkrm rather than p0qkrm is σ2/2 in the first experiment and σ2/6 in the second. The same holds for the differences due to the second and third factors. A threefold reduction in variance can be achieved by a design that varies several factors at once. The more efficient design consisted of replicating the eight-case block three times. This design also has the advantage of allowing the designer to select quite distinct environments for each block without worrying much about the contribution of the environmental factors to the overall effect being studied. In case the variations in environment have a large effect on the result, the blocking aspect of the design is useful in increasing the efficiency of the estimation of the contrasting effects of p, q, and r over a design that ignores blocking. Moreover, the design is well balanced in a technical sense, permitting simple analyses of the resulting data, as well as efficiency of the resulting estimates. The simplicity of the analysis, even in this day of cheap and fast computing, retains an advantage in permitting the analyst to present the results in a convincing way to those without a background in statistics. An experiment in which each combination of controllable factors is considered at several levels is called a factorial experiment. (Factorial designs were developed by Fisher and Yates at Rothamsted.) So, for example, if one has four factors involving five levels each, a factorial experiment would require 54 = 625 distinct observations. Such a large number could be impractical. For such cases, an elegant mathematical theory of incomplete block designs was developed, supplemented by a theory dealing with fractional factorial designs, latin squares, and graeco-latin squares for studying the main effects and low-order interactions in a small number of runs. These designs tend to achieve efficiency and balance while reducing potential biases, leading to relatively simple analysis. Fractional factorial designs were introduced by Finney (1945). Orthogonal arrays, recently popularized by Taguchi, include the fractional factorial designs developed by Finney, the designs developed by Plackett and Burman (1946), and the orthogonal arrays developed by Rao (1946, 1947), Bose and Bush (1952), and others. OPTIMAL EXPERIMENTAL DESIGNS A major advance in the theory of experimental design was the introduction of optimal experimental design. This theory provides asymptotically optimal or efficient designs for estimating a single un

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX B

59

known parameter for problems in which the relationship between the outcome Y and the independent variables x1, x2, etc. is well understood and easily modeled as a function of a few unknown parameters. While this theory has some limitations in an applied setting, its results can be useful in pointing out targets of efficiency one should approximate, and where one should aim to get reasonably good designs. There are several such limitations. First, since the theory is a large-sample theory, except in the case of regression models, it may approximate good designs only for situations in which limited sample sizes are available. Second, the optimal designs often depend on the value of the unknown parameter. For example, if the reliability r of a device under stress x is given by r(x) = exp(−θx), then the optimal design for estimating the unknown parameter θ consists of stressing a sample of devices with the stress x = 1.6/θ. In these cases, one must rely on some prior knowledge about the unknown parameter or carry out some preliminary less inefficient experiments to “get the lay of the land.” The latter is often good policy if it is feasible and not inconvenient. Third, the optimality may depend on an assumed model that is incorrect, causing the resulting design to be suboptimal and possibly even noninformative. For example, consider a linear regression for probability of hit Y, which is a linear function of distance x for x in the range 3 to 4; i.e., Y = α + βx + u, where it is desired to estimate the slope β (Of course, this model makes sense only for a relatively short range of x, since there is the danger of predicting probabilities that are less than 0 or greater than 1.) For each value of x between 3 and 4, one may observe the corresponding value of Y, which depends not only on x but also on the random noise u, which is assumed to have mean 0 and constant variance (independent of x), and is not observed. Then an optimal experiment would consist of selecting half of the x values at 3 and the other half at 4. However, if this was wrong and a more suitable model for Y as a function of distance was, instead, Y = α + βx + γx2 + u, adding a quadratic term, then an optimal design for estimating β would require the use of three values of x, and the above design that is concentrated on two values of x could not be used to estimate this threeparameter model. Note that for this quadratic model, the slope is no longer constant, and β represents the slope at x = 0. This raises the additional question of whether β is the parameter we wish to estimate if the regression is not linear in x. More likely one would want to estimate β + 7γ, the slope at the half-way point. On the other hand, if one were fairly certain that the linear model was an adequate approximation, but were somewhat concerned with the possibility that gamma was substantial, and so wanted to be highly efficient for the linear model with some recourse in case the quadratic model was appropriate, then minor variations from the optimal design for the linear model could be used to reveal deviations from the model without affecting the efficiency greatly should the linear model be appropriate. Finally, in many cases the object of the experiment involves the estimation of more than one unknown parameter. It is rarely possible to design an experiment that is simultaneously maximally efficient for estimating each of these parameters. In such cases, it is necessary to establish an appropriate criterion for measuring how well an experimental design does. Several criteria have been advanced. One possibility is to convert estimates of the parameters to estimates of performance of the equipment for each of several environments likely to be encountered. For each such environment, the estimate would have a variance. One could then determine a design that would minimize an average, over the range of environments, of the variances of estimated performance. RESPONSE SURFACE DESIGNS The choice of control settings is typically the subject of response surface design and analysis.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX B

60

Response surfaces are simple linear or quadratic functions of independent variables that are used to approximate a more complex relationship between a response and those independent variables. Two types of optimality are studied. In one case, the response is measured by a simple output that is to be maximized as a function of several control variables. This type of study requires the estimation of a surface that is typically quadratic in the relevant neighborhood. The usual 2n factorial design in which each variable is examined at two levels is inadequate for estimating the necessary quadratic effects for locating the optimal setting. However a 3n factorial design may involve too many settings. Composite designs that supplement the 2n factorial designs with additional points contribute useful information about quadratic effects. In particular, there is a useful class of rotatable designs that are efficient and easy to analyze and comprehend. In a second kind of problem, there may be several output variables to deal with, with good performance requiring that each of these lie within certain acceptable bounds. In many cases, each of these output variables behaves in a roughly linear fashion as a function of the control variables in the region under discussion. Then a 2n factorial design may be appropriate for estimating the linear trends necessary to determine control settings that will contribute satisfactory results. Typically, the goal in many industrial experiments is to identify the important factors that affect one or more responses from among a large set of factors. Highly fractional, typically main-effect plans are used as screening designs to identify the important factors. The high cost of industrial experimentation limits the number of runs; hence fractional designs with factors typically at two levels are used in these experiments. Once a smaller set of important factors has been identified, the response surface can be studied more thoroughly using designs with more than two levels, and process/product performance can be optimized. This is the rationale behind the response surface methodology developed by Box and others (see Box and Draper, 1987). It should be pointed out that most of the industrial applications along the lines of Taguchi focus on product or process development, and so are closer to the application of developmental testing. In selecting the settings of the controls in a factorial design, an experimenter must use some background information on what to expect. It would be useless to carry out an experiment in which all the values of a factor were too extreme or too similar. Thus the operational tester must depend on information cumulated from previous experience, for example, from developmental testing, at least to establish what constitutes a useful design from which an analyst, who may be willing to discard that previous history, can draw useful conclusions. To the extent that an experimenter depends on an educated intuition about likely outcomes of an experiment or appropriate models, he or she tends to be subjective. This subjective element can never be fully removed from the design of an experiment, and in the minds of many, not even from the analysis of the resulting outcomes. BAYESIAN AND SEQUENTIAL EXPERIMENTAL DESIGNS The theory of Bayesian inference deals with the formalism of subjective beliefs by assigning prior probabilities to such beliefs. With care, this theory can be used productively in both the design and analysis of experiments. One advantage of such a theory is that it forces users to lay out their assumptions explicitly. It also provides a convenient way of expressing the effect of the experiment in the user's posterior probability. Care is required, for priors that seemingly express general ignorance about a parameter sometimes assume much more information than the user thinks. During World War II, a theory of sequential analysis was developed in connection with weapons testing. According to this theory, there is no point in proceeding further with expensive tests if the results of the first few trials are overwhelming. For example, if a fixed-sample-size test might reject a

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX B

61

weapon that failed in 3 out of 10 trials, it would make sense to stop testing and reject if the weapon failed on the first two trials. This theory led to tests that were as effective as previous fixed-sample procedures, with considerable savings in the cost of experimentation. Although the initial theory confined attention to experiments in which identical trials were repeated, the concept is naturally extended to sequential experimentation. Here, after each trial or experiment, the analystdesigner can decide whether to stop experimentation and make a terminal decision, or continue experimentation. If the decision is to continue, the analyst-designer can then elect which of the alternative trials or experiments to carry out next. Two-stage experiments, in which a preliminary experiment is devoted to gaining useful information for the design of a final stage, are special cases of sequential experimentation. Finally, two active areas of research in experimental design (not specifically Bayesian or sequential) are the use of screening experiments, in which one wishes to discover which of many possible factors has a substantial effect on some response, and designs for testing computer software. The panel is interested in pursuing the application of these two new areas of research as they relate to operational testing.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX C

62

APPENDIX C

Selecting a Small Number of Operational Test Environments

In view of the reduction in the potential military threat to Western Europe and the increased prominence of armed conflict in geographic areas as varied as Somalia, Haiti, and the Persian Gulf, the defense community has grown more interested in the performance of military equipment in a greater variety of possible operating environments. However, this increased level of interest has not caused a commensurate increase in the number of prototypes budgeted for testing of new defense systems under development. Rather, constraints on test resources have more likely tightened. Thus, a fundamental problem faced in operational testing of defense systems is how to maximize the information gained from a small number of tests in order to assess the suitability and effectiveness of the system in a wide variety of environments. Of course, even before this new attention to diverse operating environments, the number of test articles has often been quite small because of high unit cost. Also, although one environment (i.e., Europe) was of primary strategic interest, the design of operational tests has always been complicated by the need to consider such factors as time of day, weather conditions, type of attack, and use of countermeasures. Therefore, the problem of how to allocate scarce test resources is not new, but has merely grown more complex. Before proceeding, we note that our consideration of this problem has potential implications for other panel activities. Our discussion and the example below emphasize considerations for system hardware, but this design problem is also relevant to software testing. Software can be subjected to thousands of test scenarios; nevertheless, the set of possible applications is often much larger than what can be executed in a test with limited time. Further, possible solutions to this problem involve the use of statistical models for extrapolating to untested environments or scenarios, which suggests, in turn, the use of simulation methods to help in that extrapolation. The combination of information from field tests and simulations is not addressed in this interim report, but is expected to be addressed in the panel's final report. Because this problem was suggested to the panel for study by Henry Dubin, Technical Director of the Army Operational Test and Evaluation Command, we refer to it as Dubin's challenge. Responding to this challenge might conceivably lead to the development of new statistical theory and methods that

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX C

63

are applicable beyond the defense acquisition process. The field of statistics has often been enriched by crossdisciplinary investigations of this nature. Similarly, the defense testing community might gain valuable insights by approaching their duties from a more formal statistical perspective—for example, by considering the concepts and tools of experimental design and their implications for current practice. In this spirit, we present the ideas in this appendix to members of both communities. DUBIN'S CHALLENGE AS A SAMPLING PROBLEM Dubin's challenge can be viewed as posing the following question: How should one allocate a given number, p, of test units among m different scenarios (with m and p being close in size)? Considering the different scenarios as strata in a population, we know from the sample survey literature that the best allocation scheme depends on (1) the goal or criterion of interest, (2) the variability within each stratum, and (3) the sampling costs (see, e.g., Cochran, 1977, ch. 5). Suppose the goal is to estimate Y, an important measure of performance, in an individual environment or scenario of greatest probability or importance. Then the best thing to do, assuming that no strong model relates performance to environmental characteristics, is to allocate all p units to that scenario. The variance of the estimate is then σ2/p. (Note that focusing on a single response, Y, is a major simplification, since most military systems have dozens of critical measures of importance. Clearly, if different Y's have different relationships to the factors that underlie the scenarios, designs that are effective for some responses may be ineffective for others.) On the other hand, if the goal is to estimate the average performance across all different scenarios, possibly weighted by some probabilities of occurrence, one can work out the optimal allocation scheme as a function of the costs and variances (which would have to be estimated either subjectively, through developmental testing, or through pilot operational testing). For a simple example, if the weights, variances, and costs happen to be all equal for different scenarios, then equal allocation is optimal, and one has the same degree of precision for the estimate of average performance as for the single scenario, i.e., σ2/p. Therefore, nothing is really lost in terms of our overall criterion. However, one cannot estimate Y well in each individual scenario. The variance of those estimates is m(σ2/p). The question of interest can thus be reexpressed as when or how can we obtain better performance estimates for the individual scenarios. If the m different scenarios are intrinsically quite different, one has essentially m different subpopulations, and there is no way one can pool information to gain efficiency. It is only when there is some structure between the scenarios that one can try to pool the information from them. As a simple example, suppose m = p = 8, and the eight different scenarios can be viewed as the eight possible combinations of three factors each at two levels. Suppose also it is reasonable to assume that the interactions among these three factors are all zero. Then, even though the mean Y's for the eight scenarios are all different, one can still estimate each with a variance of σ2/4 rather than σ2, which would be the case were the interactions nonzero. Thus, the structure buys us additional efficiency. Fractional factorial experiments represent another statistical approach to this design problem. Again, if one can assume that there are no interactions, one can study more scenarios with the same number of test points by using such designs (see also the discussion in Appendix B). Unfortunately, the interactions among factors are likely to be important in many situations involving operational testing. Also, many of the factor levels will be nested: for example, humidity will vary with temperature, which in turn will depend on time of day. As a result, classical factorial experimentation might be of limited utility for operational testing. Another possible approach involves testing at scenarios that differ from a central scenario with

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX C

64

respect to only one factor of interest. These one-factor-at-a-time perturbations seem to be commonly, if informally, applied in current practice and reflect an attempt to balance primary interest in assessing system performance in one particular operating environment with secondary interest in collecting information about performance in other environments (see Chapter 2 for further discussion of this approach). In summary, operational testing typically involves a complicated structure of environments that are functions of factors with interactions, nestings, and related complexities. If one is simultaneously interested in estimating system performance in each environment or scenario and wants to obtain more precise estimates, then it is natural to examine the structure underlying the scenarios and to attempt to model this structure. AN APPROACH TO MODELING THE SCENARIO STRUCTURE We next describe a promising approach to modeling the scenario structure through use of an example. While the example is artificial, we believe it is sufficiently realistic to permit discussion of the important ideas involved. We note that the panel's deliberations on this subject are ongoing, and we anticipate further discussion of Dubin's challenge in our final report. Consider an electric generator, say, that may be required to function in many environments. Assume that one is interested in measuring its time to failure. We list eight possible scenarios or environments (m in the discussion above) and characterize these by using 18 environmental variables or factors. For a given environment, each factor is assigned a numerical value between 0 and 10 as a measure of the perceived level of stress that the factor places on the generator's reliability in that environment. Large values indicate a perceived high level of stress. (In a real application, the values for these 18 stress variables could be determined by subjective assessments, and some interaction between the developer and testing personnel would be required to decide how much more stressful, say, an average temperature of 80° is than an average temperature of 70°.) Our stress matrix, an 18 × 8 matrix, is listed in Table C-1, followed by a brief description of the rows (variables) and columns (environments). The entry in the (i,j)th cell of the matrix, aij, denotes the level of stress that the ith factor places on the generator in the jth environment. The degree to which two different environments are similar can be assessed using a standard measure of distance or dissimilarity: D =|| d(j, j′) || such as where d( j, j′) is the distance between environments j and j′, i.e.,

Table C-2 presents the resulting distance matrix. Under the constraints on sample size often encountered in operational testing, the tasks of optimally selecting environments for testing and, subsequently, estimating performance in all environments are extremely difficult unless the environments can be characterized as functions of a very small number of core factors. If the number of factors can be reduced, then one may be able to use statistical modeling to infer system performance for combinations of these core factors that were not included in the tested scenarios.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX C 65

TABLE C-1 Stresses in Various Environments Environment

Stress Variable 1 2 3 4 5 6 7 8

1 9 8 1 1 5 6 4 6

2 2 1 9 8 5 6 7 5

3 4 2 2 3 5 7 8 5

4 9 1 8 8 5 6 7 5

5 1 9 2 2 5 4 3 6

6 3 3 3 3 5 7 7 6

7 7 3 1 1 5 6 4 6

8 3 5 1 1 5 7 5 6

9 7 4 7 9 5 7 8 4

10 7 5 8 10 5 7 8 5

11 2 2 2 10 2 6 8 3

12 8 5 5 5 5 5 4 3

13 2 5 2 2 5 6 7 7

14 3 4 3 3 5 6 6 8

15 8 3 3 7 8 6 3 9

16 8 2 5 8 8 7 7 9

17 7 5 4 4 5 5 3 8

18 7 4 7 7 5 4 6 4

NOTE: The rows of Table C-1, designed to test a system for considering many environments, through the fictitious example of a generator under test, represent different features or stresses of the environment. The columns represent environments.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX C

66

TABLE C-2 Distance Matrix D Environment Environment

1

2

3

4

5

6

7

8

1

0.00

16.12

14.56

15.46

10.39

12.37

15.30

13.42

2

16.12

0.00

16.67

20.07

11.75

14.18

16.73

13.86

3

14.56

16.67

0.00

9.95

12.65

14.46

13.04

16.85

4

15.46

20.07

9.95

0.00

14.39

14.00

11.87

17.46

5

10.39

11.75

12.65

14.39

0.00

7.00

10.86

6.00

6

12.37

14.18

14.46

14.00

7.00

0.00

6.40

8.06

7

15.30

16.73

13.04

11.87

10.86

6.40

0.00

12.65

8

13.42

13.86

16.85

17.46

6.00

8.06

12.65

0.00

NOTE: See Table C-1 and text for data and discussion.

More concretely, for our example, we have 8 environments or scenarios of interest, each characterized by the set of 18 stress variables. We would like to test fewer than 8 scenarios and use the test information— expressed in terms of the 18 common variables—to extrapolate from the scenarios tested to those not tested. Therefore, we seek a representation in which the 18 variables are “noisy” measures of an underlying lowdimensional “scenario-type” variable—e.g., involving some combination of factors, such as temperature-altitudehumidity, dust-demand, and fuel-service—so that one can test a small number of representative scenarios and extrapolate the results to the remaining scenarios. Another advantage of identifying a small number of core factors will be easier communication of test results to a general (i.e., nonstatistical) audience. Therefore, we need a technique to assist us in determining whether the many factors describing the stresses of the environments can be briefly summarized by a very small number of core factors. In the generator example, we used a procedure called multidimensional scaling,1 suggested by Kruskal (1964a, 1964b), to attempt to determine whether such core factors exist. This procedure solves the problem of representing n objects geometrically by n points in k-dimensional Euclidean space, so that the interpoint distances correspond in some sense (monotonically) to experimental dissimilarities between objects. In our application, multidimensional scaling with k = 2 locates eight points in 2-dimensional space such that the distances between pairs of these points in the 2-dimensional space are monotonically related to the distances (given in Table C-2) between corresponding pairs of points (environments) in the original 18-dimensional space. Note that the 2-dimensional space may or may not be easy to interpret (see further discussion of this point in the next section). The results of the multidimensional scaling are presented in the first three columns of Table C-3. We point out that entries in the distance matrix D may not be as easily obtained as in our artificial example and may depend partly on subjective intuitions of the users. Also, there may be no natural dimension for the reduced environment space; instead, one could consider different values of k to find the lowest dimension in which the Euclidean distances between points are still consistent (monotonically) with values in the D matrix. After a mapping of the environments into k-dimensional space has been selected, optimal selection of environments for testing can proceed without regard to the size of k. At this stage of the analysis, other relevant factors can be incorporated. For example, the last three

1

The S-Plus procedure cmdscale was used.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX C

67

TABLE C-3 Results of Multidimensional Scaling Environment (j)

Coordinate

Strategic Weight (sj)

Frequency Weight (fj)

Overall Weight (wj)

zj =

(zj1, zj2)

1

−0.89

−2.47

3

5

15

2

−7.88

−6.95

2

3

6

3

7.03

−5.25

1

1

1

4

10.06

−0.08

1

1

1

5

−3.22

0.52

5

5

25

6

−1.81

4.90

3

5

15

7

2.80

5.40

1

2

2

8

−6.08

3.93

5

5

25

NOTE: The second and third columns are the coordinates in the reduced 2-dimensional environment space. The next three columns contain the weights representing the strategic importance and the prior expected frequency of the environment, and the overall weight (product of the strategic and frequency weights). The prior frequencies are not normalized to sum to 1. See text for discussion.

columns of Table C-3 contain a strategic weight (sj) indicating the importance of success in the jth environment; a frequency weight (fj) proportional to the prior probability of deployment in the jth environment; and their product, wj, a final weight. (For new military systems in development, the strategic importance of success in a particular operating environment and the probability of deployment in that environment can typically be found in the Operational Mode Summary/Mission Profile.) In the generator example, we assume that resource constraints permit testing in only two environments. Therefore, we wish to select two points x1 = (x11,x12) and x2 = (x21,x22) in the 2-dimensional space to represent testing environments that optimize some appropriate criterion. The idea is to select two experiments that provide the maximum amount of information for the complete set of eight environments. For this example, let us use as our objective that of maximizing:

where wj is the weight at the jth environment at zj = (zj1, zj2), and I(x,zj) is the information that an experiment at x contributes to knowledge about performance at the jth environment. The wj in the denominator encourages the selection of test points that are more informative for environments considered to be more probable and/or more important. With introspection and possibly with a reasonable statistical model, one could conceivably construct an information matrix I(x,z) representing the information from an experiment at x for use at z. The lowerdimensional representation should be helpful in describing how much information an experiment in one environment gives to the user who is interested in another environment. Subjective assessments may also be feasible. In the generator example, an expert might be asked, say, how much information can be obtained from an experiment in Saudi Arabia for use in a temperate urban environment and vice versa. Presumably, the closer x is to z, the greater the value of I(x,z).2 In this example, we have assumed

2Actually,

that need not be the case, when one considers the potential advantages of accelerated stress testing.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX C

68

TABLE C-4 Optimizing Points and Values of V for Different Values of b b

x11

x12

x21

x22

V

0.05

−3.220

0.520

−6.080

3.930

7.2 (10−2)

0.10

−3.220

0.520

−6.080

3.930

6.6 (10−2)

0.15

−3.041

0.289

−6.024

3.856

5.9 (10−2)

0.20

−2.884

−0.326

−5.832

3.654

5.1 (10−2)

0.25

−3.866

−1.344

−5.097

3.638

4.1 (10−2)

0.30

−2.741

−2.804

−4.828

3.172

3.0 (10−2)

0.35

−1.375

−4.206

−4.642

2.195

2.0 (10−2)

0.40

−0.865

−3.852

−4.577

1.244

1.3 (10−2)

0.45

−0.781

−1.453

−5.323

0.120

8.2 (10−3)

0.50

−0.172

−0.115

−6.735

−0.133

6.2 (10−3)

0.55

0.305

1.280

−6.795

−0.369

4.5 (10−3)

0.60

0.653

1.448

−6.784

−0.454

3.3 (10−3)

0.70

1.221

1.667

−6.810

−0.574

1.8 (10−3)

0.80

1.588

1.420

−6.839

−0.665

1.0 (10−3)

0.90

1.927

1.271

−6.847

−0.747

5.8 (10−4)

1.00

2.116

1.106

−6.421

−0.881

3.3 (10−4)

1.25

2.369

0.957

−5.249

−1.196

6.1 (10−5)

1.50

2.525

0.863

−4.898

−1.289

1.1 (10−5)

2.00

2.721

0.748

−4.759

−1.260

3.9 (10−7)

3.00

2.919

0.632

−4.789

−1.142

0.0 +

4.00

3.020

0.573

−4.813

−1.080

0.0 +

NOTE: There is a discontinuity that takes place around b = 0.41 as the result of a bifurcation whereby one local maximum changes from a local to a global maximum. This may not be the only place where this happens. Also, it seems quite clear that not all the maxima here are global maxima, b is a measure of how dissimilar two experiments at a fixed distance are. x11 and x21 are the first coordinates of the two test experiments corresponding to various values of b and x12 and x22 are the second coordinates. V measures the minimum “information” provided the 8 environments by the two test environments.

that I(x,z) is a function of the distance from the representations of x and z in the 2-dimensional space on which the environments have been mapped. We have assumed that I is a decreasing function of the distance and, in particular, that it can be represented by exp(−b| x−z |), where b is a parameter to be selected. That choice could be made by fitting this function to preliminary subjective assessments. The optimizing points x1 and x2 and the corresponding value of V depend on the choice of b. Table C-4, based on numerical calculations, represents the dependence on b of the optimizing points. It is not clear from this table how one should choose b, but a few cases could be tried to see whether the answers provided are reasonable. ISSUES AND ALTERNATIVES This section reviews the above example step by step so that we can raise some debatable issues and discuss alternatives to the methods proposed. The first step in the analysis was the construction of a stress matrix A. In creating this stress matrix, one runs the substantial risk of ignoring the possibility of the illuminating surprises that accompany operational testing, i.e., failure modes that are much more typical of operational than developmental testing. These surprises are not incorporated in this model. To construct A, one must employ enough expertise to imagine the various features or variables of the many environments that might affect the effectiveness or suitability of the system under test. Note also

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX C

69

that, in this example, two of the variables are hot and cold. It might initially seem strange to list these as separate variables, but the stresses imposed by extremes of heat can be regarded as distinct in nature from those imposed by extremes of cold or, for that matter, extreme shifts between cold and hot. Implicit in the quantification of stress is the notion that A can be used to generate a measure of distance or dissimilarity between pairs of environments. The measure of distance used here to generate the matrix D, Euclidean distance, is naive. It might be that the expert could bypass A and go directly to D. Otherwise, the expert might find some reasonable alternative to our definition of D. Implicitly, the definition used here weights each variable equally. If some of the stress variables were highly correlated because they tended to measure the same underlying factor, our measure D could effectively give this underlying factor more weight than other equally important factors. We can compensate for that phenomenon, if it is understood, by replacing the distance with some other metric. (In addition, the unit of measurement for some variable may be such that its scale is not comparable to those of other variables, incorrectly causing the Euclidean distance to emphasize distances in that dimension. This possibility has been generally avoided here by measuring stresses on a common scale between 0 and 10.) With our measure of dissimilarity, we are effectively measuring distances of eight points in an 18dimensional Euclidean space. Each environment is represented by one of these points. (Other measures of dissimilarity may not even be able to be mapped into points in such a Euclidean space.) It is difficult to comprehend any analysis involving such high dimensionality. Besides the approach indicated above, a number of techniques that have appeared in the statistical literature have been developed to cope with representing highdimensional phenomena in terms of a low-dimensional Euclidean space. These methods have various names and are considered to be variations of factor analysis. Principal components is one such alternative. We fixed the dimension of the reduced environment space at k = 2. In practice, multidimensional scaling methods suggest a particular dimension k by minimizing a so-called “stress” criterion (different from our previous use of the term “stress”)—a type of goodness-of-fit criterion—which can be used to measure the quality of the approximation of the high-dimensional space by the lower-dimensional one. In our example, we have not examined the situation for k = 3 or higher to see if there would be an improvement as measured by this stress criterion. After the points have been mapped into a low-dimensional space, the analyst often tries to label certain directions in the k-dimensional space as measuring certain underlying factors that, in some sense, are combinations of the original factors. Interpretation of the lower-dimensional space can be nontrivial but very useful; the ability to label certain directions with intuitive interpretations, if possible, will facilitate communication with a decision maker who is reluctant to decide on the basis of the output of a black box or a mysterious algorithm. Returning to our example, in Figure C-1 we have plotted the 8 rescaled points in the 2-dimensional plane and labeled the points by the environment number (1 through 8). Moving diagonally from the upper left to lower right corner seems to indicate movement from a temperate to an intemperate environment. Moving from left to right seems to indicate passage from more humid to drier environments. These characteristics are not, by any measure, a complete description of the environments (and given the dimensions, they could not): it is helpful to retain the original labels to identify the actual environments. The key step in the analysis was to construct a criterion (V above) for a good design—i.e., a choice of several experimental environments x1, ..., xp. In this example, we selected p = 2 experimental environments. (The number of test environments does not have to equal the dimensionality of the space on which the environments were mapped.) Each xi was weighted equally in the expression of V that

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX C

70

FIGURE C-1 Optimal test environments as a function of b. NOTE: For a discussion of b and of the axes and for definitions, see text. calculated the information about the jth environment zj. In practice, one might have to spend more assets or money on one test than another and, in that case, one could use a (further) weighted sum that would take into account how many assets were used in testing at each point xi. Given that two equally weighted test points will be selected, the maximin approach suggested here is to select the two test points that maximize the minimum information provided for each of the eight listed environments z1, ..., z8. That is, the information about zj is cumulated over the test points and divided by the corresponding weight wj at zj. The overall weight wj incorporates the frequency of occurrence and the strategic importance of the jth environment. One potentially serious problem with this multidimensional scaling approach is that it assumes one can carry out an experiment in an environment corresponding to any point in the reduced 2-dimensional space. In reality, much of this space may not be available for testing. (For an obvious example, there are no simultaneously very hot and very cold environments.) It may therefore be difficult to construct an environment that would be mapped into a given point in the 2-dimensional space. Or there may be too many different real environments that would correspond to the same point in the 2-dimensional space. TABLE C-4 shows the impact of changing the parameter b. When b is small, the effect of distance is slight, and it is important to make sure that the vital or highly weighted points get maximum information. Thus, the optimal design puts the two x values at the two most highly weighted environments.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX C

71

When b is large, information diminishes greatly with distance from the testing point. In this case, it is necessary to move the x values to some compromise positions that provide at least some minimal information for estimating the performance in less important environments. The impact of the choice of b is also evident in Figure C-1. Here, the position of one optimal experiment is indicated by a “+” and the second by a “−”. The environments with the highest weights, wj, from Table C-3 are 5 and 8. For very small values of b, all environments are comparable; therefore, the two highest-weighted environments are the logical positions at which to test. This result is indicated by a “+” in the circle labeled 5 and a “−” in the circle labeled 8. As b is increased, the environments become less comparable, and the algorithm correctly selects experimental environments that are compromises of all eight environments of interest. Environments 5 and 8 are still favored, but the optimal points are now true compromises of all eight points in this 2-dimensional space. For the largest value of b, the “+” experiment is a compromise of environments 7, 4, and 3, and the “−” experiment is a compromise of environments 8, 6, 5, 1, and 2. While this behavior may seem sensible, it depends heavily on buying into the criterion proposed. Both the exponential decline and the maximin aspects should be questioned and alternatives considered. The present formulation also ignores the possibility of replacing the single-valued information function I(x,z) by a higherdimensional measure such as an information matrix. At the example's current level of simplicity, it seems premature to consider this latter extension. Undoubtedly, as these alternatives and other approaches are examined, modifications of these ideas will be suggested, and useful approaches will become clear. However, because the performance of a particular statistical method may depend greatly on the structure of the individual problem, it will be essential to apply these methods to various real problems in defense testing in order to gauge their true utility.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX D

72

APPENDIX D

Individuals Consulted

SPONSORS Philip Coyle, Director, Operational Test and Evaluation Ernest Seglie, Office of the Director, Operational Test and Evaluation GENERAL Captain Suzanne Beers, Air Force Operational Test and Evaluation Center Robert Bell, Marine Corps Operational Test and Evaluation Activity Henry Dubin, Operational Test and Evaluation Center James Duff, Operational Test and Evaluation Force Donald Gaver, Naval Post-Graduate School Ric Sylvester, Office of the Deputy Undersecretary of Defense for Acquisition Reform Marion Williams, Air Force Operational Test and Evaluation Center MODELING AND SIMULATION Peter Brooks, Institute for Defense Analyses Will Brooks, Army Materiel Systems Analysis Activity William Buchanan, Institute for Defense Analyses Gary Comfort, Institute for Defense Analyses Angie Crawford, Air Force Operational Test and Evaluation Center Robert Dighton, Institute for Defense Analyses Dick Fejfar, Institute for Defense Analyses Christine Fossett, General Accounting Office Sam Frost, Army Materiel Systems Analysis Activity Terry Hines, Defense Modeling and Simulation Office

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX D

73

Major Brian Ishihara, Air Force Test and Evaluation Plans and Programs Irwin Kaufman, Institute for Defense Analyses Dwayne Nuzman, Army Materiel Systems Analysis Activity Dale Pace, Johns Hopkins University Charles Pate, Training and Doctrine Command Patricia Sanders, Office of the Undersecretary of Defense for Acquisition and Technology, Director, Test Systems Engineering and Evaluation Brian Simes, Air Force Operational Test and Evaluation Center Brad Thayer, Institute for Defense Analyses Charles Walters, The MITRE Corporation Susan Wright, Army Digitization Office Bill Yeakel, Army Materiel Systems Analysis Activity RELIABILITY, AVAILABILITY, AND MAINTAINABILITY Captain David Blanks, Air Force Operational Test and Evaluation Center Captain David Crean, Air Force Operational Test and Evaluation Center Paul Ellner, Army Materiel Systems Analysis Activity Captain Tim Gooley, Air Force Operational Test and Evaluation Center Cy Lorber, Army Materiel Command Mike Malone, Air Force Operational Test and Evaluation Center Captain Terence Mitchell, Air Force Operational Test and Evaluation Center Paul Mullen, Army Materiel Systems Analysis Activity Ken Murphy, Air Force Operational Test and Evaluation Center Lieutenant Ronald Reano, Air Force Operational Test and Evaluation Center Nozer Singpurwalla, George Washington University Jim Streilein, Army Materiel Systems Analysis Activity Lt. Colonel Larry Wolfe, Air Force Operational Test and Evaluation Center SOFTWARE Lt. Commander Tom Beltz, Navy Operational Test and Evaluation Force Henry Betz, Army Materiel Systems Analysis Activity Lt. Colonel Hebert, Air Force Operational Test and Evaluation Center Austin Huangfu, Office of the Director, Operational Test and Evaluation Linda Kimball, Army Materiel Systems Analysis Activity Scott Lucero, Army Operational Test and Evaluation Command Lt. Colonel John Manning, Operational Evaluation Command Margaret Myers, Office of the Secretary of Defense Tom Nolan, Army Materiel Systems Analysis Activity Ray Paul, Office of the Secretary of Defense Major Frederick Thornton, Marine Corps Operational Test and Evaluation Analysis Scott Weisgerber, Air Force Operational Test and Evaluation Center Steven Whitehead, Navy Operational Test and Evaluation Force Lieutenant Cynthia Womble, Navy Operational Test and Evaluation Force

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX D

74

EXPERIMENTAL DESIGN Art Fries, Institute for Defense Analyses Major Michael Hall, Army Operational Evaluation Command Kent Haspert, Institute for Defense Analyses Anil Joglekar, Institute for Defense Analyses Captain Greg Kokoskie, Army Operational Evaluation Command Larry Leiby, Army Operational Test and Evaluation Command John McVey, Army Operational Test and Evaluation Command Major Ed Miller, Army Operational Evaluation Command Harold Pasini, Army Operational Evaluation Command Hank Romberg, Army Operational Test and Evaluation Command Patrick Sul, Army Operational Evaluation Command Tom Zeberlein, Army Operational Evaluation Command ORGANIZATIONAL CONTEXT Charles Adolph, Science Applications International Corporation Al Burge, Office of the Secretary of Defense, Developmental Test and Evaluation/Modeling and Simulation Software Evaluation Thomas Christie, Institute for Defense Analyses David Chu, The RAND Corporation Mark Forman, Senate Governmental Affairs Jackie Guin, General Accounting Office Walter Hollis, Deputy Under Secretary of the Army for Operations Research Lt. General Howard Leaf, Director of Test and Evaluation, U.S. Air Force Louis Rodrigues, General Accounting Office Steve Ronnel, Senate Armed Services Committee Donald Yockey, formerly, Under Secretary of Defense for Acquisition

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX E 75

APPENDIX E

DoD and the Army Test and Evaluation Organization

This appendix consists of two charts: one showing the organization of the DoD test and evaluation community, and the other showing the organization of test and evaluation within the Army.

FIGURE E-1 DoD Test and Evaluation Community.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

APPENDIX E

FIGURE E-2 Army Test and Evaluation Organization Chart.

76

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

REFERENCES

77

References

Bankes, Steven 1993 Exploratory modeling for policy analysis. Operations Research 41(3):435-449. Bose, R.C., and K.A. Bush 1952 Orthogonal arrays. Annals of Mathematical Statistics 23:508-524. Box, G.E.P., and N.R. Draper 1987 Empirical Model-Building and Response Surfaces. New York: Wiley. Box, G.E.P., and J.S. Hunter 1961 The 2k−p fractional factorial designs: Part I. Technometrics 3(3):311-350. Citro, C., and E. Hanushek, eds. 1991 Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling. Two volumes. Panel to Evaluate Microsimulation Models for Social Welfare Programs, Committee on National Statistics, National Research Council. Washington, D.C.: National Academy Press. Cochran, W.G. 1977 Sampling Techniques. 3rd edition. New York: Wiley. Conover, W.J., and R.L. Iman 1980 Small sample sensitivity analysis techniques for computer models, with an application to risk assessment. Communications in Statistics, Theory and Methods A9(17): 1749-1842. Daniel C. 1976 Applications of Statistics to Industrial Experimentation. New York: Wiley. Defense Science Board 1989 Improving Test and Evaluation Effectiveness. Task force report. Defense Systems Management College 1994 Systems Acquisition Manager's Guide for the Use of Models and Simulations. Report of the DSMC 1993-1994 Military Fellows. Fort Belvoir, Va.: Defense Systems Management College. Doctor, P. 1989 Sensitivity and uncertainty analysis for performance assessment modeling. Engineering Geology 26:411-429. Finney, D.J. 1945 The fractional replication of factorial arrangements. Annals of Eugenics 12:291-301. Food and Drug Administration 1987 Guideline on General Principles of Process Validation. Prepared by the Center for Drug Evaluation and Research, the Center for Biologics Evaluation and Research, and the Center for Devices and Radiological Health. Rockville, Md.: U.S. Department of Health and Human Services.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

REFERENCES

78

Friedman, J.H. 1991 Multivariate adaptive regression splines. Annals of Statistics 19:1-141. Fries, A. 1994 Design of experiments in operational test and evaluation: Where should we go next? ITEA Journal 14(4):20-33. Hahn, G. 1984 Experimental design in the complex world. Technometrics 26(1):19-31. Hill, H.M. 1960 Experimental designs to adjust for time trends. Technometrics 2:67-82. Hodges, J. 1987 Policy, models and uncertainty. Statistical Science 2:259-291. Hodges, J.S., and J.A. Dewar 1992 Is It You or Your Model Talking?: A Framework for Model Validation. R-4114-A/AF/OSD. Santa Monica, Calif.: RAND. Iman, R., and W. Conover 1982 A distribution free approach to inducing rank correlation among input variables. Communications in Statistics—Simulation and Computation B11(3):311-334. Kruskal, J.B. 1964a Multidimensional scaling by optimizing goodness-of-fit to a nonmetric hypothesis. Psychometrika 29:1-29. 1964b Nonmetric multidimensional scaling: A numerical method. Psychometrika 29:115-129. Lese, W. 1992 Cost and Operational Effectiveness Analyses (COEAs) and the Acquisition Process. Paper presented at the Committee on National Statistics' Workshop on Statistical Issues in Defense Analysis and Testing, September 24-25, 1992. Office of the Assistant Secretary of Defense for Program Analysis and Evaluation, U.S. Department of Defense. McKay, M. 1992 Latin Hypercube Sampling as a Tool in Uncertainty Analysis of Computer Models. Paper presented at the Winter Simulation Conference, Arlington, Va. McKay, M.D., W.J. Conover, and R.J. Bechman 1979 A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 22(2):239-245. Mead, R. 1988 The Design of Experiments: Statistical Principles for Practical Application. New York: Cambridge University Press. Mitchell, T.J., and D.G. Wilson 1979 Energy Model Validation: Initial Perceptions of the Process. ORNLS/CSD-50, Oak Ridge, Tenn.: Oak Ridge National Laboratory Myers, M.E. 1993 New OT&E strategy expedites fielding of software-intensive systems. Journal of Air Force Operational Test and Evaluation Center (AFOTEC RP 190-1) 6(1). Nair, V.N., ed. 1992 Taguchi's parameter design: A panel discussion. Technometrics 34:127-161. Paul, R. 1993 Metrics guided software risk management and maintainability. Journal of Air Force Operational Test and Evaluation Center 1 (1). 1995 Presentation given at software working group meeting, April 21, 1995. Washington, D.C. Plackett, R.L., and J.P. Burman 1946 The design of optimum multi-factor experiments. Biometrika 33:305-325. Rao, C.R. 1946 Hypercubes of strength d leading to confounded designs in factorial experiments. Bulletin Calcutta Mathematics Society 38:67-78. 1947 Factorial experiments derivable from combinatorial arrangements of arrays. Journal of the Royal Statistical Society Supplement 9:128-139.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

REFERENCES

79

Rolph, J.E., and D.L. Steffey, eds. 1994 Statistical Issues in Defense Analysis and Testing: Summary of a Workshop. Committee on National Statistics and Committee on Applied and Theoretical Statistics, National Research Council. Washington, D.C.: National Academy Press. RTCA, Inc. 1992 Software Considerations in Airborne Systems and Equipment Certification. Document No. RTCA/DO-178B. Washington, D.C.: RTCA, Inc. Sacks, J., W.J. Welch, T.J. Mitchell, and H.P. Wynn 1989 Design and analysis of computer experiments. Statistical Science 4:409-423. Scott, J.A., and J.D. Lawrence 1994 Testing Existing Software for Safety-Related Applications. Draft manuscripts prepared for the U.S. Nuclear Regulatory Commission. Lawrence Livermore National Laboratory, Livermore, Calif. Taguchi, G. 1986 Introduction to Quality Engineering. Tokyo, Japan: Asian Productivity Organization. U.S. Air Force Operational Test and Evaluation Center 1995 Introduction to Joint Reliability/Maintainability Evaluation Team and Test Data Scoring Board: A Handbook for the Logistics Analyst. AFOTEC Course OT&E 410, 4th edition (PDS Code XFX). Directorate of Systems Analysis. Kirtland Air Force Base, N.M.: U.S. Air Force. U.S. Army Materiel Systems Analysis Activity 1995 Handout from presentation given at Software Working Group Meeting. April 21, 1995, Washington, D.C. U.S. Department of Defense 1960 Sampling Procedures and Tables for Life and Reliability Testing. MIL-HBK 108. Office of the Assistant Secretary of Defense (Supply and Logistics). Washington, D.C.: U.S. Department of Defense. 1982 Test and Evaluation of System Reliability, Availability, and Maintainability: A Primer. Director Test and Evaluation, Office of the Under Secretary of Defense for Research and Engineering. DoD 3235.1-H. Washington, D.C.: U.S. Department of Defense. 1991 Defense Acquisition Management Policies and Procedures. Department of Defense Instruction 5000.2. Washington, D.C.: U.S. Department of Defense. U.S. General Accounting Office 1987 DoD Simulation: Improved Assessment Procedures Would Increase the Credibility of Results. GAO/ PEMD-88-3. Washington, D.C.: U.S. Government Printing Office. 1988 Weapons Testing: Quality of DoD Operational Testing and Reporting. GAO/PEMD-88-32BR. Washington, D.C.: U.S. Government Printing Office. 1993 Test and Evaluation: DoD Has Been Slow in Improving Testing of Software-Intensive Systems. Washington, D.C.: U.S. Government Printing Office. U.S. Nuclear Regulatory Commission 1993 Software Reliability and Safety in Nuclear Reactor Protection Systems. Prepared by J.D. Lawrence of Lawrence Livermore National Laboratory. NUREG/CR-6101, UCRL-ID-114839. Washington, D.C.: U.S. Nuclear Regulatory Commission. Wiesenhahn, R.D., and D.F. Dighton 1993 A Framework for Using Advanced Distributed Simulation in Operational Test. Alexandria, Va.: The Institute for Defense Analysis. Yockey, D., D. Chu, and R. Duncan 1992 Memorandum prepared for the Assistant Secretary of the Army (Research, Development and Acquisition), the Assistant Secretary of the Navy (Research, Development and Acquisition), and the Assistant Secretary of the Air Force (Acquisition). February 21.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

REFERENCES 80

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

BIOGRAPHICAL SKETCHES OF PANEL MEMBERS AND STAFF

81

Biographical Sketches of Panel Members and Staff

JOHN E. ROLPH (Chair) is professor of statistics and chair of the Department of Information and Operations Management in the University of Southern California School of Business. He previously was on the research staff of the RAND Corporation. He has also held faculty positions at University College London, Columbia University, the RAND Graduate School for Policy Studies, and the Health Policy Center of RAND/ University of California, Los Angeles. His research interests include empirical Bayes methods and the application of statistics to health policy, civil justice, criminal justice, and other policy areas. He is editor of the American Statistical Association magazine Chance, and he currently serves as vice chair of the National Research Council's Committee on National Statistics. He is a fellow of the American Statistical Association, the Institute of Mathematical Statistics, the American Association for the Advancement of Science and is a member of the International Statistical Institute. He received A.B. and Ph.D. degrees in statistics from the University of California, Berkeley. MARION R. BRYSON has retired after holding many positions in the federal government, 22 years primarily in the operational test arena. He served as scientific advisor at CDEC, director of CDEC, and technical director of the Test and Experimentation Command. Prior to his government service, he taught in several colleges and universities, including Duke University. He is a past president and fellow of the Military Operations Research Society. He is the recipient of the Vance Wanner Memorial Award in Military Operations Research and the Samuel S. Wilks Award in Army Experimental Design. He holds a Ph.D. degree in statistics from Iowa State University. HERMAN CHERNOFF is professor of statistics in the Department of Statistics at Harvard University. He previously held professorships at the Massachusetts Institute of Technology, Stanford University, and the University of Illinois at Urbana. His current research centers on applications of statistics to genetics and molecular biology, and his past work specialized in large sample theory, sequential analysis, and optimal design of experiments. He is a member of the National Academy of Sciences and the American Academy of Arts and Sciences, and has served as president of the Institute of Mathematical

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

BIOGRAPHICAL SKETCHES OF PANEL MEMBERS AND STAFF

82

Statistics, and associate editor of several statistical journals. He is a fellow of the Institute of Mathematical Statistics and the American Statistical Association. He received a B.S. degree in mathematics from City College of New York, Sc.M. and Ph.D. degrees in applied mathematics from Brown University, an honorary A.M. degree from Harvard University, and honorary Sc.D. degrees from the Ohio State University and Technion. JOHN D. CHRISTIE is a senior fellow and assistant to the president at the Logistics Management Institute, a nonprofit institution in McLean, Virginia. Before joining the institute he was the Director, Acquisition Policy & Program Integration for the Undersecretary of Defense (Acquisition) in the U.S. Department of Defense. Prior to that he was vice president of two different professional service firms, while also serving for 7 years as a member of the Army Science Board. During an earlier period of government service he held various positions at the Federal Energy Administration and the Defense Department. Previously, he was a member of the Bell Labs technical staff. He holds S.B., S.M., E.M.E., and Sc.D. degrees from the Massachusetts Institute of Technology, all in mechanical engineering. MICHAEL L. COHEN is a senior program officer for the Committee on National Statistics. Previously, he was a mathematical statistician at the Energy Information Administration, an assistant professor in the School of Public Affairs at the University of Maryland, a research associate at the Committee on National Statistics, and a visiting lecturer at the Department of Statistics, Princeton University. His general area of research is the use of statistics in public policy, with particular interest in census undercount and model validation. He is also interested in robust estimation. He received a B.S. degree in mathematics from the University of Michigan and M.S. and Ph.D. degrees in statistics from Stanford University. CANDICE S. EVANS is a project assistant with the Committee on National Statistics. She is also currently working with the Panel on Retirement Income Modeling and has been steering the report of the Panel on International Capital Transactions, Following the Money: U.S. Finance in the World Economy, through the review process to final publication. LOUIS GORDON is a statistician at the Filoli Information Systems Corporation. He has previously held academic appointments at the University of Southern California and at Stanford University. He has also worked as a statistician in industry and in the federal government. He has held J.S. Guggenheim and Fulbright fellowships. His research interests are in nonparametric statistics. KATHRYN BLACKMOND LASKEY is an associate professor of systems engineering at George Mason University. She was previously a principal scientist at Decision Science Consortium, Inc. Her primary research interest is the study of decision theoretically based knowledge representation and inference strategies for automated reasoning under uncertainty. She has worked on methods for automated construction of Bayesian belief networks and for recognizing when a system's current problem model is inadequate. She has worked with domain experts to develop Bayesian belief network models to be used in automated reasoning. She received a B.S. degree in mathematics from the University of Pittsburgh, an M.S. degree in mathematics from the University of Michigan, and a joint Ph.D. in statistics and public affairs from Carnegie Mellon University.

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

BIOGRAPHICAL SKETCHES OF PANEL MEMBERS AND STAFF

83

ROBERT C. MARSHALL is a professor and head of the Department of Economics at Penn State University. Previously, he taught at Duke University. His research—using theoretical, empirical, and numerical methods of analysis—has included a broad range of topics—housing, labor, the expected utility paradigm, and measurements of mobility. He is best known for his work on auctions and procurements, which has focused on collusion by bidders. He received an A.B. degree from Princeton University and a Ph.D. degree from the University of California, San Diego. VIJAYAN N. NAIR is professor of statistics and professor of industrial and operations engineering at the University of Michigan, Ann Arbor. Previously, he was a research scientist at AT&T Bell Laboratories. His research interests include statistical methods in manufacturing, quality improvement, robust design, design of experiments, process control, and reliability engineering. He has taught courses and workshops on these areas both at Michigan and at Bell Labs. He also has extensive practical experience in applying these methods in industry. He is a fellow of the American Statistical Association, a fellow of the Institute of Mathematical Statistics, an elected member of the International Statistical Institute, and a senior member of the American Society for Quality Control. He is a past editor of Technometrics and a past coordinating editor of the Journal of Statistical Planning and Inference, and he currently serves on the editorial boards of five journals. He has a B. Econs. (Hons.) degree from the University of Malaya and a Ph.D. in statistics from the University of California, Berkeley. ROBERT T. O'NEILL is director of the Office of Epidemiology and Biostatistics and Acting Director of the Division of Epidemiology and Surveillance in the Center for Drug Evaluation and Research (CDER) of the Food and Drug Administration. He is responsible for postmarketing surveillance and safety of new drugs, and for providing statistical support to all programs of CDER, which include advice in all drug/disease areas on the design, analysis, and evaluation of clinical trials performed by sponsors seeking approval to market new drugs. He is a fellow of the American Statistical Association and a former member of the board of directors of the Society for Clinical Trials, and he is active in several professional societies. He received a B.A. degree from the College of the Holy Cross and a Ph.D. degree in mathematical statistics and biometry from Catholic University of America. ANU PEMMARAZU is a research assistant with the Committee on National Statistics, National Research Council. In addition to the Panel on Statistical Methods for Testing and Evaluating Defense Systems, she is currently working on projects related to public health performance partnership grants and priorities for data on the aging population. She previously worked on the Panel on the National Health Care Survey and the Panel to Evaluate Alternative Census Methods. She received a B.S. degree in mathematics from the University of Maryland, College Park, and is currently pursuing a masters degree in computer and information science. STEPHEN M. POLLOCK is professor of industrial and operations engineering at the University of Michigan, Ann Arbor. Previously, he served as a consultant at Arthur D. Little, Inc., and as a member of the faculty at the Naval Postgraduate School. He teaches courses in stochastic processes, decision analysis, and reliability and mathematical modeling and has engaged in a variety of research areas and methods, including search theory, sequential detection of change, queuing systems, criminal recidivism, police patrol, and filling processes. He also serves as a consultant to more than 30 companies and other organizations. He is a fellow of the American Association for the Advancement of Science, and has been senior editor of IIE Transactions, area editor of Operations Research, and president of the Opera

Copyright © 1994. National Academies Press. All rights reserved.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.

BIOGRAPHICAL SKETCHES OF PANEL MEMBERS AND STAFF

84

tions Research Society of America. He holds a B. Eng. Phys. from Cornell and S.M. and Ph.D. degrees in physics and operations research from the Massachusetts Institute of Technology. JESSE POORE is professor of computer science at the University of Tennessee and president of Software Engineering Technology, Inc. He conducts research in cleanroom software engineering and teaches software engineering courses. He has held academic appointments at Florida State University and Georgia Tech; has served as a National Science Foundation rotator, worked in the Executive Office of the President, and was executive director of the Committee on Science and Technology in the U.S. House of Representatives. He is a member of ACM and IEEE and a fellow of the American Association for the Advancement of Science. He holds a Ph.D. in information and computer science from Georgia Tech. FRANCISCO J. SAMANIEGO is professor in the Intercollege Division of Statistics and Director of the Teaching Resources Center at the University of California at Davis. He has held visiting appointments in the Department of Statistics at Florida State University and in the Department of Biostatistics at the University of Washington. His research interests include mathematical statistics, decision theory, reliability theory and survival analysis, and statistical applications, primarily in the fields of education, engineering and public health. He is a fellow of the American Statistical Association, the Institute of Mathematical Statistics, and the Royal Statistical Society and is a member of the International Statistical Institute. He received a B.S. degree from Loyola University of Los Angeles, an M.S. degree from Ohio State University, and a Ph.D. from the University of California, Los Angeles, all in mathematics. DENNIS E. SMALLWOOD is a senior economist with RAND, where he conducts research related to national security, including defense acquisition, industrial base, and costing issues. He has held previous positions at the Pentagon, working on strategic arms control issues; he also served as head of the Economic Analysis and Resource Planning Division, Assistant Secretary of Defense for Program Analysis and Evaluation. He was previously an associate professor of economics at the University of California, San Diego, where he worked on issues related to the economics of health and of law. He received B.A. and M.A. degrees in mathematics from the University of Michigan and a Ph.D. degree in economics from Yale University. DUANE L. STEFFEY is senior program officer with the Committee on National Statistics, and he served as the panel study director until July 1995. Concurrently, he is associate professor of mathematical sciences at San Diego State University, where he teaches courses in Bayesian statistics, statistical computing, and categorical data analysis. He previously worked at Westinghouse and was involved in conducting probabilistic risk assessment of commercial nuclear energy facilities. He engages broadly in interdisciplinary research and consulting, and current professional interests include applications of statistics in environmental monitoring, transportation demand modeling, and census methodology. He received a B.S. degree and M.S. and Ph.D. degrees in statistics, all from Carnegie Mellon University.