Rasch Measurement Models-Interpreting WINSTEPS and FACETS Output

1,312 76 10MB

English Pages 71 Year 2003

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Rasch Measurement Models-Interpreting WINSTEPS and FACETS Output

Citation preview

Rasch Measurement Models: Interpreting WINSTEPS and FACETS Output

Richard M. Smith

JAM Press

Rasch Measurement Models:

Interpreting WINSTEPS/BIGSTEPS and FACETS Output

Richard M. Smith Data Recognition Corporation

JAM Press Maple Grove, Minnesota 2003

Copyright © 1999 by Richard M. Smith All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, microfilming, recording, or otherwise, without permission of the author. Printed in the United States of America

JAM Press P.O. Box 1283 Maple Grove, MN 55311

Rasch Measurement Models The distinction between measurement and statistics is one that is often blurred, if not overlooked.

The analysis stage of research is at least a two step process that requires the

researcher first to quantify the phenomena observed, achieving the most appropriate level of measurement for those data, and then to apply statistical procedures that answer the research question of interest and match for the level of measurement achieved with the data. Hays (1988) notes that "the problem of measurement, and especially of attaining interval scales, is an extremely serious one for the social and behavioral sciences. It is unfortunate that in their search for quantitative methods researchers sometimes overlook the question of level of measurement and tend to read quite unjustified meanings into their results .... However, the core problem of level of measurement lies outside the province of mathematics and statistics" (pg. 71). Measurement has to be considered a precursor to statistical analysis. Development of a theory of psychological measurement began in the late 1800's. Most of the measurement theory developed during the first half of the 1900's was based on the true-score model, also known as classical test theory, though many consider it less than classic. This model is characterized by the x and e is measurement error.

=

t + e notation where x is the observed score, t is the true score

Since 1960 there has been an increasing trend away from the true-

score model to latent-trait models. One of the best ways to understand latent trait models, which include item response models, is to start with the deficiencies ofthe true-score model. In Kuhn's theory of scientific evolution (1970), paradigm shifts occur when there are sufficient anomalies in an existing paradigm to provoke change, that is to say there are enough perceived deficiencies in the current theory to inhibit its further development.

This occurred in physics when relativity replaced the 1

Rasch Measurement Models 2 Newtonian model. This is currently happening in measurement as latent trait models replace true score models as the predominant paradigm. The deficiencies of the true score model are apparent to any first time user and frustrating to the seasoned veteran. To begin, the estimates of item difficulty and person ability in the true score model are sample dependent.

The p-values (proportion of correct responses) that estimate

item difficulty depend on the ability distribution ofthe persons who took the items. For the same item, the more able the group taking the item, the higher the item p-value, the less able the group the lower the p-value. Thus, the p-values are sample dependent, and estimates of item difficulty can not be directly compared unless the estimates come from the same sample or assumptions are made about the comparability of the different samples. The same is true for the person ability estimate, usually the number of correct items or proportion of correct items that make up that test. The number of correct responses to a fixed length test depends on the distribution of the item difficulties. For the same distribution of persons, the higher the mean item p-value the higher the raw scores and the lower the mean item p-value the lower the raw scores. This item dependency of person raw scores (and any linear transformation of those raw scores) is one reason that many users of the true score model try to make as many people as possible take a single form of the test, eliminating the need to equate forms. In this sense equating implies adjusting the raw scores or the transformed raw scores from two different forms of the same instrument to remove differences in scores that result from different distributions of item difficulty. The next deficiency that usually confronts people is that the interpretation of the person raw scores depends on complete records. In multiple choice testing, assuming that an unanswered item is incorrect (scored 0) may not be problematic, but as soon as the true score

Rasch Measurement

Models 3

model is expanded to psychological inventories or preference snrveys, that use either dichotomous scoring or Likert scales, this assumption becomes problematic.

When a person

declines to answer a question about the presence or absence of a particular symptom on a psychological checklist, it does not directly imply the assignment ofthe lowest score category to that person. However, if the interpretation ofthe raw scores depends on each of the persons responding to the same set of items, this omitting behavior causes a problem.

A similar situation

is developing in multiple choice testing with the increased presence of computer adaptive testing, where it is possible for each person taking a test to encounter a different distribution of item difficulties. Another commonly perceived problem with the true score model is the estimate of the standard error of measurement (SEM) for the raw scores. In many cases, and particularly in classroom tests, the standard error is ignored completely, with the raw scores or proportion correct scores often assumed to be errorless estimates of a person's ability. When the standard error is taken in account, as is often the case with standardized tests that use standard score scales, the fact that the same standard error is used with all scores is conspicuously inadequate. The fact that extremely high or low scores are reported to have the same accuracy as scores in the middle of the score distribution is misleading, particularly when floor or ceiling effects are present. It also obvious that the distribution of item difficulties on a test has an effect on the standard errors. For example, a test designed to have a rectangular distribution of item difficulties across a range from 0.90 proportion correct to 0.30 proportion correct, has different standard errors than a test where item difficulties were normally distributed with a mean of 0.60 proportion correct and a standard deviation of 0.10. The next deficiency is a little less obvious to the causal test user, but equally, if not more

I

Rasch Measurement Models 4 important than the previous deficiencies.

Raw scores or their linear transformations

are non-

linear, that is to say they are not on an interval scale (Stevens, 1946). The non-linearity of raw scores is easy to demonstrate.

Simply divide the items on a test into two groups based on item

difficulty (proportion correct), the easiest and hardest items. Then calculate a person raw score for each of those item subsets and plot the two raw scores for each person against each other. If the raw score metric were linear the resulting plot would be a unit slope line. But, in fact, the plot is a curve with the scores on the easy subset reaching necessarily their ceiling much before the scores on the hard subset. The same can be said about the raw scores of items, though few people think of whether the item difficulties, usually converted into proportion correct, are ordinal or interval scales. The existence of a different metric for the persons and the items highlights another deficiency, namely the fact that it is difficult to predict the outcome of the interaction between a given person and a given item. Consider the case where a person who has answered 80% of the items administered correctly is presented an item that 80% of the previous people have answered correctly, what is the expected outcome? With the true score model there is no method for predicting the person's performance, or developing a mathematical expression of the person's expected score. This is due primarily to the fact that the person abilities and the item difficulties are measured on different scales or metrics. This is a serious consequence because statistical analysis is replete with interesting ways of surrunarizing information about the differences between observed (did the person answer the item correctly) and expected (what we would have predicted the person would have done) scores. The remaining deficiencies are a little more esoteric and even less obvious to the casual user, but they pose serious problems none the less. The first two are somewhat related since they

Rasch Measurement

Models 5

both concern the issue of quality controL There are few techniques available in the true score model for validating response patterns, and the few that are available are seldom used. One example might be which score would best represent a person who answered a majority of the hard items correctly and a majority of the easy items incorrectly.

The total score for that person,

approximately 50% would seem misleading to say the least, but which would be a better estimate the 70% correct on the hard items or the 30% correct on the easy items. These two scores seem to be pointing in diametrically different directions. This is as much of a problem outside of the normal test applications ofthe true score modeL Similarly, with a Likert scale, what effect does several strongly disagree responses to statements very easy to endorse to statements say about that person's total score when the same person indicated agreement to several hard to endorse statements? A related issue is the ability to identify and control for guessing in multiple choice tests. No other topic of quality control has been addressed more widely. The usual guessing corrections in the true score model are based on the idea that a given proportion of the correct responses are not actually due to knowledge ofthe correct answer, but based on "lucky" guesses. To correct for this guessing, a proportion ofthe incorrect answers is subtracted from the observed raw score of every person. This procedure makes two assumptions.

First, that all person guess

to the same degree. And second that there are only two possible states of knowledge, the examinee either knows the correct answer or randomly selects from all of the available responses. Neither of these assumptions is plausible. It is interesting to note that there is a second, rarely used, method for correcting for guessing, namely adding a proportion of omitted responses to the person's score. Conceptually this procedure seems more realistic and, in fact, results in no disordering in the raw score distribution produced when compared to the punitive

Rasch Measurement Models 6 teclmique mentioned above. Finally, the last major deficiency with the true score model relates to the issue of estimated test reliability.

In the true score model there is a circularity in the definition of

reliability and SEM that requires one to estimate the reliability in a way other than from the conceptualized ratio oftrue to observed variance in the raw score distribution.

To get around the

circularity, various other methods for estimating reliability have been developed, test-retest, splithalf, the Kuder-Richardson

formulas, and coefficient alpha. In all of these methods the

assessment of reliability is confounded by the presence of misfitting items. Thus, these techniques may systematically misestimate the reliability of the instrument. The first latent trait model was developed by Georg Rasch in 1945 (Wright and Stone, 1979) in response to a request by the Danish Department of Defense to standardize a group intelligence test. By 1952 he had extended this model to two tests of oral reading. These models, which ultimately became known as Rasch models, can be used to analyze either dichotomous response data or data that fit a Poisson counts model. The publication of Rasch's book (Rasch, 1960/1980) and further developments in computer estimation procedures (Choppin, 1968 and Wright and Panchapakesan,

1969) led to increased attention to the dichotomous model.

Subsequent development of models for rating scale data (Andrich, 1978), partial credit data (Wright and Masters, 1982, and Masters, 1982) and the Rasch model extensions (Linacre, 1989) have further extended the interest in this family of psychometric models. The distinguishing characteristic of this family of models is that they are exponential models with additive parameters.

The general form of the model is given as

Rasch Measurement

Models 7

x

exp

L (~n -ou) }=o

(I)

m,

k

k=O

}=O

L eXPL (~n-ou) where

Jr'ix

person n,

is the probability of person n scoring in category x to item i, JJ" is the measure of

oij is the measure

of item i in the format appropriate for the item response categories,

and m, is the number of category boundaries in the item. When written as the log of the odds of scoring in two adjacent score categories the additive structure of the model is exposed. (2)

Equation 2 illustrates the rating scale model in which

~n

is the measure of the person, 0, is the

calibration of the item and 'i is the difficulty of thej" category transition. To illustrate the concept of category boundaries consider the following different response formats. In a dichotomous item there is only one boundary, the transition between wrong and right, or 0 and 1. In this format

oij reduces

to a single difficulty

0, for each

item. For a ten item

dichotomous item set there would be 10 item parameters estimated. In a three category rating scale (scored 0, 1, and 2) there are two category boundaries, the transition between 0 and 1 and the transition between 1 and 2. In rating scale data the same category boundary structure is hypothesized for each item so the O,j reduces to a single item difficulty for each item and a single set of category thresholds ('I and '2) that apply to all items (0, + 'j)' For a ten item rating scale with three response categories there would be 10 item parameters and 2 threshold parameters estimated. In a three category partial credit scale there are again two steps, the transitions between 0 and 1 and between 1 and 2, but there is no specification of a common threshold

Rasch Measurement Models 8 structure for the items so the threshold measure is double subscripted

(Oij)'

For a ten item partial

credit scale with three response categories there would be 20 item-threshold parameters estimated. For Poisson data the 0ij is written as 0i + log). In the case of Rasch's oral reading test 0i is the difficulty of the passage and) is the number ofmisreadings.

Also different item formats

can be combined in one analysis of a single composite instrument, i.e., dichotomous and rating scale items on the same instrument. In the typical two facet models there is usually one person parameter

(PJ per

person and

a number of item parameters per item that varies according to the structure ofthe response data as described above. But in many situations there are intervening agents in the measurement process that are not parameterized

in the true score model or the two facet model. These include

judges, raters, or tasks. Linacre's many facet Rasch model (MFRM)incorporates

these

intervening agents into the measurement model and their quantitative effect on the measurement process can be estimated.

The MFRM log odds is given by, (3)

where

TI

nijk

Pn is the

is the probability that person n is rated in category k of item i by rater),

measure of person n, 0i is the calibration of item i, Yj is the severity of rater), and

'k is the

measure for the transition between step k-l and k. This is a rating scale form of the model because there is a single step structure that applies to all items. In the MFRM item parameterization

can take any of the dichotomous, rating scale or partial credit interpretations

discussed above. There is no mathematical limit on the number offacets that can be added to the model, though most applications do not go beyond two or three facets in addition to the usual person and item facets.

Rasch Measurement The use of any Rasch measurement model specifies two requirements the data must be approximate unidimensionality,

Models 9

for the data. First,

i.e., most of the items must provoke data along

the same underlying construct (See Smith, 1996, and Wright, 1996, for a further discussion of this point.). Second, the data must approximate demonstrate local independence, probability ofresponding

i.e., the

correctly to one item must not be influenced by the particular response

to another item. Local independence can occur when a series of items are based on a single stimulus, such as a passage in a reading test. In that case, the reading level of the passage may have a fixed effect on all of the items associated with that passage. These two requirements

are

routinely tested by the fit statistics available in the Rasch model. Under certain conditions it can be useful to supplement fit analysis with principal component analysis of the raw scores or the residuals (Smith, 1996; Wright, 1996; and Linacre, 1998) . When the data fit a Rasch model then a series of useful conditions follow. First, the estimates of all parameters

are reported on the same interval scale in a common metric. Second,

all parameter estimates are freed from the distributional properties of the incidental parameters, i.e. the parameter estimates are neither sample nor test dependent. are not a problem nor are non-parallel or non-overlapping

This means that missing data

tests such at those common to

computer adaptive testing. Third, the probability expression based on equation 1 can be used to combine any person's estimated measure with any item's estimated measures to produce expected response values. These expected response values can be compared with the observed responses and combined in a various useful ways to produce direct tests of the fit of an item, a person, a rater, or a subset of items, persons, or raters to the model. This framework makes it possible to detect a variety aberrant response patterns, including but not limited to guessing and carelessness, and evaluate and correct for the impact of these unexpected responses on the person

Rasch Measurement Models 10· measures (See Smith, 1991 for a complete discussion).

Fourth, the person and item measures

have asymptotic standard error estimates for each discrete raw score that are based on the information function.

Fifth, since the asymptotic standard errors of measurement for the person

measures are known the issues of fit of the data to the measurement model and measurement error can be separated in the calculation of the reliability coefficient

Thus, the family of Rasch

models presents solutions to all of the deficiencies cited for the true score model. The use of Rasch measurement has been facilitated by the development of a number of computer programs that apply this family of measurement models. Two facet measurement (items and persons) is easily handled by such programs BIGSTEPSfWINSTEPS programs such as BIGSCALE,

(and earlier

CREDIT and BICAL) developed at the University of Chicago

and distributed by MESA Press. Other programs that can be used with two facet models include QUEST, available through the Australian Council on Educational Research, and RASCAL, available through Assessment Systems. The measurement models that involve three of more facets require programs such as FACETS, distributed by MESA Press. The remainder of this chapter will focus on the interpretation of output for two types of data. The first example is an analysis of a six category rating scale with the response categories ranging from "strongly agree" (coded 0) to "strongly disagree" (coded 5). There are 14 items and 378 persons in this data set This analysis uses BIGSTEPS and is based on the two-facet rating scale model where the parameters estimated are 378 person measures, 14 item measures and 5 category thresholds relating to the transition points between the six response categories (See Smith 1991, 1992 for a complete description of these data).

The second example is based

on the evaluation of 16 abstracts submitted for consideration at a conference. evaluated on six characteristics

The abstracts were

(organization, choice of problem, technical merit, scientific

Rasch Measurement importance, audience appeal, and overall recommendation)

Models 11

using a 4 category rating scale

ranging from I(excellent) to 4 (poor) for 5 of the 6 items. There were 12 reviewers each reviewing 4 abstracts. Each abstract was reviewed by three different reviewers.

This analysis

used FACETS and is based on the three-facet rating scale model where the parameters estimated are 16 abstract measures, six characteristic measures, 12 reviewer severity measures and 3 category thresholds.

BIGSTEPSAVINSTEPS The two-facet rating scale analysis was completed with the BIGSTEPS program (Wright and Linacre, 1995 version 2.60, although a Windows based version of the program WINSTEPS is now available). As this program is frequently updated it is possible that the output format presented in this chapter may not agree exactly with earlier or later versions of the program. However, the interpretation of the results remains consistent across the various versions of the program. The version ofthe BIGSTEPS/WINSTEPS

program and its creation date are always

listed in the author information table on the first page of output. The BIGSTEPS/WINSTEPS program currently produces 22 tables of output in an ASCII file and a variety of other output files that can be used in subsequent analysis. The program allows the user to choose which of the 22 tables are produced through a control statement (TABLES=). The interpretation of the results usually begins with Table 3.1 (See Figure 1). The first step is to check the "real" person and "real" item separation which are shown in the lower right comer of the person and item summary boxes. Real here means that the estimated standard errors of measurement have been adjusted for any misfit encountered in the data. The real person separation reliability, approximately equivalent to Coefficient Alpha, of 0.87 suggests that the scale discriminates between the persons well. The real item separation reliability of 0.99

Rasch Measurement Models 12 indicates that the items create a well defined variable. Given the acceptable values for the real separation indices, the next step is to check the location ofthe persons relative to the items. The mean item measure is always set to 0.0, unless an anchoring or rescaling option is used. The mean person measure here is -0.96 implying that the items are hard for the persons to assign high scores. Here a high score would mean that the persons disagreed with the statements (More on this point in Table 12.1). Figure 1 BlGSTEPS Table 3.1 TABLE 3.1 Mainstreaming

378 PERSONS SUMMARY OF

MEAN S.D.

Rating

14 ITEMS

pracmain.on2

378 PERSONS

RAW SCORE

COUNT

MEASURE

MODEL ERROR

2Z.8 10.8

14.0 .0

-.96 .96

.31 .07

SUMMARY OF

Aug 2913:431996

14 ITEMS

6 CATEGORIES

378 MEASURED (NON-EXTREME) PERSONS

.31 ADJ.SO MODEL RMSE .35 ADJ.SD REAL RMSE .05 S.E. OF PERSON MEAN

MEAN S.D.

Scale

ANALYZED:

INFIT MNSQ ZSTD 1.01 .65

-.3 1.6

OUTFIT ZSTD MNSQ .95 .58

.91 SEPARATION Z.89 PERSON RELIABILITY .89 SEPARATION 2.56 PERSON RELIABILITY

-.4 1.5 .89 .87

14 MEASURED (NON-EXTREME) ITEMS

RAW SCORE

COUNT

MEASURE

MODEL ERROR

614.9 251.4

378.0 .0

.00 .76

.06 .01

.06 AOJ.SD MODEL RMSE .06 ADJ.SO REAL RMSE .21 S.L OF ITEM MEAN

INFIT MNSQ ZSTD

.76 SEPARATION 13.52 .76 SEPARATION 13.01

.98 .21

-.3 3.0

OUTFIT ZSTD MNSQ .95 .24

ITEM RELIABILITY ITEM RELIABILITY

-.6 3.0 .99 .99

The next statistics to check are the item and person OUTFIT ZSTD statistics. The OUTFIT ZSTD statistics are the standardized unweighted item and person fit statistics, a cube root normalization ofthe mean square (MNSQ) listed in the previous column. When the data fit the model, these statistics are approximate t-statistics with a mean of 0.0 and a standard deviation

Rasch Measurement Models 13 of 1.0. Here the person OUTFIT ZSTD has a mean of -0.4 and a standard deviation of 1.5 implying that there is a little more variability in the fit ofthese persons than expected, while the mean fit is lower then expected. The outliers producing these summary statistics can be found in Tables 5.1 and 6.1. The mean item OUTFIT ZSTD is -0.6 with a standard deviation of3.0. Although the mean is below the expected value of 0.0, suggesting that these responses are more consistent than the model expects (a condition often indicating that the person have over used the middle categories to respond to the statements.).

The standard deviation is high, suggesting that

there are some items that misfit. The identification ofthe item misfit can begin in Table 14.1 (See Figure 2). This table presents a summary of the individual item statistics.

(This is one of three possible item tables,

here the items are shown in entry order. It is possible to get the same information with the items listed in measure order or in alphabetic order based on the item name.) The item measures and the standard errors associated with those measures are shown in columns four and five. The OUTFIT statistics are listed in the column under OUTFIT ZSTD. Inspection of this column indicates that there are four items above the usual cutoff of +2.0. The large negative values (fit values less than -2.0) are due to overfit. Since there are 14 quasi-independent

t-tests, there is the

stochastic possibility that one of the values may be a Type I error, but all four items should be reviewed. Table 14.2 (See Figure 3) lists the frequency of the responses for each item by distractor. For item I there were zero omits, 54 Strongly Agree (0),71 Moderately Agree (1),134 Slightly Agree (2), 61 Slightly Disagree (3), 42 Moderately Disagree (4), and 16 Strongly Disagree (5). There is nothing in this response distribution to suggest the cause of the misfit for this item. Table 12.1 (See Figure 4) pictures the variable map for this calibration.

This map shows

Rasch Measurement Models 14 Figure 2 BIGSTEPS Table 14.1 TABLE

14.1 Mainstreaming

378 PERSONS

Rating Scale

14 ITEMS

ANALYZED:

pracmain.on2

Aug 29 13:43 1996

378 PERSONS 14 ITEMS

6 CATEGORIES

ITEMS STATISTICS: ENTRY ENTRY

NUM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 MEAN S.D.

RAY INFIT OUTFIT PTBIS SCORE COUNT MEASURE ERROR MNSQ ZSTD MNSQ ZSTD CORR. 770 482 1005 542 1031 414 398 824 300 486 433 249 834 840

378 378 378 378 378 378 378 378 378 378 378 378 378 378

-.51 .35 -1.13 .16 -1.19 .58 .64 -.65 1.02 .34 .52 1.25 -.68 -.69

615. 378. o. 251.

.00 .76

.05 .06 .05 .06 .05 .06 .06 .05 .06 .06 _06 .07 .05 .05

1.31 .88 1.17 .77 1.09 .81 .83 1.13 .74 .68 1.16 .76 1.13 1.31

.06 .98 .01 .21

4.0 -1.6 2.4 -3.3 1.3 -2.7 -2_4 1.8 -3.7 -4.9 2.0 -3.2 1.8 4.0

1.31 .86 1.16 .77 1.07 .77 _75 1.14 .63 .64 1.04 .62 1.16 1.30

3.9 -1.8 2.2 -3.1 1.0 -2.7 -3.0 1.9 -4.0 -4.9 .5 -3.7 2.1 3.9

-.3 .95 3.0 .24

-.6 3.0

.45 .55 .54 .59 .54 .63 .63 .58 .66 .71 .62 .64 .55 .46

ORDER

ITEM EMR ViHd Bl in Hrlm Deaf

WcHd

PhHd CP Stut

SpeD Epil Oiab BehD Oisp

the distribution of both the persons and the items on the common logit metric. The persons are shown on the left, the items are labled on the right. The values listed at the left margin are the logit scale developed for this calibration.

The mean item difficulty is always set to 0.0, unless an

anchoring or rescaling are used. The mean of both person and item distributions is shown with the M on each side of the vertical line. The S represents ± 1 standard deviation while the Q represents ±2 standard deviations.

In this case because the items are scored 5 to

°

rather than 0

to 5, the easiest item to agree with are listed at the top, while the hardest items to agree with are listed at the bottom. The offset between the items and persons noted in the summary table in Figure 1 is easy to see in this table as represented by the distance between the two "M"s.

Rasch Measurement

Models 15

Figure 3 BIGSTEPS Table 14.2 TABLE 14.2 Mainstreaming

Rating Scale

378 PERSONS 14 ITEMS ITEMS

378

2

378

3

378

4

378

5

378

6

378

7

378

8

378

9

378

10

378

11

378

12

378

13

378

14

378

pracmain.on2

R% SCR % SCR

D 4

0 0 •• 61 16 3 0 0 •• 22 5 3 0 D .* 121 32 3 0 0 .* 33 8 3 0 0 ** 112 29 3 0 0 ** 26 6 3 0 0 ** 21 5 3 0 0 ** 83 21 3 0 0 ** 8 2 3 0 0 .* 33 8 3 0 0 ** 26 6 3 0 0 ** 2 0 3 0 0 ** 98 25 3 0 0 ** 103 27 3

54 42 116 10 30 54 91 10 21 65 151 5 161 3 46 47 205 2 123 10 170 10 234 0 39 34 44 32

Aug 29 13:43 1996

378 PERSONS 14 ITEMS

OPTION/OISTRACTORFREQUENCIES: ENTRY

NUM NONMISS MISSING 3 1

ANALYZED:

% SCR % SCR

14 11 30 2 7 14 24 2 5 17 39 1 42 0 12 12 54 0 32 2 44 2 61 0 10 8 11 8

0 4 Q

4 0 4 0 4 0 4 0 4 0 4 0 4 D 4 0 4 0 4 0

4 0 4 0 4

1 5 71 16 87 1 55 ,45 88 1 51 42 79 1 69 2 86 23 58 0 83 2 56 9 41 0 85 25 70 25

6 CATEGORIES

ORDER

% scR

Z

% SCR

18 4 23 0 14 11 23 0 13 11 20 0 18 0 22 6 15 0 21 0 14 2 10 0 22 6 18 6

1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5

I ,

% SCR

134 35

2

I

142 37

2

I

19

2

I

155 41

2

87 23

2

I I

116 30

2

I

122 32

2

I

93 24

2

I

105 27

2

I

127 33

2

107 28

2

I I

101 26

2

I

97 25

2

104 27

2

I I ,

73

Table 9.1 (See Figure 5) plots the OUTFIT information presented in Figure 2. The letters used to plot each item correspond to the letters listed in the PTBIS CaRR. column in Figure 6. The largest misfits, A and B, are plotted at the top of the graph, the largest overfit, N, is plotted at the bottom. The horizontal dashed lines represent the expected mean (0) and the two usual cutoff values ±2. The +2 and -2 must be interpreted differently since values above +2 usually represent too much variability for the probabilistic model while values less than -2 represent too little variability for the probabilistic model. The person distribution is shown at the bottom of

Rasch Measurement Models 16 Figure 4 BIGSTEPS Table 12.1 TABLE

12.1 Mainstreaming

Ratlng Scale

378 PERSONS 14 ITEMS PERSONS

ANALYZED:

pracmain.on2

378 PERSONS 14 ITEMS

MAP OF ITEMS

2

Q

Diab Stut

Q

S

o

.# .## ##### ###### ######S

--

PhHd Epil SpeD

HrIm

.######## EMR .###### BehD .###### .############ S .######### M -1 61 in ######### .## .###### Q .####### .# .#######S -2 .#

-3

.### .## .# .## Q .# .#

-4

-5

WeNd Vi Hd

CP

Deaf

Disp

Aug 29 13:43 1996

6 CATEGORIES

Rasch Measurement Models 17 Figure 5 BlGSTEPS Table 9.1 TABLE 9.1 Mainstreaming

Rating Scale

378 PERSONS 14 ITEMS -5

-4

,

pracmain.on2

ANALYZEO:

6 CATEGORIES

o

-1

-2,

-3

Aug 29 13:43 1996

378 PERSONS 14 ITEMS

2 4

4

BA 3

3 C

2

--------------------------------------2------------------------

I T E M

o

F G

0

o

------------------------------------

U T F

I

2

-,

-1

T H

Z

S T

-2

-2

------------------------------------

o

I J

-3

-3

K L

-5

-4

M

-4

y,I,

, +__ --f

-5

-4

+-__ -+ -2

-3

-,

N 1--__

, +__ ---1e'

o

2

ITEM MEASURE

PERSON

2

1 , 211 121 2 1122 1 11 1 4 4 85 8058 45200866398981637835535242 Q

S

M

S

Q

-5

Rasch Measurement Models 18 the graph, with the vertical line, here at -1, marking the mean of the person distribution.

The

location of A and B relative to this line indicates that the items that misfit are close to the mean of the person distribution.

The misfitting items are not extreme in difficulty for this population.

Note A and B are more misfitting than C et al. The information in Table 10.1, shown in Figure 6, duplicates the item information found in Figure 2. In this table the item information is presented in outfit order rather then in item entry or difficulty order. The two most misfitting items, EMR and DISP, appear at the top of the table. The most overfitting items appear at the bottom of the table. To help diagnose the nature of the misfit Table 10.4 (See Figure 7) contains a listing of the most unexpected responses for the 12 most misfitting items. The two items of primary interest are item 1, EMR, and item 14, Disp. The items (rows) in this table are listed in difficulty order with the hardest to endorse items at the top and the easiest to endorse items at the bottom. The rows show the unexpected responses of the most misfitting persons to the listed items. When responses are in the expected range are shown as ".". Responses producing large residuals are shown with the actual response. The . persons who make up the columns are also listed in measure order. Persons with a willingness to disagree with the statements (high scores) are shown on the left, while persons with a willingness to agree (low scores) are shown on the right. Looking across the row for item 14 (Disp), there are 13 unexpected responses, four unexpected Strongly Agree responses (0) by persons at the right who tended to not agree with items and 9 unexpected Disagree responses (5, 4, or3) by persons at the left who tended to agree with other items. More detail on this item can be found in Table 11.1 (See Figure 8). This table lists all of the significant residuals (378) for this item. The residuals are listed in rows of25 with the person sequence numbers for the rows listed at the beginning.

The observed responses are listed on the

Rasch Measurement Figure 6 BlGSTEPS Table 10.1 TABLE 10.1 M8instreaming

Rating Scale

378 PERSONS 14 ITEMS

ANALYZED:

pracmain.on2

Aug 29 13:43 1996

378 PERSONS 14 ITEMS

6 CATEGORIES

ITEMS STATISTICS: OUTFIT ORDER OUTFIT PT81S INFIT ENTRY RAW NUM SCORE COUNT MEASURE ERROR MNSQ ZSTD MNSQ ZSTD CORR. 1 14 3 13 8 5 11 2 6 7 4 12 9 10 MEAN S.D.

770 840 1005 834 824 1031 433 482 414 398 542 249 300 486

-.51 -.69 -1.13 -.68 -.65 -1.19 .52 .35 .58 .64 .16 1.25 1.02 .34

378 378 378 378 378 378 378 378 378 378 378 378 378 378

.00 .76

615. 378. O. 251.

1.31 1.31 1.17 1.13 1.13 1.09 1.16 .88 .81 .83 .77 .76 .74 .68

4.0 4.0 2.4 1.8 1.8 1.3 2.0 -1.6 -2.7 -2.4 -3.3 -3.2 ·3.7 -4.9

1.31 1.30 1.16 1.16 1.14 1.07 1.04 .86 .77 .75 .77 .62 .63 .64

.06 .98 .01 .21

..3 3.0

.95 .24

.05 .05 .05 .05 .05 .05 .06 .06 .06 .06 .06 .07 .06 .06

3.9 A 3.9 B 2.2 C 2.1 0 1.9 E 1.0 F .5 G -1.8 H -2.7 I -3.0 J -3.1 K -3.7 L -4.0 M -4.9 N

.45 .46 .54 .55 .58 .54 .62 .55 .63 .63 .59 .64 .66 .71

ITEM EMR Disp Bl in BehD

CP Deaf Epil ViHd WcHd

PhHd HrIm Diab Stut

SpeD

-.6 3.0

Figure 7 BIGSTEPS Table lOA TABLE 10.4 Mainstreaming

378 PERSONS 14 ITEMS

Rating

Scale

ANALYZED:

pracmain.on2

378 PERSONS 14 ITEMS

Aug 29 13:43

1996

6 CATEGORIES

MOST UNEXPECTED RESPONSES ITEM MEASURE PERSON 31 2 222 223 22 33221113231 21 322 12 3 331 541414 2072960889981504821213778965082711 6973423 70689785348097584738596923242656485406670179612650 high-------------------1.19 F 0.0 5.5 54 4 . 5 Deaf ·1.13 C 1 0 00.5 55.5 50 4 . 3 at in -.69 B 0.0 0055 .•...4 45 ..544 3 . 14 Disp ·.68 0 5.055 ..0 ..4 5 31. 13 BehO ..65 E 1. 0 ..5 0 54 4 ...•.3 33 ..1 8 CP ·.51 A 01 5 50.4 5 4.5.4 4 3 3 .. 1 EMR .16 K 0 5..0 44 3 3 . 4 HrIm .34 N .5 •••••••.•.••••••••••••••••••••••••••• ·•••••••••• 10 SpeO .35 H .0 5 4 3 2 . 2 ViHd .52 G 555.5..45 5 55 2 . 11 EpiI .58 I .0 5 3 2 . 6 WcHd .64 J 50 5 3 . 7 PhHd f------------------low

53141282222223822983322111323172165322712173973331 741894 2078960589731504821213758985082611 6612423 06 7 534 097 84 85969232426 64 406 70 9 650

Models 19

Rasch Measurement Models 20 Figure 8 BIGSTEPS Table 11.1 TABLE

11.1 Malnstreaming

378 PERSONS 14 ITEMS

Rating

Scale

ANALYZED:

pracmain.on2

Aug 29 13:43 1996

378 PERSONS 14 ITEMS

6 CATEGORIES

TABLE OF POORLY FITTING ITEMS (PERSONSIN ENTRY ORDER) NUMBER - NAME -- POSITION------ MEASURE- INFIT (ZSTO)OUTFIT EMR RESPONSE: RESIDUAL:

1: 2 1 1 42

-.51 20224

4.0 A 5221

3.9 1 2 1 22

3 3 323

RESPONSE: RESIDUAL:

26: 0 3 5 3 2 2 0 1 1 0 2

12 2 1

3 1 113

3 4 4 2 5

RESPONSE:

51:

3 10 0 2 0 4 4 2 -3 2

344 2 2 2

4 0 4 2 2

1 003

333

321

RESIDUAL: RESPONSE: RESIDUAL:

76:

2 1 1 302

2

2

0 3 2 1 111

003

1 1

-2

RESPONSE: 101: 3 1 1 2 1 2 2 0 0 1 RESIDUAL:

12 2 10 5 4 0 2 2 -2 3-2

3 2 13 2

RESPONSE: 126: 1 4 2 2 0 2 2 RESIDUAL: -2

2 1

2 2 2 2 5 2 2 2 0 0

2 3 2 5 4

RESPONSE: 151: 44353 RESIDUAL: 2

23

24242

423

12

04 2 5 2 4 4 3 2 2 3

4 10 4 2 2-2-2

2 1 032 -2

5 0242

33

RESPONSE: 176: 4 0 0 4 1 2 3 RESIDUAL: 2 RESPONSE: 201: 03 1 34 RESIDUAL:

1 5 200

23243

32450

13

2

RESPONSE: 226: 1 2 2 3 3 5 2 4 2 2 RESIDUAL: 2 2

10 2 13 2 2 2 2 1

0 54 0 3

RESPONSE: 251: 4 0 2 2 2 2 2 3 1 2 RESIDUAL:

24 2 2 2 2 2 2 2 2 2

2 2 2 2 2

RESPONSE: 276: 0 2 1 1 1 1 4 2 0 2 RESIDUAL:

5 2 10 2 3 3 2 5 3 3 2

0 3 2 2 1

RESPONSE: 301: 1 2420 RESIDUAL:

02023

23 1 4 0

13 2 13

20420 2

RESPONSE: 326: 0 2 1 2 2 1 1 1 1 2 RESIDUAL:

3 2 2 2 1 3 4 12 2 -2

3 2 0 4 0 3

RESPONSE: 351: 1 3020 RESIDUAL:

33422

1 0022

RESPONSE: 376: 4 2 3 RESIDUAL:

02332

45032 -2 2

.... .... .... ..

Rasch Measurement

Models 21

first of the pair oflines for each row of persons. The second line lists the standardized residual for that line. These are truncated standardized residuals and any residual less than I ±21 is not listed. A total of 29 residuals greater than I ±21 are shown, 19 positive and 10 negative.

The

positive residuals indicate unexpectedly high responses, while the negative residuals indicate unexpectedly low responses. The expected response is based on the measure estimated from the person's total raw score and the measures estimated from the item's total raw score. Residuals in the 3 range are more unexpected then residuals in the 2 range. There are four positive residuals in the 3 range and only one negative residual in the 3 range. Most of the misfit seems to be coming from the unexpected positive residuals of2 or more by about 19 of the 378 persons (5 per cent of the total sample). This table is available for all items with OUTFIT statistics greater than 2.0, but the remaining items are are not reproduced here. The last part of Table 3.1 (see Figure 9) contains information on the fit of the step structure to the measurement model. This information is useful in checking the fit of the single step structure to all of the items in the rating scale model. The logit step measures for the items and the associated errors of estimate are shown in the seventh and eight columns.

There is one

slightly disordered step value here. It is easier to go from Moderately Agree (1) to Slightly Agree (2) with a step value of -1.43 then it is to go from Strongly Agree (0) to Moderately Agree (1) with a step value of -1.28. The reason for this can be seen in the observed counts for the categories in column three. The larger number of responses in Strongly Agree (0) and Slightly Agree (2), both close to 1500, compared to Moderately Agree (1), with a count less than 1000 caused the step reversal. There is a satisfactory progression of the average measures, shown in column 4, from -2.20 logits for those choosing the strongly agree option to 0.80 logits for those choosing the strongly disagree option. Overall the fit of the steps is within an acceptable

Rasch Measurement Models 22 Figure 9 BIGSTEPS Table 3.1 SUMMARYOF MEASURED STEPS CATEGORY

LABEl 0 1 2 3 4 5 1 N F IT

-2.20 .93 .94 -1.22 1.13 .91 -.62 .89 .75 .05 .94 .91 .39 1.21 1.22 .80 1.22 1.27

1485 979 1563 749 324 192

0 1 2 3 4 5

EXPECTED SCORE MEASURES STEP-.5 AT STEP STEP+.5

STEP ERROR

STEP OBSERVED AVGE INFIT OUTFIT STEP VALUE COUNT MEASURE MNSQ MNSQ MEASURE

(

NONE -1.28 -1.43 .41 1.07 1.23

-2.21 -.95 .08 .97 2.11

.04 .04 .04 .06 .08

(

-2.85) -1.50 -.43 .54 1.44 2.74)

EXPECTED MNSQII

Figure 10 BIGSTEPS Table 21.1 TABLE 21.1 Mainstreaming

Rating

378 PERSONS 14 ITEMS

7ATEGO~Y

Scale

ANALYZED:

PROBAB I LI

TI ES:

~ODES

pracmain.on2

Aug 29 13:431996

378 PERSONS 14 ITEMS measu~es at

~ Step

I

6 CATEGORIES

i nter~ect i

P

R 1.0 ~'

o B A B I L I T Y

.B

o

.6

5555555 555 55 5 55

o

5

o

5

o

o

.5

o F R E S

.4

5

o

5

o o

22 22 22 2

11**1

5

'333 5 33 2

**4444

111 2 011 33 '45 3 444 11 2 0 l' 4' 3 44 111 2 '3 11 44 5522 33 44 111 22 33 00 4'1 5 2 33 4444 1111111 2222 333 44'0'5'11 222 3333 4444444

.2

o .O

5 5

2

022

P

N S E

°75 5

000000 000 00 00

**************************5555

0000**************************

"

-5

-4

-3

-2

-1

0

'

,

,

,

,

1

2

3

4

5

PERSON [MINUS] ITEM MEASURE

-1.B3 -1.07 .19 .98 1.73 dia

ea

odal & DUlF IT ARE "0BSERVED MNSQ /

-2.21 -.95 .08 .97 2.11

THURSTONE CATEG THRESHOLD RESIDUL -7.0 3.2 8.4 1.4 -2.2 -3.7

Rasch Measurement

Models 23

range with all of the OUTFIT MNSQ values less than 1.27 (Shown in columns 5 and 6). The plot of the category probability curves in Table 21.1 (See Figure 10) reflects the disordering observed in Figure 9. The probability curve for Moderately Agree, shown as 1 in the figure, is never higher than either Strongly Agree (0) or Slightly Agree (2). Tracing down the 0 curve from the y axis on the left there is an immediate transition to 2. These probability curves suggest that the persons responding to this rating scale are using only three, or four categories rather than the 6 categories offered. This set of curves is applied to all items with an adjustment for item difficulty to position each item's probability curves on the logit metric. The probability curves in Figure 10, enable the calculation of an expected score map. This map is found in Table 2.2 (See Figure 11). The first thing to notice is that the expected score transitions for exact item have exactly the same spacing across the items, with a horizontal adjustment for the difficulty of each items. This is a characteristic of the rating scale model which requires all items to have the same step structure. The items are listed in the table in measure order with the items easiest to agree with at the top of the chart and the items most difficult to agree with at the bottom. The person measure distribution is shown at the bottom of the page. This table can be used as a self-scoring form for the administration of the questionnaire, simply by circling a person's responses and finding the approximate horizontal center of the circled responses on the logit metric shown at the top and bottom of the page. The dispersion ofthe circled responses can also be used as a measure of the person fit, because the tighter the horizontal clustering of the responses the more consistent the responses.

Responses

that fall far from the horizontal center of the majority of responses indicate misfit with responses falling far to the left having large negative residuals and responses falling far to the right having large positive residuals.

Rasch Measurement Models 24 Figure 11 BIGSTEPS Table 2.2 TABLE

2.2 Mainstreaming

Rating Scale

378 PERSONS 14 ITEMS EXPECTED

-5

(II: II

INDICATES

-2

-1

0

1

2

3

4

I

I

I

I

I

I

I

I

I

I

I

0 0 0 0 0 .0

..

0 0 0 0

.1 .1

0 0

I

I

-5

-4 2

1 1 1 1

I

-3

I

-2

HALF-SCORE

I

-1

1 1

I

0

1 2222231123111 1 4 48580525208787996088584221 Q Q S S M

I

2

6 CATEGORIES

POINT)

5 2. : 3 : 4 5 .. 2 .: 3 : 4 5 2 : .3 : 4 .1 5 2 : .3 : 4 1 5 1 2 : 3 : 4 5 2 : 3. : 4 1. : 5 : 3. : 4 1. : 2 5 1 .. 2 : 3 .. 4 5 : 2 : 3 : 4 5 2 : 3 : 4. : 2. : 3 : 4. : 5 5 : 2. : 3 : 4 5 2 :3 : 4 2 3 :4 .. 5

0 0

Aug 29 13:43 1996

378 PERSONS 14 ITEMS

-3

SCORE: MEAN

-4

pracmain.on2

ANALYZED:

I

3

I

NUM 12 9 7 6 11 2 10 4 1 8 13 14 3 5 NUM

ITEM Diab Stut

PhHd YcHd

Epil ViHd

SpeD Hrlm

EMR CP BehD Disp

Blin Deaf

ITEM

4 PERSON

The figure can also be used to estimate by eye expected scores for a person at any measure. Simply find the person's measure on the horizontal axis and draw a perpendicular line through that point. The response categories nearest that vertical line are the person's most likely response. A person with a measure of 0.0 would have an expected score of 4 (Moderately Disagree) on the two items most difficult to endorse (5 and 3), have an expected score of3 (Slightly Disagree) on the next 4 hardest to endorse items (14,13,8,

and 1). An expected score

of2 (Slightly Agree) would apply to the next 6 items (4, 10,2,11,6, and 7). For item 9 an expected response is equally 1 or 2 (Slightly Agree or Moderately Agree). For item 12, the easiest item to endorse, an expected score of 1 (Moderately Agree) is indicated.

The person

distribution at the bottom of the chart indicates that a person with a measure of 0.0 would be

Rasch Measurement Models 25 approximately 1 standard deviation above the mean for this sample of persons. The next series of tables provide the same information for persons as Figures 5 through 8 provided for items. Table 18.1 (See Figure 12) is one of three possible person statistic tables. This table is in entry order, but it is possible to get a listing of persons in measure order or alphabetic order based on the person id field. The information in Figure 12 is abbreviated to show only the first 15 and last 9 persons. The table lists the entry order ofthe record, the raw score, count (the number of items with legitimate responses), the logit measure, the error of estimation for the measure, which has a different magnitude depending on the person's measure, two colunms of fit information, a person point biserial correlation and the first five characters of the person id field. Misfitting persons are identified by OUTFIT ZSTD or INFIT ZSTD values greater than +2.0.

(Values less than -2.0 usually indicate a pattern of responses too consistent

for a probabilistic model.) There is one person with a large OUTFIT ZSTD value, Person 8, with a value of 3.7. This is a good indication of unexpected responses. This person was also listed in Figure 7, where the most significantly misfitting persons were listed to help identify the sources of item misfit. Person 8 is the seventh person from the left margin on the table in Figure 7. The person identifying number is shown at both the top and bottom of the table in a vertical format. Connecting the two "8" with a line indicates that there are two unexpected "Strongly Disagree" responses for person 8 to items 6 and 7 which cause the misfit. The plot of the person OUTFIT values against person measure in Table 5.1 (See Figure 13) gives an overall picture ofthe fit ofthe persons to the model. Again, the usual critical values would be ±2. There are 25 persons with values plotted above +2.0. These points are quasiindependent cube root transformations

that approximate a t-distributions.

Given the sample size,

between 13 and 15 values greater then +2.0 would be expected to occur by chance. With about

Rasch Measurement Models 26 Figure 12 BIGSTEPS Table 18.1 TABLE 18.1 Mainstreaming

378 PERSONS

Rating Scale

14 ITEMS

ANALYZED:

pracmain.on2

378 PERSONS

Aug 29 13:43 1996

14 ITEMS

PERSON STATISTICS: ENTRY

ENTRY NUM

RAY SCORE COUNT MEASURE

INFIT ERROR MNSQ

1 Z 3 4 5 6 7 8 9 10 11 12 13 14 15

9 Z2 19 31 35 37 8 34 22 38 44 29 34 20 8

14 14 14 14 14 14 14 14 14 14 14 14 14 14 14

-Z.18 -.91 -1.16 -.24 .03 .17 -2.32 -.04 -.91 .24 .67 -.39 -.04 -1.07 -2.32

.36 .92 .Z8 .94 .29 .96 .26 .60 .26 .33 .26 .51 .37 2.49 .26 2.89 .28 1.68 .26 1.63 .27 .95 .27 .71 .26 .48 .29 1.62 .37 1.41

370 371 372 373 374 375 376 377 378

24 23 16 5 37 21 30 23 31

14 14 14 14 14 14 14 14 14

-.75 -.83 ·1.42 -2.81 .17 -.99 -.32 -.83 -.24

.28 .28 .31 .44 .26 .29 .27 .28 .26

23. 11.

14. D.

MEAN S.D.

-.96 .96

ZSTD

6 CATEGORIES ORDER

OUTFIT PTBIS MNSQ ZSTD CORR. PERSON

-.Z .97 -.Z .91 -.1 .92 -1.4 .58 -2.6 .32 -1.7 .51 2.5 1.49 3.6 3.00 1.5 1.63 1.5 1.61 -•1 .95 -.9 .73 -1.8 .49 1.4 1.78 .9 1.26

-.1 .39 0012Z1 -.3 .62 002221 -.2 .75 003221 -1.4 .37 004211 .73 005211 -2.6 .64 006221 -1.7 .8 .65 007221 .53 008211 3.7 1.5 .68 009221 1.4 .53 010221 -.1 .92 011222 .•8 .80 012222 .61 013222 -1.8 1.7 -.60 014212 .4 .20 015212

.2 .61 370322 .0 1.06 1.00 .5 -.41 371321 .3 1.18 1.13 .8 .80 372312 1.59 1.3 1.37 .67 373312 .62 -.9 .37 -1.1 .72 374322 .67 -1.1 .68 -1.0 _2 1.04 .1 .45 375321 1.06 .78 -.7 .79 -.6 .77 376321 .71 3m21 1.56 1.3 1.45 1.1 .41 378312 .51 -1.7 .49 ·1.8

.31 1.01 .07 .65

-.3 .95 1.6 .58

-.4 1.5

twice that number identified, there is an indication of some person misfit in these data. A list of the most misfitting persons identified in Figure 13 with both capital and small letters is provided in Table 6.1 (See Figure 14). The same information is provided in this table as was provided in Figure 12, in this case only the rnisfitting persons are listed and the listing is in OUTFIT order. The initial letter in the person PTBIS CORR. colurrm corresponds to the person's location on the plot in Figure 13. It is interesting to note the person point biserial correlation values for the persons who overfit (ZSTD-----low

53118141216791 43 0 1 2

pracmain.on2

378 PERSONS

14 ITEMS

Aug 29 13:43 1996

6 CATEGORIES

Rasch Measurement Figure 16 BIGSTEPS Table 7.1 TABLE 7.1 Mainstreaming

Rating Scale

378 PERSONS 14 ITEMS

pracmain.on2

ANALYZED:

Aug 29 13:43 1996

378 PERSONS 14 ITEMS

TA8LE Of POORLY FITTING PERSONS (ITEMS IN ENTRY ORDER) NUMBER - NAME -- POSITION ------ MEASURE - INFIT (ZSTD) OUTFIT RESPONSE:

1:

RESIDUAL:

50005 2 ·2

57 057222 RESPONSE: RESIDUAL:

1: 04403 ·2 ·3

8 008211 RESPONSE: RESIDUAL:

1: 2 1 515

340 340221 RESPONSE:

RESIDUAL:

288 288212 RESPONSE: RESIDUAL:

.74 455 2

RESIDUAL: 355 355211 RESPONSE: RESIDUAL: 89 089221 RESPONSE: RESIDUAL:

RESIDUAL:

3.9 4.1 B 5 1 5 4 2

3.7 0 .52 1: 1 0 5 2 5 o 0 5 2 5 5 255 ·2·2 2 2 ·2·2

1: 55533 2 3

·.32 1 000 ·2

1: 00304

o

3.3 2

o 04 00

o 1 535 2

1: 1

o

o

3.3 F 0 4 2

3.2 ·1.07 3.2 G 0 0 0 1 5 1 3 3 ·2 4 ·1.07

1:

3.6

3.3 3.6 E 205 5 2 2

·.60 303 0 123 0 ·2 ·2

1:

203 203121 RESPONSE:

4

3.7 3.6 C ·.04 555 0 1 o 0 2 2 3 3

28 028212 RESPONSE:

3.9 4.1 A 005 5 2 2

·.75 00202

294 294222

2

2

3.1 3.5 H 2 000 ·2·2

·.17 3.0 I 3 0 3 2 0 2 2 2 5 255 3 ·2

3.1

209 209121 RESPONSE: RESIDUAL:

1: 00404

·1.07 00300

3.0 3.0 J 5 0 2 2 4

304 304212 RESPONSE: RESIDUAL:

.1.72 1: 2 0 0 0 0

o 00 0

4.3 005

119 119222 RESPONSE: RESIDUAL:

1: 24444 2

o 0 114

318 318222 RESPONSE: RESIDUAL:

1: 4 0 4 4 4 2 2

o 04 00

·.04

· .99

2.9

K

2.9

5 3 3 L

2.9

o 0 55 2.7 2.9 M 000 1 ·2

6 CATEGORIES

Models 31

Rasch Measurement Models 32 be observed. The final four tables of BIG STEPS output provide different types of information that can be useful in specific situations.

Table 20.1 (See Figure l7a) provides the raw score to logit

conversion for all possible raw scores. It is important to remember that this table applies only to complete records. This table would not apply to persons who omitted responses.

Although it is

easy to estimate ability for persons with missing data, this table does not provide that information.

The raw scores, between 0 and 70 (5x14 items), are listed along with the logit

measures for those raw scores. Although measures for zero and perfect scores are undefined by the model, a Bayesian estimate is provided. That estimate is marked with an "E" to indicate that it is an extreme measure. The increasing size of the standard errors of estimate as the measures move away from the center of the test indicates that the extreme measures contain less information.

This occurs because the measures are further away from the concentration of items,

which, in this case, lies between ±l, as shown in Figure 4. The plot ofthe logit metric against the raw score metric is presented in the continuation Table 20.1 (See Figure 17b). This plot shows the nonlinearity of the raw score metric. The distribution of the person and item measures are shown at the bottom ofthe graph to locate the two distributions relative to the test characteristic curve. The person distribution is closer to the lower end of the raw score range. This will skew the raw score distribution.

The information contained in Table 20.2 (See Figure

18) provides additional sample information, which includes the raw score frequency and percent, cumulative frequency and percent, and percentiles for each raw score. Approximately 20 scores have been omitted from this table to reduce the size. Finally, Table 22.1 (See Figure 19) provides a Guttman scalogram ofthe raw data. The items are ordered from hardest to endorse to easiest to endorse across the columns and the

Rasch Measurement Figure 17a BIGSTEPS Table 20.1 TABLE 20.1 Mainstreaming

378 PERSONS 14 ITEMS

Rating scale

ANALYZED:

pracmain.on2

378 PERSONS 14 ITEMS

Aug 29 13:43 1996

6 CATEGORIES

TA8LE OF MEASURES ON COMPLETE TEST SCORE MEASURE 0 1 2 3 4 5 6 7 8 9 10 11

12 13 14 15 16 17 18 19 20 21 22 23

-4.96E -4.31 -3.67 -3.30 -3.03 -2.81 -2.63 -2.47 -2.32 -2.19 -2.06 -1.94 -1.83 -1.72 -1.62 -1.52 -1.43 -1.33 -1.24 -1.16 -1.07 -.99 -.91 -.83

S.E. 1.38 .96 _68 .56 .49 .45 .41 .39 .37 .36 .35 .34 .33 .32 .32 .31 .31 .30 .30 .29 .29 .29 .28 .28

SCORE

24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

MEASURE -.75 ..68 -.60 -.53 -.46 -.39 -.32 -.24 -.17 -.11 -.04 .03 .10 .17 .24 .31 .38 .45 .52 .59 .67 .74 .81 .89

S.E. .28 .27 .27 .27 .27

.27 .27

.27 .26 .26 .26 .26 .26 .26 .26 .26 .26 .27 .27 .27 .27 .27 .27

.27

SCORE MEASURE 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70

.96 1.04 1.12 1.19 1.27 1.36 1.44 1.53 1.61 1.71 1.80 1.90 2.01 2.12 2.25 2.38 2.53 2.71 2.92 3.18 3.56 4.21 4.87E

S.E. .27 .28 .28 .28 .28 .29 .29 .30 .30 .31 .31 .32 .33 .34 .36 .38 .40 .44 .48 .56 .68 .97 1.39

Models 33

Rasch Measurement Models 34 Figure l7b BIGSTEPS Table 20.1

,

7D 68 66 64

'

S C

6D 58 56 54 52 50 48 46 44 42 40 38 36 34 32 30 28

o

26

R E

24

E X P E C T E D

•• ••



•• •• •• •• • •

• •

••

•• • •

• • •

•• •• ••

20 18 16 14 12

••

10 8 6 2

E

•• ••

•• ••

22

4



••

62

.,

,

~AW SC?RE-MEASURE OGIVE fOR CO~PLETE TEST

••

• •

••

• •

••

••

, o 4--i----f-+---+--+-+----+---+-+---+J ~ , I "r 4 3 2 1 o -2 -1 -3 -4 -5 I

I

MEASURE

PERSON

2

1 2 22 2231123111 4 48580525208787996088584221 Q

ITEMS

S

M

S

2 31 Q

S

Q

1221 11 M

S

Q

I

5

Rasch Measurement Figure 18 BIGSTEPS Table 20.2 TABLE 20.2 Mainstreaming

Rating Scale

pracmain.on2

ANALYZEO:

378 PERSONS 14 ITEMS

378 PERSONS 14 ITEMS TABLE Of SAMPLE

SCORE

MEASURE

14 15 16 17 18 19 20

·4.96E -4.31 -3.67 -3.30 -3.03 -2.81 -2.63 -2.47 ·2.32 -2.19 -2.06 ·1.94 -1.83 -1.72 -1.62 -1.52 ·1.43 -1.33 -1.24 -1.16 -1.07

32 33 34 35 36 37 38 39

-.17 -.11 -.04 .03 .10 .17 .24 .31

51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70

1.19 1.27 1.36 1.44 1.53 1.61 1.71 1.80 1.90 2.01 2.12 2.25 2.38 2.53 2.71 2.92 3.18 3.56 4.21 4.87E

0 1 2 3 4 5 6 7 8 9 10 11 12 13

FREQUENCIES

CORRESPONDING

S.E. NORMED S.E.

TO COMPLETE

FREQUENCY %

Aug 29 13:43 1996

6 CATEGORIES TEST

CUM.FREQ.

%

PERCENTILE

84 144 151 100 218 71 256 58 285 51 307 46 327 43 343 41 359 39 373 38 386 36 398 35 410 34 421 34 432 33 442 32 452 32 461 31 471 31 480 31 489 30

0 2 1 4 4 8 5 8 10 5 8 14 5 11 11 10 10 8 16 11 15

.0 .5 .3 1.1 1.1 2.1 1.3 2.1 2.6 1.3 2.1 3.7 1.3 2.9 2.9 2.6 2.6 2.1 4.2 2.9 4.0

0 2 3 7 11 19 24 32 42 47 55 69 74 85 96 106 116 124 140 151 166

.0 .5 .8 1.9 2.9 5.0 6.3 8.5 11.1 12.4 14.6 18.3 19.6 22.5 25.4 28.0 30.7 32.8 37.0 39.9 43.9

0 1 1 1 2 4 6 7 10 12 13 16 19 21 24 27 29 32 35 38 42

28 28 27 27 27 27 28 28

10 7 10 8 3 7 8 6

2.6 1.9 2.6 2.1 .8 1.9 2.1 1.6

303 310 320 328 331 338 346 352

80.2 82.0 84.7 86.8 87.6 89.4 91.5 93.1

79 81 83 86 87 88 90 92

.28 725 29 .28 734 30 .29 742 30 .29 751 30 .30 760 31 .30 769 31 .31 779 32 .31 789 33 .32 799 33 .33 810 35 .34 822 36 .36 835 37 .38 849 39 .40 865 42 .44 883 45 .48 905 50 .56 933 58 .68 972 71 .97 1040 101 1.39 1109 145

1 0 0 0

.3 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0

378 378 378 378 378 378 378 378 378 378 378 378 378 378 378 378 378 378 378 378

100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0

1.38 .96 .68 .56 .49 .45 .41 .39 .37 .36 .35 .34 .33 .32 .32 .31 .31 .30 .30 .29 .29 .26 .26 .26 .26 .26 .26 .26 .26

582 590 597 604 611 619 626 633

a a a a a 0 0 0 0 0 0 0 0 0 0 0

99

100 100 100 100 100 100 100

lOa 100 100 100 100 100 100 100 100 100 100 100

Models 35

Rasch Measurement Models 36 Figure 19 BIGSTEPS Table 22.1 TABLE 22.1 Mainstreaming

378 PERSONS 14 ITEMS GUTTMAN

SCALOGRAM

55555243233333 54335542424422 55555425241122 55355233353121 55555422222322 34455004454511 4345433323322 55555522222202 4334332334422 3434432323322 5534522222222 55554114141141 33543323243232 55555125050022 33335533032332 55222523252222 4453324222222 3424422422322 5452422222222 4534523212221

57134455004454511 8 55225211105500 294 50552502000000 358 133333322222222 235 33222222222211 172 23322211111111 346 43 44 51 222 237 328 348 373 84 174 317 356 105 144 220 252 353 130 325

ANALYZEO:

OF RESPONSES:

PERSON ITEM 11 1 1 1 53438140216792 129 140 169 195 205 57 199 11 361 127 149 197 202 340 50 116 175 58 150 154

Rating Scale

0030300000000 10101101000000 0111100010000 21010100000000 11111000000000 11111000000000 0220100000000 11101000001000 2111000000000 11000010100000 0111100000000 11110000000000 11110000000000 2000100000000 10100010000000 1110000000000 11100000000000 0110000000000 0001000000000 00010000000000 53118141216791 43 0 1 2

pracmain.on2

378 PERSONS 14 ITEMS

Aug 29 13:43 1996

6 CATEGORIES

Rasch Measurement Models 37 persons are ordered from highest raw score (less likely to endorse) to lowest raw score (most likely to endorse) down the rows. For brevity only the highest and lowest 20 scores are shown in this figure. This table provides a quick way to check the fit of the persons. In the center of the table are the three persons (358, 235, and 172) with the lowest OUTFIT ZSTD identified in Figure 13 and 14. It is easy to see the nearly perfect Guttman patterns for these persons whose responses are too consistent for the probabilistic model. Contrast these response patterns to the three most misfitting persons (294, 57, and 8), that are shown directly above in the center of Figure 19. The lack of orderliness relative to the difficulty ofthe items is obvious, indicating why even the probabilistic model objects to these patterns as too noisy. For these positive misfits it is clear that the sums of raw scores for these persons do not provide a good indication of that person location on the variable. The BIGSEPTS output contains many other tables that have been omitted in this discussion.

Most of these tables present the same information shown in Figures 1 to 19 in

various formats. The simplest way to use BIGSTEPS output is to select all of the output tables and then edit the ASCII output file to print only the tables that are of interest. It is usually quicker than having to rerun the program to produce tables that were not specified. FACETS The analysis of three-facet rating scale example was completed with the FACETS program (Linacre and Wright, 1993 version 2.81). As this program is frequently updated it is possible that the figures presented in this chapter may not agree completely with earlier or later versions ofthe program. However, the interpretation of the results remains consistent across the various versions ofthe program.

The version of the FACETS program and its creation date are

always listed in the author information at the beginning ofthe output. This particular analysis

Rasch Measurement Models 38 was completed with FACETS version 2.81. The FACETS program produces 13 tables of output in an ASCII file and a variety of other output files that can be used in subsequent analysis. Unlike BIGSTEPS there is no control statement to specify particular tables for output. The interpretation of the FACETS analysis results usually begins with Table 1 (See Figure 20). This table provides a surmnary of the specifications entered for the analysis. In this case it is a three facet analysis with 12 reviewers, 16 papers, and 6 criteria on which the papers were evaluated.

This analysis is a rating scale model with 4 categories (Except for criteria 6

which is rated on a three category scale.). The data file locations for the input and output are listed under the title of the analysis in Figure 20. Figure 20 FACETS Table 1 'PEER REVIEY PAPER EVALUATION' Table 1. Specifications

TitLe

=

10-22-199609:56:43

from f i Le nAPA.DATI'.

'PEER REVIEY PAPER EVALUATION'

10-22-1996 09:56:43

Data file = (APA.DAT) Output file = APA.OUT ; Data specification Facets = 3 Non-centered = 1 Positive = 1 labels = 1,REVIE~ERS (elements = 12) 2,PAPERS (elements = 16) 3,CRITERIA (elements = 6) Model =?,?,6,R3,1 ?B,?B,?,R,1 Rating (or other) scale = QUALITY,R4,General,Ordinal ; Output description Arrange tables in order Unexpected observations

=

1MN,ZMN,3MN reported if standardized

; Convergence control Convergence = .5, .01 Iterations (maximum) = 100 Xtreme scores adjusted by = .3, .5

;(estimation,

residual

bias)

>=

3

Rasch Measurement Models 39 Table 2 (See Figure 21) summarizes the data read from the input file. All together 48 lines of data were read (12 reviewers times 4 papers per reviewer). There were 5 evaluations that were left blank by the reviewers, so the total number of non-blank responses is 283 rather than 288 (12x4x6). There were two different rating scales used for these items. For the first 5 items there was a four point rating scale with I representing "excellent" and 4 representing "poor". The sixth item was a three point rating scale with I representing "definitely accept" and 3 representing "reject". Table 3 (See Figure 22) lists the number of iterations and a variety of convergence data. The most important piece of information is found in the last line, indicating that the subset connection is "ok", This means that the reviewers and the papers are fully crossed, so that all ofthe estimated parameters can be placed on a common scale with the same origin. In some three facet analyses there are no subset connections.

This can be caused by one

set of reviewers reading only the papers from one site and another set of reviewers reading the papers from a second site. In that case there would be no cornmon papers or reviewers to link the two sites and the message would indicate a "loosely connected subset".

Figure 21 FACETS Table 2 'PEER REVIEW PAPER EVALUATION' Table 2. Data Summary

10-22-1996 09:56:43

Report.

Total lines in data file = 48 Total non-blank responses found = 283 Responses with unspecified elements = 0 Valid responses used for estimation = 283 Responses matched to model: ?,?,6,R3 = 48 Responses matched to model: ?B,?B,?,R = 235 Responses not matched to any model and ignored = 0 Available free memory space = 384848

Rasch Measurement Models 40 Figure 22 FACETS Table 3 ·PEER REVIEW PAPER Table 3. Iteration

EVALUATION Report.

10-22-1996

I

09:56:43

-----------------------------------------------------Iteration

I ---------

Max. Score Residual Elements Categories

Max. logit Change Elements Steps

----------------------------------------------

I

PROX PROX PROX UCON UCON UCON UCON UCON

1 -0.9816 1.3619 2 -0.9344 3 -0.9924 1 16.6965 100.0195 1.3617 -1.2011 2 -21.8909 69.0277 -1.0859 -1.0892 3 -9.6844 24.4276 -0.5178 0.6798 4 -8.4987 -10.0184 -0.4435 0.4599 5 -8.7061 -8.9439 -0.3744 0.2770 UCON 6 -8.6038 -8.6220 -0.3224 0.1918 UCON 7 -8.0790 -7.3719 -0.2754 -0.1380 UCON 8 -7.0784 -5.5742 -0.2196 -0.1035 UCON 9 -6.0408 -4.0484 -0.1602 -0.0748 UCON 10 -5.1196 -2.8715 -0.1161 -0.0538 UCON 11 -4.2943 -2.0540 -0.0916 0.0400 UCON 12 -3.5745 -1.4962 0.1010 0.0346 UCON 13 -2.9481 -1.1227 0.1073 0.0325 UCON 14 -2.4190 -0.9160 0.0959 0.0324 UCON 15 -1.9604 -0.7940 0.0840 0.0318 UCON 16 -1. 5501 -0.7042 0.0727 0.0310 UCON 17 -1.2190 -0.6453 0.0622 0.0297 UCON 18 -0.9416 -0.6001 0.0529 0.0282 UCON 19 -0.7050 -0.5593 -0.0488 0.0262 UCON 20 -0.5165 -0.5236 -0.0451 0.0237 UCON 21 -0.4383 -0.4830 -0.0411 0.0215 UCON 22 -0.4301 -0.4445 -0.0374 0.0193 UCON 23 -0.4098 -0.4022 -0.0332 0.0171 UCON 24 -0.3704 -0.3682 -0.0293 0.0151 UCON 25 -0.3352 -0.3275 -0.0259 0.0134 UCON 26 -0.3068 -0.2955 -0.0229 0.0119 UCON 27 -0.3126 -0.2637 -0.0202 0.0107 UCON 28 -0.3199 -0.2387 -0.0178 0.0095 UCON 29 -0.3266 -0.2127 -0.0158 0.0085 UCON 30 -0.3317 -0.1919 -0.0140 0.0075 UCON 31 -0.3338 -0.1709 -0.0124 0.0067 UCON 32 -0.3333 -0.1536 -0.0110 0.0059 UCON 33 -0.3296 -0.1362 -0.0096 0.0052 -------- ---------------------------------------------Subset connection O.K.

The interpretation of the FACETS results begins with Table 4.1 (See Figure 23). This table lists the unexpected responses.

In running this analysis the default option was used to

select misfitting residuals, as reported in Figure 20. Only item, reviewer, and paper interactions with standardized residuals greater than analyzed had a value greater than

131·

13/

are printed in this table. Only 5 of the

283

residuals

Interestingly, four of the five residuals listed were from

Rasch Measurement

Models 41

reviewer 1 evaluating paper 2. In each of these cases the rating exceeded (numerically) what was expected by the model resulting in a positive residual. In this case the ratings were actually poorer ratings than were expected. In the case of reviewer nine rating paper 6 there was one unexpected negative residual. The reviewer assigned a 1 to organization when the expected score was 2.9 resulting in the negative residual. The fit information continues in Table 5(See Figure 24). Of greatest interest is the mean and standard deviation of the standardized residuals. The expected values are a mean of 0.0 and a standard deviation of 1.0. The observed values of 0.0 and 1.2 are close to their expected values. The increased variability is due primarily to the five residuals shown in Figure 23. Figure 23 FACETS Table 4.1 'PEER REVIEY PAPER EVALUATION' Table 4.1 Unexpected

Icat

Step

10-22'1996 09:56:43

Responses.

Exp. Resd

stResl Nu REVIE~ER Nu PAPERS

N CRITERIA

------------------------------------------------------------------------------2 2 2 2 1

2 2 2 2 1

leat

Step

1.0 1.0 6 1.1 0.9 3 1.1 0_9 3 1.0 1.0 9 2.9 -1.9 -3 Exp. Resd

StResl

1 R-ONE 1 R-ONE 1 R-ONE 1 R-ONE 9 R-NINE Nu REVIEWER

2 2 2 2 6

P-TYO P-TYO P-TYO P-TYO P-SIX

1 ORGAN IZAT ION

3 5 6 1

TECH MERIT AUO APPEAL OVERALL RECOMM ORGANIZATION

N CRITERIA

Nu PAPERS

Figure 24 FACETS Table 5 'PEER REVIEY PAPER EVALUATION' Table

5. Measurable

Icat 1.8

Step 1.8

10-22-1996 09:56:43

Data Summary_

Exp. Resd 1.8 -0.0

StResl Nu REVIEYER Nu PAPERS

N CRITERIA

0_01 Mean (Count: 283)

I 0.7 0.7 0.5 0.5 1.2 S.D. --------------------------------------------------------------------------.---. Count of measurable responses = 283 Count of independently estimable parameters Data-to-modeL global fit chi-square: 411.9

= 35 d.f.: 248

significance:

.00

Rasch Measurement Models 42 The FACETS variable map is seen in Table 6.0 (See Figure 25). The logit metric is shown in the left hand colunm. The zero of the metric is set at the average item difficulty (labeled criteria in this analysis). The reviewers are listed by measure in the second column, with the top of the scale identifying strict reviewers, who gave high ratings, and the bottom of the scale representing lenient reviewers, who gave low ratings. It is interesting to see the variability of the reviewers in this analysis. Usually, rater severity is ignored in this type of analysis, reducing the problem to a two facet analysis.

The papers rated are listed in measure order in the

third column, with the best rated papers (low scores) at the top and poorest rated papers (high scores) at the bottom. The variability in the quality of the papers is approximately equal to the variability in the severity ofthe reviewers.

The measures of the six criteria on which the papers

were reviewed are shown in the fourth colunm.

The hardest criteria on which to receive a good

rating are listed at the top and the easiest criteria are at the bottom. Interestingly, the three point criteria, "overall recommendation",

is the criteria which was most likely to earn a high rating.

The last two columns show the step structure.

The threshold structure for the first five items is

shown under S. I. The step structure for the last item is shown under S.2. The numbers listed in these two columns are the expected integer score. A surrunary ofthe facets information is shown in Tables 6.1, 6.2, and 6.3 (See Figures 26-28). The format ofthe information in these tables is the same. The M printed on each line represents the mean of the distribution, S represents ±I standard deviation, Q represents ±2 standard deviations, and X represents ±3 standard deviations.

The reviewer information is

shown in Figure 26. The paper information is in Figure 27, and the criteria information is in Figure28.

Each line is a graphic presentation of the information.

distribution of reviewer measures.

The "Iogit" line shows the

The S.E. line shows the distribution of the standard errors of

Rasch Measurement Models 43 Figure 25 FACETS Table 6.0 'PEER REVIEY PAPER EVALUATION' Table

6.0

All Facet

10-22-1996 09:56:43

Vertical

Summary_

IMeasr I.REVIEWERS

1- PAPERS

I-CRITERIA

IS.l

IS.2







.(4)

.(3)











3 •

P-EIGHT



2 •

I •

I I •

I •

I

I

• P-FIVE I P-TWO

:

I I



I 0 * R-SEVEN R-NINE

I

R-THREE



-1

I I



I R-SIX I R-TWO R-FIVE

+

R-TWELVE

I R-TEN -2 •

P~SIXTEEN

I

I

I

I

I

I



I •

• P-FIFTEEN



I I •

ORGANIZATION

3

P-ELEVEN PROBLEM P-THIRTEEN I P-FOURTEEN I SCIENTIFIC IMPORT • P-TEN • P-NINE P-SEVEN AUD APPEAL TECH MERIT P-FOURL OVERALL RECOMM





I I •



I

I

I















+ ---

+



























+(1)

+(1)

+

I-CRITERIA

IS.,

P-ONE : P-THREE I P-TYELVE

I

2

I

I

I

2

I

*

I





I

I

I

P-SIX R-ELEVEN



-3 • R-EIGHT



-4 • R-ONE



-5 •

+

-6

R-FOUR

+

--------------------------------------------------------------IMeasrj.REVIEYERSI-PAPERS Metric

maintained

by + or

I.

Spacing

expanded

with:

IS.2

to show all elements.

Rasch Measurement Models 44 Figure 26 FACETS Table 6.1 'PEER REVIEW PAPER EVALUATION' Table

6.1

'0-22-1996 Facet Summary.

REVIEYERS

09:56:43

logit:

1

, , '1'

1

11

+------a+-~-----+---S---+.------M-------+---s---+----·--+Q------+----x--+ -6

-5

-4

-3

-2

-,

0

1

2

3

8

9

S.E. :

63'

1

,

+-------x---a---S---M----s---a---x----------------+

o

1

Intit MnSq: 2313

,

2

+--S----M----S--+Q----X-+-------+-------+-------+-------+-------+-------+

o

,

2

Outfit MnSq: 12222

1

3

4

5

6

7

1

+-------+--M----+-------+-S-----+-------+Q------+------x+-------+-------+

o

,

Infit

2

Std: 11

+

S

-3

-2

1

3

1

4

, 11 2

M

+

-,

5

6

+

+_S

0

1

789

1 ,

,

-----Q+---------+

+

2

3

4

std: 1 1 1 1 3' 1 11 +---------S---------+-----M---+---------+-S-------+-------Q-+---------+ -3 -2 -1 0 , 2 3

4

Outfit

Count: 1 +

1

x

+

20

22

Raw:

12 + + 28 30

+

+

, +

36

38

40

+ __s+ 32

34

11' +M __+ 42

Average:

X

o

+ __ Q

o

X

Q

,

23

+

,,.1 +

46

48

1 S

1

Fai r:

+

44 ,.,

+

s

+

Q

21

1 0 +---------M-------+ 24

11 +_S_+ + 50 1

52

54

111 11 11 1 M + __S--------Q-------X----+ 2

1 1

1" S

11 "1 M + 2

3

1 11 S---------Q---------X+ 3

Rasch Measurement Models 45 Figure 27 FACETS Table 6.2 'PEERREVIEU PAPER EVALUATION' 10-22-199609:56:43 TabLe 6.2

PAPERS

Facet summary.

Logi t:

1

+

+

••. X

.6

-5

-4

12

+ __Q

+

-3

2 11 112

s._+

-2

M

-1

1 2

1

.--+--5-·--+----Q--+

0

1

2

3

S.E. :

4231221 1 +--------------X--g-S--M-s--a-x-------------------+

o

1

Infi! MnSq:

1 2213221 1

1

+Q--S---M--S---Q+-X-----+-------+-------+-------+-------+-------+-------+

o

1

234

5

678

9

Outfit MnSq:

112142111

1

+-------+--M----+-------+--S----+-------+--Q----+-------+--x----+-------+

0123456789 Infit std: 1

11 1 1 1 11 1

+Q

+

-3 Outfit

S

·2

M

+

-1

+

211 1

1

S+

0

1 +_Q_------+---x-----+

1

2

+ __S

+

1

2

3

4

Std: +

+-

-3

-2

1 21

22

S

21

M_+

+

-1

11 1

0

1

---0--+---------+ 3

1 4

Count: 1

5

1

+----------S----------·-----------M---------------+

17

18

Raw: 1

1122

+

23

1

2

1

2

1

+_S+ __+ __+ __+ __+ __+_M+ __+ __ + __+ __ + __ + __ S __ + __ + __ +--+--+--+Q-+--+

1

25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

Average:

1 11 1111 11 2 1 2 +

X

o

+ __ Q

S

M

1

Fair: X.

o

1

Q

1

2

1

+

1

+ __ S--------Q-------X----+

2 1

22 11 2

S+

234

M

3

21 1 S __ +-----Q-------x---+

Rasch Measurement Models 46 Figure 28 FACETS Table 6.3 'PEER REVIEY PAPER EVALUATION' 10-22-199609:56:43 TabLe 6.3 CRITERIA Facet Summary. Logit:

1 +

+

+

+

+

-6

-5

-4

-3

-2

X

Q

11 1 S._.M

-1

1 1 S_--c---x---+-------+

0

1

2

3

S.E. :

24 +------------SMQ----------------------------------+

o Infit

1 MnSq:

411 +----XQSMSX-----+-------+-------+-------+-------+-------+-------+-------+

o

1

2

3

4

5

6

789

Outfi t MnSq:

2

2

1

1

Q----S--+--M----+S-----Q+---X---+-------+-------+-------+-------+-------+ o 1 2 3 4 5 6 789

Intit

Std:

+

X+

-3

-2

Q

1111 +S

-1

0

Outfit Std: +X -3 Count: 1 +

2 +_C

1 S

+

-2

+

-,

+

Q

43

, M

0

~

+

------+---------+

2

3

+

1 S __ +

1

2

45

1

X

1

Q+

4

1 -----Q+---------X 3

4

S_---_----+-------·-··---+-M------------+

+

44

Raw: 1

1

1 M__ + __ S

46

1

47

5 48

1

1

+· __+---+S--+---+---+---+---+--M+---+---+---+---+---S---+---+---+---+

77 78

79

80

81

82

83

84

Average:

85

86

87 88

1 1 11 +------------------------+--------X---Q--S---M---S+

o

1

89

90 91 92

93

94

11 2

Fair:

1 1 2

11

+-.----------------------+-----------.-----------X+--Q--S---M--S---Q--X----+

o

1

2

3

Rasch Measurement

Models 47

measurement for the estimated measures in the previous line. The two MnSq lines show the distribution ofthe infit (weighted total fit statistic) and outfit (unweighted total fit statistic) mean squares. These mean squares have an expected value of I and a standard deviation that varies from reviewer to reviewer. The two Std lines give the cube-root normalizations ofthe mean squares in the two previous lines. These statistics have approximate unit normal distributions. Values larger than +2 are usually investigated.

There is one reviewer with an infit statistic

greater than +2 and two reviewers with an outfit statistic greater than +2. The "count" line shows the number of ratings made by each reviewer.

Since there is

missing data in this analysis, all values are not 24 (4 papers x 6 items). One reviewer omitted one item and another reviewer omitted 4 items. The line labeled "Raw" shows the raw score distribution of the reviewers.

The "Average" line shows the average response in the ordinal

metric for each reviewer. The "Fair" line shows the average observed response adjusted for differences in rater severity. The fit information in Figure 27 shows that there is one paper with an infit and outfit statistic greater than +2.0. The fit information in Figure 28 shows that there is one criteria with an outfit value greater than +2.0.

The misfitting reviewers, papers and criteria are identified in

Figures 29, 30, and 31. FACETS Table 7.1.1 (See Figure 29) lists the measurement information for the reviewer facet. This is the same information that is graphically displayed in Figure 26. The information is presented in three panels. The first, on the left, contains raw score information, the total raw scores, the count of the number of ratings, the observed average for those ratings, and the fair (adjusted for rater severity) average. The second panel contains the calibration information.

The reviewers are listed in measure order with the lenient raters (low

scores) at the top and the severe raters (high scores) at the bottom. The logit measure and

Rasch Measurement Models 48 Figure 29 FACETS Table 7.1.1 'PEER REVIEW PAPER EVALUATION' Table 7.1.1

REVIEYERS

10-22-1996 09:56:43

Measurement

Report

(arranged

by 1MN).

---------------~~------------------------------------------------------------------------Outfit Catib Model I Infit Obsvd falr Obsvd obsvd REVIEWERS

I

Score

Count Average

Avrge

I

Logit

Error

MnSq Std

MnSq Std

I Nu

------------------------------------------------------------------------------------------

48 45 52 47 49 54

24 24 20 24 24 24 24 24 24 24 23 24

1.2 1.2 1.5 1.8 1.7 1.8 2.0 1.9 2.2 2.0 2.1 2.3

1.1 1.2 1.4 I.5 1.8 1.8 1.9 1.9 2.0 2.I 2.3 2.3

42.3 8.7

23.6 1.1

1.8 0.3

1.8 0.4

29 28 29 43 40

.-

-5.25 -4.32 -3.32 -2.70 -1.87 -1.63 -1.39 -1.24 -1.09 -0.67 -0.26 -0.07

0.55 0.63 0.45 0.38 0.37 0.37 0.36 0.36 0.36 0.35 0.37 0.35

0.7 2.I 0.9 1.6 0.9 0.6 0.5 0.7 0.8 0.8 2.1 0.5

0 1 0 1 0 -1 -2 -I 0 0 3 -2

0.4 0 7.3 2 0.9 0 1.6 1 0.9 0 0.6 0.5 -2 0.6 -I 0.8 0 0.7 0 2.0 2 0.4 -2

-,

4 I 8 '1 '0 12 5 2 6 3 9

R-FOUR R-ONE R-EIGHT

R-ELEVEN R-TEN R-TWELVE R-FIVE R-TWO R-SIX R-THREE R-NINE 7 R-SEVEN

------------------------------------------------------------------------------------------

I

-1.99 0.41 11.0 -0.4 1.4 -0.41 Mean (Count: 12) 1.54 0.09 0.6 1.6 1.8 1.6 S.D.

-----------------------------------------------------------------------------------------RMSE 0.42 Adj S.D. 1.49 Separation 3.56 Reliability 0.93 Fixed (all same) chi-square: 128.7 d.f.: 11 significance: .00 Random (normal) chi-square: 10.7 d.f.: 10 significance: .38

------------------------------------------------------------------------------------------

associated standard error are listed. The last panel lists the fit information.

As noted in Figure

26, there are two reviewers with standardized fit statistics greater than +2.0. These are Reviewer One and Reviewer Nine. This is not unexpected since these two reviewers were identified in Figure 23 as the only reviewers with individual residuals greater than 131. Only one of the 23 residuals has caused Reviewer Nine to misfit, while four of the 24 residuals for Reviewer One caused the misfit. The amount of misfit in these data is relatively smal1. The separation reliability for the reviewer facet is listed at the bottom of the page. The value of 0.93 indicates that there is a strong differentiation of rater severity among these raters. The fixed chi-square at the bottom of the page tests the hypothesis that al1 ofthe reviewers have the same severity. This chi-square is highly significant causing one to reject the hypothesis that all ofthese reviewers are rating with the same level of severity.

Rasch Measurement Models 49 The paper information in Table 7.2.1 (See Figure 30) is the crux of this application, since the purpose ofthe review was to select the best 12 papers for presentation at a conference. information is provided in the same format as the reviewer information.

The

The separation

reliability is slightly less than for the reviewers, 0.88 vs. 0.93. This is primarily due to the fact that there were only three reviewers for each paper while each reviewer rated four papers. Consequently, the standard errors for the papers are slightly larger than those for the reviewers (Paper RMSE of 0.46 vs. Reviewer RMSE of 0.42). Two papers misfit, Paper Two and Paper Six, the same two papers identified as having unexpected residuals in Figure 23. This small amount of misfit does not invalidate these ratings. The papers are listed in measure order from poorest rated (high scores) at the top to best rated (low scores) at the bottom. Since the measures have been adjusted for reviewer severity, they are a fair estimate of the paper quality. There is a clear break of almost 1 logit between Paper Four ( at -0.5410gits) and Paper One (at-LSv logits), signifying a clear cutoff between accepted (listed below) and rejected (listed above) papers.

It is

informative to compare the fair and observed average ratings for the papers. Any disordering in these two rank orderings indicates the extent to which reviewer severity changed the raw score results. In this case two papers would have changed from accepted to rejected and vice versa, if the severity of the reviewers had not been taken into account. Paper Seven with an observed rating of2.1 would not have been accepted, and Paper Three with an observed rating of 1.9 would have been accepted, were the assumption that all reviewers rated with the same severity accepted. However when reviewer severity was taken into account, the ordering of these two .papers changed. This was due to the fact that Paper Three went to more lenient reviewers, by chance, while Paper Four went to more severe reviewers.

Rasch Measurement Models 50 Figure 30 FACETS Table 7.2.1 'PEER REVIE~ PAPER EVALUATION'

10-22-1996 09:56:43

Table 7.2.1

Report

PAPERS Measurement

(arranged

by 2MN).

.. _.------._----------------------------------------------Fair IMeasu~e ModeL I Infit Outfit Obsvd Obsvd Obsvd Count Average Avrge Loglt Error Score MnSq Std MnSq Std Nu PAPERS -----------------------------------_.-._-----------------._----------------------.-----------------------_.------------_

I

I

41 46 32 38 30 38 32 36 27 28 27 28 25 26 31 23

18 18 17 18 17 18 17 18 17 17 18 18 18 18 18 18

2_3 2.6 1.9 2.1 1.8 2.1 1.9 2.0 1.6 1.6 1.5 1.6 1.4 1.4 1.7 1.3

3.2 3.0 3.0 3.0 2.6 2.6 2.5 2.4 2.2 2.2 2.2 2.2 1.9 1.8 1.8 1.6

-2.34 -1.75 -1.62 -1.59 -0.54 -0.47 -0.23 -0.07 0.30 0.39 0.49 0.51 1.29 1.57 1.61 2.43

0.42 0.40 0.45 0.41 0.44 0.40 0.43 0.41 0.51 0.46 0.50 0.48 0.49 0.52 0.43 0.58

2.1 1.2 0.6 0.9 0.7 0.9 0.7 0.5 1.1 0.8 0.3 0.5 1.1 1.7 1.0 1.0

2 0 -1 0 0 0 -1 -1 0 0 -2 -1 0 1 0 0

2.0 1.2 0.5 0.9 0.8 0.8 0.8 0.5 0.8 0.7 0.2 0.4 1.1 9.0 1.1 0.6

2 0 -1 0 0 0 0 -1 0 0 ·1 -1 0 3

a 0

6 12 3 1 4 7 9 10 14 13

P-SIX P-T~ELVE P-THREE P-ONE P-FOUR P-SEVEN P-NINE P-TEN P-FOURTEEN P-THIRTEEN P-ELEVEN P-FIFTEEN P-SIXTEEN

11 15 16 2 P-TUO 5 P-FIVE 8 P-EIGHT

Obsvd Obsvd Obsvd Fair /Measure Model I Infit Outfit I Nu PAPERS Score Count Average Avrge logit Error MnSq Std MnSq Std I----------------------.-.---------_.-----------------------------------------------------31.8 17.7 1.8 2.4 / -0.00 0.46 I 0.9 -0.4 1.3 -0.21 Mean (Count: 16) 6.2 0.5 0.3 0.5 1.31 0.05 0.4 1.3 2.0 1.4 S.D.

I I

RMSE 0.46 Adj S.D. 1.23 Separation 2.67 Reliability 0.88 Fixed (all same) chi-square: 129.9 d.f.: 15 significance: .00 Random (normal) chi-square: 14.9 d.f.: 14 significance: .38

The criteria measurement information in Table 7.3.1 (See Figure 31) provides similar information for the criteria on which the papers were evaluated. The separation reliability is a little lower, due primairly to the fact that criteria measures are relatively homogeneous compared to the reviewer and paper measures.

There is only one criteria with Standardized fit value greater

than 2.0. That is item 6, the overall recommendation .. Only one of the large residuals identified in Figure 23 was associated with this item, but it was the largest standardized residual found in this analysis. One residual (+9) was large enough to cause this criteria to misfit. Recall there are only 48 residuals in this fit statistic and it is the outfit statistic that is most sensitive to

Rasch Measurement Models 51 Figure 31 FACETS Table 7.3.1 'PEER REVIEY PAPER EVALUATION' Table 7.3.1

CRITERIA

10-22-199609:56:43

Measurement

Report

(arranged

by 3MN).

~----------------------------------------------------------------------------------------Outfit I Obsvd Obsvd Obsvd Fair IMeasure ModeL I Intit N CRITERIA Score

Count Average

Lcqi

Avrge

t

Error

MnSq Std

MnSq Std

I

------------------------------------------------------------------------------------------

rr

86 94 87 84 80

48 43 48 48 48 48

1.6 2.0 2.0 1.8 1.8 1.7

84.7 5.4

47.2 1.9

1.8 0.1

2.4 2.6 2.6 2.4 2.3 2.2

-0.69 -0.39 -0.31 0.19 0.44 0.76

0.26 0.28 0.27 0.28 0.28 0.29

0.9 1.0 0.9 0.8 0.8 1.2

0 0 0 0 0 0

2.8 1.2 1.1 0.8 0.8 1.7

2 0 0 0 0 1

6 3 5 4 2 1

OVERALL RECQMM TECH MERIT AUD APPEAL SCIENTIFIC IMPORT PROBLEM ORGAN IZAT ION

-----------------------------------------------------------------------------------------2.4 0.1

I

-0.00 0.28 0.50 0.01

I 0.9 0.1

-0.3 1.4 0.6 0.7

0.61 Mean (Count: 6) 1.2 S.D.

RMSE 0.28 Adj S.D. 0.42 Separation 1.53 Reliability 0.70 Fixed Cal L same) chi -square: 20.0 d. f.: 5 significance: .00 Random (normal) chi-square: 5.0 d.f.: 4 significance: .29

unexpected outliers (Smith 1986, 1991a). The category statistics are in Tables 8.1 and 8.2. The four category rating scale used with criteria one through five is in Table 8.1 (See Figure 32). The frequency of category use is reported in the panel on the left of this table. The scale calibrations are in the second panel from the right and the fit information is in the panel on the extreme right. The low frequency of use for category 4 (poor) suggest that there may be only three effective categories for this scale. The fit of the other three categories gives no cause for concern. The rating scale model appears to work for these data. A similar situation is found for the three category rating scale used for Criteria Six. This information is in Table 8.2 (See Figure 33). Again, the lowest category is underutilized but there is little problem with the fit of the step structure. The remaining tables are the bias analysis specified in the models statement (?B, ?B,? ,R) that controls this analysis. In this bias analysis all estimates developed in the main analysis just reported are fixed, then the program estimates the size of all residuals from the main analysis. In

Rasch Measurement Models 52 Figure 32 FACETS Table 8.1 'PEERREVIEY PAPER EVALUATION' 10-22-199609:56:43 TabLe 8.1 ModeL

I

=

Category

Statistics.

?B,?B,?,R

I

OBSERVEDCATEGORIES

Categ LabeL

1 2 3 4

DATA COUNTS Found Used

81 116 34 4

MOST ITHURSTONEIcatl EXPECTATION

PROBABLE from

%

81 34% 11649% 34 14% 4 2%

lew -2.95 .26 2.69

THRESHOLD at

PEAK Measure Prob Category

lew -2.99 0.21 2.76

71% 63%

at +0.5

(-4.05) -1.34 1.49 (3.82)

I

-3.03 0.17 2.88

I

SCALE

CALIBRATIONS Step Measure

1 2 3

-2.95 .26 2.69

INFIT OUTFIT S.E. MNSQ MNSQ

.18 .22 .55

1.1 1.0 1.5 2.3

AVGE MEASR

I

1.5 -3_67 .5 -1.77 1.4 '.13 2.3 1.74

-------------------------(ModaL)--(Median)--(at---Mean)----(+)---------------------------------------Scale structure Relative

Legit:-5.0 -4.0

-3.0

'2.0

-1.0

0.0

1.0

2.0

3.0

4.0

I

I

I

I

I

I

I

I

I

I

Mode: Median: Mean:

I

I

Legit:-5.0 -4.0

I

-3.0

I

-2.0

I

-1.0

I

0.0

I

1.0

I

2.0

I

3.0

I

4.0

this case, the analysis is looking for bias interactions between reviewers and papers (facets one and two). Table 9.2 (See Figure 34) contains the results of the bias iterations.

There are 192

elements in this analysis (12 reviewers x 16 papers). Table 10.2.1 (See Figure 35) indicates that there were no unexpected observations with standardized residuals greater than 131in this analysis. Thus, the 5 residuals identified in Figure 23 no longer misfit when the bias is taken into account by the model. Table 11.2 (See Figure 36) contains summary information for the bias analysis and is similar to the information in Figure 24. In this case the standard deviation of the standardized residual has dropped from 1.2 to 0.7 when the bias is included in the model. This indicates that most of the misfit in the data can be accounted for by the bias. Table 12.2 (See Figures 37a and 37b) contains the graphical summary of the bias analysis.

This table is similar

Rasch Measurement Models 53 Figure 33 FACETS Table 8.2 'PEER REVIEW PAPER EVALUATION' 10-22-1996 09:56:43 TabLe 8.2 Model

=

Category

Statistics.

?,?,6,R3

-----------------------------------------------------------------------------------------------------OBSERVED CATEGORIES

I

Categ Label

DATA COUNTS Found Used

%

I

MOST

PROBABLE from

1 I

I

ITHURSTONE Cat EXPECTATION THRESHOLD at

PEAK Measure Prob Category

at +0.5

SCALE

CALIBRATIONS Step Measure

I

INFIT OUTFIT S.E. MNSQ MNSQ

AVGE MEASR

I

-----------------------------------------------------------------------------------------------------1 2 3

27 13 8

27 56%1 13 2r. 8 17%

low I -.57 .57

I _________________________

low I !( -1.82) -1.031 -0.79 47% 0.00 1.04 0.78 (1.84)

(ModaL) __ (Median) __ (at

2

-.57 .57

-1 .0

I

I

0.0

1.0

2.0

I

I

I

Mode: Median:

2.0

1.0

0.0

-1.0

I

I

I

I

I

Logit:-2.0

Figure 34 FACETS Table 9.2 'PEER REVIEW PAPER EVALUATION' 10-22-1996 09:56:43 Table 9.2

Bias

Bias/Interaction There

Iteration anaLysis

are potentially

Report. specified

by Model:

?B,?B,?,R

192 Bias terms

-------------------------------------------------------

I

Iteration

Max. Score Residual Elements Categories

Max. Logit Change Elements Steps

------------------------------------------------------8IAS BIAS BIAS BIAS 8IAS BIAS

1 2 3 4 5 6

-3.7514 -3.3758 -2.5827 -1.2610 0.1066 0.0004

.5 9.0 .7

06 .9 -2. 1 .9 -1.25 .9 1.23

Mean)----(+)----------------------------------------

Scale structure ReLative Logit:-2.0

I

.37 .49

1.0000 1.0000 -1.0000 -0.8113 0.0594 0.0003

-------------------------------------------------------

I

Rasch Measurement Models 54 Figure 35 FACETS Table 10.2.1 'PEER REVIE~ PAPER EVALUATION' Table

10.2.1

Responses

leat

Step

Exp. Resd

***

No unexpected

10-22-1996 09:56:43

unexpected

after correcting

for 8las.

StResl Nu REVIE~ER Nu PAPERS

observation

with StRes

>=

N CRITERIA

3

Figure 36 FACETS Table 11.2 'PEER REVIE~ PAPER EVALUATION' TabLe 11.2

leat

I

1.8

0.7

Bias/Interaction

Step 1.8 0.7

Exp. Resd 1.8

-0.0

0.6

0.4

Count of measurable

10-22-1996 09:56:43

Measurement

StResl

-0.01

Summary.

Nu REVIEWER

Nu PAPERS

N CRITERIA

Mean (Count: 283)

0.7 S.D.

responses

=

283

to the graphical information in Figure 26. These data fit the Rasch three facet rating scale model well. This means that all of the desirable properties of Rasch models discussed earlier in this chapter apply to this analysis. The facets model illustrated here has been shown to be useful in removing the effects of reviewers or judges from the evaluation process (Linacre, 1989; Lunz, Wright, and Linacre, 1990; LUllZand Stahl, 1993; and Fisher, 1993).

Rasch Measurement Models 55 Figure 37a FACETS Table 12.2 'PEERREVIE~ PAPER EVALUATION' 10-22-199609:56:43 Table 12.2

Bias/Interaction

Summary

Report.

Bias Logit:

1 13 222 221 441114222 3 3 11 +--X------+----~Q---+-------S-+---------+M--------+---S-----+-----Q---+ -4

-3

-2

-1

0

1

3

2

3

S_E. :

2

655 1 1 4211 1 1 Q--------s--------M--------S--------Q--------X--------+-----------------+

o

1

2

3

4

Zscore:

12 2122 131

44111423122 141 1

+-------+-----X-+-------+Q------+---S---+------M+-------+-5-----+----Q--+

-6

-5

-4

-3

-2

-1

0

1

2

3

Intit MnSq:

167 42 912 3 1 52 2

1

1

1

+---------M-------+--S----------Q---+------X----------+-----------------+ o 1 2 3

4

Out f it MnSq:

1

121 422712 21122212

1

1

1

+--------H--------+-S---------Q-----+----X------------+-----------------+

o

1

2

3

Counts: 4

5 +-------------c---------------S--------------M----+

3

5

6

4

Rasch Measurement Models 56 Figure 37b FACETS Table12.2 Continued Score: 1

'01586812221

+

+

$

+

+

5

6

7

8

9

Expect:

4111 11 2

+

+

5

6

8

121

12

-4

-3

9

10

+

+

12

13

-2

15

+ __ S_+

+

11

12

13

+

14

1 11 2212221311113

+_S

+

+$

14

-+----+---Q+----+

16

17

18

211111111 l' 1 1 1

+ __M_+

+

7

+

+ __Q

11

223 123 1132

+ __ S_+

Differ:

+ __M_+

10

15

l'

-+--Q-+----+----+

16

17

18

21 12 11 31

+

M

+

-1

0

1

'9

19

21

1

S_+--------+-----Q--+ 2

3

4

Average: 1

o

1 1 3 8 26 7

1 11

2 2 2 1

X----------Q------+---S---------M---+------s----------a---------X-------+

o

1

Avge Expect: +

X

2

3

131 21'13433142 122221121 11 Q __ +

o

S

M

S

+

1

1 1

Q

2

_1

Q

13 3 1131441424 222221121 1

s

M

0

s

+----x------------+

3

Avge Differ: +_X

Q

4

X_+

1

4

Rasch Measurement

Models 57

References Andrich, D. (1978). A binominiallatent trait model for the study of Likert-style attitude questionnaires. British Journal of Mathematical and Statistical Psychology, 31, 84-98. Andrich, D. (1988). Rasch models for measurement. Sage University Paper series on Quantitative Applications in the Social Sciences, series no. 07-068. Beverly Hills: Sage Publications. Choppin, B. (1968). An item bank using sample-free calibration.

Nature, 219, 870-872.

Fisher, A.G. (1993). The assessment ofIADL motor skills: An application of many-faceted Rasch analysis. American Journal of Occupational Therapy, 47, 213-329. Hays, W.L. (1988). Statistics (4th ed.). Fort Worth: Holt, Rinehart and Winston. nd

Kuhn, T.S. (1970). The structure of scientific revolutions (2

edition). Chicago: University of

Chicago Press. Linacre, lM. (1989). Many-facet Rasch measurement.

Chicago: MESA Press.

Linacre, J.M. and Wright, B.D. (1993). A user's guide to FACETS: Many-facet Rasch analysis. Chicago: MESA Press. Linacre, J.M. (1998). Detecting multidimensionality: Which residual data-type works best? Journal of Outcome Measurement, 2, 266-283. Lunz, M.E., Wright, B.D., and Linacre, J.M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3, 331-345. Lunz, M.E. and Stahl, J.A. (1993). The effect of rater severity on person ability measure: A Rasch model analysis. American Journal of Occupations Therapy, 47, 311-317. Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika,

47, 149-174.

Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research, 1960. (Expanded edition, Chicago: University of Chicago Press, 1980) Smith, R.M. (1986). Person fit in the Rasch Model. Educational and Psychological Measurement, 46, 359-372. Smith, R.M. (1988). The distributional properties of Rasch standardized residuals. and Psychological Measurement, 48,657-667.

Educational

Rasch Measurement Models 58 Smith, R.M. (1991a). The distributional properties of Rasch item fit statistics. Educational and Psychological Measurement, 51,541-565. Smith, R.M. (1991b). IPARM: Item and person analysis with the Rasch model. Chicago: MESA Press. Smith, R.M. (1992). Applications of Rasch measurement.

Chicago: MESA Press.

Smith, R.M. (1996). A comparison of methods for determining dimensionality in Rasch measurement. Structural Equation Modeling, 3, 25-40. Stevens, S.S. (1946). On the theory of scales of measurement.

Science, 103,677-680.

Wright, B.D. (1996). Comparing Rasch measurement and factor analysis. Structural Equation Modeling, 3, 3-24. Wright B.D. and Bell, S.R. (1984). Item banks: What, Why, How. Journal of Educational Measurement, 21, 331-345. Wright, B.D. and Linacre, J.M. (1995). BIGSTEPS: Chicago: MESA Press.

Rasch analysis for all two-facet models.

Wright, B.D. and Panchapakesan, N. (1969). A procedure for sample-free item analysis. Educational and Psychological Measurement, 29, 23-48. Wright, B.D. and Masters, G.N. (1982). Rating scale analysis. Chicago: MESA Press. Wright, B.D. and Stone, M. (1979). Best test design. Chicago: MESA Press.

Complete FACETS Output

FACETS Version

No.

2.81

Copyright

(c) 1987-1994,

John

M. Linacre

'PEER REVIEW PAPER EVALUATION' 10-22-1996 09,56,43 Table 1. Specifications from file "APA.DAT". Title = 'PEER REVIEW PAPER Data file = (APA.DAT) Output file = APA.OUT

EVALUATION'

10-22-1996

09:56:43

; Data specification Facets = 3 Non-centered = 1 positive = 1 Labels = 1,REVIEWERS (elements = 12) 2,PAPERS (elements = 16) 3,CRITERIA (elements = 6) Model =?,?,6,R3,1 ?B, ?B,?, R, 1 Rating (or other) scale = QUALITY,R4,General,Ordinal ; Output description Arrange tables in order Unexpected observations

= 1MN,2MN,3MN reported if standardized

; Convergence control Convergence = .5, .01 Iterations (maximum) = 100 Xtreme scores adjusted by = .3, .5 'PEER Table

REVIEW PAPER EVALUATION 2. Data Summary Report. l

i

(estimation,

10-22-1996

EVALUATION' Report.

10-22-1996

bias)

09:56:43

Total lines in data file = 49 Total non-blank responses found 283 Responses with unspecified elements 0 Valid responses used for estimation 283 Responses matched to model: ?,?,6,R3 = 48 Responses matched to model: ?B,?B,?,R = 235 Responses not matched to any model and ignored Available free memory space = 384848 'PEER REVIEW PAPER Table 3. Iteration

residual

0

09,56,43

-------------------------------------------------------

I

Iteration

Max. Score Residual Elements Categories

Max. Logit Elements

Change Steps

-------------------------------------------------------

PROX PROX PROX UeON UeON UeON ueON UeON UeON UeON UeON UCQN UeON UeON UeON UeON UeON UeON UeON

1 2 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

16.6965 -21.8909 -9.6844 -8.4987 -8.7061 -8.6038 -8.0790 -7.0784 -6.0408 -5.1196 -4.2943 -3.5745 -2.9481 -2.4190 -1.9604 -1.5501

100.0195 69.0277 24.4276 -10.0184 -8.9439 -8.6220 -7.3719 -5.5742 -4.0484 -2.8715 -2.0540 -1.4962 -1.1227 -0.9160 -0.7940 -0.7042

-0.9816 -0.9344 -0.9924 1. 3617 -1.0859 -0.5178 -0.4435 -0.3744 -0.3224 -0.2754 -0.2196 -0.1602 -0.1161 -0.0916 0.1010 0.1073 0.0959 0.0840 0.0727

1. 3619

-1.2011 -1.0892 0.6798 0.4599 0.2770 0.1918 -0.1380 -0.1035 -0.0748 -0.0538 0.0400 0.0346 0.0325 0.0324 0.0318 0.0310

Page 1 of 9

I

>= 3

UCON UCON UCON UCON UCON UCON UCON UCON UCON UCON UCON UCON UCON UCON UCON UCON UCON

17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

-0.4445

-0.4022 -0.3682 -0.3275 -0.2955 -0.2637 -0.2387 -0.2127 -0.1919 -0.1709 -0.1536 -0.1362

-0.3704

-0.3352 -0.3068 -0.3126 -0.3199 -0.3266 -0.3317 -0.3338 -0.3333 -0.3296

0.0297 0.0282 0.0262 0.0237 0.0215 0.0193 0.0171 0.0151 0.0134 0.0119 0.0107 0.0095 0.0085 0.0075 0.0067 0.0059 0.0052

0.0622 0.0529 -0.0488 -0.0451 -0.0411 -0.0374 -0.0332 -0.0293 -0.0259 -0.0229 -0.0202 -0.0178 -0.0158 -0.0140 -0.0124 -0.0110 -0.0096

-0.6453 -0.6001 -0.5593 -0.5236 -0.4830

-1.2190 -0.9416 -0.7050 -0.5165 -0.4383 -0.4301 -0.4098

-------------------------------------------------------

Subset

connection

O.K.

PEER REVIEW PAPER EVALUATION Table 4.1 Unexpected Responses. I

I

10-22-1996

09:56:43

-------------------------------------------------------------------------------

Exp.

Resd

StResl

Nu REVIEWER

N CRITERIA

Nu PAPERS

Step Icat -------------------------------------------------------------------------------

1 ORGANIZATION 2 P-TWO 1 R-ONE 6 1.0 1.0 2 3 TECH MERIT 2 P-TWO 2 R-ONE 1 3 0.9 1.1 2 5 AUD APPEAL 2 P-TWO 2 R-ONE 1 3 0.9 1.1 2 6 OVERALL RECOMM 2 P-TWO 2 1 R-ONE 9 1.0 1.0 2 1 ORGANIZATION 2 P-SIX 6 9 R-NINE -3 -1. 9 2.9 1 1 ------------------------------------------------------------------------------N CRITERIA Exp. Resd StResl Nu REVIEWER Nu PAPERS Step ICat ------------------------------------------------------------------------------'PEER REVIEW PAPER EVALUATION' 10-22-1996 Table 5. Measurable Data summary.

09,56,43

------------------------------------------------------------------------------N CRITERIA Exp. Resd

StResl

Nu REVIEWER

Nu PAPERS

step I cat ------------------------------------------------------------------------------0.0\ Mean (Count: 283) 1.8 -0.0 1.8 1.8 1.2 S.D. 0.5 0.5 0.7 0.7 ------------------------------------------------------------------------------count of measurable responses = 283 count of independently estimable parameters Data-to-model global fit chi-square: 411.9 'PEER REVIEW PAPER EVALUATION' 10-22-1996 Table 6.0 All Facet vertical summary.

35 d.f.:

=

significance:

09,56,43

---------------------------------------------------------------

IMeasrl+REVIEWERSI-PAPERS

248

I-CRITERIA

IS.1

--------------------------------------------------------------+(4)

+(3)

+

+

+

+

I

I

I I

+

+

3 +

+

\S.2

P-EIGHT +

2 +

+

+ P-FIVE P-TWO

\

\

\

\

1 +

I

+

I

+

I

I

I

P-SIXTEEN

I I

3

\

+

I

\

+

+

I

I

ORGANIZATION

Page 2 of 9

+ I

.00

P-FIFTEEN P-ELEVEN P-THIRTEEN P-FOURTEEN

I

* P-TEN P-NINE P-SEVEN

0 * R-SEVEN R-NINE

*

I I

I I +

I

I I

\

I

I

\

\

+

+

+

+

+

+

+

+

+

+

+(1)

+(1)

+

IS.1

IS.2

I

+

+

+

+

+

+

+

+

+

+

\

+

\

P-ONE P-THREE P-TWELVE

*

+

+

\

-2 +

\

* 2

+

R-TEN +

ADD APPEAL TECH MERIT OVERALL RECOMM

I

\ +

I

*

I

P-FOURL

-1 + R-SIX R-TWO R-FlVE \ R-TWELVE

SCIENTIFIC IMPORT

*

I

R-THREE

PROBLEM

2

I

P-SIX R-ELEVEN -3 +

+

---

R-EIGHT -4 +

+

R-ONE -5 +

+

R-FOUR

+ + + -6 + ---------------------------------------------------------------

I-CRITERIA

IMeasrl+REVIEWERS\-PAPERS

--------------------------------------------------------------Metric maintained by + or Spacing expanded with:

I.

'PEER REVIEW PAPER EVALUATION' Table Logit: +

6.1

REVIEWERS

1

1

Q+

-6

-5

+

X

S

1

Q

S

all elements.

10-22-1996 09,56,43

Summary.

1

+

-4

S.E. , +

Facet

to show

-3

63M 1

M

1 1 111

-2

S

1 Q 1

+

-1

1

S

11

+

+Q

+

X __ +

0

1

2

3

+

X

o

1

Infit MnSq: + __ S 2313M

1 __ +Q2 S

X_+

+

+

+

--+-------+-------+

01234567 Outfit +

B

MnSq:

12222+ __ M 1

1 +

+_S

+

+Q

+

1 -X+-------+-------+

01234567 Infit

9

B

Std,

11

1

1

1112

1 1

Page 3 of 9

1

9

+

S

+

-3

-2

-1

Outfit

Std: 1 1

+

1 1

S

-3

M

+

-2

3

+

+_S

+

0

1

2

+

+_S

0

1

1 1 M

-1

1

+

4

11

+

----Q-+---------+

2

+

3

4

21

22

Raw: 1 2

1 +

+ __S+

32

34

111

+

+

+

36

38

40

42

Average:

44

+

+

46

48

11 X

+

o

1 50 1

S

+

52

54

III 11 11 1 M + __ S

1

Fair:

1

Q

X

o

1

1

-6

-5

-4

S.E. , +

+ __

11 Q

S

X+

2

Logit:

X

1

M

'PEER REVIEW PAPER EVALUATION' 10-22-1996 Table 6.2 PAPERS Facet summary.

+

+ 3

11 11 1 +

S

1

+

X

Q

2

1 1

+

24

1

+_S_+

1

+ __ Q

+

23

III

+M __ +

M

+

S

Q

20

28 30

3

1

count: 0 11 + X

+

-----Q+---------+

-3

09:56:43

12

2 11

112

+

S __ +

M

-2

-1

0

Q

3

1

2

1

--+--S----+----Q--+

1 X __Q_S4231221 __M_S __Q_X

1

2

3

+

o

1

Infit MnSq: 1 +Q 1 __ 2213221 S M __ S 1 Q+_x 0123456789 Outfit MnSq: + 112142111 + __ M

o

+

1 +

+ __

+

+

+

+

+ __ Q

+

4

567

s

123

Infit Std, 1 +Q -3

+

11s 1

-2

Outfit

Std,

+

+

-3

-2

1 +1

11M 1

+

-1

1

s21

+ -1

211 1

S+

0

22

21

M_+

11

1

0

--+-------+-------+

+ __ X

1

+_Q

1

2

+ __ S

+

1

2

5 + 17

s

1

1

M

+

18

Page 4 of 9

+

1 +

8

9

X

3

1

count:

1

+

---Q--+---------+

3

+ 4

1 4

Raw: 11122112

12

1

1

+-----+-S+--+--+--+--+--+-M+--+--+--+--+--+--S--+--+--+--+--+--+Q-+--+

23

25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

Average: 1 +

X

11 1111 11

+ __Q

a

S

2

1

M

2

1

1

+ __ S--------Q-------x----+

1

2

3

Fair: 1 +

x

a

1

2 1

Q

22

s+

11 2 M

21

1

s __ +-----Q-------x---+

2

3

'PEER REVIEW PAPER EVALUATION' 10-22-1996 Table 6.3 CRITERIA Facet Summary.

4

09:56:43

Logit: +

+

+

+

+

-6

-5

-4

-3

-2

X

Q

111111 S M

-1

a

S_--Q---x---+-------+ 1

2

3

S.E. , 24 +

SMQ

+

a Infit

1

MnSq: 411

+----XQSMSX-----+-------+-------+-------+-------+-------+-------+-------+

0123456789 Outfit

MnSq: 221

1

Q----S--+--M----+S-----Q+---x---+-------+-------+-------+-------+-------+

0123456789 Infit

Std: 1 1 111

1

+--------x+----Q----+S-----M--+--S-----Q+----x----+---------+---------+

-3 Outfit

-2

a

-1

1

2

3

4

3

4

St.d : 21111

+x--------+-Q-------+---s-----+-----M---+------s--+--------Q+---------x

-3 Count: 1 +

-2

Q

43 Raw: 1

a

-1

1

+

44

45

1

5

s

+

2

+

------+-M------------+

46

1

1

47

48

1

1

+---+---+s--+---+---+---+---+--M+---+---+---+---+---s---+---+---+---+

77 78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

Average: +

+

a

1

x

1111 Q __ S M

11 S+

2

Fair: 1 1 2 +

+

a

1

11

x+ __ Q--S---M--S---Q--x----+

2

Page 5 of 9

3

'PEER

REVIEW

PAPER

Table

7.1.1

REVIEWERS

EVALUATION'

10-22-1996

Measurement

09,56,43

Report

(arranged

by lMN).

-----------------------------------------------------------------------------------------Outfit Calib Model I Infit Obsvd Obsvd Obsvd Fair I Nu REVIEWERS MnSq std Logit Error MnSq Std Count Average Avrge Score -----------------------------------------------------------------------------------------0.4 0 4 R-FOUR 0.7 0 -S.dS 0.55 1.2 1.1 24 29 1 R-ONE 7.3 2 2.1 1 -4.32 0.63 1.2 1.2 28 24 8 R-EIGHT 0.9 0 0.9 0 -3.32 0.45 20 1.5 1.4 29 1 11 R-ELEVEN 1 1.6 0.38 1.6 -2.70 1.8 1.5 24 43 10 R-TEN 0.9 0 0.9 0 -1. 87 0.37 1.7 1.8 24 40 12 R-TWELVE 0.6 -1 0.37 0.6 -1 -1.63 24 1.8 1.8 44 0.5 -2 5 R-FlVE -2 -1.39 0.36 0.5 2.0 1.9 24 48 2 R-TWQ 0.6 -1 0.7 -1 -1. 24 0.36 1.9 1.9 24 45 6 R-SIX 0 0.8 0 -1.09 0.36 0.8 2.0 24 2.2 52 0.7 0 3 R-THREE 0.8 0 -0.67 0.35 2.0 2.1 24 47 2.0 2 9 R-NINE 0.37 2.1 3 -0.26 2.3 2.1 49 23 0.4 7 R-SEVEN -2 -2 -0.07 0.35 0.5 2.3 24 2.3 54 -----------------------------------------------------------------------------------------1.4 1.0 -0.4 -1. 99 0.41 1.8 1.8 -0.41 Mean (Count: 42.3 23.6 S.D. 1.6 1.8 1.6 0.09 0.6 1. 54 0.4 1.1 0.3 8.7 -----------------------------------------------------------------------------------------RMSE 0.42 Adj S.D. 1.49 Separation 3.56 Reliability 0.93 Fixed (all same) chi-square: 128.7 d.f.: 11 significance: .00 Random (normal) chi-square: 10.7 d.f.: 10 significance: .38 --------------------------------------------------------------------------------------

I

I

I

I

•PEER REVIEW Table

7.2.1

PAPER EVALUATION Measurement PAPERS

I

12)

09,56,43 10-22-1996 (arranged by 2MN) . Report

-----------------------------------------------------------------------------------------Outfit Fair Obsvd Obsvd Obsvd Measu:e Model I Infit MnSq Std MnSq Std Logl.t Error Count Average Avrge 1 Nu PAPERS Score -----------------------------------------------------------------------------------------2.0 2 6 P-SIX 2.1 2 0.42 -2.34 2.3 3.2 18 41 12 P-TWELVE 1.2 0 1.2 0 0.40 -1. 75 2.6 3.0 18 46 3 P-THREE 0.5 -1 0.45 0.6 -1 -1. 62 3.0 17 1.9 32 0.9 1 P-ONE 0 0 0.41 0.9 -1. 59 2.1 3.0 18 38 0.8 0 4 P-FOURL 0.7 0 0.44 -0.54 1.8 2.6 17 30 7 P-SEVEN 0.8 0 0.40 0.9 0 -0.47 2.1 2.6 18 38 0.8 9 P-NINE 0.7 -1 0 0.43 -0.23 2.5 17 1.9 32 0.5 -1 10 P-TEN 0.5 -1 -0.07 0041 2.0 2.4 18 36 14 P-FOURTEEN 0.8 0 1.1 0 0.51 0.30 17 1.6 2.2 27 0.7 0 13 P-THIRTEEN 0.8 0 0.39 0.46 2.2 17 1.6 28 0.2 11 P-ELEVEN -2 -1 0.50 0.3 0.49 1.5 2.2 18 27 0.4 15 P-FIFTEEN 0.5 -1 -1 0.48 0.51 1.6 2.2 18 28 1.1 0 16 P-SIXTEEN 1.1 0 0.49 1.29 1.4 1.9 18 25 2 P-TWO 9.0 3 0.52 1.7 1 1. 57 1.4 1.8 18 26 1.1 0 5 P-FIVE 0.43 1.0 0 1.61 1.7 1.8 18 31 8 P-EIGHT 0.6 0 0.58 1.0 0 2.43 1.6 1.3 18 23 -----------------------------------------------------------------------------------------Outfit Obsvd Obsvd Fair IMeasure Model 1 Infit Obsvd MnSq Std Nu PAPERS Count Average Avrge Logit Error MnSq Std Score

I

I

I

I

I

I

-0.4 1.3 -0.21 Mean (Count: -0.00 1.8 0.46 0.9 17.7 31.8 2.4 2.0 1.4 S.D. 0.05 0.4 1.3 1.31 0.5 0.5 0.3 6.2 ---------------------------------------------------------------------------------------RMSE 0.46 Adj S.D. 1.23 separation 2.67 Reliability 0.88 Fixed (all same) chi-square: 129.9 d.f.: 15 significance: .00 Random (normal) chi-square: 14.9 d.f.: 14 significance: .38 -----------------------------------------------------------------------------------------'PEER Table

REVIEW

7.3.1

PAPER EVALUATION' CRITERIA Measurement

10-22-1996 Report

09,56,43 (arranged

by

16)

3MN).

---------------------------------------------------------------------------------------Outfit Obsvd Obsvd Fair IMeasure Model I Infit Obsvd 1 N CRITERIA MnSq Std Count Average Avrge Logit Error MnSq Std Score ---------------------------------------------------------------------------------------OVERALL RECOMM 2.8 2 0.9 o -0.69 1.6 48 77 0.26 2.4 TECH MERIT 1.2 o 0.28 1.0 o -0.39 2.0 2.6 43 86 AUD APPEAL 1.1 o 0.27 0.9 o -0.31 2.0 2.6 48 94

I

I

I

Page 6 of 9

Ii

87 B4 BO

I

48 48 4B

1.8 1.B 1.7

47.2 1.9

1.B 0.1

2.4 2.3 2.2

0.19 0.44 0.76

I

0.28 0.28 0.29

O.B O.B 1.2

\

0 0 1

O.B O.B 1.7

0 0 0

I

IMPORT 4 SCIENTIFIC 2 PROBLEM 1 ORGANIZATION

------------------------------------------------------------------------------------------0.00 0.50

2.4 0.1

0.9 0.1

0.28 0.01

-0.3 0.6

1.4 0.7

0.61 1.2

Mean S.D.

I I I -----------------------------------------------------------------------------------------84.7 5.4

\

6)

(Count:

RMSE 0.28 Adj S.D. 0.42 separation 1.53 Reliability 0.70 Fixed (all same) chi-square: 20.0 d.f.: 5 significance: .00 Random (normal) chi-square: 5.0 d.f.: 4 significance: .29

-----------------------------------------------------------------------------------------'PEER REVIEW PAPER EVALUATION' Table 8.1 Category Statistics.

Model

?B,?B,?,R

=

I

10-22-1996

09,56,43

------------ -------------------------------OBSERVED CATEGORIES Categ DATA COUNTS Label Found Used %

MOST ITHURSTONEI cat\ EXPECTATION PROBABLE THRESHOLD PEAK Measure at from at. Prob Category +0.5

I

I

SCALE CALIBRATIONS INFIT OUTFIT AVGE Step Measure S.E. MNSQ MNSQ MEASR

I -----------------------------------------------------------------------------------------------------

~::I I -;~;9171·1

(=1:~;)

-~~;5 -~:~~11 3 34 34 14% .26 0.21 63% 1.49 2.88 2 4 4 4 2% 2.69 2.76 (3.82) 3 \ _________________________(Modal)-- (Median)-- (at---Mean)---- (+) ---------~

1~~

Scale

l~~

-2.95 .26 2.69 ---

.1B .22 .55

1.1

I

1.0 1.5 2.3

1.5 .5

I., 2.3

1

67

-3. -1.77 1 - .13 1.74

---

structure

Relative Logit:-5.0

-2.0

-3.0

-4.0

I

I

I

I

-1. 0

I

0.0

1.0

2.0

3.0

4.0

I

I

I

I

I

Mode: Median: Mean:

I

Logit: -5.0

I

I

I

-2.0

-3.0

-4.0

'PEER REVIEW PAPER EVALUATION' Table 8.2 Category Statistics.

Model = ?,?,6,R3

I

I

I

I

0.0

-1. 0

10-22-1996

I I

I

1.0

I

2.0

I

I

3.0

4.0

09,56,43

I

I

-----------------------------------------------------------------------------------------------------OBSERVED CATEGORIES Categ DATA COUNTS Label Found Used %

I

MOST THURSTONE Cat EXPECTATION PROBABLE THRESHOLD PEAK Measure at from at Prob Category +0.5

SCALE CALIBRATIONS INFIT OUTFIT AVGE Step Measure S.E. MNSQ MNSQ MEASR

-----------------------------------------------------------------------------------------------------1 27 27 56%1 low I low 1 1< -1.82) -1.031 2 13 13 27% -.57 -0.79 47% 0.00 1.04 1 3 8 8 17% .57 0.78 (1.841 2 1 _________________________(Modal)-- (Median)--(at---Mean)---- (+) ~ - - - - -

Scale

-.57 .57

I

.37 .49

.5 9.0 .7

.9 .9 .9

Relative Logit,-2.0

I

-1. 0

I

0.0

1.0

2.0

I

I

I

Mode: Median: Mean:

.

I

LOglt:-2.0

I

-1. 0

I

I

1.0

0.0

'PEER REVIEW PAPER EVALUATION' 10-22-1996 Table 9.2 Bias Iteration Report.

09,56,43

Page 7 of 9

06 -2. 1 -1.25 1.23

-- - -- - - -- - - - - - - - - - - - - - - - - - - ---

structure

I

2.0

I

- - --

Bias/Interaction There

analysis

are potentially

specified

192 Bias

by Model:

?B,?B,?,R

terms

-------------------------------------------------------

I

Max. Logit Elements

Max. Score Residual Elements Categories

Iteration

Change Steps

I

-------------------------------------------------------

1.0000 1.0000 -1.0000 -0.8113 0.0594 0.0003

-3.7514 -3.3758 -2.5827 -1.2610 0.1066 0.0004

1 2 3 4 5 6

BIAS BIAS BIAS BIAS BIAS BIAS

-------------------------------------------------------

'PEER REVIEW PAPER EVALUATION' 10-22-1996 09,56,43 Table 10.2.1 Responses unexpected after correcting for Bias. -------------------------------------------------------------------------------

Exp. Resd

Step

ICat

N CRITERIA

stResl Nu REVIEWER Nu PAPERS

-------------------------------------------------------------------------------

***

No unexpected

observation

with

StRes

3

>=

-------------------------------------------------------------------------------

'PEER REVIEW PAPER EVALUATION' 10-22-1996 09,56,43 Table 11.2 Bias/Interaction Measurement Summary. -------------------------------------------------------------------------------

I Cat

Exp. Resd

Step

N CRITERIA

stResl Nu REVIEWER Nu PAPERS

-------------------------------------------------------------------------------

1.8 0.6

1.8 0.7

1.8 0.7

-0.0 0.4

-0.01

Mean S.D.

0.7

(Count: 283)

-------------------------------------------------------------------------------

Count

of measurable

responses

283

=

'PEER REVIEW PAPER EVALUATION' 10-22-1996 09:56:43 Table 12.2 Bias/Interaction summary Report. Bias Logit: 1 + __ X

13 222 Q

+

-4

+

-3

221

441114222

S_+

-2

-1

3

+M

+

0

1

3

11

3

S-----+-----Q---+

2

3

S.E. , 2

Q

655 1

S

o

M

1

4211 1

1 +-----------------+

x

Q

S

123

Zscore:

1

+

+

-6

-5

Infit MnSq: 167 42 912 +

4

12 2122 X_+

+Q

-4

M

o

+

-3

3 1 52 2 + __ S

S

-2

1

44111423122

+

M+

-1

1 Q

141 1

--+-S-----+----Q--+

0

1

2

3

1

X

+

1

outfit

131

+-----------------+

2

3

4

MnSq:

1

121 422712 + M

o

2112221 2 +_S

1

1

Q

1

+

1

X

+-----------------+

2

3

Counts:

4 3

5

Page 8 of 9

4

Q

+

S

M

+

5

6

Score: 1

10

1586811

+

+

S

+

+

5

6

7

8

9

2221 + __ M_+

10

11

+

+

+S

12

13

+

14

-+----+---Q+----+

15

16

17

18

19

Expect: 4111 11 2

223

+

+

+ __ S_+

+

5

6

7

8

9

121

12

123 1132

2111111 11

+ __ M_+

10

+

11

11 1

+ __ S_+

12

13

1 1

+

14

1

1

-+--Q-+----+----+

15

16

17

18

19

Differ: + __ Q

+

-4

1 11

2212221 311113

+_S

-3

-2

21 12 11 31

+

M

+

-1

0

1

21

1

S_+--------+-----Q--+

2

3

4

Average: 1 X

Q

o Avge

o

1 1

+

S

3

8 26

7

M

+

1

1

11

2

2

S

2

X

Q __ +

S

M

1

+

S

1 1 Q

2

Differ: 13 3 1131441424

+----x------------+ 3

2222211 21 1

+-X-------Q------S-------M-------S------Q-------X-+ -1

4

Expect:

o Avge

1

3

131 211134 33142 1222211 21 11 +

2

Q---------X-------+

0

1

Page 9 of 9

4