Language Testing ( آزمون سازی زبان )

2,909 397 12MB

English Pages 99

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Language Testing ( آزمون سازی زبان )

Citation preview

1.pdf 2.pdf

iIofee CotSaal el Flom eyo Subsea! UW Jl JatfoallS yyS5 9 essLoggesl Guy CVs shag tall lslge a

IrB cballegls esis 49 preente glacegarre 99 9 00 ceygl0,F aryo cloubS il ab yglar 59S5 ee salle oh gone LS Arthur Hughes James Dean Brown (.g1ceye 9 99-3de> 4gaLe,3) FAJAB sasbe 5059 ol lel ale o,f

Oe Gapalie FAJAB GUS aSbul j| FAJAB , Heaton .clants lacs,lgs Croud 59 » Douglas Brown

FAJAB Gus lay Oh Getazer cel od Jlso Ja! fad GAS yo glee Gler alts 55 GLY cnt age 99 edylo 1 ara’

ISS shot 251 slafle yo aS Ss spl F388 BS joy Bal OLS Geom MAETy ab pole OLS ge Btn olds cel

pe alles stogt hi ome «tlasls Douglas Brown 4 Arthur Hughes J. Dean Brown ,slats 4.5chy Aegt

Cate aS thee GS ead cilasgegl ye hed oySpege Cel bd otilony Crt? C2 99 Capote lS ay GLE ol jl

el Ao oad gal as slejeg Aas aul lellegls calles pd.g0 Tb had oul jlo yo yg OVI

rl wale WSoo plore hb ys Sgrpe sRolS phe Il) pole OLS aS slaShy

Sihowiag0j!” Ceomd 9 524) gerye GLOLS alles 4) oils ghilgh Seejgbay rad yw lbs Jel” 5 ly, Suez ©

Atle jacete p51 slaSle obil gspile y9S2F OMoe BBO cee 9 Lalas jf og lllas gl wS,fal ("Lacslea wheal ym sleatl jo cepbge rgd dar ae AV Sle haliT y ospile slaySe GVges Jal asl} ©

we 9 at slag slafad yo Joey oS Fe wad ur GIT slaJibs allt

Chapter 1: Preliminaries of Language Testing WHY TESTING. ....n..cccccosssosesecsesseesceneessnvees BENEFITS/ IMPORTANCEOF TESTING. MEASUREMENT,TEST, EVALUATION.... ASSESSMENT . NORM-REFERENCEDvs. CRITERION-REFERENCEDTESTS . TEACHER-MADETest vs. STANDARDIZED TESTS. THE CONSEQUENCES OF STANDARDIZEDTESTING.... WASHBACK TEST BIAS ETHICAL ISSUES: CRITICAL LANGUAGETESTING. AUTHENTICITY State University Questions and Answers... Azad University Questions and Answers.. Chapter 2: Language Test Functions.........

TWO MAJOR FUNCTIONS OF LANGUAGETESTS .... CONTRASTING CATEGORIES OF LANGUAGETESTS COMPUTER-ADAPTIVETESTING..........0.:.:000eee eee A GENERAL FRAMEWORK ........... State University Questions and Answers Azad University Questions and Answers ..

Chapter 3: Forms of Language Test STRUCTURE OF AN ITEM CLASSIFICATION OF ITEM FORMS. TYPES OF ITEMS ALTERNATIVE VS. TRADITIONAL ASSESSMENT State University Questions and Answers.. Azad University Questions and Answers..

Chapter 4: Basic Statistics in Language Testin; STATISTICS......... TYPES OF DATA... TABULATIONOF DATA. GRAPHIC REPRESENTATION OF DATA DESCRIPTIVE STATISTICS . NORMALDISTRIBUTION... DERIVED SCORES CORRELATION CORRELATIONAL INDEXES CORRELATIONAL FORMULAS State University Questions and Answers... Azad University Questions and Answers...

Chapter 5: Test Construction DETERMINING FUNCTION AND FORM OF THETEST..

.. 100 vw 104 eve LID 112

PLANNING......sessesseccssssessessenesesenssnesccteneeeeseenseesenecseadenetessessesecneseesscssasoenecasssesbassseeasneasccnsneenerineesreneene 113 PREPARING ITEMS. 113 REVIEWING.......... 118 PRETESTING.. 118 VALIDATION... 127 EXTRA POINTS TO REMEMBER..... State University Questions and Answers Azad University Questions and Answers... Chapter 6: Characteristics of a Good Test..... . RELIABILITY: THE GENERAL CONCEPT ooo. cessesesssessseseeeecenssersearonenvencersnscnerennseseeuesseassseneeveye 142 RELIABILITY IN TESTING : CLASSICAL TRUE SCORE THEORY(CTS). APPROACHESTO ESTIMATING RELIABILITY ... FACTORSINFLUENCING RELIABILITY |... STANDARD ERROR OF MEASUREMENT...

OTHER RELIABILITY THEORIES

RELIABILITY OF CRITERION-REFERENCEDTESTS.. VALIDITY FACTORS INFLUENCING VALIDITY THE RELATIONSHIP BETWEEN RELIABILITY AND VALIDITY PRACTICALITY... EXTRA POINTS TO REMEMBER. State University Questions and Answers Azad University Questions and Answers

Chapter 7: History of LanguageTesting...... GRAMMAR-TRANSLATION APPROACH DISCRETE-POINT APPROACH INTEGRATIVE APPROACH FUNCTIONAL-COMMUNICATIVE APPROACH... State University Questions and Answers....

Azad University. Questions and Answers .... Chapter 8: Cloze and Dictation Type Tests. CLOZE PROCEDURE VARIETIES OF CLOZE TEST.... CLOZE TASKoe ceeecereeeee SCORING A CLOZETEST.. DICTATIONoe VARIETIES OF DICTATION.. SCORING A DICTATION........ VALIDITY OF CLOZE DICTATION AND CLOZE RELIABILITY OF CLOZE AND DICTATION... State University Questions and Answers........... Azad University Questions and Answers...........

Chapter 9: Communicative-functional Testing... SELECTION OF THE FUNCTION . SOCIAL FACTORS

nrecangecssensn ns enen 00sec THE PERFORMANCECRITERIA..escsssssesssssecsrpsessnessnnsecnsnennnsteesnnons DEVELOPING TEST STEM....... SCORING SYSTEM........:sersseerrees EXTRA POINTS TO REMEMBER. State University Questions and Answers .... Azad University Questions and Answets...........Chapter 10: A Sketch of Testing The Four Skills. TESTING LISTENING COMPREHENSION...

neces eg eee 233 234 235

TESTING ORAL PRODUCTION. ....ee-ssssessreeeees TESTING READING COMPREHENSION.. TESTING WRITING.....ccecseesssereeceseeeeteeeee i. State University Questions and Answers ...

Azad University Questions and Answers...

MA 98 Questions MA98 Answers..

REFRENCES .....

1

“seuensceassensecorsensuessnonesensecenenensocusneaneneees 4 . eas

Chapter 1

Preliminaries ofLanguage Testing © WhyTesting

© Benefits/ Importanceof Testing

© Measurement, Test, Evaluation

© Assessment

® Norm-Referencedvs. Criterion-Referenced Tests © Teacher-MadeTestvs. Standardized Tests © The Consequences of Standardized Testing

© Washback (or Backwash)

© Test Bias

© Ethical Issues: Critical Language Testing

© Authenticity

ghathaat,

re

je

a

ee a BS

e

Chapter 1/Preliminaries ofLanguage Testing ry 11

Testing can help students prepare themselves and thus learn the materials in three ways. Furst, learners are helped when they study for exams and again when examsare returned and discussed. Next, where several tests are given, learning can be enhanced by students’

growing awareness ofthe objectives and the areas of emphasis in the course. Finally, tests can foster learning by their diagnostic characteristics; they confirm what each person has

Preliminaries of Language Testing If you hear the word test in any. classroom setting, your thoughts are notlike ly to be positive, pleasant, or affirming. The anticipation ofa test is almost always accomp anied by feelings of

anxiety and self-doubt — along with a hope that you will come outof it alive. Tests seem as

F TIO

mastered, and they point up those language items needing further attention.

e

they can be used as tools for increasing the retention and transfer of classroom learning, if tests are aimed at measuring learning: outcomes at the understanding, application, and interpretation levels rather than knowledge level. By including measures of these more complex learning outcomesin ourtests, we can direct attention to their importance.

e

unavoidable as tomorrow’s sunrise in virtually every kind of educati onal setting. By all the inconvenience andtroublesa test brings, why do wetest? Whatare the benefits of testing? 1. WHY TESTING Education is the most important enterprise in any society. In fact, a considerable amount of budget, time and energy is putinto it every year by government. More than one-fourth of the

nation’s population attends school. Education is truly a giant and an importa nt undertaking and, therefore, it is crucial that its process and products be evaluated. In fact, evaluation is a major consideration in any education setting: ¢

e *

Testing can also benefit teachers:

«

Testing helps teachers to diagnose their efforts in teaching. It answers the question, “Have I been effective in my instruction?” and therefore testing enables teachers to increase their own effectiveness by making adjustments in their teaching to enable certain groups of students or individuals in the class to benefit more. As werecord the test scores, we might

well ask the following questions: Are mylessons ontherightlevel?

Government and private sectors which pay teachers and who employ the students

educational decisions are made (from the entrance exam to the universi ties to placing

A major aim ofall education is to assist individuals to understand themselves better so that they can make moreintelligent decisions and canmoreeffectively evaluate their own performance.Periodic testing gives them an insight into the things they can do well and the misconceptions that need correction. Such information provides students with a more objective basis for planning their study program, for selecting future educational experiences, and for developing self-evaluation skills.

Students,teachers, administrators and parents all work toward achievi ng educational goals andit is quite natural that they want to ascertain the degree to which those goals have been realized. In this sense, testing serves as a monitoring device for learning . afterwards are interested in having precise information about students’ abilities. Most importantly, through testing, accurate information is obtaine d based on which

Since tests tend to direct students’ learning efforts toward the objectives being measured,

Am I aiming myinstruction too low or too high? Am I teaching someskills effectively but others less effectively? Whatareas do we need more work on? Which points need reviewing? «

students in the right level), When a decision is made, whether the decisio n is great or small, it should be based on as muchand as accurate information as possible . The more accurate

Testing can also help teachers gain insight into ways to improve evaluation processitself:

Were the test instructions clear? Waseveryoneableto finish in the allotted time? Did the test cause unnecessary anxiety or resentment?

the information upon which a decision is madethe betterthat decision is likely to be.

2. BENEFITS/ IMPORTANCE OF TESTING

3. MEASUREMENT,TEST, EVALUATION

Tests can benefit students, teachers, and even administrators by confirming progress that has

Before we look at tests and test design in second language education, we need to understand

been made and showing how we canbest redirect our future efforts. Tests can benefit students in

the following ways: *

‘Testing can create a positive attitude toward class and will motivate them in learning the

subject matter. Tests of appropriate difficulty announced well in advance and covering skills scheduled to be evaluated can contribute to a positive tone, and also create a sense of

achievement by demonstrating teacher’s spirit of fair play and consist ency with course

objectives.

three basic interrelated concepts: measurement, test, and evaluation. These terms are sometimes used interchangeably, but some educators makedistinctions among them. Measurementis the process of quantifying the characteristics of persons according to explicit proceduresand rules.

¢

Quantification involves the process of assigning numbers, and this distinguishes measures from qualitative descriptions such as a verbal accountor visual representation.

ashes peachadie 12 [] Language Testing

ery * Sew mee ee kets

Non-numerical. categories or rankings such as letter grades (A, B, C, ...) may have the

e

characteristics ofmeasurement becausetheir focus of attention is comparisonoftestees. Characteristics: We can assign numbers to both physical and mental characteristics of persons. Physical attributes such as height and weight can be observed directly. In testing, however, we are almost always interested in quantifying mental attributes and abilities, sometimes called traits or constructs, which can only be observed indirectly. These mental attributes include characteristics such as aptitude, intelligence, motivation, attitude,

native

comprehension. e

language,

fluency

in

speaking,

and

achievement

:

in

ae wa

erie ee ens

.

Chapter I/ Preliminaries ofLanguage Testing L] 13

,

-

test scores.

rated in following figure. As Therelationship among measurement,test, and evaluation are illust can be seen all tests are measurementbut not all tests are evaluation.

Evaluation

reading

Rules and procedures: Haphazard assignment of numbers to characteristics of individuals cannot be regarded as measurement. In order to be considered a measure, an

Relationship among measurement,test, evaluation

observation of an attribute must be replicable, for other observers, on other contexts and with other individuals. Practically, anyone can rate another person’s speaking ability. But while one rater may focus on pronunciation accuracy, another may find vocabulary to be the most salient feature. Such ratings are not considered measurement because the

different raters in this case did not follow the samecriteria or procedures for arriving at their ratings. Measures are characterized by the explicit procedures and rules upon which they are based. There are many different types of measures in the social sciences,

including observations, rankings, rating scales, andtests. Test is a measurement instrument. Test often connotes the presentation of a set of questions to be answered, to obtain a measure (that is, a numerical value) of a characteristic (that is, mental attribute and ability) of a person in a given domain (language, math, etc.). What distinguishes a test from other types of measurementis that it is designed to obtain a specific sample of behavior from

which one can make inferences about certain characteristics of an individual. Let’s review two examplesto illustrate the difference between measurement and test. A qualified interviewer might

be able to rate an individual’s oral proficiency in a given language accordingto a rating scale, on the basis of several years’ informal contact with that individual, and this could constitute a measure of that individual’s oral proficiency. This measure could not be considered a test, however, because

the rater did not use an elicitation procedure(e.g. a set of activities or a set of questions) to obtain a specific sample of behavior. Or, the rating of a collection of personal letters based on a rating

scale is considered measurement, while asking a person to write an argumentative editorials (to elicit a specific sample of behavior) for a news magazineconstitutesa test. Evaluation has been defined in a variety of ways:

1) The process of delineating, obtaining, and providing useful information for judging decision alternatives. 2) The determination of the congruence between performance andobjectives. 3) A processthat allows one to make a judgment about the desirability or value of a measure. Generally, the purpose of evaluation is to gather information systematically for the purpose of making decisions. This information need not be exclusively quantitative. Verbal descriptions, ranging from performanceprofiles to letters of reference, as well as overall impressions, can provide important information for evaluating individuals, as can measures, such. as ratings and

e people. We measure or &) Note: It is important to point out that we never measure or evaluat es and abilities. evaluate characteristics or properties ofpeople such as mentalattribut

4, ASSESSMENT

attribute of a person. Jn Assessment is appraising or estimating the level or magnitude of some

a wide range of educational practice, assessment is an ongoing process that encompasses

offers a comment, or methodological techniques. Whenever a student responds toa question, an appraisal of the student’s tries out a new wordor structure, the teacher subconsciously makes

and formative/ performance. Assessment can be classified on two continuum: informal/formal surnmative.

4.1. Informal vs. Formal Assessment

ned comments Informal assessment can take a number of forms,starting with incidental, unplan

to the student. Examples and responses, along with coaching and other impromptu feedback

face on some include saying “Nice job!”;.“Did you say can or can’t?”; or putting a smiley

deal of a teacher’s informal homework. Informal assessment does not stop there. A good mance without recording assessment is embedded in classroom tasks designed to elicit perfor

l assessment is results and making fixed conclusions about a student’s competence. Informa ltimate decisions about virtually always non-judgmental, in that you as a teacher are not makingu

are marginal comments on the student’s performance. Examplesat this end of the continuum to better pronounce a word, papers, respondingto a draft of an essay, offering advice about how or suggesting a strategy for compensating for a reading difficulty. specifically designed to tap Onthe other hand, formal assessments are exercises or procedures planned sampling techniques into a storehouse of skills and knowledge. They are systematic,

constructed to give teacher and student an appraisalof student achievement.

all tests are formal &} Note: Is formal assessment the same as a test? We can say that , a systematic set of assessments, but not all formal assessment is testing. For example

y a formal observations of a student’s frequency of oral participation in class is certainl

assessment, butit is hardly what anyone would call a test.

14 [_] Language Testing

22

ay

eimai a

ew in

Ge Ti

in ‘ me

4.2. Formative vs. Summative Assessment

Formativetests are given (at the end of a small segment of material) to evaluate students in the process of ‘forming’ their competencies and skills with the goal of helping them to continue that

growth process. The key to such formationis the delivery (by

the teacher) and inter

nalization (by the student) of appropriate feedback on performance, with an eye toward the futur e continuation of learning. Subsequently, compensatory exercises and activities are provided to help the students fill in the gaps in their learning. Formative tests are either self-graded or no grade is given. i Note: Forall Practical purposes, virtu ally all kinds ofinformal assessment are (or should be)

formative. They have as their Primaryfocus the ongoing developmentof

the learner’s language. So when you give a student a commentor a suggestion, or cail attentio n to an error, that feedbackis offered to improvet he learner’s languageability, Summativetests are given at the end of a courseor unit of instruction and the resul ts are used primarily for assigning course grades, or for certifying student mastery of the instructional objec tives. Such tests measure or sum up what the students have learnt from the course, D> Note: Formative test is ongoing and impl ies the observation ofthe “process” oflearni ng, while summative test is concerned with the

“product”of learning.

5. NORM-REFERENCED TESTys. CRI TERION-REFERENCED TEST This distinction refers to different interpreta tion ofscores. After a test has been administ ered and the scores have been computed, the basic issue is how we derive meaning from the score s. To attain interpretive resul ts, two ways of interpretation are identified : norm-ref

erenced and criterionreferenced. When test scores are interpreted in relation to the performanceof othertest ees or a

particular group of testees, we speak of a norm-referenced interpretation. If, on the other hand, they are interpre

ted with respect to a specific level or doma in of ability, we speak ofa criterionreferenced interpretation,

In norm referenced tests (NRT), test results may be interpreted with reference to the performance of a given group, or norm . The ‘norm group’ is typically a large group of individuals

erie

Chapter 1/ Preliminaries ofLanguage Testing LC 15

Geese ew ah ROTH

developing a CRTis that it adequately represents the criterion ability level orcoment domain °

be evaluated. Instead of comparing a person’s performance tothat of others, his per °rmaneee compared to a predetermined criterion or standard. Often but not always, the app ann test involves the use of cutoff scores that separate competent from incompeten vn ne Usually, a testee passes the test only when he has given the right answer to a oe . peat numberof test items. Since the purpose of testing is to see if the testee has arrived ai mastery, a higher score would make no difference. COeywarsetiiiy

Type of Interpretation

Noureastseciuce

Relative (A student’s performance is comparedto those ofall other students in percentile terms)

Absolute (A student’s performance is compared only to the amount, or percentage, of material learned)

A percentile rank or a standard

A statement of whether or nota

scores such as z-score, T-score,

student has achieved a predetermined

Reported

staninescore, etc.

percentage or number correct.

Type of Measurement

To measure general language abilities or proficiencies

To measure specific objectives-based language points

Purpose of Testing

Spread students out along a continuum ofgeneral abilities or proficiencies

Assess the amount of material known or leamed by each student

Distribution of Scores

Normal distribution of scores around the mean

who know the material should score 100% :

Test Structure

A few relatively long subtests with a variety of item contents

A series of short, well-defined subtests with similar item conterits

Students havelittle or no ideas of what content to expect in test items

Students know exactly what content to expectin test items

Knowledge of Questions

take the test. In other cases, NR test resul ts are interpreted and reported solely with reference to the actual group taking the test, rather than to a separate norm group. Perhaps the most familiar example of this is

Example of Test Interpretation

When a great numberof testees miss

number of testees, the instructional

an item,it is eliminated from the test

materials are revised or additional

work is given

norm group is given the test, and then the characteristics

what is sometimes called 8rading on the curve, where, say,

focuses on whatthe testees can do with what they know. An example would

be the case in which students are evalu ated in terms oftheir relative degree of mastery of course content, rather than with respectto their relative ranking in the class, Hence, a basic concern in

Varies; often non-normal. Students

Whena test item is missed by a great Missed Items

the top ten percent of the students receive an ‘A’ on the test and the bottom ten percent fail, irrespective of the absolute magnitude oftheir scores, Evaluation may, at times, be simply carried out to determine whether teste es have achieved certain objectives; it is not intended to differentiate testees. Criterion refe rencing, as this approachis called,

Criterion-referenced

Type ofScores

who are similar to the individuals for who m thetest is designed. In the development of NRTs the

, or norms, of this group’s performance are used as reference points for interpreting the performance of other students who

|

Youperformed better onthis test than approximately 75% of the students in the group against which you are being compared.

You have answered 60% ofthe items for this unit correctly so you may moveonto the next unit.

& Note: NRT helps administrators and teachers é make program level decisions, isi si uchan as

admission, proficiency and placement decisions, and the other family helps teaci “ make classroom level decisions (thatis, assessing what the students have learned throug diagnostic or achievement testing).

6. TEACHER-MADETESTvs. STANDARDIZED TEST : ee

_made

In any consideration of educational testing, a distinction must be drawn between teacher-ma'

Chapter I/Preliminaries ofLanguage Testing [| 17

16 | Language Testing

Swe

shares

and standardized instruments. A teacher-made test is a small scale, classroom test which is

generally prepared, administered and scored by one teacher. In this situation, test objectives are based directly on course objectives, and test content derived from specific course content. Such tests have the following advantages: e They measure students’ progress based on the classroom activities. e They provide an opportunity for the teacher to diagnose students’ weaknesses concerning a given subject matter. e

They help the teacher makeplans for remedial instruction, if needed.

e

They motive students. Criterion-referencing

Norm-referencing

atta Direction for ministration and Scoring

Usually no uniform directions

Specific, i culture-free ~ direction irecti for every testee to understand; standardized ation and scoring administrati i procedures

Sampling of Content

Construction

Norms

Both content and sampling are ‘| determined by classroom teacher

Content determined by curriculum and subject-matter experts; involves extensive investigations of existing syllabi, textbooks, and programs; sampling of content done systematically

Maybehurried and haphazard; often notest blueprints,item

: : Uses meticulous consttuction procedures

tryouts, item analysis or revision; quality of test may be quite poor

that include constructing objectives and test blueprints, employing item tryouts, item analysis, and item revisions

Only local classroom norms are available, i.e. they are

dditi . n a ition to local norms, standardized ests typically make available national

determined by the schoolor a department

Bestsuited for measuring Purpose and Use

particular objectives set by teacherand for intra-class

comparisons

schools district norms

Best suited for measuring broad curriculum objectives andfor inter-class, school andnational comparisons

Quality of Items

Unknown;usually lower than standardizedtests due to limited time and skill of teacher

High; written by specialists, pretested and selected on the basis of effectiveness

ere Reliabili

Unknown; usually high if ? carefully constructed

i High

ty

d for to measure childven’s mastery of the standards or competencies that have been prescribe

specified grade levels. College entrance exams such the Scholastic Aptitude Test (SAT®)are part

US. of the educational experience of many high school seniors seeking further education in the Examples of standardized language proficiencytests are TOEFL and JELTS.

7. THE CONSEQUENCES OF STANDARDIZED TESTING

The widespread global acceptance of standardized tests as valid procedures for assessing

Typeof Interpretation

specified

usually based on an objective procedure. Such tests have a wide range of coveragethatis, they cover more material. They are used to assess either one year’s leaning or more than one year’s learning. Most elementary and secondary schools in the US have standardized achievementtests

On the other hand, standardized tests are commercially prepared by skilled test-makers and measurement experts. They provide methods of obtaining samples of behavior under uniform procedures. By a uniform procedure it is meant that the same fixed set of questions are

administered with the sameset of directions, timerestrictions, and scoring procedures. Scoring is

individuals in many walks oflife brings with it a set of consequences that fall under the category of consequential validity. Consequential validity encompasses all the consequences ofa test, including such considerations as its accuracy in measuring intendedcriteria, its impact on the preparation oftest-takers, its effect on the learner, and the (intended and unintended) social

validity consequences ofa test’s interpretation and use. One of the aspects of consequential

which has drawn special attention is the effect of test preparation courses and manuals on

s performance. McNamara cautions against test results that may reflect socioeconomic condition

such as oppottunities for coaching, that are “differentially available to the students being assessed (for example, because only some families can afford coaching, or because children with more highly educated parents get help from their parents).” 8. WASHBACK(or Backwash)

A facet of consequential validity is washback. Consider the following scenario: you are working

on in.an institution that gets more funding if the numberof students reaching a certain standard

your the standardized test at the end of the year increases. As a result, at the end of the year, director will be keeping tabs on how manyof your students make the standard for funding. Do

Would you you think that would affect your teaching? How much would your teaching change? actually be will be morelikely to teach material that is related to the test? Material that you know ent found onthetest? This cluster of issues is about washback. Washback(also called measurem

driven instruction, curriculum alignment, bogwash) generally refers to the effects the tests have on instruction/pedagogy/learning/ education in terms of how students prepare for the test. ‘Cram courses’ and ‘teachingto the test’ are examples of such washback. Another form of washback in the that occurs more in classroom assessment is the information that washes back to students form ofuseful diagnoses of strengths and weaknesses. Students’ incorrect responses can become windows of insight into further work. Their correct responses need to be praised, especially when they represent accomplishments in a student's interlanguage. Teachers can suggest

of basic strategies for success as part of their coaching role. Washback enhances a number language principles of language acquisition: intrinsic motivation, autonomy, self-confidence, effects ego, interlanguage, and strategic investment, among others. Washback also includes the

on of an assessment on teaching and learning prior to the assessmentitself, that is, on preparati for the assessment.

gi honced ai

18 CJ Language Testing _

i

opin

rir

wen ha Dee

i | i i |

i fl

|

Washback can vary along two dimensions: in terms of degree (from strong to weak) andin terms of kind (positive or negative). The degree and kind of washb ack depend on: the degree to which

6. Do notlimit skills to be tested to academic areas. 7. Use authentic tasks andtexts. Logistics ; / : 1. Insure that test-takers, teachers, administrators, curriculum designers understand the purpose o: the Test, 2. Make sure language-learning goalsare clear.

able to innovate, and the status ofthe test (and the level of stakes involved). The issues of stakes is divided into low’ stakes versus high stakes situations. Low stakes situations typically involve classroom testing, which is being used for learning purpos es or research. For students, high stakes situations usually involve more important decisi ons like admissions, promotion,

3. Where necessary, provide assistance to teachers to help them understandthetests.

placement, or graduation decisions that are directly dependenton test scores. The washback effect is obviously much strongerin high stakes situations than in low stakessituations. In terms

4. Provide feedback to teachers and others so meaningful changecan be effected. 5. Provide detailed and timely feedback to schools onlevels of pupils performance andareas of difficulty in public examinations.

of kind, we have the following definitions:

4

6. Make sure teachers and administrators are involved in different phasesofthe testing process

Negative (or harmful) washback is said to occur when test items are based on an

because they are the people who will have to make changes. 7. Provide detailed score reporting.

outdated view of language which bearslittle relationship to the teaching curriculum,ie.

when the test content andtesting techniques are at variance with the objectives of the course. An instance of this would be where students are follow ing an English course

*

| i |

I

Interpretation/ Analysis

;

1. Makesure exam results are believable, credible, and fair to test takers and score user.

which is meant to train them in the language skills necessary for university study in an English-speaking country, but where the language test which they have to take in order to be admitted to a university does not test those skills directl y. If the skill of writing, for example, is tested only by multiple-choice items, then there is great pressure to practice

I

Chapter I/Preliminaries ofLanguage Testing FE] 19

| 5.Usea variety of examination formats, including written, oral, and practical.

the test counters to current teaching practices, what teachers and textbook writers think are appropriate test preparation methods, how muchteachers and textbook writers are willing and

¢

.

E °

such itemsrather than practice the skill of writing itself.

Positive (or beneficial) washback is said to result when a testing procedure encourages good teaching practice. For example, the consequence of many reading comprehension tests is a possible development ofthe reading skills. As another example, the use of an oral interview in a final examination taay encourage teachers to practice conversational

: a

E a

language use with their students.

,

4

:

A numberof suggestions have been made overthe years cae for ways to promote positi ve washback. The following list is adopted from Brown (2005, p. 254).

2. Consider factors other than teaching effort in evaluating published examinationresults and national rankings. 3. Conductpredictive validity studies of public examinations. a 4. Improve the professional competence of examination authorities, especially in test design. 5. Insure that each examination board has a research capacity. 6. Have testing authorities work closely with curriculum organizations and with educational administrators. 7. Develop regional professional networks to initiate exchange program sand to share common interests and concerns.

9. TEST BIAS It is no secret that standardized tests involve a numberof types of test bias. Some of the sources of bias are background knowledge, native language, cultural background, race, gender, age, ° cognitive characteristics, and learning styles. A test or item can be considered to be biased if one particular section of the candidate population is advantaged or disadvantaged by some feature of

|

Isami

\

8 Decentae uerionseferenced

:

1Basete teston soundheme ees nen fo teach 5. Base achievementtests on objectives. 5Base a ev nttest ‘onbjet —_

and such an item cannot provide clear and easily interpretable information. For instance, consider an IQ item where the answer hinges on understanding the differences between the terms m Tew up up rain, > snow, ; sleet, . and hail. Such an item might naturally be biased against students who grew

7, Foster learner autonomyand self-assessment. .

hail. example if i An obvious i if bias bias is i shown by the item i i below, which appe ared in the State

|

T

2 Designeteeee

one group of people is testing something in addition to what it was originally designedto test,

6. Use direct testing

in a tropical area because many of them have never seen anything resembling snow,sleet,

|

Test content 1. Testthe abilities whose development you want to encourage.

/

||

2. Use more open-ended items.

:

Le

the test or item which is not relevant to what is being measured, An item that is biased against

:

.

3. Make examinationsreflect the full curriculum, not merelya limited aspectofit.

4. Assess higher-order cognitive skills to ensure they are taught.

Examination of English Language for Elementary Level:

—_

Mia: Whatshould I do with this martabak? .

Mom: Just put them on a (a) drawer,(b) plate, (c) stove, (d) mug

'

7 tll

20 [_] Language Testing _

.



a * Fy mend

Examinees that come from Java or are somehow familiar to Indian culture will find the item above easy to answer. Yet, for some others coming from different areas, the word martabak may be

entirely new and therefore they do not know whether a martabak refers to a kind of stationery, food, a cooking device, or a kind of beverage. Despite their excellent English proficiency, there is no way they can getat the right answer. The item, in other words,is culturally biased against these

examinees.

/

Let examine another example adopted from a listening comprehension item taken from TOEFL

Test Preparation Kit Workbook: Man: I’m taking up a collection for the jazz band. Would you like to give? Woman: Just a minute while J get my wallet. (narrator) Whatwill the woman probably do next? a. Put some moneyin her wallet b. Buy a band-concert ticket ¢. Make a donation d. Lend the man some money The right answer is c. However, it is very unlikely that examinees of non-Western culture are familiar with the habit of collecting money for a band in the US. culture. Therefore, being largely

unfamiliar with the meaning of “taking a collection” and looking at the word “give” from the man, they may be misled into thinking that the answeris d. alternatively, if they have no idea whatsoever that a band in the US may need to collect some money, they may chooseb.

>» Fairness can be defined as the degree to which a test treats every student the same,or the degree to which it is impartial. Teachers would generally like to ensure that their personal feelings do not interfere with fair assessment of the students or bias the assignment of scores. The aim in maximizing objectivity is to give each student an equal chance to do well. Equitable treatment in terms of testing conditions, access to practice materials, performance feedback, retest opportunities, and other features of test administration, including providing reasonable accommodation for test takers with disabilities when appropriate, are important aspects of fairness under this perspective. This tendency to

seek objectivity has led to the proliferation of ‘objective’ tests which minimize the possibility of varying treatmentfor different students.

10. ETHICAL ISSUES: CRITICAL LANGUAGE TESTING Shohamy sees the ethics of testing as an extension of what educators call critical pedagogy, or

more precisely in this case, critical language testing. For a better understanding of critical languagetesting, we need to know whatcritical pedagogy is. As language teachers we have to rememberthat weare all driven by convictions about what this world should look like, how its

people should behave, how its governments should contro! that behavior, and: how its inhabitants should be partners in the stewardship of the planet. We embody in our teaching a vision of a

better and more humanelife. However, critical pedagogy brings with it the reminder that.our learners mustbe free to be themselves, to think for themselves, to behave intellectually without coercion from a powerful elite, to cherish their beliefs and traditions and cultures without the

Geir

Chapter 1/ Preliminaries ofLanguage Testing LC] 21

threat of forced change. In our classrooms, where the dynamics of power and domination permeate the fabric of classroom life, we are alerted to a possible covert political agenda beneath our overt technical agenda. One of the byproducts of a rapidly growing testing industry is the danger of an abuse of power. education, As Shohamy claims “Tests represent a social technology deeply embedded in

government and business; as such they provide the mechanism for enforcing power and control.

the future of Tests are most powerful as they are often the single indicators for determining

individuals”. Proponents of a critical approach to language testing claim that large-scale

social, standardized testing is not an unbiased process, but rather is the “agent of cultural, , political, educational, and ideological agendas that shape the lives of individual participants

teachers and teachers”. The issues ofcritical language testing are numerous: e Psychometric traditions are challenged by interpretive, individualized procedures for predicting success and evaluating ability.

« © e

Test designers have a responsibility to offer multiple modes of performanceto account for , varying styles and abilities amongtest-takers. Tests are deeply embedded in culture and ideology. Test-takers are political subjects in a political context.

n that One of the problems of critical language testing surrounds the widespread convictio predictive standardized tests designed by reputable test manufacturers are infallible in their falls validity. Universities, for example, will deny admission to a student whose TOEFL,score offered if one point below the requisite score (usually around 500), even though that student, success in other measures of lariguage ability, might demonstrate abilities necessary for are university program. Onestandardized test is deemed to be sufficient, follow-up measures considered to be too costly.

and A further problem with ourtest-oriented culture lies in the agendas of those who design by those who utilize the tests. Tests are used in some countries to deny citizenship. Tests are value nature culture-biased and therefore may disenfranchise members of a non-mainstream

impose system. Test givers are always in a position of power over test-takers and therefore can

able social and political ideologies on test-takers through standards of acceptable and unaccept items. Tests promote the notion that answers to real-world problems have unambiguous right and

o reflect an wrong answers with no shades of gray. A corollary to the latter is that tests presumet

must appropriate core of common knowledge and acceptable behavior; therefore the test-taker buy into such a system ofbeliefs in order to makethe cut.

thorny Shohamy (1998) pointed outthat politicians had capitalized on languagetests for tackling could set the political issues that they failed to address by other policy-making process. They justification, benchmark for passing a language test for immigration purposes without any

nt thereby allowing them the flexibility to create immigration quotas. For example, the governme e if of Australia drew on languagetests to manipulate the numberof immigrants and to determin to prevent refugees could be accepted or rejected. Similarly, Latvia used strict language tests Russians from obtaining citizenship in the wakeofits independence.

Cain wDagttenet'

22 CI Language Testing

Femmes

ww mi ae es Ee

picket

ee et

wear. tra eB AE

State University Questions

11. AUTHENTICITY Authenticity is the degree of correspondence of the characteristics of a given language test task to the features of target language use (TLU)tasks'. Essentially, when you make a claim for authenticity in a test task, you are saying that this task is likely to be enacted in the real world.

This concept is shown in the following figure. characteristics of test task

Chapter 1/ Preliminaries ofLanguage Testing | 23

ran a

correspondence

characteristics of TLU tasks

Manytest item types fail to simulate real-world tasks. The sequencing of items that bear no relationship to one another lacks authenticity. One does not have to look very long to find reading comprehension passages in proficiency tests that do notreflect a real-world passage. In a test, authenticity may be present in the following ways:

e e

The language in the test is as natural as possible. Itemsare contextualized rather than isolated.

e

Topics are meaningful for the learner.

e

Some thematic organization to items is provided, such as througha story line or episode.

e

Tasks represent, or closely approximate, real-world tasks.

The authenticity of test tasks in recent years has increased noticeably. Reading passages are selected from real-world sources that test-takers are likely to have encountered or will encounter.

Listening comprehension sections feature natural language with besitations, white noise, and interruptions. More and moretests offer items that are “episodic” in that they are sequenced to

form meaningful units, paragraphs,or stories.

irrelevant? 1- To answerthe question ‘Why havea test at all?’ which one of the following do youfind

(State University, 81)

1) Whydoes this learnerfit in our teaching program? 2) Whatis the learner’s general level of language ability? 3) How muchhas the leamerlearned from a particular course? 4) Whatare the learner’s particular strengths and weaknesses? 2- Backwasheffect can be defined as ---------..

(State University, 83)

3- The purpose of norm-referencedtestsis to ------—-. 1) measure communicative competence 2) relate onetestee’s performanceto that of others 3) use objective linguistic norms to measure proficiency 4) classify people in termsoftheir ability to perform set oftasks

(State University, 84)

students have learned the material presented in class.

(State University, 85)

\ 1) the influence oftesting on teaching results test 2) the importanceofanalyzing 3) the importanceofcontrastive analysistotest development 4) the impactof language sub-skills on communication skills

(State University, 84) 4- Achieving beneficial backwash requires ---------.. 2) ensuring thatthe test format is unknownto testees 1) sampling widely 4) developing proficiency rather than achievementtest 3) using indirect testing degree the 5. ---e--- tests are prepared on the basis of instructional objectives te determine to what

1) Criterion-referenced

3) Proficiency

2) Norm-referenced

4) Standardized

6- “Backwasheffect” refers to the effect of ---------.

1) unsystematic sources of variance on the observed score 2) face validity considerations on test form selection 3) item difficulty on the true score 4) testing on pedagogy

(State University, 85)

3- Formative evaluation ---------.

(State University, 85)

7- In a norm-referenced test, ---------.

(State University, 86)

1) refers to the need for testing students to elicit information 2) is the ongoing evaluation involvedin all phases of teaching programs 3) refers to the formal exams administered at the end of teaching programs 4)is intended to check students’ progress in regard to their mastery of linguistic forms 1) a higher score would make nodifference

2) the goalis to select the examinees with the complete mastery ofa skill

3) standard scores and percentile ranks show a testee’s relative position 4) the focus is on assuring that testees have achieved certain objectives

1- TLU tasks are those tasks which the learneris likely to face in real-life context

24 LJ Language Testing

ween Se RTE

8- Norm-reference measurementhelps us ----------

(State University, 87)

1) evaluate the success of an educational program 2) determinethe extent to which student have met educational objectives 3) choosethe best students to receive a particular type of education 4) determine whether we need to revise our current teaching activities

(State University, 88)

(State University, 88)

1) its content areas are variable 2).it enjoys acceptable reliability 3) it may be basedona theory of language proficiency

4) its administration and scoring procedures are uniform

11- A changein testing leading to a changein teaching is known as--(State University, 88) 1) washback 2) test facet 3) curricular validity 4) communicative interaction 12- “Cram courses” and “teaching to the test” are examples of ------—-. (State University, 89) 1) test washback 2) test tasks 3) authenticity 4) directed response

13- When one designsa test with an eye to its impacton the teaching enterprise, one has technically

concerned oneself with ---------.

3) aptitude

(State University, 90)

4) knowledge

14- Norm-referenced tests rely on ---------.

(State University, 91)

1) course objectives 2) teacher-made items 3) a continuum in rank order 4) giving test-takers feedback on specific lesson objectives 15- The claim that “tests are deeply rooted in culture and ideology” is most likely made in ---------. .

1) communicative languagetesting,

3) integrative languagetesting

2)critical languagetesting

{State University, 92)

4) task-based assessment 16- In --------- tests, each candidate’s score is interpreted relative to the scores of all other

candidates who takethetest.

1) criterion-referenced

3) norm-referenced

2) placement

3) Naturalnessof the languageused in the test

(State University, 93)

4) aptitude

{7- Which ofthe following does NOT represent authenticity in a given test?

1) Contextualization oftest items

2) Easeofscoringthe test items

(State University, 93)

4) Resemblanceoftest items to real-world tasks

18- It is NOT true that a norm-referenced test -------—-. 1) measures general languageabilities 2) includes a variety of test content

3) is based on what students exactly expect oftest question

4)relies on the normaldistribution of scores around a mean

19- Washbackin language testing ---------.. 1) can be either summative or formative 3) is limited to formative assessment

Chapter 1/ Preliminaries ofLanguage Testing im 25 (State University, 94)

2) refers to the effect of testing in large-scale assessment 4) is a feature of consequential validity (State University, 95)

J) there is no need for any reference to the actual numberoftest questions the stadent has answered correctly

10- A perfect norm-referenced test has all of the following properties EXCEPTthat --

_2) proficiency

neh

20- In interpreting a student’s score on a criterion-referenced test, -——-—--.

9- Which of the following distinguishes “evaluation”from “testing”? 1) Decision making 2) Comparison of measures 3) Reliance on numerical values 4) Quantitative procedures used

1) washback

4

PS Peas et

(State University, 94)

2) the primary focus is on how much ofthe material the student has leamedin relative terms 3) there is no needfor any reference to the performances ofother students 4) the focus must be on the student’s percentilerank 21- All of the following statements are TRUE regarding testing, assessment and evaluation EXCEPT---------.

(State University, 96)

1) all tests are formal assessments, but not all formal assessmentis testing 2) assessmentis usually time-constrained and draws on a limited sample of behavior _ 3) evaluation is a process that allowsus to judge the value or desirability of a measure 4) a test is a prepared administrative procedure that occurs at an identifiable time in a curriculum 22- Norm-referenced and criterion-referenced tests differ in all of the following characteristics EXCEPTthe---------.

(State University, 96)

1) purposesoftesting

2) type of measurement

3) length ofthe test

4) type of interpretation

23- Whichofthe following tips does NOT foster beneficial washback? 1) Test the abilities whose development you wish to promote 2) Base achievementtests on objective 3) Sample widely and unpredictably 4) Use indirecttesting

(State University, 96)

24- Percentage and percentile are the terms used to capture the difference between ~------- tests, respectively.

1) criterion-referenced and norm-referenced 3) direct and indirect

(State University, 97)

2) norm-referenced andcriterion-referenced 4) indirect and direct

25- Authenticity in a test may be presentin all of the following ways EXCEPT when --------. 1) some thematic organization to items is provided 2) the items are contextualized rather than isolated 3) the difficulty level of the test presents a reasonable level of challenge 4) topics are meaningful to test-taker

{State University, 97)

ee Geir ait EO h RTT

26 TI LanguageTesting

State University Answers 1. Choice 1 2. Choice 1 Refer to Section8.

3. Choice 2 The type of interpretation in a NRTis relative, i.e. a student’s performance is compared to those ofall other students in percentile terms. 4. Choice 1

Chapter 1/ Preliminaries ofLanguage Testing C] 27

43. Choice 1

Refer to Section 8. 14. Choice 3 Refer to Section 5. 15. Choice 2 . Refer to Section 10. 16. Choice 3

Refer to Section 5. 17. Choice 2

Refer to Section 8.

Refer to Section 11.

5. Choice 1 The type of interpretation in a CRT is absolute, i.e. a student’s performance is compared only to the amount, or percentage, of material learned

18. Choice 3 Ina CRTstudents know exactly what content to expect in test items.

6. Choice 4 Refer to Section 8.

Refer to Section 8.

3. Choice 2 Refer to Section 4.2. 7. Choice 3

In an NRTtherelative positin of examineesis reported in terms of percentile rank.

19. Choice 4

20. Choice 3 We make a reference to the performances ofother students in case of NRTs. 21. Choice 2

Choicetwois the definition oftest. 22. Choice 3

Choice 1, 2 and 4 areall characteristics of CRtests.

Referto the table in Section 5.

8. Choice 3 Choice 3 describes the function of placementtest (see chapter 3) which is an NRT. Choice 1, 2 and 4 are all characteristics of CRtests.

23. Choice 4 Refer to the table in Section 8.

9. Choice 1

Refer to Section 3. 10. Choice 1 The quintessential NR test is the standardized test, which has three characteristics. First, standardized tests are based-on fixed, or standard content, which does not vary from one form ofthe test to another.

The content may be based on a theory of language proficiency or it may be based on a specification of language users’ expected needs. Second, there are standard procedures for administering and scoring the test, which do not vary from one administration of the test to the next. Finally, standardized tests have been thoroughly tried out, and through a process of empirical research and development, their characteristics are well known. Specifically, their measurement properties have been examined, so that we know what type of measurement scale they provide, that their reliability and validity have been carefully investigated and demonstrated for the intended uses of the test, that their score distribution norms have been established with groupsof individuals similar to those for whom thetest is intended, and that if there are alternate forms ofthe test, these are equatedstatistically. 11. Choice t

Refer to Section 8. 12. Choice 1 Referto Section 8. ”

28 C LanguageTesting

Chapter I/Preliminaries ofLanguage Testing Oo 29

ee Azad University Questions 1- Evaluation is different from testing in that the formeris mostly designed for ---------.

1) measurement

2) making decision

3) generalization

(Azad University, 83)

4) quantification

2-In the process of evaluation, we measure ---------.

1) people themselves 3) people’s goals

2) people’s characteristics 4) people’s needs

(Azad University, 84)

3- In --------- testing, students are treated as individuals rather than being comparedto other students. 1) criterion-referenced 3) norm-referenced

(Azad University, 84)

2) integrative 4) functional

2- The tests which are given at the end of an instructional course for assessing course grades are

weennnne tests.

1) summative

2) formative

3) selection

{Azad University, 84)

4) placement

4- The determination of congruence between performance and objective is interpreted as ---------,. 1) testing

2) evaluation

3) assessment

(Azad University, 85)

4) measurement

5-Whentests are used for program evaluation, they can be misinterpreted. Which of the following difficulties can arise whentests are used to assess the efficacy of various instructional technique s? I. Schools maynot besimilar,

(Azad University, 86)

1]. Teachers can have different styles.

Til. Thereis no valid test measuring this concept.

IV. The children being taught maybedifferent. 1) T and III only

2) Hand IV only

3) Tonly

4) I, I and IV only

6- Results of standardized tests ---(Azad University, 86) I. show how well children are reading comparedto other children HI.help determine in which reading group a child should be placed TH. determine in which skills students are deficient IV. are expressed in grade and age scores 1) T and II only 2) Tl and III only 3) I, IL, and Il only 4) I, I, and IV only 7- In termsof instructionat planning, which of the following does researchtell us about the use of

tests in the classroom?

(Azad University, 86)

1) Severallarge tests given per term facilitate learning and retention. 2) Manytests takenat short intervals are superiorin terns of learning and retention to one long exam. 3) Essay tests have higherreliability than objective examinations. 4) Instead of using objective exams to facilitate learning, it is best to give groups of students cooperative learning projects.

8- --------- tests are generally prepared by a group oftesting specialists, (Azad University, 86) 1) Communicative 2) Standardized 3) Integrative 4) Functional 9-.In --------- tests a testee’s performanceis comparedto that of the othertestees.

1) speed

2) criterion referencéd

3) summative

(Azad University, 86)

4) norm referenced

10- The purposeof --------- is to generate scores that spread the students out along a continuum of generalabilities so that any existing differences among the individuals can bedistinguished.

1) criterion-referencedtests

2) norm-referencedtests

3) achievementtests

4) speedtests

(Azad University, 89)

11- Concerningthe functions of languagetests, which of the following statements.isINCORRECT? {Azad University, 90)

1) The content of a standardizedtest is decided by the curriculum. 2) The directions in standardized tests follow a global procedure and are culture specific, so that different examinees can understand them. 3) The normsin a teacher-made test may be, influenced by the teacher’s taste. 4) In standardized tests the normsare closely described by the experts. 12- In a test, authenticity may be obtained in the following ways EXCEPT ---------.. (Azad University, 91)

1) the language in thetest is as natural as possible 2) items are contextualized notdiscrete 3) topics are meaningful for the learner 4) drills represent significant aspects of the language

.

13- Concerning the functions of language tests, which of the following statements is INCORRECT? (Azad University, 92)

1) Thecontent of a standardizedtest is decided by the curriculum. 2) The directions in standardized tests follow a global procedure, and are culture specific so that different examinees can understand them. 3) The normsin a teacher-made test may be influenced by the teacher'staste. 4) In standardized tests the norms are closely described by the experts. 14- Which oneof the following statements can best describe norm-referenced tests? (Azad University, 92)

1) They can be carried out to determine whether testees have achieved certain objectives.

2) Wecan interpret the testee’s performance by comparing it to some specific expectation. 3) Measures constructed according to this tradition are designed to yield maximum discrimination among examinees. 4) Usually a testee passes the test only when he has given the right answerto all or a specific number of test items. 15- Which ofthe following statements is true about criterion-referenced tests?

(Azad University, 93)

1) They have a pre-determined cut-off score. 2) They can be used as competition tests. 3) The distribution of scores is normal.

4) Each student’s score is interpreted relative to the scores of other students. 16- Which of the following statements is true about evaluation in educatic settings? ,

1) 2) 3) 4)

(Azad University, 93)

It is a form of assessment. It is the process of quantifying the characteristics of persons according to explicit mules. It refers to the systematic gathering of information for making decisions. It refers to the procedure for measuring ability, knowledge or performance.

30 [4 Language Testing

Azad University Answers

Chapter 2

1. Choice 2 Referto Section 3. 2. Choice 2 Refer to Section 3. |

3. Choice 1.

|

Refer to Section 5.

i

2. Choice 1

Referto Section 4.2.

|

4. Choice 2 Refer to Section 3. 5. Choice 2 6. Choice 3 7, Choice 2

When teachers: give tests at short intervals, the testees find the opportunity to know about their weaknessesin each unit, chapter, etc. and take remedial action. 8. Choice 2

Referto Section 6. 9. Choice 4 Referto Section 5.

10. Choice 2 Refer to Section 5. 11. Choice 2 Refer to Section 6. 12. Choice 4 Refer to Section 11.

13. Choice 2 Refer to Section 6. ‘14, Choice 3

Refer to Section 5. 15. Choice 1

Often but not always, the application of CRTs involvesthe use of cut-off scores that separate competent from incompetent examinees. Usually, a testee passes the test only when he has given the right answer to all or a specific numberoftest items. 16. Choice 3 Generally, the purpose of evaluation is to gather information systematically for the purpose of making decisions. This information need not be exclusively quantitative.

Language Test Functions © Two Major Functions of Language Tests © Contrasting Categories of Language Tests

© Computer-based Testing © A General Framework

30 [] LanguageTesting

Azad University Answers

Chapter 2

1. Choice 2 Referto Section 3. 2. Choice 2 Referto Section 3. 3. Choice 1. Refer to Section 5.

2. Choice 1 Refer to Section 4.2. 4. Choice 2

Refer to Section 3. 5. Choice 2 6. Choice 3

7. Choice 2 Whenteachers give tests at short intervals, the testees find the opportunity to know about their weaknessesin each unit, chapter, etc. and take remedial action. 8. Choice 2

Refer to Section 6. 9. Choice 4 Refer to Section 5. 10. Choice 2 Refer to Section 5. 11. Choice 2 Refer to Section 6,

12. Choice 4 Refer to Section 11. 13. Choice 2

Refer to Section 6. “14. Choice 3 Refer to Section 5. 15. Choice I Often but not always, the application of CRTsinvolvesthe use of cut-off scores that separate competent

from incompetent examinees. Usually, a testee passes the test only when he has given the right answerto all or a specific numberoftest items. 16. Choice 3 Generally, the purpose of evaluation is to gather information systematically for the purpose of making decisions. This information need notbe exclusively quantitative.

Language Test Functions © Two Major Functions of Language Tests

© Contrasting Categories of Language Tests

© Computer-based Testing

© A General Framework

Chapter2 / Language Test Functions | 33 a) the objectives of the lesson, unit, or course being assessed,

b) the relative importance (or weight) assigned to each objective, c) the tasks employed in classroom lessons during the unit of time,

d) practicality issues, such as the time frame for the test and turnaroundtime, and e) the extent to which thetest structure lendsitself to formative washback. Achievementtests are divided into three kinds; general, diagnostic and progress. e General achievementtests are (standardized) tests which deal with a body of knowledge. Constructors of such tests rarely teach students being tested. One example is a test to

LanguageTest Functions The function of test refers to the purpose for which test is designed. A test user should clearly identify the function for which a test is to be used. Otherwise, employing a test for inappropriate purposes would lead to making unjustified decisions and thus to undesirable consequences.

1. TWO MAJOR FUNCTIONS OF LANGUAGETESTS According to FJB tests serve two major functions: prognostic and evaluation of attainment.

Prognostic tests are not directly related to a particular course ofinstruction while evatuation of attainmenttests is based on a clearly specified courseofinstruction, TEST FUCNCTIONS

measure students’ achievementin the first year ofhigh school. The content ofa final achievementtest may be based directly on a detailed course syllabus or on the books and other materials used. This has been referred to as the syllabus-content

approach. It has an obvious appeal, since the test only contains whatit is thought that the students have actually encountered, and thus can be considered a fair test. The disadvantage is that if the syllabus is badly designed, or the books and other materials are

badly chosen,the results of a test can be very misleading. Successful performance on the test may not truly indicate successful achievement of course objectives. For example, a course is intended to prepare students for university study in English,but the syllabus may

not include listening to English delivered in lecture style on academic topics. In this case, tests results will fail to show what students have achieved in terms of course objectives. Thealternative approachis to basethe test directly on the objectives of the course. This has

a number ofadvantages. First it compels course designers to be explicit about objectives. Secondly, it makes it possible for performance on the test to show just how far students

ATTAINME EVALUATION

have achieved those objectives. This in tum puts pressure on those responsible for the syllabus andfor the selection of books and materials to ensure that these are consistent with

PROGNOSTIC

the course objectives. Tests based on objectives work against the perpetuation of poor Prony

Achievement |’

ote |

Aptitude

| Selection

[seme

FF | Generat | |

Progress

|

[pisenos

1.1. Evaluation of Attainment Tests Attainment evaluation tests measure to what extent examinees have learmed the intended skill, performance, knowledge,etc. in a given area. Attainmenttests include achievement, knowledge, and

proficiencytests.

1.1.1. Achievement tests

Achievementtests are related to clearly specified courses of instruction. These tests are limited to particular material covered in a curriculum within a particular time frame and thus the degree of students’ learning from a period of instruction. The specification for an achievementtest ’ should be determined by

teaching practice, something which course-content-basedtests fail to do. Hughes supports the second approach because he thinks it will provide more accurate information about individual and group achievement,and it is likely to promote a more beneficial backwash

effect. The disadvantage of this method might be its unfairness to students: if the course content doesnotfit well with objectives, examinees will be expected to do things for which they have not been prepared. e Diagnostic tests measure the degree of students’ achievement on a particular subject/topic and specifically the detailed elements of an instructional topic. They show weaknesses and strengths of students so that teachers can modify the instructional procedure and remedial

action can be takenif the number ofstudentsis big.Diagnostic tests could also be administered in the beginning of the term and via the

collected data teachers are able to determine the potential problems of students and accordingly plan instructional: activities to alleviate the students’ problems. A test in pronunciation, for example, might diagnose the phonological features of English that are difficult for learners and should therefore become part of a curriculum.

RGF

34 CT] Language Testing Criterion-Referenced Test

Test Qualities

Details of Information

semmak ets Se

era Untety

PUA totes

cernn giles ai

Sissel

er me

a ht

Chapter2 / LanguageTest Functions LE]

35

Cambridge First proficiency test is the Test of English as a, Foreign Language (TOEFL), the in English Certificate in English examination (FCE), Cambridge Certificate of Proficiency

examination (CPE) andInternational English Language Testing System (IELTS).

Specific

Very specific

Terminal objectives of course or

Terminal and enabling objectives of

Purpose of Decision

To determinethe degree of learning of what has been already taught for advancementor graduation

To inform students and teachers of objectives needing more work in the future

Relationship elationship t to

Directly . related to objectives

Directly i related to objectivesstill jecti i needing work

Endof courses

Beginning and/or middle of courses

defining the term ‘proficiency’ (construct of language). This difficulty renders the

Overall number and percentage of

Percentage of each objective in terms ofstrengths and weaknesses

types oftests.

Focus

program

Program When Administered Interpretation of Scores P

objectives learned

courses

« Progress tests attempt to measure portions of the materials taught in the course. As their namesuggests, progress tests are intended to measure the progress that students are making

towards the achievement of course objectives. These tests are linked to a particular set of teaching materials or a particular course of struction. For example, teachers need to make

decisions about when to. move on to anotherunit of instruction. If subsequent units assume mastery of the objectives of the current unit, the teacher must be assured that students have

mastered these short-term objectives before moving on: In such cases the teacher needs to give a progress test. In addition to more formal achievement tests which require careful

preparation, teachers should feel free to set their own ‘pop quizzes’. These serve both to make a rough check on students’ progress and to keep students on their toes. Since such tests will not form part of formal assessment procedures, their construction and scoring neednot be too rigorous. 1.1.2. Knowledgetests Knowledge tests are used when the medium of instruction is a language other than examinees’ mother tongue. So, they don’t measure language ability but knowledge of a scientific subject.

For example, a physics test or a psychology test written in English would be considered a knowledge test. Therefore, these tests are designed to measure examinees’ knowledge of a scientific subject through a second language. 1.1.3. Proficiency tests If your aim in a test is to tap the overall languageability, ic. global competence, then youare, in a conventional terminology, testing proficiency. A proficiency test is not limited to any one

course curriculum, or single skill in the language; rather it tests the result of the individual’s cumulative learning experiences. What language proficiency measures focus upon is to determine the extent of examinees’ ability to utilize language at the time of examination. How examinee has acquired that degree of capability is not of significance to examiner(i.e., learner’s

educational background is not of significance). Some typical examples of a standardized

Moreprecisely, these tests measure:

« e

degree of knowledge a learner has accumulated in language components; and degree ofhis capability to demonstrate his knowledge in language use.

They provide Proficiency tests are almost always summative (see below) and norm-referenced.

results in the form ofa single score, which is a sufficient result for the gate-keeping role they

play of accepting or denying someone passage into the nextstage of a journey. 2) Note: A key issue in testing proficiency is the difficulty that centers on the complexity of construct of proficiency tests more difficult in comparison to the development of other

1.2. Prognostic Tests

Prognostic tests aré not related to a particular course of instruction. Their objective is to predict and make decisions about fature success and actions of examinee based on present capabilities. ‘ They include: placement, selection and aptitudetests. 1.2.1. Placementtests

Placement tests are used to determine the most appropriate channel of education for examinees. The

purpose of placementtests is merely to measure the capabilities of an applicant in perusing a certain path of language learning and to place them into an appropriate level or section of a language

cutriculum or school. A student’s performance on a placement test should indicate the point at which the student will find material neither too easy nor too difficult but appropriately challenging.

with Teachers benefit from placementdecisions because they end up with classes that have students relatively homogeneousability levels. As a result, teachers can focus on the problems and learnmg

points appropriate for thatlevel of students. If there is a mismatch between the placementtest and what is taught in a program, the danger is that the groupings of similar ability Jevels will simply not occur. For instance, consider an elementary school English program in which a general grammar test is used for placement. If the focus of the

is program is on oral communication at three levels, numerous problems may arise. Such test placing the children into levels on the basis of their written grammar abilities. While grammar ability may be related to oral proficiency, other factors may be more important to successful oral

communication. Thus, the placement tests that are most successful are those constructed for particular situations. They depend on the identification of the key features at different levels of teaching in the institution. They are tailor-made rather than bought off the peg. This usually means

that they have been produced ‘in house’. The work that goes into their construction is rewarded by

the saving in time andeffort through accurate placement.

After students have been placed at different levels, they should be given the proper instruction to alleviate the existing differences among the groups and help them reach a criterion level. Here, the

36 C] Language Testing.

BR

ph

i

Keir wie ein we Se ae

Ger

Peseta a 3

Chapter2 / Language Test Functions EJ

;

37

purpose of placement procedures is to help those who need more instruction. To achieve this

Sound coding ability (or phonetic coding): the ability to identify and remember new

Those at lower levels should receive instruction for a longer time than those at the higherleve ls.

auditory phonetic material in such a way that this material can be recognized, identified and remembered over a period longer than a few seconds. This is a rather unique auditory

objective, two different methods can be applied. The first method regards the /ength ofinstruction.

The second method concems the intensity of instruction. Thatis, students at lower levels would receive more intensiveinstruction than those at higher levels.

&} Note: A placement test, usually but not always, includes a sampling of material to be

covered in various course in a curriculum. £ Note: There is no passorfail in placement tests.

componentofforeign languageaptitude.

e

Grammatical coding ability: the ability to identify the grammatical functionsof different parts of sentences in a variety of contexts,

e

Memorization (or rote learning ability): the ability to remember words,rules, etc. in a new language. Rote learningability is a kind of general memory, but individuals seem to

differ in their ability to apply their memory to the foreign languagesituation.

1.2.2. Selection tests

The purposeof selection tests is to provide information upon which the examinees’ acceptance or non-acceptance into a particular program can be determined. For example, to obtain a driver’s

license, applicants take a selection test and they either passorfail. Whena criterion is determined, there should be no limitation for participants who pass the

criterion. However, as many candidates obtain the criterion, due to administrative restriction s, these tests turn into competition tests. In such situations, students are ranked. from highest to the

lowest; the last person accepted in the program would be the passing criterion. To avoid this

situation, there are two options; the first is to increase the facilities in order to admit more

applicants. The secondis to set a highercriterion.

i} Note: In contrast to placement tests, testees pass orfail in selection tests. > A type ofselection test is readiness (or screening) test. The term screening refers to a

preliminary step of a selection process usually undertaken to separate individuals who

e

Inductive learning ability: the ability to work out linguistic forms, mes, patterns, and

meanings from new linguistic content with a minimum ofsupervision or guidance. Two standardized aptitude tests were once used in the United States: the Modern Language Aptitude Test (MLAT) and the Pimsleur Language Aptitude Battery (PLAB). Aptitude tests are

seldom used today. Instead, attempts to measure language aptitude more often provide learners with information about their preferred styles and their potential strengths and weaknesses, with follow-up strategies for capitalizing on the strengths and overcoming the weaknesses. Any test that clams to predict success in learning a language is flawed, because we now know that with appropriate self-knowledge, active strategic involvement in learning, and/or strategies-based

instruction, virtually everyone can eventually succeed. To pigeon-hole learners a priori, because they have even attempted to learn a language, is to presuppose failure or success without

substantial cause.

:

merit or require more extensive evaluation from those who do not and, in turn, to assist in

2. CONTRASTING CATEGORIES OF LANGUAGETESTS

Also, a readiness test measures the extent to which an individual has achieved a certain degree of maturity or acquired certain skills or information needed for undertaki ng

Understanding contrasting exam types can be helpful to teachers since tests of one kind may not

a decision of who should be allowedto participate in a particular program of instruction.

successfully some new learningactivity.

In this section we will. outline rather briefly some other ways tests can be classified. always be successfully substituted for those of another kind. 2.1. Knowledge vs. Performance Tests

1.2.3. Aptitudetests

Knowledge tests are used in various school subjects to determine the subjects’ knowledge of

Aptitude tests are used to predict applicants’ success in achieving certain objectives in the

facts or concepts about the foreign language or other fields of study (such as geography and literature). In the field of language, knowledge tests show how well students know facts about

future. In order to take an aptitude test, the examinee doesn’t need to have prior knowledge of the subject being tested. Aptitude tests can contribute to making decisions on the appropriate majorfields of study, learning suitable foreign languages, future occupations of the students, etc. One wordofcaution is that we should be careful about their predictive power before making any

suggestions. Language aptitude tests are designed to measure a person’s capability or generat ability to learn a foreign language a priori and to be successful in that undertaking. Language aptitude tests were

ostensibly designed to apply to the classroom learning of any language. These tests do nottell us who will succeed or fail in learning a foreign language. They attemptto predict the rate at which certain students will be able to acquire a language. Language aptitude tests usually consist of

several different tests which measure the following cognitive abilities.

.

the language. In performancetests the ability to perform particular tasks, usually associated with job or study requirements, is assessed. Instead of just offering paper-and-pencil singleanswer tests of possibly discrete items, performance-based testing of typical school subjects

involves group projects, labs, etc. In the field of language, these type of tests comein the form of various interactive language tests. This means such tests have to involve people in actually performing the behavior that we want to measure. Thusauthenticity is of great importance.

2.2, Language Sub-skills vs. Communication Skills Tests Tests of language sub-skills measure the separate components of English, such as vocabulary, grammar, and pronunciation. Communication skills tests, on the other hand, show how well

students can use the Janguage in actually exchanging ideas and information.

Bd Aiba Aart imenp

38 [| Language Testing 2.3. Speed vs. Power.Tests

Speedtests are those in which the items are comparatively easy andall of which are within the ability level of the test but the time limits are so short that few or none of the candidates can complete all items. Such tests aim at determining the speed with which the testees perform certain tasks. These tests are contrasted with power tests, in which item difficulty generally increases gradually and where ample timeis given for all, or at least most, of the candidates to

attempt every item. Powertests include more items too difficult for anyone to solve so that no one can get a perfect score. The aim is to determine how much an individual is able to do, not how rapidly. 2.4. Direct vs. Indirect Tests Testing is said to be direct when it requires the candidate to perform precisely the skill we wish to measure. For example, if we want to know how well candidates can write compositions, we get them to write compositions. Similarly, if we want to know how well they pronounce a language, we get them to speak. Comparedto indirect tests, direct tests °

are generally easier to construct;

«

use tasks and texts which are authentic;

e "are easier to carry out whenit is intended to measure the productive skills of speaking and writing; e

areless reliable; (see chapter 6)

e

foster positive backwasheffect;

e

provide results which are quite straightforward to interpret and assess.

Indirect tests attempt to measure the abilities which underlie the skills in which we are interested. Such a test does not require the test taker to perform tasksthat directly reflect the kind

of language usethat is the target of assessment; rather, an inference is made from performance on more artificial tasks. For example, an indirect test of writing ability may include items requiring the test taker to identify grammatical or spelling errors in written sentences. The main appeal of indirect testing is that it seems to offer the possibility of testing a

representative sample of a finite number ofabilities which underlie a potentially indefinitely large number of manifestations of them. By contrast, direct testing is inevitably limited to a

rather small sample of tasks, which maycail on a restricted and possibly unrepresentative range of grammatical structures.

The main problem with indirect tests is that the relationship between performance on them and performance of the skills in which we are usually more interested tends to be rather weak in strength and uncertain in nature. We do not know enough about the componentparts of, say, composition writing to predict accurately composition writing ability.

Chapter2 / Language Test Functions C]

cele wwe meee eels

Tests ' 9.7. Teacher-Made vs. Standardized See Chapter1.

“9.8, Proficiency-vs. Achievement Tests See above.

2,9. Subjective vs. Objective Tests See Chapter3.

2.10. Productive vs. Receptive Tests See Chapter3.

2.6. Norm-Referenced vs. Criterion-Referenced Tests See Chapter1.

i

2.11. Alternative vs. Traditional Assessment See Chapter 3. 2.12. Discrete-Point vs. Integrative Tests

See Chapter7.

3. COMPUTER-BASED TESTING

ker performs responses Recent years have seen a burgeoning of assessment in which the test-ta written stimuli from the on a computer. Students receive prompts in the form of spoken or

of computer-based computerized test and are required to type their responses. A specific type has recently gained test, a computer-adaptive test, has been available for many years but

es a set of questions that momentum. In a computer-adaptive test (CAT), each test-taker receiv

mance level. meetthe test specifications and that are generally appropriate for his or her perfor questions of For this reason, it has also been called tailored testing. The CAT starts with the question and moderate difficulty. As test-takers answer each question, the computer scores to determine which uses that information, as well as the responses to previous questions, ly, the computer typically question will be presented next. As long as examinees respond correct s, however, typically bring selects questions of greater or equal difficulty. Incorrect answer

test design as it questionsoflesser or equal difficulty. The computer is programmedto fulfill the d at after collecting continuously adjusts the difficulty level until a dependable estimate is arrive based on statistical responses to a relatively small numberof items. This dependable estimate is

question at a time, and the analysis (item response theory). In CATs, the test-taker sees only one test-takers cannot return computer scores each question before selecting the next one. Asa result, , with or without CAT to questions to any earlier part of the test. Computer-based testing technology, offers these advantages: e

e

classroom-based testing

orall of self-directed testing on various aspects of a language (grammar, discourse, one

the four skills, etc.)

2.5. Formative vs. Summative Assessment See Chapter1.

39

* ©

,

practice for upcoming high-stakes standardized tests someindividualization, in the case CATs

Chapter? / Language Test Functions [| 41

40 [_] Language Testing

e

large-scale standardized tests that can be administered easily to thousandsoftest-takers at manydifferent stations, then scored electronically for rapid reporting of results

Of course, some disadvantages are present in the current prediction for computerizing testing. Among them:

e e

Lack of security and the possibility of cheating are inherent in classroom-based unsupervised computerizedtests.

Systemreferenced

Occasion home-grown quizzes that appear on unofficial websites may be mistaken for

Direct(holistic)

Indirect(analytic)

Traditionaltests of general language ability:

Discrete-item tests of linguistic knowledge: e multiple-choice grammar or vocabulary tests clicited imitation ofspecific linguistic features ¢

« e

validated assessments. e

The multiple-choice format preferred for most computer-based tests contains the usual

e

potential for flawed item design. Open-ended responses are less likely to appear because of the need for human scorers,

®

The humaninteractive elementis absent.

4. A GENERAL FRAMEWORK? There is a general distinction between system-referenced and performance-referenced tests. System-referenced tests assess knowledge of language as a system. As Bakerputs it their aim is to provide information about the proficiency in a general sense without reference to any particular use or situation. Performance-referenced tests, in contrast, seek to provide information about the ability to use language in specific contexts; they are directed at assessing a

particular performance, for example, the ability of a trainee pilot to understand and respond to messages from the contro! tower when landing an aircraft. Thus, whereas system-referencedtests

are more construct-oriented, drawing on some explicit theory of language proficiency, performance-referenced tests are more content-oriented, drawing on a work-sample approach to test design.

Both system-referenced and performance-referenced tests can be direct or indirect. Directtests are based on a direct sampling ofthe criterion performance. Such tests are holistic in nature and aim to obtain a contextualized sample of the testee’s use of language. In this type of assessment, learners are required to reproduce the communication behaviors they will need to carry outin the

real life in a testing situation. Indirect tests are based on an analysis ofthe criterion performance in order to obtain measuresof the specific features or components that comprise it. They seek to assess proficiency by means of specific linguistic measure, which are obtained from thetest itself, for example, a cloze test provides a score based on the numberof blanksthe testee has

been ableto fill in correctly. Such tests are less contextualized and more artificial. Juxtaposing these two dimensions, four basic types of assessment can be identified, as shown in table below.

? Due to some unfamiliar terminologies,it is advised that you study the frameworkafter studying chapter seven.

information-gap opinion-gap reasoning-gap

Specific purposetests: ©

with all the attendantissues of cost, reliability, and turn-around time.

e

Informationtransfertests: ° *

e

free composition oral interview

Performancereferenced

¢

tests based on observing real-world tasks simulations ofrealworld tasks

error-identification tests

Integrative tests: e i

e

cloze

dictation

Tests that seek to measure specific aspects of communicative proficiency discretely:

the tests of specific academic sub-skills,e.g.

ability to cite from a published work tests of the ability to perform specific functions orstrategies, e.g. the ability to write a definition of a technical term

of.

42 C] Language Testing

ey ween ee RP te

be cee ane

an there won't be a wide range ofabilities in the

1- Thereis a contrast between eachof the following pairs of tests EXCEPT---------. 2) knowledge and performance 4) proficiency and achievement

2- In system-referenced testing ---------.

(State University, 83)

ta: i g are among the problems withith task-based 14- All ofthe followin

(State University, 84)

1) reliability

4- Analytical tests are generally ---------..

2) use-based, indirect, and form-focused 4) indirect, usage-based, and discrete-point

5- The test types referred to as diagnostic test, achievement test, placementtest, and proficiency test (State University, 89)

1) NRT, CRT, NRT, CRT 3) CRT, NRT, CRT, NRT

2) NRT, NRT, CRT, CRT 4) CRT, CRT, NRT, NRT

6- Whichofthe followingis a direct system-reference language test?

90)

1) Oral interviews

(State University,

3) Multiple-choice grammar or vocabulary tests 4) University entrancetests EXCEPT---------.

1) Achievement

(State University, 90)

2) Proficiency

3) Aptitude

4) Knowledge

8- The “syliabus-content” approach is most relevant to ---------.

1) test-wiseness 3) objectivity of a test

{State University, 91)

2) achievementtests 4) the negative washback ofa test

9- Oneof the key features of indirect testing is ---------.

(State University, 91)

10- Diagnostic tests do NOT intend to identify ---------. 1) students’ weakness 2) students’ strengths

(State University, 91)

1) authenticity of texts 2) its inability to measure comprehension skills 3) measuringthe abilities which underlie a skill 4) the ease to carry out when it is intended to measure production

3) the necessary future teaching

4) the washbackoftesting to teaching

11- A traditional composition test in which candidates are required to write about a given topic is an example of a(n) --------- test. (State University, 93)

1) indirect performance-referenced 3) direct system-referenced

2) direct performance-referenced 4) indirect system-referenced

12- In a computer-adaptive test, the computer is programmedto ------—-..

1) let test takers answer questions in an orderalready fixed 2) score each question before selecting the next one 3) start with easy questionsthat almostall test takers can answer 4) let test takers see the whole set of questions before theystart the test

{State University, 95)

3) validity

9

4) representativeness (State University, 96)

of attainmentfunction. of languagetests. 4) They belong to the category or evaluation when it comes to designing is true about the syllabu s-content approach

2) Simulation-based tests

7- Alt of the following are regarded as tests that aim at evaluating the degree of attainment

2) authenticity

9 E ----. ----assessment EXCEPT

ic tests? 45- Which ofthe following are prognost 2) Selection, placement, and proficiency 1) Selection, placement, and aptitude 4) Achievement, proficiency, and knowledge 3) Achievement, knowledge, and aptitude (State University, 96) of grammar? 16- What kind oftest is a multiple-choice test 2) Direct and performance-based 1) Direct and system-referenced 4) Indirect and performance-based 3) Indirect and system-referenced (State University, 97) about placement tests? 17- Which of the following statements is true most language schools. 1) Theyare often used for selection purposes in and weaknesses. 2) They are used to identify Jearners’ strengths off the peg. ; 3) They should be tailor-made rather than bought 7

(State University, 86)

are respectively ---------.

program

. not occur 2) the groupings of similarability levels will ws dsed ty levels abili ous gene 3) classes will contain students with relatively homo e judg be can nces h students’ performa 4)there will not be criterion level against whic

1) items directly reflect criterion performance 2) tests may be either direct or indirect 3) tests are constructed with reference to a particular use 4)thetest's target is usually contextualize 1) form-focused,direct, and valid 3) productive, norm-referenced, and direct

43

in a program the danger is i i tween a pl acementtest and what is taught 13- If there is a mismatch be P (State University, 95)

State University Questions

1) speed and power 3) productive and integrative

; Chapter2 /Language Test Functions Cl

Se Binet Ger

aml edit

| a

|

18- Which of the following / an achievementtest? ; ing. teach on 1) It fosters a more beneficial backwasheffect . tives about the objec 2) It compels syllabus designers to be explicit

(State University, 97)

n about individual and group achievement. 3) It will provide more accurate informatio .

softhe test could be very misleading 4) If the syllabus is poorly designed, the result ne en forming their competencies and skills witha 19- Evaluating students in the process of University, (State nts to -----~ — assessment, helping them continue that growth process amou 4) objective : 3) subjective 2) formative 1) summative

.

BD thar

44 CJ Language Testing

gemma

Gernh hea arcs

: ligt

iA at

rin

ae

ha na te

Chapter2 / Language Test Functions |

————— Azad University Questions

State University Answers

of the --------- of the testees. 1- Generally,the proficiency tests are designed regardless (Azad University, 83)

1. Choice 3 Refer to Section 2.

1) knowledge

2. Choice 2

10. Choice 4

Refer to Section 1.1.1. 11. Choice 3

‘Refer to Item 2. 12. Choice 2 Refer to Section 3. 13. Choice 2 Refer to Section 1.2.1. 14. Choice 3 According to Ellis, a number of problems are commonto allkinds oftesting but someare specific to taskbased testing. These problems include: representativeness, authenticity, generalizability, inseparabilit and reliability. , ” 15. Choice 1 Refer to the diagram in Section 1. 16. Choice 3

Referto the table in Section 4. 17. Choice 3 Referto Section 1.2.1. 18- Choice 4

Refer to Section 1.1.1. 19- Choice 2 Refer to Section 2.4.

2) skills

3) preparedness

4) backgroundtrainings

3- Which one is NOTtrue about proficiency tests? are measuring. 1) They cannot be easily defined,i.e. in terms of the trait they

(Azad University, 84)

2) They are used to measure overall languageability 3) They have wide application in the world

4) Theyare designed to measure the covered materials in a course

to predict if they could be good 4- Some students were tested in terms ‘of their tolerance kindergarten teachers. Such a test is a(n) --------- test.

1) aptitude

2) knowledge

3) achievement

1) proficiency

2) aptitude

3) placement

1) aptitude

2) proficiency

3) knowledge

(Azad University, 84)

4) diagnostic

ures through --------- test. 5- Teachers can evaluate and modify their instructional proced (Azad University, 84) 4) diagnostic

-function as competition test. 6- In Iran,due to administrative limitation, the ------- tests {Azad University, 84)

Aitiiecdi ae

Referto Section 4. 4. Choice 4 Referto Section 4. 5. Choice 4 Thefirst and most basic distinction in language testing involves two families of tests that perform two very different functions: one family helps administrators and teachers make programlevel decisions, such as proficiency and placement decisions, and the other family helps teachers make classroom-level decisions, such as diagnostic and achievement decisions. In the technical jargon of testing, these two families are called norm-referenced tests and criterion-referencedtests. 6. Choice t Refer to Section 4. 7. Choice 3 Though we can imagine cases in which proficiency tests are used to evaluate what examinees have learned in the long run,aptitude tests are never used for such a purpose. 8. Choice 2 Refer to Section 1.1.1. 9. Choice 3 Refer to Section 2.5.

45

tests? 7- Whichofthe following are the primary goals of classroom IL. To rank studentsin the class TL. To practice what has been learned

instruction ILL. To find. out whathas been learned to guide further IV. To determine eligibility for special pull-outprograms

2) II only

3) Il and III only

1) instruction

2) administration

3) assessment

1) prognostic

2) diagnostic

1) Tonly

4) selection (Azad University, 86)

4) THandIV only

’ progress as well as the effectiveness of -------. & An achievement test is used to evaluate the testees ty, 86) .

(Azad Universi

4) validation

weaknesses,.2 --------- test is used. 9- To determine the individual’s specific strengths and {Azad University, 88)

3) placement

4) selection

(Azad University, 88) variance? 10- In what kindof test do we usually expect to observe a higher ement 4) achiev 3) knowledge 2) diagnostic 1) proficiency of applicants in pursuing a certain 11- The purpose of--------- is merely to measure the capabilities (Azad University,.89) path of language learning. There is no passor fail in these tests. 4) aptitude tests 3) selection tests 2) achievementtests 1) placementtests can best be defined with reference to 12- The relationship between test input and response ve tests are the type in which ---------. “feedback” and “interaction”. Taking this into account, adapti

{Azad University, 89)

1) both feedback and interaction are present

3) there is interaction but not feedback

2) neither feedback nor interaction is present 4) there is feedback but no interaction

Dhak

wl Asin

Gere

46 Ey Language Testing

See. eee ety

ee Seria BS nS as

(Azad University, 89)

1) dynamic assessment

2) performance assessment

3) criterion-referenced testing

4) norm-referenced testing

14- Concerning the functions of languagetests, which of the following statements is INCORRECT? (Azad University, 90)

1) A formative test typically covers some predefined segmentof instruction (e.g. a unit, a chapter, or a particular skill). 2) Wecall the testing process direct, when it requires the candidates to perform precisely what we wish to measure.

3) A usage test is a test which taps test taker’s ability to compose correct sentences with no regard to the communicative context. 4) If item difficulty decreases gradually in a test, it is assumed that the test can be categorized under the category of pure powertests. 15- Which of the following tests is mainly intended to determine the applicants’ success or failure in _ achieving certain objectives in the future?

1) Diagnostic tests

2) Aptitudetests

(Azad University, 91)

3) Placementtests

4) Achievementtests

16- Achievement tests play an important role in education. A good achievementtest is one which should ---------..

{Azad University, 91)

1) measure an adequate sample of the learning outcomes and subject-matter content may not be includedin the instruction 2) be madeas reliable as possible and then can be interpreted without caution 3) be designedto fit all types of uses to be madeofthe results 4) measure clearly defined learning outcomes that are in harmony with the instrumental objectives

17- At some schools, where the teachers teach other subject matters in a language other than the students’ mother tongue and test them in that language, the tests are considered as--------- tests.

1) achievement

2) proficiency

3) knowledge

(Azad University, 91)

4) diagnostic

18- Concerning the functions of language tests, which of the following statements is INCORRECT? (Azad University, 92)

1) A formativetest typically covers some predetermined segmentofinstruction (e.g. unit, chapter, or a particular test or skill). 2) Wecall the testing process direct when it requires the candidates to perform precisely the skill which we wish to measure. 3) A usage test is a test which taps test taker’s ability to compose correct sentences without regard to their communicative context. 4) A pure powertest is one in which item difficulty decreases gradually and has a time limit long enough to permit everyone to attemptall items. 19- All of the following are characteristics of achievement tests EXCEPT ---------. (Azad University, 93)

1) They determine the extent of examinees’ ability to utilize language at the time of examination. 2) Theyare usable before an instructional program starts. 3) They measure the degree of students’ learning and determine their problems. 4) They judge the examinees’ past performance.

———— Azad University Answers

13- A type of assessment in which examinees must do authentic tasks and the success andfailure in the outcome ofthe tasks are of great concern is called ---—---.

; ns CI 47 Functio Test ge Chapter2 / Langua

1. Choice 4 Refer to Section 1.1.3.

_

3. Choice 4

of objectives. Proficiency tests are notrelated to a particular course 4, Choice 1

;

_

ss in achievingertain objectives in the future. Aptitudetests are used to predict applicants’ succe 5. Choice 4

Refer to Section 1.1.1: 6. Choice 4 Refer to Section 1.2.2. 7, Choice 2 en are taken out of their regular classroom for one or A pull-out program is one in which gifted childr ent activities and instruction. To.find out students fitting more hours a week and provided with enrichm: pull-out program is not the function of classroom tests. diagnosis. Through this property, the weaknesses of Oneof“he functions of achievementtests is their teachers to revise their teaching. testees and even instruction are revealed andthis leads 9. Choice 2 Refer to Section 1.1.1.

ae ; 10. Choice 1 provide scores that form a wide distribution so Since proficiencytests are norm-referenced, the test must will be as fair as possible. that interpretations of the differences among students 11..Choice 1 Refer to Section 1.2.1.

” 12, Choice 1

Refer to Section 3.

13. Choice 2 types oftesting in that: (a) examinees must perform Performance assessmentis distinguished from other < orfailure in the outcome ofthe task, and (c)) success tasks, (b) the tasks should be as authentic as possible, qualified judges. because they are performances must usually be rated by 14. Choice 4 Refer to Section 2.3. 15, Choice 2 Refer to Section 1.2.3.

; ; ice 4 . the with ny harmo in are d learning outcomes that Aehievement tests should measure clearly define *: m outco ng Jearni of y designed to measure:a variet instructional objectives: achievement tests can be ” : s, usines of order of terms, etc. The first such as: knowledge of specific facts, knowledge on and definition of the learning outcome ficati identi the constructing an achievement test is ctional objectives of the course in whichthetest is measured. These should logically grow outof the instru to be measured.

17. Choice 3

Refer to Section 1.1.2.

18. Choice 4 Refer to Section 2.3.

19. Choice 2 by of the course to measure the amount of learning Achievement tests are administered in the en d learners.

Chapter3

Forms ofLanguage Test © Structure of an Item © Classification of Item Forms

© Types of Items

© Traditional vs. Alternative Assessment

Chapter 3/ Forms ofLanguage Test ry 51

answers), they can be scored mechanically. In an objective test the tester spends a great deal of

time constructing each test item as carefully as possible butlittle time on scoring. Multiplechoice (MC)andtrue-false items are popular kinds of the so-called objective measures. £) Note: You should know that objectivity and subjectivity refers to the way a test item is scored and haslittle or nothing to do with the form of a test. Though the following two items have the same form, the first one is subjective (because each person might have a different opinion about the most beautiful season) and the second one is objective

Formsof Language Test

(because everyone agrees that there arefour seasonsin a year).

The most beautiful seasonis ......

I) spring

2) summer 3) fall

Thereare ...... seasonsin a year. 1) four 2) three 3)two

The form of a test refers to its physical appearance. Language testers should use the most suitable form of the test according to the nature ofthe attribute and function ofthe test.

4) winter 4) five

1, STRUCTURE OF AN ITEM

2.2. Essay-Type vs. Multiple-Choice Items

An item, the smallest unit of a test, consists of two parts: the stem and the response. The purpose of stem (orlead)is to elicit information and may appearin the form of question, statement, etc. Response is the information elicited from the examinee which may be a single morphemeor an

Essay-type items are those in which the examinee is required to produce language elements. They are advantageous in that they require learners draw together knowledge and demonstrate understanding rather than simply rememberisolated facts. The following are such examples:

extended writing. In case of multiple-choice items, stem is followed by three or more responses (or alternatives; options; choices), one of which is the correct option (or answer; key) and the others are called distractors. The purpose of the distractor is to distract the majority of poor students from the correct option. : 1) How manyfunctions do languagetests serve? (stem) {two >> key alternatives



Describe one of your family members. Whatis the stance of the author towards the issue? Multiple-choice items are those in which examinee is required to select the correct response from among given alternatives. This type of item has the advantage of being easy to examine

statistically. Multiple choice type items include multiple-choice, true-false, matching, etc. This classification is dissatisfactory in that the production of, for example, a single word is equated with the production of, for example, writing an essay, while each requires certain types of activity. 2.3. Suppletion vs. Recognition Items

——» distractors

The only improvementhere is the change of terms. Suppletion (or production; recall) items — following essay-type items — require the examinee to supply the missing part(s) of the sentence

2. CLASSIFICATION OF ITEM FORMS 2.1. Subjective vs. Objective Items ‘Subjective items are those in which the scorer must make an opinionated judgment about the

correctness of the responses based on his subjective interpretation of the scoring criteria. For example, in a test of translation, there is the problem of inconsistency of scores when two raters

rate a picce oftranslation due to the fact that they have their own criteria. Therefore, these items

suffer from unreliability or lack of consistency of scores. Other examples include essay-writing, oral interview, etc. In such tests, examiners tend to spend relatively short time on setting the

questions but considerable time on marking. Objective items are those in which the correctness ofthe test taker’s response is determined by predetermined/objective criteria so that no judgmentis required on the part of the scorer. As a result, objective test have only one correct answer (or, at least, a limited number of correct

or complete an incomplete sentence.

Recognition items ~ following multiple-choice items — require examinee to select an answer from a list ofpossibilities. Recognition type items include multiple-choice, true-false, matching,etc. 2.4. Psycholinguistic Classification In this type ofclassification, the form of the item is determined by taking theoretical principles

of language processing into account. These theoretical assumptions include psychological and linguistic principles. e

Psychological principles: This classification attempts to benefit from the principles of psychology because respondingto an item requires psychological processes. It is clear that from the first moment of encountering a single item, psychological processes are involved

to process written or oral materials. Two major psychological processes, which are crucial

52 [_] LanguageTesting

Chapter 3/ Forms ofLanguage Test CJ 53

a

to language processing are recognition and production among which could be placed

comprehension and comprehension/production. Linguistic principles: An item involves linguistic theories because an item which is

presented and responded in a certain form of language involves linguistic elements. Language can be manifested in three different ways, referred to as the modality or

psychological process involved in answering true-false items. True-false items have the following advantages: They are easy to construct.

¢

True-false items have the following disadvantages:

They very much depend on chance, namely, the examinees havea fifty percent chance of getting a correct response.

instrumentality. Modality deals with the ways through which language is manifested:

They are limited to measuring simple learning activities in language. Complex tasks cannot be measured validly through true-false items.

Oral Verbal Modality

Written

i

3.1.2. Matching items Matchingitems present the students with two columns of information, students must then find and identify matches between the twosets of information. The information given in the left-hand

Non-verbal (pictures, maps, etc.)

Generally, this two dimensional model could be presented in a table like the following. Birjandi and Mosailanejad give examples for each cell. sychological Linguistic wae

Oral : Written Pictorial

R

iti

e cognition

They allow the test developers to use a large numberofitemsin a giventest.

c

hensi ‘omprehension

Comp/Prod

; Production

listening and recognition

listening comprehension

oral interview

oral presentation

. reading and 8 .

: readin;

summary, or paraphrase

essa. sey

6:

recognition

comprehension

watching and recognition

watching comprehension

we writing

writing

look and say

drawing items

3. TYPES OF ITEMS Below wewill investigate five general classes of items, i.e. receptive items,fill-in items, short-

response items, task items, and personal response items. Other items such as comprehensionproduction and production forms, will be discussed in subsequent chapters. 3.1. Receptive Response Items Receptive response items require the student to select a response rather than actually produce

one. In other words, the responsesinvolvereceptive language in the sense that the item responses from which students mustselect are heard or read, receptively. 3.1.1. True-false items True-false items are technically called alternative-response items. Such items require testees to

marktrueorfalse (T/F), right or wrong (R/W), correct or incorrect (C/I), yes or no (Y/N). The most common use of true-false items is to measure theability of examinees to identify the accuracy ofthe information provided through a statement. Consequently, such items are usually used to assess simple learning outcomes. In other words, comprehension is the major

column is called the matching item premise and that shownin the right-hand columnis labeled

options. Matching items are usually used for measuring facts based on simple associations. In

language testing, one of the uses of matching items is to check the students’ ability in

recognizing and comprehending synonyms and antonyms. In comparison to true-false items, matching items require more complex mental activities. An example of matching items is as

follows:

Match the words in column I with their autonyms in column II.

Column I

Column II

construct

accept

refuse:

clarify

approach

destroy

confuse

3.1.3. Muitiple-choice (MC)items Multiple-choice items are undoubtedly one of the most widely used types of items in-objective

tests. MC items havethe following advantages. *

Because of the highly structured nature ofthese items, the test writer can get directly at

many ofthe specific skills and learning he wishes to measure. This in turn leads to their diagnostic function. The test writer can include a large numberofdifferent tasks in the testing session. Thus they have practicality. Scoring can be done quickly and involves no judgments as to degrees of correctness. Thusthey havereliability.

However, these items are disadvantageous on the grounds:

They are passive, ie. such items test only recognition knowledge but not language

communication. They may have harmful washback. They exposestudents to errors.

“Aiken

54 [| Language Testing

seciiyd

enn

beara we ee ee

e

Theyare de-contextualized.

© ¢

Theyare one ofthe mostdifficult and time-consuming typesofitems to construct. They are simpler to answer than subjectivetests.

e

Chapter 3/ Forms ofLanguage Test L 55

acontrastive analysis between the native and target languages.

3.2. Fill-in Items

e They encourage guessing. In response to the last criticism, proponents of MC items have stated: the constructor of a standardized test not only selects and constructs the items carefully but analyses student

Fill-in items are those wherein a word or phrase is replaced by a blank in a sentence or longer text and the student’s job is to fill in that missing word or phrase.

test discriminates widely. Also, there is a way to compensate for students’ guessing on tests. ‘Thatis, there is a mathematical way to adjust or correct for guessing. This statistical procedure properly named guessing correction formulais:

Short-response items are usually items that the students can answer in a few phrases or

performance on each item and rewrites the items where necessary so that the final version of his

3.3. Short-Response Items sentences. In these items it is advisable to use partial credit. Partial credit entails giving some credit for answers that are not 100 percent correct. For instance, on one short response. item, a student might get two points for an answer with correct spelling and correct grammar, but only

one point if either grammar orspelling were wrong, and no points if both grammar and spelling

where

“ were wrong.

n = the numberofoptions

& Example: In a test which consisted of 80 items with four options, a student answered 50 items correctly and gave 30 wrong answers. After applying guessing correction formula his score would be ------~--

1) 45

2)35

3) 40

4) 30

Ww Score = Right — Fong =50--20 =50-10=40 n-1

-

B Example: A student tried all items in a test, out of which 70 turned out to be correct, 20 wrong, and 10 were left blank. If each item had 5 choices, what would be her score after applying guessingformula?

1) 65

2) 75

3) 55

4) 85

Score = Right — 108 7920 99 _ 565 5-1 nl B Example:In a test which consists of 70 items, a student gave 20 wrong answers andleft 20

items blank. Ifeach item had three options, her score would be —------1) 40 2) 20 3) 37 4) 47 Score = Right —

Wrong _ 30 n-1

20 Foy 7 3010 20 -1

As was mentioned before, the incorrect options in a multiple choice item are called distractors. These distractors are best based on: e e

mistakesin the students’ own written work; their answers in previoustests;

e

the teachers’ experience; and

3.4. Task Items Task items are defined as any of a groupoffairly open-ended item types that require students to

perform a task in the language that is being tested. A task test (or fesk) might'include a series of communicative tasks, a set of problem-solving tasks, and a writing task. In anotheralternative that has becomeincreasingly popularin the last decade, students are asked to perform a series of writing tasks and revisions during a course and put them together into a portfolio. 3.5. Personal Response Items(or alternative assessment options) So far, the focus has been on formaltests in the classroom. In recent years, language teachers have stepped up efforts to develop non-test assessment options that are nevertheless carefully

designed and that adhere to the criteria for adequate assessment. Such innovations are referred to as personal response items that encourage the students to produce responses that hold personal

meaning. In other words, the responses allow students to communicate in ways and aboutthings that are interesting to them personally. Personal response item formats inchide self-assessments,

journal, conferences, and portfolios. These personal response type items are also called alternative assessment. 3.5.1. Self-Assessment Self-assessmentis defined as any items wherein students are askedto rate their own knowledge,

skills, or performances. Thus, self-assessments provide the teacher with some idea of how the students view their own language abilities and development. Research has shown a number of advantages of self-assessment and peer-assessment: speed, direct involvement of students, the encouragement of autonomy, and increased motivation because ofself-involvementin the process of learning. Of course, the disadvantage of subjectivity loomslarge. Students maybe either too harsh on

themselves or too self-flattering, or they may not have the necessary tools to make an accurate assessment. Also, especially in the case of direct assessments of performance (see below), they may not be able to discern their ownerrors.

°3* Note: The related concept ofpeer-assessments is simply a variation on this theme that

Aina gmc 56 [_| Language Testing

ai

x a ee cei ewe Wee hee ete

requires students to rate each other, In addition to autonomy and intrinsic motivation,

peer-assessment appeals to the principle ofcooperative learning. It is important to distinguish among several different types of self- and peer-assessment and to apply them accordingly. There are five categories of self- and peer-assessment: e

Direct assessment of a specific performance: a student typically monitors himself in

either oral or written production and renders some kind of evaluation of performance. e

Indirect assessment of general competence:this type of assessmenttargets large slices of time with a view to rendering an evaluation of general ability, as opposed to one specific, relatively time-constrained performance.

»® e

Metacognitive assessment: a student sets goals and maintains an eye on the process of their pursuit. Socioaffective assessment: such assessment requires looking at oneself through a psychological lens. When learners resolve to assess and improve motivation, to gauge and lower their own anxiety, to find mental or emotional obstacles to learning and then

plan to overcomethosebarriers, an all-important socioaffective domain is invoked. e

Student-generated tests: a final type of assessment that is not usually classified as selfor peer-assessment is the technique of engaging students in the process of constructing tests themselves.

3.5.2, Journal A journal is a log or account of one’s thoughts, feelings, reactions, assessments, ideas, or progress toward goals, usually written with little attention to structure, form, or correctness. Learners can articulate their thoughts without the threat of those thoughts being judged later. Journals can range from language learning logs, to grammar journals, to responsesto readings, to strategies-based learning logs, to self-assessment reflections, to acculturation logs, to attitudes

and feelings about oneself. Mostclassroom-oriented journals are what have now come to be knownas dialogue journals.

They imply an interaction between a reader (the teacher) and the student through dialogues or responses. For the best results, those responses should be dispersed across a course at regular intervals, perhaps weekly or biweekly. One of the principal objectives in a student’s dialogue journal is to carry on a conversation with the teacher. Through dialogue journals, teachers can

ee ee Kerry wee oe fan a 8 eres

teaching-learning process. Because most journals are a dialogue between student and teacher,

they afford a unique opportunity for a teacher to offer various kinds of feedback. On the other side ofthe issue,it is argued that journals are too free to form to be assessed accurately. With so much potential variability, it is difficult to set up criteria for evaluation. For some English

language learners, the concept of free and unfettered writing is anathema. Certain critics have

Chapter 3/ Forms ofLanguage Test CT 57

3.5.3. Conference

her’s Conferences are defined as any assessment procedures that involve students visiting the teac

office alone or in groups for brief meetings. Conferencing has becomea standard part of the

tes the process approach to teaching writing, as the teacher, in a conversation abouta draft, facilita

ion improvementofthe. written work. Such interaction has the advantage of one-on-one interact

ent’s between teacher and student, and the teacher’s being able to direct feedback toward a stud

cing is specific needs. Of course, the list of possible functions and subject matter for conferen ‘ substantial: e e e e

e e

~ e

e _e e

®

commenting on drafts of essays and reports reviewing portfolios responding to journals advising on a student’s plan for an oral presentation

‘assessing a proposal for a project

giving feedback onthe results of performance ona test

clarifying understanding of a reading

exploring strategies-based options for enhancement or compensation

focusing on aspects of oral production checkinga student’s self-assessment of a performance setting personal goals for near future

assessing general progress in a course

Conferences must assumethat the teacher plays the role ofa facilitator and guide, not of a master ng atmosphere, controller and deliverer of a formal assessment. In this intrinsically motivati ion and students need to understand that’ the teacher is an ally who is encouraging self-reflect ng, the teacher should improvement. So that the student will be as candid as possible in self-assessi

not consider a conference as something to be scored or graded. Conferences are by nature formative, not summative, and their primary purposeis to offer positive washback.

zed kind & Note: Discussions of alternatives in assessment usually encompass one speciali

of conference: an interview. This term is intended to denote a context in which a teacher interviews a studentfor a designated assessment purpose.

affective states, and thus becomebetter equipped meet students’ individual needs. using writing as a thinking process, individualization, and communication with the teacher. At the same time, the assessment qualities ofjournal writing have assumed an important role in the

.

expressed ethical concerns: students may be asked to reveal an inner self, which is virtually unheard ofin their ownculture.

become better acquainted with their students, in terms of both their learning progress and their Journals obviously serve important pedagogical purposes: practice in the mechanics of writing,

:

3.5.4. Portfolio

One of the most popular alternatives in assessment, especially within a framework of communicative language teaching, is portfolio development. A portfolio is a purposeful collection of students’ work that demonstrates to students and others their efforts, progress, and

achievements in given areas. Portfolios include materials such as e

essays and compositions in draft and final forms;

©

poetry and creative prose;

58 E] Language Testing

e

reports, project outlines;

e

art work, photos, newspaper or magazineclippings;

e

video- or audiotape recordings of a student’s oral production;

e

journals, diaries, and other personalreflections;

e

tests, test scores, and written homework exercises;

e

notes on lectures; and

e

self- and peer-assessments -comments, evaluations, and checklists.

ID peel

wi oman

:

ne Sa,

wwe ee ee TE

A developmental scheme is developed for considering the nature and purpose ofportfolios, using the acronym CRADLEto designate six possible attributes ofa portfolio: Collecting is the process in which the students gather materials to include in their portfolio. It is left to the students to decide what to include. However,it is still necessary

Chapter 3/ Forms ofLanguageTest

Geir

Bes ee

4. ALTERNATIVE vs. TRADITIONAL ASSESSMENT Personal response items are referred to as alternative assessment, if only to distinguish them from traditional formaltests. Early in the decade ofthe 1990s, in a culture of rebellion against the notion that all people andall skills could be measured bytraditional tests, a novel concept

emerged that began to be labeled ‘alternative’ assessment. As teachers and students were

~ becoming aware of the shortcomingsof standardized tests, an alternative to standardized testing

and all the problems found with such testing was proposed. That proposal was to assemble additional measures of students in an effort to triangulate data about students. For some, such alternatives held ‘ethical potential’ in their promotion of fairness and the balance of power

relationships in the classroom. Alternative assessments: e-

for the teacher to provide clear guidelines in terms of what can be potentially selected.

®

Reflecting happens through the student thinking about the work they have placed in the portfolio. This can be demonstrated in many different ways. Common waysto reflect

e

use real-works contexts or simulations; use tasks that represent meaningful instructionalactivities; allow students to be assessed on what they normally do in class every day;

¢

ensure that people, not machines, do the scoring, using human judgment;

include the use of journals in which students comment on their work. Another way for

tap into higher-level thinking and problem-solving skills;

young students is the use of checklist. Students simply check the characteristics that are

e

are multi-culturally sensitive when properly administered.

presentin their work. As such, the teacher’s role is to provide class time so that students are able to reflect on their work.

Thefollowing table highlights differences betweentraditional and alternative assessment:,

Assessing involves checking and maintaining the quality of the portfolio over time.

MPEG @eC toll

PMC ECACTee

Normally, there should a gradual improvement in work quality in a portfolio. This is a

One-shot, standardized exams

Continuous long-term assessment

subjective matter that is negotiated by the student and teacher often in the form of conferences. Documenting serves more as a reminder than an action. Simply, documenting meansthat

the teacher and student maintain the importance of the portfolio over the course of its usefulness. This is critical as it is easy to forget about portfolios through the pressure of the daily teaching experience.

Linking is the use of a portfolio to serve as a mode of communication between students, peers, teachers, and even parents. Students can look at each other portfolios and provide

feedback. Parents can also examine the work oftheir child through the use ofportfolios. Evaluating is the process of receiving a grade for this experience. For the teacher, the

goal is to provide positive washback whenassessing the portfolios. The focus is normally less on grades and morequalitative in nature.

The advantages of engagingstudents in portfolio development have been extolled in a number of sources. A synthesis of those characteristics gives us a numberofpotential benefits. Portfolios: e

foster intrinsic motivation, responsibility, -and ownership,

¢

promote student-teacher interaction with the teacheras facilitator,

©

individualize learning and celebrate the uniquenessof each student,

«

provide tangible evidence of a student’s work,

©

facilitate critical thinking, self-assessment, and revision processes,

e

offer opportunities for collaborative work with peers, and

e

permit assessment of multiple dimensions of language learning.

| 59

Timed, multiple-choice format Decontextualized test items Scores suffice for feedback Norm-referenced scores Focus on the right answer

Summative _ Oriented to product Non-interactive performance Fosters extrinsic motivation

Untimed, free-response format Contextualized communicative tasks Individualized feedback and washback Criterion-referenced scores Open-ended, creative answers

Formative Oriented to process Interactive performance Fosters intrinsic motivation

ls

60 [_} Language Testing

Pf

paimannans

Geir

Chapter 3/ Forms ofLanguage Test

11- In contrastto traditional assessment, alternative assessment —------. 1) is oriented to the learning process 2) uses standardized exams

State University Questions 1-Theresults obtained from objective tests can indirectly reflect the performance of --------. (State University, 82)

1) individualtestees 3) the testees as a whole

a6 : Ske eyieeee ewe

2) course instructor 4) the items composing the tests

3) fosters extrinsic motivation

12- As a form ofassessment, portfolios ----

2- In a composition test which is based on a picture, the three terms --------- refer to the item, response, and psychological process, respectively.

4) are a purposeful selection of students’ work .

(State University, 83)

2) pictorial, written, production 4) written, pictorial, comprehension (State University, 84)

2) encourage guessing 4) make the answerrely on facts

4- In a test consisting of 60 four-choice items, a student got 15 answers right through guessing. After applying the correction for guessing, the score will be ---------. (State University, 84) Ho 2) 45 3)4 4) 30 5- All of the following are among the shortcomings of multiple-choice items EXCEPT ---------. {State University, 86)

1) the difficulty of writing successful items 2) the restriction related to whatis tested 3) the unknowableeffect of guessing on testees’ scores and choice selection 4) the impossibility of applying the guessing correction formula 6- The so-called ‘subjective’ items are those in which ----

1) discouraging learner antonomy

2) subjectivity of assessment

3) students' direct involvement

4) speed of assessment

{State University, 92)

14- Portfolios are likely to promoteall the following EXCEPT ---------. 1) a continuousrecord of language development 2) student’s experiences in and outside of school 3) student involvementin assessment 4) responsibility for self-assessment

(State University, 93)

15- Which of the following is NOT true about alternative assessment? 1) It is one-shot and presents decontextualizedtest items. 2)It provides individualized feedback and washback. 3) It is a process-oriented continuous form of assessment.

(State University, 94)

4) It is based on an untimed, free-response format.

(State University, 87)

1) the test taker needs to produce the language 2) both comprehensionand production are necessary 3) the rater has biases for or against some test takers 4) there is more than onecorrect response for each item

16- Which of the followingis a feature of alternative assessment? (State University, 96) 1) It is product-oriented. 2) It fosters extrinsic motivation. 3) It focuses on the right answer. 4) It provides individualized feedback. 17- The acronym CRADLEis associated with -------- assessment.

1) performance

(State University, 87)

1) fill-in-the-blank cloze

2) open-ended summary writing

3) the true-false test oflistening

4) the multiple-choicetest of reading

2) traditional

3) portfolio

(State University, 97)

4) dynamic

State University Answers

7- In the psycholinguistic categorization of tests, the cell related to the comprehension and written

modality of the input is exemplified by ---------..

(State University, 92)

13- Self-assessment is considered to be disadvantageous in termsof ---------.

3- Typical objections to multiple-choice tests of reading include all of the following EXCEPT that 1) are passive 3) expose students to errors

(State University, 91)

4) focuses on the right answer

1) are considered as an important peer-assessment strategy 2) are a procedure for cooperative test construction 3) were mostly encouraged in the psychometric-structuralist era

1) pictorial, written, recognition 3) written, pictorial, recognition

c] 61

1. Choice 2

attempted all the test items but got 76 on the test, his/her score would be -------- after the guessing-

2. Choice 2 Thetest.is based on a picture and thus the item is pictorial. Besides, composition is a type of production in written form.

correction formulais applied.

3. Choice 4

8- If a student.took a 100-item multiple choice itemed test with each item having five choices and

1) 24

2) 51

{State University, 88)

3) 68

4) 70

9- In contrast to traditional assessment, alternative assessment--------.

1) prefers non-interactive performance 3) highlights norm-referenced scores

Refer to Section 3.1.3. {State University, 89)

2) is oriented to process 4) employs standardized exams

The student provided 45 wrong answers:

10- One takes a 50-item multiple choice test with five alternatives for each item. The test taker attempts all the item but scores 40 on the test. This test taker's guessing corrected score would be --------..

1) 35.5

2) 38.5

:

3) 37.5

4) 36.5

4. Choice 1

(State University, 90)

5. Choice 4 Refer to Section 3.1.3. 6. Choice 4 Refer to Section 2.1.

wrong _15_ 45 =o

score = correct —-—-

n-1

4-1

Ger

62 [] Language Testing

7. Choice 4 Fill-in-the-blank cloze has written modality of the input; in terms of psychological processes,it is both comprehension and production. Open-ended summary writing may have the oral modality of the input. The true-false test of listening has oral modality of input. 8. Choice 4

9, Choice 2 Refer to Section 4.

Score = correct — M7008 _ 40 _10 =37.5 n-I 4 11, Choice 1 Refer to Section 4.

12. Choice 4 Refer to Section 3.5.4.

13. Choice 2 Refer to Section 3.5.1.

14. Choice 2 Someof the devices used in continuous assessment especially in language assessment are journals, work samples, teacher observation, interviews, learnerprofiles, self-evaluation, peer evaluation and portfolios. Portfolio is a purposeful collection of students’ work that demonstrate to others about the students’ efforts, progress, and achievements in given areas. When weapply the devices outlined above to second language, it can have a very specific focus, such as writing, or a broad focus that includes examples ofall aspects of language development. They are not just a collection of documents. They. have somebenefits because they provide a continuous record of language development, students can share learning with their parents, teachers and other educators, and they give opportunities for collaborative assessment and goalsetting with students. Furthermore, the devices used in continuous assessment promote responsibility for self-assessment, student involvement in assessment, interaction with teachers, parents and students about

learning,critical thinking about their work, and motivation.towardslearning and so on. However, despite all the benefits stated above, continuous assessment does not meetall the needs of different learners in

different societies but it does offer many benefits while it works towards developing the full potentials of our students.

16. Choice 4 Referto the table in Section 4. 17. Choice 3 Referto Section 3.5.4,

1- Sentence completion-type items are sometimes preferred to MC-type items becausethey---------.

{Azad University, 82.

1) can tap production to some extent 3) are scored very easily

2) are quite valid 4) are used by most teachers

1) an option

2) a subtest

3). an item

3- The terms subjective and objective refer to the waya testis ------—-. 3) scored 2) administered 1) constructed

(Azad University, 83)

4) a choice (Azad University, 84) 4) interpreted

4- In psycholinguistic classification of item forms, the form ofthe item is determined by considering

10. Choice 3

Refer to the table in Section 4.

[| 63

Azad University Questions

2- The smallest independentunit ofa test is called ---------. Score=R ~—W ag 24 =70 choice-1 . 5-1

15. Choice 1

Chapter 3/ Forms ofLanguage Test

owe. ial KeeNTE

the --------- of language.

(Azad University, 85)

2) production procedures 4) psychometric principles

1) theoretical principles 3) applied procedures

5. Based on the psycholinguistic classification of languagetests, a true-false item can be either a wenneee or --------- type item.

1) recognition — comprehension 3) comprehension ~ production

(Azad University,-85)

2) recognition — production 4) comprehension/production ~ recognition

6-Why dopeople consider essay examinations to be more valid than objective tests?

(Azad University, 85)

1) Essay questions are easier to make up. 2) Essaytests lend themselvesto split-halfreliability studies. 3) Students need to studyless intensively to answer an essay question. 4) Students draw together knowledge and demonstrate understanding rather than simply remember isolated facts. 7- Multiple-choice items are ---------. 2) highly structured 1) difficult to score

(Azad University, 87)

3) easy. to construct

4) less content-based

8- The main purposeofthe stem is to ---------.

(Azad University, 88)

9- Oneofthepitfalls of the limited-response item is that they are --------- . 2) slow to correct 1) purely subjective

(Azad University, 88)

1) provide the test-takers with redundant information 2) present the problem clearly and concisely 3) lead the abler test-taker to choose the wrong options 4) encouragetest-takers to make intelligent guesses

4) not informative

3) not easy to construct

10- In testing, which of the following is NOT among the main factors involved in self-assessment?

(Azad University, 91)

1) gender

2) entry level

3) world-knowledge

4) culture

64 [| Language Testing

Azad University Answers

Chapter 4

1. Choice 1 Carefully constructed completion items are a useful means of testing a student’s ability to produce acceptable and appropriate forms of language. They are frequently preferable to multiple-choice items since they measure production rather than recognition, testing the ability to insert the most appropriate

wordsin selected blanksin sentences. 2. Choice 3 Referto Section 1. 3. Choice 3 Referto Section 2.1.

4. Choice | Refer to Section 2.4. 5. Choice 1

:

Astestees are not supposed to produce anything in a true-false item, it can’t be of production type. 6. Choice 4 In essay-type items testees should produce linguistic elements and sometimes testees are required to integrate skills or sub-skills to produce language. In this way someprefer essay examinations. 7. Choice 2 Refer to Section 3.1.3. 8. Choice 2 Refer to Section 1.

Basic Statistics in Language Testing © Statistics

9. Choice 4 Limited-response items require subjects to make their own responses which are, however, constrained or restricted, thus ensuring objective marking providedthat the items have been properly worked out. As the namereveals and the answers are not varied, they are easily scorable and they are easy to construct (compared to multiple-choice items, for example). Free-response items have to be most carefully wordedif it is to be fair to the candidates and objectively. scorable. Let us suppose that weare testing knowledge ofoblique factors. The question ‘what are oblique factors?’ is too broad and should be more specific as in ‘What information of psychological significance can be obtained from the angles between oblique factors?’

© Typesof Data © Tabulation of Data © Graphic Representation of Data

10. Choice 3 Thompson points out five main factors influencingself-assessment: gender, culture, previous knowledge ofthe target language, writing script, entry level to the target language and maturity.

© Correlation © Correlational Indexes © Correlational Formulas

© Descriptive Statistics © Normal Distribution

© Derived Scores

Labs Bint

i

pT Hitt

Chaper 4/Basic Statistics in Language Testing / CI

were eka se Tre

67

2. TYPES OF DATA The way the data are coded will depend, in part, on the scales you have used to measure the variables. Typically, four types of scales appear in the language teaching literature. The four

scales represent four different ways of observing, organizing, and quantifying language data. 2.1. Nominal Data (or nominal scale) As the name implies, nominal data names an attribute or category and classifies the data

Basic Statistics in Language Testing

according to presence or absence of the attribute. Some of the most common categories or groupings for people could be ‘gender,’ ‘nationality,’ ‘native language,’ ‘educational background’. In order for a scale to be riominal, one condition must always be met: each

The purpose of describing the results of a test is to provide test developers and test users with a picture of how the students performed on it. Therefore, students in the field of language testing

must somehow demonstrate knowledge ofthe language and methodsofstatistics.

Statistics involves collecting numerical information called data, analyzing them, and making meaningful decisions on the basis of the outcomeof the analyses. Mathematicians have divided the field of statistics into two major areas called descriptive statistics and inferential statistics. The statistics used to summarize data are called descriptive statistics. In descriptive statistics we describe sample data. Each characteristic of the sample is called a statistic. The description summarizes only the sample data. From thestatistic, ie. the

characteristic of a given sample, we can make inferences about the characteristic of a given population, called a parameter. This is possible through the utilization of the methods of inferential statistics. Inferential statistics is much more complicated than descriptive statistics



Suppose that you want to find if teaching vocabulary to Iranian high school students increases their comprehension of texts, Since you can’t do your experiment on all Iranian high school students (due to lack of enough time, budget, facilities, etc.) you decide to pick up thirty such students. This smaller numberof studentsis, then, called sample. A sample is drawn from a pool

of students called population. In this example, the thirty students you have chosen as subject of the study is the sample which has been drawn from all Iranian high school students, i.e. the population. After you carry out the study you take a test and obtain thirty scores. But this data is not much revealing. You need to make them more comprehensible by summarizing them. This summarizing is done through descriptive statistics. For example, you may calculate the mean or

standard deviation of the scores of the sample. These two characteristics (i.e. mean and standard deviation) are called statistic. However, after you have looked at the data summary, you will undoubtedly want to know whether the results mean anything for other Iranian high school

students. The principles that let you expand the findings of the sample to predictions aboutother learners in the population are those of inferential statistics. Having applied inferential statistics on descriptive statistics, you come to certain characteristics about population. characteristics are referred to as parameter.

You can use a yes/no notation. For example, with regard to native speakers ofPersian you can categorize peopleas yes (native) or no (non-native).

1. STATISTICS

andis not therefore the focus of our discussion.

observation on the scale must be independent, that is, each observation must‘fall into one and only one category. In recording this type of information:

These

You can also use +/— system where (+) means native speaker and (—) means non-native speaker. Still a third way is to assign an arbitrary number to each possibility. So in the above

example wecan say a native person be assigned / and a non-native person be assigned 2. Nominaldata do nothaveto be dichotomous, however. For example, the nominal variable native language could be recorded as 1 = Persian, 2 = Italian, 3 = German, 4 = Turkish. iS Note: It is important to know that the numbers assignedto represent different categories

' ofanominalvariable have no arithmetic value. An average language of2.5 based on 7 = Persian, 2 = Italian, 3 = German, 4 = Turkish is meaningless, 2.2 Ordinal Data (or ordinal scale) Like the nominalscale, an ordinal scale names a group of observatioris, but, as its label implies,

an ordinal scale also orders, or ranks, the data. For example, if you want to measure happiness, there is no reliable method of measuringthis concept. However, you can say that a person is very

unhappy — unhappy — happy — very happy. These can be assigned numbers from 1 to 4. We could also obtain ordinaldata by: You are happy today

12

3

4

Another instance of ordinal data is obtained when you as a teacher, based on your subjective judgment, put the students’ essays from the best quality to the least quality. & Note: In this case, the numbers do have arithmetic value. Persons rated 4 are ordered higher than those with a 3, and those with a 3 higher than those with a 2. While it is true that the scales in ordinal measurement have arithmetic value, the value is not precise.

Thatis, a person ranked 4 is riot precisely twice as happy as one ranked 2. Noris the distance between I to 2 necessarily equal to that from 3 to 4. The points on the scales, and the numbers usedto representthose points, are not equalintervals.

pets ppacitans ie

68 [| Language Testing

aye ia

eee lie

AV ihni

eal toma

cerry res aaa er aed

Chaper 4/Basic Statistics in Language Testing E]

2.3. Interval Data(or interval scale)

Rank order

2.5 2.5 2.5 2.5 5 6 7.5 75 9 11 11 i

Interval data represent the ordering of a named group of data, but they provide additional information. Interval data also show the (more) precise distances between the points in the rankings. They differ from ordinal data in that each interval data has the same value so that units can. be added or subtracted. You can’t do that with ordinal measurement. You can’t say “fair + fair = good,” not even if you’ve assigned numbersto fair and to good. Note: Test scores are considered as though they were interval data. In a set of test

scores, iftwo students obtain scores of95 and 100, respectively, we say that oneis better than the other to the extent of the value of those five points. In other words, we assume

that thosefive points are ofequalvalue.

:

2.4. Ratio Data(or ratio scale) Ratio data, also called absolute interval data, are similar to interval data except that they have absolute zero. As a result of this new characteristic, in ratio data we can say ‘this point is two time as high as that point,’ but in the interval data it. is not possible. Therefore, you can’t say a score of 20 is twice as big as 10 because in testing there is no such concept as a score of zero. Someof the commonratio data are age, time, heat, weight, height, etc.

The next table showsthe samescores, with two minordifferences. First, each score appears once

in the table; second,the rank order column is deleted.

Score

Frequency

Relative



frequency

Percentage

Cumulative frequency (F)

100 = 33 100 = 8 100 = 8

12

. Percentile 100

©! Note: Nominal and ordinal data arealso called discrete data.

100

0.33

©} Note: Interval andratio data are also called continuous data.

97 96

0.08 0.08

95

0.16

0.33 x 0.08 x 0.08 x 0.16 x

100 = 16

50

93 92

0.08 0.25

0.08 x 100 = 8 0.25 « 100 = 25

33 25

The following shows the four scales of measurement along with the type of information each

include.

Names

Shows

Gives

Absolute

Categories

Ranking

Distances

Zero

Nin!

3.2. The Frequency Distribution

Ordinal Interval

a a

Ratio

a

Before quantitative data can be understood and interpreted, it is usually necessary to summarize them. Supposethat the following table showsthe reading scores of students in an achievementtest.

Score

a

b

cn)

c

Thelist of scores can be made shorter by arranging the scores from high to low in a frequency distribution, sometimes simply called a distribution. Frequency is the number oftimes a score occurs. Also called simple or absolute frequency, this type ofdistribution is shown by small letter £ For example, 2 shows there are two testees who obtained 95; 3 shows three testees

3. TABULATION OF DATA

Student.

66 58

d

e

f

g

h

i

92

97

J

j

k

92

obtained 92 and so forth.

3.3. Relative Frequency

Relative frequency refers to the simple frequency of each score divided by the total number of scores. See columnsthree and four in the abovetable.

3.1. Rank Order Ordinarily, the first step is to arrange the scores in the order of size, usually from highest to

lowest. If two testees received the same score, we should divide the sum of their rank by two. For example in the table 95 is ranked as = = 7.5. In case of three similar scores, we divide the

sum oftheir ranks by three and so forth, e.g. three students received 92: woes =11.

For example therelative frequency of the score 93 is:

1 Relative frequency = pt 0.08

6D piled tpt! erin

era

70 [_] Language Testing

Chaper 4/Basic Statistics in Language Testing L]

wwe maha hie ete

3.4. Percentage

Whenrelative frequency index is multiplied by 100, the result is called percentage.

4. GRAPHIC REPRESENTATION OF DATA Frequency data can be displayed in far more graphic and appealing ways than the plain, ordinary frequencydistributions. Graphic representations follow three purposes: ®

better comprehensionof data than is possible with textual matter alone; morepenetrating analysis of subject than is possible in written text;

e

acheck of accuracy.

e Let’s calculate the percentage of the score 93: Percentage = 0.08 x 100 = 8 The index meansthat 8 percent of those who took thetest scored 93.



Graphic displays of scores generally comein_one ofthe three forms: a bar graph, a histogram, or a frequency polygon. All three are drawn on twoaxes: a horizontal line (abscissa) and a vertical

line (ordinate).

3.5. Cumulative Frequency

Cumulative frequency indicates the standing of any particular score in a group of scores. Cumulative frequency, represented by F, shows how manyscores fall below the given score in a

distribution. In other words, this index shows how many students received a particular score and less than that. This frequency is calculated by adding the frequency of successive intervals in the previous work. For better understanding of how cumulative frequency is calculated look at the following table:

100 97 96 95 93 92

4 1 1 2 1 3

(abscissa) represents score values.

B Example: Thefollowing are the results ofa group ofstudents on an achievementtest. 75, 76, 76, 76, 79, 80, 80, 86, 86, 86, 86, 89, 90, 90

Cumulative

frequency (F)

Always the lowest cumulative frequency equals the lowest frequency.In this case the curmulative frequency of 92 equals 3. The next cumulative frequency is calculated by adding the cumulative frequency of 92 to frequency of 93, i.e. 3 + 1 = 4. The third cumulative frequencyis calculated by

adding the cumulative frequency of 93 to frequency of 95, i.e. 4 +2 6 and so forth.

allol 78

76

78

86

C Bi)

80

4.2. The Histogram. A histogramis a series of columns, each havingasits base one class interval andasits height the numberofcases, or frequency, in that class.

A Example: Let's draw the histogramfor the achievement test scores given above.

the result is percentile. Percentile rank shows what percentage of students received a particular score or below that. :

ote

When cumulative frequency index is divided by the total number of learners multiplied by 100,

BOP

he

3.6. Percentile

DG

Frequency(f)

In bar graphs, the vertical line (ordinate) represents the frequency and the horizontal line

we

Score

4.1. The Bar Graph |

76-78 70-42 W868 87-00

Let’s calculate the percentile rank of 93:

4 Percentile rank = Dp x 100 = 33 The index meansthat 33% of the students who took the test scored at or below 93.

©) Note: Histogram is similar to bar graph in that both use bars to representfrequency. 3 Note: Histogram is different from bar graph in that on the horizontalline histogram

shows class interval while bar graph shows score values. Besides, histogram uses

connected bars meanwhile bar graph uses discrete bars.

72 L] LanguageTesting

Chaper 4/Basic Statistics in Language Testing LE]

ein

4.3. The Frequency Polygon (orline chart)

In the polygon we follow the steps for drawing histogram except that instead of using bars to show frequencies, a point is located above the midpointofeachclass interval and then the points are connected. RB Example: Let’s draw the frequency polygonfor the achievementtest scores given above.

73

EA Example: In the following set of scores, 87. and 89 are the modes (each repeated three times):

,

82, 85, 85, 87, 87, 87, 88, 89, 89, 89, 90

53! Note: When all ofthe scores in a group occur with the same frequency, it is customary to say that the group ofscores has ‘no mode’. B Example: The following set of scores has no mode because each score is repeaied three times:

83, 83, 83, 88, 88, 88, 90, 90, 90, 95, 95, 95 & Note: When two adjacent scores have the same frequency and this common frequencyis

76-78 78-82 BS-RG B70

greater than thatfor any other score, the mode is the average ofthe two adjacentscores.

5. DESCRIPTIVE STATISTICS Aswas explained previously, descriptivestatistics are used to summarize the data obtained from

sample. Two aspects of group behavior are crucial in descriptive statistics: measures of central

EE Example: The mode ofthe following set ofscores is 86.5:

80, 82, 82, 85, 85, 85, 88, 88, 88, 90

tendency and measures ofvariability.

Mode =

Mode

|

Median

| Mean

z

= 86.5

5.1.2. Median

DESCRIPTIVE STATISTICS

| CENTRAL TENDENCY

85488

The median (Md)is the score at the 50° percentile in a group of scores. It is the score that divides the ranked scores into halves, such that half of the scores are larger than the median, and the other half of the scores are smaller.

VARIABILITY Standard deviation

Variance

5.1. Measures of Central Tendency

- Centrality measures describe the most typical behavior of a group. They include mode, median,

mean and midpoint. 5.1.1. Mode

EB Example: In the following set ofscores 85 is the median:

81, 81, 82, 84, 85, 86, 86, 88, 89 & Note: If the data are an even numberofscores, the median is the point halfway between the central values when the scores are ranked. EB Example: In the following set ofscores, 84,5 is the median:

81, 81, 82, 84, 85, 86, 86, 88

The most easily obtained (and the least accurate) measure of central tendency is the mode. The modeis the score that occurs most frequently in a set of scores. A memory device to keep the modestraight in mindis that the word mode can mean fashionable. Thus, the mode would be that

score which is most fashionable.

Median =

84+ 85 2 = 845

|

°) Note: In order to calculate mode and median you have to rank order the scores first; otherwise, you would get the wrong result.

Ey Example: In thefollowing set ofscores, 88 is the mode sinceit is repeated three times: 80, 81, 81, 85, 88, 88, 88, 93, 94, 94, 95

5.1.3. Mean

® Note: Sometimes there might be more than one mode. Such distributions of scores are

Mean(X)is probably the single most often reported indicator of central tendency.It is the same

referred to as being bimodal(ifthere are two peaks) or trimodal(ifthere are three peaks).

as arithmetic average. The formula for obtaining the meanis:

74 [| Language Testing

Gee.1

Chaper 4/Basic Statistics in Language Testing C

en mn ee 8S

75

EX Example: In the following set ofscores midpoint is 84.5:

81, 81, 82, 84, 85, 86, 86, 88 where:

z= sum of; X = each score; N = the total numberof scores

B Example: Using the aboveformula the meanofthefollowing set ofscores is 86:

82, 84, 86, 88, 90

~~ 82+844+86+88/+90 X =" «96 5

Mean hasthe following characteristics:

First, if we subtract X from the score X, the resulting difference is a deviation score (D); if we were to find the deviation of scores from the mean of the set, the sum would be exactly zero.

This propertyis illustrated in the following table: Score

82 84 86 88 90

Mean

86 86 86 86 86

Compute

Deviation score

82-86 84-86 86 — 86 88 — 86 90 — 86

—4 -2 0 +2 +4 YD=0

Second, the sum of the squared deviations of scores from their arithmetic mean is less than the

sum of the squared deviations around any point other than mean. This property is illustrated by the following example:

(82 — 86)? = 16 (84-86)? =4 (86 — 86)? =0

(88-86)? =4

(90-86) *= 16

YD? = 40

(82-87)? =25 (84-87)? =9 (86 -87)7=1

(88-87)7=1

(90-87)*=9

yD? = 45

(82-84)? =4 (84—84)7=0 (86 —84)?=4

(88-84)? =16

(90-84)? = 36

limitation. It is seriously sensitive to extreme scores. Ifthere is an outlier score, median is preferred to mean. .

81+ 88 _ 2 84.5

5,2. Measures of Variability (or dispersion) Although measures of centrality are very useful in analyzing a set of data, these measures only locate the center of the distribution. In certain ‘situations the location of the center may not be

adequate to provide a logical picture of the data. Suppose we. gave a test to measure reading speed to two different classes and they both tured out to have the same mean scores. Does this imply that the two classes are really the same? No, of course it doesn’t. The variability among the scores, how they spread out from the central point, may be quite different in the two groups. Compare the following figure for two distributions both of which have the same mean score. class B class A

Notice that in class A the spread of scoresis larger than in class B. Therefore, to be ableto talk about data more accurately, we have to measure the degree of variability of the data from our measure ofcentral tendency.

There are three major ways to show how the data are spread out: the range, the standard deviation and variance. 5.2.1. Range Rangeis the simplest measure of dispersion and is defined as the difference between the number of points between the highest score on a measure and the lowest score (plus one).

XD? = 60

mp j the mostfrequently used t#" Note:: Although the mean is measure ofcentral tendency, it has a ro

Mid point =

.

: i Note: Meanis the only measure ofcentrali.ty which takes into accountall the scores.

5.1.4. Mid point

The midpointin a set of scores is that point halfway between the highest score and the lowest score on the test. The formulafor calculating the midpointis:

Ey Example: Rangeofthe following set ofscoresis 8: 92, 95, 95, 97, 98, 98, 100

Range =100 -92 =8

& Note: Range changes drastically with the magnitude of extreme scores (or outliers). Ifyou had test score data where one person simply didn’t have the chance to study well, the range would dramatically change just because of that one score. For example if someone scored 50 in the above set ofscores, the range would be 50:

50, 92, 95, 95, 97, 98, 98, 100

Range :100—50 = 50

As you see range has increased unrealistically. °* Note: In order to calculate range, you need to rank order the scores.

BD ypihnsap

76 | Language Testing

Risin

cerns

Chaper4/Basic Statistics in Language Testing C]

5.2.2. Standard deviation (SD)

Differencefrom Mean

Square ofDifference

(33 — 40) = -7

(-7)? = 49

«= X)

The most frequently used measure ofvariability is the standard deviation. Put simply, standard

deviation is the average difference of all scores from the mean. Take the following two sets of

scores and the accompanying figures: 4, 5, 6,7, 8

(x- x)"

(36 - 40) = —-4

(~4)? = 16

(44 ~ 40) = +4 (50 — 40) = +10~-

(+4)? = 16 (+10)? = 100

—> «—_

(37-40) = -3

Mean: 6

(~3=9

LX-X) = 0

1, 3, 6, 10, 10

77

¥(x - x)’ = 190

Now we can substitute the numbers in theformula:

Mean: 6

In each figure a line is drawn from any given score to the mean ofthe set. If we calculated the average size (mean) ofthe lines, we would see that in average the length of arrowsin the second figure is higher than thatin thefirst figure. Let’s do the calculation.

X 14+2+0+1+2 Mean ofthe arrorwslength in thefirst figure = »e == 1.2 xX Meanof the arrorwslength in the secondfigure = as

5+3+04+444

—_ = 3.2

yx — xy?

=

— =

SD

N-1

190

|— =

4

Java V47.5

= 6.89

Standard deviation tells us about the degree of dispersion of scores in a distribution. By comparing the standard deviations ofdifferent groups we would know to what extent they are

homogeneous. The larger the standard deviation index, the wider the range of distribution (i.e. the more heterogeneousthetestees); the smaller the standard deviation index, the more similar the scores, and the more tightly clustered the data are around the mean (i.e. the more homogeneousthetestees).

Therefore we can say the scores in the second figure in average deviate more from their mean than do the scoresin thefirst set, i.e. they have a higherstandard deviation. In its technical sense,

standard deviation is the square rootof the sum ofthe squared differences from the mean divided by the numberofcases. To calculate standard deviation we can use the following formula:

large SD Letusillustrate the formula by calculating the standard deviation ofa set of scores.! B Example: The standard deviation ofthefollowing speed test scores is 6.89.

33, 36, 37, 44, 50

First, we need to calculate the mean ofthe scores:

_ XX 334364+37+444+50

NO

5

Next, we need to subtract mean from each score as shown in the second column and then square these differences as done in the third column.

_ satall SD.

A Example: Ifyou were a teacher new to the English Language Institute and were told that you could pick one class to teach from four sections ofIntermediate English, which of the classesin the following table would you pick?

Section

x

sD

1

55.6

4.5

2

56.4

12,8

3

39.1

4

28.1

»

53 18.7

First, you should look at the mean scores and decide that sections 1 and 2 have the highest means. However, there is not a big differences between the two indexes. Then you. should

1- You are never asked to calculate the standard deviation in University Entrance Exams. The example aimstolet

you see the procedure for calculating SD.

look at the standard deviations. Since students in section 2 are more widely spread than those in section I (SD ofscores in section B is higher than that of scores in section A), the best choice would be section 1.

Agh

Sy

aa

cians

78 [| Language Testing

Due to practical difficulties we can use raw scores to calculate SD through the following formula:

EDsesh

Aafia

Gere

Chaper 4/Basic Statistics in Language Testing. |

719

distribution of scores of large samples tends to be normal. For example, suppose we wanted to find out how longit takes most people to learn 100 new vocabulary items. We should then begin

tabulating how long it took each person. As we tested more and more people, we would find a curve emerging with most people scoring around a middle point on the distribution. The other

Sometimes, another slightly different formula is used for the variance and SD. In this new version, in denominator we have'N instead of N — 1. This new formula is used when the number

of scoresis bigger than 30. Therefore, the new formulas would look like:

scores would be distributed out from that point (look at Figure 4.1). Contrary to common sense, the normal distribution does not exist. We never get a completely normaldistribution in the real

world. We can name four characteristics for normal distributions. These characteristics will be explained using the following example. Suppose that 250 students took part in a language test and test scores resulted in a frequency polygon like Figure 4.2:

5.2.3, Variance

In moststatistical analyses the variance is used as the measure of variability. To find it, you

simply stop short ofthe last step in calculating the standard deviation. You do not need to bother

with finding the square root.

39

:

49

59

69

79

89

99

Figure 4.2 Zero Score. The normal distribution does not have a zero score; the tails of the curve never meet the horizontal line. Central Tendency. The second characteristic is that in a perfect normal distribution, all four indicators of central tendency,i.e. mean, median, mode, and midpoint would fali on exactly the same score value as shown in Figure 4.2, right in the middle of the distribution. Note in this figure all measures of central tendency equal the samevalue, 69.

Dispersion. In a perfect normal distribution we expect the lowest score onthe test (39 in Figure 4.2) and highest score (99 in Figure 4.2) to be exactly the same distance from the centre, or mean. This is apparently true in the example. Both are 30 points above or below the mean. Thus, in a normaldistribution, the range is symmetrical.

Theother indicator of dispersion is the standard deviation. The SD in Figure 4.2 is a nice round

6. NORMAL DISTRIBUTION

The outcomeofalmost all human behavioris a normal distribution. A normal distribution meanstha t most of the scores cluster around the mean ofthe distribution, and the number ofscores gradually decreases on either side ofthe mean. Theresulting figure is a symmetrical bell-shaped curve:

number, 10. Typically, the SD in a normal! distribution will fall in the pattern shown in the figure: One standard deviation above the mean (+1SD)will fall on the score whichis equal to the Mean + 1SD;or in this case 69 + 10 = 79, Also, one standard deviation below the meanwill fall on the score, which is equal to Mean — 1SD, orin this case 69 — 10 = 59. Similarly, two standard deviations above the mean will be equal to Mean + 2SD or in this case 69 + 20 = 89 and so forth. In short, the standard deviation is used to mark off certain portionsof the distribution, each

of which is equal in length. Percentage. If the distribution of the scores is normal, the standard deviation can giveus a great Figure 4.1

No matter what kind of scale’ is used or no matter what kind of behavior is investigated, the

deal of information. First, recall the mean, median, mode and mid-pointshould all be the same in

a normaldistribution. Also recall that the median is the score below which 50 percentofthe case should fall, and above which 50 percent should be. Given these facts, we can predict that 50

Ape ipa a rin ein

80 [| Language Testing “

eww inh eRe eee

percentoftestees’ scores will be above the median or mean and 50 percentoftestee’s scores will be below the median or mean in a normaldistribution. In like manner, researchers have shown that about 34% ofthe scores will fall between the mean

and 1 standard deviation above the mean as shown in Figure 4.3. That means that about 34 percentoftestees scored between 69 and 79 points. Since the distribution is normal (bell-shaped; symmetrical), thus about 34 percent of the testees are likely to score between the mean and one standard deviation below the mean,or in this case between 59 and 69 points. Thus, in a normal distribution, about 68 percent of the students (34 + 34 = 68) are likely to fall within one standard deviation above and below the mean. You can see the detailed information about what proportion or what percent ofall scoresfall in any area of the normal in Figure 4.3. .

Baa

Geir1h

Chaper 4/Basic Statistics in Language Testing LC

81

|

Following the instructions in the last example, the

middle line is the mean and equals 15. Moving to the right we add one SD to the mean; thus 15 + 1.5 = 16.5, next 16.5 + 1.5 = 18 and so forth. On

the other hand, moving to the left, we subtract one SD from the mean; thus 15 — 1.5 = 13.5, then 13.5 - 1.5 = 12 and soforth. Since the item asks for 95%, as the figure shows we should look for two SDs above and two SDs below the mean. i . f Example: If in a class SD and:mean are 2 and 59, it is --------- probable that a testee’s score exceeds 61.

1) 68%

2) 34%

3) 16%

4) 2.5%

This example. is different from the last two

examples. Here, the point is the probability that a student scores higher than 61. In this case, using Figure 4.3, we should add up the percentages

above 61:

B Example: In vocabulary test, mean and standard deviation are calculated to be 82 and 4,

:

respectively. In this test, 68% ofstudentsfall between --------1) 82and 86 ~=—-.2) 76. and 80 3) 74 and 78 4) 78 and 86

70 74

78

82

86

90 94

°

= Example:In a test SD and mean are 1.5 and 15. Then wearesure that 95% of testes

418

3) 13:3

San

and 15

4) 2.5%

add up the percentages above 46.

B mean.

an

3) 16%

probability ofobtaining higher than 46 we need to ~

value ofSD; thus 82 —4 = 78, next 78 ~ 4 = 74 and so forth. Since the item asks for 68%, we should look for one SD above and one SD below the

iall bee 912

2) 98%

First, the stem gives the value for variance and therefore SD equals 2. Next, since the item asks for

distribution. Therefore, you need to write the mean (82) in the middle. As you move to the right,

As you move to the left, you should reduce the

B Example: In a test mean and variance are 50 and 4. A student ig —--—--- probable to

1) 34%

distribution and show the information on the

you should add the value of SD or in this case 4; thus 82 + 4 = 86, next 86 + 4 = 90 and so forth.

:

obtain.a score higher than 46.

In such cases you had better draw the normal

1) 15 an

Poi:

53 $5 57 59 6

13.59% + 2:14% + 0.13% = 16%

Figure 4.3

4) 163

3 an

and 19.5 .

13,59% + 34.13% + 34.13% + 13.59% + 2.14% + 0.13% = 98% More simply, we know that 50% ofstudents fall above the mean. Thus:

13.59% + 34.13% + 50% = 98% One more important point to learn from normal distribution concerns percentile. You should

remember from earlier pages that percentile is defined as the total percentage of students who

scored equalto or below a given point(in the normaldistribution). The points marked off by SD could also tell us about the percentiles as well. Given this definition, what percentile would a score of 79 represent in Figure 4.2? The answer is about 84. How? We know that 79 is one

standard. deviation above the mean. Jn order to calculate its percentile we need to add up all the percentages below 79:

|

GDjy

82 [| Language Testing

hninad

Rivcrenane

sir wy ine see

34.13% + 34.13% + 13.59% + 2.14% + 0.13% = 84% B Example: The mean and SD of a set of scores are 60 and 3. A student who obtained 66 has outperformed --------- oftestees.

1) 68%

2) 95%

3) 34%

et ne aS ets

The scores may be scrunched up toward the lower end of the scale as shown in Figure 4.5. In this case, the distribution ~ would be considered positively skewed (also called skewed right or skewed high). Therefore in such distributions, most

-. will be depressed (again pay attention to ‘the relationship betweenfloor + low scores), In this case the distribution will likely have indicators that vary from low to high as follows: 5!

54

57

60

63

66

69

skewed distributions characteristically have a ‘tail’ pointing in one of the two possible

directions. When the tail is pointing in the direction of the lower scores (-), the distribution is said to be negatively skewed, whenthe tail points toward the higher scores

deviation 0.5. A student who scored 15 has outperformed --------- oftestees.

4) 98%

(+), the distribution is positively skewed.

Just like the previous example we need to add up

6.2. Kurtosis

the percentage(s) below the score 15. 013% + 2.14% = 2%

ot boi 14.5 15 15.5 16 16.517 17.5

6.1. Skewed Distributions

Lookingat a distribution curve, it is also possible to see whether the data are ‘flat’ or ‘peaked’. Flatness or peakedness of the. curve is called kurtosis. Kurtosis is one way of looking at the degree to which the curve in the middle ofa distribution is steep or peaked.

«

Leptokurtic: The data spreads out in a very sharp peaked curve(.¢., a relatively large proportion of the scores are close to the mean):

e

Platykurtic: The data spreads out in a very flat

For a variety of reasons, the distributions of language students’ scores may not always have the prototypical symmetrical ‘bell’ shape. In this case there is the phenomenon of skewness, i.c. the piling up of scores with greatest frequency not in the middle but near one end of a distribution. Skewing usually occurs because the test was either too easy or too difficult for the group of students. The scores may bepiling up toward the higher end of thescale,

mid-point,

&? Note: The assignment of the negative and positive distinctions in discussions of skewedness is always confusing. To keep them straight, you could try to remember that

R Example: In a class with 50 students, the mean is calculated to be 16 and standard

3) 2%

mode median mean

mode, median, mean and midpoint.

know that 50% of testees fall below the mean,i.e.

2) 97%

Figure 4.5

scale on a particular test- the usual indicators of dispersion

add up all percentages below 66. As a-characteristic of normal distributions we

1) 47%

83

_the effect of most candidates scoring near the bottom ofthe

should calculate percentile rank, i.e. we need to

i

Chaper 4/Basic Statistics in Language Testing C]

- of the students have obtained low scores. Dueto floor effect

4) 98%

Since the question asks for outperformance, we

60. To this we add 34.13 and 13.59: 50% + 34.13% + 13.59% = 98%

Dima iss salmaaa ain nw

Figure 4.4

curve (i.e., the scores are relatively far from the

mean):

as shown by Figure 4.4, in which case the distribution is negatively skewed (also called skewed left or skewed low). Therefore in such distributions, most of the students have scored well. Due to ceiling effect — the effect of most candidates scoring near the top of the scale on a particular test— the usual - mid-point

© :

:

jean Inedian aie

measures of dispersion will be depressed (pay attention to the relationship between ceiling + high scores). In this case the distribution will likely have indicators that vary from low to high as follows: midpoint, mean, median, and mode.

Mesokurtie: The data spreads out in a standard normal .

distribution:

7, DERIVED SCORES

Na

.

Rawscores are obtained simply by counting the numberof right answers. A raw score as such

has very limited meaning; without other information we cannottell whether a raw score of, let us say, 50 out of 100'possible points represents superior, average, or poor performance. A raw score

without context can havelittle real meaning.

Measurementusually uses a set of arbitrary scoring conventions. Thereis no rule to tell us how

84 ["] Language Testing

Chaper4/Basic Statistics in Language Testing LJ 85

Keir

to score a given test. That is, you can assign one, two, three points, or any value you prefer, to a

i

correct response on a test. The scores depend on arbitrary scoring decisions set by the test developers. This arbitrariness creates many problems when we want to make comparisons. Suppose a simple case where Sara wanted to compare how well she performed in her Math test compared with her IQ test. She received 70 for Math and 75 for IQ. In this case, Sara achieved a higher mark in her IQ test, 75 out of 100. However, just because her IQ score (75) is higher than her Math score (70), we shouldn’t assumethat she performedbetter in her IQ test compared to her Math test. The question therefore arises: How well did Sara perform in the two tests? Clearly, the two scores come from different distributions. The distribution of examinees who took the IQ test has a mean of 65 and standard deviation of 10. The examinees who took the Math test, on the other hand, have a mean of 60 and a standard deviation of 5. This gives us the

© ]f we turn the raw scores in the case of Sara into z-scores, this would give us:

' _X-X_ 75-65 _, Q=

p=

XX 70-60 _ 2 sD 5

Math =

Thez scores highlight that Sara is one standard deviation above the mean (z = 1) in IQ test, and two standard deviation above the mean (z = 2) in Math test:

following: Score

Mean

Standard Deviation

IQ test

75

65

10

Mathtest

70

60

5

Now,is it logical to compare the two scores based on their scores and say that she performed better in IQ test compared to Math test? The answeris probably ‘No’. To avoid such problems and make scores obtained from different tests comparable we need to come up with a solution. One method is to determine the average score on the test so that each subject’s performance may be compared with the average. Another way is to. convert the raw scores into percentile or

standard scores. ©

Percentile scores indicate how a given student’s score relates to the test scores of the entire group of students. Thus a student with a percentile score of 84 has a score equal to or higher than 84 percent of the other students in the distribution.

e

Standardscoresare obtained by taking into account the mean and SD of any givenset of scores. Standard scores represent a student’s score in relation to how far the score varies from the test mean in terms of standard deviation units. The most commonly reported types of standard scores are z, T and CEEB scores.

Both of these are useful: percentile ranks are easily understood bytestees, but standard scores are advantageous in that they express equal units. Since we have already talked about percentile scores, below wewill just discuss the different types of standard scores. 7.1. z score

The easiest way to understand a ‘z score’ is to think of the standard deviation as a measure (a ruler which is one standard deviation long). The z score just tells you how many standard deviations above or below the mean anyscore or observation might be. So to computeit, all we have to do is subtract the mean from our individual score. That shows us how high above or

|

below the mean the score is. Then we divide that by the. standard deviation to find out how many standard deviations away from the mean weare. That’s the z score: _

35

45

55

65

75

85

45

95

50

55

60

65

70

75

Math Test

1Q Test

Using the normaldistribution and ‘z score’, we find that Sara performed above‘average’ in both subjects. However, the key point here is that Sara performed better in math exam than her IQ exam as her math score is further away from the mean than her 1Q exam. EB Example: In a set ofscores where, mean and SD are 41 and 10, whatis the z score of a

student who obtained 51?

_X-X_ 51-41 _

1 2= = SD a a 10 This student’s z score would be +1 or one standard deviation units above the mean. As was discussed before, the percentile rank ofsuch a studentis 84%.

BR Example: In a test the mean and SD are 80 and 5. Whatis the test score of a student whose z-scoreis 1,3?

X-X

35 SD ~ 13 Zz =—

X—80 5

> 6.5=X—-80 > X = 86.5

A quick look at the following figure will reveal that z scores are exactly in the same position as

those points marked off for the standard deviations just above them. Observe that the mean for the z scoresis zero andthatlogically, the standard deviation for any set of z scores will be 1.

aN

Chaper 4/Basic Statistics in Language Testing EF]

mem ets wi ele

87

As can be seen from the Figure, with CEEB scores the mean of the CEEB distributionis set at 500. 4 ‘ + ’

+ ,

7.2. Tscore

Now that you understand the concept of standard scores and can compute z scores, let’s consider T scores. z scores often contain decimal points and they maybeeitherpositive or negative. That

makes for error in reporting if you are not very careful and thus T scores are used instead. The

;

.

a

‘t

t

CEEB 100 200 300 400600 600 700 800 900 The following figure comparesthe different standard scores.

formula for calculating T score is:

& Example: Ifz score ofa student were —1, what would be his T score?

T score = 10z + 50 = 10(-1) + 50 = 40

:

As can be seen from thefigure, with T scores the mean of the T distribution is set at 50 instead of

CEEB.

at zero as in the z score.

T-score

°

T



100 200 300 400 500 800 700 10 20 30

zscore -4

-3

-2

900

40

50 60

70

80

90

-l

@

2

3

4

1

8. CORRELATION

Sometimes we are interested in determining whether a relationship exists between two or more variables. For example, if a researcher wonders whether a high vocabulary knowledgeis related Sonne T-score 10

'

.

a

D

20 30 40 50 60 70 80 90

7.3. CEEB Score

Another version of the z score is CEEB (College Entrance Examination Board) score. To convert

to success in completing a cloze test, she might want to know the relationship ofstudents’ scores

on a vocabulary test and their scores on cloze test. Do students who scored high on the vocabulary test also did well on the cloze test? On the other hand,is it true that a student who did

poorly on the vocabulary test will necessarily perform poorly on the cloze test? To this end we can perform correlation analysis to obtain a correlation coefficient. Correlation analysis refers to a family of statistical analyses that determines the degree of

z score to CEEBscoreusethe following formula:

relationship between two sets of numbers. The numerical value representing the degree to which two variables are related (co-vary, or vary together) is called correlation coefficient. In other f Example: In a set ofscores where, mean and SD are 41 and 10, what is the CEEB score ofa student who obtained 41? To find CEEB score we needto find z scorefirst:

XX _ MW

"sb

10

Now we can use the formula to obtain CEEB score:

CEEB score = 100z + 500 = 100(0) + 500 = 500

words, correlation analysis is used to examine how the scores on two tests compare with regard

to spreading out students. Correlationis the go-togetherness oftwo sets ofscores. °$ Note: Correlation analysis is also used to estimate reliability and validity of tests (see

Chapter6).

°

In the followingsections, first, you’ll be introduced to the types of relation between twosets of numbers, which could be either positive, negative, zero, or curvilinear. Then you'll become

‘familiar with the different formulas used to calculate correlation coefficient.

88 [_] Language Testing

Chaper 4/Basic Statistics in Language Testing rT

owe nea

Positive Correlation. To help us understand what is meant by a correlation coefficient, let us consider a teacher who hasjust received the scores of an IQ test and an English test. The teacher

89

The teacher decides to allot IQ to the vertical axis and English marksto the horizontal axis. The

graphic representation ofthese scoresis like:

may be interested in determining if there is any correlation between these sets of scores. For

example, he may wish to know whether a student whoscored high on IQ test will also score high on the Englishtest. The easiest way to see the relationship between the two sets of scores is to represent them

Students

"t

English test

IQ test

°

Dean

2

2

graphically. This representation, called a scatter plot or scattergram, is done by plotting the

Randy

4

3

3

77

scores. In the following scattergram one of the axes represents IQ test and the other axis represents Englishtest. It doesn’t matter which test we-assign to which axis.

Joey

6

7

Qo

o5+

Jeanne

8

7

Kimi

9

10 11

Students

IQ test

English test

Dean

2

3

Randy

3

4

z

Joey

4

5

2

Jeanne

5

6

Kimi

6

7

Shenan

7

8

Shenan

\

it

.

rS

+++

Ht

You can see from this scattergram that high IQs tend to be associated with hilslifgtish scores, and low IQ scores tend to be associated with low English scores. Of course, in this instance the

correlation is not perfect. This is a linear positive correlation. This scatter plot is an example of a linear correlation because as you see the points arrange themselvesin the form of an approximate straight line. 2

4

6

8

English test

This diagram depicts a linear perfect positive relationship. A perfect relationship is where all the points on the scattergram would fall on a straight line. The fact that this is a perfect correlation meansthatthe relative position of the participants is exactly the same for each test. In

other words, the student who scored the highest on IQ test also scored the highest on English test; the same is true for the second highest, third highest, and so forth. In the above diagram Shenan obtained the highest score in IQ test and English test. Next, Kimi who obtained the second highest score in IQ test is also the person who obtained the second highest score in

This is still a positive relationship. because the points form a discernible pattern going from the bottom left-hand cornerto the top right-hand comer. Negative Correlation. The relationship between two sets of data is not always positive. The following data shows the recording of a teacher of the students’ days of absence and their English scores. Theteacher has assigned the numberof absence days to the vertical line and the English scores

to the horizontalline.

English test and so forth. Finally, Dean is the testee who obtained the lowest score in IQ test and

Students

Englishtest.

In a positive relationship the points go from the bottom left-hand corner to the top right-hand corner, i.e. high scores on one test are associated with high scores on the other test; conversely, low scores on onetest are associated with low scores on theothertest. In reality we rarely face perfect positive correlations. More often the two sets of scores are not ordered in exactly the same way. Imagine that the teacher administers the same IQ test and English test to another class. He wants to see whether there is a relationship between scores on these two tests.

12

3-4

Days of absence

English scores

8

Dean

8

20

2

Randy

7

6

40

30

S

Jeanne

5.

50

a

Kimi

4

60

Shenan

3

70

Joey

=

10

30

50

70

English scores

This diagram depicts a linear perfect negative relationship. With a perfect negative linear relationship the dotsstill fall in a straightline, but this time they go from thetop left-hand comer down to the bottom right-hand corner. In negative correlation, high scores on one variable are associated with low scores on the other variable. In other words, the student who was most

absent scored the Jowest on English test and the student who wasleast absent gained the highest score. In the above diagram Shenan who obtained the highest score in English test has the least numberofabsent days and Dean who wasabsent for 8 days gained the lowest score.

RSF BO her i)

90 [_] Language Testing

a

Aas dt

as eae ee

“Beek melgt Geir1

Chaper 4/Basic Statistics in Language Testing []

ta

91

In reality we rarely face perfect negative correlations. More often the two sets of scores are not ordered in exactly the same way. Suppose the sameteacher records the same data in another

Curvilivear Correlation. Somerelationships are not linear. An example of such a relationship is that between level of anxiety and performance. Asindividuals’ anxiety level increases, so does

As with the abovescatterplot, the days of absence are represented on the vertical axis and the

their performance, but only up to a point. With further increases in anxiety, performance decreases. Another example is the amount of care people require during the courseofa lifetime: It is high in the early and late years and usually relatively low in the middle years. The

class and comes upwith the following data.

English scores are represented on the horizontal axis.

scatterplot of the first example would producethe following curve.

Students

Dean

Days of absence

English test

2

90

Randy

3

Joey

5

Jeanne

8

60 .

40 40

Kimi

8

30

Shenan

9

10

» 8 go

Test Performance

Bot

ia o

2

S

a

44

Anxiety

>

In the above curvilinear relationship, as the values on the horizontal axis increase, the values on

r 10

30

|

it

«©5006©70)—(90

English scores

This is a linear negative correlation. In this relationship the dots do not fall on a straight line (i.e. the correlation is not perfect), but they still form a discernible pattern going from the top

the vertical axis increase up toa point, at which further increases in the values of horizontal axis are associated with decreases in the valueofvertical axis. The following diagrams show the different possibilities of curvi-linear correlation.

left-hand corner down to the bottom right-hand corner. You can see from this scattergram that more absence days tend to be associated with low English scores, and low absence days tend to

be associated with high English scores.

Let’s summarize what we know aboutcorrelations already. If high scores in one set are associated

EB Example: What type of correlation (i.e. positive or negative) do the followi ng situations depict?

with high scores on the otherset, there is a positive relationship between the twosets of scores. If high scores in one set are associated with low scores on the other set, there is a negative

1) Numberofhours spent studying andperformance in examinations You would expect that the number of hours spent studying would have a positive relationship with examination performance — the more hours a student spends studying,

If there is no systematic pattern between high and low scores, there will be no relationship between the twosets of scores.

the better theperformance.

2) Age ofdriver and accidents Age of driver is associated with motor accidents — but this time the relation ship is negative. It’s been found that age is correlated with accidentrate, young male drivers being morelikely to have accidents.

relationship between the twosets ofscores.

/

Finally,if the dots suggest different types of curves, the relationship is referred to as curvi-linear. Thus, there may be four correlational patterns between two sets of scores, as represented in the

following figures.

Zero Correlation.It is possible that the distribution of scores on the scatterg ram will not clearly

show either a positive or negative relationship. The points may form a circle, for example. In such cases, we say there is no relationship between the twosets of data.

9. CORRELATIONAL INDEXES When we have large sample sizes it is time-consuming to do all the scatter plotting by hand.

Also scatter plots do not give us any quantitative measure of the degree ofrelationship between the two sets of scores. Therefore, we will use certain statistical formulas which have been developed to measure the degreeof relationship. The quantity obtained, previously referred to as correlation coefficient, indicates how closely the two sets of scores are related. But before explaining the formulas, other pieces of informationare in order.

the

92 [] LanguageTesting

/

hip

Stra

GerNa bare a ee ee ed

Chaper 4/Basic Statistics in Language Testing Cl

GelAw ew we ee

If there is a perfect relationship between the two sets of scores (either positive or negative), the

10.1. Pearson Product-moment Correlation

magnitude ofthe correlation coefficient would be either +/ or —/. A +1 correlation coefficient

Karl Pearson (1895) developed a coefficient of linear correlation which demonstrates the

indicates a perfect positive correlation, a —J correlation coefficient indicates a perfect negative correlation, and.a zero correlation indicates no relationship between the two sets of scores.

Therefore, the magnitude of the correlation coefficient will vary from —/ to 0 to +/. As a result a correlation coefficient gives us two pieces of information: e

the direction ofthe relationship — whetherit is positive, negative or zero

e

the strength or magnitude ofthe relationship between the twosets of data (called the correlation coefficient) which varies from 0 to —1 and from 0 to +1.

strength of a relationship between twosets of continuous data. The parametric Pearson productmomentcorrelation coefficient is usually symbolized by ry (or r): ear

'Y = label for the othertests N = numberofpairs of scores

Both + | are considered perfect correlations;

—0.4—l and 0.8

Answeris, the moretightly the points are clustered around a straight line, the stronger the relationship between the two variables. Accordingly, the left most figureis closest to +1, and the right most figure shows the least correlation. &} Note: The sign (- or +) ofthe correlation coefficient doesn’t have any effect on the degree of association, only on the direction of the association. You should not be negatively affected by the sign (—) and think of it as being undesirable or weak. As was mentioned before a —I index is a perfect correlation not a weakone.

10. CORRELATIONAL FORMULAS So far we have been talking aboutcorrelational coefficients like +/, — 0.4 and so forth. Having madesurethat the relationship between twosets of data is linear, we can establish the degree of strength. This is done through different formulae which are chosen accordingto the scale of data. However, we have not said anything about how these indexes are obtained. In the remaining of this chapter, we will look at three formulas to calculate correlational coefficients.

XY

i i ||

X (cloze)

__Y (writing)

x?

y?

52

48

2704

2304

2496

Ricky

49

49

2401

2401

2401

Stanly

26

a7

676

729

702

i

Steve

44

40

1760

|

Testees Ali-

Charls Betty

28 63

Douglas

»

;

70

24 59

784 3969

576 3481

672 3717

72

4900

5184

5010

1956

1600

Noam

32

31

1024

961

992

Leila

49

50

2401

2500

2450

George

SI

49

2601

2401

YX =464

YY = 449

x=

LY’ =

LAY =

23,396

22,137

22,729

~

L

I

2499

NQXY)—- YY)

{wox- enIvary- en] - 10 x 22,729 — 464 x 449 oe of[10 x (23,369) — (464)x [10 x (22,137) — (449)7] 18.046

= 0.9867

(18,664)(19,769)

The correlation coefficient ofthe scores in the table is 0.98. A coefficient of this magnitude

indicates that there is a strong positive correlation between these two groups of data. In

i |

;

94 L] Language Testing

i

8 cthemink citipmmmutl Aonuma st

Fe ed

ieirny

Ger

he eee‘ isla Ro eet

other words, the two tests are spreading the students out in very much the same way.

|

a |

¢ :

Continuous: As was mentioned before, the two sets of numbers must both be continuous scales.

e

relationship between the twosets is not linear, the value of r will be very low and might

mee

even be zero. e

of determination can be graphically shownin the following figures:

If the correlation betweentest X andtest Y is zero, then the coefficient of determination is zero:

Normally distributed scores: Neither of the two distributions can be skewed. If one or

i

the other is not normal, the magnitude of any resulting correlation coefficient will be affected. Typically, if either distribution is skewed the value of the correlation coefficient l i| : il i | ee

i ad

More test results and it was found that vocabulary knowledge is the source ofall variation. coefficient ofthe concisely, the two tests measure vocabulary to the extent of 54%. The concept

Linear: The relationship between the two sets of data should be linear. If the actual

i

95

; comprehensiontests is vocabulary knowledge. All and all we can say thatin the first experimentit was foundthatthe correlation between cloze and reading comprehension tests is 0.736 and that 54% of the differences in cloze test scores could be explained by differences in reading comprehension test (and vice versa). The second experiment looked for the element(s) that affected both test results, ic. source of difference in

: :

There are four assumptions underlying Pearson product-moment correlation coefficient:

| He

Chaper 4/Basic Statistics in Language Testing CI

prea wD

wilt be depressed to an unpredictable degree. e

Independence: Requires that each pair of scores be unrelated to all other pairs of scores.

:

In other words, whenthe pairs of test.scores are in two columns, no student should appear

:

Noneofthe factors accounting for variability are commonto both tests.

If the correlation between test A andtest B is 0.9, then the coefficient of determinationis:

created two pairs of twice (because, for example, he took the two : tests twice and thus scores related to each other). In short, to properly apply the Pearson r, there must be no

(0.9)? = 0.81 x 100 = 81%

systematic association between pairs of scores.

10.1.1. Coefficient of determination, 7? | :

A Correlation coefficient gives a measure of the relationship between two variables. It tells us very little, however, about the nature of that relationship, only that it exists and that it is either

| :

i ,

: . . : weis gained : ‘ relatively high or low. A fuller grasp of correlation if we consider the coefficient of

: wae f the factors accounting for variability are commonto both tests. 81% 0 8 Y

If the correlation between test X andtest Y is 1, then the coefficient of determinationis:

Here, the concept is common variance (or overlapping variance). The determination. coefficient of determination, ?, gives the percentage of variance in one variable that is

E

°

1? =1x 100 = 100%

accountedfor by (associated with or common with) the variance in the other.

'

|

f

;

You can think of.a test score as being the result of many separate elements. Some of these elements are characteristics of the person taking the test, some are characteristics of the situation |

|

in which the test is taken, and someare characteristics of the test itself. Now, if a person takes twotests, some elements will have their effects on both scores measured, and the name given to minati tion (you should knowthat to find the commonfactor we need i the coefficient of determina is is this to perform two experiments; one study to show that there is a relationship betweenthe twotests

4

and onestudy to find the common factor). Suppose a cloze and reading comprehensiontests were administered and the correlation turned

|

out to be 0.736. Here, the correlation is positive, meaning that high scores on the cloze test tend ha

|

Be |

.

other words, if a person scored high on the cloze test, we would predict that the person scored relatively high on the reading comprehensiontest, and if the person scored low on the cloze test

we would predict he or she also scored low on the reading comprehension test. When the

correlation is squared theresult is approximately 0.54. We would therefore say that about 54% of

the variability in cloze test can be explained by (or “be accounted for by”, or “is attributable to”) differences in reading comprehension ability. As was mentioned before, further suppose that another experiment was conducted which revealed the reason forvariability in cloze and reading

factors.

©} Note: One cannot obtain a negative coefficient of determination, since the correlation . oe 2 value has to be squared. The range ofcoefficient ofdetermination is 0 »! ps

or decreasein the size of the correlation.

The formula given is one of several simplifications of Pearson product-moment correlation coefficient formula. For instance, 1 might be interested in the degree of relationship between being male or female and language aptitude scores as measured by the Modern Language Aptitude Test. Do you think that there would be any relationship between students’ gender and

D the difference between paired ranks

their performance on such a test? The point biserial correlation coefficient could help me find

¥ D2 =the sum ofthe squared differences between ranks

out.

N = numberofpaired ranks &} Note: This formula is used if the numberofpaired variables is more than 9 or fewer than

30.

Es] Example: In the following table the calculation of rs is illustrated on data gathered to relate sex and scores of 15 students on an aptitude test. Notice that girls are assigned 0 and boys 1.

ER Example: Suppose two teachers (A and B) ranked ten students in an interview. The + . : following table shows the rankings given by the twoteachers:

.

Students

Scores on

Gender

aptitude test

Testees

A’s ranking

B’s ranking

Difference

Difference?

James

i

39

Victoria

I

i

0

0

Barbara

0

67

5 9

1 2

1 4

Ali Betty

1 0

Rod

3

Paul Carl

4 7

2

]

1

John

1

63

65 55

Robert

6

6

0

0

Adam

i

72

Sara

9

8

i

i

Sara

0

62

i

64

Nina

Michael

8

10

10

2 3

9

Harry

James

2

4

“2

4

Albert

i

66

Grover

5

3

2

7

4

Katy

4

VD? = 28 p=Hl1

6xI D? N(N2-1)

=l11

6(28)— 10 (100 1)

=11

168 990

=1-0., 1—-—0.17 =

5 +0.83

10.3. Point Biserial Correlation This formula is used when one set of data is continuous and the other set is nominal. The nominal variable is dichotomous which can take only the values of 1 or 0. The point biserial

0

60

Charlie

i

63

Jane

0

61

0.

i

62

gnes Agnes Margret

0 0

63 63

¥,=6114

pH,

X= 64.25

8 ‘pb =

»

Xp—%Xq 7 Vea Sx

64.25-61.14

"pb = 391

18

7

1), then V=4 tn choice 4, if V = 0.64 (V < 1), then SD=0.8 From these examplesit is clear that SD can be smaller than V,if and only if SD > 1. 9. Choice 2

Refer to Section 7.1.

108 [| Language Testing

4

Ritecnnseisl

islenha ee ie ha Bey

With a mean of 100 and SD of15, the score of 130 would be two SDs above the mean. The question asks for the

chances to exceed 130. Then we should add up the percentages above 130 in the figure: 2.14% + 0.13% = 2.27% = 2.3%

a

55-70

Test Scores

Frequency

Cumulative Frequency

16

1

14

14

1

13

85 100 115 130 145

22. Choice 1

Refer to Section 7. 23. Choice 4 Refer to Section 10.1.1.

24. Choice 3

Referto item 1.

12

2

12

al

1

10

25. Choice 3

10

2

9

Refer to Section 5.2.2,

9

2

7

26. Choice I

8

3

5

Refer to Section 6.1.

7

1

2

27. Choice 2

6

1

1

F 5 percentile rank = N x100- 4% 100 = 0.35 x 100 = 35 12. Choice 3 Whena studentranks 14, the rest of the students scored below him. Thus,

6514 _ 594%100=78.4%79 43. Choice 3 Refer to Section 5.1.3. 14. Choice 2

Rangeis the difference between the largest numberin the distribution and the smallest number. 15. Choice 4

Referto Section 5.1.1. 16. Choice 1 Refer to Section 10.3. 17. Choice 1

Refer to Section 5.1.3. 18. Choice 3 Refer to item 1. 19. Choice 3 Ifa student’s score is two SD above the mean wecould say his percentile rank is almost 98, because 50% of students fall below the mean, 34.13% between the mean and +1SD, and 13.59% between +1SD and

+2SD. When addedup, his percentile rank would be exactly 97.72.

Chaper 4/Basic Statistics in Language Testing [ 109

esi anaes

21. Choice 2 Refer to Section 8.

1. Choice 3

I

el

20. Choice 3 Derived score is any type of score other than the raw score. A derived score is calculated by converting a raw scoreor scoresinto units of anotherscale. In stanine, standardized scores are arranged on a nine-step scale. This scale provides a system of scores with a mean of 5.

10. Choice 1

' ok

Ahhip

3

Al

4

,

Refer to Section 3.3. 28. Choice 4 Refer to Section 5.1.3. 29. Choice 3 With a meanofthirty and SD offive, the score one SD above the meanis 35 and the score one SD below

the mean is 25. What’s more about 64% oftestees (two thirds) fall between +1SD. 30. Choice 2 Refer to Section 3.5. 31. Choice 3 Refer to Section 10.3. 32. Choice 4 Refer to Section 2.1. 33- Choice 1 Sometimes there might be more than one mode. Such distributions of scores are referred to as being bimodal(if there are two peaks) or trimodal(if there are three peaks). 34- Choice 1 Stem gives the definition of mode. 35. Choice 4 To make scores from different distributions comparable, we need to turn raw scores into standardized scores such as z-score, T-score.and CEEBscore.

36. Choice 1 68.26 % of testees fall between +1 SD. 95.44 % oftestees fall between +2 SD. 99.72 % of testees fall between +3 SD.

|

Chapter 5

4 1 |

| ki

| Np

:

Test Construction



|

i

I a

© Determining Function and Form of the Test

® Planning

© Preparing Items

© Reviewing © Pretesting

© Validation

® Extra Points to Remember

|

I

|

| Fal

he

icine

Geir

Chaper 5/Test Construction L] 113

Be as

2. PLANNING! It is importantfor the tester to decide on the area of knowledge to be measured. The content of the test depends on both function and form. For example, the content of a placement test should ‘logically differ from that of an aptitude test. Moreover, a comprehension test in multiple-choice

format allows the examiner to include as many items as he feels necessary, whereas in a production type test, limitations of item and space force the test developer to limit the test to a manageable numberofitems.

Test Construction To constructa test the following steps should be taken.First, the function and the form ofthetest

should be decided upon. Second, the content of the items should be specified. This step is also

called planning. Third, the items should be prepared in accordance with the specified content. Fourth, the items should be reviewed. Fifth, the items should be pretested in order to determine

their statistical characteristics. And finally, the test should be validated. Thus the steps in developing a test can be listed as:

1) Determining function and form ofthetest 2) Planning

In orderto determine the content ofa test: First, instructional objectives should be examined. Thatis, the course content should be

outlined to includea list ofmajor structural points covered during the instruction. Second, major topics should be divided into their specific components; Third, table ofspecification should be prepared. The main purposeoftable of specification is to assure the test developer that the test includes a

representative sample of the materials covered in a particular course. The following table of specification is prepared for a grammar course whose final exam consists of 10 items: Instructional objectives/ CO eats

iliomomty

Reported speech Subjunctive Dangling structure

3 2 5

3) Preparing items 4) Reviewing items

5) Pretesting items In this chapter weshall briefly describe. each step in the process oftest construction as applied to

As can be seen table of specification specifies whatis to be tested (in this case reported speech, subjunctive and dangling structure); the aspect of achievementto be tested; and the number of

the testing of English as a second language.

items(in this case 10 items).

1. DETERMINING FUNCTION AND FORM OF THE TEST

3. PREPARING ITEMS

In order to determinethe function and form ofa test, three factors should be taken into account:

This is to write items. In the process of item preparation, even the most experienced teacheris apt to make defective items. As one writes an item,it is essentialto try to lookatit through the eyes of test takers and imagine how they might misinterpret the item (in which case it will need

6) Validation

e

Characteristics of the examinee: The nature of the population is important. If the testees are a group of youngsters, for example, test items in pictorial mode would be more

appropriate than a test with written modality. If the same mother-tongue is shared by all

the testees, the task of sampling is madeslightly easier even though they may have attended different schools or followed different course. They will all experience problems

of a similar nature as a result ofthe interference oftheir first-language habits and thus the test could include items based onthe findings of contrastive analysis on the elements of the source andthetarget language

¢ e

Specific purpose of the test: According to the test function, the content ofthe test would differ. For example,the contentofa proficiencytest differs from that of an achievementtest. Scope of the test: Whethera test is to be used within the scope of a classroom, a school, or a country influencesthe structure ofthe test. As the scope of the test widens, the degree

of care to be taken along with the amountof time and energy to be spent increases because

the decision to be made would influencea larger population.

to be rewritten). Even if there is no possibility of misinterpretation, test takers may find

responsesthatare different from,but equally valid as, the one intended. When constructing MCitems’, test constructor should bear in mind the following guidelines. a) In constructing multiple-choice items, only one feature/skill at a time should betested. Such an item is called pure item such as listening to a text and then subsequently hearing a

series of true/false questions to be marked on the answer sheet. Pure items are contrasted with Hybrid item — one in which the examinee tests more than. one skill such as dictation which requires testee to listen to a text and then write it. As another example, the following test item,

tests both word order and sequence oftenses: 1- Planningor “specifying the contentofthe test” 2- For guidelines on constructing T/F and matching items refer to FAJAB, chapter 5.

gland

114 [] Language Testing

Sapam

ein es wee mes eekFeit

4) the boys had gone

levels because of the severely limited numberof distractors generally available. b) The context should be at a lower level than the actual problem which the item is testing. For example, grammartest item should not contain other grammatical features as difficult as the area being tested; or as another example, vocabulary item should not contain more difficult semantic features in the stem than the area being tested.

correct unless the instruction specifies choosing the best option (as in some vocabulary tests). In the following item choices two andthree could be taken as the answers. 3) was leaving

4) peach

John was........ of the efforts of his friends to deceive the old man. ’ A) absorbed 3) popular 2) interested 1) scornful h) The stem should allow for the number of choices which have been decided upon. This is particularly relevant, for example, when comparisons are involved in reading comprehension.

Thereis no possible fourth option which canbe addedin the following item. TOM WAS.......s0 es the other two boys.

c) Every item should have one correct or clearly best answer. This answer must be absolutely

Whenhecalled, I......... the house. 1) left 2) had left

3) banana

2) tangerine

Similarly, in the following item, preposition of reveals that scornful is the correct option.

Note that it may sometimes be necessary to construct such impure items at the very elementary

|

Chaper 5/Test Construction L] 115

Re ha Bites

1) apple

2) have the boys gone

1) had the boys gone 3) the boys have gone

|

ee

airy

Hepicked an......... off the tree and gaveit to his guest.

I never knew where.........

|

ee

4) haveleft

1) taller than

2) smaller than

i

3) as tall as

i) The stem should notstart with a blank. This recommendationoriginates from the concept of meaningful learning. According to the cognitive-code learning theory, language processesstart

with known information and move towards unknown information. Starting a stem with a blank

meansthat the testees should move from unknown to known information. Thus the flow of informationis in the direction opposite to the normal flow of information.

d) The stem should be quite clear and state the point to be tested unambiguously. The following item is thus deficient: 1) one ofthe students passing the test 3) the man wholeft the testing session early

2} one ofthe students failing the test 4) the man who did participate in the test

e) The stem should not contain extraneous information or irrelevant clues, thereby confusing the problem being tested. Unless students understand the problem being tested, there is no way of knowing whether or not they could have handled the problem correctly. For example, in the

following example clause after ‘because’ could be eliminated without making the item i

: |

’. impossible to answer.

Children werenot......... to watch the violent part of the movie because in that part two criminals attack a citizen in the street and kill her with a gun.

1) displayed

3) perceived

2) allowed

4) proven

f) The stem should include as muchof the item as possible. Any wordor phrase that is shared by all alternatives should be placed in the stem. The following item is deficient because who

should have been included in the stem. The person......... is called an author. 1) who writes a book

3) who reviews a book

'

2) who prints a book

4) whosells a book

g) The stem should not provide any grammatical clue which might help the examineefind the

correct response without understanding the item. In the following example, the article ‘an’ leads the testee to the correct response.

j) Negative statements should be avoided because they are likely to be ignored by the

examinees. k) If the options have a natural order(e.g., figures, dates), it is advisable to keep to this order

as shown in the following example.

Blackwell started his career as a lawyerin ......---... .

1) 1921

2) 1925

3) 1933

4) 1939

}) Each option should be grammatically correct when placed in the stem, except of course in the case of specific grammar test items. For example, stems ending with the determiner a, followed by options in the form of nouns or noun phrases, sometimes trap the unwary test constructor.

Someone whodesigns housesis a ........- . 1) designer

2) builder

, 3) architect

4) plumber

Stems ending in are, were, etc. may have the same weaknesses as the following and will require

complete rewriting. The boy’s hobbies referred to in the first paragraph of the passage were......... 2) tennis and golf 1) campingandfishing 4) rowing and swimming 3) collecting stamps

Any fairly intelligent student would soon be aware that option three is obviously not in the tester’s mind when constructing the item becauseit is an ungrammatical answer. Stems ending in prepositions may also create certain difficulties. In the following reading

Deen

116 LI Language Testing

ee

ewe tl ee aes EE

3) the prison

4) work

m) All of the alternatives must be grammatically correct by themselves and consistent with the stem. Choice three in the following item is impossible and thus shouldbe eliminated. Lastyear, incoming students......... on the first day of school. 1) enrolled

2) will enroll

3) will enrolled

Chaper 5/Test Construction C] 117

.

that the choice must be between polite and kind, since if glad were correct, pleased would probably also becorrect.

comprehension item, option one canberule out immediately. John soon returned to..........+. . 1) home 2) school

.

petAcar

4) should enroll

It is worth mentioning that using wrong distractors would expose examinees to wrong forms of the language which might negatively influence the student’s language learning process. n) Using‘all of the above’ or ‘none of the above’ as an alternative is not recommended. 0) Each option should belong to the same word class as the word in the stem. Option two of

the following item is rejected because it is of a different part of speech and doesn’t fit the context.

Have you heard the planning committee’s ......... for solving the city’s traffic preblems? 1) purpose 2) propose. 3) design 4) theory p) There is some disagreement concerning the relationship of the options to the problem area

r) Each distractor, or incorrect option, should be reasonably attractive and plausible.It should appear right to any testee who is unsure of the correct option. Items should be constructed in such a way that students obtain the obviously incorrect options. The following item contains two absurd items whichare eliminatedat first sight because theyareillogical. Howdid Picard first travel in space? 1) Hetravelled in a space-ship 3) he wentin a submarine.

i

2) He used a large balloon. 4) He jumped from tall building

s) All distractors should be of similar length and level of difficulty. Also, the correct option should be approximately the same length as the distractors. This principle applies especially to vocabulary tests and tests of reading andlistening comprehension, where there is a tendency to make the correct option longer than the distractors imply because it isso often correct. An example of such a ‘giveaway’ item is: He began to choke while he waseating the fish.

die ~

2) cough and vomit 3) be unable to breathe because of something in the windpipe

4) grow very angry

being tested. Some test writers argue that the options should be related to the same general topic or area, while others prefer as wide a range of associations as possible. Unless the vocabulary

t) Distractors should notbe too difficult nor demand a higher proficiency in the language than

item being tested has a very low frequency count(i.e. is very rarely used), however, the item writer is advised to limit the options to the same general area of activity where possible.

the correct option.If they are too difficult, they will succeed only in distracting the good student, whowill be led into considering the correct option too easy (and a trap). There is a tendency for this to happen, particularly in vocabulary items. In the following item, proficient testees may

Item 1

ltem 2

apparition

apparition

1) skeleton 2) ghost

1) scenery 2) ghost

You need a ......... to enter that military airfield. 1) permutation 2) perdition 3) permit

3) nightmare

3) magician

4) corpse

4) castle

Similarly, if the correct option is more difficult than the distractors, the testees will arrive at the

consider choice three, whichis the key, to be too easy andthus ignoreit.

If item 2 were set in a test, students who read a few ghost stories would probably select option two because they would associate apparition with the stories they had read. In item 1, however, students are required to show a muchgreater control over vocabulary. q) It is advisable to avoid using a pair of synonymsas distractors: if the testees recognize the

synonym, they may realize immediately that neither is the correct option, since there can be only one correct answer. The old womanwasalways courteous when anyone spoketo her. 1) polite 2) glad 3) kind

4) pleased

Even such near synonyms as glad and pleased are sufficient to indicate to intelligent students

4) perspicuity

correct answer by process of elimination. Thus, the test may have a negative effect on the testees: i.e. they will select the correct option not because they know itis correct but only

because they know the other options are wrong. The following item measures the testees’ knowledgeofthe distractors rather than their familiarity with the correct option. theatrical 1) angry

2) histrionic

3) proud

4) foolish

u) All distractors should be plausible. That is, distractors which do not logically belong to the point beingtested will be discarded bythetestees. In the following item a verb is to be tested and

therefore choice four is unacceptable.

Biche

118 L] LanguageTesting

To call on someone means......... 1) to visit

2) to talk

3a

Sasori ta

eres

Chaper 5/Test Construction L] 119

meee mae eR te

3) to telephone

4) a curious person

> C = sum ofthe correct responses, N = total number of responses

4, REVIEWING

where:

The writing of successful items is extremely difficult. No one can expect to be able consistently

©! Note: Item facility equation assumes that items left blank are incorrect answers.

to produceperfect items. Some items will have to be rejected, others reworked. The best way to identify items that have to be improved or abandoned is through reviewing. Reviewing or the process of moderation is the scrutiny of proposed items by at least two colleagues or outsiders,

neither of whom is the author of the items being examined. Through this stage, problems unnoticed by the test developers are likely to be observed by the reviewers. Thus the purpose is

Table below shows the hypothetical results of a grammar test with 10 items. The actual

responses are recorded with a 1 for each correct answer and 0 for a wrong answer. As can be seen from the ‘Total’ column, Haruka and Rie obtained the highest scores ie. 9 while Kana

received the lowestscore i.e. 1.

\

Items

to try to find weaknesses in the items and, where possible, remedy them. Where successful modification is not possible, they must reject the item. Of course, the reviewer’s comments

would be subjective and are, thus, not sufficient for the development of a reasonabletest. For a test to be scientifically defensible, it should be examined objectively. Such a scrutiny will be

possible though the next stage of test construction. 5. PRETESTING Reviewing is important because it indirectly helps the pretesting stage. That is, improved items would make pretesting more fruitful. Pretesting is defined as administering the newly developed

test to a group of examinees with characteristics similar to those of the target group. For example, if a test is designed for children, it should be pretested with a group of children. Otherwise, the goal of pretesting will not be achieved. Through pretesting the tester collects the numerical data,

Subjects Haruka Rie Arisa Saki Natsumi Tomomi Momo Yuuka Kana

12 3 4 5 6 1 tt 1 2°31 21 2 1 111 1 1 oo 1,1 1 14 o 10 1 1 2 2 2 1 1 o 10 01 0 0 0 0 0 10 oo 0 0 1 1 0 0 0 01 0

7 1 4 1 0 0 0 0 0 0

8 ¥F «1 «1 1 0 1 1 0 0

9 1 °21° «21 «1 0 0 1 0 0

10 O «0 °~=«0 °~«°0 0 «0 «0 «0 «0

based on which theefficiency and/or shortcomingofthe item is going to be checked.

Let’s calculate IF for items 2, 5, 7 and 10.

It has to be accepted that, for a number of reasons, trialling of this kind is often not feasible. In

In item 2 out of nine students, five students provided the correct answer:

somesituations a group fortrialling may simply not be available. In other situations, although a suitable group exists, it may be thought that the security of the test might be putat risk. It is often the case, therefore, that faults in a test are discovered only after it has been administered to the _ target group. There are two kinds of analysis that should be carried out. Thefirst is called item analysis where test taker determines, objectively, the characteristics of the individual items. The second is to determine one ofthe characteristics of the items altogether, ie. reliability. In this chapter we will

study the first kind of analysis, that is, investigate item analysis procedures for NRTs. The effectiveness of NRT items is determined through item facility (IF), item discrimination (ID), and choice distribution (CD).

5.1. Item Facility (IF) Item facility refers to the easiness of an item. Technically, it is defined as the proportion of correct responses for every single item. Proportion means that all correct responses should be

divided by the total numberof responses. This idea can beillustrated by the following formula:

Total 9 9 7 6 6 3 3 2 1

5 IF, = l—=-=0,55 N79

In item 5 out of nine students, nine students provided the correct answer: =—=-=1

Rs=

9

In item 7 outof nine students, three students provided the correct answer: xc 3 IF; = —=-—= 0,33

7"

N

9

In item 10 out of nine students, none ofthe students provided the correct answer:

Fog °

It can be understood from the examples above that the maximum item facility, when all examinees get an item correctly, equals 1. By the same token, the mostdifficult item is the one to which no one

-gives a correct response,i.e. the item facility is zero. Therefore the range ofIF indexis: O

4:0 0 0

1

«21

0 0 10

«0

0 0

4

4 2

3

0

0

oO

0

1

A

E 3

. Take, for example, item number 2 where item facility is 0.5 which is a perfectly ideal index. However, as can be seen the five top proficiency testees have missed the. item and the five low

proficiency testees have answered it correctly. This situation renders item number 2 as a lowquality item while by relying solely on IF we should consider it a high-quality item. Therefore it

seems that item facility alone wouldn’t give us a realistic image of the quality of an item. Besides item facility we need another index called item discrimination. Item discrimination refers to the extent to which a particular item discriminates more knowledgeable examinees from less knowledgeable ones. The index of discrimination tells us

whether those students who performed well in the whole test tended to do well or badly on each item in the test. It is presupposed that the total score on the test is a valid measure of the student’s ability (i.e. the good student tends to do well on the test as whole and the poor student badly). On this basis, the score on the wholetest is accepted as the criterion measure,andit thus becomes possible to separate the ‘good’ students from the ‘bad’ ones in performances on

individual items. If ‘good’ students tend to do well on an item and the ‘poor’ students badly on the same item, then the item is a good one becauseit distinguishes the ‘good’ from the ‘bad’ in the same wayasthe total test score. This is the argument underlying the index of discrimination.

To computethe item discrimination, the following formula should be used:

122 ry LanguageTesting

Chaper 5/Test Construction L] 123

tad Fi ;

ww eae eReES

The acceptable range of item discriminationis:

ID > +0.4 where:

Chign= the numberofcorrect responsesto that particular item given by the

examinees in the high group Ciow = the numberof correct responses to that particular item given by the

examineesin the low group

£) Note: The closer the value ofitem discrimination to unity, the moreidealtheitem. B Example: Ifin a class with 50 students, 20 students in high group and 10 students in the low group answered an item correctly, then ID equals ---------

1) 0.36

students based on their scores. Let’s calculate item discrimination for items 1, 2, 8 and 10: In item 1, all students in high group gave correct answers while all students in low group gave

IDS

—€

5-0

1D = Ps /2N /2 (10) In item 2, all students in high group gave wrong answers while all students in low group gave

correct answers:

Chigh — Gow 0-5 ID= oo?

/2N

/z (40)

In item 8, two students in high group gave correct answers and three students in low group gave

YN.

1)- 0.29

2)0.4

_ Chign Gow 15-30 2

In item 10, four students in high group gave correct answers and one student in low group gave correct answer: _ Chigh ~~ Qow _

4-1

ID = 2 = = 406

VN

4, (10)

It can be understood from the examples above that the maximum item discrimination, whenall examinees in high group answer .an item correctly and all examinees in low group answer

wrongly, equals +1, By the same token, the minimum item discrimination, when all examinees in

25

4) 0.63

15

#2

The upper and lower groups are sometimes defined as the upper and lower third, or 33 percent.

In suchtype of problems denominator should be 1, N: fa Example: In a group of 120 students, 20 students in the high group and 10 students in the

low group answered an item correctly. Ifeach group consists of33% of testees, then ID I$ ---------

1) 0.58

2) 0.42

2-3

Yy (00)

. YY, (50)

3) - 0.5

3) 0.33 ID

TN

10

= po Ee = 04

and half of the students in the high group answer the same item correctly, item discrimination index is ---------

correct answers:

Ip = Chigh = Clow

4) 0.31

B Example: Ifall the 30 students in the low group answerthe first item in a test correctly

wrong answers: Chigh

3) 0.47

_Crigh— Cow 20-10

1/2 N = the total numberof responses divided by 2 aii 3" Note: In calculating item discrimination, test constructor, first, needs to rank order

2) 0.4

20 ~ 10

4) 0.25

10

= y= 10.28 V,x120 40

The upper and lower groups are sometimes defined as the upper and lower fourth, or 25 percent.

In such typeof problems denominator should be 1/,4N: FR Example: In a group of 100 students, 15 students in the high group and 10 students in the low group answered an item correctly. Ifeach group consists of25% oftestees, ID is——--—-

1) 0.27

2)0.3

3) 0.2

4) 0.36

D2 a 5 = 402 T/,x100

250°”

high group answer an item wrongly and all examinees in low group answerit correctly, equals —

The upperand lower groups are sometimes defined as the upper arid lowerfifth, or 20 percent. In

—1. Therefore the range of ID index is:

such type of problems denominator should be Vy5 N: -1n=3

(State University, 95)

2. Choice 4 Determiner ‘an’ is a contextual clue by the help of whichtestee selects alarm.

1) would have an item discrimination of -0.8 2) is somehowtesting something quite different from the rest of the test 3) is a good candidate for retention in any revised version ofthe test 4) would have a low but positive item discrimination index

3. Choice 4

Distractors should not be too difficult nor demand higher proficiency in the language than the correct option.If they are too difficult, they will succeed onlyin distracting the good student, whowill be led into considering the correct option too easy.

36- The process of administering a newly developed test to a group of examinees with characteristics similar to those of the target group is known as----------

1) objective testing

2) validation

3) indirect testing

(State University, 95)

4. Choice 4

4) pretesting

correct answerto it. What would be the discrimination index of this item?

2) 0.5

3) 0.6

Since each group consists of 25% ofthe population, so: 80 + 4 = 20

4). 0.8

D=25e2=-05 “200

38- What is the problem with the following vocabulary item in which the candidates should choose

A, polite

B. happy

(State University, 96)

C. kind

6. Choice 2

Determiner ‘an’ is a contextual clue bythe help of which testee eliminateds umbrella.

D. pleased

7. Choice 4

1) There is a pair of synonymsusedas distractors. 2) The stem does not provide sufficient contextual clues. 3) The stem provides a grammatical clue-as to what the correct answeris. 4) Thecorrect option andthe distractors are not at the same levelofdifficulty.

In item 1, choice D is nonfunctioning. In item 2, choice B is malfunctioning. In item 3, choice A is malfunctioning. In choice four most testees have selected the correct option and the distractors have almost equally attracted the same number of examinees. 8. Choice 3 Choices one and four refer to choice distribution. Choice two refers to item discrimination.

39- Suppose that “A” is the correct response in the items below. Which distractor is a malfunctioning one? 1) B in item #2

(State University, 96) cl D

Item!

A

B/|

2) C in item #3

1

60

15

3) D in item # 1

2

45

44]

4) D in item #3

3

40- Whatis the problem with the following grammaritem? Igot a phone call from the man.,........ I bad sold mycarto. A. which B. whe C. whem

50

{17

25 5

14]

9. Choice 3 Choice A is nonfunctioning becauseit hasn't attracted anyone; Choice Bis malfunctioning because it has

0 6

attracted more students than the correct answer; choice C is difficult because IF = 0.25.

19

10. Choice 3

(State University, 97)

In an item of grammar,all distractors should stand grammatically correct, therefore choice three is poorly constructed.

D. why

a

1) The stem does not provide sufficient contextual clues.

2) It is an item that tests different levels of formality. 3) It confuses students by having them read unnecessary material. 4)It reflects a native English-speaker error rarely made by non-native speakers. 41- The optimal range for facility index of test items lies between --------. 1) 0.3 and 0.7 2) 0.2 and 0.8 3) 0.25 and 0.75 4) 0.37 and 0.63

40

5. Choice 1

(State University, 96)

the best definition for the underlined word? The old man was always courteous when people approached him,

xc

809 IF ==-=—=08x N 50 0.8 x.100 = 80%

37- Suppose 10 students took a test in which 4 individuals from the high-performing group answered item #1 correctly, while only L person from the low-performing group provided the 10.4

Chaper 5/Test Construction [4 135

meme ae eS

,

11. Choice 3 Refer to Section 5.2. 12. Choice 4

Refer to Section 1. (State University, 97)

13. Choice 4

In choice one, item is easy and point biserial correlation (rpb) is low. In choice two, item is easy. Though the indexes in choices three and four are standard, the indexes in choice four is more desirable.

Bet

136 [_] LanguageTesting

ht saan

Gerry

Chaper 5/Test Construction } 137

we mee eh ae Ee

14. Choice 3 Each group consists of 20% ofthetotal population»

30. Choice 4

-50+5=10 6-2 .4 ID= Ww = To = 0.4

;

;

;

Whenall the examinees in the high group and none of the examinees in the low group answer an item

correctly, ID = +1.

Whenall the examinees in the low group and none of the examinees in the high group answer an item

15. Choice 4

The test constructor who is measuring the use of third person singular -s has chosen a difficult stem for this purpose. 16. Choice 2 In choice 1 B is nonfunctioning; In choice 3 A is malfunctioni ng; In choice 4, A is malfunctioning. 17, Choice 4 18. Choice 3

correctly, ID =-1. 31. Choice I 32- Choice 4 Using the data, item discrimination is 0.5: _ Chigh ~~ Clow _

1/5N

33. Choice 2 Keeping in choice c is different from the other choices.

Thestudent provided 60 correct answers:

_ 20 Score = R-7—7 = 60-57 = 50

5-0

ID = — = = 405

1/, (20)

34- Choice 2

Refer to Section 1.

19. Choice 3 Refer to Section 5. 20. Choice 2 Sometest writers argue that the options should berelated to the same general topic orarea, e.g. Apparition means...... a) skeleton b) ghost c) corpse d) nightmare

21. Choice 2 Refer to Section 5.2.

35. Choice 2 This item, for which the uppergroup had an IF of .20 andthe lowergroup an IF of .60, would have an ID of —40 (.20 — .60 = -.40). This ID index indicates that the item is somehow testing something quite different from the rest of the test because those who scored low on the whole test managed to answerthis

item correctly more often thanthose who scored high onthetotaltest.

36. Choice 4

22. Choice 3 Because the stem is not long enough, there is more than one correct response to the item. 23. Choice 1 Refer to Section 5.

Stem gives the definitionofpretesting.

37. Choice 3

Ghigh Gow 471 Ip — ish _ 77 Log TN

T/, (10)

24. Choice 3 Refer to section lem Facility in CRTs.

38. Choice 1 ; In the construction of vocabulary items, there should be nopair of synonyms.

25. Choice 3 Refer to Section 3.

39. Choice 1 . This distractor has attracted almost as manytestees as the correct response, hence mal-functioning.

26. Choice 2 Oneof the guidelines for MC construction is “All distractors should be plausible”. In the given item ‘to havelearn’ is grammatically poorly constructed. . 27. Choice 4

40. Choice 2

If there were, for example, 20 students in the class then,

Refer to Section 5.1.

Item discrimination =

Chigh 7 ow _ 10-10

i 5N

i 520

=0

28. Choice 2 Ifa testis too easyor a test item is too difficult it can tell us nothing aboutthe differences in ability within the test population. A difficulty level falling in the middle ofthe range guarantees somevariance in scores amongthe examinees. 29. Choice 3 Refer to Section 7.1.

.

/

In constructing grammaritems, it is a good idea to avoid items that test divided usage, or items that only

test different levels of formality. 41. Choice 4

a

138 [_]} Language Testing

j

Naeem

ier awe ata Reet?

(Azad University, 83) 4) 0.67

2- Some ofthe shortcomingsofthe items in a test can be accounted for through item --------. {Azad University, 83)

1) facility

2) discrimination

3) difficulty

4) analysis

3- If 6 testees from among a total numberof 24 answera question correctly, whatis the item facility indexofit?

{Azad University, 83)

1) 0.18

2) 0.25

3) 0.75

4) 0.50

4- If the number of the correct responses in the lower group exceedsthatof the higher one, the test is regarded to be ---------.

1) very easy

2) very difficult

(Azad University, 83)

3) inappropriate

4) valid and reliable

5- 15 testees in the lower group and 45 in the higher group answered a question correctly. If all of

the subjects in the upper group have answered correctly, what is the item discrimination of this given item?

1) 0.75

2) 0.90

3) 0.86

4) 0.67

6- A distractor is mal-functioningif it attracts ---------.

1) high and low students equally 3) morehigh than low students

2) more low than high students 4) some ofthe students

2) tests more than onethingat a time

3) presents a single problem

4) has lengthy distractors

15- In the process of --------- a test, it undergoes some changesto fit the needs and purposeof the particular language program. (Azad University, 89) 1) scoring 2) developing 3) administering 4) adapting 16- In the multidimensional process of test construction, which of the following steps should be taken first?

(Azad University, 83)

Azad University Answers (Azad University, 84) 1. Choice 1 Referto Section 5.2.

3. Choice 2

9. The most important task in pretesting is ---------.

{Azad University, 84)

1) analyzingthe result 3) preparing directions

2) reviewingthe items 4) planning thetest

2) function ofthe test 4) physical appearance

11- The purpose of pretesting is to determining the characteristics of the individual items including

the following EXCEPT---------.

1) IF

2) ID

3) CD

3) the content ofthe test

2) the function ofthe test

4) SD

(Azad University, 85)

1) item facility

.

2) item analysis

3) item difficulty

“NO 2400"

4. Choice 3 students in the high group would answeran item than the low group.If the case is reversed, there is some problem with the item. 5. Choice 4 Since all 45 students in the high. group answered the item correctly, we infer that in the low group there are 45 students, i.e. the total numberoftestees is 90:

Ce ee

Chign ~ Clow

45-15

Item discrimination = rE

= 0.66

6. Choice 3 © Refer to Section 5.3. 7. Choice 4

4)the time needed

13- Which one refers to the power ofdistinguishing among the testees on the basis of their language

proficiency?

pata 2-925

(Azad University, 85)

12- In constructing a test, which of the following points must be determined in advance? 1) the form ofthe test

Refer to Section 5.

In calculating item discrimination, when we divide students into two halves, it is supposed that more

10- To decide on the form ofa test, certain elements such as the following are invelved EXCEPT the ---------. {Azad University, 85)

1) nature ofthe attribute 3) authenticity ofthe attribute

(Azad University, 91)

J) it is a very easy item forall tests takers 2) it provides the maximum differentiation amongtests takers 3) it is a very difficult item for all except the most qualified applicants 4) it is an acceptable value to differentiate qualified from unqualified applicants

(Azad University, 84)

4) 0.5



2) Determining the form and function ofthe test 4) Examiningthe instructional objectives

17- If the IF value ofa test is 0.1, we can conclude---------.

8- If the item discrimination of an item equals 1, what wouldits item facility be?

3) 0.0

(Azad University, 89)

1) Determining the contentofthe test 3) Preparing the test items

2. Choice 4

2)-1

(Azad University, 87)

1) involves non-English options

7- Determining the characteristics of an item altogetheris called -------. (Azad University, 84) 1) item facility 2) choice distribution 3) item discrimination 4) validation

1)1

Chaper5/Test Construction | 139

aE pees

14- A numberofscholars believe that an impure MC item is one which -

Azad University Questions 1- If item facility equals 1, item discrimination will be ---------. Ho 2)1 3) 0.33

ee ee

me

(Azad University, 86) 4) item discrimination

Refer to Section 6. 8. Choice 4

.

Whenitem discrimination of an item is 1, it meansall the students in high group answeredit correctly and all the students in low group answered incorrectly, that is, halfthe learners answered correctly. Therefore

‘|

140 [_] Language Testing

,

item facility equals 0.5.

| a | i

9, Choice 1 Refer to Section 5.

op {

re iry ween ew ah etry

10. Choice 3

Refer to Section 1.

Chapter 6

11. Choice 4 Refer to Section 5. 12. Choice ? According to FJB, p. 83 choices 1, 2, and 3 should be determined in advance.

|

|

Refer to Section 5.2.

|

Refer to Section 7.

|

A newly developed test may work fairly well in a program, but perhaps not as well as was originally hoped. Such a situation would call for further adapting of the test so that it better fits the needs and purposes of the particular language program. The process of adapting a test to a specific situation will

| |

| :

/

13, Choice 4

:

14, Choice 2 15. Choice4

involve somevariantof the following steps:

|

‘1) Administer the test in the particular program, using the appropriate teachers and students; 2) Select those test questions that word well at spreading out the students (for NRTs), orare efficient at measuring the learning of the objectives(for CRTs)in this particular program; 3) Develop a shorter, more efficient revision of the test — one that fits the program’s purposes and works well with its students; and

4) Evaluate the quality of the newly revised test. 16. Choice 2 | j

Refer to Section 1. 17, Choice 3 The closer the index ofIF to zero, the harderthe item.

Characteristics of a Good Test ® Reliability: the General Concept © Reliability in Testing

© Classical True Score Theory (CTS)

© Approaches to Estimating Reliability © Factors Influencing Reliability

© Standard Error of Measurement

© Other Reliability Theories ® Reliability of Criterion-referenced Tests

® Validity

© Factors Influencing Validity

® The Relationship Between Reliability and Validity

- © Practicality

© Extra Points to Remember

.

Chapter 6 / Characteristics ofa Good Test C] 143

ll be explained. generalizability theory (G-theory) and item response theory (IRT)wi

Characteristics of a Good Test i

In previous chapter we mentioned that tests are evaluated at two levels. At the first level, the 7

items are investigated individually. The item analysis procedures carried out here, for NR tests include item facility (IF), item discrimination (ID) and choice distribution (CD), and for CR tests include difference index (DI) and B-index: On the basis of these criteria, defective items should

i

of

be either modified or discarded. Having good items, however, does not necessarily lead to a

good test, because atest as a whole is more than a mere combination of individual items.

3, CLASSICAL TRUE SCORE THEORY(CTS) comprises two factors or CTS states that an observed score an examinee obtains on a test y and an error score, that is. components: a true score that is due to an individual’s level of abilit can be represented in the due to factors other than the ability being tested. This assumption t devices are subject to formula. Let’s assume that someonetakes a test. Since all measuremen one’s ability in that particular error, the score one gets on a test cannot be a true manifestation of true ability along with some error. trait or subject matter. In other words, the score contains one’s

sent an errorless measure of If this error part could beeliminated, the resulting score would repre score’. that ability. By definition, this errorless score is called.a ‘true whichis called the ‘observed score’. This true score is always different from the score one gets,

score’, it can be greater Since the observed score includes the measurementerror, i.e. the “error” ved scoreis represented by than, equal to, or smaller than the tme score. Therefore, if the obser between the observed and true X, the true score by T andthe error score by E, the relationship score can beillustrated as follows:

Therefore, we need to go through the second level which involves investigating characteristics of items as a whole. These characteristics are reliability, validity, and practicality. The notion of

Observed score

authenticity, as a fourth characteristic of a good test, was discussed in chapter one.

1. RELIABILITY: THE GENERAL CONCEPT A reliable person, for instance, is a person whose behavior is consistent, dependable, and predictable, i.e., what he will do tomorrow and next week will be consistent with what he does today and what he has done last week. We say he is stable. An unreliable person, on the other

T

+

2. RELIABILITY IN TESTING If one could take a test over and over again, he would probably agree that his average score over

all the tests is an acceptable estimate of what he really knows. On a reliable test, one’s score on its various administrations would not differ greatly. That is, one’s score would be quite consistent. On an ‘unreliable’ test, on the other hand, one’s score might fluctuate from. one

administration to the other. That is, one’s score on its various administrations will be inconsistent. The notion of consistency of one’s score with respect to one’s average score over repeated administrationsis the central conceptofreliability.

In order to investigate the concept of reliability or consistency of a test, first a theoretical framework should be established. Different theories have been developed to explain the concept of reliability in scientific terms. Each theory makes certain assumptions which are similar to axioms and should be taken for granted. In this chapter classical true score theory (CTS),

X

=

This relationshipis illustrated in the following figure.

Ke

hand, is one whose behavior is much more. variable. Sometimes he does this, sometimes that. He

lacks stability. We say he is inconsistent. So is with tests and measurements; they are more or less variable from occasion to occasion. They are stable and relatively predictable or they are unstable and relatively unpredictable; they are consistent or not consistent. If they are reliable, we can depend on them.If they are unreliable, we cannot depend upon them.

E

Xe *%

y see the true (T) or It is important to keep in mindthat we observe the X score ~ we never actuall

language test. That’s the score error (E) score. For instance, a student may get a score of 85 on a is actually better at language we observe, an X of 85. But the reality might be that the student ability is 89 (i.e., T=89). That than that score indicates. Let’s say the student’s true language Well, while the student’s true means that the error for that student is 4. What does this mean? had breakfast, or may have had languageability is 89, he may have had a bad day, may not have ementthat makethe student’s an argument. Factors like these can contribute to errors in measur mes errors will lead you to observedability appear lower than his true or actual ability. Someti good day guessing!). If the student perform better on a test than your true ability (e.g., you had a score is 89, this means that error scored 91 on the same language test, assuming that his true

score for that studentis 2.

follows. A measure 1s considered According to CTS, reliability or unreliability is explained as

144 rT] LanguageTesting

Chapter 6 / Characteristics ofa Good Test EC] 145

reliable if it would give us the same result over and over again (assuming that what we are measuring is not changing). However, if one takes two measures of the same attribute, e.g.

Language Competence

verbal knowledge, the two measures will not resemble each other exactly. The fact that repeated measurements of someattributes of the same individual almost never duplicate one another is

Organizational

called ‘unreliability’. On the other hand, repeated measurements of the same attribute of the same person will show some consistency. The tendency toward consistency from one set of

Grammatical

Competence

measurementto the nextis called ‘reliability’.

It has been demonstrated that the theory of true and error. scores developed over multiple samplings of the same person (i.c., taking a listening test 1000 times) holds over to a single administration of a test over multiple persons (i.c., administering a listening test to a group of 1000 people once). Therefore, we don’t speak of the reliability of a measure for ‘an individual ’.--

reliability is a characteristic of a measure that is taken ‘across individuals’. Therefore we can say, the performanceof students on anytest will tend to vary from each other, but their performances can vary for a variety ofreasons. In the bestofall possible worlds,all the variancein test scores would be

directly related to the purposesofthe test. For example, consider a test oflistening comprehension. Atfirst glance, teachers might think that the variance in students? performance on such a test could

beattributed entirely to their listening ability of texts. Unfortunately,reality is not quite that simple

and clear. Many other factors may be potential sources of score variance on this listening test. In summary, the sources of variance in a set of scores fall into two general categories: (a) meaningful variance: those creating variance related to the purposes of the test or subject

matter area being tested, and (b) error variance: those generating variance due to other

extraneous sources,

Meaningful Variance. Part of the variation in a set of scores represents testees’ differenc es in terms of their communicativeability. In order for test scores to be most informative, the concept

being tested must be very carefully defined and thought so that the items are a straightforward reflection of the purpose for which the test was designed. In order to achieveth is precision we need to refer to.a framework outlining the components of language competence. One such

framework is Bachman and Palmer’s (1996) model of language knowledge. By considering this theoretical model, we may decide to focus on students’ comprehension of

Pragmatic

Competence

/ wocabulary

‘Morphology

Competence

Textual

Wocutionary

Competence

Competence

js fessor

Syntax

Phonology/

Rhetorical

Organization

.

Graphology ;

Sociolinguistic Competence

Ideational Functions

Sensitivity to Dialect or Variety

‘Manipulative

Sensitivity

“°

Functions

to Register

Heuristic

Sensitivity

Functions

maginative

Functions

.

to Naturalness

Cultural References

and Figures

of Speech

Error Variance (or residual variance; measurement error; random error). Unfortunately,

other factors, unrelated to the purpose ofthe test, almost inevitably enter into the performances of the students. Error variance is a term that describes the variance in scores on a test that is not directly relajed to the purpose of the test (see table below for a list of issues which are the potential sources of error variance). . Variance due to environment e Location

Variance due to

Variance due to scoring

administration procedures

procedures

Space

e directions

e errors in scoring

Ventilation

® equipment

© subjectivity

Noise

e timing

e evaluator biases

Lighting

e mechanicsof testing

e evaluator idiosyncrasies

Weather Variance due to the test and

Varianceattributable to

impulsiveness

test items

examinees

carelessness

test booklet clarity

e health

answer sheet format

test-wiseness

cohesion in academic lectures (see undertextual knowledge within organizational competence), for example. Therefore, the purpose of testing becomes comprehension of cohesion in academic

particular sample of items

° fatigue ¢ physical characteristics

item types

* motivation

guessing

could be regarded as meaningful variance. This variation, which would be predicta ble, is called

numberofitems

¢ emotion

task performance speed

item quality

°® memory

chance knowledge of item

test security

¢ concentration

content

lectures and any variance in score dueto ability or inability to distinguish the cohesive devices

systematic variation and contributesto reliability.

comprehension of directions

forgetfulness For instance, in the set of scores from the listening comprehension example, other potential sources of score variance might include: variables in the environment like noise, heat, etc.; the adequacy of administration procedures; factors like health and motivation in the examinees themselves; the nature and correctness of scoring procedures; or even the characteristics of the set of items selected for this particular test. All these factors might be contributing to the success

146 LE] LanguageTesting

Chapter 6 / Characteristics ofa Good Test CI 147

GiGi oY

or failure of students on the test — factors that.are not directly related to the students’ listening

comprehension. For the most part, variation under these conditions and many others which may

Also as can be seen in the last circle from left, whenthereis no error variance,i.e. when V, = V;, reliability equals one. This could be demonstrated through the formula too:

not be predictable is called unsystematic variation. The important thing about random error is

Vi

that it does not have any consistent effects across the entire sample. Instead, it pushes observed

rece

Va

scores up or down randomly (hence the name random error) and contributes to unreliability. Therelationship between true, error and observed scores which was stated by a simple equation has a parallel equation at the level of the variance of a measure. Thatis, across a set of scores, we assumethat:

O)#FOEO

From this formula, it can easily be understood that there is a close relationship between the degree of error in measurement and the exact amount of the true score: because the variance of the true scores doesn’t change, the variance of the observed scores, nonetheless, fluctuates because of the extent of the error of measurement. Logically, we can say, when a small part of the observed score is due to measurementerror, the estimate of the true score will approximate its real value. With this in mind, we can now definereliability more precisely. Reliability is a ratio. Reliability is expressed as the ratio of the variance of true scores to the variance of observed scores.

Notationally, this relationship is presented as:

OPO r=“is

_a

-

ts

This means that when there is the greatest amountof error in measurement, the reliability will

equal zero. Thus, the magnitudeofreliability can range from zero to one. Thereliability of zero, which is the minimum, means that all observed variation is due to error. That is, the test is

Within the CTS model, there are three approaches to estimating reliability, each of which

addresses different sources of error. Stability estimates indicate how consistent test scores are overtime, equivalence estimates provide an indication of the extent to which scores on alternate formsofa test are equivalent, and internal consistency estimates are concerned primarily with sources of error from within the test and scoring procedures. The estimates ofreliability that these approaches yield are called reliability coefficients. 4.1. Stability (Test-retest method)

Test-retest reliability provides a measure ofthe variability that can be expected due to day-to-day

fluctuations in a numberofdifferent factors, such as concentration, fatigue, etc. In this method, reliability is obtained through administering a given test twice and then calculating the correlation between the two sets of scores. Thefirst step in this strategy is to administer whatever

test is involved two times to the same group ofstudents. Once the tests are administered twice

and the pairs of scores for each studentare lined up in two columns, a Pearson preduct-moment

&! Note: Since there has to be a reasonable amount of time (a two-week interval) between the two administrations, this kind ofreliability is referred to as the reliability/consistency

t=

of scores over time or temporalreliability. Therefore, in reporting test-retest reliability coefficients,it is important to include the time interval. For example, a report might state

;

.

As you see in the third circle from left, when all the observed variance is error variance, i.e.

“the stability of test scores obtained on the same form over a three-month period was 0.9”. This approachhas three drawbacks: e

It requires two administrations. Obviously it is difficult to arrange twotesting sessions for the same group of examinees. Furthermore, preparing similar conditions under which the

e

Human beingsare intelligent and dynamic creatures. They are always involved in the processoflearning. Thus, their abilities are mostlikely to change from one administration to another, especially when the interval betweenthe twotesting sessionsis long.

when V, = V,, reliability equals zero. This could be demonstrated through the formula:

Vy

Vy

=+1

correlation coefficient between the two sets of scores is calculated to show the extent of consistency. In using this method the assumption is that no significant change occurs in the examinees’ knowledge during the interval between the two administrations.

Below are some possible combinationsof true and error variances:

= ros

Vx

4, APPROACHESTO ESTIMATING RELIABILITY

magnitudeof the error variance.

Vy-Ve

Vx

£-—

completely unreliable. On the other hand, the reliability of 1 indicates that there is no error in

variance component. Since error varianceis included in the observed variance, the variance of the observed scores is always greater than the variance of the true scores. Therefore, as you see the magnitude of the observed variance equals the magnitude of the true variance plus the

Vi

Vae-O

x

measurementandthetest is perfectly reliable.’

where V, is the observed score variance, V; is the true score variance, and V, is the error score

r=0,25

Va-Ve

=

Ve-Ve Vy

=0

administrations take place adds to the complications of this method.

a

ue

AD alten

148 ["] LanguageTesting ¢

eel

ein 4

Practice effect: Generally speaking, the more often we perform an operation, the more proficient we becomeatit. This is as true of test-taking as it is of driving a car or baking a cake. Thus we may expect subjects who are repeating a test to score somewhathigher than

they did the first time, due to familiarity with the test procedure and also specific test items, even if their knowledge of the subject being tested has notitself increased. 4.2. Equivalence (Parallel-forms method)

In-parallel-forms method, two similar or parallel forms of the same test are administered to a group of examinees just once. Then using Pearson product-moment formula, the correlation

coefficient between the two sets of scores obtained from the two formswill be an estimate oftest scorereliability.

In order to constructthe two parallel forms, the table of specification for the two formsofthe test must be the same. It means that all the components of the two tests should be the same. For example the number of sub-tests and items should be equal in the two tests. The sameness

doesn’t mean thatthe surface formsof the items should be the same. Each item in one form has a stem which differs from its counterpart in the other form. What remains consistent across the

items is the elementto be tested. For example, the following items are parallel because they are testing present tense with different surface forms: He usually ..... .... tennis every day. a) plays b) play c) playing Mybrotheroften ......... a cup of tea every morning. a) drinking _b) drinks c) drink

The two parallel forms of a test should have the following characteristics: e

the means andstandard deviations are quite similar;

©

the two formscorrelate about equally with somethird measure.

This method has two drawbacks: « Constructing two parallel forms ofa test is not an easy task. « Ordering effect is the effect on test performance of presenting two or more forms of the test to the sameparticipant. Since administering two alternate forms to the same group of individuals will necessarily involve giving one form first and the other second, there is a possibility that individuals will perform differently because of the order in which they take

the twotests; this is referred to as ordering effect which can be minimized through the use of counterbalanced test design. To do counterbalancing, testers need to develop two

parallel forms (for instance, forms A and B) of the test so that they are very similar. During the pre-test, half of the students take Form A and half take Form B. Thenthe first half takes Form B and the secondhalf takes Form A.Thisis illustrated below. Time I

Time 2

Half I

Form A

Form B

Half IF

Form B

Form A

aie!

atta

Ceri wwe med TS

Chapter 6 / Characteristics ofa Good Test | 149—Ct«;

4.3. Internal Consistency To avoid, the work and: complexity involved in the test-retest and equivalent-formsstrategies,

testers most often use internal-consistency strategies to estimate the go-togetherness of test items. As the name implies, internal-consistency reliability strategies estimate the consistency

of a test using only information internal to a test, or test takers’ performances on the different parts of the test with each other. Thus, only information available in one administration of a single test is required. Inconsistencies in performance on different parts of tests can be caused by

a number of factors. Performance on the parts of a reading comprehension test, for example, might be inconsistent if passages are of differing lengths and vary in termsof their syntactic and

lexical complexity, or involve different topics. Similarly, performance on different items of a multiple-choice test that includes items with different formats —some with blanks to be completed and others with words underlined that. may be incorrect- may not be internally consistent. The main assumptions underlying the internal consistency method are that

a) test scores are unidimensional, which means that the parts or items of a giventest all measure the same, single ability, i.e. items comprising a test are homogeneous. For example, grammatical points, vocabulary, reading and listening comprehension, are all subparts of the trait called language ability; b) the items or parts of a test are locally independent. That is, we assume that’ an individual’s response to a given test item does not depend upon how he responds to other items that are of equal difficulty, i.e. items comprising a test are independent. This is a praised characteristic of multiple-choice items. * Numerousstrategies exist for estimating internal consistency, including Flanagan’s coefficient, Rulon’s coefficient, split half, KR-20, KR-21 and Cronbachalpha. In this survey wewill present the last four strategies. The strategies chosen were selected on the basis of their conceptual clarity, ease of calculation, accuracy of results, and frequency of appearance in the language testing literature. 4.3.1. Split-half reliability estimates

One approach to examining the internal consistency ofa test is the split-half method, in which we divide the test into two halves and then determine the extent to which scores on these two halvesare consistent with each other. In so doing, we are treating the halvesasparallel forms; of course, parallel forms of the items in a single test, not the parallel forms of two separate tests. In

this method one important question should be answered, namely, how should the test be divided? A convenient way ofsplitting a test into halves might be to simply divide it into the first and

second halves. The problem with this is that mostly language test items are designed with the easiest questions at the beginning and the questions becoming progressively more difficult. In tests such as this, the second half would obviously be more difficult than the first, so the performance on the two halves will not be inter-related. A procedure that is adequate for most purposes is to split the test on the basis of odd- and even-numbered items. The odd-numbered and even-numbered items are scored separately as though they were two different forms.

Through this procedure, easy anddifficult items will be equally distributed in the two halves.

'|

Germ

150 EC] Language Testing

aca es ee eee

GMM where

The Spearman-Brownsplit-half estimate

Once the two half-scores have been obtained for each person, they may be correlated through Pearson product-moment correlation coefficient. It should .be noted, however, that this

correlation actually gives the reliability of only half-test. For example, if the entire test consists of 100 items, the correlation is computed between two sets of scores each of which is based on

only 50 items. Now, the question is how to compensate for the reduced numberof items. Since the length ofthe test, i.e. the numberofitems, is an importantfactor in test scorereliability, by

dividing the test into two halves, the length of the test will be reducedto half of the length ofthe

where:

1) 0.85

Shi + Sho\ _

ra ax(1-7

Notal = reliability of the full-length test

fa Example: The reliability of half of a grammartest is calculated to be 0.35. By applying the Spearman Brown’s prophecyformula, the total reliability wouldbe ---------.

2) 0.63 ,

3) 0.45

_ 2X Trap 2035 0.7 |

total “1+ Tnatp 14035 135 —

=2x(1

ae (

4) 0.7

3) 0.75

2) 0.80

=2x

_

2.667 + 2.87\ _ 4972

4)0.38

0.51

Two assumptions must be met in order to use this method. First, since we are in effect treating the two halves as parallel tests, we must assume that they are equivalent, i.e. they have equal means and variances (an assumption we can check). Second, we must assumethat the two halves

are experimentally independentof each other (an assumption that is very difficult to check). That _is, an individual’s performance on one half does not affect how he performs on the other. This

assumption doesn’t mean that the two halves will not be correlated with each other, rather that the correlation will be due to the fact that they are both measuring the sametrait or ability, and notto the fact that performance on one half depends upon performanceon the other. The Guttmansplit-half estimate

347

seat) =2x04=08

a7 =

NOB =O

Since this formula is based on the variance ofthe total test, it provides a direct estimate of the reliability of the whole test. Therefore, unlike the correlation between the halvesthat is the basis

4.3.2. Reliability estimates based on item variances

In estimating reliability with the split-half approach, there are many different ways in which:a given test could be divided into halves. Since not every split will yield halves. with exactly the

samecharacteristics in terms of their equivalence and independence, the reliability coefficients

obtained with different splits are likely to vary, so that our estimates ofreliability will depend greatly on the particular split we use. One way of avoiding. this problem would be

to split the test into halves in every way possible, computethereliability coefficients based on

these differentsplits, and then find the average ofthese coefficients. For example, in a four-item

test, the three possible splits would be: (1) items 1 and 2 in one half and items 3 and 4 in the

other; (2) 1 and 3, 2 and 4; (3) 1 and 4, 2 and 3. This approach soon becomes impractical,

however, beyond a very small set of items, since the numberofreliability coefficients to be computed increases dramatically. KR-21 method This formula, developed by Kuder and Richardson (1973), was based on the assumption that (a) all itemsin a test are designed to measure a single trait, (b) all items are of equal difficulty, and

(c) all items are scored dichotomously(no weighting scheme).

Another approach to estimating reliability from split-halves is that developed by Guttman (1945), which does not assume equivalence of the halves, and which does not require computing a correlation between them. This split-half reliability coefficient is based on the ratio of the sum

of the variancesof the two halves to the variance of the wholetest:

( _ 707+ 78)

=2xi(1

for the Spearman Brown reliability coefficient, the Guttman split-half estimate does not require an additional correction for length.

Tt, = reliability of half of the test (correlation between the two halves)

D051

Sz, = varianceofthefirst half $2, = varianceofthe second half

EE Example: Suppose that standard deviation of the odd-numbereditems ofa test is 2.66, of even-numbereditems is 2.8 and ofthe test is 4.97. Whatis reliability ofthis test?

total test. Thus, we must correct the obtained correlation for this reduction in length. The most

commonly used formula for this is a special case of the Spearman-Brown prophecy formula (Spearman, 1910; Brown, 1910), which yields a split-half reliability coefficient:

Chapter 6 / Characteristics ofa Good Test CI 151

maa Bees

where:

K = the numberoftheitems in test

X = the meanscore V = the variance

,

Bitlet’

152 [_] Language Testing

'

;

doa

ceri pig ns a Bo TE

& Example:Ifa test with 30 items has a variance and mean of 10 and 20, then reliability of the test would be ---------. .

D) 0.44

a

2) 0.34

3) 0.29

4) 0.52

r= (Fad “S)= @) 0= 5,xVI=7 9 0=Vi-ror=1

|

“) Note: It can be inferred from the last example that. there-is a negative relationship

|

between standard error of measurement andreliability. When there is no measurement

:

error reliability equals +1.

the item asks for 68% of the times, the shaded area covering between 12.5 and 17.5 is the



76

6 1258 18

175 2 228

answer.

EB Example: In a set ofscores variance is 4 andreliability is0.64. Ifastudent obtained 20 we can be sure that 95% oftimes her scorefluctuates between --------4) 20 and 23.6 3) 17.6 and 22.4 2) 16.4 and 20 1) 188and 21.2 First we need to calculate SEM (note that the stem provides the indexfor variance): SEM =S,V1—-r=2xvV1i- 0.64 = 2xV0.36=2x 0.6 ‘ =12

This estimate of the SEM indicates how far a given obtained score is from its true score.

||Z

Conceptually, standard error of measurementis used to determine a band around a student’s score within which that student’s score would probably fall. Since SEM is the standard deviation

; |

of the error scores we can use the constant percentages associated with the normal distribution to determine the probability that the individual’s true

score will fall within this band. Since the SEM is

Since we are lookingfor the probability of 95% the shaded area shows the answer.

964 176 188 2

21.2 224 23.6

160 CT Language Testing

/

:

mei ” See ty Hen ae CTE

a

i Note: You should be careful not to mix this type ofproblem with the type in chapter 4.

Though both are based on normaldistribution, they have fundamental differences. First, the diagrams in chapter four were divided on SD basis, while problems in this chapter are divided based on SEM. More importantly in diagrams in chapter four, the mean of the class is placed in the middle while in these diagrams an individual’s observed score is located under the middleline.

7. OTHER RELIABILITY THEORIES Classical test score theory is but one of the theories used to estimate reliability of tests. Due to the shortcomings of CTS, two other theories, ie. Generalizability theory and Item Response

Theory were proposed. Below wewill review them very briefly.

7.1. Generalizability Theory (G-theory) A broad model for investigating therelative effects of different sources of variance in test scores

has been developed by Cronbach and his colleagues. This model which they call

generalizability theory provides a conceptual framework and a set of procedu res for examining several different sources of measurementerror simultaneously, Using G-theory , test developers can determinethe relative effects, for example, of using different scoring procedures, and can thus estimatethereliability, or generalizability, of tests more accurately. 7.2. Item Response Theory (IRT)

Item responsetheory refers to a group of models which are based‘on the fundame ntal theorem that an individual’s expected performance on a particular test question, or item, is a function of the level of difficulty of the item and the individual’s level of. Essentially, IRT models are used to establish a uniform, sample free scale of measurement, which is applicable to individuals and

groups of widely varying ability levels and to test content of widely varying difficulty levels.

Item response models includethree families of analytical procedures: *

*

*

One parameter model (or Rasch model)is probabilistic in nature in that the persons and items are not only graded for ability and difficulty, but are judged accordi ng to the

probability or likelihood oftheir response patterns given the observed personability and item difficulty. In this model, the discrimination ofall the items is assumed to be equal, andit is also assumed thatthere is no guessing. Therefore,it is based only on thedifficulty ofa set of items. ,

Two parameter model does everything that the one-parameter model does. In this model it is assumed that individuals of low ability will have virtually no probabil ity of a correct

response, This model omits the chance response parameter, and is appropriate when the effects of guessing on test performance can be negligible.

Three parameter logistic model not only does everything that the one- and two-

parameter models do, butit also takes guessing into account.

ere ae oe ee

Chapter 6 / Characteristics ofa Good Test CI 161

8. RELIABILITY OF CRITERION-REFERENCEDTESTS As noted previously, CRTs will not necessarily produce normal distributions, especially if they are functioning correctly: Hence, a CRT that produces little variance in scores is an ideal that

testers seek.in developing CRTs. In other words, a low standard deviation may be a positive byproduct of developing a sound CRT, Thisis quite the opposite of the goals and results when developing a good NRT, which ideally should approximate a normaldistribution of scores to the greatest extent possible. Therefore, the appropriateness of using correlational strategies for

estimating the reliability of CRTs is questioned, because such analyses all depend in one way or another on normal distribution and a large standard deviation.

Suppose, for example, that we have developed an achievement test based on specific course objectives. If we give this test to beginning language learnersat the start of the course, we would expect them to obtain uniformly low scores. By the end of the course, if instruction had been equal effective for all students, we would expect students to obtain uniformly high scores, again with little variation among them. If we estimated the internal consistency of each set of scores,

we would probably obtain disappointingly low reliability coefficients. And if we computed the correlations between the two sets of scores, this would probably give a very low estimate of stability, since there is likely to be very little variancein either the pre-test or the post-test scores. Thus, the problem with using classical NR estimates ofreliability with CR test scores is that such estimates are sensitive primarily to inter-individual variations, which are oflittle interest in CR interpretation.

.

,

Although NR reliability estimates are inappropriate for CR test scores, it is not the case that reliability is of no concern in such tests. On the contrary, consistency, stability, and equivalence are equally important for CR tests. However, they take on different aspects in the CR context, and therefore require different approaches to both estimation and interpretation. Fortunately,

many otherstrategies have been worked out for investigating CRT consistency. In general, these strategies fall into three categories: threshold loss agreement, squared-error loss agreement, and

domain score dependability. £ Note: Sometimes the term reliability is reserved for an NR test and consistency ofscores ofa CRtest is referred to as agreement or dependability. 8.1. Threshold Loss Agreement Approaches Two of the threshold loss agreement ‘statistics that are prominent in the literature are the agreement coefficient (Po) and the kappa coefficient (x). Both of these coefficients measure the

consistency of master/non-masterclassifications as they were defined in Chapter 5. Recall that a master is a student who knowsthe material or has the skill being tested, while a non-masteris a student who does not. These two threshold loss. agreement approaches are sometimes called “decision consistency” estimates because they gauge the degree to which decisionsthat classify students as masters or non-masters are consistent. In principle, these estimates require the

administration of a test on two occasions

wo

162 E] Language Testing

;

B

fadmmaant

Chapter 6 / Characteristics ofa Good Test Cy 163

ier

8.2. Squared-Error Loss Agreement Approaches

Threshold loss agreement coefficients focus on the degree to which classifications in clear-cut

categories (master or non-master) are consistent. Squared-error loss agreement strategies also

determined fairly independently of the test itself. For example, given a set of scores obtained from a given test, the degree of reliability can be determined without even referring to thetest. Validity, on the other. hand, depends mostly on the peculiarities of the test. Therefore, just

do this, but they do so with “sensitivity to the degrees of mastery and non-mastery along the

obtaining a certain numberof scores from a test will not enable test users to establish its validity. Validity refers to the extent to which a test measures what it is supposed to measure. The

distances that students are from the cut-point-that is, the degree of mastery and non-mastery rather than just the dichotomouscategorization. Oneofthestatistics of threshold loss agreement

question of validity is concerned with whether the test is achieving what it is intendedto or not. As an example,if a test is designed to measure students’ ability on a particular trait, it will be

score continuum”. Thus, squared-error loss agreement approaches attempt to account for the

is the phi (lambda) dependability index (® or }) that can be estimated using one test administration.

desirable to observe that the test actually provides information on the intended trait rather than

8.3. Domain Score Dependability

9.1. Content Validity

All the threshold loss and squared-error loss agreementcoefficients described previously have beencriticized because they are dependent in one way oranother on the cut-score. Altemative

Contentvalidity refers to the degree of correspondence betweenthe test content and the content

approaches, called the domain score estimates of dependability, have the advantage of being

independent of the cut-score. However, they apply to domain-referenced interpretations rather

than to all criterion-referenced interpretations. One way of analyzing the consistency of domain-

referenced tests (and by extension, objective-referenced tests) is the phi coefficient. Phi assumes

that the items are sampled from a well-defined domain and gives no information about the

reliability of the individual objectives-based subtests.

2? Note: One last point on CRT dependability is the confidence interval (CI). The CI Junctions for CTRs in ‘a manner analogous to the standard error of measure for NRTs. More explicitly, the CI can be used to estimate a band around each student’s score in a CRT.

9. VALIDITY

Many discussions of reliability and validity emphasize the differences between these. two qualities, rather than their similarities, However, these concepts can be better understood by recognizing them as complementary aspects of a common concern in measurement ~ identifying,

estimating, and controlling the effects of factors that affect test scores. The investigation of

reliability is concerned with answering the question, “How much of an individual’s test

performance is due to measurementerror, or to factors other than the language ability we want to measure?” and with minimizing the effects of these factors on test scores. Validity, on the other

hand, is concerned with the question, “How muchof an individual’s test performance is due to

the languageability we want to measure?” and with maximizing the effects of these abilities on

test scores. The concerns of reliability and validity can thus be seen as leading to two complementary objectives in designing and developing tests: (a) to minimize the effects of

measurementerror, and (b) to maximizethe effects of the language abilities we want to measure. In the preceding sections, the conceptofreliability and the ways to estimate it were presented. It

would be most desirable if the concept of validity could be treated in the same way. Unfortunately, for mosttests it is not so because validity is a more test-dependent concept than reliability is. Furthermore, reliability is a purely statistical parameter. That is, it can be

something else. There are the following typesofvalidity. i

of the materials to be tested. The content of materials may be broadly defined to include both

subject matter content and instructional objectives: « Subject matter content is concemed with the topics, or subject matters to be covered; « Instructional objectives part is concerned with the degree of learning. that students are supposed to achieve. The key aspect in content validity is that of sampling. A test is always a sample of the many questions that could be asked. Content validity may be defined as the extent to which a test contains a representative sample of the larger universeit is supposed to represent or the content to be tested at the intended level of learning. The focus of content validity, then, is on the appropriacy of the sample of elements included in the test. That’s why content validity is

sometimescalled the appropriateness ofthe test.

/

Supposea test is to be developed to measure examinees’ ability in recognizing the correct grammar of the English language. To assure that the content of the test corresponds to the content to be

tested, the test should include a representative sample of grammatical items. This would be possible through utilization of thetable of specifications. Furthermore, the test items should meet,

quite accurately, the level of learning expected from the examinees. Complex items would not be appropriate for elementary level students. Thus, in dealing with content validity, both the content of the test and the student for whom thetest is designed should be taken into account. Since there is no commonly used numerical expression for content validity, it provides the most useful subjective information about the appropriateness of the test. Of course, this subjectivity is a drawbackin itself. To reduce subjectivity two measures can be taken: e

To havethe test reviewed by more than one expert.

e

To define the content to be tested in as detailed terms as possible and transfer the detailed definition onto a table of specification.

9.2. Face Validity Face validity is the way the test looks to the examinees, test administrators, educators, and the like. Obviously, this is not validity in the technical sense. Yet its importance should not be

underestimated, for if the content of a test appears irrelevant, silly, or inappropriate, knowledgeable administrators will hesitate to adopt the test. Moreover, the students’ motivation

Gey or

164 [_] Language Testing

s

F:Auihrormstitvess

is maintained if a test has good face validity, for most students will try harder if the test looks sound. If, on the other hand, the test appears to havelittle of relevance in the eyes of the. student, it will clearly lack face validity. Possibly as a direct result, the student: will not put maximum effort into performing the tasks set in the test. An example might be an English reading

comprehension test designed for American children which is given to adult foreign learners of English just because the two groups are thought to have a similar degree of proficiency in the

Pig donind

the reading text concerns, say, public institutions in Britain.

,

i Note: It should be noted that a test may have a desirable level ofcontent validity, but not

« © ¢

a well-constructed, expected format with familiar tasks, ‘a test that is clearly doable within the allotted time limit,

¢

itemsthat are clear and uncomplicated, directions that are crystal clear,

¢

task thatrelate to their course work (content validity), and adifficulty level that presents a reasonable challenge,

*

English proficiency on the basis of his class performance duringthe first week and correlate the measures, weare seeking to establish the concurrentvalidity ofthe test.

e

the administration of the two tests is not concurrent but in some time interval. For example, if we use a test of English as a second language to screen university applicants and then correlate test scores withi grades made at the end of the first semester, we are attempting to determine the predictive validity ofthetest. In using criterion-related validity some points should be taken into account.

e

When the newly-developed test is validated against’ a subjective criterion (such as a supervisor’s rating) the validity index will be low. Also, the test we are validating may itself be a somewhat imprecise measure (e.g. a composition or scored interview), in which case the validity will be comparatively low. In short, criterion-related validity depends in large part on thereliability of both test and criterion measure.

e

The criterion measure should possessall the characteristics of a goodtest, that is, it must

e

have a reasonable index ofreliability and validity. The content of the criterion measure must be on the same domainasthat of the newtest.

e

One should be cautious in the interpretation of the validity index. It is important for educators and test users to bear in mindthat the correspondence between the new test and the criterion test can get into a circular question. For example, test A, the new test, is validated against test B, the criterion measure. But test B itself should have been validated against still another measure, let’s say C. So does the process go on. This means the

criterion-related validity should be interpreted cautiously because all validity indexes depend on the very test against which all other tests have been subsequently validated.

9.3. Criterion-Related Validity

Criterion-related validity investigates the correspondence between the scores obtained from the newly-developed test and the scores obtained from some independentout side criteria. The

criteria can range from a relatively subjective or inexact nature (e.g. supervisors’ ratings, teacher’s subjective judgmentor grades in a course) to standardized objectiv etests. Toobtain criterion-related validity, the newly-developed test has to be administered along with the criterion measure to the same group. Then, using Pearson product -moment, the correlation between the two sets of scores will be an indication of the criterion-relate d validity of the test. Depending on the time of administration of thecriterion measure, two types ofcriterion-related *

Predictive validity: Predictive validity is just like concurrentvalidity in that it depends on some sort of correspondence between the scores obtained from the new test and those obtained from an already established one. The difference is in the procedure taken in that

nosurprisesin thetest.

validity are established.

Chapter 6 / Characteristics ofa Good Test [J 165

applicants and follow upthe test immediately by having an English teacher rate each student’s

face validity. For example, a test such as a cloze test, which is experimentally shown to

have a reasonable validity, may not be acceptable, on the Jace of it, to be a good test of say, grammar. Thus, test developers should not be very much concerned with the face validity. They should, however, be very careful about establishing the contentvalidity. Facevalidity will likely be high if learners encounter

Sunaina

test. For example, if we use a test of English as a second language to screen university

language. It is possible for a test to include all the components of a particular teaching program being

followed and yet at the same time lack fact validity. For example, a reading test for engineers could have most of the grammatical features of the language of engineering (e.g. frequent use of the passive voice) as well as most of the language functions and notions associated with reading and-writing engineering texts. Even so, however, the test will lack face validity if the subject of

shi

Sew

,

Concurrent (or status) validity: In concurrent validity a test develop ed to measure a

particular trait is administered concurrently with another well-knowntest the validity of which is already established. Then correlation is computed between the newly developedtest and the criterion measure. The degree of correlation is an indication of the concurre ntvalidity of the

e

Since estimates of criterion-related validity are usually expressed in terms of coefficients ofcorrelation like those commonly used in estimating test reliability, this kind of validity is also known as empirical orstatistical validity.

9.4. Construct Validity An understanding of the concept of a psychological construct is prerequisite to understanding

construct validity. A psychological construct is an underlying ability/trait defined or hypothesized in psychological theories. Communicative competence, self-esteem, and selfregulation are some instances of construct. ‘Constructs occur inside the brain. In domain of

languagetesting, the word construct refers to any underlying ability that is hypothesized in a theory of language ability. In the field of testing, this job falls to the language tester. A test has construct validity to the extent to which the psychological reality of a trait or construct (e.g., language proficiency) can be established. ,

A teacher needs tobe satisfied that a particular test is an adequate definition of a construct. Let’s

5 tht

166 C] Language Testing

ers Siemens aoe cee

nnd been given a procedure for conducting an oral interview. The scoring analysis for say *aed you i i svcunaey ewweighs several factors into a final score: pronunciation, fluency, grammatical

wou YS poficien

uuny use, and sociolinguistic appropriateness. The justification for these five . “ eoretical construct that claims those. factors as major components oforal

profi ey. %, if you were asked to conduct an oral proficiency interview that accounted only “ Pe aeand grammar, you could be justifiably suspicious about the construct validity ora ans .

other definition offered by Heaton states if a test has construct validity, it is

ip le of measuring certain specific characteristics in accordance with a theory of | behavior andlearning. ees & Note:*Const struct validity idity isis difficult di te determine because it requires utilization of sophisticated statistics called “factor analysis”. Note construct validity is the most important type of validity which can dominate all ‘

rs.

The reason is quite simple: if there does not exist a particular trait or construct,

then there will be no sense in attempting to measureit. 10. FACTORS INFLUENCING VALIDITY any factors tend to influence the validity of test scores. Some factors to which the test user: should pay close attention are discussed below. . 10.1. Directions Di

.

.

.

.

monn of the test should be quite clear and simple to ensure that the testees understand what *eyare expected to do. For instance, the examinees should be informed whether they have to n e answer sheets or on the test papers. 1 i i Disregarding this point will reduce the validity of test scores because the obtained scores may fluctuate due to these elements rather than the examinees’ real ability.

10.2. Difficulty Levels of the Test

T

: : lays : ‘00 easy or too difficult items will jeopardize test validity because such items are either below vnovlete or above the ability level of th e testees, and thus they will i not measure the testees’ real

10.3. Sample Truncation iy

.

.

.

.

.



ee truncation or artificial restriction of the range of ability represented in the examinees

will result in the underestimation of both reliability and validity.

10.4, Structure ofItems Poorly constructed and/or ambiguous items will contribute to the invalidity of the test because Bul such items do not allow testees to per form to their potential. If a test taker misses a poorly

co nstructed item, j it it will wi not be known whether he missed the item because he did not know the

correct response or he simply did not understandit.

Estados

167 Chapter 6 / Characteristiesofa Good Test LI

Cerin Be reerte

rect Responses 10.5. Arrangementof Items and Cor

typically starts with easy the order of difficulty. Thatis, a test Test items are usually arranged in usually arranged . Furthermore, item responses are ones t icul diff ard tow es ress prog jtems and

ern for the correct response.

patt ~ randomly to avoid any identifiable

N RELIABILITY AND VALIDITY 11. THE RELATIONSHIP BETWEE cases they are ussed under separate topics, in most disc e wer dity vali and ity abil reli Although y but not sufficient s of validity, reliability is a necessar kind t mos For d. late r-re inte ely clos for a

without being valid is possible for a test to be reliable condition. Stated in another way, it lts, it on mathematics produces consistent resu test ar icul part a if ple, exam For ose. specified purp if it is a valid measure of e of one’s language ability, even sur mea d vali a ns, mea any by is not, ible for a test to be valid

he other band,it is not poss mathematics in a particular context. Ont istently, it follows a test does not measure something cons withoutfirst being reliable. That is, if accurately. In sum, that it cannot always be measuringit not be valid. » Ifa testis reliable, it may or may nt reliable. degree of validity, it is to some exte e Ifa test demonstrates a certain

42. PRACTICALITY tests into place d have to do with physically putting esse addr be will that es issu ty ali tic Theprac ease of scoring, and refers to the ease of administration, in a program. Generally, practicality ations, some other In addition to these three major consider ease ofinterpretation and application. practicality of comparable forms contribute to the lity labi avai and ing, test of cost as factors, such ofa test. 12.1. Ease of Test Administration simplicity of nister will depend on clarity and admi to easy is test a ch whi to The degree

ally with low ct in foreign language situations specific aspe ant ific sign t mos the is This ns. directio is important because

a test the amount of time required for proficiency level students. Also, ests involved is still a tire the students. The number of subt sitting for a test for a long time will Other factors , the easier the test administration. ests subt of ber num the r fewe The third issue. and materials required on are the amount of equipment rati nist admi of ty icul diff to ting contribu

st. nce that the students need during thete

to administer it and the amountof‘guida

12.2. Ease of Test Scoring ity, simplicity, and suming, the trend is toward objectiv con e tim and t icul diff is ing scor Since a

of constructing sto be ‘inversely’ related to the ease machine scorability. Ease of scoring seem (composition, translation) types of tests to constructinitially test type. in other words, the easiest s which are more and least objective, while those test type are usually the most difficult to score objective. most are usually the easiest to score and difficult to constructinitially (MC)

.

168 [] Language Testing

2,

di

7 d Test Oo 169 Chapter 6/ Characteristics ofa Goo

Anadis

ner

Dee ioe ee

i

din answering own values andgoals, or with the in test wiseness an on cti tru ins m ter rt sho as can be defined

12.3. Ease of Interpretation and Application

get to those appearing on the tar

The purpose of a test is to make decisions on a certain aspect of the test taker’s life. The most crucial point about a test is the meaningfulness of the scores obtained from that test. We should already decide on what it means to get a score of X, how clear-cut a decision can be made,étc.

In a nutshell, the scores should make sense to us so that we can use them appropriately and interpret them effectively. 12.4. Ease of Test Construction

Special considerations with regard to test construction can range from deciding how /ong the test should be to considering what types ofquestions to use. Naturally the more the numberof items, the higher the reliability; however, the construction of many test items puts burden on teachers’ shoulder. Then, one goal of many test developmentprojects is to find the ‘happy medium’, that

is, the shortest test length that does a consistent and accurate job of testing the students. Anothertest construction issue involves the degree to which different types of tests are easy or difficult to produce. Some test types, for instance a composition test, are relatively easy to

construct; others like cloze test may be more demanding. 12.5. The Cost Issue (or economy) It is the possibility of obtaining a relatively large amount of information in a short period of time

and without an inordinate amountof energy expended by the instructor and students. 13. EXTRA POINTS TO REMEMBER 13.1. Validity depends on the purpose of the test. By changing the purpose ofthe test, validity

can completely disappear, ie. a test may be quite valid for one purpose but not as valid for another. For example a diagnostic test has a particular content and function which serves validly to reveal] learners’ strengths and weaknesses. However, the same valid test would lose its

validity, if it is used to measure learners’ global knowledge of language. 13.2. Eeven though we often speak of the reliability of a given test, strictly speaking, reliability refers to the test scores, and notthetest itself. Furthermore, since reliability will be a function not only of the test, but of the performance of the individuals who take the test, any given

estimate of reliability based on the CTS modelis limited to the sample of test scores upon which it is based. 13.3. None of the measures of internal consistency should be used with speed tests, since a falsely high estimate of reliability will result. Test-retest or parallel forms are the methods best

adapted to the measurement of speed-test reliability. 13.4. Harris also reports two other sources which affect test scores: coaching effect and test comprise effect. ¢

Coaching effect is the effect on test scores of ‘teaching to the test’, The term teaching to

the test implies doing something in teaching that may not be compatible with teachers’

«

ram. Coaching

the instructional prog Ce values and goals " of questions similar

examination.

tent. Tests and test

test con uisition of prior knowledg e .of books or Test comprise effect is the acq in 4 nom er of ways: the test s son per . zed ori uth una by answers may be acquired rizeea certain of examinees may mH emoon up gro a of ) eac! or , len sto be i keys may scoring to reconstruct the tes

to a prearranged plan so as vonber of items according

wiphenct

170 [_] Language Testing

a

ica Matte

wwe ine Heh

State University Questions 1- Standard Error of Measurement — the standard deviation ofthe error scores — is an index of -------. .

1) test validity 3) the length ofthe test

(State University, 81)

2) testreliability 4) level ofdifficulty ofthe test

2) content

3) construct

4) predictive

3- The main disadvantage of the --------- method of estimating reliability is developing test with homogeneousitems.

1) KR-21

(State University, 82)

2) split-half

3) test-retest 4) parallel-forms 4- The validity obtained by comparing the results of twotests is called --------» validity. .

1) face

2) empirical

3) construct

4) content

5- One ofthe factors affecting the validity of a test is ---------.

1) the length ofthe test 3) the scoring procedure

2) the contentof the test 4) the administration procedure

(State University, 82)

(State University, 83)

2) test-retest

3) parallel-forms

(State University, 83)

4) rational equivalence

8- Achievementtests and proficiency tests should be primarily validate d as to their -------. and ~-----validity, respectively. (State University, 84) 1) construct, content 2) concurrent, face 3) content, criterion-related 4) criterion-related, construct

9- Empirical validity ---------. .

(State University, 84)

1) is based on twocriterion tests 2) drawsonthe correlation between twotests 3) evaluatestests in terms.of their construct

4) should be used with a time gap between two administrations

10- To increase the reliability of a 50-item test from 0.40 to about ---------, the length should be

doubled.

1) 0.62

2) 0.57

3) 0.67

.

(State University, 84)

4) 0.72 11- All of the following are among the uses of correlation coefficient EXCEPT---------. :

;

1) findinga less time-consuming test 2) measuring the contentvalidity of tests 3) deciding if two tests measure the same construct 4)eliminating sub-tests from a battery

12- To calculate the mark/remarkreliability, the test ---------.. 1) should be divided into two halves 2) is scored by tworaters

(State University, $4)

3) is administered on different occasions

13- If the V ofa test is 4 andits reliability is 0.64, the limits of the score of a testee who obtained 15 will be ----—--- within one standard error.

1) 10.2 ~ 19.8

~

2)11.8-18.2

(State University, 84)

3) 13.4 - 16.6

4) 13.8-16.2

14- The application of the rational—equivalence method requires ---------“

1) the homogeneity of items 3) the measurementofvarioustraits in atest

(State University, 84)

2) the equality of testees’ proficiency levels 4) the use of correlational procedure

1) 2) 3) 4)

(State University, 85)

the psychologicalreality of what is being tested whetherornota test gives coverageto all aspects of a given area the relationship betweenthetarget test and an already established measure in the same area whetheror nota test can measure the extent which good students perform differently from poor ones

16- If the reliability of the split halves of a test is .95, the reliability of the whole test will be about

— 2(Fhait) . 7-, The formula totat = Tn should be used with the --------- method of measuring reliability.

1) split-half

Chapter 6 / Characteristics ofa Good Test C] 171

vue. ines nape

15- Contentvalidity refers to ------—-.

6- The type of validity obtained as a result of comparing the results of a test with a criterion measureis called --------- validity. (State University, 83) 1) content 2) construct 3) empirical 4) predictive .

ira

4) should reflect progress between two administrations

2- Whena testis capable of measuring certain specific characteristics in accordance with a theory of language behavior and learning,it is said to have -------~ validity. (State University, 81) 1) empirical

at pt

(State University, 84



weenencore percent.

1) 96.5

(State University, 85)

2) 97

3) 97.5

17- All of the following refer to a validity EXCEPT---------.

1) concurrent

2) statistical

3) internal

18- The homogeneity oftest items has a direct effect on ---------. 1) reliability 2) test objectivity 3) the error score 19- It is true that---------.

4) 98

‘ (State University, 85)

4) predictive (State University, 85) 4) item difficulty (State University, 85)

1) face validity refers to the technical sense ofvalidity 2) as the length ofa test increases, so doesits reliability 3) content validity is a subcategory ofcriterion-related validity 4) thereis a positive correlation betweentest difficulty andits reliability, 20- The contentvalidity of a test ---------.

1) 2} 3) 4)

(State University, 86)

is either concurrentor predictive is enhanced by drawing up a table oftest specifications should be measured against a theory of language learning constitutes the most importanttype ofvalidity in proficiency exams

21- The reliabitity of a test ---------..

1) 2) 3) 4)

(State University, 86)

depends on the washbackeffectofthe test can be calculated throughthe rational equivalence decreases when the varianceofthe true score approachesthat of the observed score improves in case test items are heterogeneous

22- If the reliability and variance of a test are 0.36 and 4, we can be 95%sure that 15 as a score lies between ---------..

1) 14.28 and 15.72 3) 13.56 and 16.44

2) 13.4 and 16.6 4) 11.8 and 18.2

{State University, 86)

,

172 [ Language Testing

PO hn hip

deeahs

ein

ae see ee es

23- An 11-item test taken by 20 students has the mean of 7 and standard deviation of 2; the

reliability of the test would be ---—----, 1) 0.15 2) 0.30

3) 0.40 24- The Spearman ran £ Brown proph rophecy ecy formula is employed to ---------. 1) measurereliability when internal consistency is import ant 2) find the causal relationship between two variables 3) calculate reliability whena test is re-administered 4) compensate for the reducedtest size in the split-half method

4) 0.70

(State University, Ys 86)

(State University, 86

'



(State University, 86)

4) 0.85

(State University, 87)

27- To ensure content validity, we somet imes need to --------- » Which could lead to the underestimation ofreliability. ; 2) include a variety of items 4) increase the numberof items

(State University, 87)

28- It is usually through experts’ judgment rathe r than numerical values that weestablish the ------validity of a test. (State University, 87)

2) concurrent

3) face 4) predictive 29- An assumption underlying the use of KR-21 is that : (State University, 87) 1) test scores are nominal 2) test items measure the sametrait 3) thetest consists of multiple-choice items 4) test items are different in terms ofitem facility 30- The construct validity ofa test is concerned with --------(State University, 87) 1) the relationship between the test and a standardized test 2)the relationship betweenthetest and course objectives 3) the extent to which the test measures a theoreticaltr ait

4) the degree to whichthetestis developed in termsoftest

3) test-retest

(State University, 87)

4) parallel-form

32- If you advise someone who is, say, going to take the TOEFL to review several sample tests before the test day, you think that “--------” ean impro vetest scores, (State University, 88) 2) test-wiseness

2) additive marking

”3) predictive validity

(State University, 88)

4)test-retest reliability

36- The application of the generalizability theory enablesthe test taker to -------—. (State University, 88)

1) find outif the trait being tested has construct validity 2) study the effect of different sources of variance in test scores 3) identify the best test methodto be usedto elicit reliable data 4) measurethe relevance ofthe purpose for whichthetest is being used

37- The underlying assumption in the definition of reliability as a correlation between two parallel tests is that ---------.

(State University, 88)

1) all factors systematically affecting test scores are eliminated 2) the observed variance and the true score variance are identical 3) the observed scoresonthe twotests are experimentally independent 4) all measurementerror can be determined and taken into account

38- Which ofthe following is part of pragmatic competence? 1) Textual , 2) Grammatical 3) Organizational

(State ‘University, 88) 4) Iocutionary

39- Which one ofthe following statements is TRUE? 1) Reliability is the function of test score variance. 2) The length of a given test does not influence test score consistency.

(State University, 88)

3) The less homogenousthetest items in a test, the more consistent scoresit will produce.

4) When the testees’ ability varies greatly in regards to a given attribute, the reliability will be underestimated 40- If a translation paperis scored by one teacher, the main concern will be -—-------. (State University, 89)

construction principles

31- The size ofa test is a factor particularly undere stimatin: g reliability when measured through the --------- method.

1) coaching

4) the correlation between two measuresis calculated

:

26- If r= 0.60, N=20, and k=10,then r, will be --—-—-—-, 10.72 2) 0.75 3) 0.82

2) split-half

(State University, 88)

2) the numberofparticipants is the key factor 3) test-retest results in the highest reliability coefficient

1) powertests

4) can best be guaranteed throughthe test-retest metho d

1) KR-21

34- In KR-21, a formulafor the estimation of test reliability, --------.. 1) test variance is taken into account

task at some future point, the focal pointis ---------.

1) can best be measured by means of KR-21 2) is a major concern in subjectivetests 3) is inherentto reading comprehensiontests

1) content

Chapter 6 / Characteristics ofa Good Test E] 173

em es Oe te

35- Whenthe learners’ scores on a test are correlated with their performance on some important

25- Inter-rater reliability ------—

1) increase item difficulty 3) reducethe length ofthetest

Gere

3) test method 4)test item bias 33- The morereliable the test, the less likely the individual’ 's estimated --------- score is to --------- the meanof the distribution. (State University, 88) 1) true - regress toward 2) error - regress toward 3) true - move away from 4) error - move away from

1) intra-rater reliability 3) test administration reliability

2) face validity 4) criterion-referenced validity

41- It is NOT true that empirical validity ---------.

1) 2) 3) 4)

(State University, 89)

dependson thereliability of both the test and criterion measure is of predictive and concurrent kinds requires a careful analysis of the skill being tested is usually estimated and expressed in terms ofcoefficients of correlation

42- Which of the following statements is FALSE? 1) A test that has a small SEM is more consistent than one with a large SEM. 2) A low standard deviation is a shortcoming in NRTs. 3) Correlational strategies are inappropriate for estimating the reliability of CRTs. 4) A person’s true score is within +1 SEM of their observed score 95% ofthe time.

{State University, 89)

wigs 174 {_| Language’Testing

ss

cerNL| wwe in sae ett?

43- If a test of vocabulary knowledge has a variance of 16 and reliability of 0.75, the true score of a test taker with the raw score of 15 will fluctuate 68% of the time between---------. {State University, 89)

1) 13 and 17

2) 11 and 19

3) 14 and 16

4) 13.5 and 16.5

44- Which ofthe following statements is FALSE?

(State University, 89)

1) Reliability is equal to the ratio of true score variance to observed score variance.

2) A correlation of -1 indicates that there is no relationship between the two sets of scores. 3) The split-half methodfor estimating reliability measures the internal consistency ofthe test scores. 4) Whena distribution is not normal the median is more meaningful than the mean. 45- The magnitudeofreliability will equal +1 when ---------. ~ 1) the true score is greater than the observed score 2) the estimate of the true score approximatesits real value 3) systematic variation is greater than error variance 4) there is no unsystematic variation in measurement

(State University, 89)

(State University, 90)

1) there are no random error scoreson either test

3) the observed scores on the twotests are experimentally independent 4) the error scores on eachtest are correlated with the true scores

47- A test aiming to measure examinees’ knowledge of how utterances are knitted together to form 2) organizational

3) manipulative

(State University, 90)

4) regulatory

48- A correlation coefficient obtained by comparing the performance of some candidates on a wellestablished test and a newly developed one is an indication of ---------..

1) face validity

2) empirical validity

3) constructvalidity

49- The reliability of a test calculated via the split-half method is ---------.

(State University, 90)

4) content validity (State University, 90)

1) a very rigorous estimation ofreliability 2) always higher than that of the wholetest

50- If we need some indicator of how much we expect an individual test score to vary, given a (State University, 90)

1) commonvariance 2) the generalizability coefficient 3) the ratio oftrue variance to observed score variance

1) the test format and scoring 2) intra-rater and inter-rater inconsistencies

3) the content of items and the numberofitems 4) the features of the test and the administration condition

52- Standard error of measurement ---------.

1) ranges between —1 and +1 3) is zero in multiple-choicetests

(State University, 91)

2) is the sameas standard deviation squared 4) has a negative relationship with reliability

concernis -------. 53- If one rater gives different scores on two occasions, what is a matter of .

‘1) concurrentvalidity

3) intra-rater reliability

(State University, 91)

2) predictive

4) inter-rater reliability (State University, 91)

54- The term “construct” in construct validity refers to ---------.

1) any underlying trait hypothesized in a theory of language ability

2) the four languageskills and their components

3) the correlation between theory and practice 4) theories of second languagelearning and teaching

(State University, 92)

/

1) is almost alwayshigher thantest reliability 2) should be applied to objective tests 3) is needed when the constructis defined theoretically 4) is mostlikely to fall between 0.5 and 1

(State University, 92)

.

1) identifying test takers whose behavior doesnotfit the mode 2) determining itemsthat do not belong in thetest 3) being an effective type of classical item analysis 4) being based on item response theory

y of CRTs 57- Traditional reliability measures are NOT appropriate to estimate the reliabilit becausethesetests ---------.

1) are based on different measurementscales 3) cannot discriminate among extreme scores

(State University, 92)

2) do not aim to generate variability 4) are compared with an ideal criterion

on between two 58- Whichof the following statements is NOT correct about the index of correlati

(State University, 92)

tests as an estimate of validity when it is squared?

1) it indicatesrelationship between the validity and reliability of the twotests. 2) It indicates extent to which the twotests are reliable.

4) It indicates the extent to which the twotests are valid. of 100 words in 59- A test of writing which requires candidates to write the translation equivalents (State University, 93) their own language --------- test of writing. 2) is neithera reliable nora valid 1) is both a reliable and valid 4) might be a reliable butnota valid 3) might be a valid but nota reliable that weare 60- Comparing the results of a test with a teacher’s ratings obtained later suggests

4) the standard error of measurement 51- The unreliability of a multiple-choice test has two origins: ---------..

ww ee

3) It indicates the commonvariance between the two tests.

3) used less often due to practical considerations

4) always lower than that ofthe whole test particularlevel of reliability, the best indicator is ---------.

Chapter 6 / Characteristics ofa Good Test | 175

Cen

56- Rasch analysis hasailof the following characteristics EXCEPT ---------.

2) the error scores on one test cancel out those on the other

1) ideational

hase

55- Scorerreliability ---------.

46- Whenreliability is defined as the correlation between parallel tests, the assumptionis that ------.

texts is on concerned with an area of knowledge knownas---------.

GD gina

(State University, 91)

interested in --------- validity of the test. 2) predictive 1) concurrent

(State University, 93)

3) construct

=>

4) content

61- Which ofthe following guidelines is NOTrelated to promoting test reliability? 1) Notallowing candidates too much freedom. 3) Providing a detaijed scoring key.

(State University, 93)

2) Taking enough samples of behavior. 4) Using direct testing techniques.

va 176 C LanguageTesting

si an

relies

im

wen ed eB ie

62- Ifthe reliability and standard deviation of a test are 0.64 and 10 respectively, then the standard error of measurement would be --------- .

7

24

3)6

(State University, 93)

48

63- Suppose that a test has a standard deviation of 3 and a candidate scores 14 on this test. In this case, we can be 68 percent sure that this person’s true scorelies between ---------. (State University, 93)

1) 11-14

2) 11-17

3) 14-17

4) 14-20

64- To use KR-21 formula, one needs to know about the ---------..

(State University, 93)

1) the mean score and the standard deviation 2) the number of items and the standard deviation 3) the numberof items, the mean score, and the variance 4) the numberof candidates, the mean score, and the variance

65- What would be the standard error of measurementfor a test with a reliability index of 0.75 and a standard deviation of 10?

1) 25

2)5

(State University, 94)

3) 75

4) 50

66- When a single teacher has unclear criteria, fatigue, or bias, the main concern is ---------. (State University, 94)

1) intra-raterreliability 3) test reliability

2) inter-raterreliability 4)test-retest reliability

are really measuring the sub-skills of reading ability?

(State University, 94)

2) Predictive validation 4) Content validation

68- Making sure that sound amplification is clearly audibleto every test-taker in the room is an attempt toward achieving --------.

1) reliability

2) authenticity

,

3) validity

(State University, 94)

2) Mark/re-markreliability 4) Profile reporting

70- A test is more reliable in all of the following ways EXCEPT---------.

1) having scores dispersed 2) making the test longer

(State University, 94)

4) practicality

69- Which of the following is the concept of scorerreliability related to?

1) Test/re-test reliability 3) Backwash effect

(State University, 94)

71- Which of the following is NOT needed to estimate the reliability of a test through KR-21 (State University, 94)

1) Variance 3) Numberoftest-takers

2) Mean score 4) Numberoftest items

.

wee me ea Bek

error? 73- Which of the following statements is NOT true about systematic measurement

(State University, 95)

1) Its general and specific effects tend to increase the generalizability of test scores. 2) It increases estimatesofreliability but may decrease validity. 3) Systematic errors tend to be correlated across measures. 4) It introduces bias into our measures.

University, 95) 74- Whatis the majorlimitation of contentvalidity as the sole basis for validity? (State ofability. 1) It is difficult to demonstratethat the contents ofa test accurately represent a given domain 2) It affects the performanceofdifferenttest takers in the same way. 3) It will vary across different groups of examinees. 4)It focuses ontests, rather than test scores.‘ 75- Whatis construct validation research mainly concerned with? 1) Estimating the ability levels of test takers and the characteristics of test items

(State University, 95)

2) Eliciting language test performance that is characteristic of language performance in non-test situations underlie this 3) The relationships between performance on language tests and the abilities that performance of error 4) A conceptual framework and a set of procedures for examining different sources

ative use will 76- A classroom test designed to assess mastery of a point of grammar in communic University, 95) (State havecriterion validity if --------.

1) test scores are at least 1 standard deviation above the mean 2) it predicts students’ subsequent performance in the course accurately ative 3) it requires the test-takers to use the grammar point in question to perform certain communic tasks 4) test scores are corroborated by other communicative measures of the grammarpoint in question 7T- Which ofthe following is a problem with the classical true score theory (CTS)?

(State University, 95)

2) It treats error variance as homogeneous in origin.

3) It distinguishes systematic errors from random errors. 4)It defines reliability in termsoftrue score variance.

78- Compared to the K-R 24, K-R 20 ----—--.

1) is less difficult to calculate 3) requires multiple administrationsofa test

(State University, 95)

2) is a less accurate estimate ofreliability 4) avoids the problem of underestimating reliability

about --------~. 79- Think aloud and retrospection are two principal ways of obtaining evidence

(State University, 96)

72- What does the measurement assumption that test scores are unidimensional mean? oo

Chapter 6 / Characteristics ofa Good Test [4 177

Cera

1) It considersall errors to be systematic.

2) administering the test to heterogeneousstudents 4) assessingdifferent language materials

formula?

oe

simultaneously

67- Which of the following validation procedures is needed to ensure that the items in a reading test 1) Concurrent validation 3) Construct validation

PE jth

(State University, 95)

1) An individual’s expected performance on an item is a function ofthe individual’s level of ability and the level of difficulty of the item. 2) An individual’s response to a given test item does not depend upon how herespondsto other items that are of equaldifficulty. : 3) Test scores can be usedto relate an individual’s test performance to that individual's level ofability. 4) The parts or items ofa given test all measure the same single ability.

1) concurrent validity 3) construct validity

2) content validity 4) predictive validity

80- Whatis the relationship between reliability and standard error of measurement?

(State University, 96)

1) The lowerthe standard error of measurement, the lowerthereliability index. 2) The higherthe standard error of measurement, the higher the reliability index.

3) A very low standard error of measurement indicates thetest is not reliable atall.

4) The lowerthe standard error of measurement, the higherthe reliability index.

wy

178 [_] Language Testing

we mah oe

w

oe meu

cima a

Chapter 6 / Characteristics ofa Good Test C] 179

aR ETT

81- All of the following are guidelines intended to make tests morereliable EXCEPT —------.

State University Answers

(State University, 96)

1) offering candidates a choice of questions and allowing them freedom to choose from 2) excluding itemsthat do not discriminate well between poor and strong candidate 3) constructing items that permit scoring whichis as objective as possible 4) providing uniform and non-distracting conditions of administration

1. Choice 2

: :

82- Empiricalvalidity is obtained on the basis of comparing the results of the test with the results of

all the following criteria EXCEPT--------a

(State University, 96)

1) the teacher’s ratings given sometime later 2) the teacher’s ratings given at the sametime

3) the candidates’ subsequent performanceona certain task measured by a valid test

4) an existing test which maynot necessarily be valid administered at the same time 83- Which parameters are needed to use the Kuder-Richardson 21 formula? 1) The numberof items, the mean score, and the range

2) concurrent

3) consequential

3. Choices 2 . Though both split-half and KR-21 are methods of internal consistency and are based on the assumption of homogeneity of items,split-half method has unduereliance on this assumption.

Referto section 9.3. (State University, 97)

(State University, 97)

\

~

5. Choice 2

:

Contentvalidity refers to the degree of correspondence between the fest content and the content of the materialsto betested..

concurrent.

7. Choice 1

Refer to section 4.3.1,

8. Choice 3

85- A test is said to be practical whenit --------. (State University, 97) }) treats all test-takers the same 2) has a positive backwasheffect 3) measures whatis supposed to measure 4) has a specific and time-efficient scoring procedure 86- A test consisting of chiefly multiple-choice items of grammar and vocabulary administered at 1) construct

Referto section 9.4.

6. Choices 3 & 4 Criterion related validity, which is also called empirical validity, is of two types: predictive and

84- All of the following statements are true about face validity EXCEPT--------..

the end of a communicative course will lack -----—-- validity.

2. Choice 3

4. Choice 2

2) The numberof candidates, the mean score, and the variance 3) The numberofitems, the mean score, and the variance 4) The numberof candidates, the numberof items, and the range

1) it is the degree to whicha test looks right 2)it is ensured by selection of relevant content 3) it is often evaluated based on the subjective judgmentof the test-taker 4)it is not something to be empirically evaluated by a testing expert

Refer to section 6.

(State University, 97)

4) predictive

In an achievementtest, test constructor must assure thatall the materials of the course are included in the

test and thus content validity is important. On the other hand since language proficiency is a construct, then test constructor is concerned with construct validity. Also because a proficiency test should provide results similar to a standard test proficiency test, criterion-related validity is of importance. Butthis type of validity is not a concern of achievementtest. 9. Choice 2 Refer to section 9.3. 10. Choice 2 Since.stem asks forreliability the numberof items is doubled, then k = 2:

2x04 08 kr, We Fo a = 57 1+(k-1)r, 14+04 14

IL. Choice 2

Amongthethree types ofvalidity, only through criterion-related validity numerical values are obtained. 12. Choice ? All the choices are false. Refer to section 5.4, - 13. Choice 4The SEM ofthe test is 1.2:

SEM =S,vVl-r=2xv1—0.64 = 2x v0.36 = 2x 0.6 = 1.2 Since stem asks for one SEM, then:

15-12=13.8 and 15+1.2=16.2

180 L] Language Testing

wim

ae a he ee

14. Choice 1

The main idea behind internal consistency type ofreliability is that the items comprising a test are

29. Choice 2 Refer to section 4.3.

15. Choice 2

30. Choice 3 Refer to section 9.4.

homogeneous and we knowthat rational-equivalence is a methodofinternal consistency. Referto section 9.1, 2(thaip)

2x095

1.9

Nttoral = TT (nan) = 14055 = —_ 1.95 = 0: 974 x

100

=

97

17. Choice 3

Refer to section 9.

Referto section 5.2. 20. Choice 2 Refer to section 9.1. i

21. Choice 2



Referto section 4.3.3. % When variance oftrue score approachesthat of observed score, reliability increases. 22. Choice 4 In case of 95% we should add and subtract two SEMs: = 2x V1 — 0.36 = 2 x V0.64 =2X08= 16

K

° F = il

X(K-X)]

yu

RV | lial [:

1 — ———=}

24, Choice 4 Refer to section 4.3.1.

=



7x(11-7))

ply

4

1x4 | lial * Fall o4

- = J

—|=

25. Choice 2

35. Choice 3 Refer to section 9.3.

37. Choice 3 In parallel tests the true score in oné test is equal to the true score in the othertest. If we want the two tests to be highly parallel (observed scores to be highly correlated), theeffect of error scores should be minimized. This said, we can derive a definition of reliability as the correlation between the observed

scores on two parallel tests. Fundamentalto this definition is the assumption that the observed scores on the two tests are experimentally independent. That is, an individual’s performance on the second test should not depend on how she performsonthefirst. Also, refer to Bachman (1990), p. 169.

26. Choice 2

38. Choice 4 Referto section 3.

2 X Vhalf _ 2x 0.6 _ 12

=a — = 075 total 1+ Thair 14+06 16 0.7

27. Choice 2 Validity refers to the correspondence betweentest content and course objectives. By including a variety of items we ensure that most orall of the course objectives. are included in the test and thus content validity increases. |

34, Choice 1

Referto section 5.4.

Tr

He!

33. Choice 1 Thetrue score reflects what the test is validly measuring and tends to be stable from one measurement to another. The error component includes extraneous factors which are nonsystematic. These can include Jack of concentration, inappropriate testing conditions or improper question wording. This componentis the part that averages out. from one measurement to another. What this meansis that an individual who _scored extremely high on thefirst test is “likely to score lower on a secondtest simply because the error component-is like to be less extreme and closer to the average.” Thus that the morereliable the test, the less likely regression of true score toward the mean will occur.

36. Choice 2 Referto section 7.1.

23. Choice 3 To estimate reliability of this test we need to used KR-21 method: |——|

28. Choice 1

Referto section 9.1.



Refer to section 4.3.3.

15-2 x (16) =11.8 and 1542 x (1.6) = 18.2

=

not the total test. To compensate for the loss in the degree of reliability due to dividing the test into halves, Spearman Brown prophecyformula is used. Refer to section 5.5.

19. Choice 2

Then:

weobtain an estimate of the test scoresreliability, This correlation is the reliability of one halfofthetest,

32. Choice 2

18. Choice 1 Refer to section 5.2,

SEM = S,V1—r

Chapter 6 / Characteristics ofa Good Test LE] 181

31. Choice 2 By dividing the length ofthe test into two halves, and calculating the correlation between the two halves

16. Choice 2

i

cieiny non wrench abo Fe

39. Choice 1 Reliability is expressed as the ratio ofthe varianceoftrue scores to the variance ofobserved scores. 40. Choice I Refer to section 5.4.

.

Wahgi ea

182 [_] Language Testing

al cheats

Cer

Geir

we ine eee ete

Chapter 6 / Characteristics ofa Good Test ry 183

ww vm

41. Choice 3

' 59. Choice 4

42. Choice 4 You should rememberthat 68% of the times a person’strue score is within +1SEM ofhis observed score.

Since the results of such an almost objective test wouldn’t change. much,then reliability is expected to be high. On the other hand, this test partially measurs a testee’s writing ability and leaves out other components such as coherence, cohesion, topic development, mechanics, etc. thus such a test’ lacks

43. Choice I

validity.

SEM = 8,V1-r = 4x V1—0.75 =

60. Choice 2

4x V0.25=4x05=2

Referto section 9.3.

i

|

62. Choice 3

45. Choice 4

63. Choice 2 : : : To find the person’s true score we need to have SEM ofthe scores. But, stem doesn’t provide this piece

re Me = Va — Ve - Vg 0 +t Vy

‘ SEM = Syv1—r=10 x ¥1—-0.64= 10x06 =6

Whenthereis no error variance,i.e. when V, = V,, reliability equals one: Vy



61. Choice 4 Direct tests which are mainly subjective tend to be unreliable compared to indirect tests which are mainly objective and tend to bereliable. ‘

44. Choice 2 A correlation of —-1 or +1 indicates a perfect correlation while a correlation of 0 indicates no relationship.

. 46. Choice 3

—_

ofinformation. Therefore, there is no answerto this iter. However,it seems that since the item asks for 68%, test constructor meant: 14 + 3 = 11 and 17.

Vy

Referto item 37.

64. . Choice Choice 3 Referto se¢tion 4.3.3. 65. . Choice Choice 2

47. Choice 2 Referto section 3. 8. Choi 48. vice 2

J SEM = 8,v1—r=10xV1—0.75 =10x0.5=5

Refer to section 9.4,

66. he Choice OIce 1

49. Choice 4

.

In case of a single teacher the concern is intra-raterreliability.

One ofthe factors affecting reliability of a test is ‘the numberof items’. Thus, whenusing the reliability of half, test constructorcalculates thereliability ofthe wholetest, this index definitely increases.

67. Choice 3 Whentest constructor is concerned with ensuring thatall test items measure the sub-skills of reading

50. Choice 4

comprehension,he is attempting to construct a test which actualizes the trait of reading in an appropriate

Referto section 6.

manner, Thushe is concerned with construct validity.

31. Choice 4

68. Choice 1 The variation created as a result of poor test administration is unsystematic and contributes to test

52. Choice 4 According to SEM = S,/1—r as reliability increases, SEM decreases andvice versa. Therefore, there is ,

negativerelationship between SEM andreliability. 53. Choice3 . Referto section 5.4.

54. Choice1

Referto section 9.4.

y

7 4 .

;

E

55. Choice I

Refer to section 5.4,

56. Choice 3

unreliability.

69. Choice 2

Mark-remarkreliability is another nameforintra-raterreliability.

70. Choice 4 Whena test measuresdifferent language materials, the items are heterogeneous. This lowers reliability

index. 71. Choice 3

ir

According to the KR-21 formula knowing the numberoftest-takers is unnecessary.

(KR 21) = KS [! _XK 4

{4

Rasch analysis is a modelin estimating reliability of a test andis irrelevantto item analysis.

37. Choice 2 Refer to section 8. 58. Choice 2

K-1

] 7

72. Choice 4

q:

Refer to section 4.3.

kv

184 [_] Language Testing

ED gtk Bp

ate data

ee ee

Bah

73. Choice 1 Two different effects are associated with systematic error: a general effect and a specific effect. The generaleffect of systematic error is constantfor all observations; it affects the scores ofall individuals who take the test. The specific effect varies across individuals; it affects different individuals

differentially. Suppose a situation in which every form of a reading comprehension test’ contained passages from the area of, say, economics. We can say passage content is the samefor all tests, and so will have a general effect on all test scores. If individuals differ in their familiarity with economics, this systematic error will have different specific effects for different individuals, increasing some scores and lowering others. In terms of variance components, the general effect of systematic error is that associated with a main effect, while the specific effect is associated with an interaction between persons and the facet. This effect in turn can be distinguished from random error variance, which is associated with residual, or unpredicted variance. Both the general and the specific effects of systematic error limit the

generalizability of test scores as indicators of universe scores. Thus, we would be reluctant to make inferences about reading comprehension in general on the basis of scores from tests containing passages only from the field of economics. We mightbe willing to accept such scores for students of economics by redefining the ability we are interested in measuring to reading comprehension of passages in economics.

ee ey Cer WH

ee wee Ra Te

Chapter 6 / Characteristics ofa Good Test C] 185

of forms,therefore, any inconsistencies among the items in the forms themselves will lower our estimate not know whetherthis equivalence. Thus, if two tests turn out to have poor equivalence, the test user will

he examines is because they are not equivalent or whether they are not internally consistent, unless internal consistency separately.

78. Choice 4 Refer to Section 4.3.2. 79. Choice 3

One wayofobtaining evidence about the construct validity of a test is to investigate what test

gather such takers actually do when they respond to an item. Two principal methods are used to their information: think aloud and retrospection. In the think aloud method, test takers voice thoughts as they’ respondto the item. In retrospection, they try to recollect what their thinking although a was as they responded. In both cases their thoughts are usually taper-recorded,

the questionnaire may be used for the latter. The problem with the think aloud method is that

74, Choice 1

item. The very voicing of thoughts mayinterfere with what would be the natural response to the these drawback to retrospection is that thoughts may be misremembered or forgotten. Despite weaknesses, such research can give valuable insights into how items work.

The major limitation of content relevance as the sole basis for validity, however, is that demonstrating

80. Choice 4

that the contents of a test accurately represent a given domain of ability does not take into consideration,

or accountfor, how individuals actually perform on the test. Even though the content of a given test does notvary across different groups of individuals to whichit is given, the performance of these individuals may vary considerably, and the interpretation of test scores will vary accordingly. Content validity is a test characteristic. It will not vary across different groups of examinees, however,the validity of test score interpretation will vary from onesituation to another. Furthermore, the score of an individual on a given test is notlikely to be equally valid for different uses. For example, although a rating from an oral interview may be a valid indicator of speaking ability, it may not be valid as a measure of teaching ability. The primary limitation of content validity, then, is that it

focuses ontests, rather than test scores. 75. Choice 3 Researchinto the relationships between performance on languagetests andthe abilities that underlie this performance(construct validation research) dates to least from the 1940s, with Carroll’s pioneering work.

Reliability and SEM have a negativerelationship. Therefore, as one increases the other decreases.

81. Choice 1

much freedom. Opposite to choice one,to increase reliability, tester should not allow candidates too 82. Choice 4

* To be suitable for as a criterion, the test should be highly valid. 83. Choice 3 Refer to Section 4.3.2. 84, Choice 2 Choice tworelates to contentvalidity.

85. Choice 4

5

.

expensive, (b) According to Brown, an effect test is practical. This meansthatit (a) is not excessively

76. Choice 4

(d) as a stays within appropriate time constraints, (c) is relatively easy to administer, and scoring/evaluation procedurethat is specific and time-efficient.

In the case of teacher-made classroom assessments, criterion-related evidence is best demonstrated

86. Choice 1

through a comparison of results of an assessment with results of some other measure of the same criterion. For example, in a course unit whose objective is for students to be able to orally produce voiced and voiceless stops in all possible phonetic environments, the results of one teacher’s unit test might be compared with an independent assessment —possibly a commercially producedtest in a textbook— of the same phonemic proficiency. A classroom test designed to assess mastery of a point of grammar in communicative use will have criterion validity if test scores are corroborated either by observed subsequent behavioror by other communicative measuresof the grammar point in question.

/

aspects like The ability to communicate is more than knowledge of grammar and vocabulary. Other communicative of test a in discourse competence, sociolinguistic competence, etc. should be included course, of else the test lacks constructvalidity.

77. Choice 2 One problem with the CTS model is that it treats error variance as homogeneousin origin. Each of the estimates ofreliability in CTS addresses one specific source of error, and treats other potential sources either as part of that source, or as true score. Equivalent forms estimates, for example, do not distinguish inconsistencies between the forms from inconsistencies within the forms. In estimating the equivalence of

i

Bolo



Sir

186 C] LanguageTesting

we ine

ase

Azad University Questions

14- The--------- estimatesthe limits within which an individual’s obtained scoreon testis likely to diverge from his/her true score.

I- The content validity of --------- tests can easily be determined. 1) achievement 2) proficiency 3) power

{Azad University, 81)

4) speed

(Azad University, 81)

2) to exactly determine the true scores 4) to standardize the raw scores

3- Unreliability is a matter of ---------. i) systematic errors contributed to x,

3) repeated measurement

2) inconsistency in observed scores 4) fluctuationsin a given attribute

4- The changes in subjects’ observed. scores --------~.

1) are due to systematic variation 3) are due to standard error of measurement

5- Whichofthe followingis most likely NOT true?

)xX=T

2QX>T

3)XT

3)X