Language Testing (Summary of Chapters 1-6)

1,951 186 3MB

English Pages 29

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Language Testing (Summary of Chapters 1-6)

Citation preview

Chapter One: Preliminaries of Language Testing 1. WHY TESTING In general, testing has the following purposes:  Students, teachers, administrators and parents want to ascertain the degree to which those goals have been realized  Government and private sectors who employ the students are interested in having precise information about students’ abilities.  Most importantly, through testing, accurate information is obtained based on which educational decisions are made. Tests can benefit students in the following ways:  Testing can create a positive attitude toward class and will motivate them in learning the subject matter.  Testing can help students prepare themselves and thus learn the materials. Testing can also benefit teachers:  Testing helps teachers to diagnose their efforts in teaching.  Testing can also help teachers gain insight into ways to improve evaluation process. 2. TEST, MEASUREMENT, EVALUATION Measurement is the process of quantifying the characteristics of persons according to explicit procedures and rules. Test is an instrument, often connoting the presentation of a set of questions to be answered, to obtain a measure of a characteristic of a person.  Note: What distinguishes a test from other types of measurement is that it is designed to obtain a specific sample of behavior. Evaluation has been defined in a variety of ways: 1) The process of delineating, obtaining, and providing useful information for judging decision alternatives. 2) The determination of the congruence between performance and objectives. 3) A process that allows one to make a judgment about the desirability or value of a measure.  Note: It is important to point out that we never measure or evaluate people. We measure or evaluate characteristics or properties of people. 3. NORM-REFERENCED TESTS vs. CRITERION-REFERENCED TESTS If we compare the score of a testee to the scores of other testees, this would be norm referencing. However, if we interpret a testee’s performance by comparing it to some specific criterion without concern for how other testees performed, this would be criterion referencing. Usually, a testee passes the test only when he has given the right answer to all or a specific number of test items.

Characteristics

Type of Interpretation

Type of Measurement

NRT

CRT

Relative (A student’s performance is compared to those of all other students in percentile terms) To measure general language abilities or proficiencies Normal distribution of scores around the mean

Absolute (A student’s performance is compared only to the amount, or percentage, of material learned.)

Spread students out along a continuum of general abilities or proficiencies A few relatively long subtests with a variety of item contents Students have little or no ideas of what content to expect in test items When a great number of testees miss an item, it is eliminated from the test

Assess the amount of material known or learned by each student

To measure specific objectivesbased language points Varies; often non-normal. Students who know the material should score 100%

Distribution of Scores

Purpose of Testing Test Structure Knowledge of Questions

Missed Items

A series of short, well-defined subtests with similar item contents Students know exactly what content to expect in test items When a test item is missed by a great number of testees, the instructional materials are revised or additional work is given

4. TEACHER-MADE TESTS VS. STANDARDIZED TESTS A teacher-made test is a small scale, classroom test which is generally prepared, administered and scored by one teacher. On the other hand, standardized tests are commercially prepared by skilled test-makers and measurement experts. They provide methods of obtaining samples of behavior under uniform procedures. Characteristics

Teacher-Made Test

Standardized Test

Type of Interpretation

Criterion-referencing

Norm-referencing

Direction for Administration and Scoring

Usually no uniform directions specified

Specific, culture-free direction for every testee to understand; standardized administration and scoring procedures

Sampling of Content

Both content and sampling are determined by classroom teacher

Content determined by curriculum and subject-matter experts; involves extensive investigations of existing

syllabi, textbooks, and programs; sampling of content done systematically

Construction

May be hurried and haphazard; often no test blueprints, item tryouts, item analysis or revision; quality of test may be quite poor

Uses meticulous construction procedures that include constructing objectives and test blueprints, employing item tryouts, item analysis, and item revisions

Norms

Only local classroom norms are available, i.e. they are determined by the school or a department

In addition to local norms, standardized tests typically make available national schools district norms

Quality of Items

Unknown; usually lower than standardized tests due to limited time and skill of teacher

High; written by specialists, pretested and selected on the basis of effectiveness

Reliability

Unknown; usually high if carefully constructed

High

5. WASHBACK A facet of consequential validity is washback. Washback generally refers to the effects the tests have on instruction in terms of how students prepare for the test.  Note: ‘Cram’ courses and ‘teaching to the test’ are examples of such washback. In classroom assessment it refers to the information that washes back to students in the form of useful diagnoses of strengths and weaknesses.  Harmful washback is said to occur the test content and testing techniques are at variance with the objectives of the course.  Beneficial washback is said to result when a testing procedure encourages good teaching practice is introduced. 6. TEST BIAS A test or item can be considered to be biased if one particular section of the candidate population is advantaged or disadvantaged by some feature of the test or item which is not relevant to what is being measured.  Fairness can be defined as the degree to which a test treats every student the same, or the degree to which it is impartial. Equitable treatment in terms of testing conditions, access to practice materials, performance feedback, retest opportunities, and other features of test administration, including providing reasonable accommodation for test takers with disabilities when appropriate, are important aspects of fairness under this perspective. 7. AUTHENTICITY It is the degree of correspondence of the characteristics of a given language test task to the features of target language task.  The language in the test is as natural as possible.

   

Items are contextualized rather than isolated. Topics are meaningful for the learner. Some thematic organization to items is provided, such as through a story line or episode. Tasks represent, or closely approximate, real-world tasks.

Chapter Two: Language Test Functions 1. TWO MAJOR FUNCTIONS OF LANGUAGE TESTS 1.1. Evaluation of attainment tests Attainment evaluation tests measure to what extent examinees have learned the intended skill, performance, knowledge, etc. in a given area. 1.1.1. Achievement tests Such tests are related directly to classroom lessons, units, or even a total curriculum.  General achievement tests are (standardized) tests which deal with a body of knowledge. Constructors of such tests rarely teach students being tested. One example is a test to measure students’ achievement in the first year of high school. The content of a final achievement test may be based directly on a detailed course syllabus or on the books and other materials used. This has been referred to as the syllabus-content approach.  Since the test only contains what it is thought that the students have actually encountered, and thus can be considered a fair test.  If the syllabus is badly designed, or the books and other materials are badly chosen, the results of a test can be very misleading. The alternative approach is to base the test directly on the objectives of the course.  It makes it possible for performance on the test to show just how far students have achieved those objectives. This in turn puts pressure on those responsible for the syllabus and for the selection of books and materials to ensure that these are consistent with the course objectives. Tests based on objectives work against the perpetuation of poor teaching practice, something which course-content-based tests fail to do.  It is unfair. If the course content does not fit well with objectives, examinees will be expected to do things for which they have not been prepared  Diagnostic tests measure the degree of students’ achievement on a particular subject/topic and specifically the detailed elements of an instructional topic. They show weaknesses and strengths of students so that teachers can modify the instructional procedure and remedial action can be taken if the number of students is big. Criterion-Referenced

Test Qualities

Achievement

Diagnostic

Details of Information

Specific

Very specific

Focus

Terminal objectives of course or program

Enabling objectives of courses

Purpose of Decision

To determine the degree of learning for advancement or graduation

To inform students and teachers of objectives needing more work

When Administered

End of courses

Middle of courses

1.1.2. Knowledge tests These tests are used when the medium of instruction is a language other than examinees’ mother tongue. 1.1.3. Proficiency tests If your aim in a test is to tap the overall language ability, i.e. global competence, then you are, in a conventional terminology, testing proficiency. A proficiency test in not limited to any one course curriculum, or single skill in the language; rather it tests overall ability. More precisely, these tests measure: degree of his capability to demonstrate his knowledge in language use; and degree of capability in language components  Note: A key issue in testing proficiency is the difficulty that centers on the complexity of defining the term ‘proficiency’ (construct of language). This difficulty renders the construct of proficiency tests difficult. 1.2. Prognostic tests Prognostic tests are not related to a particular course of instruction. Their objective is to predict and make decisions about future success and actions of examinee based on present capabilities. 1.2.1. Placement tests Placement tests are used to determine the most appropriate channel of education for examinees. The purpose of placement tests is merely to measure the capabilities of an applicant in perusing a certain path of language learning and to place them into an appropriate level or section of a language curriculum or school.  Note: There is no pass or fail in placement tests.  Note: Teachers benefit from placement decisions because they end up with classes that have students with relatively homogeneous ability levels.  Note: If there is a mismatch between the placement test and what is taught in a program, the danger is that the groupings of similar ability levels will simply not occur. 1.2.2. Selection tests The purpose of selection tests is to provide information upon which the examinees’ acceptance or nonacceptance into a particular program can be determined.  Note: In contrast to placement tests, testees pass or fail in selection tests.  Note: As many candidates obtain the criterion, due to administrative restrictions, these tests turn into competition tests. 1.2.3. Aptitude tests These tests are used to predict applicants’ success in achieving certain objectives in the future. Such tests are designed to measure a person’s capability or general ability to learn a foreign language and to be successful in that undertaking. Language aptitude tests are designed to measure a person’s capability or general ability to learn a foreign language a priori and to be successful in that undertaking. Language aptitude tests were ostensibly designed to apply to the classroom learning of any language. These tests do not tell us who will succeed or fail in learning a foreign language. They attempt to predict the rate at which certain students will be able to acquire a language. Language aptitude tests usually consist of several different tests which measure the following

cognitive abilities. 

 



Sound coding ability (or phonetic coding): the ability to identify and remember new auditory phonetic material in such a way that this material can be recognized, identified and remembered over a period longer than a few seconds. This is a rather unique auditory component of foreign language aptitude. Grammatical coding ability: the ability to identify the grammatical functions of different parts of sentences in a variety of contexts. Memorization (or rote learning ability): the ability to remember words, rules, etc. in a new language. Rote learning ability is a kind of general memory, but individuals seem to differ in their ability to apply their memory to the foreign language situation. Inductive learning ability: the ability to work out linguistic forms, rules, patterns, and meanings from new linguistic content with a minimum of supervision or guidance.

Chapter Three: Forms of Language Test The form of a test refers to its physical appearance. 1. STRUCTURE OF AN ITEM An item, the smallest unit of a test, consists of two parts: the stem and the response. 1) How many functions do language tests serve? (stem) a) two key alternatives b) three c) four distractor d) five 2. CLASSIFICATION OF ITEM FORMS 2.1. Subjective vs. objective items Subjective items are those in which the scorer must make an opinionated judgment. Objective items are those in which the correctness of the test taker’s response is determined by predetermined/objective criteria.  Note: You should know that objectivity and subjectivity refers to the way a test item is scored. The most beautiful season is …… 1) spring 2) summer 3) fall There are …… seasons in a year. 1) four 2) three 3) two

4) winter 4) five

2.2. Essay-Type vs. multiple-choice items Essay-type items are those in which the examinee is required to produce language elements. Multiplechoice items are those in which examinee is required to select the correct response from among given alternatives. 2.3. Suppletion vs. recognition items Suppletion (or production; supply) items require the examinee to supply the missing part(s) of the sentence or complete an incomplete sentence. Recognition items require examinee to select an answer from a list of possibilities. 3. TYPES OF ITEMS 3.1. Receptive response items Multiple-choice (MC) items are undoubtedly one of the most widely used types of items in objective tests. MC items have the following advantages.  Because of the highly structured nature of these items, the test writer can get directly at many of the specific skills and learning he wishes to measure. This in turn leads to their diagnostic function.  The test writer can include a large number of different tasks in the testing session. Thus they have practicality.

 Scoring can be done quickly and involves no judgments as to degrees of correctness. Thus they have reliability. However, these items are disadvantageous on the grounds they:  are passive, i.e. such items test only recognition knowledge but not language communication,  may have harmful washback,  expose students to errors,  are de-contextualized,  are one of the most difficult and time-consuming types of items to construct,  are simpler to answer than subjective tests,  encourage guessing. There is a way to compensate for students’ guessing on tests. That is, there is a mathematical way to adjust or correct for guessing. This statistical procedure properly named guessing correction formula is: Guessing Correction Formula  Right 

where

Wrong n -1

n refers to the number of options.

 Example: In a test which consisted of 80 items with four options, a student answered 50 items correctly and gave 30 wrong answers. After applying guessing correction formula his score would be --------1) 45 2) 35 3) 40 4) 30 Score  Right 

Wrong n 1

 50 

30 4 1

 50  10  40

3.2. Personal response items (or alternative assessment options) In recent years, language teachers have stepped up efforts to develop non-test assessment options. Such innovations are referred to as personal response items that encourage the students to produce responses that hold personal meaning. 3.2.1. Self-assessment Self-assessment is defined as any items wherein students are asked to rate their own knowledge, skills, or performances. Thus, self-assessments provide the teacher with some idea of how the students view their own language abilities and development.  speed,  direct involvement of students → increased motivation  the encouragement of autonomy,  subjectivity There are at least two categories of self-assessment:  

Direct assessment of a specific performance: a student typically monitors himself in either oral or written production and renders some kind of evaluation of performance. Indirect assessment of general competence: this type of assessment targets large slices of time with a view to rendering an evaluation of general ability, as opposed to one specific, relatively timeconstrained performance.

3.2.2. Journal Journals can range from language learning logs, to grammar discussions, to responses to readings, to attitudes and feelings about oneself. One of the principal objectives in a student’s dialogue journal is to carry on a conversation with the teacher. Through dialogue journals, teachers can become better acquainted with their students, in terms of both their learning progress and their affective states, and thus become better equipped to meet students’ individual needs.  Because journal writing is a dialogue between students and teacher, journals afford a unique opportunity for a teacher to offer various kinds of feedback to learners.  Journals are too free to form to be assessed accurately.  Certain critics have expressed ethical concerns.

CHAPTER FOUR: BASIC STATISTICS IN LANGUAGE TESTING 1. STATISTICS Statistics involves collecting numerical information called data, analyzing them, and making meaningful decisions on the basis of the outcome of the analyses. Statistics is of two types: descriptive and inferential. 2. TYPES OF DATA 2.1. Nominal Data As the name implies names an attribute or category and classifies the data according to presence or absence of the attribute, e.g. ‘gender,’ ‘nationality,’ ‘native language,’ etc. 2.2. Ordinal Data Like the nominal scale, an ordinal scale names a group of observations, but, as its label implies, an ordinal scale also orders, or ranks, the data. For example, the degree of happiness is shown by very unhappy – unhappy – happy – very happy. 2.3. Interval Data Interval data represent the ordering of a named group of data, but they provide additional information. Interval data also show the (more) precise distances between the points in the rankings, e.g. test scores. 2.4. Ratio Data Ratio data are similar to interval data except that they have absolute zero. As a result of this new characteristic, in ratio data we can say ‘this point is two time as high as that point,’ Shows categories

Gives ranking

Equal distances

Absolute zero

Nominal Ordinal Interval Ratio 3. TABULATION OF DATA Suppose that the following table shows the reading scores of students in an achievement test. Student Score

a 93

b 95

c 92

d 95

f 100

g 96

h 92

i 96

j 92

k 95

l 92

3.1. Rank Order The first step is to arrange the scores in the order of size, usually from highest to lowest. If two testees received the same score, we should divide the sum of their rank. Score 100 96 96 95 95 95 93 92 92 92 92

Rank order 1 2.5 2.5 5 5 5 7 9.5 9.5 9.5 9.5

The next table shows the same scores. The remaining terms used in tabulation of data will be presented according to this tale. Score 100 96 95 93 92

Frequency (f) 1 2 3 1 4 Total = 11

Relative Frequency 0.09 0.18 0.27

0.09 × 100= 9 0.18 × 100= 18 0.27 × 100= 27

Cumulative Frequency (F) 11 10 8

0.09 0.36

0.09 × 100= 9 0.36 × 100= 36

5 4

Percentage

Percentile 100 90 72 45 36

3.2. The Frequency Distribution (Simple/Absolute) frequency (f), also called simple or absolute frequency, is the number of times a score occurs. 3.3. Relative Frequency Relative frequency refers to the simple frequency of each score divided by the total number of scores. 3.4. Percentage When relative frequency index is multiplied by 100, the result is called percentage. 3.5. Cumulative Frequency Cumulative frequency (F) indicates the standing of any particular score in a group of scores. This index shows how many students received a particular score and less than that.

3.6. Percentile When cumulative frequency index is divided by the total number of learners multiplied by 100, the result is percentile. Percentile rank shows what percentage of students received a particular score or below that. 4. DESCRIPTIVE STATISTICS 4.1. Measures of Central Tendency 4.1.1. Mode The most easily obtained measure of central tendency is the mode. The mode is the score that occurs most frequently in a set of scores, e.g. 88 is mode in 80, 81, 81, 85, 88, 88, 88, 93, 94, 94 Note: When all of the scores in a group occur with the same frequency, it is customary to say that the group of sores has ‘no mode’ as in 83, 83, 83, 88, 88, 88, 90, 90, 90, 95, 95, 95 Note: When two adjacent scores have the same frequency, the mode is the average of the two adjacent scores, so 86.5 is mode in the following set 80, 82, 84, 85, 85, 88, 88, 90, 94 Note: When two non-adjacent scores have the same frequency, the distribution is bimodal. 82, 82, 85, 85, 85, 87, 88, 88, 88, 90, 94 4.1.2. Median The median (Md) is the score at the 50th percentile in a group of scores, e.g. 85 is median in 81, 81, 82, 84, 85, 86, 86, 88, 89 Note: If the data are an even number of scores, the median is the point halfway between the central values when the scores are ranked, e.g. 85 in the following set 81, 81, 82, 84, 86, 86, 88, 90 4.1.3. Mean Mean is probably the single most often reported indicator of central tendency. It is the same as arithmetic average: ∑𝑋 𝑋= 𝑁 Note: If we were to find the deviation of scores from the mean of the set, the sum would be exactly zero. Note: The limitation of means is that it is seriously sensitive to extreme scores. 4.1.4. Mid Point The midpoint in a set of scores is that point halfway between the highest score and the lowest score on the test. The formula for calculating the midpoint is:

𝑀𝑖𝑑 𝑝𝑜𝑖𝑛𝑡 =

𝐻𝑖𝑔ℎ + 𝐿𝑜𝑤 2

4.2. Measures of Variability 4.2.1. Range Range is the simplest measure of dispersion and is defined as the difference between the number of points between the highest score on a measure and the lowest score, e.g. range is 8 in the set 92, 95, 95, 97, 98, 98, 100 Note: Range changes drastically with the magnitude of extreme scores (or outliers). 4.2.2. Standard Deviation (SD) The most frequently used measure of variability is the standard deviation. SD is the average difference of all scores from the mean.

3, 5, 5, 8, 9 Mean: 6

1

2

3

4

5

∑X

Mean of the arrorws length in the first figure =

1, 4, 5, 10, 10 Mean: 6

1

2

3

4

7

6

N

5

Mean of the arrorws length in the second figure =

6

∑X

N

8

9

10

11

1+1+3+3+2 =2 5

=

7

=

8

9

10

11

5+2+1+4+4 = 3.2 5

Therefore, we can say the scores in the second figure in average deviate more from their mean than do the scores in the first set, i.e. they ∑(𝑋 − 𝑋̅)2 𝑆𝐷 = √ 𝑁 4.2.3. Variance To find variance, you simply stop short of the last step in calculating the standard deviation. You do not need to bother with finding the square root.

𝑆𝐷 = √

∑(𝑋−𝑋̅)2 𝑁



Variance =

∑(𝑋−𝑋̅)2 𝑁

You will frequently find variance showed as 𝑆 2 . Variance = S 2 (standard deviation squared)  Standard deviation = √variance

5. NORMAL DISTRIBUTION A normal distribution means that most of the scores cluster around the mean of the distribution, and the number of scores gradually decreases on either side of the mean. The resulting figure is a symmetrical bell-shaped curve.

 Example: In vocabulary test, mean and standard deviation are calculated to be 82 and 4, respectively. In this test, 68% of students fall between ---------

 Example: The mean and SD of a set of scores are 45 and 5. A student who obtained 55 has a percentile rank of ---------.

30 35 40

45 50 55 60

 Example: In a test mean and standard deviation are 32 and 3. A student is --------probable to obtain a score higher than 29.

23

26 29 32

35 38 41

6. DERIVED SCORES Raw scores are obtained simply by counting the number of right answers. Raw scores from two different tests are not comparable. To solve this problem, we could convert the raw scores into percentile or standard scores. Percentile scores indicate how a given student’s score relates to the test scores of the entire group of students. Standardized scores are obtained by taking into account the mean and SD of any given set of scores. Standard scores represent a student’s score in relation to how far the score varies from the test mean in terms of standard deviation units. 6.1. z score The ‘z score’ just tells you how many standard deviations above or below the mean any score or observation might be: 𝑧=

𝑋−𝑋 𝑆𝐷

 Example: In a set of scores where, mean and SD are 41 and 10, what is the z score of a student who obtained 51? 𝑋 − 𝑋 51 − 41 𝑧= = = +1 𝑆𝐷 10 6.2. T score The formula for calculating T score is: 𝑇 𝑠𝑐𝑜𝑟𝑒 = 10𝑧 + 50 Therefore, the T score of the student in the previous example would be 60. 8. CORRELATION Correlation analysis refers to a family of statistical analyses that determines the degree of relationship between two sets of numbers. The numerical value representing the degree to which two variables are related (co-vary, or vary together) is called correlation coefficient. Correlation is the go-togetherness of two sets of scores. Let’s take the following hypothetical set of scores and then represent them on a scatter plot.

Positive correlation: Students Dean Randy Joey Jeanne Kimi Shenan

Test A 2 3 4 5 6 7

Test B 3 5 7 9 11 13

This is a linear perfect positive correlation. The two sets of scores may not necessarily be ordered in exactly the same way. Here is another set of hypothetical scores with the representing scatergram. Students Dean Randy Joey Jeanne Kimi Shenan

Test A 2 4 6 8 9 12

Test B 2 3 7 7 10 11

Negative correlation: Students

Days of absence

English scores

Dean

8

20

Randy Joey Jeanne Kimi Shenan

7 6 5 4 3

30 40 50 60 70

This is a linear (not perfect) positive correlation. The two sets of scores may not necessarily be ordered in exactly the same way. Here is another set of hypothetical scores with the representing scatergram. Students Dean

Days of absence 2

English test 90

Randy

3

60

Joey Jeanne Kimi Shenan

5 8 8 9

40 40 30 10

Note: If high scores in one set are associated with low scores on the other set, there is a negative relationship between the two sets of scores. Note: If high scores in one set are associated with high scores on the other set, there is a positive relationship between the two sets of scores. Zero Correlation:

Curvi-linear:

9. CORRELATIONAL FORMULAS Correlational values are named after their strength:  Both ± 1 are considered perfect correlations.  – 0.4 ≤ r ≤ + 0.4 are considered weak correlations.  –0.8 ≥ r > –1 and 0.8 ≤ r < 1 are considered strong correlations. Note: The sign (– or +) of the correlation coefficient doesn’t have any effect on the degree of association, only on the direction of the association. 9.1. Pearson Product Moment Correlation Karl Pearson developed a correlation which demonstrates the strength of a relationship between two sets of continuous scale data: 𝑟=

𝑁(∑ 𝑋𝑌) − (∑ 𝑋)(∑ 𝑌) √[𝑁(∑ 𝑋 2 ) − (∑ 𝑋)2 ][𝑁(∑ 𝑌 2 ) − (∑ 𝑌)2 ]

9.2. Rank Order Correlation The Spearman rho (𝜌) correlation coefficient is used only when the data exist in ranked (ordinal) form: 6 ∑ 𝐷2 𝜌 =1− 𝑁 (𝑁 2 − 1) 9.3. Point Biserial Correlation It is used when one set of data is continuous and the other set is nominal. The nominal variable is

dichotomous which can take only the values of 1 or 0. The correlation between each single test item (nominal scale) and the total test (continuous scale) can be computed by this formula: 𝑟𝑝𝑏 =

𝑋𝑝 − 𝑋𝑞 √𝑝 ∙ 𝑞 𝑆𝑥

Note: Correlation doesn’t show causality between two variables. It shows relative positions in one variable are associated with relative positions in the other variable.

CHAPTER FIVE: TEST CONSTRUCTION 1. DETERMINING FUNCTION AND FORM OF THE TEST In order to determine the function and form of a test, three factors should be taken into account: (a) characteristics of the examinees (b) specific purpose of the test (c) scope of the test 2. PLANNING (Specifying the test content) It is important for the tester to decide on the area of knowledge to be measured. In order to determine the content of a test table of specification should be prepared.  The main purpose of table of specification is to assure the test developer that the test includes a representative sample of the materials covered in a particular course. Instructional objectives/ Content

Number of items

Reported speech Subjunctive Dangling structure

3 2 5

3. PREPARING ITEMS Last year, incoming students ……… on the first day of school. 1) enrolled

2) will enroll

3) will enrolled

4) should enroll

Have you heard the planning committee’s ……… for solving the city’s traffic problems? 1) purpose

2) propose

3) design

4) theory

4. REVIEWING It is highly recommended that the produced test be reviewed by an outsider to know his subjective ideas and evaluation of the test. 5. PRETESTING Pretesting is defined as administering the newly developed test to a group of examinees with characteristics similar to those of the target group. The purpose of pre-testing is to determine, objectively, the characteristics of the individual items and the characteristics of the items altogether. Item Facility (IF) Item facility refers to the easiness of an item: 𝐼𝐹 =

∑𝐶 𝑁

𝐼𝐹 = item facility ∑ 𝐶 = sum of the correct responses 𝑁 = total number of responses Example: In a test 20 testees answered an item correctly. If 50 students took the exam, what would be item facility? 𝐼𝐹 =

∑ 𝐶 20 = = 0.4 𝑁 50

Example: A test was given to 75 examinees: 50 answered correctly, 10 answered wrongly, and 15 left the item blank. What is FV? 𝐼𝐹 =

   

∑ 𝐶 50 = = 0.83 𝑁 60

Note: The range of IF index is 0 ≤ 𝐼𝐹 ≤ 1 Note: The acceptable range of IF index is 0.37 ≤ 𝐼𝐹 ≤ 0.63 Note: The ideal index for IF is 𝐼𝐹𝑖𝑑𝑒𝑎𝑙 = 0.5 Note: By determining item facility, the test constructor can easily find out item difficulty. Item difficulty can be calculated by using the following formula: Item Difficulty (ID) = 1 − 𝐼𝐹 Items 2

3

4

5

6

7

8

9

10

Total

Shenan

1

0

1

1

1

1

1

1

1

1

9

Robert

1

0

1

1

1

1

1

0

1

1

8

Millie

1

0

1

1

1

1

1

0

1

0

7

Kimi

1

0

0

1

0

1

0

1

1

1

7

Jeanne

1

0

1

1

0

1

0

0

0

1

5

Corky

0

1

0

0

1

0

0

1

0

1

4

Dean

0

1

0

0

0

0

1

1

1

0

4

Bill

0

1

1

0

1

1

0

0

0

0

4

Randy

0

1

0

0

0

0

0

1

0

0

2

Mitsuko

0

1

0

0

0

0

0

0

0

0

1

Higher Proficiency

1

Lower Proficiency

Subjects

Item Discrimination (ID) Item discrimination refers to the extent to which a particular item discriminates more knowledgeable examinees from less knowledgeable ones. To compute the item discrimination, the following formula should be used:

∑ 𝐶ℎ𝑖𝑔ℎ − ∑ 𝐶𝑙𝑜𝑤 1⁄ 𝑁 2 ∑ 𝐶ℎ𝑖𝑔ℎ = the number of correct responses to a particular item by the examinees in the high group ∑ 𝐶𝑙𝑜𝑤 = the number of correct responses to a particular item by the examinees in the low group 1⁄ 𝑁 = the total number of responses divided by 2 2 𝐼𝐷 =

Example: If in a class with 50 students, 20 students in high group and 10 students in the low group answered an item correctly, then ID equals --------𝐼𝐷 =

∑ 𝐶ℎ𝑖𝑔ℎ − ∑ 𝐶𝑙𝑜𝑤

1⁄ 𝑁 2

=

20 − 10 10 = = +0.4 1⁄ (50) 25 2

Example: All the 30 testees in the high group and one-third of the students in the low group answered item number one correctly. In case there were 100 items in the test, what are IF and ID? 𝐼𝐷 =

∑ 𝐶ℎ𝑖𝑔ℎ −∑ 𝐶𝑙𝑜𝑤

1⁄ 𝑁 2

=

30−10 1⁄ (60) 2

20

= 30 = +0.66

𝐼𝐹 =

∑𝐶 𝑁

40

= 60 = 0.66

 Note: The range of ID index is −1 ≤ 𝐼𝐷 ≤ +1  Note: The acceptable range of item discrimination is 𝐼𝐷 ≥ +0.4  Note: If all students answered a question correctly (IF = 1), it would mean that the item is not only too easy but also non-discriminating (ID = 0). Similarly, if none of the students answered an item (IF = 0), it would mean that the item is not only too difficult but also nondiscriminating (ID = 0). Choice Distribution (CD) Choice distribution refers to: (1) The frequency with which alternatives are assigned as the correct answer. (2) The frequency with which alternatives are selected by the examinees in a multiple-choice item. Accordingly, there are three types of distractors: Functioning: a distracter which attracts more low-scoring students, who have not mastered the subject Non-functioning: a distracter which attracted no one, not even the poorest examinees Mal-functioning: a distracter which attracted more high than low scorers, it is a malfunctioning choice Example: (Choice C is the answer) Choice

Highs

Lows

Total

A B C D

3 7 14 0 (20)

8 3 5 0 (20)

11 10 19 0 (40)

IF =

ID =

∑ C 19 = = 0.47 N 40

∑ 𝐶ℎ𝑖𝑔ℎ − ∑ 𝐶𝑙𝑜𝑤 14 − 5 = = 0.45 1⁄ 𝑁 1⁄ (40) 2 2

6. VALIDATION Through validation, which is the last step in test construction process, validity as a characteristic of a test as a total unit is determined.

1) RELIABILITY On a ‘reliable’ test, one’s score on its various administrations would not differ greatly. That is, one’s score would be quite consistent. The notion of consistency of one’s score with respect to one’s average score over repeated administrations is the central concept of reliability.

2) CLASSICAL TRUE SCORE THEORY (CTS) CTS states that an observed score an examinee obtains on a test comprises two factors or components: a true score and an error score. If the observed score is represented by X, the true score by T and the error score by E, the relationship between the observed and true score can be illustrated as follows:

T=X

True score

Error score

Observed score

T>X T