Current Issues in Language Evaluation, Assessment and Testing : Research and Practice [1 ed.] 9781443889964, 9781443885904

Current Issues in Language Evaluation, Assessment and Testing: Research and Practice is a collection of research papers,

240 17 2MB

English Pages 340 Year 2016

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Current Issues in Language Evaluation, Assessment and Testing : Research and Practice [1 ed.]
 9781443889964, 9781443885904

Citation preview

Current Issues in Language Evaluation, Assessment and Testing

Current Issues in Language Evaluation, Assessment and Testing: Research and Practice Edited by

Christina Gitsaki and Christine Coombe

Current Issues in Language Evaluation, Assessment and Testing: Research and Practice Edited by Christina Gitsaki and Christine Coombe This book first published 2016 Cambridge Scholars Publishing Lady Stephenson Library, Newcastle upon Tyne, NE6 2PA, UK British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Copyright © 2016 by Christina Gitsaki, Christine Coombe and contributors All rights for this book reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the copyright owner. ISBN (10): 1-4438-8590-8 ISBN (13): 978-1-4438-8590-4

TABLE OF CONTENTS

List of Tables ............................................................................................ viii List of Figures............................................................................................ xii List of Appendices .................................................................................... xiv List of Abbreviations ................................................................................. xv Preface ...................................................................................................... xix Christina Gitsaki and Christine Coombe Issues in the Analysis and Modification of Assessment Tools and Tests Chapter One ................................................................................................. 2 How Well do Cloze Items Work and Why? James Dean Brown, Jonathan Trace, Gerriet Janssen, and Liudmila Kozhevnikova Chapter Two .............................................................................................. 40 Estimating Absolute Proficiency Levels in Small-Scale Placement Tests with Predefined Item Difficulty Levels Kazuo Amma Chapter Three ............................................................................................ 63 Bilingual Language Assessment in Early Intervention: A Comparison of Single- versus Dual-Language Testing Caroline A. Larson, Sarah Chabal, and Viorica Marian Chapter Four .............................................................................................. 80 Frequency and Confidence in Language Learning Strategy Use by Greek Students of English Penelope Kambakis-Vougiouklis and Persephone Mamoukari

vi

Table of Contents

Chapter Five .............................................................................................. 98 The Development of a Vocabulary Test to Assess the Breadth of Knowledge of the Academic Word List Lee-Yen Wang Issues in the Creation of Assessment and Evaluation Tools Chapter Six .............................................................................................. 118 Assessment for Learning; Assessment for Autonomy Maria Giovanna Tassinari Chapter Seven.......................................................................................... 137 Cultivating Learner Autonomy through the Use of English Learning Portfolios: A Longitudinal Study Beilei Wang Chapter Eight ........................................................................................... 158 Sharing Assessment Power to Promote Learning and Autonomy Carol J. Everhard Chapter Nine............................................................................................ 177 Developing a Tool for Assessing English Language Teacher Readiness in the Middle East Context Sadiq Midraj, Jessica Midraj, Christina Gitsaki, and Christine Coombe Chapter Ten ............................................................................................. 201 Foreign Language Teachers’ Proficiency: The Implementation of the EPPLE Examination in Brazil Douglas Altamiro Consolo and Vera Lúcia Teixeira da Silva Issues in Language Assessment and Evaluation Chapter Eleven ........................................................................................ 222 Vocabulary Size Assessment as a Predictor of Plagiarism Marina Dodigovic, Jacob Mlynarski, and Rining Wei Chapter Twelve ....................................................................................... 236 What is the Impact of Language Learning Strategies on Tertiary Students’ Academic Writing Skills? A Case Study in Fiji Zakia Ali Chand

Current Issues in Language Evaluation, Assessment and Testing: Research and Practice

vii

Chapter Thirteen ...................................................................................... 253 Speaking Practice in Private Classes for the TOEFL iBT Test: Student Perceptions Renata Mendes Simões Chapter Fourteen ..................................................................................... 272 Assessing the Level of Grammar Proficiency of EFL and ESL Freshman Students: A Case Study in the Philippines Selwyn A. Cruz and Romulo P. Villanueva Jr. Chapter Fifteen ........................................................................................ 287 Methodology in Washback Studies Gladys Quevedo-Camargo and Matilde Virginia Ricardi Scaramucci Contributors ............................................................................................. 305 Index ........................................................................................................ 314

LIST OF TABLES

Table 1-1:

Table 1-2:

Table 1-3: Table 1-4: Table 1-5: Table 1-6: Table 1-7: Table 1-8: Table 1-9: Table 1-10: Table 1-11: Table 1-12: Table 1-13:

Table 2-1: Table 2-2: Table 2-3:

Descriptive statistics for 50 cloze passages and reliability–Japan sample (adapted and expanded from Brown, 1998). Descriptive statistics for 50 cloze passages and reliability–Russia sample (adapted and expanded from Brown, 1998). Frequencies and percentages of tests that functioned well or poorly in terms of internal consistency reliability. Frequencies and percentages of items that functioned well in terms of Item Facility (IF). Frequencies and percentages of items that functioned well in terms of Item Discrimination (ID). Summary fit statistics for analyses of 1,496 items by country of origin. Vertical rulers for test taker ability, test version difficulty, and test item difficulty for Japan. Vertical rulers for test taker ability, test version difficulty, and test item difficulty for Russia. Frequencies and percentages of items that functioned well in terms of CTT ID and Rasch Fit. Item misfit by linguistic background for different parts of speech for 1,496 items. Item misfit for word type (content & function) in the Japan and Russia data for 1,496 items. Item misfit for word origin (Latinate & Germanic) in the Russia and Japan data for 1,496 Items. Item misfit as a function of word frequency (in the Brown Corpus found in Francis & Kuþera, 1979) in the Japan and Russia data for 1,496 items. Descriptors of overall reading comprehension (Source: Council of Europe, 2001, p. 69). Item difficulty levels and required proficiency in the placement test. Description of imaginary test items.

Current Issues in Language Evaluation, Assessment and Testing: Research and Practice

Table 2-4: Table 2-5: Table 2-6: Table 2-7: Table 2-8: Table 2-9: Table 2-10: Table 2-11: Table 2-12: Table 3-1: Table 3-2: Table 3-3: Table 5-1: Table 5-2: Table 5-3: Table 5-4: Table 5-5: Table 5-6:

Table 5-7: Table 5-8: Table 5-9: Table 6-1: Table 6-2: Table 6-3: Table 7-1: Table 7-2: Table 7-3: Table 7-4:

ix

Summary report of a candidate (Adapted from Linacre 1987, p. 25). Simulation 1. Simulation 2. Item difficulty and responses (arranged in the order of difficulty levels). Statistics of the candidates’ proficiency levels and confidence intervals. Item difficulty and responses by exact scoring. Adjusted responses for Candidate A (part). Adjusted responses for Candidate B (part). Comparison of estimated proficiency levels and confidence intervals. Demographic information for study participants. Receptive language ability as indexed by The Rossetti (1990) Language Comprehension subtest. Expressive language ability as indexed by The Rossetti (1990) Language Expression subtest. Participant information. Distribution of the sample words across the AWL Levels. Omitted AWL words with Level information. Pseudowords with Level information and the original AWL words. Descriptive statistics (freshmen = 50, seniors = 39). Comparison between freshmen and seniors in their knowledge of non-omitted/omitted academic words using ANOVA. Reliability analysis. Descriptive statistics for the 52 and 38 omitted items. Omitted academic words and their COCA ranks and frequencies. Reasons for choice of components (more than one answer possible). Effects of the self-assessment on participants (more than one answer possible). Comparison of studies on learner autonomy. Student participants in the study. Demographic data of case study participants. Phases of treatment in the junior high school. Independent samples test of LA and its subscales.

x

Table 7-5: Table 7-6: Table 7-7: Table 8-1: Table 8-2: Table 8-3: Table 8-4: Table 8-5: Table 8-6:

Table 9-1: Table 9-2: Table 10-1: Table 10-2: Table 10-3: Table 11-1: Table 12-1: Table 12-2: Table 12-3: Table 13-1: Table 13-2: Table 13-3: Table 13-4: Table 13-5: Table 13-6: Table 13-7: Table 14-1:

List of Tables

Planning and reflecting. Alice’s monthly learning plan (A-P1-9). David’s weekly plan. AARP assessment overview (Writing 1 and Writing 2). AARP assessment overview (Oral Presentation). AARP mean scores for Writing 2, for Groups C and D, with ANOVA results for self-assessment variations. Paired t-test results for Writing 2 self-assessment Variations (AY2). Responses to questions about peer- and self-assessment. Modified extract from the AARP model showing the relationship between assessment and degrees of autonomy (Based on Everhard 2014, 2015a; Harris & Bell, 1990). Example performance indicator and achievement indicators (Source: TESOL 2010, p. 69). Criteria for the external review of the indicators and the MCIs. Marks in the FCE and in the TEPOLI (Source: BaffiBonvino, 2007, pp. 276-277). Grammatical accuracy as measured in TEPOLI (Source: Borges-Almeida & Consolo, 2010). Grammatical complexity as measured in TEPOLI (Source: Borges-Almeida & Consolo, 2010). Correlation coefficient (effect size) values for lexical errors. Correlations between strategy use, gender and ethnicity. Results of the error analysis. Correlation between strategy use and language proficiency. Classification of production variables in oral communication tasks (Adapted from Ellis, 2003, p. 117). Data collection procedures in each phase of the study. Student perceptions of their language skills before the course. Change in students’ perceptions (N = 17). Average perceived overall improvement in language skills. Students’ final TOEFL iBT score. Comparative table of the TOEFL iBT scores. A scale for measuring grammar proficiency.

Current Issues in Language Evaluation, Assessment and Testing: Research and Practice

Table 14-2: Table 14-3: Table 14-4: Table 15-1: Table 15-2:

xi

Mean and standard deviation of participants’ total score based on nationality. Means and standard deviations for the ESL participants. Means and standard deviations for the EFL participants. Studies distributed by continent. Number of instruments used in the studies.

LIST OF FIGURES

Figure 2-1: Figure 2-2: Figure 2-3: Figure 2-4: Figure 2-5: Figure 2-6: Figure 3-1:

Figure 3-2:

Figure 3-3:

Figure 4-1: Figure 6-1: Figure 7-1: Figure 9-1: Figure 10-1: Figure 10-2: Figure 13-1: Figure 13-2: Figure 13-3:

Sample representation of logistic regression analysis of a test item (Adapted from Amma, 2001). Sample representation of logistic regression analysis of a candidate (Adapted from Amma, 1990). Logistic regression analysis of Simulation 1. Logistic regression analysis of Simulation 2. Logistic fit of Candidate A with confidence interval. Bottom layer = ‘fail’. Logistic fit of Candidate B with confidence interval. Bottom layer = ‘fail’. Participants’ receptive language assessment results using The Rossetti (1990) Language Comprehension subtest. Error bars represent standard errors and asterisks indicate significant differences at p < .05. Participants’ expressive language assessment results using The Rossetti (1990) Language Expression subtest. Error bars represent standard errors and asterisks indicate significant differences at p < .05. Percent language delay in primary-language-only assessment and in dual-language assessment using The Rossetti (1990) Language Comprehension and Language Expression subtests. Error bars represent standard errors and asterisks indicate significant differences at p < .05. An example from the SILL questionnaire employing the [01] bar for frequency and confidence. The dynamic model of learner autonomy (Source: Tassinari, 2010, p. 203). The process of the ELP practice. Project stages. The EPPLE examination, Oral Test, Part 1. The EPPLE examination, Oral Test, Part 3. Difficulties with language at the course (N = 17). Student perceptions of their language ability at the end of the course (N = 17). Student self-perceived learning (N = 17).

Current Issues in Language Evaluation, Assessment and Testing: Research and Practice

Figure 13-4: Figure 13-5: Figure 13-6: Figure 15-1: Figure 15-2:

xiii

Students’ perceptions of their language skills before the course (N = 17). Students’ perceptions of their language skills after the course (N = 17). Student improvement in Task 1 and Task 2 (N = 17). Studies per year of publication. Data-collection instruments in the reviewed studies.

LIST OF APPENDICES

Appendix 1-A: Appendix 1-B: Appendix 1-C: Appendix 1-D:

List of participating universities. Example cloze test (Adapted from Brown, 1989). Actual word frequencies. FACETS input file for 50 cloze tests with 10 anchor items.

LIST OF ABBREVIATIONS

A AARP ANOVA AS-unit ASC ASHA AUTh AVL AWL AY B BNC C CAEP CAPES

CEEC CEFR CEPA CG CI CILL CNPq COCA Comm Arts 1 Corr. Coeff. CRAPEL CTT EAL EFL EG

Self-Assessor (in Tukey-Kramer formulae) Assessment for Autonomy Research Project One-way Analysis of Variance Analysis of Speech Unit Ascendente (rising intonation) American Speech-Language-Hearing Association Aristotle University of Thessaloniki Academic Vocabulary List Academic Word List Academic Year Peer-Assessor (in Tukey-Kramer formulae) British National Corpus Teacher-Assessor (in Tukey-Kramer formulae) Council for the Accreditation of Educator Preparation Coordenação de Aperfeiçoamento de Pessoal do Ensino Superior - Brazilian Agency for the Development of Graduate Studies College Entrance and Examination Center Common European Framework of Reference Common Educational Proficiency Assessment Control Group Confidence Interval Centre for Independent Language Learning Conselho Nacional de Desenvolvimento Científico e Tecnológico Corpus of Contemporary American English Communication Arts and Skills 1 Correlation Coefficient Centre de Recherches et d’Applications Pédagogiques en Langues Classical Test Theory English as an additional language English as a Foreign Language Experimental Group

xvi

ELF ELL ELP ELT EMI ENAPLE-CCC Eng AN EPPLE ERWL ESL ESOL ESP ETS F FATEC FCE FEU FL FUB GSEAT GSL HE ID IELTS IF INCOMP IRT L2 LA LLS LPFLT M Max MCI MENA MFRM Min MOE

List of Abbrerviations

English as a Lingua Franca English Language Learners/Learning English Learning Portfolio English Language Teaching English Medium Instruction Ensino e Aprendizagem de Língua Estrangeira: Crenças, construtos e competências Introduction to Language Arts Exame de Proficiência para Professores de Língua Estrangeira English Reference Word List English as a Second Language English for Speakers of Other Languages English for Specific Purposes Educational Testing Service Female Faculdades de Tecnologia First Certificate in English Far Eastern University Foreign Language Freie Universität Berlin General Scholastic English Ability Test General Service List Higher Education Item Discrimination International English Language Testing System Item Facility Incomprehensible Item Response Theory Second Language Learner Autonomy Language Learning Strategies Language Proficiency of Foreign Language Teachers Male Maximum Multiple-Choice Item Middle East and North Africa Many Facet Rasch Measurement Minimum Ministry of Education

Current Issues in Language Evaluation, Assessment and Testing: Research and Practice

N n.s. NAPO NBPTS NCATE NECS NNS NRT NS OPI OPT P-A P-T r S-A S-T SAT SD SEM Sig. SILL SOE SPSS SVA T-A T-TRI TECOLI TEPOLI TESOL The Rossetti TOEFL TOEFL-iBT TOEIC UEM UERJ UnB UNESP UNIP

xvii

Number (of participants) Non-significant differences in One-way ANOVA National Admissions and Placement Office National Board for Professional Teaching Standards National Council for Accreditation of Teacher Education National English Curriculum Standards Non-Native Speakers Norm-Referenced Test Native Speaker Oral Proficiency Interview Oxford Placement Test Peer-Assessment Peer-Assessment and Teacher Assessment Reliability Self-Assessment Self-Assessment and Teacher Assessment Scholastic Assessment Test Standard Deviation Standard Error of Measurement Significance Strategy Inventory for Language Learning School of English Statistical Package for the Social Sciences Subject Verb Agreement Teacher Assessment TESOL Teacher Readiness Inventory Teste de Compreensão Oral em Língua Italiana Teste de Proficiência Oral em Língua Inglesa Teaching English to Speakers of Other Languages The Rossetti Infant-Toddler Language Scales Test of English as a Foreign Language Test of English as a Foreign Language–Internetbased Test Test of English for International Communication Universidade Estadual de Maringá Universidade Estadual do Rio de Janeiro Universidade de Brasília Universidade Estadual Paulista Universidade Paulista

xviii

UNISEB/Estacio VST Y/N

List of Abbrerviations

Centro Universitário do Sistema Educacional Brasileiro Vocabulary Size Test Yes/No

PREFACE

Language assessment, whether formative or summative, plays an important role in second language learners’ educational experience and learning outcomes. Whether assessment is used for student initial screening, placement, or progression in a language course, it always involves gathering, interpreting and evaluating evidence of learning. Such information collected through the different assessment and evaluation tools allows educators to identify student needs and plan a course of action to address these needs, provides feedback about the effectiveness of teaching practice, guides instruction and curriculum design, and provides accountability for the system. For language educators, assessment is perhaps one of the most difficult and demanding tasks they have to perform given that designing valid and reliable assessment tools requires specialized skills, while decisions about assessment, especially ‘high-stakes’ exams, can have a lasting impact on students’ progress and life. Furthermore, in order for assessment to be useful, it must align itself with the mandated standards and academic expectations of the specific context where it occurs. Since no single type of assessment can provide all the information that is necessary to gauge students’ progress and language proficiency levels, educators need to incorporate a variety of assessment techniques into their practice and be aware of approaches and methods that can help provide valid and reliable evidence of student learning. The edited volume presented here, Current Issues in Language Evaluation, Assessment and Testing: Research and Practice, is a collection of papers that address relevant issues in language assessment from a variety of contexts and perspectives. The book is divided into three major sections. The first section addresses Issues in the Analysis and Modification of Assessment Tools and Tests. In Chapter One, JD Brown, Jonathan Trace, Gerriet Janssen, and Liudmila Kozhevnikova discuss a comparative study of analyzing cloze tests using Classical Theory Test (CTT) item analysis and multifaceted Rasch analysis. Through the examination and analysis of almost 7,500 cloze tests from university students studying English as a foreign language (EFL) in Japan and Russia, the Rasch analyses proved to be more appropriate than CCT. In Chapter Two, Kazuo Amma proposes using a logistic regression analysis

xx

Preface

with predefined item difficulty levels in order to properly assess a student’s true proficiency level and the confidence interval, avoiding the pitfalls that may occur from estimating proficiency based on the total test score. In Chapter Three, Caroline Larson, Sarah Chabal, and Viorica Marian examine the use of The Rossetti Infant-Toddler Language Scale with Spanish-English speaking children. Their findings suggest that when the instrument is used in the children’s primary language only, their language skills are underestimated and their language delay is overestimated leading to inappropriate Early Intervention referrals. In order to maximize the efficacy, reliability and validity of The Rossetti, the researchers recommend administering it in both the primary and the secondary languages of the child. In Chapter Four, Penelope KambakisVougiouklis and Persephone Mamoukari present the results of their investigation of language learning strategy use of Greek EFL students. Their study is a pilot of a modified version of the Strategies Inventory for Language Learning (SILL) using a bar instead of a Likert scale, measuring frequency of strategy use as well as student confidence in the effectiveness of the different strategies, and administering the instrument orally rather than in writing. Their modifications of the SILL allowed them to detect discrepancies in student understanding of language learning strategies and their effectiveness which would not have been otherwise evident, as well as problematic items within the SILL that need further modification prior to future administrations of the instrument. In the last chapter for this section, Chapter Five, Lee-Yen Wang describes the use of a Yes/No test to measure University EFL students’ vocabulary acquisition of academic words that were left out of the Ministry-defined vocabulary list for schools in Taiwan. The analyses of the data highlighted the limitations of having a centrally controlled national wordlist. The next part of the volume, addresses Issues in the Creation of Assessment and Evaluation Tools. In Chapter Six, Maria Giovanna Tassinari discusses the creation of a dynamic model for assessing language learner autonomy and provides evidence of the use of the model with foreign language learners in a German University context. Her findings indicate that the model initiated and maintained pedagogic dialogue between the students and their teachers, raised students’ awareness of the different dimensions of learner autonomy, and enhanced their reflexive learning. In Chapter Seven, Beilei Wang describes the use of a threedimensional learner autonomy scale with junior high school EFL students in China. Findings revealed that the use of the English Language Portfolios was conducive to helping students gain learner autonomy. Learner autonomy was also the primary interest of Carol Everhard’s study

Current Issues in Language Evaluation, Assessment and Testing: Research and Practice

xxi

in Chapter Eight. Greek EFL students in a university context participated in self- and peer-assessment activities of their oral and writing skills over the course of the 5-year study. Findings suggest that the use of such formative assessment techniques activated students’ criterial thinking and metacognitive awareness of their learning process. The next two chapters in this section describe the creation of assessment and evaluation tools for measuring teacher proficiency. In Chapter Nine, Sadiq Midraj, Jessica Midraj, Christina Gitsaki, and Christine Coombe describe the process of compiling a contextually relevant resource for independent learning and self-assessment in order to strengthen EFL teachers’ content knowledge, pedagogical knowledge, and professional dispositions. The resource was created for teachers in the Gulf Region using internationally accepted professional standards in teaching English to speakers of other languages (TESOL). In Chapter Ten, Douglas Altamiro Consolo and Vera Lúcia Teixeira da Silva discuss a meta-analysis of a string of research studies conducted within the framework of designing assessment tools for evaluating foreign language teachers’ proficiency in the foreign language they teach. Their meta-analysis is motivated by the need to revise and improve the EPPLE examination of foreign language teachers in Brazil. The third and final section of the volume, Issues in Language Assessment and Evaluation, comprises studies that implemented different instruments to measure learner proficiency in different language skills. In Chapter Eleven, Marina Dodigovic, Jacob Mlynarski, and Rining Wei describe how they used instruments such as Grammarly and the Vocabulary Size Test (VST) to investigate possible correlations between academic plagiarism and vocabulary knowledge in the academic writing of University EFL students in China. Through their investigation poor vocabulary command emerged as a major cause of plagiarism. In Chapter Twelve, Zakia Ali Chand investigated whether there is a correlation between language strategy use and academic writing proficiency, using the SILL. The study involved University ESL students in Fiji, and the preliminary results showed that students were moderate users of language learning strategies and their academic writing was not influenced by their use of language learning strategies indicating the need for more strategy training in the language classroom. In Chapter Thirteen, Renata Mendes Simões addresses an area of language teaching that has not received much attention in research, that of one-to-one tutorials focusing on candidate preparation for high stakes exams. The study involved Brazilian students preparing for the Test of English as a Foreign Language (TOEFL). The process of assessing students’ progress over the course of several weeks is described and enhanced by the use of qualitative and quantitative data in

Preface

xxii

order to provide a deeper understanding of the effectiveness of English for Special Purposes (ESP) courses for exam preparation. In Chapter Fourteen, Selwyn Cruz and Romulo Villanueva describe the development and administration of a grammar proficiency test in order to investigate issues in the grammatical proficiency of Korean and Filipino students studying in an English Medium University in the Philippines. The final paper in the volume, Chapter Fifteen, addresses issues of washback from language assessment. Gladys Quevedo-Camargo and Matilde Virginia Ricardi Scaramucci discuss the results of a meta-analysis of research studies on washback from around the globe and their respective methodologies and provide an account of the diverse instruments used to investigate washback. All fifteen papers included in this volume underwent a rigorous selection process through a double-blind peer review process that involved a number of notable academics. The papers underwent further review and editing before being published in this book. Below is the list of academics who were involved in the double blind review process: Thomaï Alexiou Ramin Akbari Deena Boraie Helene Demirci Aymen Elsheikh Atta Gebril Melanie Gobert Tony Green Sahbi Hidri Elisabeth Jones Mary Lou McCloskey Josephine O’Brien Sufian Abu Rmaileh

Aristotle University, Greece Modares Tarbiat, Iran American University of Cairo, Egypt Higher Colleges of Technology, UAE New York Institute of Technology, UAE American University of Cairo, Egypt Higher Colleges of Technology, UAE University of Bedfordshire, UK Sultan Qaboos University, Oman Zayed University, UAE EDUCO, USA Zayed University, UAE UAE University, UAE

The volume presents research studies conducted in a variety of contexts (from early childhood to University and post-graduate studies) from around the world covering an equally diverse range of issues in language assessment and evaluation. It is hoped that it will be of use to both new and seasoned researchers in the field of Applied Linguistics and TESOL as well as teacher educators, language teachers, curriculum and assessment designers. Christina Gitsaki and Christine Coombe

ISSUES IN THE ANALYSIS AND MODIFICATION OF ASSESSMENT TOOLS AND TESTS

CHAPTER ONE HOW WELL DO CLOZE ITEMS WORK AND WHY? JAMES DEAN BROWN, JONATHAN TRACE, GERRIET JANSSEN, AND LIUDMILA KOZHEVNIKOVA

Abstract This study examined item-level data from fifty 30-item cloze tests that were randomly administered to university-level examinees from Japan (N = 2,298) and Russia (N = 5,170). A single 10-item anchor cloze test was also administered to all students. The analyses investigated differences between the two nationalities in terms of both classical test theory (CTT) item analysis and multifaceted Rasch analysis (the latter allowed us to estimate test-taker ability and item difficulty measures and fit statistics simultaneously across 50 cloze tests separately and combined for the two nationalities). The results indicated that considerably larger proportions of items functioned well in the Rasch item analyses than in the traditional CTT item analysis. Rasch analyses also turned out to be more appropriate for our cloze test analysis and revision purposes than did traditional CTT item analyses. Linguistic analyses of items that fit the Rasch model revealed that blanks representing certain categories of words (i.e., function words rather than content words, and Germanic-origin words rather than Latin-origin words), and to a greater extent relatively high frequency words were more likely to work well for norm-referenced test (NRT) purposes. In addition, this study found that different items were functioning well for the two nationalities.

How Well do Cloze Items Work and Why?

3

Introduction Taylor (1953) first proposed the use of cloze tests for evaluating the readability of reading materials in US elementary schools. In the 60s and 70s, a number of studies appeared on the usefulness of cloze for English as a second language (ESL) proficiency or placement testing (see Alderson, 1979 for a summary of this early ESL research). Since then, as Brown (2013) noted, this research on using cloze in ESL proficiency or placement testing has continued, but has been inconsistent at best with reported reliability estimates ranging from .31 to .96 and criterion-related validity coefficients ranging from .43 to .91. While the literature has focused predominantly on fixed interval (i.e., every nth word) deletion cloze tests, other bases have been used for developing cloze tests. For example, rational deletion cloze was developed by selecting blanks based on word classes (cf., Bachman, 1982, 1985; Markham, 1985). Tailored cloze involved using classical test theory (CTT) item analysis techniques to select items and thereby create cloze tests tailored to a particular group of students (cf. Brown, 1988, 1989, 2013; Brown, Yamashiro, & Ogane, 1999, 2001; Revard, 1990). For the most part, cloze studies have been based on CTT. However, Item Response Theory (IRT), including Rasch analysis, has been applied to cloze in a few cases. Baker (1987) used Rasch analysis to examine a dichotomously scored cloze test and found that “observed and expected item characteristic curves show reasonable conformity, though with some instances of serious misfit…no evidence for departure from unidimensionality is found for the cloze data…” (p. iv). Hale, Stansfield, Rock, Hicks, Butler, and Oller (1988) found that IRT provided stable estimates for cloze in their study of the degree to which groups of cloze items related to different subparts of the overall Test of English as a Foreign Language (TOEFL). Hoshino and Nakagawa (2008) used IRT in developing a multiple-choice cloze test authoring system. Lee-Ellis (2009) used Rasch analysis in developing and validating a Korean C-test. However, Rasch analysis has not been used to study the effectiveness of individual items.

The Study Certainly, no work has investigated the degree to which cloze items function well when analyzed using both CTT and IRT frameworks, and little research has examined the functioning of cloze items in terms of their linguistic characteristics. To address these issues and others, the following research questions were posed, all focusing on the individual items

4

Chapter One

involved in 50 cloze tests that were administered to university-level examinees from two linguistically different backgrounds: 1. How do the CTT descriptive statistics, reliability, and item analyses differ for test-taker groups from different linguistic backgrounds? 2. How do Rasch item difficulty measures differ for the test-takers from different linguistic backgrounds? 3. How do the proportions of functioning cloze items differ between the CTT and IRT analyses, when based on the test-taker responses from different linguistic backgrounds? 4. In what ways do Rasch item fit patterns differ in terms of factors such as linguistic background and four cloze item linguistic features: parts of speech, word type, word origin, and word frequency? 5. Which linguistic characteristics will increase the probability of cloze items functioning well during piloting?

Participants A total of 7,468 English as a foreign language (EFL) students participated in this study: 2,298 of these EFL students were studying at 18 different universities in Japan as part of their normal classroom activities; the remaining 5,170 EFL students were studying at 38 universities in Russia (see Appendix 1-A for a list of the participating universities in both countries). In Japan, about 38.3% of the participants were women, and 61.7% of the participants were men; in Russia, 71.7% of the participants were women, and 28.0% were men, with the remaining 0.3% giving no response. The participants in Japan were between 18-24 years old, while in Russia they were between 14-46 years old. The data from Japan were collected as part of Brown (1993 & 1998); the data from Russia were collected in 2012-2013 and served as the basis of Brown, Janssen, Trace, and Kozhevnikova (2013). Though these samples were convenience samples (i.e., not randomly selected), they were relatively large, which is important as this sample size permits robust analyses of these cloze data. It is critically important to stress that in this study we are interested in how linguistic background affects different analyses; we do not make any claims for the generalizability of these results to the EFL populations of all undergraduate students in university-level institutions in these countries. In fact, we want to stress that the samples from Japan and Russia cannot be said to be comparable given the sampling procedures, the very different proportions of university seats per million people available in the two

How Well do Cloze Items Work and Why?

5

countries, the proportions of young people who go to university, and so forth. Thus, any interpretations of these data to indicate that the English proficiency of students in either country is higher than in the other country are unwarranted and indefensible.

Measures The 50 cloze tests used in this study were first created and used for Brown (1993). The 50 passages were randomly selected from among the adult-level books at a public library in Florida. Passages were chosen from each book by randomly selecting a page then working backwards for a reasonable starting place. Passages were between 366-478 words long with a mean of 412.1 words. Each passage contained 30 items, and the deletion pattern was every 12th word, which created a fairly high degree of item independence relative to the more typical 7th-word deletion pattern. The first and last sentences of all passages were left intact to provide context. Appendix 1-B shows the layout of the directions, example items, and answer key. A 10-item cloze passage was also administered to all participants to act as anchor items (i.e., items that provide a common metric for making comparisons across tests and examinee samples). This anchor-item cloze was first created in a study by Brown (1989), wherein it was found that these 10 items were functioning effectively. To check the degree to which the English in the cloze passages was representative of typical written English, the lexical frequencies for all 50 passages combined were calculated (see Appendix 1-C) and compared to the frequencies reported for the same words in the well-known Brown Corpus (Francis & Kuþera, 1979, 1982; Kuþera & Francis, 1967). We felt justified in comparing the 50 passages to this particular corpus for two reasons. First, following Stubbs (2004), though the Brown Corpus is relatively small, it is “still useful because of their careful design ... one million words of written American English, sampled from texts published in 1961: both informative prose, from different text types (e.g., press and academic writing), and different topics (e.g., religion and hobbies); and imaginative prose (e.g., detective fiction and romance).” (p. 111)

Then, too, we found that the logarithmically transformed word frequencies of the cloze test items (to normalize the Zipfian nature of vocabulary distributions) and the logarithmically transformed frequencies of these same words in the Brown Corpus correlated strongly at .93

6

Chapter One

(Brown, 1998). Thus, we felt reasonably certain that these passages and cloze items were representative of the written English language, or at a minimum the genres of English found in US public library books.

Procedures The 50 cloze tests were distributed to intact classes by teachers such that every student had an equal chance of receiving each of the 50 cloze test passages. In Japan, 42-50 participants completed each cloze test, with a mean of 46.0 participants completing each passage. In Russia, 90-122 completed each cloze test (Mean = 103.4). All examinees in both countries completed the 10-item anchor cloze. Twenty-five minutes were allowed for completing the tests. Exact-answer scoring was used (i.e., only the word found in the original text was counted as correct). This was done for two reasons: (a) we wanted each item to be interpretable as fillable by a single lexical item for analysis purposes; and (b) with the hundreds of items and thousands of examinees in this study, using an acceptableanswer scoring or any other of the available scoring schemes would clearly have been beyond our resources.

Analyses Initially, CTT statistics were used to analyze the cloze test data. These statistics included: the mean, standard deviation, minimum and maximum scores, reliability, item facility, and item discrimination. Rasch analyses were also used in this study to calculate item difficulty measures and to identify misfitting test items. We used FACETS (Linacre, 2014a) analysis rather than WINSTEPS because the former allowed us to easily analyze our nested design (i.e., multiple tests administered to different groups of examinees). Or as Linacre put it, “Use Winsteps if the data can be formatted as a rectangle, e.g., persons-items … Use Facets when Winsteps won’t do the job” (Linacre, 2014b, np). We performed the analyses in several steps. Initially, we needed to determine anchor values through a separate FACETS analysis of only the 10 anchor items that were administered across all groups of participants. Then, we created a FACETS input file to link our 50 cloze tests by using our 10 anchor items (see Appendix 1-D for a description of the actual code that was used). There were three facets in this analysis: test-takers, test version, and test items. By using the FACETS program, we were able to combine the 50 different cloze procedures for both nationalities into a single analysis using anchor items, and put all of the items onto the same

How Well do Cloze Items Work and Why?

7

true interval scale for ease of comparison (see e.g., Bond & Fox, 2007, pp. 75-90). Four of the total 1,500 items had blanks that were either missing or made no sense, thus the total number of valid cloze items was 1,496. Appendix 1-D also shows how we coded the data for the analysis. In order to analyze separate tests in a single analysis using a common set of anchor items, each examinee required two lines of response data. The first line corresponds to the set of items for the particular cloze procedure, set up by examinee ID, test version, the range of applicable items (e.g., 101130 for items 1-30 on Test 1), followed by the observed response for each item. An additional line was also needed for examinee performance on anchor items, with the same coding format as above except for a common range of items for all examinees (31-40). The series of commas within the data indicates items that were removed as explained above. Using the same setup, we were able to run the program separately for the samples in Russia, Japanese, and Combined (i.e., with the two samples analyzed together as one).

Results Classical Test Theory Descriptive Statistics. As most previous item analyses of cloze tests have been based on CTT, we began our analysis by focusing on the CTT characteristics of our cloze tests and their items. Tables 1-1 and 1-2 show the descriptive statistics and internal consistency reliability estimates for our 50 cloze tests in test number order for each nationality. In general, the means are low for the 30-item cloze tests, indicating that the items (scored for exact answers) were quite difficult for the students. Tables 1-1 and 1-2 indicate that the Russia sample generally produced higher means and standard deviations than the Japan sample. Reliability. Tables 1-1 and 1-2 also show how the reliability estimates of the various cloze passages were for the two nationalities. These cloze tests functioned somewhat less reliably with the Japan sample (ranging from .17 to .87) than with the Russia sample (ranging from .65 to .92). This pattern could be a consequence of the greater variation and perhaps the larger sample sizes in Russia. A synthesis of the cloze passages’ reliability estimates is shown in Table 1-3.

Chapter One

8

Table 1-1: Descriptive statistics for 50 cloze passages and reliability – Japan sample (adapted and expanded from Brown, 1998). Japan Test 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

Mean 5.23 4.21 2.02 7.54 3.98 5.11 6.14 3.16 2.85 2.54 5.94 8.98 2.87 3.23 9.18 1.36 1.38 1.02 4.76 4.38 9.92 3.70 3.64 2.96 5.36 2.68 2.34 2.58 2.32 9.56 3.78 3.83 2.14 5.87 6.63 5.00

SD 3.16 3.42 2.13 3.87 2.79 3.23 3.41 2.27 2.46 2.31 3.36 3.97 1.71 2.50 3.42 1.41 1.25 1.09 2.88 3.24 4.44 2.86 2.40 2.26 2.74 1.56 2.72 2.17 1.77 3.28 3.08 2.53 1.87 2.92 3.66 2.05

Min 0 0 0 2 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0

Max 15 13 10 16 13 14 16 8 11 8 16 21 8 9 18 6 5 3 10 15 19 11 11 9 12 5 13 8 7 16 15 9 6 13 17 9

N 48 47 48 46 47 47 43 45 46 46 46 47 46 47 49 48 46 50 50 47 48 47 43 47 46 47 47 43 44 48 46 42 44 45 45 46

r 0.71 0.86 0.74 0.80 0.73 0.80 0.83 0.46 0.77 0.83 0.74 0.79 0.50 0.68 0.68 0.65 0.35 0.50 0.70 0.86 0.84 0.84 0.65 0.44 0.63 0.17 0.87 0.57 0.64 0.72 0.83 0.77 0.63 0.82 0.72 0.51

How Well do Cloze Items Work and Why?

9

Japan Test Mean SD Min Max N r 37 5.46 3.66 0 13 48 0.77 38 1.71 1.57 0 8 48 0.75 39 2.51 1.98 0 9 47 0.65 40 3.49 1.90 0 9 43 0.66 41 2.87 2.51 0 10 43 0.76 42 4.41 3.10 0 18 44 0.81 43 1.43 1.45 0 7 44 0.19 44 3.24 2.52 0 10 46 0.67 45 6.55 3.87 0 16 42 0.79 46 2.16 1.82 0 7 47 0.31 47 3.79 2.33 0 11 43 0.69 48 2.69 2.12 0 11 42 0.74 49 4.56 2.81 0 11 49 0.75 50 2.49 2.70 0 12 45 0.77 2.61 0.18 11.34 45.96 0.69 Mean 4.11 Note: SD = Standard Deviation; Min = Minimum score; Max = Maximum score; N = number of items; r = reliability. Table 1-2: Descriptive statistics for 50 cloze passages and reliability – Russia sample (adapted and expanded from Brown, 1998). Russia Test 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Mean 6.78 7.06 3.94 9.82 6.54 5.34 8.07 3.13 4.08 3.77 5.74 9.27 3.30 5.10 8.10 2.30

SD 3.99 4.94 3.71 6.12 4.38 4.19 6.22 3.67 3.67 4.24 4.53 4.86 3.89 4.70 5.60 2.70

Min 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Max 16 19 14 21 22 16 20 24 23 22 17 20 17 17 21 11

N 120 102 103 105 106 102 103 101 105 102 101 115 105 107 106 115

r 0.75 0.85 0.81 0.89 0.82 0.83 0.90 0.86 0.81 0.87 0.85 0.83 0.86 0.87 0.89 0.77

Chapter One

10

Russia Test 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 Mean

Mean 2.55 1.60 6.15 5.41 10.32 3.74 3.58 2.13 4.63 4.35 3.48 4.01 3.39 12.82 4.88 4.96 2.82 7.11 6.72 4.81 8.38 2.42 3.62 3.87 4.53 4.78 2.09 4.80 9.24 3.69 3.19 2.98 4.39 3.57 5.07

SD 2.29 2.27 5.08 5.01 6.96 3.64 3.36 2.37 4.55 3.25 3.07 3.81 2.70 5.39 3.89 3.22 2.57 4.43 5.54 4.11 5.46 2.44 3.44 4.39 3.56 4.10 2.56 4.28 6.59 3.49 2.79 3.36 4.10 3.04 4.05

Min 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Max 10 15 30 24 27 14 14 10 15 21 15 18 11 22 14 12 10 18 25 16 24 14 12 24 14 20 15 19 21 14 12 18 15 13 17.52

N 109 100 102 97 103 102 102 101 102 100 100 102 102 111 101 101 102 102 103 96 103 106 103 90 101 93 99 102 101 93 104 108 122 109 103.40

r 0.65 0.78 0.88 0.89 0.92 0.83 0.79 0.71 0.87 0.77 0.75 0.84 0.70 0.83 0.82 0.79 0.71 0.83 0.87 0.83 0.87 0.74 0.80 0.88 0.79 0.84 0.76 0.85 0.91 0.80 0.73 0.75 0.86 0.76 0.82

Note: SD = Standard Deviation; Min = Minimum score; Max = Maximum score; N = number of items; r = reliability.

How Well do Cloze Items Work and Why?

11

Table 1-3: Frequencies and percentages of tests that functioned well or poorly in terms of internal consistency reliability. Below .49

.50.59

.60.69

.70.79

.80.89

.90.92

Total

Frequencies Japan Russia Total

6 0 6

4 0 4

11 1 12

17 16 33

12 30 42

0 3 3

50 50 100

Percentages Japan Russia Total

12% 0% 6%

8% 0% 4%

22% 2% 12%

34% 32% 33%

24% 60% 42%

0% 6% 3%

100% 100% 100%

Item Analyses. One aspect of CTT item analysis that testers often examine while developing norm-referenced tests (NRTs) is item facility (IF). Brown (2005) recommends keeping items with an IF in the range between .30 and .70 and discarding or replacing any items with IF values outside that range. Table 1-4 shows at the bottom of the third column of numbers that only 19.1% (Japan = 14.0%; Russia = 24.1%) of the items overall were functioning well in the .30-.70 range. Interestingly, 27.6% of the items (Japan = 31.7%; Russia = 23.5%) were not functioning at all (i.e., nobody answered correctly, hence IF = .00) and over 50% of the items were in the .01 to .29 range, which further confirms that these items were generally too difficult for these two samples.

Chapter One

12

Table 1-4: Frequencies and percentages of items that functioned well in terms of Item Facility (IF). Range

.00

.01-.29

.30-.70

.71-1.00

Total

Frequencies Japan Russia Total

474 351 825

776 759 1535

210 360 570

36 26 62

1,496 1,496 2,992

Percentages Japan Russia Total

31.7% 23.5% 27.6%

51.9% 50.7% 51.3%

14.0% 24.1% 19.1%

2.4% 1.7% 2.1%

100.0% 100.0% *100.1%

Note: * = Does not add up to exactly 100% because of rounding error.

Another aspect of CTT item analysis that testers often consider when developing NRTs is item discrimination (ID, calculated using the pointbiserial correlation coefficient in this study). ID values can range from .00 for items that do not discriminate at all between the high and low performing examinees to 1.00 for items that are discriminating perfectly between the high and low examinees. Negative discrimination values, which can range down to -1.00, indicate the degree to which items are measuring differently from the total scores on the test. Generally, CTT test designers try to use items with the highest positive ID values available when developing and revising NRTs. Ebel (1979) suggests these following ID value ranges for test development: poor ID (.00-.19), marginal ID (.20.29), good ID (.30-.39), and very good ID (.40 and higher). Table 1-5 shows the frequencies and percentages of cloze items in terms of different ID value ranges for both nationalities. The Total row at the bottom shows that on average 28.8% of the items contributed nothing to the discrimination of these tests, though that percentage was considerably higher in Japan (41.2%) than in Russia (16.4%). While large proportions of items were Very Poor, Poor, or Marginal discriminators when used with both test-taker samples, the cloze tests when used in Russia had considerably more items in the Good and Very Good categories (19.1% and 10.1%, respectively) than in Japan (9.2% and 2.1%, respectively).

How Well do Cloze Items Work and Why?

13

Poor .10-.19

Marginal .20-.29

Good .30-.39

Very good .40+

Total

Japan

616

194

258

258

138

32

1,496

Russia

246

173

292

348

286

151

1,496

Total

862

367

550

606

424

183

2,992

Non-functional .00 or negative

Very poor .01-.09

Table 1-5: Frequencies and percentages of items that functioned well in terms of Item Discrimination (ID).

Frequencies

Percentages Japan

41.2%

13.0%

17.2%

17.2%

9.2%

2.1%

100%

Russia

16.4%

11.6%

19.5%

23.3%

19.1%

10.1%

100%

Total

28.8%

12.3%

18.4%

20.3%

14.2%

6.1%

100%

Many Faceted Rasch Measurement (MFRM) Many-Faceted Rasch measurement (MFRM) was used to complement our CTT analyses. Infit mean squares were used to identify misfitting test items (i.e., items that were not functioning properly within the MFRM model) and to calculate item difficulty measures (analogous to the item facility, or IF, discussed for CTT) for each test item using FACETS (Linacre, 2014a). Table 1-6 summarizes the fit statistics for Japan, Russia, and the two combined. The # Extreme Items column in Table 1-6 shows how many items did not fit the model based on nobody answering them correctly. There were 475, 195, and 171 of these items for the Japan, Russia, and Combined samples, respectively. Thus, about 31.8%, 13.0%, and 11.4% of the items were extreme items, respectively. Clearly, the percentage of such items was much higher in the Japan samples than in the Russia samples and the two sets of samples combined. The # Underfitting Items column in Table 1-6 shows how many items “did not fit the general pattern of responses in the matrix, and can thus be classified as relatively misfitting…” (McNamara, 1996, p. 171). Notice that there were 32, 34, and 43 underfitting items in the analyses for the

Chapter One

14

Japan, Russia, and Combined samples, respectively. Thus, only 2.1%, 2.3%, and 2.9% of the items were underfitting, respectively. The Total # Misfitting column in Table 1-6 represents the total of both the extreme and underfitting items, for which there were 507 (33.9%), 229 (15.3%), and 214 (14.3%) of these for the Japan, Russia, and Combined samples, respectively. Conversely, the # Fitting Items represents the total number of items that fit the model according to the Rasch analyses. For the 1,496-item analysis, there were 989, 1,267, and 1,282 of these for the Japan, Russia, and Combined samples, respectively. Thus, about 66.1%, 84.7%, and 85.7% of the items fit the model. These findings underscore two commonsense notions: that many test items when piloted will not function properly and that item function may be related to test-taker origin.

Total # Misfitting

# Fitting Items

Item RMSE

Item Separation

Item Reliability

Japan Russia Combined

# Underfitting Items

Sample

# Extreme Items*

Table 1-6: Summary fit statistics for analyses of 1,496 items by country of origin.

Ȥ2

475 195 171

32 34 43

507 229 214

989 1,267 1,282

1.16 0.82 0.76

1.53 2.62 2.84

.70 .87 .89

.00 .00 .00

Note: * = Extreme items are those for which no examinee answered correctly.

The Item RMSE is the root mean square standard error statistic used in calculating the separation index discussed next, but it can be interpreted on its own as a standard error. The lower the value of RMSE, the better the data fit the model. In this study, the RMSE values were high (ranging from .76 to 1.16) which indicates that a number of items were not fitting the model as well as might be desired (as discussed above). The item separation index is an estimate of the spread of the item estimates relative to their precision, “the number of standard errors of spread among the items” (Bond & Fox, 2007, p. 59), which is to say that this measure reports reliability in units of standard error. Higher values are desired in this case. Notice that the separation index is higher for the

How Well do Cloze Items Work and Why?

15

combined data than for the single nationalities, and also higher for the samples in Russia than for those in Japan. In all three cases, this indicates that the items are comparatively high in terms of the way the difficulty estimates are spread out relative to their precision. The item reliability statistic shown in Table 1-6 is similar to Cronbach’s alpha (Bond & Fox, 2007), and as with Cronbach’s alpha, is on a scale from 0.00 to 1.00. High reliability for items in this case indicates that items’ measures of difficulty are predicted to be ordered similarly across different iterations of Rasch modeling. The analyses indicated moderate item reliability of .70 for the Japan samples, considerably higher reliability of .87 for the Russia samples, and even higher reliability at .89 for the combined samples. The chi-square (fixed) values test the following hypothesis: “Can this set of elements be regarded as sharing the same measure after allowing for measurement error?” Thus, for this design, the following hypothesis is being tested: Can these items be thought of as equally difficult? Clearly, since all of the chi-square (fixed) statistics in this study were found to be significant (at p < .00), this hypothesis must be rejected, that is, the answer is no, the items cannot be thought of as equally difficult. Thus, the variations in item difficulty estimates are probably due to factors other than chance. Rasch Vertical Rulers. Tables 1-7 and 1-8 present the vertical rulers that resulted from our FACETS analyses for Japan and Russia. The first column shows the scale for the vertical ruler, which represents the range of scores on a true interval logit scale, centered on 0. Note that FACETs requires at least one category to be fixed (i.e., centered on 0.00) in order to set the parameters of the scale. We chose to center tests on 0.00 because they were the same for both groups. Because persons and items were the more interesting categories, they were set to float (i.e., non-centered) to reveal their positions relative to one another. In these rulers, the range of logit scores is from low scores at about -6 to high scores at +7. The second column shows the test-taker ability measures for Japan ranging from about 2.5 down to -6. The third column shows the relative difficulty of the 50 cloze tests when used in Japan. The fourth column shows the test item difficulties for Japan with a number of items at +4 (i.e., maximally difficult because nobody answered them correctly) and others ranging down below -5. The fifth column shows the logit scores again, and the sixth column shows the test takers in Russia on the scale ranging from about 6.5 down to -5. The seventh column shows the test versions in Russia. The eighth column shows the test item difficulties in Russia with a number of items at +7 (i.e., meaning that they were maximally difficult in

Chapter One

16

that nobody answered them correctly) and the others ranging down to below -3. Table 1-7: Vertical rulers for test taker ability, test version difficulty, and test item difficulty for Japan.

Logits 7

6

5

4

3

2

1

0

Test Taker Ability + | | | + | | | + | | | + | | | + | | | + | | | + | | | * | | |

. . . . . * *. **. ***. ****. *****. ********. *******.

Japan Test Version Difficulty + | | | + | | | + | | | + | | | + | | | + | | | + | | | *** | ********** | ***. | |

Test Item Difficulty + | | | + | | | + | | | + | | | + | | | + | | | + | | | * | | |

XXXXXXXXX* * * X* X* * * * * X* * * * * X* * * * * *

How Well do Cloze Items Work and Why?

Logits -1

-2

-3

-4

-5

-6

+ | | | + | | | + | | | + | | | + | | | +

. *.

Japan Test Version Difficulty + | | | + | | | + | | | + | | | + | | | +

+ | | | + | | | + | | | + | | | + | | | +

* = 23 . = 4

*=3 . =1

x = 48 * = 24

Test Taker Ability ********. *********. *******. *********. ******. *****. ****. ***. **** **. *. . . . . .

Test Item Difficulty * * * * * * * * * * * * * * * * *

17

Chapter One

18

Table 1-8: Vertical rulers for test taker ability, test version difficulty, and test item difficulty for Russia.

Logits 7

6

5

4

3

2

1

0

-1

Test Taker Ability + | | | + | | | + | | | + | | | + | | | + | | | + | | | * | | | + | |

.

. .

. . . . . . . . . *. **. ***. *****. *****. *******. ********. *********. ********. *******. ********. *******. ********. *******. *******. ******.

Russia Test Version Difficulty + | | | + | | | + | | | + | | | + | | | + | | | + | | | *. | ********. | ** | | + | |

+ | | | + | | | + | | | + | | | + | | | + | | | + | | | * | | | + | |

Test Item Difficulty *********.

. . *. **. *. **. *. *. **. *. *. **. **. **. **. **. ***. ***. *** **. ***. **. **. **. *. *. *. *. *. . .

How Well do Cloze Items Work and Why?

Logits

-2

-3

-4

-5

-6

| + | | | + | | | + | | | + | | | +

Test Taker Ability ***. *****. ***. ***. ***. **. *. . . . . . .

* = 34 . = 28

Russia Test Version Difficulty | + | | | + | | | + | | | + | | | + *=4 . =1

| + | | | + | | | + | | | + | | | +

19

Test Item Difficulty . . . . . .

* = 20 . = 10

Comparison of CTT and Rasch Item Analysis Results Table 1-9 shows the frequencies and percentages of items that functioned well in terms of CTT ID and Rasch fit statistics. Notice that according to the CTT framework, only 11.4% and 29.2% of the items for the Japan samples and Russia samples, respectively, worked well in the sense that they produced ID values of .30 or higher, which by extension means that 88.6% and 70.8%, respectively, of the items for the two samples were .29 or lower. In short, based on the CTT ID values, a fairly small number of items—no more than 29.2%—can be said to have been working well for NRT development purposes even in the Russia sample.

Chapter One

20

Table 1-9: Frequencies and percentages of items that functioned well in terms of CTT ID and Rasch Fit. CTT ID .30 +

CTT ID Below .30

Rasch Fit

Rasch Misfit

Frequencies Japan Russia

170 437

1,326 1,059

989 1,267

507 229

Percentages Japan Russia

11.4% 29.2%

88.6% 70.8%

66.1% 84.7%

33.9% 15.3%

The Rasch analyses results in Table 1-9 provide a brighter picture with larger proportions of items, 66.1% and 84.7% for the Japan samples and Russia samples, respectively, fitting the model and thus being analyzable and useable. The association between the nationality and fit (analyzed in a 2 x 2 contingency table with Japan and Russia on one dimension and misfit and fit on the other) indicated a small but demonstrable degree of association (22%) between these two factors (Ȥ² = 139.26, df = 1, p < .00; phi = .22). We conclude from Table 1-9 that the CTT item analysis statistics indicated that only small numbers of items are functioning well, while the Rasch analysis indicated that relatively larger numbers of items fit the Rasch model and were thus analyzable, interpretable, and useable for NRT purposes. Further analyses provided additional detail. It turned out that the number (and percentage) of the 1,496 items that were functioning well in the CTT items analysis (i.e., with ID = .30+) were as follows: 106 (7.1%) worked for both nationalities; 331 (22.1%) worked uniquely in the Russia samples; and 64 (4.3%) worked uniquely in the Japan samples. Hence, in the CTT analyses, considerable differences occurred in which items were working for each of the nationalities. Similarly, in the Rasch analyses, the number (and percentage) of the 1,496 items which fit were as follows: 938 (62.7%) fit for both nationalities; 329 (22.0%) fit uniquely for the Russia takers; 51 (3.4%) fit uniquely for the Japan test-takers. Again, considerable differences surfaced for which items were fitting for each of the nationalities, but less so than in the CTT analyses.

How Well do Cloze Items Work and Why?

21

Linguistic Characteristics and Rasch Item Fit Given that the Rasch analysis turned out to be more sensitive and appropriate for analyzing the effectiveness of the items in this study, the remaining analyses were based on the Rasch item fit statistics. The linguistic characteristics considered here were (a) the parts of speech of the word in each blank, (b) whether the word was a content or function word, (c) whether it was of Latinate or Germanic origin, and (d) the frequency of the word in the Brown Corpus. Based on previous research (Brown, 1992), we had a reasonable expectation that these linguistic characteristics would have some relationship with item performance. Parts of Speech and Rasch Item Fit. Table 1-10 shows the item misfit and fit in the Rasch analyses separately for Japan and Russia for the different parts of speech across all 1,496 items, presented in alphabetical order by part of speech. Note that the chi square (Ȥ²) statistic for items that fit for parts of speech by Japan and Russia was only 13.68 with df = 16, which is not significant even at the very liberal p > .50. The Cramer’s V statistic was .078. Thus, counter to our expectations, the pattern of item fit for the parts of speech does not vary for the two nationalities beyond what would be expected by chance alone (in this case, with p > .50) and the association between test-taker linguistic background and parts of speech is only about 7.8%. Of course, none of this means that these frequencies are significantly the same as what could reasonably be expected by chance alone, but it did convince us that there were no variations worth further scrutiny in this table. Content/Function Words and Rasch Item Fit. Table 1-11 shows the number of content and function words that misfit or fit depending on nationality for the 1,496 items. The Ȥ² statistic for word type by nationality 2 x 2 contingency table for the frequencies of fitting items was 5.02 with df = 1, which is significant at the .025 level. The phi statistic here was .05. Thus, these fluctuations in frequencies are significantly different from chance at .025 and were associated to a very small degree (5%) with content versus function distinction. Visual inspection of the percentages shown in this table indicated that (a) fewer items in both the content and function word categories fit in the Japan sample, (b) in the Russia sample, a somewhat higher percentage of function words fit the Rasch model than content words, and (c) in the Japan sample, considerably higher percentage of function words fit than content words.

Chapter One

22

70 63 22 69 3 29 96

45 15 0 7 1 2 10

111 88 23 78 3 29 99

156 103 23 85 4 31 109

44.9 61.2 95.7 81.2 75.0 93.5 88.1

71.2 85.4 100.0 91.8 75.0 93.5 90.8

26.3 24.3 4.3 10.6 0.0 0.0 2.8

3 8

8 27

1 0

10 35

11 35

72.7 77.1

90.9 100.0

18.2 22.9

2 169 7 7

7 187 6 61

1 79 1 3

8 277 12 65

9 356 13 68

77.8 52.5 46.2 89.7

88.9 77.8 92.3 95.6

11.1 25.3 46.2 5.9

45 10 14 83

174 46 42 79

15 1 8 40

204 55 48 122

219 56 56 162

79.5 82.1 75.0 48.8

93.2 98.2 85.7 75.3

13.7 16.1 10.7 26.5

507

989

229

1,267

1,496

66.1

84.7

18.6

Russia Fit

86 40 1 16 1 2 13

Russia Misfit

% Difference

Total

Russia % Fit

Adverb Auxiliary Conjunction Contraction Copula Definite Article Gerund Indefinite Article Modal Noun Number Personal Pronoun Preposition Pronoun Proper Noun Verb

Japan % Fit

Adjective

Japan Fit

Part of Speech

Japan Misfit

Table 1-10: Item misfit by linguistic background for different parts of speech for 1,496 items.

Total

How Well do Cloze Items Work and Why?

23

Word Type Content

Japan Misfit

Japan Fit

Russia Misfit

Russia Fit

Russia % Fit

% Difference

424

601

197

828

Total 1,025

Japan % Fit

Table 1-11: Item misfit for word type (content & function) in the Japan and Russia data for 1,496 items.

58.6

80.8

22.2

Function

83

388

32

439

471

82.4

93.2

10.8

Total

507

989

229

1,267

1,496

66.1

84.7

18.6

Germanic/Latinate Word Origin and Rasch Item Fit. Table 1-12 shows the number of Latinate and Germanic words that misfit or fit in the Rasch models constructed for the 1,496 items, used with test-takers from two nationality backgrounds. The Ȥ² statistic for the word origin by nationality 2 x 2 contingency table for the frequencies of fitting items was 6.17 with df = 1, which is significant at the .025 level. The phi statistic here indicates that the association was .05. Thus, these fluctuations in frequencies are significantly different from chance at .025 and were associated to a very small degree (5%) with Latinate versus Germanic distinction. Visual inspection of the row percentages shown in this table indicates again that (a) a smaller proportion of items fit in the Japan samples than in the Russia sample, (b) in both the Russia and Japan samples, a considerably higher proportion of the Germanic words fit the Rasch model than Latinate words, and (c) Latinate words were less likely to fit than Germanic words for the Russia sample and considerably less for the Japan sample. Word Frequency and Rasch Item Fit. The point-biserial correlation coefficients between whether or not the items fit with the raw frequencies were .222 for the Japan data and .123 for the Russia data. Because the frequencies had skewed distributions, we also transformed those frequencies using a natural log transformation. The point-biserial correlation coefficients between whether or not the items fit with the transformed frequencies were .363 for the Japan data and .279 for the Russia data, which were significant at p < .01. These results indicate Rasch item fit estimates were somewhat related to item frequencies.

Chapter One

24

Word Origin Latinate

Japan Misfit

Japan Fit

Russia Misfit

Russia Fit

Russia % Fit

% Difference

219

181

114

286

Total 400

Japan % Fit

Table 1-12: Item misfit for word origin (Latinate & Germanic) in the Russia and Japan data for 1,496 Items.

45.3

71.5

26.2

Germanic

288

808

115

981

1,096

73.7

89.5

15.8

Total

507

989

229

1,267

1,496

66.1

84.7

18.6

Pearson correlation coefficients between the item logit scores and raw item frequencies were -.215 (p < .05) for the Japan data and -.294 (p < .01) for the Russia data. The Pearson correlation coefficients between the item logit scores and the transformed item frequencies were -.356 for the Japan data and -.430 for the Russia data, both of which were significant at p < .01. These correlations were negative because, as we expected, as the magnitude of the difficulty estimates increased the frequencies decreased. Nonetheless, these results indicate Rasch item difficulty estimates were also somewhat related to item frequencies. Table 1-13 shows the Rasch item fit frequencies separately for the different the levels of vocabulary frequency in the Brown Corpus across all 1,496 items for the two nationalities. Examining the percentages of item fit on the right side of Table 1-13, it is easy to see that, below a certain frequency threshold (i.e., as items become less and less frequent), the infrequent items did not fit the models well. For instance, lexical items that occurred fewer than 1,000 times in the Brown Corpus were much less likely to fit (i.e., accounted for a much lower percentage) than more frequent lexical items in the Japan data. Similarly, lexical items that occurred fewer than 50 times in the Brown Corpus were much less likely to fit than lexical items that occurred more often in the Russia data. This result is intuitive in that less-frequent lexical items are likely to be more unpredictably known or not known by test takers than the more frequent items and thus are less likely to fit.

How Well do Cloze Items Work and Why?

25

Japan Misfit

Japan Fit

Russia Misfit

Russia Fit

Japan % Fit

Russia % Fit

% Difference

Table 1-13: Item misfit as a function of word frequency (in the Brown Corpus found in Francis & Kuþera, 1979) in the Japan and Russia data for 1,496 items.

30000+

17

142

11

148

159

89.3

93.1

3.8

20000-29999

31

116

7

140

147

78.9

95.2

16.3

10000-19999 5000-9999 1000-4999 100-999 50-99 10-49 0-9

1 18 44 143 56 99 98

20 128 166 210 69 73 65

2 6 18 53 20 49 63

19 140 192 300 105 123 100

21 146 210 353 125 172 163

95.2 87.7 79.0 59.5 55.2 42.4 39.9

90.5 95.9 91.4 85.0 84.0 71.5 61.3

-4.7 8.2 12.4 25.5 28.8 29.1 21.4

Total

507

989

229

1,267

1,496

66.1

84.7

18.6

Frequency in Brown Corpus

Total

Discussion A number of research studies (Brown, 1988, 1989, 2013; Brown, Yamashiro, & Ogane, 1999, 2001; Revard, 1990) have examined the issues involved in developing cloze items based on CTT item analysis. This chapter expands on those analyses and is one of a few that apply Rasch analysis to cloze testing. It is also one of the first studies to systematically examine the performances of large numbers of students from two distinct language backgrounds on 50 different cloze tests. Recall that the title of the study was How well do cloze items work and why? With regards to the first part of that question, How well do cloze items work?, as mentioned above, the literature on the topic has shown considerable variation in how well cloze tests function with reliability estimates ranging from .31 to .96 and validity coefficients ranging from .43 to .91 (Brown, 2013). The present study was designed to look a bit closer at these issues by addressing five research questions.

26

Chapter One

How do the CTT descriptive statistics, reliability, and item analyses differ for the test taker populations from two different linguistic backgrounds? The descriptive statistics in Tables 1-1 and 1-2 indicated that the raw score means were low for the 50 thirty-item cloze tests administered to samples from two different linguistic backgrounds. Reliability estimates varied widely depending on the different cloze test forms and the language background of the groups. IF statistics indicated that about one fifth of the test items were contributing effectively to the variance in cloze scores for these different test-taker populations; similarly, ID values also indicated that that about one fifth were Good or Very Good discriminators. From a CTT perspective, then, many of the 50 tests in this study were not functioning particularly well in terms of central tendency, dispersion, and reliability. These mixed results are consistent with the literature and may be due to the fact that large numbers of the 1,496 individual items were either not working at all or were not particularly effective as NRT items in terms of IF and ID. These data potentially indicate that random deletion patterns may not be the most effective way to build cloze tests; indeed, item creation based on rational selection that considers lexical items with relatively high word frequency may be a more productive strategy for the development of cloze items either manually or through automatic generation (as advocated by Coniam, 2013). How do Rasch item difficulty measures differ for the test takers from different linguistic backgrounds? In the Rasch analyses, the items also proved to be difficult, that is, they were generally suitable for students of high or very high ability levels as indicated by the logit scores for the two test-taker groups. For both testtaker groups, ability estimates were lower than the item logits in many cases. Thus, as in the CTT analyses, a number of items were suitable for students above the general ability levels of these samples. Nonetheless, the same vertical rulers indicate that a fairly high proportion of items was also suitable for the examinees in these samples. How do the proportions of functioning cloze items differ between the CTT and IRT analyses, when based on the test taker responses from different linguistic backgrounds? Table 1-9 presented the frequencies and percentages of items that functioned well (in terms of CTT ID and Rasch fit statistics). As shown

How Well do Cloze Items Work and Why?

27

there, higher percentages of items were analyzable and usable if considered within the constraints of Rasch analysis. It appears then that Rasch analysis was better suited for accounting for items like the ones found in our cloze tests than was CTT item analyses. In what ways do Rasch item fit patterns differ in terms of factors such as linguistic background and four cloze item linguistic features: parts of speech, word type, word origin, and word frequency? Given that the Rasch analyses proved to be more sensitive and appropriate for analyzing the effectiveness of the items in this study, we investigated the degree to which parts of speech, word types, word origins, and word frequencies accounted for any differences in how items fit the Rasch model in the Russia and Japan samples. No statistically significant association (p > .50) surfaced in the numbers of items that fit for the two samples due to their parts of speech. However, significant associations (at p < .025) were found for the proportions of items in the Russia and Japan data that fit for both word type (content vs. function) and word origin (Latinate vs. Germanic). Phi statistics indicated that each of these variables only accounted for about 5% of their association with item fit. In both samples, function words (with 93.2% and 82.4% fitting for the Russia and Japan samples, respectively) were somewhat more likely to fit than content words (with 80.8% and 58.6% fitting for the Russia and Japan samples, respectively), and Germanic words (with 89.5% and 73.7% fitting for the Russia and Japan samples, respectively) were more likely to fit than Latinate (with 71.5% and 45.3% fitting for the Russia and Japan samples, respectively). The analyses of the relationship between word frequency and item fit showed somewhat more association. Point-biserial correlation coefficients (rpbi) between whether or not items fit and the (natural log) transformed frequencies of the words in the blanks were .279 for the Russia sample and .363 for Japan. In more detail, items in the Russia samples were more likely to fit (with 84% or more fitting) if they were based on words with frequencies of 50 more in the Brown Corpus. The results for the Japan samples were a bit less clear. However, in general, items based on words with frequencies of 1000 or more were considerably more likely to fit the Rasch model.

28

Chapter One

Which linguistic characteristics will increase the probability of cloze items functioning well during piloting? What does all of this mean for cloze test design? Based on the results of these analyses, it appears that using higher proportions of frequent words (i.e., with frequencies over 50 in the Brown Corpus) should help produce items that fit for samples like that in Russia. However, for samples like that in Japan, items based on words with frequencies of 1,000 or more appear to be more likely to fit the Rasch model. It is also advisable to use higher proportions of items requiring Germanic words (as opposed to Latinate words) as it could help produce somewhat more items that fit for samples like those in Russia and Japan (though somewhat more so for samples like that in Russia). Using higher proportions of items requiring function words (as opposed to content words) may help produce somewhat more items that fit for samples like those in Russia and Japan, (though more so for samples like that in Russia). Note that increasing the proportions of function words might increase the degree to which grammar knowledge is being tested. Certainly, sound test development practices (and the results of this study) dictate that the best strategy for producing effective cloze tests is to pilot those tests with larger numbers of items and then use Rasch analysis to select those items that fit the sorts of examinees being tested. If that is not possible, it may help to select items that tend to require function words and words of Germanic origin in the blanks, or more importantly, words that occur frequently in English.

Conclusion Implicit in the second part of the question posed by the title of this paper is the question: Why do cloze items and tests operate the way they do? Clearly, one reason for the items being as difficult as they proved to be is that they were natural cloze items (i.e., cloze tests developed from passages randomly selected from a large collection of native-speaker texts). As first demonstrated in Brown (1993), such natural cloze tests tend to be difficult even for university level students of English, especially when scored using the exact-answer scoring method as was the case in this study. The generally wider dispersion of scores found for the Russia samples may have occurred because (a) these test-takers varied more widely in ability levels than the Japan samples, (b) the potential for variation was greater as a result of their higher means, or (c) both. The differences in

How Well do Cloze Items Work and Why?

29

reliability may have been due to the higher means in the Russia samples, to the greater variance, to the larger sample sizes, or to differences in the test-takers (e.g., higher motivation, more familiarity with cloze format, etc.). The fact that much larger proportions of items functioned well in the Rasch item analyses than in the traditional CTT analysis may be explained by the nature of Rasch analyses which provide item difficulty estimates based on the probability of an average test-taker answering a given item correctly rather than on the proportion of examinees who answered correctly as in CTT. As a result, the Rasch item difficulty estimates were not sample dependent and were not affected as much as the CTT item analyses statistics by the relative difficulty of the items in these 50 tests for both samples. Hence, we were able to identify greater numbers of functioning items, even those items which might be challenging for the test-takers, and understand at least partially why and how the items were functioning linguistically in interesting and interpretable ways. In addition, the fact that we used multifaceted Rasch analysis in the form of FACETS analysis made it possible to simultaneously analyze (using 10 anchor items administered to all test takers) 50 cloze tests with test-takers and items nested within tests (i.e., with different test takers and items on each test). FACETS analysis also made it possible to link the 50 cloze tests from two nationalities and thereby put the test-taker ability estimates and item difficulty estimates on the same scales for all tests and both nationalities. Thus, we were able to learn that Rasch analysis is more appropriate for our cloze test analysis and revision purposes, indeed, considerably better than traditional CTT analyses. In addition, we found that blanks representing certain categories of words (i.e., function words and words of Germanic origin, and to a greater extent relatively high frequency words) are more likely to work well for NRT purposes. If we were to revise any or all of the 50 cloze tests in this study by selecting only those items that functioned well from a CTT perspective (as described in Brown, 1988) or from a Rasch perspective (based on the results of this study), we are convinced that very different tests would surface for the Russia and Japan samples because different items were functioning well in the two samples. This is consistent with the starting point for this study which was that cloze items are just another “family of item types” (Mullen, 1979, p. 21). In fact, cloze is no more than “a technique for producing tests, like any other technique” (Alderson, 1979, p. 226), though, according to this study, they provide a more effective set of items from the Rasch perspective than from the CTT viewpoint. Thus, there is no reason to “…think that cloze tests are somehow different from

30

Chapter One

other tests” (Brown, 2013, p. 26), and we should no doubt pilot and revise cloze tests just as we would any other tests to tailor them to the specific range of abilities involved. However, we should also consider selecting items based on the recommendations of this study, and thereby extend the notion of rational deletion in useful ways.

References Alderson, J. C. (1979). The cloze procedure and proficiency in English as a foreign language. TESOL Quarterly, 13(2), 219-227. Bachman, L. F. (1982). The trait structure of cloze test scores. TESOL Quarterly, 16(1), 61-70. —. (1985). Performance on cloze tests with fixed-ratio and rational deletions. TESOL Quarterly, 19(3), 535-555. Baker, R. L. (1987). An investigation of the Rasch model in its application to foreign language proficiency testing. Doctoral thesis University of Edinburgh, UK. Bond, T., & Fox, C. M. (2007). Applying the Rasch model: Fundamental measurement in the human sciences (2nd ed.). Mahwah, NJ: Lawrence Erlbaum Associates. Brown, J. D. (1988). Tailored cloze: Improved with classical item analysis techniques. Language Testing, 5(1), 19-31. —. (1989). Cloze item difficulty. JALT Journal, 11(1), 46-67. —. (1993). What are the characteristics of natural cloze tests? Language Testing, 10(2), 93-116. —. (1998). An EFL readability index. JALT Journal, 20(2), 7-36. —. (2005). Testing in language programs: A comprehensive guide to English language assessment. New York: McGraw-Hill. —. (2013). My twenty-five years of cloze testing research: So what? International Journal of Language Studies, 7(1), 1-32. Brown, J. D., Janssen, G., Trace, J., & Kozhevnikova, L. (2013). Using cloze passages to estimate readability for Russian university students: A preliminary study. In M. A. Kulinich, V. A. Levchenko, E. G. Kashina, L. A. Kozhevnikova, & E. A. Sokolova (Eds.), ɉɪɨɮɟɫɫɢɨɧɚɥɶɧɨɟ ɪɚɡɜɢɬɢɟ ɩɪɟɩɨɞɚɜɚɬɟɥɟɣ ɚɧɝɥɢɣɫɤɨɝɨ ɹɡɵɤɚ ɜ ɭɫɥɨɜɢɹɯ ɦɨɞɟɪɧɢɡɚɰɢɢ ɨɛɪɚɡɨɜɚɬɟɥɶɧɨɣ ɫɢɫɬɟɦɵ». Ɇɚɬɟɪɢɚɥɵ ɦɟɠɞɭɧɚɪɨɞɧɨɣ ɧɚɭɱɧɨ-ɩɪɚɤɬɢɱɟɫɤɨɣ ɤɨɧɮɟɪɟɧɰɢɢ. ɋɚɦɚɪɚ, 2526 ɦɚɪɬɚ 2013. [English language teacher professional development: Scaling New Heights. A Collection of Conference Papers. Samara, March 25th–26th, 2013]. Samara, Russia: Samara State University.

How Well do Cloze Items Work and Why?

31

Brown, J. D., Yamashiro, A. D., & Ogane, E. (1999). Tailored cloze: Three ways to improve cloze tests. University of Hawai‘i Working Papers in ESL, 17(2), 107-129. Brown, J. D., Yamashiro, A. D., & Ogane, E. (2001). The Emperor’s new cloze: Strategies for revising cloze tests. In T. Hudson & J. D. Brown (Eds.), A focus on language test development (pp. 143-161). Honolulu, HI: University of Hawai‘i Press. Coniam, D. (2013). A preliminary inquiry into using corpus word frequency data in the automatic generation of English language cloze tests. CALICO Journal, 14(2-4), 15-33. Ebel, R. L. (1979). Essentials of educational measurement (3rd ed.). Englewood Cliffs, NJ: Prentice-Hall. Francis, W. N., & Kuþera, H. (1979). Brown Corpus manual. Providence, RI: Brown University Department of Linguistics. Francis, W., & Kuþera, H. (1982). Frequency analysis of English usage: Lexicon and grammar. Boston: Houghton Mifflin. Hale, G. A., Stansfield, C. W., Rock, D. A., Hicks, M. M., Butler, F. A., & Oller, J. W. (1988). MultipleǦchoice cloze items and the Test of English as a Foreign Language: ETS research report series #1. Princeton, NJ: Educational Testing Service. Hoshino, A., & Nakagawa, H. (2008). A cloze test authoring system and its automation. In Advances in web based learning–ICWL 2007 (pp. 252-263). Berlin: Springer. Kuþera, H., & Francis, W. (1967). Computational analysis of present-day American English. Providence, RI: Brown University Press. Lee-Ellis, S. (2009). The development and validation of a Korean C-Test using Rasch Analysis. Language Testing, 26(2), 245-274. Linacre, J. M. (2014a). Facets many-facet Rasch analysis computer program (v. 3.71.4). Retrieved from: http://www.winsteps.com/facets.htm —. (2014b). Winsteps & Facets comparison. Retrieved from: http://www.winsteps.com/winfac.htm Markam, P. L. (1985). The rational deletion cloze and global comprehension in German. Language Learning, 35(3), 423-430. Mullen, K. (1979). More on cloze tests as tests of proficiency in English as a second language. In E. J. Briere, & F. B. Hinofotis (Eds.), Concepts in language testing: Some recent studies (pp. 21-32). Washington, DC: TESOL. Revard, D. (1990). Tailoring the cloze to fit: Improvement of cloze tests through classical item analysis. Unpublished scholarly paper. Honolulu, HI: University of Hawai’i at Manoa.

32

Chapter One

Stubbs, M. (2004). Language corpora. In A. Davies & C. Elder (Eds.), The handbook of applied linguistics (1st ed.) (pp. 106-132). Malden, MA: Blackwell. Taylor, W. L. (1953). Cloze procedure: A new tool for measuring readability. Journalism Quarterly, 30(4), 414-438.

How Well do Cloze Items Work and Why?

Appendix 1-A: List of participating universities. In Japan: Dokkyo University Fukuoka Teacher’s College Fukuoka University of Education Fukuoka Women’s University International Christian University International University of Japan Kanazawa University Kansei Gakuin University Meiji University Saga University Seinan Gakuin University Soai University Sophia University Tokyo University of Agriculture and Technology Toyama College of Foreign Languages Toyama University Toyo Women’s Junior College Waseda University

In the Russian Federation: Chelyabinsk Law Institute Chelyabinsk State University International Market Institute (Samara) Kazan branch of the Russian International Academy for Tourism Kazan Military Institute Kazan State Technical University Kolomna State Pedagogical University Krasnodar State University Mordovian State University Moscow State University Novocherkassk Polytechnic Institute Novosibirsk State University Orenburg State University Presidential Cadet College (Orenburg) Rostov/Don Institute of Management, Business and Law Rostov/Don State University Ryazan State University

33

34

Chapter One

Samara Aerospace university Samara State Academy of Social Sciences and Humanities Samara State University Samara State University of Architecture and Civil Engineering Saratov State Pedagogical University Saratov State University Smolensk University for the Humanities Solykamsk State Pedagogical University South-Ural State University St. Petersburg State University Surgut State University Syktyvkar State University Syzran branch of Samara State Technical University Taganrog Institute of Management and Economics Taganrog State Pedagogical University Togliatti Academy of Management Tomsk Polytechnic University Ulyanovsk State University Institute for International Relations Volga State University of Technology (former Mari State Technical University) Voronezh State University Voronezh State University of Architecture and Civil Engineering

How Well do Cloze Items Work and Why?

35

Appendix 1-B: Example cloze test (Adapted from Brown, 1989). Name_________________________________________________ (Last) (First) Native Language________________________________________ DIRECTIONS: x Read the passage quickly to get the general meaning. x Write only one word in each blank. Contractions (example: don’t) and possessives (John’s bicycle) are one word. x Check your answers. NOTE: Spelling will not count against you as long as the scorer can read the word. EXAMPLE: The boy walked up the street. He stepped on a piece of ice. He fell (1)____________, but he didn’t hurt himself. A FATHER AND SON Michael Beal was just out of the service. His father had helped him get his job at Western. The (1)____________ few weeks Mike and his father had lunch together almost every (2)____________. Mike talked a lot about his father. He was worried about (3)____________ hard he was working, holding down two jobs. “You know,” Mike (4)____________, “before I went in the service my father could do just (5)____________ anything. But he’s really kind of tired these days. Working two (6)____________ takes a lot out of him. He doesn’t have as much (7)____________. I tell him that he should stop the second job, but (8)____________ won’t listen. (continues for 30 items...) Answer Key past day how said about jobs energy he …

Chapter One

36

Appendix 1-C: Actual word frequencies. Word the of and to a in as he at it with for is was I that on be his not you about all had how no they were by first said so who are but from or up

Count 109 50 44 37 33 32 18 16 13 13 13 12 11 11 10 10 9 8 8 8 8 7 7 7 7 7 7 7 6 6 6 6 6 5 5 5 5 5

Word fire found good groups him if inches into Jacob light long might more never only powder shy than think thought two wings work against also America an any because before bills book both came cattle children company compared

Count 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

Word mine months most need negative new off one our out percent possible quartz religion Saint say school see something spread system tall then this though too town while wife working world Other words with 1 each

Count 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

Total

1,496

1,337

How Well do Cloze Items Work and Why?

Word very what been Christ feet full have own people Piglet she some still their there these we went when which behind boy cage could day

Count 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3

Word control dead develop did dollar during each every father few fish friends Galileo got here hit immediately Kano knows let limited M.I.6 many May me

Count 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

Word

37

Count

38

Chapter One

Appendix 1-D: FACETS input file for 50 cloze tests with 10 anchor items. The abbreviated FACETS input file shown below illustrates how multiple tests can be entered into a single analysis. Note that this analysis contained three facets (test-takers, test version, and items), and the coding for each is displayed, with comments displayed in italics and separated from analyzable data with a semi-colon. Test-takers were coded by group, test number, and examinee ID for ease of identification. Test versions were coded by the corresponding cloze test number. Items were coded both by test version and item number (e.g., item 430 = item 30 on test 4). Notice that the first set of items is designated as anchor items and include a measurement value. Facets = 3 * Labels = 1, Test-Takers 100101 100102 … 105041 * 2, test version 01 = cloze 1 … 50 = cloze 50 * 3, Items, A 31 = A31,-0.5 … 40 = A40,-0.17 101 = T01-01 … 130 = T01-30 201 = T02-01 … 5030 = T50-30 * Data = 100101,1,101-

; Test takers, test version, items

; Coded by group, test number, and examinee ID ; Group 1, test 01, examinee ID 01 ; Group 1, test 01, examinee ID 02 ; Group 1, test 50, examinee ID 41 ; 50 tests ; Cloze test 01 ; Cloze test 050 ; Test items with anchor values (A) ; Anchor item 1 with a Measure of -0.50 ; Anchor item 10 with a Measure of -0.17 ; Cloze test 01, item 01 ; Cloze test 01, item 30 ; Cloze test 02, item 01 ; Cloze test 50, item 30 ; Data with two response lines per examinee.

How Well do Cloze Items Work and Why?

39

130,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0; Examinee #1 cloze item responses 100102,1,101-130,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0 … 105041,50,50015030,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0 100101,1,31-40,0,0,0,,,0,0,0,0,0; Test taker #1 anchor item response 100102,1,31-40,0,0,1,,,0,1,0,0,0 … 105041,50,31-40,0,1,1,,,1,1,0,0,0

CHAPTER TWO ESTIMATING ABSOLUTE PROFICIENCY LEVELS IN SMALL-SCALE PLACEMENT TESTS WITH PREDEFINED ITEM DIFFICULTY LEVELS KAZUO AMMA

Abstract In traditional placement tests a candidate’s proficiency level is assessed based on the total test score. This estimation is inappropriate for two reasons. Firstly, it may be affected by the arbitrary combination of test items with varying difficulties. Too many relatively difficult items might lead to underestimating a candidate’s true proficiency level, while too many easy items might result in overestimation. Secondly, looking at the total placement test score is not informative enough because it does not show the absolute proficiency level of the candidates. This chapter proposes using a logistic regression analysis, which properly assesses a candidate’s proficiency level as well as the confidence interval, provided the difficulty level of individual test items is defined in advance. The difficulty scale can be any standardized proficiency scale (e.g., the Common European Framework of Reference) as long as the test item difficulty is projected on it. This technique can further allow continuous estimation for incomplete performance (i.e., when an open-ended answer is partially correct) as well as binary scoring (i.e., correct/incorrect). As the main output of this kind of analysis is the proficiency level on the difficulty scale, candidates can be placed directly in the corresponding class level. As long as item difficulty information is provided, the estimation can be conducted regardless of the number of candidates taking the placement test and it can be applied to individualized online learning programs.

Estimating Absolute Proficiency Levels in Small-Scale Placement Tests

41

Introduction The goal of a placement test is to place the candidates on a proficiency scale with properly defined descriptions of the target language behaviour. The assessment of the candidates’ proficiency level should be absolutely stipulated (i.e., criterion-referenced) and thus should not be affected by arbitrary addition/deletion of test items. In other words, the candidate’s proficiency measure must stay constant however many items there are in the test that are above or below his/her proficiency level. If test items are too difficult, the candidate will hardly answer them correctly; if they are too easy, the candidate will quite probably get the answers right. But these results should not affect a candidate’s proficiency assessment. As an alternative to traditional score-based assessment, this chapter proposes a psychometric solution to estimate the candidates’ proficiency level which refers to a set of proficiency criteria defined in advance. Such assessment has often been done manually and subjectively with reduced reliability as a result. Logistic regression analysis, however, is a statistical tool that calculates the estimated mean proficiency level of a candidate as well as the range of the confidence interval. The chapter also reports on two candidates in a small-scale placement test and shows how the logistic regression analysis describes the different characteristics of their proficiency. The discussion that follows employs a psychometric rather than a psychological perspective (Henning, 1992). Although placement testing involves a number of issues such as test design, reliability, validity, and decision-making (Fulcher, 1997; Plakans & Burke, 2013; Wall, Clapham, & Alderson, 1994), the setting of item difficulty and the streaming of difficulty levels in the examples included are assumed to be accurate and reliable, in order to make the argument simple and clear.

Background Although the estimation procedure presented here is independent of any particular language proficiency/difficulty model in theory, its practical application is based on a linear grading system of proficiency/difficulty. Among various second language (L2) proficiency scales the most comprehensive and influential is the Common European Framework of Reference for Languages (CEFR). It describes what a learner can do when he/she has reached a certain proficiency level. The following are descriptors of overall reading comprehension in six levels of proficiency (Council of Europe, 2001, p. 69). Table 2-1 refers only to reading skills

42

Chapter Two

since the placement test to be described later in the chapter deals almost exclusively with reading comprehension. Table 2-1: Descriptors of overall reading comprehension (Source: Council of Europe, 2001, p. 69). Level A1

A2

B1

B2

C1

C2

Descriptor Can understand very short, simple texts a single phrase at a time, picking up familiar names, words and basic phrase and rereading as required. Can understand short, simple texts on familiar matters of a concrete type which consist of high frequency everyday or jobrelated language. Can understand short, simple texts containing the highest frequency vocabulary, including a proportion of shared international vocabulary items. Can read straightforward factual texts on subjects related to his/her field and interest with a satisfactory level of comprehension. Can read with a large degree of independence, adapting style and speed of reading to different texts and purposes, and using appropriate reference sources selectively. Has a broad active reading vocabulary, but may experience some difficulty with low frequency idiom. Can understand in detail lengthy, complex texts, whether or not they relate to his/her own area of speciality, provided he/she can reread difficult sections. Can understand and interpret critically virtually all forms of the written language including abstract, structurally complex, or highly colloquial literary and non-literary writings. Can understand a wide range of long and complex texts, appreciating subtle distinctions of style and implicit as well as explicit meaning.

In the application of the estimation procedure to the placement test discussed in this chapter, the cognitive complexity was assumed to grow in equal steps as the level rises. Although the CEFR scale does not guarantee this assumption, the independent variable (i.e., the difficulty/proficiency level) had to be an interval scale in order to conduct a logistic regression analysis, which will be explained in the next section. With the CEFR scale in mind as reference, 10 difficulty levels were set in the actual placement test. As stated above, the five major grades from

Estimating Absolute Proficiency Levels in Small-Scale Placement Tests

43

Basic to Advanced were divided in progressive steps, and each major grade had two minor steps to ensure adjustment of relative difficulty/easiness within the major step (see Table 2-2). Table 2-2: Item difficulty levels and required proficiency in the placement test (Adapted from Amma, 2013). Level 10 9 8 7 6 5 4 3 2 1

Grade Advanced Advanced High-Intermediate High-Intermediate Low-intermediate Low-intermediate Beginning Beginning Basic Basic

Task required Can critically evaluate the text (corresponding to C2). Can reformulate messages (corresponding to C1). Can understand simple implicatures (corresponding to B1 and B2). Can understand literal meaning of sentences (corresponding to A2). Can understand short and simple ideas (corresponding to A1).

In Table 2-2 above, the correspondence between the tasks involved in the placement test and the can-do descriptors in the CEFR is not clearly guaranteed. As Weir (2005) contends, “CEFR is not seen as a prescriptive device but rather a heuristic, which can be refined and developed by language testers to better meet their needs” (p. 298). The CEFR is vague and fails to take account of contextual factors such as purpose of task completion, response format (i.e., true/false or short answer), time constraints for the task, channel of communication (telephone, face to face, etc.), discourse type, text length, topic or content knowledge, lexical competence, and structural and functional competence (Weir, 2005). Thus, the CEFR should include descriptions comparable to language tasks and “operationalize criterial distinctions between levels in their [test writers’] tests” (Weir, 2005, p. 298). Attempts to clarify the target proficiency do not seem very successful so far. In a recent study on the identification of criterial features of learner language, Hawkins and Buttery (2010) described learner error types of grammatical structures comparable to the CEFR levels, based on the understanding that the “CEFR levels are underspecified with respect to key properties that examiners look for when they assign candidates to a particular proficiency level and score in a particular L2” (p. 2). However, the more specific the descriptor becomes, the less likely the target task is generaliseable, since the context, linguistic or extralinguistic, may act as a

44

Chapter Two

determinant factor, including such performance variables as content knowledge and willingness to communicate. For instance, the statement ‘It’s a fine day’ uttered by a cyclist in heavy rain can be either incorrect, if taken literally, or correct, if taken ironically. Thus, it is often difficult to make an accurate correspondence between a single language item, especially grammar and vocabulary, and a proficiency level. Staying in the realm of competence, Harsch and Rupp (2011) propose a ‘level-specific approach’ in assessing writing skills in which “tasks are used that are each targeted at one specific level,” thus “written responses of the students are then assessed by having trained raters assign a fail/pass rating using level-specific rating instruments” (p. 2). The contrast between ‘level-specific’ and ‘multilevel’ approaches can be comparable to certain aspects of language literacy. For instance, whether a foreign language learner can write his/her name correctly in the generally accepted orthography in the target language pertains to the level-specific domain, i.e., a learner in a certain level can perform the task either correctly or incorrectly with little grey zone in between. On the other hand, the absolute size and depth of the learner’s vocabulary cannot be assessed but by the relative relevance of its use in the discourse he/she creates, i.e., a learner with poor vocabulary may be able to communicate better than someone with richer vocabulary in certain situations. According to Harsch and Rupp (2011), some examples of specific purposes of tasks for B1 are “pass on information”, “provide reasons for actions and comments”, and text types for B1 are “write notes and messages”, “write personal letters, simple formal letters and emails”, and “write reports/articles” (p. 12). Although the tasks are specifically defined, the success/failure of the candidates’ performance still depends to a large extent on the rater’s subjective judgment. Meanwhile the task characteristics assigned to the CEFR levels by Harsch and Rupp (2011) are linearly prescribed from “very short prompts on concrete topics” (A1) to “topics of personal as well as general interest” (B1) to “concrete as well as abstract, also unfamiliar topics” (C1) (p. 12). The present study first follows this binary pass/fail judgment criterion in the placement test, and considers later whether a partial acknowledgement of success (e.g., assigning Level 4 to an incomplete performance for a Level 6 task) is plausible. Throughout this chapter, it is assumed that the CEFR levels and the proficiency levels in the present placement test refer to points on a proficiency/difficulty scale, but not bands of proficiency/difficulty that stretch from a certain point on the scale to another. The names of grades in Table 2-2 were given merely to help the reader understand the direction of the scale. Thus, if a candidate’s proficiency is judged as above Level 4 but

Estimating Absolute Proficiency Levels in Small-Scale Placement Tests

45

below Level 5, for instance, he/she should be placed in a class ‘Lowintermediate (lower)’. To conclude this section, the criteria in Table 2-2 are one manifestation of the general guidelines by the CEFR. The accuracy of the criteria of the placement test may probably affect the result of analysis, but this is a separate issue and the focus here is on the psychometric procedure of placement on the assumption that the criteria adopted in this study are appropriate for the present purpose.

Logistic Regression Analysis The main problem with traditional test scoring pertains to its reliability. Traditionally, the proficiency level of test takers is estimated by the total score of a test, while the validity of test items, i.e., whether their difficulty matches the test-taker’s ability, is not guaranteed. As a result, if a test writer has written too many easy items, the candidate will get a relatively high score. In contrast, if the test writer prepares too many difficult items, the candidate is more likely to get a lower score than when he/she is provided with appropriate items. Consider an imaginary case of item difficulty levels (Table 2-3), where it is assumed that the difficulty rises in the following order: A1 < A2 < B1 < B2 < C1. Table 2-3: Description of imaginary test items. Item Points Difficulty

#1 10 A1

#2 10 A2

#3 10 B1

#4 10 B2

#5 10 C1

If a candidate’s true proficiency (which is yet unknown) is just above A2, she/he will most likely pass #1 and #2 but fail the rest of the items, thus obtaining 20 points out of 50 (or 40%). The rationale behind this prediction is based on the relative distance between item difficulty and test-taker ability. Henning (1987) describes the logic concisely as follows: “The farther person ability is below item difficulty, the more unlikely will be success in responding to the item. Similarly, the farther person ability is above item difficulty, the more unlikely will be failure in responding to the item.” (p. 122)

This pass rate will change if we add #6 with C1 difficulty: (10 + 10 + 0 + 0 + 0 + 0) / 60 = 33.3%; if we add #6 with A1 difficulty again: (10 + 10 + 0 + 0 + 0 + 10) / 60 = 50%. Thus, setting a cut-off score by an absolute

Chapter Two

46

point is arbitrary and unreliable. As long as the candidates are placed in relative order (as in ordinary entrance examinations) this scoring system will work. But if the decision maker wants to know what they can do, the score-based assessment does not provide the necessary information. Another problem with the traditional approach is that it is difficult to specify the test taker’s proficiency level on the absolute scale. In the case of ordinary entrance examinations the decision maker may accept the candidates from the top scorer to someone on the borderline. But in placement tests, the information of the order of test-takers alone does not help the decision maker understand what proficiency level they are in. To solve these problems, it is proposed here using a nominal logistic regression analysis. Simply put, it is a nonlinear regression model applied for estimating probabilities of categories as dependent variables when a continuous variable is given as an independent variable. In the example of Table 2-3 above, if a candidate passes items with A2 difficulty repeatedly and fails items with B1 difficulty even though the responses include errors, his/her ability is estimated somewhere between A2 and B1. Nominal logistic regression analysis deals with this kind of case where item difficulty levels are a continuous variable and pass/fail responses are categorical variables. In mathematical terms, the probability of a candidate’s performance (i.e., pass/fail) is predicted by the following equation (1): (1)

p(x) = EXP(ß0 + ß1 x)/(1 + EXP(ß0 + ß1 x))

where x represents the difficulty level and p(x) represents the probability of either pass or fail of the item. ß0 and ß1 are parameters that characterize this candidate, and EXP represents an exponent of e, a constant equal to 2.71828 or the base of natural logarithm. It is where the probability of response is 0.5 that the candidate’s ability matches the item difficulty. This formula for logistic regression can be found in various books on multivariate statistics, e.g., Lloyd (1999). Logistic regression is an analytic method often used in item analysis. Figure 2-1 below shows an example of a graphical representation of logistic regression analysis applied to a test item in an actual test conducted for Japanese university students of English as a foreign language (EFL) (N = 316) in which one grammatically correct sentence should be chosen out of four options: (a) Grandma went shopping, (b) Grandma went to shop, (c) Grandma went shop, and (d) Grandma went to shopping (Amma, 2001). Note that the responses are reduced to either ‘Pass’ (for option (a)) or ‘Fail’ (for other options or no answer).

Estimating Absolute Proficiency Levels in Small-Scale Placement Tests

47

The graphical representation of probability of responses created by the statistical package JMP (SAS Institute, 2012) shows how likely one is to pass this item given a certain proficiency level. The horizontal axis shows the candidates’ proficiency by means of the total score of the test. The vertical width of the bottom layer at the total score of a certain candidate indicates his/her probability of correct response. The total score at which the vertical line crosses the 50% probability level corresponds to the estimated difficulty of the item.

Figure 2-1: Sample representation of logistic regression analysis of a test item (Adapted from Amma, 2001).

Unlike ordinary logistic regression analyses in which all test takers’ responses are incorporated in a single item, the author proposes a singleperson summary of all responses in one test. If we accumulate the pass/fail responses to items of varying difficulties, we can estimate the candidate’s proficiency level, which is indicated by the vertical line that crosses the 50% pass/fail probability level. Figure 2-2 is a sample image of a graphical output of logistic regression analysis, taken from another study (Amma, 1990). It illustrates how an S-shaped logistic curve predicts the transition of pass/fail probability (vertical axis) of test items as the difficulty of items (horizontal axis) varies. The proficiency of this candidate (ID = S005) is comparable to the horizontal point at which the vertical probability measures 0.5, indicated by the dotted line. An S-

48

Chapter Two

shaped (or reverse-S-shaped, depending on the order of categories in layers) logistic curve can be observed in any sound proficiency test. Nominal logistic regression is a kind of non-linear regression for estimating categorical probability when the independent variable is continuous and dependent variable categorical (JMP, 2002).

Figure 2-2: Sample representation of logistic regression analysis of a candidate (Adapted from Amma, 1990).

Linacre (1987) describes a very basic mechanism of calculating proficiency in logits: “Let’s say that we know that a particular person has a 75% chance of succeeding on a question about state capitals. This 75% probability of success means that he has 3 chances succeeding to 1 of failing, so that the scale value is the natural logarithm of 3/1 = log (3) = 1.1 logarithmic units (“logits”). A 50% chance of success, or a 50% chance of failure would be 1 chance of succeeding to 1 chance of failing, giving a scale value of the logarithm of 1/1 = log (1) = 0.” (pp. 4-5)

Linacre’s BASIC program can produce a summary report of a candidate after taking 10 test items (see Table 2-4 below). Because Linacre’s interest was to pick up a proper item for efficient estimation of a person’s proficiency in a computer adaptive test (CAT), the item difficulty information was already stored in the item bank, and this information was

Estimating Absolute Proficiency Levels in Small-Scale Placement Tests

49

to be re-estimated by Rasch model calculation every time the test was administered. This is where a placement test is different from CAT. In CAT, the candidate’s proficiency is obtained as a result of adaptation by convergence over several trials, and the item to be presented is chosen after every trial. In a placement test, in contrast, all items must be prepared in advance and the estimation of a candidate’s proficiency is done after all the items are presented. In other words, it is possible that a placement test includes improper items either too easy or too difficult—even though the level setting of individual items is correct—which are likely to affect the result in a traditional assessment method. Table 2-4: Summary report of a candidate (Adapted from Linacre 1987, p. 25). Item 2 24 1 25 7 13 12 15 3 18

Difficulty 96 99 104 114 106 111 105 109 85 103

Right/Wrong Right Right Wrong Right Wrong Right Right Wrong Surprisingly Wrong Right

Note: This candidate scored in the range from 101 to 115 at about 108 after 10 questions.

The advantage of logistic regression analysis is that the estimated proficiency level is independent of the combination of easy items and difficult items, as long as the responses are consistent. Figure 2-3 below is the result of a simulation using the data in Table 2-5, where there are supposedly too many easy items.

Chapter Two

50

Table 2-5: Simulation 1. Item #1 #2 #3 #4 #5 #6 #7 #8 #9 #10

Difficulty 2 2 2 2 2 2 2 2 8 8

Response Pass Pass Pass Pass Pass Pass Pass Pass Fail Fail

Figure 2-3: Logistic regression analysis of Simulation 1.

The estimated proficiency level is 5.0. Compare it with Figure 2-4 based on another set of simulation data in Table 2-6 where there are too many difficult items.

Estimating Absolute Proficiency Levels in Small-Scale Placement Tests

51

Table 2-6: Simulation 2. Item #1 #2 #3 #4 #5 #6 #7 #8 #9 #10

Difficulty 2 2 8 8 8 8 8 8 8 8

Response Pass Pass Fail Fail Fail Fail Fail Fail Fail Fail

Figure 2-4: Logistic regression analysis of Simulation 2.

The estimated proficiency level is again 5.0. This robustness alone makes logistic regression analysis advantageous over the traditional method of simple summing.

52

Chapter Two

Estimation of Proficiency Level and Confidence Interval The purpose of the present placement test was to judge whether the two candidates who wished to be transferred to another department were capable of the language tasks involved in the regular course. That is, whether the candidates’ pass or fail depends totally on their estimated proficiency levels. Unlike ordinary entrance examinations, the decision maker can reject the candidates, if they do not reach the required proficiency level. The test material included several types of questions from lower-order reading comprehension to higher-order inferencing and paraphrasing or summarising. Formal grammar and vocabulary were also involved as part of the reading skill. The test items were partly multiple-choice and partly open-ended. Below are two sample test items, one multiple-choice and the other open-ended. 1. Education The literacy rate in Finland is 99% and the number of newspapers and books printed per capita is one of the highest in the world. The nine-year comprehensive school (peruskoulu) is one of the most equitable systems in the world—tuition, books, meals and commuting to and from school are free. All Finns learn Swedish and English in school and many also study German or French. (Source: Lehtippu, 1996, p. 31). #3. How much is the education cost in Finland? a. Basically free. b. Parents pay 10% of the entire cost. c. The state supports about half of the expense. d. Free only for primary and secondary education. 2. The Conquest of Reality The idea of a revival was closely connected in the minds of the Italians with the idea of a rebirth of ‘the grandeur that was Rome’. The period between the classical age, to which they looked back with pride, and the new era of rebirth for which they hoped was merely a sad interlude, ‘The Time Between’. Thus the idea of a rebirth or renaissance was responsible for the idea that the intervening period was a Middle Age—and we still use this terminology. (Source: Gombrich, 1984, pp. 167-169). #14. Why was the period before the Renaissance called the ‘Middle Age’ (underlined part)?

What was special about this test was that, unlike ordinary entrance examinations, the test writer specified a difficulty level for each item as he

Estimating Absolute Proficiency Levels in Small-Scale Placement Tests

53

wrote the test. Since the test writer was in contact with the teaching staff who were familiar with the level of teaching and goals of the curriculum, it was easy to connect the difficulty levels to the proficiency required in the specific course (see Table 2-2). After the test was administered the candidates’ individual responses for open-ended questions were judged as either pass or fail, depending on whether they satisfied the required proficiency of the item in question. In the case of multiple-choice items, the responses were simply pass or fail. The rater’s job was to calculate the estimated proficiency level of the candidates using the pass/fail information. For example, candidate A passed items #6, #10, and #17 whose difficulty levels were all 6 but failed items #8, #12 which were in Level 7. From this fact alone we may infer that her proficiency estimate is somewhere between 6 and 7. But we also have to consider the contradictory responses. She failed the relatively easy item #19 (Level = 6), and passed the relatively difficult item #14 (Level = 7). Candidate B had more such contradictory responses (see Table 2-7). Table 2-7: Item difficulty and responses (arranged in the order of difficulty levels). Item #1 #2 #3 #4 #5 #7 #11 #6 #9 #10 #19 #17 #8 #12 #14 #16 #20 #13 #15 #18

Level 2 2 4 4 4 5 5 6 6 6 6 6 7 7 7 8 8 10 10 10

Candidate A Pass Pass Pass Pass Pass Pass Pass Pass Pass Pass Fail Pass Fail Fail Pass Fail Fail Fail Fail Fail

Candidate B Pass Pass Pass Fail Fail Pass Pass Pass Pass Fail Fail Fail Fail Fail Pass Fail Fail Fail Fail Fail

54

Chapter Two

A logistic regression analysis conducted by JMP yielded Figures 2-5 and 2-6 indicating the probability of pass and fail as a function of the difficulty level. The difficulty level (‘Predicted Adjusted level’) at which the fail rate is 0.50 is the point of the candidate’s proficiency level. The two candidates’ proficiency levels and their 95% confidence intervals (indicated by the arrows in Figures 2-5 and 2-6) were calculated by the ‘inverse prediction’ procedure (Table 2-8).

Figure 2-5: Logistic fit of Candidate A with confidence interval. Bottom layer = ‘fail’.

Estimating Absolute Proficiency Levels in Small-Scale Placement Tests

55

Figure 2-6: Logistic fit of Candidate B with confidence interval. Bottom layer = ‘fail’.

Table 2-8: Statistics of the candidates’ proficiency levels and confidence intervals. Candidate A B

Level 6.924 6.925

Lower 95% 6.600 6.216

Upper 95% 7.482 8.001

In the actual data, both Candidate A and Candidate B are estimated to have similar levels of proficiency, close to high-intermediate in Table 2-2. But the confidence interval shows that Candidate B is less stable than Candidate A. Since the proficiency levels have been obtained, the next step is to judge whether each candidate will pass or fail, depending on the critical proficiency level that the institution requires.

Advanced Estimation of Proficiency Level and Confidence Interval The analyses so far made were based on a simplified version of the data. The actual test included some items with open-ended questions, instead of multiple-choice. If the answer was not perfect an exact level was assigned directly. For instance, #15 was originally Level 10, but

Chapter Two

56

Candidate A’s performance was judged as equivalent to Level 5. Candidate A had three such cases and Candidate B had five (Table 2-9). The problem here was that the rater had to deal with two different kinds of data—binary and continuous data—on one scale. However, an exact rating does not fit logistic regression analysis, since the dependent variable must be categorical in nominal logistic regression analysis. So the rater doubled the responses, changed the difficulty level to the rated level, split the weight, and finally made one response “Pass” and the other “Fail”, because the exact rating means the probability of the candidate’s answering the item correctly is 50% (Tables 2-10 and 2-11). Table 2-9: Item difficulty and responses by exact scoring. Item #1 #2 #3 #4 #5 #7 #11 #6 #9 #10 #19 #17 #8 #12 #14 #16 #20 #13 #15 #18

Level 2 2 4 4 4 5 5 6 6 6 6 6 7 7 7 8 8 10 10 10

Weight 3 3 3 3 3 3 3 3 4 4 6 8 3 3 8 8 8 8 8 10

Candidate A Pass Pass Pass Pass Pass Pass Pass Pass Pass Pass Fail Pass Fail Fail Pass Fail 6 Fail 5 6

Candidate B Pass Pass Pass Fail Fail Pass Pass Pass Pass 5 Fail 4 Fail Fail Pass 7 Fail 4 Fail 4

Estimating Absolute Proficiency Levels in Small-Scale Placement Tests

Table 2-10: Adjusted responses for Candidate A (part). Item #11 #15 #15 #10 #20 #20 #18 #18 #19 #17 #12 #14 #16 #13

Level 5 5 5 6 6 6 6 6 6 6 7 7 8 10

Weight 3 4 4 4 4 4 5 5 6 8 3 8 8 8

Candidate A Pass Pass Fail Pass Pass Fail Pass Fail Fail Pass Fail Pass Fail Fail

Table 2-11: Adjusted responses for Candidate B (part). Item #13 #13 #17 #17 #18 #18 #10 #10 #11 #19 #12 #16 #16 #14 #20 #15

Level 4 4 4 4 4 4 5 5 5 6 7 7 7 7 8 10

Weight 4 4 4 4 5 5 2 2 3 6 3 4 4 8 8 4

Candidate B Pass Fail Pass Fail Pass Fail Pass Fail Pass Fail Fail Pass Fail Pass Fail Fail

57

Chapter Two

58

In the case of Candidate A, item #15 has a new level, it is now Level 5 because the rater judged her performance as corresponding to Level 5, the weight is 4, half the original weight, and one response is ‘Pass’, and the other one is ‘Fail’. The result of this process shows an expansion of confidence interval as well as a drop in proficiency level, even more notably with Candidate B (see Table 2-12 below). Table 2-12: Comparison of estimated proficiency levels and confidence intervals. Candidate A A B B

Scoring dichotomous exact dichotomous exact

Level 6.924 6.552 6.925 5.401

Lower 95% 6.600 6.089 6.216 3.872

Upper 95% 7.482 7.253 8.001 6.774

Discussion It appears that the exact estimation leads to decreased accuracy even though it was intended to increase it. One reason for this seeming decrease in reliability would be the possible inconsistency of the subjective rating with the rest of the dichotomous data; the finer the rating is as one pinpoints the level, the less accurate the conclusion becomes than when one stays in a rough estimation. That is, an exact judgement that a candidate’s proficiency is at Level 6.0 when his/her true proficiency is at Level 6.5 is less accurate than a vague judgement that the candidate’s proficiency is somewhere below Level 10. It is a matter of whether we trust the rater’s case judgement or the latent ability structure that the candidate is assumed to follow. In other words, should our analysis be data-driven or model-driven? This question may remind us of the contrast between Item Response Theory (IRT) and Rasch Model analysis. If we understand the nature of variability in human behaviour, however, the reality may more likely lie in obscurity than clearly focused measurement. One reservation has to be made with this estimation method using logistic regression analysis. The present data happened to have no missing data. Where there are some, they will be ignored from calculation because the difficulty level in equation (1) is not obtained. Revuelta (2004) points out that raters cannot distinguish a simple accidental absence of data from intentional avoidance of answering, and proposes a correction programme. Although he deals with self-adaptive tests, the same possibility may occur in placement as well as other time-constrained tests when the candidate

Estimating Absolute Proficiency Levels in Small-Scale Placement Tests

59

discards an item after attempting to solve it and, finally, deciding it is too difficult. To end our discussion, let us consider the application of the present technique to two cases. First, we can guarantee the equivalence of two tests, to some extent, where both test items and test takers are completely different. A good example is the National Center Test in Japan (National Center for University Entrance Examinations, n.d.). It is a nationwide public examination for admission to most of the universities/colleges in Japan, taken by roughly 530,000 students (about 45% of the population at age 18). Although it is one of the largest high-stakes tests in Japan, there has been no known valid means of equating test results across years. But, since the test writing criteria which the test writers rely on remain more or less the same, the test-takers’ absolute proficiency can be estimated with reference to a standardised proficiency scale (such as the CEFR) by specifying difficulty levels of items at the time of writing the items. Because it is a large-scale examination, it is convenient to reevaluate the item difficulty and to adjust the criteria for assigning difficulty levels. Another application can be made to small-scale online proficiency tests such as M-Reader (M-Reader, n.d.). M-Reader is a collection of quizzes for graded readers intended to promote extensive reading for students. The result of students’ taking the quiz serves to verify that they have read the books. The quizzes are written by voluntary teachers, so it may be an extra job to assign a difficulty level to each item, but a little additional effort as an option would let the students realise which proficiency level they are in. It would also help quiz writers to control the difficulty level by referring to the difficulty criteria. It is worth adopting the logistic regression analysis in estimating the students’ proficiency because the combination of easy items and difficult items is not controlled. Besides, since the estimation business is carried out for individual students, rather than in a large group of test takers at one time in institutional examinations, the proposed means of estimation is advantageous for its adaptability to on-demand requests.

Conclusion This chapter described how the use of logistic regression analysis made it possible to successfully estimate test-takers’ absolute proficiency levels, which was impossible by traditional placement based on raw scores. Proficiency was described in terms of a set of levels, which included the rough can-do statements (Table 2-2). This qualitative information is useful when the candidates are streamed into classes. The information of the

60

Chapter Two

confidence interval helps judge whether the estimated proficiency level is reliable or not. The strongest advantage of the present procedure is its application to small-size tests. Thus, even if there are only one or two test-takers—in which case IRT cannot be used—an estimation of the test-takers’ proficiency levels is obtainable, as long as the item difficulty is reliably specified. A large-scale standardised test, such as the Test of English for International Communication (TOEIC) or one of the Cambridge tests, may contain reliable item information, and in that sense it is advised to take such a test. However, it is more meaningful for teachers to write test items reflecting the local context and judge the acceptability of candidates by matching their proficiency descriptions with the curriculum contents. In fact, large-scale standardised tests do not provide detailed feedback of individual item responses. The present procedure is capable of further refinement. Instead of the rough descriptors of difficulty levels, we could use the CEFR, IELTS (International English Language Testing System), OPI (Oral Proficiency Interview), or any other predefined proficiency scale. There is a high practical need for estimating candidates’ proficiency with reference to the established scales, but studies in criterion-referenced testing (CRT) have only referred to the application of IRT in the connection of item difficulty with person ability. Brown and Hudson (2002), for example, explain how examinees are characterised by specific can-do statements in combination with IRT. When the item level information is predefined, we do not need IRT, which requires a vast number of test takers. A possible drawback in the proposed process would be the difficulty in securing stable item difficulty information, hence the need for test-writer training. The question of whether the difficulty levels are accurately specified is a crucial issue with respect to the validity/reliability of the entire test. Furthermore, the validity/reliability is also affected by the number of test items prepared. Obviously, when the quality of test items is high in terms of internal consistency, a relatively small number of test items will suffice. However, these two are independent issues outside the present focus and it is assumed that the difficulty specification is done with high credibility. Further study is needed for the treatment of partial assignment of proficiency levels. Some open-ended items can end up with responses whose degree of perfection is judged by a rater. These partial assignments, provided that they correspond to the difficulty levels initially defined, are an absolute estimation of the candidate’s proficiency. They have more information than reduced responses of pass/fail. We need to investigate

Estimating Absolute Proficiency Levels in Small-Scale Placement Tests

61

further validation of the incorporation of these responses harmoniously into other dichotomous responses for which the true proficiency level is expressed in inequality equations (i.e., if one passes an item of Level 5, his/her true level is at 5 or above).

References Amma, K. (1990). Unpublished internal document. [A grammar test conducted at a private university in Tokyo.] —. (2001). Variations of parsing strategies among EFL learners of different proficiency levels. Ronso (Bulletin of the Faculty of Humanities, Tamagawa University), 41, 79-115. —. (2013). Criterion-referenced testing in small-scale placement: A case study. Paper presented at the English Language Education and the CEFR in Japan, JACET Kanto Chapter Meeting, June 16, Aoyama Gakuin University, Tokyo, Japan. Brown, J. D., & Hudson, T. (2002). Criterion-referenced language testing. Cambridge: Cambridge University Press. Council of Europe. (2001). Common European framework of reference for languages: Learning, teaching, assessment. Cambridge: Cambridge University Press. Fulcher, G. (1997). An English language placement test: Issues in reliability and validity. Language Testing, 14(2), 113-139. Gombrich, E. H. (1984). The story of art. Oxford: Phaidon Press. Harsch, C., & Rupp, A. A. (2011). Designing and scaling level-specific writing tasks in alignment with the CEFR: A test-centered approach. Language Assessment Quarterly, 8(1), 1-33. doi: 10.1080/15434303.2010.535575. Hawkins, J. A., & Buttery, P. (2010). Criterial features in learner corpora: Theory and illustrations. English Profile Journal, 1(1), 1-23. doi: 10.1017/S2041536210000103. Henning, G. (1987). A guide to language testing: Development, evaluation, research. Rowley, Massachusetts: Newbury House. —. (1992). Dimensionality and construct validity of language tests. Language Testing, 9(1), 1-11. JMP. (2002). JMP Version 5: Statistics and graphics guide. Cary, North Carolina: SAS Institute. Lehtippu, M. (1996). Finland: A lonely planet travel survival kit. Hawthorn, Victoria, Australia: Lonely Planet Publications.

62

Chapter Two

Linacre, J. M. (1987). A computer program for adapting testing by microcomputer. MESA Psychometric Laboratory, Memorandum No.40, University of Chicago. Lloyd, C. (1999). Statistical analysis of categorical data. New York: John Wiley & Sons. M-Reader (n.d.). Retrieved from: http://mreader.org/mreaderadmin/s/html/about.html National Center for University Entrance Examinations (n.d.). Retrieved from: http://www.dnc.ac.jp/ Plakans, L., & Burke, M. (2013). The decision-making process in language program placement: Test and nontest factors interacting in context. Language Assessment Quarterly, 10(2), 115-134. Revuelta, J. (2004). Estimating ability and item-selection strategy in selfadapted testing: A latent class approach. Journal of Educational and Behavioral Statistics, 29(4), 379-396. SAS Institute. (2012). JMP version 10.0.0. Cary, North Carolina: SAS Institute. Wall, D., Clapham, C., & Alderson, J. C. (1994). Evaluating a placement test. Language Testing, 11(3), 321-344. Weir, C. (2005). Limitations of the Common European Framework for developing comparable examinations and tests. Language Testing, 22(3), 281-300.

CHAPTER THREE BILINGUAL LANGUAGE ASSESSMENT IN EARLY INTERVENTION: A COMPARISON OF SINGLEVERSUS DUAL-LANGUAGE TESTING CAROLINE A. LARSON, SARAH CHABAL, AND VIORICA MARIAN

Abstract Despite a growing number of bilingual children enrolled in Early Intervention language services, methods of administering language assessments to bilingual children are not standardized. This study reports clinically-meaningful differences in bilingual children’s receptive and expressive language outcomes when their language skills are assessed in the primary language versus in both the primary and secondary languages. Eleven Spanish-English speaking children (ages 1;11 to 2;11) with language delay enrolled in Early Intervention were assessed using The Rossetti Infant-Toddler Language Scale (Rossetti, 1990) in their primary language only, and then in both their primary and secondary languages. When assessed in only one language, bilingual children’s language skills were underestimated by 1.4 months for receptive language and 2.2 months for expressive language; language delay was overestimated by 4.7% for receptive language and by 7.8% for expressive language. Single-language assessments would lead to inappropriate Early Intervention referral for 3 of the 11 tested children. It is therefore suggested that assessing bilingual children in only one language leads to a significant underestimation of receptive and expressive language abilities and a significant overestimation of language delay. Consequently, the efficacy, reliability, and validity of the assessment are compromised and best practice as mandated by speechlanguage pathology certification organizations is not achieved.

64

Chapter Three

Introduction The number of bilingual children in the United States, as well as throughout the world, is rapidly growing, due, in part, to globalization, migration, and an increased prevalence of bilingual education options. For example, of school-age children in the United States, 22% speak a language other than English in the home (Lowry, 2011). Within certain areas, such as large cities, an even higher percentage of families speak more than one language in the home. For instance, a language other than English is spoken in 35.5% of Chicago residences (United States Census Bureau, 2013). Children in these homes who are developing more than one language are generally believed to have language disorders at a similar rate as children acquiring only one language (Kohnert, 2010). As a result, the caseload makeup for speech-language pathologists often includes children with language delay who are developing bilinguals. When young monolingual and bilingual children fail speech-language screenings or are referred by pediatricians due to speech-language concerns, they undergo language assessment to determine eligibility for Early Intervention services. For example, in Illinois a child is considered eligible for speech-language services when he or she demonstrates a 30% or more delay in one or more areas of speech, language, or communication, when he or she presents with a medical diagnosis that typically results in developmental delay, or when he or she is determined to be at risk of substantial developmental delay (Illinois Department of Children and Family Services, 2003; Illinois Department of Human Services Community Health and Prevention Bureau of Early Intervention, 2009). Eligibility for speech-language services through the Early Intervention program in the United States is often determined based on assessment outcomes of The Rossetti Infant-Toddler Language Scale (Rossetti, 1990). The Rossetti is a criterion-referenced assessment of preverbal and verbal areas of communication and interaction for children up to three years of age. The skill age at which all criteria are demonstrated and the resulting percent of receptive or expressive language delay relative to chronological age decide the children’s eligibility for Early Intervention. The Rossetti is often used in the Early Intervention program as it is familiar to Early Intervention clinicians across disciplines (e.g., occupational therapists, social workers, etc.) (Marchman & MartinezSussmann, 2002) and because few other assessment tools cover a similar breadth of developmental domains within the birth to three age range. Like many assessments structured for use with young children (e.g., Bzoch, League, & Brown, 2003; Hedrick, Prather, & Tobin, 1984; Marchman &

Bilingual Language Assessment in Early Intervention

65

Martinez-Sussmann, 2002; Rescorla, 1989; Wetherby & Prizant, 1993), The Rossetti is primarily informal, follows a checklist format, and involves multiple sources reporting the presence or absence of specified skills. The Rossetti is often preferred over other assessments due to ease of administration in the home environment and applicability to the Early Intervention program assessment requirements (Illinois Department of Human Services Community Health and Prevention Bureau of Early Intervention, 2009).

Background Despite its use within the Early Intervention program, methods of administering The Rossetti assessment to bilingual children are not standardized. When The Rossetti is used to assess bilingual children, accepted practices include measuring language abilities in only the child’s primary language, in only the child’s secondary language, or across both developing languages. One concern with assessing bilingual children’s language skills in only their primary or secondary language is that developing bilinguals with language delay often display uneven skill distribution and shifting development across languages, as well as individual variation in their developmental trajectories (Kohnert, 2010). For example, a child may have relatively even expressive vocabulary skills in Spanish and English, but demonstrate more advanced verb conjugation skills in English. Even in typically-developing bilingual children, language acquisition is characterized by variable timeframes and patterns of development, which cause difficulty in obtaining valid assessment outcomes (e.g., Kohnert & Goldstein, 2005; Marian, 2008; Marian, Faroqi-Shah, Kaushanskaya, Blumenfeld, & Sheng, 2009). Therefore, single-language assessment of developing bilinguals may not accurately reflect their language abilities and may not be best practice. Indeed, previous research with school-age bilinguals suggests that both languages should be measured and considered as a composite in order to reduce the risk of misdiagnosis and inappropriate individualized education plans (Kohnert, 2008; Kohnert, 2010; Marian et al., 2009; Roseberry-McKibbin, Brice, & O’Hanlon, 2005). While such risks in the school-age population are well documented, there is little research examining language assessment methods with birth to three-year-old bilingual children who have language delays (Dollaghan & Horner, 2011). Within typically-developing populations, the language(s) of assessment can affect measures of young bilinguals’ total vocabulary size (Core, Hoff, Rumiche, & Señor, 2013; Hoff, Core, Place,

66

Chapter Three

Rumiche, Señor, & Parra, 2012; Thordartottir, Rothenberg, Rivard, & Naves, 2006), grammatical ability (Hoff et al., 2012), and syntax (Thordartottir et al., 2006). Similar metrics are likely to be impacted by the language of assessment for children with language delays. Understanding how bilingual children’s assessments are impacted by the use of single- or dual-language practices is important for early and accurate detection of language disorders. Early assessment allows for the provision of Early Intervention speech-language services to the young bilingual population, which results in faster gains and possible prevention or minimization of deficits (National Joint Committee on Learning Disabilities, 2006; Paul, 2007; Woods & Wetherby, 2003). Because the Early Intervention population often includes bilingual children with language delays, the current study aimed to determine whether young bilingual children’s language assessment outcomes were different when evaluated in only one language as opposed to in both of the children’s developing languages.

The Study The study reported in this chapter looked at the differences in expressive and receptive language measures on The Rossetti for birth to three year old bilingual children with language delay when they were assessed in their primary language versus in both their primary and secondary languages. It was hypothesized that assessment outcomes provide a more accurate picture of the developing bilingual’s language level when skills are measured across both developing languages. Therefore, it was predicted that when administering The Rossetti to young bilingual children with language delay in only one language, outcomes will underestimate language abilities and overestimate language delay.

Participants Participants were 11 children (2 girls; 9 boys) of Hispanic descent ranging in age from 1;11 to 2;11 (Mean = 2;5, SD = 0;4.8), born in the United States to bilingual Spanish-English speaking parents. All participants included in the study were assigned to Early Intervention speech-language services and required annual or 6-monthly Early Intervention mandated reassessment. All participants passed a hearing screening within one year of the testing date. Verbal consent was obtained from the participants’ parents prior to the evaluation. Information about participants’ demographic information, linguistic

Bilingual Language Assessment in Early Intervention

67

backgrounds, and language skills was obtained from parent reports and Early Intervention initial evaluation reports (see Table 3-1). Five participants were reported to use English as their primary language; six participants were reported to use Spanish as their primary language. On average, participants made 78% of their expressions in their primary language (SD = 10.8%) and 22% of their expressions in their secondary language (SD = 10.8%).

English English Spanish Spanish English English English Spanish Spanish Spanish Spanish

Spanish Spanish English English Spanish Spanish Spanish English English English English

80 85 80 75 85 90 85 60 90 60 70

20 15 20 25 15 10 15 40 10 40 30

Type of Diagnosed Delay

% Expression in Secondary Language

1;11 2;6 2;4 2;11 2;4 2;6 1;11 2;11 2;1 2;11 2;6

% Expression in Primary Language

Age

M M F M M M M M F M M

Secondary Language

Gender

1 2 3 4 5 6 7 8 9 10 11

Primary Language

Participant

Table 3-1: Demographic information for study participants.

Language Language Language Language Language Developmental Developmental Language Language Language Language

Materials Participants were assessed according to Early Intervention standards using The Rossetti Infant-Toddler Language Scales (Rossetti, 1990) at home with the presence of a parent, the treating therapist (first author), and an interpreter who had been assigned by the program to the child’s case at the onset of service provision. The Rossetti assesses skills across developmental domains including Interaction-Attachment (e.g., ‘Plays away from familiar people’), Pragmatics (e.g., ‘Uses words to protest’), Gesture (e.g., ‘Gestures to request action’), Play (e.g., ‘Stacks and

68

Chapter Three

assembles toys and objects’), Language Comprehension (e.g., ‘Identifies four objects by function’), and Language Expression (e.g., ‘Uses sentencelike intonational patterns’), and is used with young children ages birththree years (1990). Because our interest is in the assessment of children’s language skills, the current study focused on the expressive and receptive language domains of The Rossetti. Within each domain, children’s skills were assessed within three month intervals (e.g., 21-24 months of age). Receptive language measures included: total number of words understood; the ability to follow two-step directions; the ability to identify body parts; the ability to answer wh-questions; and the ability to identify objects by category. Expressive measures included: total number of words spoken; the frequency with which the child expressed two word phrases; the ability to verbalize two different needs; the ability to use words to interact with others; and the ability to imitate animal sounds. Because a child may not spontaneously produce all of these behaviors within the context of a single session with the clinician, scores on each domain were credited with equal weight based on parent report, assessor observation, and/or assessor elicitation. Behaviors not observed or elicited by the parent or assessor were considered not yet present. The evaluator was a state licensed and Early Intervention credentialed practicing speech-language pathologist-clinical fellow. All interpreters were Early Intervention credentialed Spanish-English bilinguals who were familiar to the child and family. Interpreters were assigned to each child at the onset of Early Intervention service provision.

Procedure The Rossetti parent questionnaire and test criterion are available in Spanish and English; however this study’s administration used only the English questionnaire and test criterion, as an interpreter was present to translate the questions from English to Spanish. Parent interviews were completed in English with Spanish interpretation prior to the assessment to determine participants’ demographic and linguistic backgrounds, and then with The Rossetti parent questionnaire during the assessment period. Follow-up questions and clarification questions were used as needed to ensure adequate and appropriate interpretation of assessment questions. Within each assessment period, The Rossetti was administered twice: first only in the participant’s primary language (i.e., credit was only given for skills demonstrated or reportedly observed in the primary language), and then in the child’s primary and secondary languages (i.e., credit was

Bilingual Language Assessment in Early Intervention

69

given for skills in either and/or both languages). During primary language administration, all activities were conducted in the child’s primary language only, and the child received credit for skills demonstrated in that language only. For example, a child whose primary language was Spanish would not receive credit for a skill demonstrated in English. During duallanguage administration, all activities were conducted in a ratio that matched the parent-reported ratio of Spanish to English expression. Children were awarded credit for all skills, regardless of their language of demonstration. Because the assessment accounts for skills that parents have observed but that may not have been demonstrated during the assessment period, and because it is a criterion-referenced assessment with general skill benchmarks, practice effects across single- and dual-language assessments were not problematic. The parent interview, primary language assessment, and dual-language assessment occurred within the same contact period. Sessions lasted approximately one hour and involved child-directed and therapist-directed structured play activities, similar to a typical therapy session (e.g., shared storybook reading, symbolic play with a toy farm, and putting together puzzles).

Scoring and Data Analysis The assessment was scored and analyzed by the treating therapist with adherence to testing procedures. Skills observed or elicited by the assessing therapist were scored online, and parent-reported skills were credited offline within one week of administration. All assessment reports were reviewed by the assessing therapist’s clinical fellowship mentor. The Rossetti assigns age levels based on the presence of all skills within a domain’s three month interval. In order to be scored within an age range, the child must have demonstrated all skills within that interval (i.e., if one or more skills from a given level were not present, the child was assigned a lower age level for that domain). Skills were awarded if they were observed or elicited by the parent, evaluator, or other reporter (e.g., daycare teacher or caregiver). Children were assessed at the highest reported skill level (e.g., if parents reported that the child used two-word phrases frequently but the evaluator elicited two-word phrases only occasionally, the skill was assigned as ‘uses two-word phrases frequently’ (Rossetti, 1990). Percent language delay was calculated by dividing the child’s lowest assessed age by his or her chronological age, multiplying that number by 100, and then subtracting 100 (Rossetti, 1990; West Virginia Department

70

Chapter Three

of Human Resources, 2009). For example, a child with a chronological age of 30 months who demonstrated a receptive language age of 21-24 months would present with a 30% delay in receptive language.

Results All data were analyzed using paired t-tests to compare outcomes when assessments were conducted in only the child’s primary language versus his or her primary and secondary languages. Results revealed that singlelanguage outcomes underestimated the participants’ receptive and expressive language skills.

Primary Language Testing When assessed in the primary language only, participants’ average receptive language skill age was 20.5 months (SD = 5.8 months), representing a mean delay of 28.4% (SD = 16.2%). Average expressive language skill age was 18.8 months (SD = 6.4 months), representing an average delay of 34.1% (SD = 18.3%). When including scattered skills (i.e., all ages at which skills were demonstrated), single-language assessment revealed a highest receptive skill-age average of 21.8 months (SD = 5.9 months) across participants and a highest expressive skill-age average of 22.4 months (SD = 5.1 months).

Dual-Language Testing When assessed in both primary and secondary languages, participants’ receptive skill age was 21.8 months (SD = 6.2 months), representing a delay of 23.6% (SD = 15.8%). Expressive skill age was 21 months (SD = 6.6 months), representing an average delay of 26.3% (SD = 19%). When accounting for scattered skills (i.e., skill distribution) across both languages, average highest receptive skill-age was 24.3 months (SD = 4.9 months) and average highest expressive skill-age was 24.3 months (SD = 4.9 months).

Single- Versus Dual-Language Testing The data were compared using t-tests. The results of the analyses suggest that assessment in only the primary language significantly underestimated receptive skill age by an average of 1.4 months (SD = 1.6 months, t(10) = 2.8868, p < .05) (see Table 3-2 and Figure 3-1) and

Bilingual Language Assessment in Early Intervention

71

expressive skill age by an average of 2.2 months (SD = 1.4 months, t(10) = 5.1640, p < .05) (see Table 3-3 and Figure 3-2). Primary language assessment also significantly overestimated the language delay by 4.7% (SD = 5.7%, t(10) = 2.7368, p < .05) for receptive skills and by 7.8% (SD = 5.4, t(10) = 4.8348, p < .05) for expressive skills (see Figure 3-3). The findings also suggest that single-language assessment significantly underestimated scattered skills by 2.5 months (SD = 1.8 months, t(10 = 4.5000, p < .05) in the receptive domain and 1.9 months (SD = 2.0 months, t(10) = 3.1305, p < .05) in the expressive domain.

Discussion The results of the present study confirm that assessing bilingual children in only one language leads to a significant underestimation of participants’ receptive and expressive language abilities and a significant overestimation of their language delay. Scattered skill measurement, which provides treatment planning and skill distribution information, was also significantly underestimated. As a result of obtaining inaccurate assessment outcomes, eligibility determination and treatment planning are therefore compromised when assessing language skills in only one language, and implementation of best practice (ASHA, 2010) is not achieved. We conclude that clinicians working with bilingual children must measure highest skill levels across both languages to obtain accurate diagnostic and treatment planning information.

Chapter Three

72

Table 3-2: Receptive language ability as indexed by The Rossetti (1990) Language Comprehension subtest.

15

21

-6

-0%

24

27

-3

3

27; 4%

27; 4%

-0

-0%

27

27

-0

4

27; 25%

30; 17%

-3

-8%

27

30

-3

5 6

21; 25% 12; 60%

21; 25% 12; 60%

-0 -0

-0% -0%

24 15

27 18

-3 -3

7

15; 35%

18; 22%

-3

-13%

15

18

-3

8 9

18; 49% 18; 14%

21; 40% 18; 14%

-3 -0

-9% -0%

21 21

24 21

-3 -0

10

30; 14%

33; 6%

-3

-8%

33

33

-0

11

18; 31%

18; 31%

-0

-0%

18

21

-3

Mean

20.5; 28.4%

21.8; 23.6%

-1.4*

-4.7%*

21.8

24.3

-2.5*

Skill Age Difference (months)

-14%

-0

Dual Language (months)

-3

24; 20%

Note: * = significant difference at p < .05

Primary Language Only (months)

18; 21%

24; 20%

% Delay Difference

15; 35%

2

Skill Age Difference (months)

1

Participant

Dual Language (months; % delay)

Highest Skill Age Assessment

Primary Language Only (months; % delay)

Skill Age Assessment

Bilingual Language Assessment in Early Intervention

30

*

*

73

Primary Language

Age in Months

25 20

Primary and Secondary Language

15 10 5 0 Skill Age

Highest Skill Age

Figure 3-1: Participants’ receptive language assessment results using The Rossetti (1990) Language Comprehension subtest. Error bars represent standard errors and asterisks indicate significant differences at p < .05.

30

*

*

Primary Language

Age in Months

25 Primary and Secondary Language

20 15 10 5 0 Skill Age

Highest Skill Age

Figure 3-2: Participants’ expressive language assessment results using The Rossetti (1990) Language Expression subtest. Error bars represent standard errors and asterisks indicate significant differences at p < .05.

Chapter Three

74

Table 3-3: Expressive language ability as indexed by The Rossetti (1990) Language Expression subtest.

Primary Language Only (months)

Dual Language (months)

Skill Age Difference (months)

Skill Age Difference (months)

Dual Language (months; % delay)

18; 21% 21; 9% -3 24; 20% 24; 20% -0 21; 25% 24; 14% -3 24; 33% 27; 25% -3 24; 14% 24; 14% -0 9; 70% 9; 70% -0 12; 48% 15; 35% -3 18; 49% 21; 40% -3 15; 28% 18; 14% -3 30; 14% 33; 6% -3 12; 53% 15; 42% -3 18.8; 21; 26.3% -2.2* 34.1% Note: * = significant difference at p < .05

Highest Skill Age Assessment

% Delay Difference

1 2 3 4 5 6 7 8 9 10 11 Mean

Primary Language Only (months; % delay)

Participant

Skill Age Assessment

-12% -0% -11% -8% -0% -0% -13% -9% -14% -8% -11% -7.8%*

18 24 21 33 27 27 21 21 18 21 15 22.4

21 27 27 33 27 27 24 24 18 24 15 24.3

-3 -3 -6 -0 -0 -0 -3 -3 -0 -3 -0 -1.9*

Bilingual Language Assessment in Early Intervention

75

35.00% *

*

Primary Language

Percent Language Delay

30.00% 25.00%

Primary and Secondary Language

20.00% 15.00% 10.00% 5.00% 0.00% Comprehension

Expression

Figure 3-3: Percent language delay in primary-language-only assessment and in dual-language assessment using The Rossetti (1990) Language Comprehension and Language Expression subtests. Error bars represent standard errors and asterisks indicate significant differences at p < .05.

Clinical Implications The results of the present study are relevant for Early Intervention initial evaluation and ongoing assessment methods. Frequently, initial evaluations assess developing bilinguals in the primary language or secondary language only, or the evaluation report does not discuss the language of assessment. Consequently, questions may be drawn as to the accuracy of children’s eligibility determination, as well as their speechlanguage treatment planning. For example, if dual-language assessment protocols are not followed, three of the eleven tested participants would receive inappropriate referral for Early Intervention services. Although these three participants would meet the 30% delay criterion when assessed in only one language and could therefore be eligible for Early Intervention services, when assessed across both of their languages, these participants’ language skills would fall within the average range for bilingual children. Assessing children in only one language and inappropriately referring them for services may cause these children’s families to direct limited familial resources to the children’s treatment, as well as cause undue stress on the family. Additionally, occupying a finite number of clinicians and

76

Chapter Three

limited funding is not warranted for these children. Children who are significantly delayed and who actually meet the eligibility requirements may linger on a waitlist or receive no services as children whose development is age-appropriate receive treatment. Furthermore, not accounting for a child’s second language perpetuates negative bias against bilingual language learners and the differences in their course of language development as compared to monolingual language development. Appropriate treatment planning may also be impacted by singlelanguage assessment as treating therapists develop therapeutic goals and establish the language of treatment based on the children’s initial evaluation reports. Developing a treatment plan based on inaccurate assessment outcomes and skill distribution information is not best practice, and may hinder the child in reaching his or her full communicative potential. Also, due to a lack of continuity and infrequent contact between assessing and treating therapists in Early Intervention, the treating therapist may not be able to determine how and in what language the child’s skills were measured based on unreported or inaccurately-reported language of assessment in the initial evaluation reports. Consequently, the Early Intervention language assessment process must accurately and thoroughly account for developing bilinguals’ composite language skills. The research presented here has direct implications for how language assessments should be structured. Prior to initiating the assessment process for children who are developing more than one language, the assessor must complete a thorough case history with the child’s parent or caregiver, utilizing interpretation services as necessary. The case history should include information related to medical history and current health status (e.g., birth weight, hospitalizations, familial medical history), developmental milestones (e.g., age the child first walked, first words), linguistic environment (e.g., primary language, language input/output, community language), and concerns regarding the child’s language skills (e.g., the child uses less than five true words and jargoning to communicate). Assessments should then measure the child’s highest language skill across both developing languages, as well as scattered skills and other qualitative information (e.g., the child produces the pronouns ‘I’ and ‘me’ in English and ‘me’ in Spanish independently, but is able to also produce ‘yo’ in Spanish given support). For example, a child with a primary language of English and secondary language of Spanish who is able to follow 2-step directions in English and 1-step directions in Spanish should receive credit for following 2-step directions. Measuring the highest reported and observed language skills across languages ensures that all of the child’s skills are given credit. As a result, the assessment

Bilingual Language Assessment in Early Intervention

77

yields a more appropriate eligibility determination.

Conclusion To conclude, we have shown that assessing bilingual children in only the primary language can underestimate their language abilities, and may result in inaccurate eligibility determination and over-identification of language delays. Therefore, it is vital that language assessments in children acquiring multiple languages account for abilities across all developing languages. Measuring children’s skills in all developing languages (as opposed to skills in only one language) yields a more accurate and complete assessment, which has immediate benefits for appropriate service eligibility determination and treatment planning. While our current findings provide support for the use of duallanguage assessments when determining children’s eligibility for Early Intervention services, future research will need to explore the use of single- versus dual-language assessments as evaluated by independent raters. Although concerns of examiner bias in the present study were minimized because all evaluations were thoroughly reviewed and approved by a non-treating clinician not involved in the present study, more rigorous evaluation methods are prudent to ensure that the differences between single- and dual-language assessments are reproducible across a variety of contexts and populations. Assessment outcomes will also need to be evaluated across other diagnostic tools (e.g., Communication and Symbolic Behavior Scales by Weatherby & Prizant, (1993); The Language Development Survey by Rescorla (1989); etc.). Finally, future research will need to investigate the magnitude of misdiagnoses by expanding the participant selection to more diverse groups of language speakers (e.g., sequential language learners) and demographic makeups (e.g., high versus low socioeconomic status). By ensuring that all children receive accurate diagnoses and referrals for Early Intervention treatment, best practice standards will be met and increased therapeutic success will be achieved.

References American Speech-Language-Hearing Association. (2010). Code of ethics. Retrieved from: http://www.asha.org/Code-of-Ethics/ Bzoch, K. A., League, R., & Brown, V. (2003). The receptive-expressive emergent language scale third edition. Austin, TX: Pro-Ed. Core, C., Hoff, E., Rumiche, R., & Señor, M. (2013). Total and conceptual

78

Chapter Three

vocabulary in Spanish-English bilinguals from 22 to 30 months: Implications for assessment. Journal of Speech, Language, and Hearing Research, 56(5), 1637-1649. Dollaghan, C. A., & Horner, E. A. (2011). Bilingual language assessment: A meta-analysis of diagnostic accuracy. Journal of Speech, Language, and Hearing Research, 54, 1077-1088. Hedrick, D. L., Prather, E. M., & Tobin, A. R. (1984). Sequenced inventory of communication development-Revised. Torrance, CA: Western Psychological Services. Hoff, E., Core, C., Place, S., Rumiche, R., Señor, M., & Parra, M. (2012). Dual language exposure and early bilingual development. Journal of Child Language, 39(1), 1–27. Illinois Department of Children and Family Services. (2003). Title 89: Social services. Retrieved from: www.wiu.edu/ProviderConnections/pdf/Rule_500.pdf Illinois Department of Human Services Community Health and Prevention Bureau of Early Intervention. (2009). Early Intervention service descriptions, billing codes and rates: Early Intervention provider handbook. Retrieved from: www.wiu.edu/ProviderConnections/pdf/ServiceDescriptionManual0910.pdf Kohnert, K. (2008). Language disorders in bilingual children and adults. San Diego, CA: Plural Publishing. —. (2010). Bilingual children with primary language impairment: Issues, evidence and implications for clinical actions. Journal of Communication Disorders, 43(6), 456-473. Kohnert, K., & Goldstein, B. (2005). Speech, language, and hearing in developing bilinguals: From practice to research. Language, Speech, and Hearing Services in Schools, 36(3), 169-171. Lowry, L. (2011). Bilingualism in young children: Separating fact from fiction. Retrieved from: www.hanen.org/Helpful-Info/Articles.aspx Marchman, V. A., & Martinez-Sussmann, C. (2002). Concurrent validity of caregiver/parent report measure of language for children who are learning both English and Spanish. Journal of Speech, Language, and Hearing Research, 45(5), 983-997. Marian, V. (2008). Bilingual research methods. In J. Altarriba, & R. R. Heredia (Eds.), An introduction to bilingualism: Principles and processes (pp. 13-38). Mahwah, NJ: Lawrence Erlbaum. Marian, V., Faroqi-Shah, Y., Kaushanskaya, M., Blumenfeld, H. K., & Sheng, L. (2009). Bilingualism: Consequences for language, cognition, development, and the brain. The ASHA Leader, 14, 10-13.

Bilingual Language Assessment in Early Intervention

79

National Joint Committee on Learning Disabilities. (2006). Learning disabilities and young children: Identification and intervention. Retrieved from: www.ldonline.org/article/11511/ Paul, R. (2007). Language disorders from infancy through adolescence: Assessment and intervention. St. Louis, MO: Mosby Elsevier. Rescorla, L. (1989). The Language Development Survey: A screening tool for delayed language in toddlers. Journal of Speech and Hearing Disorders, 54(4), 587-599. Roseberry-McKibbin, C., Brice, A., & O’Hanlon, L. (2005). Serving English language learners in public school settings. Language, Speech, and Hearing Services in Schools, 36(1), 48-61. Rossetti, L. (1990). The Rossetti Infant-Toddler Language Scale: A measure of communication interaction. East Moline, IL: Linguisystems. Thordartottir, E., Rothenberg, A., Rivard, M. E., & Naves, R. (2006). Bilingual assessment: Can overall proficiency be estimated from separate measurement of two languages? Journal of Multilingual Communication Disorders, 4(1), 1-21. United States Census Bureau. (2013). State and county QuickFacts. Retrieved from: quickfacts.census.gov/qfd/states/17000.html West Virginia Department of Human Resources. (2009). WV birth to three percentage conversion chart. Retrieved from: www.wvdhhr.org/birth23/files/wvbtt_perc_conv_%20chart.pdf Wetherby, A. M., & Prizant, B. (1993). Communication and symbolic behavior scales. Baltimore, MD: Paul H. Brookes Publishing. Woods, J. J., & Wetherby, A. M. (2003). Early identification of and intervention for infants and toddlers who are at risk for Autism Spectrum Disorder. Language, Speech, and Hearing Services in Schools, 34, 180-193.

CHAPTER FOUR FREQUENCY AND CONFIDENCE IN LANGUAGE LEARNING STRATEGY USE BY GREEK STUDENTS OF ENGLISH PENELOPE KAMBAKIS-VOUGIOUKLIS AND PERSEPHONE MAMOUKARI

Abstract The study reported in this chapter involved twelve Greek learners of English in an oral administration of a translated and validated version (Gavriilidou & Mitits, 2013) of the Strategy Inventory for Language Learning (SILL) questionnaire (Oxford, 1990). There were two innovations in this study, the first of which concerns the participants’ reporting of not only the frequency of use of each language learning strategy (LLS), but also of their confidence in the effectiveness of each strategy. The employment of this extra parameter provided the researcher the potential to identify factors in learner strategy use not usually detected by the indication of frequency use only. The second innovation concerns the use of the bar (Kambaki-Vougioukli & Vougiouklis, 2008) instead of the usual Likert scale, as this can be more flexible for both the participants and the researcher. The results of the study showed deviations between the frequency of strategy use and students’ confidence in the effectiveness of the language learning strategies indicating that learners either appreciated the effectiveness of a strategy but they did not know how to use it or that they used a strategy without firmly believing in its usefulness. These findings suggest the need for pedagogical interventions in order to raise the learners’ awareness of language learning strategies and how to use them effectively. Additionally, more proficient learners reported a higher frequency and confidence in LLS use than their less proficient peers, while the age of the learners did not seem to affect LLS use.

Language Learning Strategy Use by Greek Students of English

81

Introduction Language learning strategies (LLS) are the conscious or semiconscious mental processes employed for language learning and language use (Cohen, 2003). Research has shown that strategies may facilitate language learning. As a consequence, strategic behavior has greatly concerned research in language learning (Chamot, 2007; Wharton, 2000). Moreover, there is enough convincing evidence that language learning strategies can and should be taught (Chamot, 2005; Cohen & Macaro, 2007; Graham & Macaro, 2008). Research has also indicated that the use of language learning strategies can often be unclear since it depends on various factors, such as the learners’ age, their target language proficiency, and the socio-cultural context (see Tragant & Victori, 2012 and references therein). Moreover, the different methodological tools selected to investigate use of LLS may lead to discrepancies between studies.

Background Strategy Inventory for Language Learning (SILL) Oxford’s (1990) Strategy Inventory for Language Learning (SILL) questionnaire has maintained its reliability, validity, utility (Oxford, 1996) and, consequently, its popularity among researchers for more than three decades. SILL measures how frequently learners use memory, cognitive, comprehension, metacognitive, affective and social language learning strategies, as described by Oxford (1990). SILL is used to identify the level of strategy use (low, medium, high) and the statistical tool used to measure this frequency is the 5-point Likert scale. Most studies on LLS have employed this measurement for comparable results. Recently, however, there have been researchers who argue that SILL has a lot more potential not yet investigated and identified. For instance, Bull and Ma (2001, p. 174) introduced the Learning Style-Learning Strategies addition to SILL to measure “similarity between individual learning strategies”, which may raise learner awareness of LLS use and usefulness. Confidence is also an important, yet not systematically studied, factor in the process of language learning. It has been investigated in association with communication strategies (Kambaki-Vougioukli, 1992a, 1992b, 2001) and among regular student populations in Greece (Mathioudakis & Kambaki-Vougioukli, 2010). Also, Intze and Kambaki-Vougioukli (2009) and Intze (2010) investigated confidence in association with the strategy

82

Chapter Four

of guessing among Muslim learners of Greek as a second/foreign language and found statistically significant differences between males and females, with the latter being better at guessing and more confident too, compared to their male peers. The use of the SILL questionnaire hides a potential danger that may be related to the actual time of the questionnaire completion and generates certain questions: x How familiar are the learners with the certain strategies mentioned in the questionnaire? x Are they sure they really employ the strategies they claim they do or do they think so because they have heard the teacher or their peers emphasize their importance? Although one would assume that when learners claim they use a strategy, they are most likely to consider it effective, we have reasons to believe that this might not probably be the case. A series of studies (Kambaki-Vougioukli, 2012, 2013) included confidence along with frequency in the SILL questionnaire, namely the learners were asked to specify not only how frequently they used each strategy but also how confident they felt of its effectiveness. Results from these studies indicate that when the learners claim they use a strategy, this does not necessarily imply that they consider it effective as evidenced by the low confidence scores. There have also been cases where learners claimed they did not use a strategy but seemed confident that this strategy would help them in language learning. Finally, the close relation between the learners’ proficiency and the frequency of strategy use would be of particular interest together with the measurement of their confidence of strategy effectiveness. Moreover, SILL questionnaires are generally in written form and their data analysis process is usually quantitative. However, the oral administration of SILL may glean important insights by stimulating the learners’ individual experiences and by allowing the expression of attitudes, feelings and behaviors, possibly opening up new topic areas. A researcher might be able to better explain why a particular response was given through a qualitative analysis of such results, alongside a quantitative one.

Language Learning Strategy Use by Greek Students of English

83

Proficiency and Age in LLS Use It has been found that more advanced learners are usually more proficient in LLS use than less advanced learners (Magogwe & Oliver, 2007). However, there are studies that show no such connection (e.g., Phillips, 1991). Discrepancies across studies in this respect may be due to differences between the participants’ cultural background (Psaltou-Joycey, 2008) and/or to the different ways in which proficiency is measured, namely, based on the learners’ grades or the learners’/teachers’ relevant opinions or independent proficiency tests (Tragant & Victori, 2012). In addition to that, there is the question of whether advanced strategy use is the outcome or the reason for high proficiency levels and there seems to be a bidirectional relationship between the two, and interference in both ways (Green & Oxford, 1995; MacIntyre, 1994; McDonough, 1999). Similarly, inconclusiveness in the literature regards how age affects LLS use. In short, while more mature learners are expected to be more resourceful in LLS use, such an expectation is not verified in all studies (Psaltou-Joycey & Sougari, 2010). As far as the interaction between age and proficiency is concerned, Tragant and Victori (2006) conducted a study with Spanish adolescent learners of English as a foreign language (EFL), where the learners’ English proficiency was based on their school grades. Results from this study showed that LLS use is affected by age, irrespective of proficiency. It should be mentioned, however, that the methodological instrument in the latter study was not the SILL questionnaire.

Strategy Use by Learners in Greece Kazamia (2003) investigated the strategy profile of Greek adults learning English as a foreign language and the way they perceive the tolerance of ambiguity while learning a foreign language. KambakisVougiouklis, Mamoukari, Agathopoulou and Alexiou (2013) during their research with Muslim children learning English as a foreign language, having Turkish as their native language and Greek as a second language, also recorded the level of confidence of the students about the usefulness of the strategies. Accordingly, Psaltou-Joycey (2008) mainly focused on the effect of factors such as age, proficiency and cultural background of university students learning Greek as a second language. In Gavriilidou and PsaltouJoycey (2009), light is shed on issues such as the definition of strategies, ways of recording them, strategies employed by effective learners, factors

84

Chapter Four

that influence the choice of strategies and the teaching of strategies. Another study by Gavriilidou and Papanis (2010) investigated the effectiveness of direct strategy teaching with suggested activities for Muslim students. In 2009, Psaltou-Joycey and Kantaridou investigated multilingualism in relation to the use of learning strategies as well as learning styles. Learning strategies are also investigated in the project “ĬĮȜȒȢ 2012” with the translation and validation of the SILL questionnaire in Greek and Turkish, aiming to the collection of useful data regarding learning strategy use. The previous studies suggest that there is close relation between the learners’ proficiency and the frequency of strategy use, however, there has been no recording of the learners’ confidence that the strategies they employ are actually effective towards their learning. The study reported in this chapter is part of a larger research project that involved both Turkish-Greek bilingual students and native-Greek students. Students provided their responses to the SILL questionnaire in the form of oral protocols, i.e., face-to-face interviews, in order to have their frequency of strategy use and confidence of strategy effectiveness recorded. The oral administration of the SILL allowed interviewees to ask for clarifications and the researcher to pose further questions and reach more accurate conclusions about students’ use of language learning strategies.

The Study This study is part of a wider investigation conducted in Thrace, Greece with two groups of learners of English, one Muslim, i.e., native speakers of Turkish, and the other Christian, i.e., native speakers of Greek. The terms Christian and Muslim are conventionally used to distinguish the two groups. The Muslim group results have already been presented elsewhere (Kambakis-Vougiouklis, Mamoukari, Agathopoulou, & Alexiou, 2013). In this chapter, we focus on the LLS of the Greek native speakers and both their frequency of strategy use as well as their confidence in the effectiveness of those strategies. The study was guided by the following research questions: 1. How and to what extent does investigating the learners’ confidence in the effectiveness of a strategy, provide more information regarding LLS use?

Language Learning Strategy Use by Greek Students of English

85

2. Are there any problematic items in the initial version of the SILL questionnaire, i.e., items that are not well understood by the learners? 3. Is the learners’ strategic behavior affected by their proficiency in English (in combination with their age), and if so, how?

Participants The learners in our study were all Greek and were recruited from the first three grades of a public secondary school in Thrace (a prefecture in the north-east of Greece). There were a total of 12 participants (six male and six female), aged 12-15 years and learning English as a foreign language. The sample comprised: four students from each grade: two of low and two of high level in English; one male and one female in each proficiency level. The learners’ level of English language proficiency was estimated according to their performance in class and their course grades by their English teacher, who was also one of the investigators in the present research study. Learners of intermediate English language proficiency were not included in the sample because previous research found differences in LLS use only between learners of low and high proficiency in the target language (Magogwe & Oliver, 2007).

An Alternative Statistical Tool: The [01] Bar Kambaki-Vougioukli and Vougiouklis (2008) and KambakiVougioukli et al. (2011), investigating the possible hidden potential in the SILL questionnaire, introduced an alternative way of measuring the learners’ responses, which is the use of a bar [01] instead of the conventionally used Likert scales on the assumption that such a tool facilitates the collection and processing of the data. Responses on the bar, i.e., 0_________________________________1, range from 0, which represents a completely negative answer or attitude, to 1, which represents a completely positive answer or attitude. The greatest difference between using a Likert scale and using the bar lies in the fact that the completion of the SILL questionnaire using a Likert scale assumes that the learners fully understand the usually fine difference between the different grades of the scale. On the other hand, through the use of the bar, learners are allowed to indicate their use or attitude towards a strategy by cutting the bar at any point they think that expresses their use or attitude towards any item. There is no influence to their responses to the questions by their linguistic knowledge, as it is mostly a hands-on

86

Chapter Four

procedure that requires them to ‘feel’ or sense their position on the bar, rather than consciously think of the wording or having to choose from any suggested division pre-arranged for them. Replacing the discrete character of Likert scales by a fuzzy one, such as that of the bar, seems even more suitable when a questionnaire is not in the learners’ mother tongue and where insufficient linguistic knowledge of the target language may distort the validity of the questionnaire. Similarly, at the results processing stage, when using a Likert scale, researchers must decide in advance how many divisions will be used. By contrast, such an initially predetermined decision is not required by the employment of the bar. Moreover, it is possible to process the same data using different subdivisions, for a number of reasons including that of comparability with different research studies. The bar was first introduced at a length of 10 cm but was later modified at 6.2 cm, which is the Golden Ratio of 10. The reason for this change is that, as argued, since human eyes are used to the decimal system, people can easily divide a 10 cm long bar equally, which is not desirable in our case. On the other hand, a bar length of 6.2 avoids familiar divisions, leaving the participant free to choose from an infinite number of points (Vougiouklis & Kambaki-Vougioukli, 2011). Finally, KambakiVougioukli et al. (2011) compared the fuzzy bar with the Likert scale in an application of a departmental evaluation questionnaire among all students of the Department of Education in Alexandroupolis, Greece, asking the students to specify which method they preferred. The results yielded an overwhelming majority of 98% in favour of the bar.

Instruments and Data Collection Procedure The questionnaire used in this study was the Greek version of the 50item SILL (Oxford, 1990) translated and validated by Gavriilidou and Mitits (2013). Each question was followed by two separate bars. The first bar was for measuring frequency of LLS use and the second one for measuring confidence in the effectiveness of each strategy, as exemplified in Figure 4-1.

Lannguage Learninng Strategy Use by Greek Studdents of English h

87

An example froom the SILL qu uestionnaire em mploying the [0 01] bar for Figure 4-1: A frequency andd confidence.

The queestionnaire was w orally ad dministered too all learnerrs during individual innterviews witth their Engliish teacher. T The learners explained e their decisioon each time thhey marked where w they cutt either of the bars. The learners hadd been briefly instructed by the teacher-reesearcher abou ut how to fill in the SILL questionnaire using g the bar, w which was something completely nnew to them; they seemed to t understand it straight away. Then, they were assked to pay attention to thee fact that nott only did they y have to indicate how w often they used u a strateg gy, but also hhow confident they felt with each sstrategy, or, inn other words, how effecttive they thou ught each strategy wass. At this speccific moment, most learnerss reacted by saaying that if they claim m they oftenn use a strategy, this impllies they conssider this strategy effeective. They were, w then, tolld that this miight not be neecessarily so and that iit was an issuee to be investiigated. All intterviews were recorded with the learrners’ consentt. It was hyypothesized thhat when con nfidence was hhigher than frrequency, then this straategy might need n to be systematically taaught to learneers. If, on the other haand, there wass lower confid dence than thee actual frequeency, one could assum me that the leaarners probablly used the strrategy as a rou utine, not really apprecciating its valuue.

Resu ults Within thhe content-analysis techniq que, all the ansswers were no ormalized into groupss on the bassis of two crriteria: (a) cconfidence, where w the deviation beetween frequeency of use an nd confidence in the effectiv veness of each strateggy for everyy single quesstion was exxamined; and d (b) the questionnairre comprehennsion (wording g of the quesstions that miight have caused som me problems). Also, a decision was maade on the (arbitrary) convention tthat if the diffference between the confideence and the frequency f

88

Chapter Four

scorings was 6 on the 6.2 bar, then it was negligible and no further investigation was necessary. If it were higher, we estimated that it would need investigation.

Confidence and Strategy Use The main questions that concern this part of the analysis were: 1. Are the learners confident that the strategy they employ is effective so they score high confidence where they score high frequency of use? 2. Do they use certain strategies often but they are not sure of their effectiveness, so they score lower in confidence? 3. Do they rarely use a strategy but nevertheless score high confidence in this strategy? Certain SILL items drew our attention regarding the way the learners perceived and answered these items, always in relation to the confidence factor. The items that were of greatest interest are the following: Q.3 (Memory Strategy): Combining the image with the sound of a new word. Eight out of the twelve participants scored equally in frequency and confidence. Two scored lower in confidence, while two students appeared to be rather puzzled, and one of them paused for quite some time before scoring. Pausing for quite some time was translated as a problematic behavior and was recorded accordingly. The student either did not understand the description of the strategy or was in confusion of whether s/he did that or not. Q.5 (Memory Strategy): I use flashcards in order to remember the new words (with the new word on one side and the definition or other information on the other side). Two out of the twelve students scored with no apparent deviation between frequency and confidence, while ten students scored much lower in frequency than in confidence. One of the students asked what exactly was implied by the word ‘flashcards’. The interviewer explained the word so that the student could proceed with the scoring. The majority of the students did not use the strategy even though they considered it to be rather effective. This could be translated as a need for instruction on how to actually make better use of the strategy, or as a fact related to the participants’ age, who as teenagers may not use flashcards for learning as much as very young learners. Q6. (Memory Strategy): I physically act out new English words. Four students scored equally in frequency and confidence, whereas eight

Language Learning Strategy Use by Greek Students of English

89

students scored remarkably lower in frequency. However, it is believed that the students were not fully aware of the meaning of the verb ‘to act out’, as one of them commented positively and said that she would speak in English with her aunt. In this case, there is again the issue of underusing the strategy in the learning process, even though most subjects feel confident that the strategy is helpful. However, considering the age of the students helps us explain why they did not choose to make use of the strategy: low confidence and embarrassment to expose oneself is quite common during adolescence. More reinforcement would be necessary so as to enhance the use of this language learning strategy. Q14. (Cognitive Strategy): I watch English language TV shows spoken in English or go to movies. Six students scored minimal frequency and very high confidence, while only one student scored higher in frequency. There were four students who had the same level in both frequency and confidence. This is another case of insufficient instruction of the strategy, so the majority of the students do not use it while learning the second language. Since the majority of students scored lower in frequency, clearly denoting that they do not make use of that strategy, the need for instruction is apparent, so that the students can fully exploit the benefits of this strategy and achieve higher levels of language proficiency. Q15. (Cognitive Strategy): I read books and magazines. Seven out of the twelve students scored very high in confidence, despite the low scoring in frequency. Three students scored equally on both bars, whereas two students marked a much lower scoring in frequency compared to the high scoring in confidence. The majority scored higher in confidence, leaving the frequency bar under-scored. It is possible that if the students had been taught the use and value of this strategy, they could have incorporated it in the learning process. Q37. (Meta-Cognitive Strategy): ǿ have clear goals for improving my English skills. Seven students scored equally on both bars (confidence and frequency). Five students out of the twelve gave a higher score on the confidence bar. Three students made a connotation to their grades and commented on that. However, they did not make any reference to their goals, as if they did not fully understand the meaning of the English word “goals”, as well as what setting goals entails and they could not fully understand what they were supposed to do. In this case, there are almost equal numbers in the students who either scored higher in confidence, or equally on both bars, frequency and confidence. Still, instruction could be given so as to clarify what exactly most students feel that helps them learn the language.

90

Chapter Four

Q44. (Affective Strategy): I talk about the way I feel when learning English. Eight out of the twelve students scored equally on both bars, except for four students who marked a higher score on the confidence bar. Most of them made no comments, apart from one who admitted that he does feel stressed when he talks in English and another student, who wanted to make sure he got it right, asking, or rather repeating the statement, as if seeking for further explanation (which was not provided, as he immediately proceeded with the scoring). The students seem to be aware of the strategy and also have the confidence that the particular strategy is helpful.

Problematic Perception of Questionnaire Items During the administration of the SILL, there were certain items that were not easily understood by the participants and needed further explanation. Question 4. (Memory Strategy): The use of rhymes. Eight students scored higher in confidence whereas only one scored higher in frequency. One of the students (advanced level) paused to understand the question and another student (beginner) asked for further clarification, somewhat puzzled and did not seem to fully understand the question. Both of the students were girls. Most of the students scored higher in the confidence bar. Question 41. (Affective Strategy): Rewarding oneself when achieving a goal in the target language. Six students scored higher in confidence, and there were three that scored lower in confidence, and also three others that had equal scoring both in the confidence and frequency bar. Two students were embarrassed and one of them giggled not knowing what to answer. He was encouraged by the interviewer to do so. Another student made a rather long pause, as if she was trying to understand the question. There was also a case of a student who directly asked for clarification, not understanding what the interviewer meant by ‘rewarding’. Question 46. (Social Strategy): Asking English speakers to correct me when I talk. Six students scored equally both in frequency and confidence, whereas four had a higher score in confidence. There were three cases of students who were unable to reply to the question, probably because they never had the experience of interacting with an English speaker before and consequently were unable to answer, so the interviewer had to provide a real life example taken from their student-life experience, in order to encourage them to score on the bars.

Language Learning Strategy Use by Greek Students of English

91

Question 21. (Cognitive Strategy): Finding the meaning of an English word by dividing it into parts that I understand. Five students gave an equal score to both bars, and four out of the twelve students had considerably lower scores in confidence than frequency. In this particular question, the students had a hard time providing an answer, as some of them could not understand what way they should answer. More specifically, five students gave a wrong score, meaning they scored very high in the frequency bar, whereas they reported that they always tend to translate the words into Greek in order to understand them. Four of those students were beginners. There were also six cases that required further clarification, four of whom directly asked the interviewer what the question actually meant, and what the required information was. Question 27. (Compensatory Strategy): Reading English without looking up every new word. Six students had equal scoring and four scored higher in confidence. The interviewer noticed that what the students reported orally, was not in accordance with the score they marked on the frequency bar, indicating that they may not be fully aware of the actual meaning of the reported strategy. In the case of a fifth student, the above mistake passed unnoticed, and there was no cohesion of the verbal and the written data. A sixth student had difficulty figuring out the meaning of the statement, and made a rather awkward pause before scoring. The advanced students tended to make awkward pauses in order to think about the answer they should give. They were not interrupted until some time had passed, so they were asked if there was anything wrong. Similar issues were encountered with the following questions: Q9: I say or write new English words several times. Q11: I practice the sounds of English. Q12: I start conversations in English. Q22: I make summaries of information that I read or hear in English. Q34: I plan my schedule so I will have enough time to study English. Q39: I try to relax whenever I feel afraid of using English. Q42: I notice if I am tense or nervous when I am using English. On the other hand most of the somehow ‘problematic’ questions that troubled the group of the beginners, lead the students to express their uncertainty and ask for clarification, so as to make sure they are on the right track. Such cases were in the following questions: Q18-I look for words in my own language that are similar to new words in English (twice); Q20-I try to find the meaning of a new word by dividing it into parts; Q33-I try to find out how to be a better learner of English; Q43-I

92

Chapter Four

encourage myself to speak English even when I am afraid of making a mistake. There were also cases of long pauses or hesitations in questions: Q16-I write notes, messages, letters in English; Q17-I first skim an English passage, then go back and read it; Q19-I try to find patterns in English; Q29-If I can’t think of an English word, I use a word or phrase that means the same thing. As it appears from the scoring on the bars, most of the students scored very high in confidence, although the scoring on the frequency bar was not as high. This could be explained as a need for further instruction, so that the students are not only confident that the particular strategy is of great importance but they also know how to use it in order to enhance their language learning.

Other Problematic Areas There were certain questions in the questionnaire that were particularly problematic and might possibly need attention/revision before any future administration of the instrument. For example, the question ‘I try not to translate word-for-word and I read English without looking up every new word’ caused a lot of confusion. Although some learners stated that they avoid looking up words they still scored towards the left end of the bar (0), which means they actually translate and look up words in dictionaries. Others stated that they do look up words but scored towards the right end of the bar (1), as if stating that they avoid it. Most of the learners asked for clarification, others who did not, scored not in compliance with the comment they made, contradicting themselves. In total, 4 out of the 6 students in the beginner level either gave wrong scoring or had to ask for clarification, whereas 3 out of the 6 advanced students did the same thing. The above prove the important advantage of the oral administration of the SILL, combined with individual interviews as it allows clarifications and may prevent subjects from making the wrong assumptions about any of the SILL items. Lastly, it became apparent that the negatively worded items were particularly problematic and they may need to be reworded in future administrations of the instrument. There were also questions that appeared to be quite similar, e.g., ‘I ask English speakers to correct me when I talk’ and ‘I ask for help from English speakers’. Learners told the researcher that one of them could have been eliminated. To be more precise, 4 out of the 6 beginner students gave almost identical scoring to both questions, and so did all of the advanced students. The comments that the learners made indicated that these two questions were identical in the mind of the students, and

Language Learning Strategy Use by Greek Students of English

93

therefore they extracted similar information from them. The questions that were more problematic than others in the sense that they needed further explanation or the students misinterpreted if no clarification was given were questions Q21-I try not to translate word-for-word, and Q27-I read English without looking up every new word. However, questions Q46-I ask English speakers to correct me when I talk and Q48-I ask for help from English speakers were dealt as if they expressed the same strategy and therefore had the same impact on the students. With regards to the two questions above, ‘I ask English speakers to correct me when I talk’ and ‘I ask for help from English speakers’, there was one more interesting observation. The lower level students reported that they do not use these strategies but they believe that seeking help and correction from others could help them. In contrast, the majority of the advanced learners scored lower in confidence and some of them verbally stated that they do not wish to be corrected or that they do not consider it as a helpful strategy, i.e., they do not like it. This could be explained as a refusal of the higher proficiency students to be corrected, as these students are the ones who usually perform well, not only in English, but in other subjects as well. Consequently, they might consider the correction by a native speaker or by their teacher as a failure or negative exposure that could cause them to ‘lose face’.

Discussion Concerning the first question of this research study about whether confidence affects learners’ choice of strategy, it was shown that in a number of items there was great deviation between frequency of use and confidence in the effectiveness of the strategy. This could imply an appeal for strategy instruction, as the learners appear to be confident that the specific strategy might help them, even if their frequency of use indicates that they do not use the strategy often or even not at all in some cases. This is an important finding as it demonstrates the difference between what is used and what is considered useful. However, such an assumption would have been impossible without the introduction of the parameter of confidence and without the use of the bar. As for the second question, there were a number of items in the questionnaire that were identified as problematic items. These items caused confusion among learners or were considered similar, and often resulted in incorrect responses. Probably, such items need to be revised and reworded before using this instrument again (see Dörnyei, 2003, as well as Roszgowski & Soven, 2010 for suggesting similar improvements

94

Chapter Four

in questionnaires). Finally, concerning the third research question, namely how proficiency in English affects the learners’ strategic behaviour, it is evident that it does, as the level of the students, beginners-advanced, seems to influence not only their perception of the actual items of the questionnaire, but also their recorded responses.

Conclusion In the current study there is the obvious limitation of the small number of participants since it was a pilot study. Future research with a larger sample would allow quantitative analyses and correlations that would provide more valid conclusions. The above limitation should also be taken into serious consideration as, due to the way of the administration of the suggested instrument–oral administration and individual interviews–the fact that it cannot be applied to a large number of learners provides us with very restricted data. As a general conclusion, we could point out that apart from certain improvement and/or changes that need to be performed on the questionnaire to make it more appropriate for the specific learners, the need for instruction is apparent as it will boost the learners’ strategy use so as to make them more efficient and autonomous, and probably encourage and reinforce their self-study. Moreover, the format of the data-collection could be adapted, so that a bigger number of participants could be included, and therefore more valid information could be extracted through the use of a differentiated format of the same questionnaire, aiming to its administration with larger groups of learners.

Acknowledgement This study is part of the Thales Project MIS 379335. It was carried out within the National Strategic Reference Frame (Ǽ.Ȉ.Ȇ.ǹ.) and was cofunded by the European Union (European Social Fund) and the national resources.

References Bull, S., & Ma, Y. (2001). Raising learner awareness of language learning strategies in situations of limited resources. Interactive Learning Environments, 9(2),171-200.

Language Learning Strategy Use by Greek Students of English

95

Chamot, A. U. (2005). Language learning strategy instruction: Current issues and research. Annual Review of Applied Linguistics, 25, 112130. —. (2007). Accelerating academic achievement of English language learners: A synthesis of five evaluations of the CALLA Model. In J. Cummins & C. Davison (Eds.), The international handbook of English language learning (pp. 317-331). Norwell, MA: Springer Publications. Cohen, A. D. (2003). The learner’s side of foreign language learning: Where do styles, strategies, and tasks meet? International Review of Applied Linguistics in Language Teaching, 41(4), 279-291. Cohen, A. D. & Macaro, E. (2007). Language learner strategies: Thirty years of research and practice. Oxford: Oxford University Press. Dörnyei, Z. (2003). Questionnaires in second language research: Construction, administration, and processing. Mahwah, New Jersey: Lawrence Erlbaum Associates. Gavriilidou, Z., & Mitits, L. (2013). Adaptation of the Strategy Inventory for Language Learning (SILL) for students aged 12-15 into Greek: A pilot study. Paper presented at the 21st International Symposium of Theoretical and Applied Linguistics, April 5th-7th, School of English, Aristotle University of Thessaloniki, Greece. Gavriilidou, Z., & Papanis, A. (2010). The effect of strategy instruction on strategy use by Muslim pupils learning English as a second language. Journal of Applied Linguistics, 25, 47-63. Gavriilidou, Z., & Psaltou-Joycey, A. (2009). Language learning strategies: An overview. Journal of Applied Linguistics, 25, 11-25. Graham, S., & Macaro, E. (2008). Strategy instruction in listening for lower-intermediate learners of French. Language Learning, 58(4), 747783. Green, J. M., & Oxford R. (1995). A closer look at learning strategies, L2 proficiency, and gender. TESOL Quarterly, 29(2), 261-297. Intze, P. (2010). Accuracy and confidence in Modern Greek vocabulary of native and non-native speakers in Western Thrace (in Greek). Doctoral thesis, Democritus University of Thrace, Greece. Intze, P., & Kambaki-Vougioukli, P. (2009). Lexical guessing: Accuracy and confidence of pupils of Greek as a first or second language. Journal of Applied Linguistics, 25, 65-83. Kambaki-Vougioukli, P. (1992a). Greek and English readers’ accuracy and confidence when inferencing meanings of unknown words. Proceedings of the 6th International Symposium on the Description and/or comparison of English and Greek (pp. 89-112). Thessaloniki, Greece: Aristotle University.

96

Chapter Four

—. (1992b). Accuracy and confidence of Greek learners guessing English word meaning. Doctoral thesis, University of Wales, U.K. —. (2001). īȜȦııȚțȒ ĮʌȠțĮIJȐıIJĮıȘ ʌĮȚįȚȫȞ İʌĮȞĮʌĮIJȡȚȗȠȝȑȞȦȞ Įʌȩ IJȘȞ ʌȡȫȘȞ ǼȈȈǻ: MȚĮ ȥȣȤȠȖȜȦııȚțȒ ʌȡȠıȑȖȖȚıȘ. PHASIS, ICPBMGS, 4, 71-81. —. (2012). SILL revisited: Confidence in strategy effectiveness and use of the bar in data collecting and processing. In Z. Gavriilidou, A. Efthymiou, E. Thomadaki, & P. Kambaki-Vougioukli (Eds.), Selected papers of the 10th ICGL (pp. 342-353). Komotini, Greece: Democtitus University of Thrace. —. (2013). Bar in SILL questionnaire for multiple results processing: Users’ frequency and confidence. Sino-US English Teaching, 10(3), 184-199. Kambaki-Vougioukli, P., & Vougiouklis T. (2008). Bar instead of a scale. Ratio Sociologica, 3, 9-56. Kambaki-Vougioukli, P., Karakos A., Lygeros N., & Vougiouklis, T. (2011). Fuzzy instead of discrete. Annals of Fuzzy Mathematics and Informatics (AFMI), 2(1), 81-89. Kambakis-Vougiouklis, P., Mamoukari P., Agathopoulou, E., & Alexiou, T. (2013). Oral application of SILL questionnaire using the bar for frequency and evaluation of strategy use by Muslim pupils in Thrace. Paper presented at the 21st ISTAL, April 5th-7th, Aristotle University of Thessaloniki, Greece. Kazamia, V. (2003). Language learning strategies of Greek adult learners of English: Volumes I and II. Doctoral thesis, The University of Leeds, UK. MacIntyre, P. D. (1994). Toward a social psychological model of strategy use. Foreign Language Annals, 27(2), 185-195. Magogwe, J. M., & Oliver, R. (2007). The relationship between language learning strategies, proficiency, age and self-efficacy beliefs: A study of language learners in Botswana. System, 35(3), 338-352. Mathioudakis, ȃ., & Kambaki-Vougioukli, P. (2010). The adjectival identity of Ulysses in Kazantzakis’ ȅDYSSEY through the fuzzy sets. Paper presented at the 4th European Congress of Greek Studies, September 9th-12th, University of Granada, Spain. Retrieved from: http://www.eens.org/EENS_congresses/2010/Mathioudakis_Nikolaos_ Kambaki-Vougioukli_Penelope.pdf McDonough, S. H. (1999). Learner strategies. Language Teaching, 32(1), 1-18. Oxford, R. L. (1990). Language learning strategies: What every teacher should know. New York: Newburry House.

Language Learning Strategy Use by Greek Students of English

97

—. (1996). Employing a questionnaire to assess the use of language learning strategies. Applied Language Learning, 7(1&2), 25-45. Phillips, V. (1991). A look at learner strategy use and ESL proficiency. CATESOL Journal, 4, 57-67. Psaltou-Joycey, A. (2008). Cross-cultural differences in the use of learning strategies by students of Greek as a second language. Journal of Multilingual and Multicultural Development, 29(3), 310-324. Psaltou-Joycey, A., & Kantaridou, Z. (2009). Plurilingualism, language learning strategy use and learning style preferences. International Journal of Multilingualism, 6(4), 460-474. Psaltou-Joycey, A., & Sougari, A. (2010). Greek young learners’ perceptions about foreign language learning and teaching. In A. Psaltou-Joycey & M. Mattheoudakis (Eds.), Advances in research on language learning and teaching: Selected papers (pp. 387-401). Thessaloniki: The Greek Applied Linguistics Association. Roszkowski, M. J., & Soven, M. (2010). Shifting gears: Consequences of including two negatively worded items in the middle of a positively worded questionnaire. Assessment & Evaluation in Higher Education, 35(1), 113-130. Tragant, E., & Victori, M. (2006). Reported strategy use and age. In C. Munõz (Ed.), Age and foreign language learning rate. The BAF project (pp. 208-236). Clevedon: Multilingual Matters. Tragant, E., & Victori, M. (2012). Language learning strategies, course grades, and age in EFL secondary school learners. Language Awareness, 21(3), 293- 308. Vougiouklis, T., & Kambaki-Vougioukli, P. (2011). On the use of the bar. China-USA Business Review, 10(6), 484-489. Wharton, G. (2000). Language learning strategy use of bilingual foreign language learners in Singapore. Language Learning, 50(2), 203-244.

CHAPTER FIVE THE DEVELOPMENT OF A VOCABULARY TEST TO ASSESS THE BREADTH OF KNOWLEDGE OF THE ACADEMIC WORD LIST LEE-YEN WANG

Abstract This research study investigated English as a foreign language (EFL) college students’ vocabulary acquisition of a group of 52 Academic Words that were excluded from the national wordlist for high school students in Taiwan. The study found that the 52 omitted words were acquired significantly less by both the freshman and senior students in Taiwan compared with the other non-omitted 518 academic words. In addition, 38 of the 52 omitted words are also on the Academic Vocabulary List (AVL), which was made available in 2013 (Davies, 2012), as part of the Corpus of Contemporary American English (COCA). Again, these 38 shared words were also acquired significantly less than the non-omitted academic words by the same groups of freshman and senior students in this research study. These findings highlight the limitations of having a centrally controlled national wordlist for students to learn, as anything omitted from that list will have a high probability of being missed subsequently in later acquisition.

Introduction In Taiwan, before students can be admitted to a college, they have to take at least one mandatory national exam, and English is a required subject. Much like other Asian countries, in Taiwan college entrance examinations are high-stakes exams (Guo, 2005; Ng & Renshaw, 2009) and students are encouraged to study hard for them early in their

The Development of a Vocabulary Test

99

childhood. In this environment, students’ English proficiency in high school highly correlates with how they will perform in college (Luo, 2005). Therefore, vocabulary acquisition during this period in students’ education can be crucial. The vocabulary included in the English teaching materials in Taiwan has to conform to the guidelines established by the English Reference Word List (ERWL). This list determines what teachers teach and what students learn. In general, teaching, learning, and testing form a highly connected relationship. Due to the pressure of the high-stakes exams, high school teachers try their best to cram all the listed words into their lessons while students work hard to master them. After high school, there is no word list available to guide English language teaching and learning and the Ministry of Education (MOE) in Taiwan does not specify other vocabulary lists to augment the ERWL for college students. In previous research (Wang, 2015), it was found that there were 52 academic words omitted from the ERWL. This finding provided a unique opportunity for investigating the subsequent acquisition of the 52 omitted academic words relative to the rest of the ERWL words.

Background High school English education in Taiwan is under the guidance of the ERWL, which is published and released by the College Entrance and Examination Center (CEEC), a non-profit organization commissioned by the MOE in Taiwan to administer the nationwide college entrance examination (CEEC, 2015). Wang (2015) compared the ERWL with West’s General Service List (GSL) (West, 1953), Coxhead’s (2000) Academic World List (AWL), the 5,000 Frequency Dictionary (Davies & Gardner, 2010) from the Corpus of Contemporary American English (COCA), and the most frequent 500,000 words in the COCA. The study concluded that the ERWL is a reasonable wordlist, but the comparison study identified a set of 52 AWL words which were omitted from the ERWL probably because when Jeng (2005) was compiling the ERWL, he was unaware of the availability of the AWL by Coxhead (2000). This study leverages this finding as a prism to investigate the acquisition of vocabulary by English as a foreign language (EFL) college students in Taiwan from the perspective of these 52 academic words. Vocabulary learning is essential for language development. Wilkins (1972) states that “without vocabulary, nothing can be conveyed” (p. 111), not ignoring the importance of grammar, but stressing the role of vocabulary in conveying ideas. Lexical knowledge is essential in all skill

100

Chapter Five

areas, including writing (Engber, 1995; Laufer & Waldman, 2011), listening (Chang, 2007; Nation, 2006), speaking (Joe, 1998; Koizumi, 2013), and fluency (Harrington, 2006; Pellicer-Sanchez & Schmitt, 2012). Vocabulary is, in general, categorized into three main groups, which include the GSL-based high-frequency words (West, 1953), academic vocabulary, and technical and low-frequency words (Nation, 2001). West’s GSL was built using a corpus of 5 million words (West, 1953), and it can cover between 80% and 90% of the words in a text. Technical vocabulary is highly subject-dependent and its usage is discipline-specific. Researchers and scholars also explored vocabulary items or phrases with a higher frequency of occurrence in academic texts than in other genres (Anderson 1980; Coxhead, 2000; Cummins & Man, 2007; Gardner & Davies, 2014; Simpson & Mendis, 2003), namely, academic vocabulary. The AWL, a list of 570 word families, was created from academic texts, in four main areas: the arts, law, science, and commerce. These four areas were further divided into 28 subjects (Coxhead, 2000, 2011) and the academic words were categorized in ten sublevels with the highest frequency group in Level 1 (Coxhead, 2000). Hyland and Tse (2007) duplicated Coxhead’s corpus study with a corpus of 3.3 million words. Their findings showed that Coxhead’s AWL has a skewed distribution among the subcorpora, and 27% of the AWL items have very low occurrences in at least one of the three subcorpora. Futrthermore, it was found that the AWL can only account for 78% of the words when the AWL is coupled with the GSL, rather than the usual 85% reported by Coxhead (Hyland & Tse, 2007). This led Hyland and Tse to conclude that there may be no universal academic vocabulary that can satisfy all the academic disciplines. Eldridge (2008) argued that Coxhead compiled a set of more general vocabulary than the GSL, that the AWL serves a supportive function and the words are “not likely to be glossed by the content teacher” (Flowerdew, 1993, p. 236). However, Coxhead (2011) reviewed the impact and significance of the AWL and showed that the AWL has a deep influence in the field of English language teaching and learning. While the AWL (Coxhead, 2000) can achieve 10.1% of text coverage in the 3.5 million words of academic texts in her collection, Gardner and Davies (2014) reported that their academic vocabulary list (AVL) can cover 14% of their academic corpus of about 120 million words. This is achieved by having 1,991 academic word families and 3,015 lemmas, but without the GSL as the base. Since COCA’s AVL is a recent product, this study will also investigate how many of the 52 omitted AWL words

The Development of a Vocabulary Test

101

overlap with COCA’s AVL and further examine the acquisition of these overlapped words by Taiwanese students. Properly assessing a learner’s vocabulary knowledge plays a significant role in facilitating efficient language teaching and learning. Assessment can evaluate vocabulary from a variety of perspectives: size, depth, fluency, and other cognitive and association skills (Meara, 2002; Meara & Wolter, 2004; Read, 2000; Schmitt, 2014; Sonbul & Schmitt, 2013; Tannenbaum, 2006). Read (1993, 2000) developed tests to assess word association, including knowledge of collocation, derivative forms of a stem word, and polysemous meaning senses. Nation (2001) differentiated the passive vocabulary that a person can understand in reading and listening, from the active vocabulary, which a person can use in speaking and writing. A convenient way to measure students’ passive or receptive vocabulary knowledge is through a checklist, where students mark YES/NO (Y/N) to self-report whether they know these words (Meara & Buxton, 1987). These Y/N tests are incapable of surveying vocabulary depth (Laufer & Goldstein, 2004; Read, 2000), but they are simple to administer and they are effective in assessing vocabulary size or breadth (Read, 2007). However, the self-reporting nature of Y/N tests requires researchers to implement further checks to assess the reliability of the self-reporting. Pseudowords were introduced to the Y/N tests in the reading comprehension assessment by Anderson and Freebody (1983), with two approaches to creating a pseudoword. The first one is to add a prefix or suffix to a real word, e.g., ‘steal’ becomes ‘stealment.’ Modifying the vowel or the consonant, one or two at a time, forms the second method of constructing a pseudoword. Pseudowords cannot be counted as real words, so they are dealt with as ‘hit’ and ‘false alarm.’ A hit indicates that the pseudoword is correctly recognized as a pseudoword, while a false alarm is when the test-taker claims to know a pseudoword as if it were a real word. The relative numbers of these two measures can indicate if the test result can be trusted Meara (1992).

The Study This study is based on the findings by Wang (2015) that there is a set of 52 AWL words that were omitted from the ERWL. A question that arises is whether the omitted vocabulary items can be acquired later in college and how they are acquired. In addition, because the COCA also released the AVL, this study leverages the AVL to investigate the number of overlaps between the omitted AWL words and the AVL and the

102

Chapter Five

acquisition of these overlapped words. To solicit students’ knowledge about vocabulary, a Y/N test was used as a convenient tool, knowing it could be affected by incorrect self-reports from respondents. To address this deficiency, a popular approach is to embed questions with pseudowords in the test.

Research Questions In the current study, three questions were posited to investigate the acquisition of the 52 omitted academic words in the AWL relative to the set of non-omitted AWL words: 1. Is there a significant difference between the acquisition of the non-omitted academic words and the omitted academic words between freshman and senior college students in Taiwan? 2. Does the inclusion of pseudowords affect the design of the Y/N questions in the vocabulary test? 3. For the omitted words that are both on the AWL and the AVL, is there a significant difference in the acquisition of the overlapped words between freshman and senior college students in Taiwan?

Participants Two classes of freshman and one class of senior students from a private university in Taipei, Taiwan participated in this study. This university has a policy of assigning students to freshman English classes by the level of their General Scholastic English Ability Test (GSEAT), which is a mandatory exam every high school senior student in Taiwan has to take in early February of each year. There are 15 levels in the English subject. The freshman students who participated in the study were at the GSEAT Level 13 or above, which is in one of the top percentiles of 76% to 82% of all students who took the exam. One student was at Level 15, which is in the top percentile of 88% to 100%. These two classes of freshman students had the highest GSEAT score in the university. A total of 50 freshman students participated in this study. A further 39 senior students participated in this study from the English department. Table 5-1 summarizes some general information about the study participants.

The Development of a Vocabulary Test

103

Table 5-1: Participant information. Participants

No.

Male Female Age Range

First Language

Freshmen

50

21

29

18-20

Chinese

Seniors

39

4

35

21-24

Chinese

Compilation of Y/N Checklist There are 570 word families in the AWL. It is not feasible to use the Y/N checklist to ask students to report on so many questions at one time. With 52 omitted words, this leaves 518 words to be sampled. The formula introduced by Krejcie and Morgan (1970) was used to perform a sampling. Using this method, a total of 220 words from the AWL at a 5.02% margin of error were selected for inclusion in the Y/N checklist. A further 82 pseudowords were chosen for inclusion in the Y/N checklist, with a margin of error 9.07%, again calculated from the formula by Krejcie and Morgan (1970). If a 5% margin of error were to be maintained, it would require 160 pseudowords, and the overall questions would exceed the size of an A3 paper. The words in the pseudoword set were modified by changing a consonant or a vowel according to the approaches by Anderson and Freebody (1983). In this approach, pseudowords are fabricated in synforms (Laufer, 1988), which can be confusing to Y/N test-takers. Finally, these two sets (220 words sampled from the AWL and 82 pseudonyms) were then combined with the 52 omitted academic words to form the final Y/N checklist set of 354 words. The sampling process did not randomly select the words from each of the 10 levels in the AWL but treated all the words as a single pool. Table 5-2 shows how many of the 354 words on the Y/N checklist belonged to which AWL Level while Table 5-3 provides a list of the 52 omitted AWL words. Since the number of words per level differed, there was a need to check if this represented a uniform distribution across the ten academic levels defined in the AWL by Coxhead (2000). On average each level should have 35.4 words because the total is 354 words with 10 levels. A Chi-square analysis was performed to use 35.4 as the expected value for each level to check against the subtotal in each level, and it was not statistically significant (Ȥ2 (9.842, 9) = 0.3634, p< .05). Therefore, the sampling process was proven to be a uniform distribution.

Chapter Five

104

Table 5-2: Distribution of the sample words across the AWL Levels. AWL Level 1

2

3

4

5

6

7

8

9

10 Total

Set 1 (non-omitted)

26 31 18 32 24 22 23 14 23 7

220

Set 2 (omitted)

1

5

4

3

7

7

10 5

10

52

Set 3 (pseudowords)

12 10 7

4

10 9

8

11 8

3

82

Total

39 41 30 40 37 38 38 35 36 20

0

354

Table 5-3: Omitted AWL words with Level information. Words

Level

Words

adjacent

10

discrete

aggregate

6

domain

albeit

10

amend

Level

Words

Level

intrinsic

10

6

invoke

10

empirical

7

legislate

1

5

entity

5

levy

10

append

8

fluctuate

8

negate

3

arbitrary

8

hierarchy

7

notwithstanding

10

attribute

4

hypothesis

4

offset

8

automate

8

ideology

7

ongoing

10

concurrent

9

implicate

4

paradigm

7

constrain

3

incidence

6

parameter

4

convene

3

incorporate

6

practitioner

8

criteria

3

infrastructure

8

predominant

8

deduce

3

inhibit

6

protocol

9

denote

8

innovate

7

qualitative

9

deviate

8

integral

9

5

Table 5-4 presents the list of the 82 pseudowords included in the Y/N checklist. In this table, the pseudowords, the AWL Levels, and the original words for the pseudowords are presented. For instance, ‘abandon’ was modified to ‘abendon’ using vowel modification. The word ‘availabler’ was created by adding the suffix ‘-er’ to the word ‘available’. The word

The Development of a Vocabulary Test

105

word ‘breif’ was created from ‘brief’ with an ‘ie’ and ‘ei’ swap. The purpose was to gauge if pseudowords derived from legitimate academic words in the AWL would produce different responses between freshman and senior students. Table 5-4: Pseudowords with Level information and the original AWL words. Pseudoword abendon acadamy accompeny acheive acknowladge assast attein availabler banefit breif chellange chairter circumstiance coite clerify coharent commance commision cmunicate consast contrdict ratioe regulete restract refeal ritid skope whereask inevatable

L 8 5 8 2 6 2 9 1 1 6 5 8 3 6 8 9 1 2 4 1 8 5 2 2 6 9 6 5 8

Original W abandon academy accompany achieve acknowledge assist attain available benefit brief challenge chart circumstance cite clarify coherent commence commission communicate consist contradict ratio regulate restrict reveal rigid scope whereas inevitable

Pseudoword coerperate coopler crecter desing dymension emphesis enhiance enture evaruate eventuel exband espend flexibel fourmula genertion hypothisis imblicit implisit imbly inkome individaul salect spedific sdyle supmit tehnology salect widspread morme

L 6 7 1 2 4 3 6 3 2 8 5 5 6 1 5 4 8 8 3 1 1 2 1 5 7 3 2 8 9

Original W cooperate couple create design dimension emphasis enhance ensure evaluate eventual expand expand flexible formula generation hypothesis implicit implicit imply income individual select specific style submit technology select widespread norm

Chapter Five

106

Pseudoword indegrity invoulve jornual justefy layert leglel marture midia purcue qualitetive thoery thezis violit

L 10 1 2 3 3 1 9 7 5 9 1 7 9

Original W integrity involve journal justify layer legal mature media pursue qualitative theory thesis violate

Pseudoword nowithstanding occubie overlape persive percist predede precite prioridy prosibit morme uniforn uniqeu vesible

L 10 4 9 2 10 6 5 7 7 9 8 7 7

Original W notwithstanding occupy overlap perceive persist precede precise priority prohibit norm uniform unique visible

Note: W = Word; L = Level.

Final Y/N Checklist The words in each of the three sets were tagged (omitted, non-omitted, and pseudowords) and then combined to be randomized. The randomized words were numbered consecutively, from 1 to 354. The Y/N checklist done on paper and the instructions asked students to indicate if they knew each word. Since there have been no studies to have such an extensive checklist to be completed by students, a pilot test was conducted with a group of six students, four freshmen and two seniors, to see how much time it would take for the students to complete the test. The checklist completion time was found to be between 11 and 16 minutes, so the formal checklist completion time was set to 20 minutes, and it was found to be sufficient. The checklists for both senior and freshman students were administered and collected in the first three weeks of the fall semester when freshmen had just started their college study in Taiwan. SPSS PASW Statistics 18 was used for data analysis. An appropriate metric in this study was to compare the relative ratios between the omitted words and non-omitted words. The following variables were needed to explain this metric: x OC: the total count of ‘Yes’ responses in the omitted AWL words in the checklist.

The Development of a Vocabulary Test

107

x OT: the total number of omitted AWL words in the ERWL (N = 52). x NC: the total count of ‘Yes’ responses in the non-omitted AWL words. x NT: the total number of non-omitted words determined by the sampling process (N= 220). x PC: the total number of ‘hits’ (i.e., ‘Yes’ responses) for pseudowords. x PT: the total number of pseudowords (N = 82). x PP: PC/PT (the total number of ‘hits’ (i.e., ‘Yes’ responses) for pseudowords divided by the total number of pseudowords). x OP: OC/OT (the total count of ‘Yes’ responses in the omitted AWL words in the checklist divided by the total number of omitted AWL words in the ERWL). x NP: NC/NT (the total count of ‘Yes’ responses in the non-omitted AWL words divided by the total number of non-omitted words determined in the sampling process). x RP: NP/OP (the value derived from calculating the NP divided by the value derived after calculating NT). The concept and calculation of PP, OP and NP is straightforward, however RP needs to be explained. If students’ knowledge of the omitted AWL words has the same distribution as the non-omitted words, the relative ratio will be 1. For instance, if a student reports 110 ‘Yes’ out of 220 and 26 ‘Yes’ for the 52 omitted words, we will have the following: NP = NC110/NT220 = 0.5 OP = OC26/OT52 = 0.5 RP = NP/OP = 1 If this student keeps the same NC, 110 ‘Yes’, but only 39 ‘Yes’ out of 52 (OT), the calculation will then be as follows: NP = NC110/NT220 = 0.5 OP = OC39/OT52 = 0.75 RP = NP/OP = 0.667 On the other hand, if this student keeps the same NC, 110 ‘Yes’, but reports only 13 ‘Yes’ out of 52 (OT), the calculation will then be as follows:

Chapter Five

108

NP = NC110/NT220 = 0.5 OP = OC13/OT52 = 0.25 RP = NP/OP = 2 Evidently, if the relative ratio (RP) between NP and OP is much larger than 1, it indicates that the count of the omitted words in the checklist (i.e., the number of omitted words a student has indicated that he/she knows) is relatively much smaller than the non-omitted ones (i.e., the number of non-omitted words a student has indicated that he/she knows) and vice versa. When the RP value is close to 1, it means these two counts are also close to each other. If OP is zero (i.e., a student did not know any of the omitted words), the calculation for RP is reversed, but the reversion applies to the whole calculation.

Results and Discussion With regards to the first research question, whether there is a significant difference between the acquisition of the non-omitted academic words and the omitted academic words between freshman and senior college students in Taiwan, the results of the statistical analyses are shown below in Tables 5-5 and 5-6. Table 5-5 shows that the variable NC for the senior students had a value, on average, of 176.41 words out of 220, and this is 80.2%. The total count of ‘Yes’ responses in the omitted AWL words (in variable OC) for the freshman and senior students were 9.74 (18.7%) and 16.692 (32.1%), respectively. It is clear that freshman students knew an average of 73.1% of academic words that are on the ERWL, but only 18.7% of academic words that are not on the list. The situation improves considerably in the senior students, but the percentage ratio is still 80.2% to 32.1%. The ratio of NP/OP indicates the ratio between the total number of non-omitted words to the total number of omitted academic words. This was 4.868 and 3.183 for the freshmen and seniors respectively, and both are significantly greater than 1. Table 5-6 presents the results of an ANOVA comparison, and it shows that there is a statistically significant difference between freshman and senior students in this ratio measurement.

The Development of a Vocabulary Test

109

Table 5-5: Descriptive statistics (freshmen = 50, seniors = 39).

Seniors

Freshmen

Year DV

Mean SD

Min

Max

95% 95% CI Range CI Upper Lower

204

102

144.23 174.12

NC

160.80 26.439 102

NP

.731

.1202 .4634 .927

.464

.6556 .791437

OC

9.74

4.861 2

19

8.80

OP

.187

.093

.365

.1693 .312

NP/OP

4.869 2.400 2.202 13.473 11.270 3.226 5.733

NC

176.41 31.263 100

218

118

176.27 191.52

NP

.802

.991

.536

.80122 .871

OC

16.69 8.037 2

35

33

14.39 19.82

OP

.321

.673

.635

.277

NP/OP

3.183 2.068 1.45

.142

21

.0385 .404

.455

.1545 .039

16.24

.381

13.710 12.264 2.322 4.123

Note: SD = Standard Deviation; DV = Dependent Variable; CI = Confidence Interval; NT, NP, OT, OP, NP/OP are defined in the previous section.

Table 5-6: Comparison between freshmen and seniors in their knowledge of non-omitted/omitted academic words using ANOVA. Source

Type III Averaged df Sum of Square Sum of Square

Between 62.256

1

Within 444.802

87 5.113

Sum

88

507.058

62.256

F

Sig

Ș2

Observed Power

12.177 0.001 0.123 0.932

Note: df = degrees of freedom; F = ANOVA result; Sig = Significance level; Ș2 = effect size.

The combined results from Tables 5-5, and 5-6 indicate that the ratio of the non-omitted to the omitted academic words for senior students was lower than that for freshman students; and both ratios are significantly greater than 1, which indicates that these two groups of words are not known by the students in equal proportion. This can be attributed to the

110

Chapter Five

washback effect (Shohamy, Donitsa-Schmidt, & Ferman, 1996) as high school teachers in Taiwan are not teaching words that are not covered in the ERWL. Instead, they are drilling heavily the words that are defined in the ERWL. It is not surprising then that even senior English majors fail to acquire a significant amount of the omitted AWL words. These results are a clear indication that words that are not in the ERWL will likely not be learned by the students. The second research question in this study concerned the inclusion of the pseudowords and how this might affect the design of the Y/N test. The inclusion of pseudowords in the test aimed to assess how the respondents handle the checklist and the quality of their self-reports. Some researchers argue against the value of the pseudowords because overestimations and false alarms are not a concern as the error rate due to false alarms for high proficiency students in general and Asian students in particular is low (Harrington & Carey, 2009; Mochida & Harrington, 2006; Stubbe, 2012). In this study, both the freshman and the senior groups were composed of high proficiency students. The internal consistency of the test was calculated using Cronbach’s Alpha (see Table 5-7) and it was found to have the lowest value for Set 3 (pseudowords) most likely because the pseudowords were harder to recognize than the real words. For Set 1 (non-omitted words) Cronbach’s Alpha was 0.969, and this was higher than the other two sets. The total checklist was composed of the three sets combined (S1+S2+S3) and it had a Cronbach’s Alpha of 0.964 which was less than that of S1+S2 (0.973). Usually the longer the questionnaire the better the value of Cronbach Alpha. Since the inclusion of pseudowords in the test decreased Cronbach’s Alpha, and since it was also a low reliability construct as evidenced by its 0.759 Cronbach’s Alpha, the conclusion in this study is not favorable for including the pseudowords in the checklist as suggested in other research (Harrington, 2006; Harrington & Carey, 2009; Stubbe, 2012). The key question is then: did the use of pseudowords in the Y/N checklist make a difference between freshman and senior students? A MANOVA was performed to see if the vector mean of the 82 pseudowords in Set 3 was different between freshman and senior students. The results showed that there was no difference between the two groups of students (F (10, 78) = 1.223, p = .095, Ș2 = 0.905, Observed Power = 0.422).

The Development of a Vocabulary Test

111

Table 5-7: Reliability analysis. Category

Set 1

Set 2

Set 3

S1+S2

S1+S2+S3

No. of items Cronbach’s Alpha

220 0.969

52 0.871

82 0.759

272 0.973

354 0.964

Note: Set 1 = non-omitted words; Set 2 = omitted words; Set 3 = pseudowords.

Finally, the last question this study attempted to answer was whether there was a significant difference in the acquisition of the omitted words that were both on the AWL and AVL by freshman and senior students. Gardner and Davies (2014) reported that 451 out of the 570 AWL are in the first 4,000 most frequent lemmas in the COCA. This leaves an interesting question about how many of the 52 omitted words are in the AVL. From COCA’s perspective, this would show how important the omitted words are. A comparison was performed, and it was found that 38 of them were in COCA’s AVL. This indicates that from the perspective of COCA’s AVL, these 38 words should not have been omitted by the ERWL. Table 5-8 shows the comparison in students’ performance between the 52 omitted academic words and the subset of 38 words (see Table 5-9) that are in the new academic vocabulary list released by Gardner and Davies (2014). Table 5-8: Descriptive statistics for the 52 and 38 omitted items.

Set

52 Items 38 Items

Year 95% 95% 1: freshmen Mean Std Min Max Range CI CI 4: seniors Lower Upper OP1/NP1 4.869 2.4 2.202 13.473 11.27 3.226 5.733 OP4/NP4

3.183 2.068 1.45

13.71 12.264 2.322 4.123

OP1/NP1

5.559 5.654 1.62

28.5

OP4/NP4

3.229 2.091 1.37

10.02 8.65

26.88 3.114 8.004 2.418 4.040

Note: RP= OP/NP˗Std = Standard deviation; CI = Confidence Interval

Chapter Five

112

implicate 4 319 12,440 invoke hypothesis 4 345 11,533 thesis incorporate 6 370 10,580 aggregate attribute 4 373 10,491 arbitrary domain 6 481 7,319 concurrent constrain 3 485 7,237 whereby underlie 6 499 6,818 discrete ongoing 10 533 6,114 adjacent hierarchy 7 537 6,012 denote innovate 7 545 5,901 fluctuate scenario 9 555 5,761 intrinsic practitioner 8 571 5,482 simulate empirical 7 591 5,127 negate entity 5 619 4,578 deviate parameter 4 631 4,426 deduce paradigm 7 637 4,356 qualitative infrastructure 8 673 3,913 append predominant 8 720 3,330 offset integral 9 777 2,891 legislate Note: AWL = Academic Word List; COCA = Corpus of English; AVL = Academic Vocabulary List.

10 805 7 813 6 854 8 905 9 996 10 1003 5 1008 10 1026 8 1028 8 1202 10 1243 7 1288 3 1357 8 1396 3 1413 9 1444 8 1736 8 1916 1 1970 Contemporary

COCA Frequency

COCA AVL Rank

Word

AWL Level

COCA Frequency

COCA AVL Rank

Word

AWL Level

Table 5-9: Omitted academic words and their COCA ranks and frequencies.

2,685 2,613 2,334 1,990 1,571 1,535 1,516 1,390 1,377 856 784 714 607 544 517 470 203 68 37 American

Looking at the results of the analysis, it is obvious that the shorter list (38 items) was more difficult for the freshman students than for the senior students because for the senior students the OT4/NP4 in the 52 items was 3.183 while the OT4/NP4 in the 38 items was 3.229. In contrast, these two values for the freshmen were 4.869 and 5.559, respectively. Furthermore, the ANOVA comparison between the freshmen and the seniors for the short list was found to be significant (F (1, 87) = 10.401, p = .002, Ș2 = 0.107, Observed Power = 0.891), indicating that there is a significant difference in the acquisition of the 38 words between the freshman and the

The Development of a Vocabulary Test

113

senior students, with seniors outperforming the freshmen. Nevertheless, the ratio of OT/NP for the senior students is still far from the ideal situation, which is 1. This would occur if the acquisition of the omitted words was the same as the acquisition of the non-omitted words.

Conclusion The study reported in this chapter shows that there is indeed a significant difference between freshman and senior college students in Taiwan in their receptive knowledge of academic words that are on the ERWL list and those that are not on the list. The freshmen knew an average of 160.8 words out of the 220 sampled words or 73.1%, while the seniors reported that they knew 176.41 words out of the 220, or 80.18%. In contrast, the freshman and the senior students reported that they knew 9.74 (18.7%) and 16.69 (32.09%) words out of the omitted 52 words, respectively. The average ratio of the non-omitted words to the omitted words was 4.869 times for the freshman students and 3.183 times for the seniors. The gap becames smaller as students learned more English, but it was still far from the ideal 1:1 ratio. Further research is needed to find out the reasons for senior English majors failing to acquire the omitted AWL words. In addition, pseudowords were found not to have a discriminating effect between freshman and senior students in this study. A cross-examination with the new academic vocabulary list from COCA found that 38 of the omitted AWL) words are also present on the new list (AVL), and there is also a significant difference in acquiring these overlapped words between both groups of students. Although vocabulary testing is better in context (Read, 2004; Read & Chapelle, 2001), this 354-item checklist serves to show a significant learning discrepancy. The findings of this study can be useful to the CEEC in order to help refine and augment the ERWL, while college and high school English language instructors can use the information in this study to help find ways to teach the omitted words to their students.

References Anderson, J. (1980). The lexical difficulties of English medical discourse for Egyptian students. English for Specific Purposes Newsletter, 37, 4. Anderson, R. C., & Freebody, P. (1983). Reading comprehension and the assessment and acquisition of word knowledge. In B. A. Hutson (Ed.),

114

Chapter Five

Advances in reading/language research, Vol. 2 (pp. 231–256). Greenwich, CT: JAI Press. Bauer, L., & Nation, P. (1993). Word families. International Journal of Lexicography, 6(4), 253-279. CEEC. (2015). GSAT and AST. Retrieved from: http://www.ceec.edu.tw/CeecEnglishWeb/E07Process_GSAT.aspx Chang, A. C. S. (2007). The impact of vocabulary preparation on L2 listening comprehension, confidence and strategy use. System, 35(4), 534-550. Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2), 213-238. —. (2011). The academic word list ten years on: Research and teaching. TESOL Quarterly, 45(2), 355-362. Cummins, J., & Man, E. Y. F. (2007). Academic language: What is it and how do we acquire it? In J. Cummins & C. Davison (Eds.), International Handbook of English Language Teaching (pp. 797-810). New York: Springer. Davies, M., & Gardner, D. (2010). A frequency dictionary of contemporary American English: Word sketches, collocates, and thematic lists. London: Routledge. Davies, M. (2012). Corpus of contemporary American English (1990–2012). Retrieved from: http://corpus.byu.edu/coca/ Eldridge, J. (2008). No, There isn’t an ‘academic vocabulary,’ But…”: A reader responds to K. Hyland and P. Tse’s “Is there an ‘academic vocabulary’?”. TESOL Quarterly, 42(1), 109-113. Engber, C. A. (1995). The relationship of lexical proficiency to the quality of ESL compositions. Journal of Second Language Writing, 4(2), 139-155. Flowerdew, J. (1993). Concordancing as a tool in course design. System, 21(2), 231-244. Gardner, D., & Davies, M. (2014). A new academic vocabulary list. Applied Linguistics, 35(3), 305-327. Guo, Y. (2005). Asia’s educational edge: Current achievement in Japan, Korea, Taiwan, China, and India. Lanham MD: Lexington Books. Harrington, M. (2006). The Yes/No test as a measure of receptive vocabulary knowledge. Language Testing, 23(1), 73-98. Harrington, M., & Carey, M. (2009). The on-line Yes/No test as a placement tool. System, 37(4), 614-626. Hyland, K., & Tse, P. (2007). Is there an “academic vocabulary”?. TESOL Quarterly, 41(2), 235-253.

The Development of a Vocabulary Test

115

Jeng, H. H. (2005). The methodologies and principles for compiling the vocabulary list of college entrance examination center. Education Research Monthly, 138, 5-17. Joe, A. (1998). What effect do text-based tasks promoting generation have on incidental vocabulary acquisition? Applied Linguistics, 19(3), 357-377. doi: 10.1093/ applin/ 19.3.357. Koizumi, R. (2013). Vocabulary and speaking. The Encyclopedia of Applied Linguistics. Hoboken, NJ: Wiley. Krejcie, R. V., & Morgan, D. W. (1970). Determining sample size for research activities. Educational and Psychological Measurement, 30(3), 607-610. Laufer, B. (1988). The concept of ‘synforms’ (similar lexical forms) in vocabulary acquisition. Language and Education, 2(2), 113-132. Laufer, B., & Goldstein, Z. (2004). Testing vocabulary knowledge: Size, strength, and computer adaptiveness. Language Learning, 54(3), 339-436. Laufer, B., & Waldman, T. (2011). VerbǦnoun collocations in second language writing: A corpus analysis of learners’ English. Language Learning, 61(2), 647-672. Meara, P., (1992). EFL vocabulary tests. ERIC Clearinghouse. —. (2002). Review Article: The rediscovery of vocabulary. Second Language Research, 18(4), 393-407. Meara, P., & Buxton, B. (1987). An alternative to multiple choice vocabulary tests. Language Testing, 4(2), 142–154. Meara, P., & Wolter, B. (2004). V_links: Beyond vocabulary depth. Angles on the English Speaking World, 4, 85-96. Nation, I. S. P. (1983). Testing and teaching vocabulary. Guidelines, 5(1), 12-25. —. (2001). Learning vocabulary in another language. New York: Cambridge University Press. —. (2006). How large a vocabulary is needed for reading and listening? Canadian Modern Language Review, 63(1), 59-81. Ng, C. H., & Renshaw, P. D. (2009). Reforming learning: Concepts, issues and practice in the Asia-Pacific region: An introduction. The Netherlands: Springer. Pellicer-Sanchez, A., & Schmitt, N. (2012). Scoring yes-no vocabulary tests: Reaction time vs. nonword approaches. Language Testing, 29(4), 489-509. Read, J. (1993). The development of a new measure of L2 vocabulary knowledge. Language Testing, 10(3), 355-371.

116

Chapter Five

—. (2000). Assessing vocabulary. Cambridge: Cambridge University Press. —. (2004). Plumbing the depths: How should the construct of vocabulary knowledge be defined? In P. Bogaards & B. Laufer (Eds.), Vocabulary in a second language: Selection, acquisition and testing (pp. 209-227). Amsterdam: John Benjamins. —. (2007). Second language vocabulary assessment: Current practices and new directions. International Journal of English Studies, 7(2), 105-125. Read, J., & Chapelle, C. (2001). A framework for second language vocabulary assessment. Language Testing, 18(1), 3-32. Schmitt, N. (2014). Size and depth of vocabulary knowledge: What the research shows. Language Learning, 64(4), 913-951. Shohamy, E., Donitsa-Scmidt, S., & Ferman, I. (1996). Test impact revisited: Washback effect over time. Language Testing, 13(3), 298-317. Simpson, R., & Mendis, D. (2003). A CorpusǦbased study of idioms in academic speech. TESOL Quarterly, 37(3), 419-441. doi: 10.2307/3588398. Sonbul, S., & Schmitt, N. (2013). Explicit and implicit lexical knowledge: Acquisition of collocations under different input conditions. Language Learning, 63(1), 121-159. Stubbe, R. (2012). Do pseudoword false alarm rates and overestimation rates in Yes/No vocabulary tests change with Japanese students’ English ability level? Langauge Testing, 29(4), 471-488. Tannenbaum, K. R. (2006). Relationships between word knowledge and reading comprehension in third-grade children. Scientific Studies of Reading, 10(4), 381-398. Wang, L-Y. (2015). A study of the national high school English wordlist in Taiwan. In C. Gitsaki, M. Gobert, & H. Demirci (Eds.), Current issues in reading, writing and visual literacy: Research and practice (pp. 134-156). Newcastle, UK: Cambridge Scholars Publishing. West, M. (1953). A general service list of English words. London: Longman, Green company. Wilkins, D. (1972). Linguistics in language teaching. London: Arnold.

ISSUES IN THE CREATION OF ASSESSMENT AND EVALUATION TOOLS

CHAPTER SIX ASSESSMENT FOR LEARNING; ASSESSMENT FOR AUTONOMY MARIA GIOVANNA TASSINARI

Abstract It is generally accepted that autonomy is a matter of degree, or degrees, which fluctuate and therefore it could be assumed that the way in which language teaching and learning is approached can make a significant difference to the degree of autonomy and consequently the degree of autonomy may influence language learning. It is also well-documented that assessment plays an influential role in learning and that, like autonomy, assessment is also a matter of degrees: the greater the degree of involvement of the learner in the assessment process, the greater the degree of autonomy that can be achieved. Although assessment in language learning and teaching contexts is usually intended as assessment of the language competencies, it is the intention of this chapter to show that assessment of learning competencies and of competencies for autonomy should play a role in a curriculum aiming at fostering learner autonomy and reflexive learning. The research project reported in this chapter was conducted in a German higher education context and involved the development and use of a dynamic model of autonomy. Once the nature of autonomy had been examined and the views of theorists and practitioners in the field had been taken into account, dimensions of autonomy and their sub-elements were integrated within a dynamic model for initiating and continuing pedagogic dialogue between students and their teachers/advisers at the Freie Universität Berlin. The model of autonomy proved to be reliable, provided a clear picture to learners and their teachers/advisers, and showed its potential to be used iteratively. Being online, the model is available for anyone to use and its value lies in the formulation of a profile for learners, which helps them understand their

Assessment for Learning; Assessment for Autonomy

119

learning and their degree of autonomy, builds self-awareness and enhances their metacognitive skills.

Introduction Assessment plays a central role in language teaching and learning. In institutional forms of assessment, such as tests or certifications, we can often observe that entire curricula and syllabi are directed to making learners able to ‘pass the test’ (Prodromou, 1995). In teaching and learning settings aiming at the development of learner autonomy, the capacity of the learner to assess/evaluate their own progress and their learning process is a pivot in the development of learner autonomy (Dam, 1995). Empowering learners to assess their own language competencies and their learning process is therefore one of the main challenges for language educators. This can be done while putting in place formative assessment modes, in which learners play an active role through self- or peerassessment. These forms of assessment are referred to in the literature as assessment for learning and assessment as learning (Boud, 2000; Colbert & Cummings, 2014). Although assessment in second language education is mostly intended as assessment of language competencies, in the literature on learner autonomy researchers have started to investigate, besides assessment modes of language competencies, also forms of assessing the learner’s disposition and capacity for autonomy, i.e., self-directing their own learning (Sheerin, 1997). This chapter describes an approach towards assessment of/for autonomy developed in a German higher education (HE) setting, at the Freie Universität, Berlin (FUB). The project was undertaken as part of the author’s doctoral studies, with the aim of encouraging and promoting the autonomy of the language learners involved, young adult language learners in HE, using tools which, combined with advising, aimed to increase learner awareness and to explore the possibility of greater learner empowerment in the language learning process. The study was conducted while the author was setting up a self-access centre, the Centre for Independent Language Learning (CILL), at the FUB. Recognizing that an essential element of the self-access centre was to provide learners and learning facilitators, such as teachers and advisers, with various forms of support for developing learner autonomy, the study was undertaken with the aim of (i) operationalizing the notion of learner autonomy on the basis of a critical analysis of existing definitions and descriptions, and (ii) developing a tool for supporting the learners’ awareness and reflection on their learning competencies.

120

Chapter Six

After describing the theoretical background of the study, the exploration which led to the development of the dynamic model of autonomy, and the different stages of its development, this chapter illustrates the investigation conducted with learners and teachers at the Language Centre of the FUB, and its results, in particular the learners’ feedback, as a way to deal with the crucial questions concerning assessment of and for autonomy. Afterwards, a brief comparison between the dynamic model of learner autonomy and other existing models for assessing or measuring learner autonomy is made and finally conclusions are drawn.

Operationalizing Learner Autonomy The dynamic model of learner autonomy, which is the focus of this chapter, is based upon extensive research and a critical analysis of several definitions and descriptions of learner autonomy, which elicited core aspects, components, contextual aspects and possible degrees of learner autonomy (see Benson, 2001; Dickinson, 1987; Holec, 1981; Little, 1991; Littlewood, 1996, 1999; Martinez, 2005; Oxford, 2003). The first step towards the model was formulating a definition of learner autonomy focusing on learners’ general competencies in different learning contexts and situations. As a result of this research, learner autonomy was defined as the metacapacity, i.e., the second order capacity, of the learner to take control of their learning process to different extents and in different ways according to the learning situation. Learner autonomy is a complex construct, a construct of constructs, entailing various dimensions and components. Essential components of learner autonomy are: x a cognitive and metacognitive component (cognitive and metacognitive knowledge, awareness, learners’ beliefs); x an affective and a motivational component (feelings, emotions, willingness, motivation); x an action-oriented component (skills, learning behaviours, decisions); x a social component (learning and negotiating learning with partners, advisers, teachers). Learner autonomy is inasmuch a metacapacity, as essential to learner autonomy is the learner’s capacity to activate an interaction and a balance among these dimensions in different learning contexts and situations.

Assessment for Learning; Assessment for Autonomy

121

Based upon this definition and the descriptions of characteristics of autonomous learners, of learners’ attitudes, behaviours and strategies (see, among others, Breen & Mann, 1997; Candy, 1991; Oxford, 1990), a set of descriptors of attitudes towards learning, of language learning competencies and behaviours, for each component of learner autonomy, was developed. The components were then put into relation with each other through/in the dynamic model. Within an explorative-interpretative research approach, the first versions of the dynamic model and the descriptors were discussed in workshops with experts, first at the Centre de Recherches et d’Applications Pédagogiques en Langues (CRAPEL), at the Université de Lorraine, and then at the Language Centre of the FUB; subsequently, both dynamic model and descriptors were further developed according to the results of the validation. The first versions of the dynamic model and the descriptors were in German and French. The validated German version was tested with students at the CILL and students and teachers of a language module at the Language Centre, and then translated into English. The research approach combined therefore a theoretical-analytical approach which comprised the analysis of definitions and descriptions in the literature, the development of a definition of learner autonomy and of the dynamic model, with an empirical, explorative-interpretative approach, constituted by the validation of the dynamic model by experts and by the testing of the model conducted with students and teachers.

The Dynamic Model of Learner Autonomy The dynamic model of learner autonomy (available at http://www. sprachenzentrum.fu-berlin.de/slz/index.html), as shown in Figure 6-1, takes the form of a sphere and entails the dimensions previously identified as being characteristic of learner autonomy: an action-oriented dimension, a cognitive and metacognitive dimension and an affective and motivational dimension. A social component is integrated into each of these dimensions. Distinctions between these dimensions remain on the theoretical level since they are all closely intertwined; however, on the practical level, having these distinctions is useful to both learners and teachers or advisers, so that they can better reflect on competencies, skills and strategies and use opportunities that present themselves in the language learning context to improve and enhance them. Each component of the dynamic model has a set of descriptors, with specific statements about competencies, skills and behaviours, formulated as can-do statements (for example, ‘I can organise a time and a place for

122

Chapter Six

my learning’; ‘I can set myself a task’; ‘I can recognize my strengths and weaknesses as a learner and reflect on these’). Together, these descriptors constitute a checklist covering all the main aspects of autonomous language learning; however, they are not intended to be exhaustive or normative, but rather serve as a spectrum of competencies which function as a tool in raising learners’ awareness of autonomous language learning processes in a higher education setting. The full list of descriptors can be seen online.

Figure 6-1: The dynamic model of learner autonomy (Source: Tassinari, 2010, p. 203).

This model is structurally and functionally dynamic. It is structurally dynamic, because each component is directly related to all the others (as shown by the arrows in Figure 6-1). It is functionally dynamic, because learners can decide to enter the model from any component and move freely from one component to another without following a given path, according to their needs and purposes. For example, they can start with ‘planning’ if they would like to focus on this aspect of the learning

Assessment for Learning; Assessment for Autonomy

123

process, and then move to ‘evaluating’, or to ‘motivating myself’, or to any other component they want to reflect on. This dynamic feature is an essential characteristic of the model, and makes it possible both to account for the complexity of learner autonomy and to operationalize it, breaking it down in smaller portions. On the online version of the dynamic model, the interrelationships among the components and the descriptors are represented by hyper-textual links.

Assessment and Autonomy Learner autonomy, as many researchers and practitioners agree, is a matter of degree or stages within a continuum, namely the autonomyheteronomy continuum (Everhard, 2015). The variety of forms learner autonomy can take in disparate contexts and situations calls, on the one hand, for a deeper analysis and differentiation of this complex capacity, which can be undertaken, among others, through specific assessment tools, in order to better investigate the construct and to help researchers and practitioners to better scaffold its development. On the other hand, assessing how autonomous learners are, especially in institutional contexts, expose us to the risk of “unnecessarily inserting autonomy into the regimes of accountability and assessment that dominate our professional lives” (Benson, 2015, p. xi). One of the crucial questions related to the assessment of autonomy is: Why should we assess autonomy (see also Lai, 2011, cited in Everhard, 2015, p. 29)? Moreover, the next relevant question is: Who should conduct the assessment? The learner or the teacher/adviser? The aim of the assessment approach underlying the dynamic model is to make learners (more) aware of their potential as autonomous learners, and to allow them to be initiators of and responsible for the (self-) assessment process. The results of the self-assessment should not be considered from the perspective of assessment of learning, or of accountability; on the contrary, it should be seen as a chance for reflection in and on the learning process, as an assessment for learning.

Self-Assessment with the Dynamic Model of Learner Autonomy A prerequisite for the self-assessment process is the learner willingness to assess their own learning competencies. Therefore, the use of the dynamic model is voluntary, rather than imposed. The self-assessment process can be conducted either in language advising or in classroom settings and entails four steps: a) Getting started; b) Choosing components

124

Chapter Six

and descriptors; c) Assessing one’s own competencies; and d) Comparing perspectives. According to the setting, these steps may slightly differ. In the first stage, information is elicited about learners’ previous experience of and beliefs on autonomous language learning. In advising settings, this reflection can be conducted by the learner alone, using the questions in the ‘Getting started’ section, or with the language adviser during the advising session. In classroom settings, the reflection can be conducted with a partner or in small groups and eventually discussed in plenary. This process of reflection can be very useful, since learners’ perceptions, beliefs and previous experiences may strongly influence their attitude towards (autonomous) language learning, their decision-making, and, ultimately, the learning process itself (see Barcelos & Kalaya, 2011; Cotterall, 1995, 1999; Navarro & Thornton, 2011). In the second stage (i.e., choosing components and descriptors), it is important that learners actually select the aspects of the learning process that they would like to reflect upon. Whereas in classroom settings the choice of components can be linked to specific tasks, such as implementing an individual learning plan, or choosing resources and tasks for individual objectives, in self-directed learning and language advising settings learners should select themselves the focus of their selfassessment according to their needs. For each component, learners can also choose which descriptors they believe are relevant to their particular language learning process. The third step is self-assessment, where learners can choose between three answers for each descriptor, which are: ‘I can’, ‘I want to learn this’ and ‘This isn’t important for me’. In advising settings, this step can be tackled when the student is alone, rather than in the advising session, so that they are free of pressure regarding time and they can answer the questions within an environment in which they feel comfortable. In classroom settings, the self-assessment can be done individually or with a partner. For learners who are not experienced in self-assessment, it may be useful to exchange views and experiences with a partner. The final stage (i.e., comparing perspectives) involves discussing the outcome of the self-assessment either with the adviser or with the teacher. This is a key element in the evaluation process, involving a pedagogical dialogue where the adviser/teacher and learner reflect together in order to compare their perspectives on the learner’s competencies and on the learning process. This pedagogical dialogue is the core of the assessment process and crucial for the development of learner autonomy (Little, 1995). In language advising settings, the process followed in the dialogue is that suggested by Kelly (1996). Firstly, the adviser listens to the learner

Assessment for Learning; Assessment for Autonomy

125

(active listening). The adviser then asks questions seeking further details and clarification, reformulates the learner’s statements, summing up the information elicited. Finally, the adviser aims to focus on what the learner’s next priorities are and asks for their next steps. This style of pedagogical dialogue is useful in that learners are not left alone to cope with self-assessment, in which they may have the tendency to be too kind to themselves or too strict. Without specific criteria with which to judge, learners might have difficulty in assessing their competencies in various situations. Having the descriptors allows the inner perspective of the language learner to interact and be compared with an external perspective on autonomous language learning. Most importantly, the dialogue with the adviser/teacher or peers is real and, as such, it has the potential to unleash powerful and meaningful interaction, where internalized understandings can be brought to the surface and become externalised. The benefits to learners are that they are enabled to reflect deeply, without constraints. They can initiate the topics for discussion and by doing so they gain insights into their own attitudes and competencies and establish the basis for decision-making. This capacity for reflection and consequent action is both the aim and the outcome of the evaluative reflection process. The descriptors in the model are not provided with a numeric answer system, because giving a numeric score to the different answers would imply a hierarchy among the components and the descriptors and would severely compromise the learners’ ability to freely choose the components and the descriptors upon which they would like to reflect. Furthermore, a scored test is not advisable from a pedagogical point of view, since it could give learners the false impression that there is a full score to reach. Therefore, the assessment is merely formative and qualitative, resulting in enhancing metacognitive processes and, most importantly, it can be repeated, when and as needed, with the learner modifying and changing the focus, as required. All of these features contribute to making the model dynamic.

Testing the Dynamic Model: The Empirical Research Procedure Whereas the first part of the research study aimed at developing the dynamic model and the descriptors and was based, as mentioned above, upon a critical analysis of the literature and a validation with experts in discussions and workshops, the second part of the research, which followed the creation of the dynamic model, aimed at testing the dynamic

126

Chapter Six

model of autonomy and its descriptors with learners and teachers, to ensure the model was a comprehensible and a useful support tool in the development of their autonomy. The research was conducted over a period of nine months both in language classroom settings (with 15 student participants and 2 teachers) and in the CILL (with 6 student participants). The language course followed in the classroom setting focused both on promoting language competencies (French) and competencies specific to learner autonomy (see Denorme, 2006 for a description of the approach used). Additionally, students had access to the CILL, which is part of the Language Centre at the University. The CILL offers materials in 30 languages and different forms of support for autonomous language learning such as a language advising service, workshops and study guides. The use of the CILL is optional and language learning in the CILL can be undertaken as individual language learning, as a language learning partnership (tandem learning) or within tutorials. Sessions with advisers are open to all, can be attended once or more and aim to encourage reflection on learning, support in decision-making, offer help in choosing tasks and learning materials, or in evaluating progress and/or the learning process. Of the 21 student participants, 14 were native German speakers and seven others were variously speakers of Italian, Chinese, Hungarian, Turkish and Farsi. Eight of the participants were enrolled in courses for language specialists, while the remaining 13 were attending non-specialist language courses. The languages the participants were learning were French (17), Spanish (1), German (2) and English (1). The learners who were following a CILL mode of learning did so for various reasons, which included the undertaking of remedial language work beyond the classroom, following an individual learning plan, learning in tandem or working on a project. The approach taken to gathering data was qualitative, making use of: a) preliminary questionnaires and interviews with learners and teachers in order to determine their understandings of autonomy; b) self-assessment using the dynamic model and descriptors (which, at the time of the research, was conducted in paper format); c) follow-up feedback questionnaires and interviews related to the self-assessment; and d) discussion of the learner profile which emerged from learner selfassessment, through an advising session.

Assessment for Learning; Assessment for Autonomy

127

In the case of the CILL learners, their understandings of learner autonomy were gleaned through individual interviews, while in the case of the classroom learners, the same information was gleaned through questionnaires and follow-up discussions. Learners were then invited to complete the self-assessment, choosing the components and descriptors which they wished to reflect upon, without the need to work through all of them. The CILL learners completed the self-assessment on their own, whereas the classroom learners of the language module did the same, individually, during one session in the classroom. Feedback on the selfassessment was gathered through questionnaires and interviews. Two teachers participated in the research study, the focus of their interviews being their feedback on the usefulness of the dynamic model and the descriptors in an autonomy-oriented language module. Similarly, the teachers were first interviewed about their understanding of learner autonomy, and then they were asked to give their feedback on the selfassessment with the dynamic model in the classroom setting.

Results and Discussion Learner opinions on what constitutes learner autonomy were quite diverse. Some regarded it as simply ‘learning without a teacher’, others as ‘self-aware learning’, while others saw autonomy as the ability to choose among tasks, having the ability to correct oneself, choosing the pace and rate of learning and, above all, being able to self-assess their own learning. According to the learners’ answers, discipline, motivation, time management, appropriate resources and learning environments play a role in successful autonomous learning. Of the 21 participants, 20 participants chose between two and eight components for self-assessment, while only one student chose to assess herself in all components. The choice of components and descriptors for self-assessment shows that all areas are almost equally represented, with a slight preference for ‘motivating oneself’ and ‘dealing with my feelings’ (chosen by 16 students), whereas ‘monitoring’ and ‘evaluating’ were less focused on (chosen by 12 and 9 students respectively). Asked to give a reason of their choice, some said they chose components in accordance with the areas they felt were troublesome, while others chose areas in which they felt confident and able to manage well. Some simply followed the steps as they were presented, while other students, mostly the students enrolled in the language module, gave no particular reason for their choices (see Table 6-1).

128

Chapter Six

Table 6-1: Reasons for choice of components (more than one answer possible). Reason for Choice of Components It is problematic, difficult for me I followed the sequence of learning It is relevant for autonomous language learning It is easy for me, something I can do I am interested in it No answer given

Total 7 3 2 2 1 11

What was encouraging was that all but one of the 21 participants gave positive feedback concerning the self-assessment. The majority found the model useful and thought that the self-assessment process gave them the impetus for self-reflection, increased awareness of their learning processes and helped them set further goals to improve their language learning. Learners felt that the model also made them become more conscious of the choices and opportunities open to them and more competent in making decisions with regard to their future learning (see Table 6-2). Such decisions might include choosing to undertake new learning tasks, trying new strategies, joining a learning group or finding a tandem partner. Learners might also decide to leave a course or change courses in order to meet their more specific learning needs. Difficulties that the learners identified related to autonomous learning were variations in motivation levels, the challenges involved in self-assessment of their language skills, and their ability to select suitable materials, to plan and to manage their time efficiently. The results of the investigation showed that the dynamic model is a valid tool which supports evaluation, raises awareness, reflection and decision-making. Through reflection on skills and competencies, learners were brought to a state of greater awareness, they could identify their strengths and shortcomings and recognise areas in which they needed support. This contributed to improving the learning process and to greater regulation by learners. The following comments by two of the students illustrate these points: “I have learned that I have a problem with managing: I always learn, but before [the self-assessment] I wasn’t aware of this problem. I start learning, then I get side-tracked and I don’t make progress.” (Student 19) “[The descriptors] allow you to understand which things you prioritize when […] learning a language autonomously, and allow you to understand how many opportunities you can exploit and which opportunities you do

Assessment for Learning; Assessment for Autonomy

129

actually exploit and which you do not, what can be improved and, as for many things, if I have only my own point of view, maybe I am only able to see certain things. […] It’s a test that, since it has no grade, one can do it freely and it allows you to realize your own pros and contras.” (Student 4)

Table 6-2: Effects of the self-assessment on participants (more than one answer possible). Effect of the Self-Assessment Using the Dynamic Model Stimulates reflection Gives an overview about different methods for language learning Enhances decision-making Helps awareness Enhances awareness of priorities It is stimulating, motivating It is interesting Helps awareness of issues with autonomous learning in institutional context It is frustrating I think I can do it quite well No answer

Total 8 6 4 3 3 2 1 1 1 1 1

The only negative feedback on the self-assessment came from a learner at the CILL, who had been learning English in self-access mode for a long time (i.e., he had been working very meticulously through various video courses during several months). He found the self-assessment with the checklist frustrating, difficult to understand and too time consuming. Interestingly, the preliminary interview with him showed that although he was working in a self-access mode, he was not interested in developing learner autonomy. On the contrary, he defined himself as dependent on the materials with which he was learning. His feedback is of great value for this research study, since it shows that self-assessment of learner autonomy may be useful only in contexts in which learner autonomy is an educational goal, otherwise it may be useless or even counterproductive. As stated by some of the participants, the descriptors offer learners another perspective on autonomous language learning. Reading the descriptors and reflecting on the competencies and/or behaviours they describe provide learners with another perspective, another point of view on their own learning, stimulating self-reflection and self-assessment. Comparing perspectives is crucial to the self-assessment and evaluation process. One of the key steps of this self-assessment process is precisely

130

Chapter Six

the pedagogical dialogue in which learner and adviser, or learner and teacher share and discuss their own perspectives. This dialogue is valuable for both learners and advisers or teachers: for advisers and teachers, it provides insights regarding the competencies or skills with which learners need further guidance or support. For learners, it is an opportunity to reflect on their own language learning, to externalise their own beliefs and attitudes towards learning, to review or confirm their own assessment, or to address questions, if need be. At the end of this process, learners should be able to make more informed decisions about their further learning. However, since learners may not be used to this reflection, it is the duty of the adviser and/or teacher to choose settings and pedagogic practices which enhance reflection and which always take into account the needs and attitudes of the learners. The possible “learner resistance”, as Cotterall and Malcolm (2015) put it, may be “one of the greatest hurdles to implementing new assessment procedures. Introducing more innovative assessment practices demands transformative, personal change […]; willingness to participate […]; an inclination to adopt an unfamiliar role […]; and the investment of time for no immediate reward […]. Overcoming such resistance requires mechanisms for engaging learners in reflecting on and articulating their beliefs about their and their teacher’s role in language learning, and the means and purposes of assessment.” (p. 171)

Therefore, the adviser and/or teacher also has to maintain a careful balance between the focus on language itself and the focus on learning processes and learning competencies, since learners may not always be concerned about the latter.

Comparison between the Dynamic Model and other Approaches to Assessment or Measurement of Autonomy During the development and testing of the dynamic model of learner autonomy, other researchers were investigating similar research questions, identifying dimensions of learner autonomy and developing instruments for describing, assessing or measuring autonomy (Cooker, 2012; Dixon, 2011; Murase, 2010) (see Table 6-3). This indicates the relevance and in some way the urgency for inquiry into these fields.

Assessment for Learning; Assessment for Autonomy

131

Table 6-3: Comparison of studies on learner autonomy.

Cooker, 2012

Aims Dimensions

Method Instrument

Dixon, 2011

Use/Setting Aims

Dimensions

Method Instrument

Murase, 2010

Use/Setting Aims Dimensions

Method Instrument

Tassinari, 2010

Use/Setting Aims Dimensions Method Instrument Use/Setting

Operationalizing learner autonomy; Developing a tool for self-assessment and development of learner autonomy Learner control; Metacognitive awareness; Critical reflection; Motivation; Learning range; Confidence; Information literacy Q-methodology Formative (self-) assessment tool: a learner generated instrument, potentially unlimited; Languages: English Self-access learning Developing a quantitative instrument for measuring learner autonomy; Comparing results of the quantitative instrument with teachers’ evaluation; Helping teachers to help learners to develop learner autonomy Autonomy is a multidimensional concept; Autonomy is variable; Autonomy is a capacity; Autonomy is demonstrated; Autonomy requires metacognition; Autonomy involves responsibility; Autonomy involves motivation; Autonomy involves social interaction; Autonomy is political Critical reflexive mixed methods: first exploratory and then quantitative Questionnaire: Long List (256 items) and Short List (50 items); Languages: English and Chinese Classroom learning, self-access learning Operationalizing learner autonomy; Developing an instrument for measuring learner autonomy Technical (behavioural, situational); Psychological (motivational, metacognitive, affective); Politicalphilosophical (group/individual, freedom); Socio-cultural autonomy (social-interactive, cultural) Quantitative Measuring Instrument for Language Learner Autonomy (MILLA) (113 items); Languages: Japanese and English Classroom learning Operationalizing learner autonomy; Developing an instrument for reflection, self-assessment & learning support Cognitive and metacognitive; Motivational; Affective; Action-oriented; Social Exploratory-interpretative, qualitative Dynamic model with descriptors (133 descriptors in total); Languages: German and English Self-access learning, language advising, classroom learning

132

Chapter Six

These research approaches have some aspects in common, but also differ in some others. The primary aim of the research studies is describing and operationalizing the construct of learner autonomy, but with different scopes and forms: Cooker (2012) and Tassinari (2010) created a tool for self-assessment, reflection and the development of learner autonomy, whereas Dixon (2011) and Murase (2010) developed a quantitative instrument to measure learner autonomy. The formative self-assessment tool developed by Cooker is a learner generated instrument, based on statements and learner profiles; the dynamic model, as described above, is a self-assessment tool for learners which can be used in an adaptive and recursive way; whereas both Dixon and Murase developed a questionnaire with a Likert-scale. Based upon a critical analysis of the literature, all these researchers define autonomy as a multidimensional construct, whose main dimensions are cognitive, metacognitive, motivational, affective and social. Although slight differences can be observed in the perspective and in the focus on some dimensions (e.g., Murase (2010) and Dixon (2011) stress the political dimension; Cooker (2012) and Tassinari (2010) underline more the aspects of control and critical reflection), many similarities exist in the statements describing the learner’s competencies, attitudes, beliefs and behaviours. More relevant for the purpose of the present chapter are the differences in the research methodology used in order to develop and validate the instruments. The methods range from a merely quantitative approach (Murase, 2010) to a prevailing qualitative one (Tassinari, 2010). Concentrating on the learner perspective, Cooker (2012) makes use of techniques and procedures of the Q-methodology, which allows for a systematic understanding of subjectivity and of the participants’ point of view about the subject of investigation. Finally, Dixon (2011) chose a mixed methods approach with critical reflexive methods, starting with an exploratory phase in order to explore the viability and appropriateness of a quantitative approach for his research purpose, and continuing with quantitative methods in order to develop and validate his questionnaire. Among the findings of the research, I would like to point out the conclusion in Dixon’s (2011) study. After having tested his questionnaires (a Long List, with 256 items and a Short List, with 50 items) and compared the results of the measurement with the questionnaires and the teachers’ evaluation of the learners’ autonomy, he concluded that the closed-item questionnaire he developed “cannot be claimed to be a measure of autonomy; rather, the data provided by the questionnaire need to be viewed “in context and in consultation with the learner” (Dixon,

Assessment for Learning; Assessment for Autonomy

133

2011, pp. 313-314). In other words, according to Dixon, autonomous learning cannot be measured in the abstract, but it has to be considered as situated in a real context; and it cannot be measured only through questionnaires or statements, but the learners’ voice has to be listened to.

Conclusion The aims of the research illustrated in this chapter were to create a dynamic model of learner autonomy which would offer a comprehensive description of language learning competencies, skills, attitudes and behaviours, to be used to support reflection in autonomous language learning processes. The self-assessment proposed by the dynamic model offers a learner-centred, dynamic and recursive approach which involves the collaboration of learners and advisers or teachers within a pedagogical dialogue and can be renegotiated according to the changing focus of the learning context and/or situation. The research participants were positive about their use of the dynamic model, stressing that the self-assessment stimulated their awareness and their reflection on their learning competencies, and helped them recognize strengths and/or issues in their learning process. Out of this reflection, they could better focus on priorities and make decisions for their further learning. With the use of the dynamic model, learners reached greater awareness of themselves as autonomous learners, through the processes of critical thinking and evaluation, which encouraged metacognitive development. However, self-assessment, both of language and of learning competencies, can be very difficult for learners who are not used to it. The descriptors in the model provide learners with criteria for conducting the selfassessment. In addition, the pedagogical dialogue with advisers and/or teachers gives learners the opportunity to compare their own perspective with an external perspective, and therefore to enhance their self-awareness and critical reflection, which are key aspects of learner autonomy. Thus, the pedagogical dialogue aims at encouraging learners to reflect and take on a more agentic role than they might previously have been accustomed to in their language learning. Since learners (and maybe even teachers) may be reluctant to engage in such innovative practices of self-assessment, it is necessary to integrate self-assessment in learning and teaching settings that foster autonomy and to support it through reflection on learners’ and teachers’ roles and beliefs. Due to the complex, developmental and fluctuating nature of learner autonomy, a qualitative research approach, such as the one adopted in this

134

Chapter Six

inquiry, seems to be appropriate for investigating questions related to the assessment of learner autonomy. In the same way, as learners’ reflection enhances understanding of the learning process, critical, reflexive, and exploratory research methods help the researcher both gain better insight into the research object and be open-minded and without preconceptions regarding the research findings. From the results of the present investigation some questions arise which could be explored in further research. One area worthy of further investigation concerns the effects of self-assessment of learner autonomy and of awareness-rising intervention in a longitudinal study, in order to see what role they play in the development of learner autonomy and/or learning competencies and if they influence changes in learners’ learning behaviours and habits. Another research field could focus on forms and techniques of the pedagogical dialogue both in the classroom and language advising setting in order to provide practitioners with appropriate tools and a wide repertoire of skills for fostering their own and their learners’ development.

References Barcelos, A., & Kalaya, P. (2011). Introduction to beliefs about SLA revisited. System, 39(3), 281-289. Benson, P. (2001). Teaching and researching autonomy in language learning. Harlow, Essex, UK: Pearson Education. —. (2015). Foreword. In C. J. Everhard & L. Murphy (Eds.), Assessment and autonomy in language learning (pp. viii–xi). Basingstoke, UK: Palgrave Macmillan. Boud, D. (2000). Sustainable assessment: Rethinking assessment for the learning society. Studies in Continuing Education, 22(2), 151-167. Breen, M. P., & Mann, S. J. (1997). Shooting arrows at the sun: Perspectives on a pedagogy for autonomy. In P. Benson & P. Voller (Eds.), Autonomy & independence in language learning (pp. 132-149). London: Longman. Candy, P. C. (1991). Self-direction for lifelong learning. San Francisco: Jossey Bass. Colbert, P., & Cummings, J. J. (2014). Enabling all students to learn through assessment. In C. Wyatt-Smith, V. Klenowski & P. Colbert (Eds.), Designing assessment for quality learning (Vol. 1, pp. 211231). Heidelberg, Germany: Springer. Cooker, L. (2012). Formative (self-)assessment as autonomous language learning. Doctoral thesis, University of Nottingham, UK.

Assessment for Learning; Assessment for Autonomy

135

Cotterall, S. (1995). Readiness for autonomy: Investigating learner beliefs. System, 23(2), 195-205. —. (1999). Key variables in language learning: What do learners believe about them? System, 27(4), 493-513. Cotterall, S., & Malcolm, D. (2015). Epilogue. In C. J. Everhard & L. Murphy (Eds.), Assessment and autonomy in language learning (pp. 167-175). Basingstoke, UK: Palgrave Macmillan. Dam, L. (1995). Learner autonomy, 3: From theory to practice. Dublin: Authentik. Denorme, M. (2006). Autoformation et diversification des publics. Mémoire professionnel. Compte rendu d’expérience: l’élaboration d’un curriculum dans le cadre d’un dispositif de langue autonomisant. Université Charles-De-Gaulle, Lille 3. Dickinson, L. (1987). Self-instruction in language learning. Cambridge: Cambridge University Press. Dixon, D. (2011). Measuring language learner autonomy in tertiary-level learners of English. Doctoral thesis, University of Warwick, UK. Retrieved from: http://wrap.warwick.ac.uk/58287 Everhard, C. J. (2015). The assessment-autonomy relationship. In C. J. Everhard & L. Murphy (Eds.), Assessment and autonomy in language learning (pp. 8-34). Basingstoke, UK: Palgrave Macmillan. Holec, H. (1981). Autonomy in foreign language learning. Oxford: Pergamon. Kelly, R. (1996). Language counselling for learner autonomy: The skilled helper in self-access language learning. In R. Pemberton, E. S. L. Li, W. W. F. Or & H. Pierson (Eds.), Taking control: Autonomy in language learning (pp. 93-113). Hong Kong: Hong Kong University Press. Lai, J. (2011). The challenge of assessing learner autonomy analytically. In C. J. Everhard & J. Mynard (Eds.), Autonomy in language learning: Opening a can of worms (pp. 43-49). Canterbury, UK: IATEFL. Little, D. (1991). Learner autonomy 1: Definitions, issues and problems. Dublin: Authentik. —. (1995). Learning as dialogue: The dependence of learner autonomy on teacher autonomy. System, 23(2), 175-181. Littlewood, W. (1996). Autonomy: An anatomy and a framework. System, 24(4), 427-435. —. (1999). Defining and developing autonomy in East Asian contexts. Applied Linguistics, 20(1), 71-94. .

136

Chapter Six

Martinez, H. (2005). Lernerautonomie: Ein konzeptuelles Rahmenmodell für den Fremdsprachenunterricht...und für die Fremdsprachenforschung. Fremdsprachen Lehren und Lernen, 34, 65-82. Murase, F. (2010). Developing a new instrument for measuring learner autonomy. Doctoral thesis, Macquarie University, Sydney, Australia. Navarro, D., & Thornton, K. (2011). Investigating the relationship between belief and action in self-directed language learning. System, 39(3), 290-301. Oxford, R. L. (1990). Language learning strategies: What every teacher should know. Englewood Cliffs, NJ: Newbury House. —. (2003). Toward a more systematic model of L2 learner autonomy. In D. Palfreyman & R. C. Smith (Eds.), Learner autonomy across cultures (pp. 75-91). Hampshire / New York: Palgrave Macmillan. Prodromou, L. (1995). The backwash effect: From testing to teaching. English Language Teaching Journal, 49(1), 13-25. doi: 10.1093/elt/49.1.13 Sheerin, S. (1997). An exploration of the relationship between self-access and independent learning. In P. Benson & P. Voller (Eds.), Autonomy and independence in language learning (pp. 54-65). London: Longman. Tassinari, M.G. (2010). Autonomes Fremdsprachenlernen: Komponenten, Kompetenzen, Strategien. Frankfurt am Main, Germany: Peter Lang.

CHAPTER SEVEN CULTIVATING LEARNER AUTONOMY THROUGH THE USE OF ENGLISH LEARNING PORTFOLIOS: A LONGITUDINAL STUDY BEILEI WANG

Abstract The study reported in this chapter explored whether portfolio-based assessment is effective in fostering learner autonomy through a longitudinal study in a Chinese junior high school. A three-dimensional learner autonomy scale was administered to both the experimental and control groups. The questionnaire findings revealed that English Learning Portfolios (ELP) were conducive to helping students gain learner autonomy, which was further supported by the case study results. The study also showed that the ELP template and development need constant negotiation and adjustment in accordance with learner needs and environmental constraints. Therefore, some implications and suggestions are provided in this regard.

Introduction Learner autonomy has been a popular research topic in the past thirty years since Holec (1981) first used the word ‘autonomy’ in his report of the Council of Europe’s Modern Language Project. In China, during the past decade, learner autonomy, as a remedy for the conventional problem of teacher-centeredness, already found its way to the National English Curriculum Standards (NECS) (2001), a national English teaching syllabus for primary and secondary English language education in China. To help learners assume a more active role in English language learning, formative assessment has been suggested by the NECS, in addition to summative assessment, so that learners can assess their own

138

Chapter Seven

performance and that of their peers. In doing so, it is believed that learners can attach more importance to the learning process than the learning results as learning and assessment are reciprocally integrated (Little & Erickson, 2015). In fact, the issue of test-oriented, time-consuming but ineffective English language teaching in China has received a lot of criticism since the late 1990s (Dai & Hu, 2009). However, research by Shu (2004) has shown that most English teachers in secondary schools were unaware of the exact requirements or suggestions put forth in 2001 by the NECS. In practice, teachers in many schools still immersed their students in exercises and discrete point quizzes and tests, activities which were not designed to improve student language competence. Because of the high-stakes nature of the senior high school entrance examinations, tests were still considered to be the most powerful measure of students’ performance and teachers’ teaching abilities. The issue of over-emphasis on teaching and formal assessment is also evident in a core language journal published in Chinese, namely, Foreign Language Teaching in Schools. For example, even a decade after the NECS was introduced, the themes of the published papers in the 2011 journal were classroom teaching, which constituted 51.4% of all journal content, followed by high-stakes tests, which covered 13.8% of the articles. Articles on formative assessment were conspicuously lacking. To bridge the gap between societal needs, educational policy and reality, researchers and scholars launched collaborative university-school English language teaching (ELT) projects (Wang & Zhang, 2014). The present study was but a part of a collaborative ELT project between a junior high school and a foreign language university. The goal of the three-year longitudinal study was to empower teachers and learners with innovative development in course design, assessment and teacher training. The present study focused on reform in assessment by integrating the English learning portfolio (ELP) into the assessment system in an effort to foster learner autonomy.

Background Learner autonomy (LA), though considered “an elusive notion” (Bown, 2009, p. 572) and embodied in various terms (Dickinson, 1987; Sheerin, 1991; Wenden, 1991; White, 1999; Zimmerman & Schunk, 2001), is a three-dimensional concept in the present study: metacognitive, affective and social. Metacognition is accepted as a key element in LA as learners are supposed to take charge of their own learning (Holec, 1981) and “take

Cultivating Learner Autonomy: A Longitudinal Study

139

ownership (partial or total) of many processes which have traditionally belonged to the teacher” (Littlewood, 1999, p. 71). That is, learners assume responsibilities of self-planning, self-management and self-reflection. The affective dimension was also recognized by Cotterall (1995), Benson (2005), and Confessore and Park (2004), and it addresses students’ willingness and readiness to learn English. In addition, the social dimension, whereby learners cannot learn without social interaction, was supported by Little (2005) and Benson (2005). Moreover, LA is not a steady state achieved by learners, but it is progressive and changeable. Nunan (1997) suggested that learners grow through five stages, while Littlewood (1999) proposed two levels of autonomy: proactive autonomy and reactive autonomy. Littlewood’s two levels of autonomy were adopted for the present study. His view was that learners might move back and forth along the continuum due to the influence of contextual factors. The main purpose of the current study was to cultivate leaners’ reactive autonomy and help them move towards proactive autonomy. Portfolios, originally an alternative to assessment (Banfi, 2003; Cummins & Davesne, 2009; Nunes, 2004), began to be used as a way to promote language learner autonomy especially after the enactment of the Common European Framework of Reference (CEFR) and the European Language Portfolio. The language portfolio has been implemented at different levels, ranging from tertiary institutions in Spain (González, 2009) to primary schools (Little, 2009), fourth and fifth graders in Turkey (Yilmaz & Akcan, 2011), and Spanish language classrooms in 23 American high schools (Moeller, Theiler & Wu, 2012). The portfolio helped make invisible learning outcomes visible and helped students develop a “metacognitive understanding of language” (Kohonen, 2006, p. 34). The most widely used portfolio was the European Language Portfolio, the first draft of which was published in 1995 and revised in 2001, as part of the European Year of Languages (Trim & Bailly, 2001). The European Language Portfolio was comprised of three obligatory components: a Language Passport, a Language Biography and a Dossier. The portfolio was adapted into LinguaFolio, an American version, which was adopted in the official project for the 2005 Year of Languages by the National Council of State Supervisors of Foreign Languages in the USA (Moeller, Theiler & Wu, 2012). In China, there were limited studies on the use of portfolio and learner autonomy. Xu’s (2007) book on learner autonomy at the tertiary level was a comprehensive one, touching upon the portfolio but failing to cover the

140

Chapter Seven

relevant research about the portfolio and learner autonomy. Gong and Luo (2002) introduced what portfolio assessment is and how to develop it in schools by giving some examples and practical suggestions. However, there was no systematic description about any empirical studies on this issue. Rao (2006) carried out a 6-month study during which he integrated the portfolio in his class instruction among university students and gained positive feedback from students, but his study did not include a control group. Lo (2010) recorded how she used a reflective portfolio to promote learner autonomy in a journalism course. In her study, questionnaires were administered to learn more about learners’ gains in journalism. The longitudinal study reported in this chapter posited two research questions: 1. Can portfolio assessment promote learner autonomy among younger learners in China? 2. What are the learner perceptions of the English learning portfolio (ELP)? The first research question attempted to examine whether portfolio assessment can be a possible solution to the existing English education problem in China, by helping learners take an active role in the learning process. The second research question further explored learners’ viewpoints about ELP and LA for the sake of further improvements on ELP practice.

The Study This study took place in a Chinese junior high school as part of a large ELT project, in collaboration with a university research team, which was made up of six doctoral candidates and their supervisor. The needs analysis in the preparatory phase revealed that English teaching at that junior high school was still teacher-dominant and test-oriented, though teachers and students agreed on the importance of the communicative function of English. The ELP was thus used as a means of assessment in order to change the test-oriented teaching and learning and to promote learner autonomy. A quasi experiment was conducted to see the effectiveness of portfolio assessment in fostering learner autonomy over a period of three years. There were two groups of students: the experimental group (EG), who received the ELP assessment intervention, and the control group (CG), who received the traditional assessment.

Cultivating Learner Autonomy: A Longitudinal Study

141

Questionnaires were administered to both the EG and CG. Furthermore, a separate case study was conducted with six of the EG students.

Participants The number of participants grew over the duration of the study as every year new students were enrolled in the school. The newly-enrolled students were assigned to twenty parallel classes according to their performance on the placement test covering three subjects: Chinese, math and English. Every year, two out of the twenty classes were randomly selected to be the EG group. In Year 1, the EG was composed of EG1-1 (Year 1, Class 1) and EG1-2 (Year 1, Class 2), with a total of 109 students. Two English teachers participated in the project on a voluntary basis. In the following two years, another two classes were recruited into the EG. Similarly, the CG also grew in number in the three consecutive years, Year 1, Year 2 and Year 3. Table 7-1 displays the numbers of study participants over the three years of data collection. Table 7-1: Student participants in the study. Participants EG (7th graders) CG (7th graders)

Year 1 109 110

Year 2 111 110

Year 3 114 112

This study excluded data collected from the learners enrolled in Year 3 because of the drastic changes in the local educational policy and the forthcoming entrance examinations, but the questionnaires were still administered. The participants for the case study were selected by stratified case sampling (Duff, 2007). A total of 6 participants were selected according to their language proficiency, gender, and learning goals: Alice from EG1-1, Bob and Elian from EG1-2, Cathy and Jack from EG2-1 (Year 2, Class 1), David from EG2-2 (Year 2, Class 2). Pseudonyms have been used in the reporting of the data to protect the privacy and identity of the students. The demographic information for the case study participants is displayed in Table 7-2.

Chapter Seven

142

Table 7-2: Demographic data of case study participants.

Class Gender Learning goals English proficiency

Alice EG1-1 Female

Bob EG1-2 Male

Elina EG1-2 Female

Cathy EG2-1 Female

Jack EG2-1 Male

David EG2-2 Male

a

a, c

a, b, c

c

c

d

CEFR A1-A2

CEFR A1-A2

CEFR A2

CEFR A1-A2

CEFR C only n.s.

0.51

p = .005 A>B only

p < .001 A>B>C n.s.

p < .001 A>B=C p = .010 A>B=C p = .002 A>B=C

0.66

S-T Pearson corr. coeff. 0.48

n.s. p = .048 A>B only p = .017 A>B only p = .021 A>B only

n.s.

P-T Pearson corr. coeff.

One-way ANOVA & Tukey-Kramer

Writing 2 S-T Pearson corr. coeff.

P-T Pearson corr. coeff.

Year & Group

One-way ANOVA & Tukey-Kramer

Writing 1

0.47 0.66 0.65

n.s.

Note: n.s. = non-significant; ANOVA - A = Self; B = Peer; C = Teacher; PEARSON - S-T = Self-Teacher; P-T = Peer-Teacher; AY = Academic Year; AY5* = Only participants in the Post-Study Intervention exercises have been included.

In terms of Pearson correlation co-efficients, there were interesting correlations for S-T for Groups C, G and H, with values of 0.48, 0.47 and

166

Chapter Eight

0.66 respectively, revealing a tendency towards alignment between selfassessment and teacher assessment. In the Post-Study (AY5), despite the time spent on training for peerassessment, the same pattern emerged as in the years before, with significant differences in the first group of the year, Group I, and no significant differences, as shown by ANOVA, for the second group, Group J. Only Group J displayed alignment between peer-, self- and teacher assessment in both the writing assignments. The Tukey-Kramer equation of A>B>C for Group I in Writing 1 indicated complete non-alignment of peer-, self- and teacher assessment, while for Writing 2, the non-alignment for the same group was between self- and peer-assessment. The Pearson correlation coefficient for P-T in Group J was 0.66 for Writing 1 and in Group I it was 0.65 for Writing 2, indicating inconsistency between the two groups. The results derived from learner involvement in the assessment of oral skills on the AARP are altogether different from those for the writing skills. This is most likely due to the fact that in the case of oral peerassessment, the mean peer grade was arrived at by whole-group criterial thinking rather than the thinking and decision-making of an individual, as was the case with the assessment of writing. The alignment between peer-, self- and teacher assessment which is present in the assessment of oral presentations in the Pre-Study from both Groups A and B is repeated five more times in the Main Study and by both groups who were involved in the oral intervention in the Post-Study (see Table 8-2). Not only is alignment shown by the ANOVA test, but also Pearson correlation coefficients produced P-T values of 0.75 and 0.51 respectively for Groups A and B in the Pre-Study (AY1), and P-T values of 0.46, 0.44 and 0.52 for Groups F, G and H respectively in the Main Study (AY2-AY4). The P-T correlation value for Group I in the Post-Study (AY5) was 0.51. These correlation values show a tendency towards alignment between peerassessment and teacher assessment. At the same time, correlations were also produced for S-T, with correlation coefficients of 0.45, 0.42 and 0.45 for Groups C, F and H respectively in the Main Study, indicating a tendency towards alignment between self-assessment and teacher assessment. Group E seemed to fit the profile of a completely rogue group, since there is no alignment between peer-, self- and teacher assessment either for Writing 1, Writing 2 or for Oral presentation assignments, making it unique. It is possible that members of Group E aimed to demonstrate their autonomy by refusing to conform regarding assessment. Elsewhere, plausible reasons for ‘cheating’ within this group, pertaining to

Sharing Assessment Power to Promote Learning and Autonomy

167

collaboration on oral presentations, have been suggested (Everhard, 2015b). Another interesting phenomenon, previously referred to, was uncovered through asking learners to self-assess their second writing assignment twice. In Groups C and D, in the Main Study (AY2), there was a unique variation in the way self-assessment of Writing 2 was conducted since learners conducted self-assessment of their assignments twice-over: (1) on submission of the assignment, and (2) in the usual way, with a delay after peer-assessment had been completed. Table 8-2: AARP assessment overview (Oral Presentation).

n.s n.s. n.s. n.s. p < .001 A=B>C n.s. n.s. n.s. n.s. n.s.

S-T Pearson corr. coeff.

P-T Pearson corr. coeff.

Year & Group AY1 Group A AY1 Group B AY2 Group C AY2 Group D AY3 Group E AY3 Group F AY4 Group G AY4 Group H AY5* Group I AY5* Group J

One-way ANOVA & Tukey-Kramer

Oral Presentation

0.75 0.51 0.45

0.46 0.44 0.52 0.51

0.42 0.45

Note: n.s. = non-significant; ANOVA - A = Self; B = Peer; C = Teacher; PEARSON - S-T = Self-Teacher; P-T = Peer-Teacher; AY = Academic Year; AY5* = Only participants in the Post-Study Intervention exercises have been included.

Unfortunately, not all of the students from Groups C and D participated in this variation, so that only 20 of the 22 participants in Group C and 13 of the 18 participants in Group D, were involved. Nevertheless, it seems important to report the outcome of this particular experiment and the results given in Table 8-3 might go some way towards

168

Chapter Eight

throwing light on the results for these groups shown in Table 8-1. In Table 8-3: 1. ‘Actual’ refers to the means derived from all the participants who were involved in self-assessment after conducting peer-assessment, which was the normal procedure on the AARP and it is that on which the results in Table 8-1 are based. 2. ‘Variation 1’ presents the means derived from learner-assessment for those students from Groups C and D, who submitted selfassessment with their assignments, before peer-assessment processes took place. The S-A means presented in Variation 1 are therefore those based on their first self-assessment, before involvement in peer-assessment. 3. ‘Variation 2’ presents the same means as in ‘Actual’ (i.e., selfassessment conducted after peer-assessment), but with the means derived only from those students who performed self-assessment twice, i.e., only those students who took part in Variation 1. Table 8-3: AARP mean scores for Writing 2, for Groups C and D, with ANOVA results for self-assessment variations. AY2 - Group C - Actual AY2 - Group D - Actual Mean SEM Mean SD SEM SD N N S-A 8.80 0.67 0.142 22 S-A 8.44 1.05 0.248 18 P-A 7.94 1.12 0.240 22 P-A 7.41 1.26 0.297 18 T-A 7.42 0.81 0.174 22 T-A 7.41 1.03 0.242 18 AY2 - Group C -Variation 1 AY2 - Group D - Variation 1 Mean SEM Mean SD SEM SD N N S-A 8.54 0.70 0.156 20 S-A 8.11 1.19 0.329 13 P-A 7.86 1.15 0.258 20 P-A 7.29 1.31 0.362 13 T-A 7.51 0.79 0.177 20 T-A 7.29 1.08 0.300 13 AY2 - Group C -Variation 2 AY2 - Group D -Variation 2 Mean SEM Mean SD SEM SD N N S-A 8.87 0.62 0.139 20 S-A 8.71 1.08 0.300 13 P-A 7.86 1.15 0.258 20 P-A 7.29 1.31 0.362 13 T-A 7.51 0.79 0.177 20 T-A 7.29 1.08 0.300 13 Note: S-A = Self-Assessment; P-A = Peer-Assessment; T-A = Teacher Assessment; AY = Academic Year; SD = Standard Deviation; SEM = Standard Error of Measurement; N = Number of participants.

Some interesting points emerged from these data. Firstly, from the teacher-researcher’s point of view, it is gratifying to see that there seems to be consistency in her assessment of Groups C and D, in that the T-A

Sharing Assessment Power to Promote Learning and Autonomy

169

means were calculated at 7.42 for Group C and 7.41 for Group D. Both with regard to peer-assessment and self-assessment, Group C tends to be a little more generous in its self-assessment when compared with Group D. What is very interesting is the exact match between peer-assessment and teacher-assessment in Group D. It is strange that this same alignment does not appear in Group D self-assessment, with the discrepancy in S-A means with T-A means rising from 0.82 for the first self-assessment to 1.42 for the second self-assessment. There is also an increase in discrepancy from 1.03 to 1.36 in Group C, when comparing the means for S-A and T-A, between the first self-assessment and the second. This increase in discrepancy is alarming, when one would actually expect closer alignment between S-A and T-A after the experience of peer-assessment. What is most significant is the fact that the highest means in all cases, whether Actual, Variation 1 or Variation 2 for both Groups C and D were produced by S-A, with higher S-A values in each case awarded by Group C, as compared with Group D. Most interesting also is the fact that the second self-assessment exceeds the first self-assessment, with S-A means rising from 8.54 to 8.87 in Group C and even more steeply, from 8.11 to 8.71 in Group D. This inflated self-assessment differs from findings in the Far East where modesty prevails and self-assessment means tend to be lower than T-A and P-A, both for writing (Matsuno, 2007, 2009) and speaking (Chen, 2006). With regard to Standard Deviation (SD), the highest level of SD was displayed by P-A in all cases, indicating that Peers were awarding a wider range of grades than both S-A and T-A. One-way ANOVA revealed significant p values for both Groups C and D in the Actual and in Variation 2, but in the case of Variation 1, while Group C still produced a p value of .002, which is considered significant, in the case of Group D, the p value was 0.147, which is considered non-significant. Tukey-Kramer Multiple Comparison Tests revealed very similar results in Groups C and D (Actual) and Groups C and D (Variation 2), with the pattern A>B=C, indicating alignment between peer-assessment and teacher assessment. In order to understand better the differences in assessment behaviour between Variation 1 and Variation 2, a paired sample t-test of the two selfassessments of the two groups was conducted and the results are shown in Table 8-4. Both the ANOVA and the t-test revealed significant increases in S-A values for both Groups C and D, with the paired t-test revealing an increase of 0.33 for Group C and an increase of 0.60 for Group D between the first self-assessment conducted and the second. These increases occur after peer-assessment processes, which we would have expected to have been a form of training in using the criteria and to have had more of a

170

Chapter Eight

regulatory effect on subsequent self-assessment processes. On the contrary, peer-assessment appears to have had a negative effect on selfassessment, since S-A means have increased the second time around. Table 8-4: Paired t-test results for Writing 2 self-assessment Variations (AY2). N Mean Difference 20 AY2 Group C 0.33 95% CI for mean difference: (0.077; 0.583) t-value = 2.73 p = .013 N Mean Difference 13 AY2 Group D 0.60 95% CI for mean difference: (0.221; 0.979) t-value = 3.45 p = .005

SD 0.54

SEM 0.121

SD 0.63

SEM 0.17

Note: CI = Confidence Interval; SD = Standard Deviation; SEM = Standard Error of Measurement; N = Number of participants.

The results overall could be seen as an indication that in Group D (though the sample is small), and possibly also in Group C, there was sufficient assessment “maturity” (Ritter, 1998, p. 79), before peerassessment processes, for these students to self-assess themselves reasonably accurately, indicating that practice through peer-assessment with these groups perhaps led to less rather than more self-assessment accuracy. It is otherwise hard to explain why students had the tendency to overrate in self-assessment processes, in Groups C and D, to the extent that they did, after an experience of peer-assessment which had proved to be rather successful.

Qualitative Data Qualitative data was gathered from participants at the end of each AARP semester by means of a post-questionnaire. In the case of Groups C and D, completed questionnaires were received only from 13 participants from Group C and 16 students from Group D. T-analysis of their responses to 10 of the questions showed lack of agreement between the groups concerning the matter of objectivity in peer-assessment, concerning how easy self-assessment was and how easy it was to be objective in selfassessment. Table 8-5 shows some responses by students to these questions, which seem to be representative of opinions in these groups.

Sharing Assessment Power to Promote Learning and Autonomy

171

Table 8-5: Responses to questions about peer- and self-assessment. Question 4 – How easy did you find it to be objective in peerassessment? C9 – In the beginning it was difficult to assess ourselves and our classmates, and if we didn’t give a good grade, we felt bad. D8 – I had never found myself before in similar circumstances and I confess that it caused me anxiety that I would perhaps assess someone more strictly than I should. Question 5 – How easy was it to assess yourself (self-assessment)? C4 – I learned to be more objective with myself, although it was hard to grade myself. D3 – The problem was that I couldn’t really assess myself. I found everything perfect! Question 6 – How easy did you find it to be objective in selfassessment? C11 – You try to evaluate yourself and the others objectively and you learn from this procedure. D4 – I had never assessed with such criteria before and thus the whole concept was a little complicated. Many researchers (including Benson, 2001; Everhard, 2015a; Nunan, 1997; Sinclair, 1997, 2000, 2011) suggest that autonomy is a matter of degree. Some stress the essential link between assessment, particularly self-assessment and autonomy (Dam, 1995; Harris, 1997; Little, 2009). Others suggest that autonomy cannot exist without the ability to selfassess (Hunt, Gow, & Barnes, 1989). This view seems to be substantiated by Harris and Bell (1990), who see both autonomy and assessment as a matter of degree, asserting, as does Dickinson (1987), that the greater the degree of self involvement in assessment, the greater the degree of autonomy that can be enjoyed. Everhard (2014), in her model of autonomy for the AARP, visualises autonomy as being on a continuum. When decisions about learning are fully teacher-controlled, Everhard suggests that there is a state of heteronomy. The more that learners are involved in decision-making and assessment processes, the greater the degree of autonomy that is promoted. A modified extract from the model is shown in Table 8-6. The details in Table 8-6 are derived from Everhard (2014), Everhard (2015a) and Harris and Bell (1990) and show how degrees of autonomy are linked to degrees of assessment. Summative forms of assessment tend to promote intellectual heteronomy, while learner-centred and more

172

Chapter Eight

formative types of assessment, involve the learners in metacognitive thinking and decision-making, leading to transformative, liberatory and sustainable learning, with learners taking greater responsibility and initiative. Table 8-6: Modified extract from the AARP model showing the relationship between assessment and degrees of autonomy (Based on Everhard 2014, 2015a; Harris & Bell, 1990). Intellectual heteronomy – no autonomy.

Academic autonomy – low degree of autonomy.

Language instructor evaluation. Criteria for assessment may be hidden.

Language instructor defines assessment mechanisms, but learners may provide some justifications for answers and solutions. Collaborative assessment – formative assessment.

Traditional educational assessment – summative assessment.

Academic autonomy – medium degree of autonomy. Collaborative assessment involving peer, self and instructor.

Peer assessment – formative assessment.

Intellectual autonomy – high degree of autonomy. On-going realistic selfassessment of learning achievements and success.

Selfassessment – sustainable assessment.

Conclusion While Tassinari’s research project in Berlin (this volume; Tassinari, 2015) focuses on learning competencies, this project in Thessaloniki, focuses on language competencies; however, there are some similarities between the two studies. Before proceeding to peer- and self-assessment tasks, participants in the AARP were encouraged to reflect on their speaking and writing skills by means of the Learner Contract and smallgroup counselling meetings with the teacher. As with Tassinari’s dynamic model, where learners reach awareness of themselves as autonomous learners, in the case of the AARP they become aware of their strengths and weaknesses in the productive skills and in their approach to language

Sharing Assessment Power to Promote Learning and Autonomy

173

learning in general. Together with the teacher/counsellor, learners in Thessaloniki were asked to consider some aims and goals which they could have and to reflect on the means by which they could achieve these goals. This is very similar to the procedures in the Berlin study and, in both cases, assessment of these abilities promoted metacognition. AARP participants are encouraged to take action along with other group members who have similar weaknesses and similar targets. Goals and possible ways of achieving them are recorded on the contract and it is signed and dated, to make it binding, by the teacher and the learner. It, then, remains to the learners to decide how to proceed during their own time. The means by which this agreement is reached in both the Berlin and Thessaloniki projects is through pedagogical dialogue, aimed at encouraging learners to reflect and take on a more agentic role, as opposed to the more passive role they may have previously been used to during their language learning careers. The combination of the pedagogical approach and the assessment approach which involved learners in decision-making processes, raised awareness of themselves as learners and helped them, through developing critical thinking skills, to recognise both their strengths and their weaknesses in their speaking and writing, helped them move beyond the state of ‘learned helplessness’ (Dörnyei, 1994, p. 276) and heteronomy which may have been promoted in their previous language learning experiences. The aims of the triangulated assessment implemented and of the AARP overall, were not to promote agreement between learner-awarded and teacher-awarded marks, because as Rühlemann (2002, p. 4) points out, such expectations would have been “futile”. Orsmond et al. (1996; 1997) and Sadler (1989) insist that it is worth taking the risk with marks in order to develop what are necessary skills. Black, Harrison, Lee, Marshall, and Wiliam (2003) agree that it is only through practice in peer- and selfassessment that critical thinking can be developed. Janssen-van Dieten (1989) points out that poor results in self-assessment show the need for it rather than its abandonment and feels it should be pursued. Although sharing assessment power with learners is not without risks, teachers and advisers are beginning to recognise its usefulness as a “teaching tool”, more than a “grading procedure” (Orsmond et al., 1996, p. 245). There is the realization that “validity of judgments” has to take precedence over “reliability of grading” (Sadler, 1989, p. 122), and that avoidance of learner involvement in assessment of the kinds described here denies learners opportunities to learn from assessment, develop greater autonomy and sustain this learning throughout their lives.

174

Chapter Eight

References Benson, P. (2001). Teaching and researching autonomy in language learning (1st ed.). Harlow, UK: Pearson Education. Black, P., & Jones, J. (2006). Formative assessment and the learning and teaching of MFL: Sharing the language learning road map with the learners. Language Learning Journal, 34(1), 4-9. Black, P., Harrison, C., Lee, C., Marshall, B., & Wiliam, D. (2003). Assessment for learning: Putting it into practice. Maidenhead, Berks. & New York, NY: Open University Press. Boud, D. (1995). Enhancing learning through self-assessment. London: RoutledgeFalmer. Bowen, J. (1988). Student self-assessment. In S. Brown (Ed.), Assessment: A changing practice (pp. 47-70). Edinburgh, UK: Scottish Academic Press. Chen, Y.-M. (2006). Peer and self-assessment for English oral performance: A study of reliability and learning benefits. English Teaching & Learning, 30(4), 1-22. Dam, L. (1995). Learner autonomy, 3: From theory to practice. Dublin: Authentik. Dickinson, L. (1987). Self-instruction in language learning. Cambridge, UK: Cambridge University Press. —. (1992). Learner autonomy, 2: Learner training for language learning. Dublin: Authentik. Dörnyei, Z. (1994). Motivation and motivating in the foreign language classroom. The Modern Language Journal, 78(3), 273-284. Everhard, C. J. (2012). Re-placing the jewel in the crown of autonomy: A revisiting of the ‫ލ‬self’ or ‫ލ‬selves’ in self-access. Studies in Self-Access Learning Journal, 3(4), 377-391. Retrieved from: http://sisaljournal.files.wordpress.com/2009/12/everhard1.pdf —. (2014). Exploring a model of autonomy to live, learn and teach by. In A. Burkert, L. Dam & C. Ludwig (Eds.), The answer is learner autonomy: Issues in language teaching and learning (pp. 29-43). Faversham: IATEFL. —. (2015a). The assessment-autonomy relationship. In C. J. Everhard & L. Murphy (Eds.), Assessment and autonomy in language learning (pp. 8-34). Basingstoke: Palgrave Macmillan. —. (2015b). Investigating peer- and self-assessment of oral skills as stepping-stones to autonomy in EFL higher education. In C. J. Everhard & L. Murphy (Eds.), Assessment and autonomy in language learning (pp. 114-142). Basingstoke: Palgrave Macmillan.

Sharing Assessment Power to Promote Learning and Autonomy

175

Harris, D., & Bell, C. (1990). Evaluating and assessing for learning (2nd ed.). London & New York: Kogan Page & Nichols Publishing. Harris, M. (1997). Self-assessment of language learning in formal settings. English Language Teaching Journal, 51(1), 12 -20. Heron, J. (1981). Assessment revisited. In D. Boud (Ed.), Developing student autonomy in learning (pp. 55-68). London: Kogan Page. Hunt, J., Gow, L., & Barnes, P. (1989). Learner self-evaluation and assessment – a tool to autonomy in the language learning classroom. In V. Bickley (Ed.), Language teaching and learning styles within and across cultures (pp. 207-217). Hong Kong: Institute of Language in Education, Education Department. Janssen-van Dieten, A.-M. (1989), The development of a test of Dutch as a second language: The validity of self-assessment by inexperienced subjects. Language Testing, 6(1), 30-46. Little, D. (2009). The European language portfolio: Where pedagogy and assessment meet. Strasbourg: Council of Europe. Matsuno, S. (2007). Self-, peer-, and teacher-assessment in Japanese university EFL writing classrooms. Doctoral thesis, Temple University, Tokyo, Japan. —. (2009). Self-, peer-, and teacher-assessments in Japanese university EFL writing classrooms. Language Testing, 26(1), 75-100. Murphy, L. (2015). Autonomy in assessment: Bridging the gap between rhetoric and reality in a distance language learning context. In C. J. Everhard & L. Murphy (Eds.), Assessment and autonomy in language learning (pp. 143-166). Basingstoke: Palgrave Macmillan. Nguyen, T. T. H., & Walker, M. (2016). Sustainable assessment for lifelong learning. Assessment & Evaluation in Higher Education, 41(1), 97-111. doi: 10.1080/02602938.2014.985632 Nunan, D. (1997). Designing and adapting materials to encourage learner autonomy. In P. Benson & P. Voller (Eds.), Autonomy and independence in language learning (pp. 192-203). Harlow, Essex: Longman. Orsmond, P., Merry, S., & Reiling, K. (1996). The importance of marking criteria in the use of peer assessment. Assessment and Evaluation in Higher Education, 21(3), 239-250. Orsmond, P., Merry, S., & Reiling, K. (1997). A study in self-assessment: Tutor and students’ perceptions of performance criteria. Assessment & Evaluation in Higher Education, 22(4), 357-368. Orsmond, P., Merry, S., & Reiling, K. (2000). The use of student derived marking criteria in peer and self-assessment. Assessment & Evaluation in Higher Education, 25(1), 23-38.

176

Chapter Eight

Pope, N. K. L. (2005). The impact of stress in self- and peer assessment. Assessment & Evaluation in Higher Education, 30(1), 51-63. Raappana, S. (1997). Metacognitive skills, planning and self-assessment as a means towards self-directed learning. In H. Holec & I. Huttunen (Eds.), Learner autonomy in modern languages: Research and development (pp. 123-137). Strasbourg: Council of Europe. Ritter, L. (1998). Peer assessment: Lessons and pitfalls. In S. Brown (Ed.), Peer assessment in practice: SEDA Paper 102 (pp. 79-85). Birmingham, UK: SEDA. Rühlemann, C. (2002). Sharing the power: Action research into learner and teacher co-evaluation. Humanising Language Teaching, 4(1), 111. Retrieved from: http://www.hltmag.co.uk/jan02/mart.htm Sadler, D. R. (1989). Formative assessment and the design of instructional systems. Instructional Science, 18(2), 119-144. Sinclair, B. (1997). Learner autonomy: The cross-cultural question. IATEFL Newsletter, 139, 12-13. —. (2000). Learner autonomy: The next phase? In B. Sinclair, I. McGrath & T. Lamb (Eds.), Learner autonomy, teacher autonomy: Future directions (pp. 4-14). Harlow: Pearson Education. —. (2011). Learner training. In C. J. Everhard & J. Mynard, with R. Smith (Eds.), Autonomy in language learning: Opening a can of worms (pp. 91-98). Canterbury: IATEFL. Tassinari, M. G. (2015). Assessing learner autonomy: A dynamic model. In C. J. Everhard & L. Murphy (Eds.), Assessment and autonomy in language learning (pp. 64-88). Basingstoke: Palgrave Macmillan.

CHAPTER NINE DEVELOPING A TOOL FOR ASSESSING ENGLISH LANGUAGE TEACHER READINESS IN THE MIDDLE EAST CONTEXT SADIQ MIDRAJ, JESSICA MIDRAJ, CHRISTINA GITSAKI, AND CHRISTINE COOMBE

Abstract The effectiveness of teachers significantly correlates with students’ educational achievement. Ensuring teachers are well prepared for the classroom is of primary importance in the educational system of any country. In the field of English as an additional language (EAL), there are a number of tools that have been developed to gauge teacher readiness for taking on the profession. However, the existing tools and resources have a North-American orientation and they are not a good fit for other contexts that may require a different cultural sensitivity on the part of the teacher. This is particularly acute in situations where teachers trained in the West are employed in non-western contexts as is the case in the United Arab Emirates (UAE). The purpose of the project described in this chapter was the creation of a contextually relevant resource for independent learning and selfassessment to strengthen EAL teachers’ content knowledge, pedagogical knowledge, and professional dispositions. This chapter describes the context of the study and the process of developing the resource by first compiling EAL standards and indicators that are culturally responsive and cater to the needs of EAL teachers in the UAE and the greater Gulf Region, and then by producing over 300 assessment items designed to measure specific EAL indicators. Valuable insights from the project leaders are shared in an effort to facilitate the process of developing similar resources for other EAL contexts. It is envisioned that the resource will support teacher learning and

178

Chapter Nine

measure their progress in applying strategies, methods, and theories in EAL teaching/learning-related situations.

Introduction Recent studies at the classroom level have found that teacher effectiveness is a strong determinant of differences in student learning (Cavalluzzo, Barrow, Henderson et al., 2015; Cowan & Goldhaber, 2015; Hawk, Coble, & Swanson, 1985; Tchoshanov, 2011). Empirical evidence has shown that students who have high quality teachers make significant and lasting learning gains. These findings make the identification and evaluation of teacher effectiveness a major priority in today’s classrooms. Despite the importance of highly effective teachers, there is no one scale or agreed-upon list of criteria to evaluate what makes an effective teacher in general or an effective second language teacher in particular. This dearth of research carries over into the Middle East context where studies about teacher effectiveness are lacking. An important research strand in teacher effectiveness and one that is central to the conceptualization and development of the teacher resource described in this chapter is the knowledge base on what is known as selfefficacy. Self-efficacy, a belief in one’s capability to execute the actions necessary to achieve a certain level of performance, is an important influence on behavior and affect, relating to individuals’ goal setting, effort expenditure and levels of persistence (Deemer & Minke, 1999; Bandura, 1993). When applied to teachers, the self-efficacy (or teacher efficacy) construct has been associated with teachers’ instructional practices and attitudes toward students (Ashton, Webb, & Doda, 1983; Bender, Vail, & Scott, 1995; Midgley, Anderman, & Hicks, 1995). In addition, teacher efficacy has been defined as both context and subject-matter specific. In terms of context, the effects of efficacy have been studied with pre-service, novice and in-service teachers at various levels of education (i.e., elementary, middle and secondary school) and in various contexts (i.e., urban, suburban and rural). However, no empirical findings to date have been presented on the efficacy of teachers of English as an additional language (EAL) in the Gulf or Middle East context. It was this lack of empirical evidence and adequate resources for measuring language teacher effectiveness that served as the motivation for the creation of this context-specific localized teacher resource. Finally, teacher efficacy has also been associated with subject-matter knowledge and the belief that teachers must have both declarative and procedural knowledge to successfully navigate today’s classroom

Developing a Tool for Assessing English Language Teacher Readiness 179

(Pasternak & Bailey, 2004). Declarative knowledge refers to knowledge about something, for example, knowledge of grammatical rules. Procedural knowledge refers to “the knowledge of how”, “the ability to do things” (Pasternak & Bailey, 2004, p. 157), for example, how to use the grammatical rules. The resource described in this chapter aimed to increase the effectiveness of EAL teachers in the Gulf Region by addressing the content knowledge, the pedagogical knowledge and the professional dispositions they need in order to improve student learning outcomes.

Teacher Effectiveness in Relation to Student Learning Research suggests that teacher effectiveness affects student learning. Previous studies on teacher effectiveness tend to focus on teacher certification, student achievement as measured by standardized test scores, pedagogical practice as measured by classroom observation, and the relationship of teacher content-knowledge to student achievement. Many of the measures of teacher effectiveness have focused on a combination of the aforementioned criteria (Cavalluzzo, Barrow, Henderson et al., 2015; Cowan & Goldhaber, 2015; Kane & Staiger, 2008; Rockoff, 2004). One important finding from the available research shows that teachers who are professionally certified are more effective than non-certified teachers. Moreover, students who are taught by professionally certified teachers are more likely to achieve better learning outcomes (Cavalluzzo et al., 2015; Cowan & Goldhaber, 2015). In one recent study, Cavalluzzo et al. (2015) analyzed scores of thousands of secondary school students between 2000 and 2012 in Chicago and Kentucky and correlated their performance on an American College Testing suite of assessments to their teachers’ assessment scores. The results showed that students taught by board-certified teachers did better than students who were taught by nonboard-certified teachers (Cavalluzzo et al., 2015). Similarly, Cowan and Goldhaber (2015) investigated the effectiveness of board-certified teachers on elementary and middle school students’ scores in Washington State in the United States. The results showed that the National Board for Professional Teaching Standards (NBPTS) certified teachers were about “0.01-0.05 student standard deviations more effective” than non-NBPTS certified teachers with similar levels of experience (Cowan & Goldhaber, 2015, p. 2). In addition, the teachers’ primary NBPTS assessment results predicted student achievement. Furthermore, research by Hawk, Coble, and Swanson (1985) revealed a strong positive effect of teacher certification on student achievement. The researchers used controlled matched comparison

180

Chapter Nine

groups. They then matched teachers by level of experience, school, and used pre- and post-tests of student achievement at the beginning and end of the academic year in the specific curriculum taught. The study results showed that students of certified teachers in mathematics performed significantly better than those who were taught by uncertified teachers in mathematics in both general mathematics and algebra. With regards to early career teachers, Tchoshanov (2011) maintains that there are significant correlations between new teachers’ content knowledge, knowledge of student learning variables, and the quality of their lessons. Content knowledge isolated from other teacher knowledge, such as pedagogical content knowledge, curriculum knowledge, epistemological knowledge, and knowledge of the learners may not give a complete picture of the relationship between teacher, content knowledge, and student achievement. The studies reviewed here show that students of teachers with professional knowledge perform better than those taught by teachers without professional pedagogical and content knowledge. Ensuring that English language teachers, in the Gulf Region and beyond, have the required professional knowledge is crucial and of the utmost importance to help increase student achievement.

Independent Learning and Self-Assessment Given that the instrument resulting from the present study is intended for self-assessment and independent learning, it is important to explore the benefits of such an approach. Self-assessment is defined as assessment that involves learners in making judgments about their performance, learning, and/or attitudes as they reflect on and judge how their work relates to the established standards so that they can determine the subsequent steps in their learning process (Midraj, in press). With reference to the social cognitive theory of self-regulation (Bandura, 1991), independent learning necessitates learners’ selfregulation of their learning as they take responsibility for their own learning. Underpinning this is motivation towards achieving their own professional goals and aspirations, developing metacognitive strategies to achieve their goals and monitoring their progress towards achieving them through self-evaluation (Bandura, 1991, 1983; Eggen & Kauchak, 2007). Research studies show that self-assessment can enhance the use of strategic planning and self-monitoring of one’s learning (Punhagui & De Souza, 2013). Tamjid and Birjandi (2011) conducted a quasi-experimental study using self- and peer-assessment activities. The results showed that

Developing a Tool for Assessing English Language Teacher Readiness 181

incorporating such activities in a course of study can promote learner independence and enhance learner autonomy. These findings also indicate that self- and peer-assessments ultimately improve learners’ metacognition skills and strategies and ultimately lead to improvements in their achievement (Blue, 1994; Tamjid & Birjandi, 2011). In well-structured independent learning and self-assessment, the expectations are high and transparent, while success and satisfaction in meeting these expectations and achieving the goals develop high levels of positive academic and general self-esteem, and self-efficacy. Experiencing this learning process enables teachers to develop the skills necessary for their own ongoing professional development as well as providing opportunities for, and supporting, self-regulation in their students as lifelong learners.

The UAE Context English language teaching and learning plays a central role in education in the United Arab Emirates (UAE). The UAE is a multilingual and multicultural country where 85% of the population comprises expats from other Arab, Western, and Asian countries (Central Intelligence Agency, n.d.). With over 150 different languages in use across the country (International Business Publications, 2013), English has gained a significant place as the language of communication among expats, while Arabic is the official language of the country and the language of instruction in public schools, where English is taught as a foreign language. Given the increasingly important role of English in the lives of Emiratis, recently a number of bilingual (English/Arabic) education curricula have been implemented in public schools (e.g., New Model School programme, Madares Al Ghad (Schools of the Future) programme) in an effort to raise Emirati students’ English language proficiency in preparation for attending undergraduate programmes in higher education institutions where the language of instruction is English (for a review see Layman, 2011). Efforts in implementing bilingual education programmes have also been followed by a complete overhaul of the national English language curriculum, which is the guiding document for English language teaching in public schools. The new curriculum aims to graduate students at a high English language proficiency level, equivalent to Band 7 to 8 on the International English Language Testing System (IELTS) (Ministry of Education, 2014). Recognizing the key role that teacher effectiveness plays in the success of these education reforms, Marwan Al-Sawaleh,

182

Chapter Nine

Assistant Undersecretary of Support Services at the Ministry of Education (MOE), UAE, in an interview with Haneen Dajani and Roberta Pennington on June 3, 2014, maintained that the MOE is also in the process of implementing a licensure system for about 60,000 teachers in the UAE public schools. English language teachers in public schools are predominantly expats who are educated and trained in other Arab countries (e.g., Egypt, Syria, Jordan) or Western countries (e.g., Britain, Australia, USA). Their training and background can differ significantly as well as their subject knowledge and professional dispositions (Gitsaki & Bourini, 2012). Given this diversity, being evaluated for licensing purposes presents a number of challenges for EAL teachers. How can they prepare for the licensure test, which will be specific to the UAE context? How can they make sure their knowledge and skills are sufficient for the UAE context? Using one of the well-established tools for measuring EAL teacher professional knowledge and effectiveness, such as the TESOL Standards for P-12 Teacher Education Programs (Teaching English to Speakers of Other Languages, 2010) or the Praxis Series tests (Educational Testing Services, 2011), is not going to be particularly effective as such tools were designed for EAL teachers in North-America and while for the most part they apply to international contexts, there are elements- in both the indicators of knowledge and professionalism and the test items themselves- that would not contribute positively to in-service or pre-service teacher preparation for the licensure evaluation. In the midst of this change and evolution in the English language education programme in the UAE and the challenges it creates for inservice and pre-service teachers, the current study was conceived as a remedy to teacher preparation for the EAL teachers’ licensure. It aimed at providing EAL teachers in the UAE and the greater Gulf Region with a contextually relevant resource for independent learning and self-assessment in an effort to strengthen content knowledge, pedagogical knowledge, and professional dispositions.

The Study The study reported in this chapter is part of a larger project that aimed to design and implement a self-evaluation instrument for assessing EAL teacher readiness in the context of the UAE and the greater Gulf Region. It was envisioned that the creation of the instrument would involve five stages (see Figure 9-1). In the first stage (Stage 1), an initial review of the existing tools for assessing EAL teacher readiness would be performed by

Develooping a Tool forr Assessing Eng glish Language Teacher Readin ness 183

the researchh team. In Stagge 2, EAL ind dicators wouldd be compiled based on the tools revviewed and allso new indicaators would bbe described to o address context speccific localizedd needs. In Stage 3, assesssment items would w be created to aaddress each of the perforrmance indic ators. In Stag ge 4, the validity andd reliability off the items wo ould be testedd. In the finall stage of the project ((Stage 5), the resource wou uld be publishhed and made available for use. This chapter provvides an overrview of the ffirst three stag ges of the project. Stagge 1 Revview resources on EAL teacheer standaards.

Stage 2 IIdentify contexxt-specific stand dards and inndicators.

Stage 3 Develop item ms to test speciific indicators.

Stagee 4 Pilot thee resource and evaluate e its valiidity and reliab bility.

Sttage 5 Makke resource avaailable to in-servvice and prre-service EAL teachers in the UAE.

Figure 9-1: Prroject stages.

Review of o the Tools for f EAL Teeachers One of tthe major staages in the prrocess of creaating our conttextuallyrelevant andd localized teaacher develop pment resourcee was examin ning what other resourrces (i.e., stanndards docum ments, tests, ettc.) were available for teacher usee. In an initiial review off the literatuure, many so ources of standards foor EAL teacheer performancce were foundd to exist. Wee decided to base the ddevelopment of o our standard ds and indicattors on two off the most widely usedd in the public domain: the t TESOL P Professional Teaching

Chapter Nine

184

Standards (2010) and ETS’s Praxis exam specifications (2011). A brief description of each of these resources follows. The TESOL Standards The TESOL Standards for P-12 Teacher Education Programs (TESOL, 2010) address the professional expertise needed by EAL educators to work with language minority students. The Council for the Accreditation of Educator Preparation (CAEP) uses these performancebased standards for national recognition of teacher education programmes. Also known as the TESOL Professional Teaching Standards, these standards can be used to assess programmes that prepare and license K-12 EAL educators, as well as other teacher educator programmes. In this widely used standards document, there are five domains of knowledge that are deemed essential for EAL teachers: Domain 1: Language Domain 2: Culture Domain 3: Planning, Implementing and Managing Instruction Domain 4: Assessment Domain 5: Professionalism More information about the TESOL Standards can be found at: https://www.tesol.org/advance-the-field/standards/tesol-caep-standardsfor-p-12-teacher-education-programs The review of the TESOL Standards revealed that they are not culturally-relevant for all student groups as some indicators use terminology specific to North American contexts and the laws and regulations that apply in those contexts. Table 9-1 provides an example of a rubric for assessing Standard 5.a. ESL Research and History, which would be unsuitable for the Gulf context where different laws and regulations to North America are in play:

Developing a Tool for Assessing English Language Teacher Readiness 185

Table 9-1: Example performance indicator and achievement levels (Source: TESOL 2010, p. 69). Suggested Performance Indicator Approaches Standard Meets Standard

Exceeds Standard

5.a.2. Demonstrate knowledge of the evolution of laws and policy in the ESL profession. Candidates are aware of the laws, judicial decisions, policies, and guidelines that have shaped the field of ESL. Candidates use their knowledge of the laws, judicial decisions, policies, and guidelines that have influenced the ESL profession to provide appropriate instruction for students. Candidates use their knowledge of the laws, judicial decisions, policies, and guidelines that have influenced the ESL profession to design appropriate instruction for students. Candidates participate in discussions with colleagues and the public concerning federal, state, and local guidelines, laws, and policies that affect ELLs.

Such wording can be confusing to teachers who work in countries outside of North America and can render the usefulness of such resources questionable. Praxis The Praxis Series tests developed by Educational Testing Services (ETS) measure teacher candidates’ knowledge and skills and are used for licensing and certification processes. The suite of exams include: x Praxis Core Academic Skills for Educators (Core): These tests measure academic skills in reading, writing and mathematics. They were designed to provide comprehensive assessments that measure the skills and content knowledge of candidates entering teacher preparation programmes. x Praxis Subject Assessments (formerly the Praxis II tests): These tests measure subject-specific content knowledge, as well as general and subject-specific teaching skills, that teachers need for beginning teaching.

186

Chapter Nine

The Praxis Core Academic Skills for Educators tests and the Praxis Subject Assessments feature selected-response and essay questions that measure the content and pedagogical knowledge necessary for a beginning teacher (ETS, 2015). For more information about the Praxis test see: https://www.ets.org/praxis/ Upon reviewing the Praxis standards, it became apparent that they too had a North-American orientation as seen in the following example of an indicator and associated questions for Cultural Understanding in Module 2 (Cultural and Professional Aspects of the Job) of Praxis: 4CA.10 Knows how to explain United States cultural norms to Englishlanguage learners 1. What are some common U.S. cultural norms, values, and patterns of behavior that should be made explicit to ELLs? 2. How can teachers help students build intercultural competencies? (ETS, 2011, p. 25)

Here is an example of a test item outlining the process of referring a student to special education which would be considered unsuitable for the Gulf context: Question 63. A middle school ESOL student who has been in the United States for two years is being discussed in a team meeting. It is noted that the student is still at the beginning ESOL level, has difficulty focusing on assignments, has poor recall, and displays several inappropriate behaviors. The teachers have checked the student’s educational history, which indicates that the same problems were seen the year before. Which of the following would be an appropriate next step? (A) Wait at least six more months because the student has not been in the United States long enough to be evaluated for special education services. (B) Send a letter home to the student’s parents urging them to help stop the inappropriate behaviors from occurring. (C) Develop a pre-referral intervention plan to improve the student’s classroom and study skills. (D) Refer the student to the special education team and ask for testing and a physiological evaluation. Explanation: Your awareness of appropriate channels for evaluating the special education needs of ESOL students is tested here. Before ESOL students are referred for special education evaluations, pre-referral interventions should be attempted. Based on the response to the intervention, the student

Developing a Tool for Assessing English Language Teacher Readiness 187 might subsequently be referred for special education. The correct answer, therefore, is (C). (ETS, 2011, p. 69)

There is no one standardized instrument for teacher competency that is used across the globe or even in one country. Customized versions of the Praxis II are tailored to meet the context and politics of several different States in the US. In addition to Praxis, there are many other standardized instruments, such as the New York State Teacher Certification Examinations, New Mexico Assessment of Teacher Basic Skills, the English Major Field Test, State Education Exams, in several states to name a few. The goal of this project was to take into consideration the standards developed by TESOL and other professional organisations and create not a standardized test, but a resource guide for independent learning and self-assessment for EAL teachers working in the Gulf context.

Procedure The project was initiated at the College of Education in one of the federal higher education institutions in the UAE. The College of Education had a special EAL teacher education programme which had received recognition from the TESOL International Association (http://www.tesol. org) as part of the National Council for Accreditation of Teacher Education/Council for the Accreditation of Educator Preparation (NCATE/CAEP) accreditation process. The lead faculty of the EAL programme headed the project team. Fifteen academics from notable higher education institutions in the region were selected to participate in the project. All team members were working in countries within the Gulf region, namely, the UAE, Oman, and Qatar. The 15 team members were selected based on a number of factors, including academic background and qualifications, research publications in the field, and work experience in EAL teacher training. The project coordinators also sought representation from the three local federal institutions and other regional institutions and entities. As far as academic credentials are concerned: two team members had master’s degrees and the rest of the team (13) had terminal degrees in language education or general education. Seven of the 15 members were Arabic-English bilingual academics with many years of teaching experience in K-12 and tertiary institutions. Three members were language assessment supervisors at their respective institutions. One of the language assessment supervisors was serving at the time as the President of TESOL International

Chapter Nine

188

Association. Six team members were applied linguists and one of the applied linguists was serving as the Associate Dean for the largest English as a foreign language programme in the UAE. One team member was a psychologist and another one was a special needs expert. The Chair of the project was the Dean of the College of Education that had undertaken the project and the principal investigator was a language education academic with over 25 years of teaching experience, 7 years in K-12 and 18 years in EAL teacher education. Once assembled, the team members were divided into five groups and a leader was appointed for each group. The five groups worked on the creation of the resource by first reviewing EAL teacher international standards and associated indicators using a set of shared resources. These resources comprised core course textbooks and supporting material as well as teacher eligibility guidelines from different countries, such as the following: x x x x x

The TESOL International standards and indicators; European and Australasian standards on teaching English; Teachers of English eligibility instruments; EAL programme curricula and course syllabi; and Curriculum crosswalk that includes the programme standards, learning outcomes, topics covered and teaching/learning materials.

All resources were assembled and placed in an electronic folder which was shared with each project member. Given the geographical distance of the members, working remotely and conducting meetings virtually was a necessity. At first, each group reviewed the set of materials and the related indicators. The resource was divided into five domains or modules and each group worked on a specific module: Group 1 worked on language foundations; Group 2 worked on culture and language; Group 3 worked on instruction and pedagogy; Group 4 worked on language assessment; and Group 5 worked on professionalism and research. Upon review of the different standards and indicators in the various resources, The TESOL Standards for P-12 Teacher Education Programs (TESOL, 2010) were deemed to be the most useful resource for the project even though there were examples of content that was not relevant to our context and therefore not suitable for inclusion in the new assessment instrument. Upon discussion with the research team, it was decided that we would adapt TESOL Standards (TESOL, 2010) by rewording problematic indicators and adding those that we felt were missing. The

Developing a Tool for Assessing English Language Teacher Readiness 189

indicators for each domain were then contextualized and adapted for the Gulf Region and the greater Middle East and North Africa (MENA) Region. Once agreement among the members of each group had been reached as to the set of indicators to be used for each domain, the groups proceeded with the production of contextually relevant multiple-choice items that covered the standards and indicators for each domain respectively. Prior to item production, training was held for all item writers on how to generate objective closed-response multiple-choice items (MCIs) using internationally accepted guidelines (see Coombe, Folse, & Hubley, 2007; Rodriguez, 1997; Statman, 1988). Moreover, item writers were to link their questions to Bloom’s Taxonomy (Anderson & Krathwohl, 2001; Bloom, Englehart, Furst, Hill, & Krathwohl, 1956). Writing balanced MCIs for higher-order cognitive skills proved to be more challenging than writing items for basic or factual knowledge. In addition to each MCI, the members of each team had to also provide an explanation as to why a particular answer was the right answer for each MCI and/or why the distractors were the wrong answer options. This feature of the assessment instrument was there to ensure that the EAL teachers, who would use it, would be able to learn from it and not simply find out which of their answers were wrong and which were right by using a simple answer key. Once the MCIs and their explanations were written, each group reviewed the items they created for their assigned domain before sending them for internal review. For the internal review process, each group was assigned the MCIs from another group to review, make recommendations for improvements, and provide feedback. Once the internal review of the items was completed, each group revised their own MCIs and then the items were sent to a group of external reviewers. The external reviewers chosen for the project comprised professionals in the field of language assessment who were responsible for compiling and administering large-scale assessment instruments. Their task was to review the indicators and the MCIs per indicator using a set of specific criteria (see Table 9-2).

190

Chapter Nine

Table 9-2: Criteria for the external review of the indicators and the MCIs. A 1. 2. 3. 4. B 5. 6. 7. 8. 9.

10.

11. C 12. 13. 14. D 15. 16. 17. 18. 19.

Criteria for the Indicators The indicators are closely related to the 2010 TESOL International Standards. The indicators are contextually relevant to the MENA region. The indicators are politically and culturally sensitive to the MENA region. The indicators are free of grammar, punctuation, and spelling errors. Criteria for the Individual Items The item is not culturally biased (inclusive of all MENA region cultures). The item generally reads well. The item includes relevant and correct information. The item is accurately aligned to the indicator it purports to assess. The item is accurately aligned with Bloom’s Taxonomy. (1 = Recall, 2 = Comprehension, 3 = Application, 4 = Analysis, 5 = Evaluation, 6 = Creation) The maximum number of words in the item does not exceed 225 words (up to 110 words maximum for the stem and 110 for the answer options). The item is free of grammar, punctuation, and spelling errors. Criteria for the Stems There is no usage of the word ‘EXCEPT’ in the stem unless it is required for item clarity. The stem is worded in the positive form unless a significant learning outcome requires the negative form. The stem is a question or a statement that clearly identifies a problem. Criteria for the Answer Options There is only one unambiguously correct key. The alternatives are plausible and concise. The alternatives are mutually exclusive. The alternatives are homogenous in length, grammar and content to avoid giving extraneous clues. The alternatives do not include ‘All’, ‘Never’, ‘none of the above’ or ‘all of the above.’

Developing a Tool for Assessing English Language Teacher Readiness 191

20. 21. 22. E 23. 24. 25. 26.

The alternatives/key do not include different combinations of options such as A & C, D & B. The alternatives are presented in a logical order such as alphabetically or at random. The alternatives do not contain words like ‘all’, ‘none’, or ‘always’. Other Criteria The item is of appropriate level of difficulty for ESL teachers with a Bachelor’s degree. The item is in the active voice unless a significant learning outcome requires the passive form. The explanation for the correct answer includes ‘Therefore, the correct answer is . . . .’ The item explanation does not need to include in-text citations. However, footnotes and the bibliography that accompanies each module may include references.

The external reviewers reviewed each and every item from each group. A total of 450 MCIs were reviewed. The review revealed that writing MCIs with plausible distracters is a challenging task. About 33% of the total items were discarded after the review process was complete. The results of the external reviews were then collated and each of the five groups had to revise their MCIs accordingly.

Sample Test Items A total of 300 out of 450 items met the standards and criteria set by the research team. The following are sample MCIs from the resource that are contextually and culturally relevant to the EAL teachers in the Gulf Region. Item 1 Paul’s family moved to Oman. The family adopted several of the values and traditions of the Omanis like celebrating Eid, but held on to some characteristics of their own culture such as traditional American dishes. This is an example of cultural __________. (A) segregation (B) dehumanization (C) integration (D) discrimination

Chapter Nine

192

Explanation: Cultural integration occurs when one cultural group preserves some distinctive aspects of its own culture, while adopting many of the values, attitudes, and traditions of the dominant culture. Therefore, the correct answer is (C).

Item 2 Who are polychronic time-oriented people? (A) People who do many things simultaneously. (B) People who do one thing at a time. (C) People who like to treat other people the same way. (D) People who like to make exceptions for certain people. Explanation: Because Western cultures are mostly monochronic and Middle Eastern cultures are polychronic, it is important for Arab students to know the difference. Polychronic time-oriented people are people who adjust their time to suit their needs and may have to do many things simultaneously. Therefore, the correct answer is (A).

Item 3 Fatima’s family moved from the United Arab Emirates to the United Kingdom. While she appreciates many aspects of the new culture, she prefers to wear Emirati traditional clothing and eat Emirati food, and she makes a special effort to continue learning about Emirati heritage. Fatima exhibits a high level of __________. (A) assimilation (B) cultural identity (C) linguistic diversity (D) bigotry Explanation: Cultural identity is part of people’s self-perception, as they prefer to follow the traditions, nationality, language, ethnicity and social class of their distinct culture. Assimilation is when people adapt to the prevailing culture of the majority. Linguistic diversity is having a variety of languages, and bigotry is an act of prejudice and racism. Therefore, the correct answer is (B).

Developing a Tool for Assessing English Language Teacher Readiness 193 Item 4 A student ended an academic note to his teacher with this: ‘Wish peace be with you.’ This is an example of: (A) Code-switching (B) L1 interference (C) Avoidance (D) Displacement Explanation: Code-switching is when a student that speaks English and Arabic uses words from both languages in the same sentence. Avoidance is a communication strategy used by a learner when he avoids talking about a topic because he does not have the necessary language resources to talk about it. Displacement is a linguistic term indicating the capability of language to communicate about things that are not immediately present. The student example in this question is a direct translation from his first language (Arabic) also known as L1 interference. Therefore, the correct answer is (B).

Item 5 Children in the UAE attend bilingual schools where all subjects are taught in both English and Arabic. At the end of school these children will be: (A) Natural bilinguals (B) Coordinate bilinguals (C) Minimal bilinguals (D) Compound bilinguals Explanation: According to the definitions of Bilingualism, a natural bilingual is someone who has not undergone any specific training in a second language; a coordinate bilingual is someone whose two languages are learned in distinctly separate contexts; a minimal bilingual is someone with only a few words and phrases in a second language; and a compound bilingual is someone whose two languages are learned at the same time, often in the same context such as the New School Model bilingual programme in the UAE. Therefore, the correct answer is (D).

Item 6 Children from an Arab background will typically be diglossic as they use a dialectal variety of some sort at home, but will be taught in Standard Arabic and English at school. The pattern of errors in English that these children make will be more influenced by:

Chapter Nine

194

(A) Dialectal Arabic, the older the child. (B) Dialectal Arabic, the younger the child. (C) Dialectal and Standard Arabic, the older the child. (D) Dialectal and Standard Arabic, the younger the child. Explanation: An Arab child becomes a relatively proficient user of Standard Arabic after 5 to 6 years of formal education, that is, by the age of 12-13. The dialectal variety, as a real mother tongue, is mastered at a much earlier age. So the younger the child, the more likely it is that his/her English errors will be influenced by his/her dialectal Arabic. Conversely, the older the child the more likely that his/her Standard Arabic is well established and affects his/her English performance. Therefore, the correct answer is (B).

Item 7 When a language learner says: “Inshallah, I will see you next week”, it is an example of: (A) L1 interference (B) Fossilization (C) Code switching (D) Pidginization Explanation: This question tests your understanding of common theoretical terms in the field of language acquisition. When a student uses a direct translation from his first language (Arabic) into English, it is called L1 interference. Fossilization refers to the loss of progress in the acquisition of a second language despite continuous exposure to the second language. Pidginization is when a group of people, who do not have a common language, use a simplified language for communication. When words from L1 and L2 end up in the same sentence, it is referred to as code-switching. In this case the student used “Inshallah”, which is an Arabic word, in the same sentence with English words. Therefore, the correct answer is (C).

Item 8 The magazine TESOL Arabia Perspectives is most likely to contain which of the following? (A) Articles about the teaching and learning of English with a focus on the United States and abroad. (B) Articles about the teaching and learning of English with a focus on the Middle East and North Africa.

Developing a Tool for Assessing English Language Teacher Readiness 195 (C) Articles about the teaching and learning of Arabic with a focus on the Middle East and North Africa. (D) Helpful tips for teachers of Arabic and English in the Middle East. Explanation: TESOL Arabia Perspectives is the quarterly publication of the TESOL Arabia association (tesolarabia.org) and discusses the teaching and learning of English with a focus on the Middle East and North Africa. Therefore, the correct response is (B).

Item 9 In order to study at federal tertiary institutions in the United Arab Emirates, prospective Emirati students must take the _____. (A) Common Educational Proficiency Assessment-English (CEPAEnglish) (B) Scholastic Assessment Test (SAT) (C) National Admissions and Placement Office (NAPO) (D) International English Language Testing System (IELTS) Explanation: The Common Educational Proficiency Assessment-English (CEPAEnglish), administered through the National Admissions and Placement Office, is a requirement for admission to federal tertiary institutions in the UAE. Therefore, the correct response is (A).

Item 10 A teacher is designing a new grammar assessment for her English class. She wants to design an assessment that reflects a situation the student is likely to encounter in the “real world”. Therefore, the selected theme and context is visiting Yas Mall. The teacher is considering the _____ of the assessment. (A) transparency (B) practicality (C) washback (D) authenticity Explanation: Authenticity in assessment involves designing “real-life” tasks in which students use and apply their knowledge and skills. Yas Mall is a major shopping center in Abu Dhabi, and people from all over the region travel to shop there. Therefore, the correct response is (D).

196

Chapter Nine

The TESOL Teacher Readiness Inventory (T-TRI) The resource that was produced after the process outlined above, namely, TESOL Teacher Readiness Inventory (T-TRI), was then prepared for piloting with in-service and pre-service teachers in the UAE. The piloting of the resource is an important stage in the development of the instrument. It is important to ensure that the items included in the resource are valid and reliable. To this end, three types of validity are deemed to be crucial to the success of the project: content validity, criterion validity and face validity. In terms of content validity, the project team ensured that the resource items were based on the agreed upon standards and the associated indicators. Resource items that were not adequately matched to the indicators were either amended or discarded. During the piloting of the resource, pre-service and in-service teachers will review the resource items for both face validity and contextual relevance. In terms of criterion validity, during the rollout phase of the resource, the investigators will measure the relationship between the test-takers’ performance on the items and their actual status as novice, developing, or proficient achievers. During the pilot stage of the instrument, item analysis statistics will be generated on all items and will be used to analyze the effectiveness of individual items. Performing these statistical analyses will help the research team and item writers improve items and eliminate ambiguous or non-discriminating items. These statistics will also provide the item level of difficulty, item discrimination, frequencies and distribution, and reliability coefficients. Using items that pretested well, a final resource (to be published in hard copy and in an online format or even possibly an APP) will be created ensuring that the required instrument specifications are honored such as item difficulty and content coverage. The final resource will allow users to calculate their level of proficiency on a given scale based on the number of correct answers in the MCIs per domain. In the online version, the teacher proficiency profile will be automatically generated with specific information on which domain(s) require more attention and references and resources to help the test-taker acquire specific knowledge in domains where weaknesses have been detected. The project is on-going and the research team plans to periodically write revised editions of the T-TRI to include new items that reflect the flux of changes and improvements in new learning strategies and respond to the new requirements of the educational authorities in the Gulf Region and also changes the TESOL International Association makes to the standards. Currently, the T-TRI includes items that address cultural

Developing a Tool for Assessing English Language Teacher Readiness 197

elements from different Gulf countries. A future edition of the instrument should include MCI writers and reviewers from the wider MENA Region in order to make the resource more responsive to the linguistic and cultural differences between the MENA countries and the needs of EAL teachers working in these countries.

Conclusion The project outlined in this chapter will contribute to the body of knowledge of EAL teacher education by developing a contextually relevant independent study and self-assessment resource for EAL preservice and in-service teachers in educational institutions in the Gulf Region. The independent learning and self-assessment resource will support teacher learning and assessment of critical knowledge and understanding that go beyond basic factual knowledge. The resource will provide opportunities for teachers to become not only direct beneficiaries, but also stakeholders in determining the nature of support required in a format and time that is most beneficial to them, through self-assessment and self-regulation of their learning. Teacher education instructors may use the resource as part of the formative assessment of the EAL teacher candidates’ learning in the core curriculum courses. The use of the instrument may generate quantitative data that teacher education units may use to improve the curricula. The data may show the strengths and the challenges of teacher education programmes. One of the long-term goals for this project is to turn the resource from a hard-copy format into an APP that teachers can use on mobile devices. In addition, the resource will be designed to provide differentiated learning opportunities along with synchronous (real-time) reporting to test-takers that can help them identify areas of strengths, areas to improve, and plan strategies to support their learning. The resource will be designed to reinforce and evaluate teachers’ ability to solve problems and analyze teaching-learning situations, understand relationships that contribute to effective teaching and successful learning, and to predict and interpret test-takers’ progress and achievement. Therefore, by using the resource, EAL teachers’ professional content knowledge and pedagogical capabilities may improve, thus, increasing their effectiveness in the classroom, increasing their employability in schools, and meeting the competitive labor market requirements as well as the licensure standards adopted by educational authorities.

198

Chapter Nine

References Anderson, L. W., & Krathwohl, D. (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom's taxonomy of educational objectives. New York: Longman. Ashton, P., Webb, R. B., & Doda, N. (1983). A study of teachers’ sense of self-efficacy (Final Report, National Institute of Education Contract No 400-79-0075). Gainesville, FL.: University of Florida (ERIC document number ED 231 834). Bandura, A. (1993). Perceived self-efficacy in cognitive development and functioning. Educational Psychologist, 28(2), 117-148. —. (1991). Social cognitive theory of self-regulation. Organizational Behavior and Human Decision Processes, 50(2), 248-287. —. (1983). Self-efficacy determinants of anticipated fears and calamities. Journal of Personality and Social Psychology, 45(2), 464-469. Bender, W. N., Vail, C. O., & Scott, K. (1995). Teachers’ attitudes toward increased mainstreaming: Implementing effective instruction for students with learning disabilities. Journal of Learning Disabilities, 28(2), 87-94. Bloom, B., Englehart, M. Furst, E., Hill, W., & Krathwohl, D. (1956). Taxonomy of educational objectives: The classification of educational goals. Handbook I: Cognitive domain. New York, Toronto: Longmans, Green. Blue, G. (1994). Self-assessment of foreign language skills: Does it work? CLE Working Paper, 3, 18-35. Cavalluzzo, L., Barrow, L., Henderson, S. et al. (2015). From large urban to small rural schools: An empirical study of National Board Certification and teaching effectiveness. CNA Analysis and Solutions. Retrieved from: http://www.cna.org/sites/default/files/research/IRM2015-U-010313.pdf Central Intelligence Agency. (n.d.). The World Factbook. Middle East: United Arab Emirates. Retrieved from: https://www.cia.gov/library/publications/the-worldfactbook/geos/print/country/countrypdf_ae.pdf Coombe, C., Folse, K., & Hubley, N. (2007). A practical guide to assessing English language learners. Ann Arbor, MI: University of Michigan Press. Cowan, J., & Goldhaber, D. (2015). National Board Certification: Evidence from Washington State. CEDR Working Paper 2015-3. Seattle, WA: University of Washington.

Developing a Tool for Assessing English Language Teacher Readiness 199

Dajani, H., & Pennington, R. (2014, June 4). New licensing system for teachers in the UAE. The National. Retrieved from: http://www.thenational.ae/uae/education/new-licensing-system-forteachers-in-the-uae Deemer, S., & Minke, K. (1999). An investigation of the factor structure of the Teacher Efficacy Scale. The Journal of Educational Research, 93(1), 3-10. Eggen, P., & Kauchak, D. (2007). Educational psychology: Windows on classrooms, Pearson: Merrill Prentice Hall. Educational Testing Services. (2015). Praxis. Retrieved from: www.ets.org/praxis —. (2011). The Praxis Series: The Official Study Guide: English to Speakers of Other Languages. ETS. Gitsaki, C., & Bourini, A. (2012). Innovative approaches to teaching: A teacher professional development program for grades 6-9. In H. Emery & F. Gardiner-Hyland (Eds.), Contextualizing EFL for young learners (pp. 3-24). Dubai, UAE: TESOL Arabia. Hawk, P., Coble, C. R., & Swanson, M. (1985). Certification: It does matter. Journal of Teacher Education, 36(3), 13-15. International Business Publications. (2013). United Arab Emirates Country Study Guide. Volume 1, Strategic Information and Developments. Washington D.C.: IBP. Kane, T., & Staiger, D. (2008). Estimating teacher impacts on student achievement: An experimental evaluation. NBER Working Paper No. 14607. Retrieved from: http://www.nber.org/papers/w14607 Layman, H. (2011). A contribution to Cummin’s Thresholds Theory: The Madaras Al Ghad Program. Master’s dissertation, the British University in Dubai, Dubai, UAE. Midgley, C., Anderman, E., & Hicks, L. (1995). Differences between elementary and middle school teachers and students: A goal theory approach. Journal of Early Adolescence, 15, 90-113. Midraj, J. (In Press). Self-Assessment. In J. Liontas & M. DelliCarpini (Eds.), The TESOL Encyclopedia of English Language Teaching. New York: Wiley. Ministry of Education. (2014). English as an International Language (EIL): National Unified K–12 Learning Standards Framework. Dubai: Ministry of Education. Pasternak, M., & Bailey, K. M. (2004). Preparing nonnative and native English-speaking teachers: Issues of professionalism and proficiency. In L. D. Kamhi-Stein (Ed.), Learning and teaching from experience:

200

Chapter Nine

Perspectives on nonnative English-speaking professionals (pp. 155175). Ann Arbor, MI: University of Michigan Press. Punhagui, G. C., & De Souza, N. A. (2013). Self-regulation in the learning process: Actions through self-assessment activities with Brazilian students. International Education Studies, 6(10), 47-62. Rockoff, J., (2004). The impact of individual teachers on student achievement: Evidence from panel data. American Economic Review, 94(2), 247-252. Rodriguez, M. C. (1997). The art and science of item writing: A metaanalysis of multiple choice item format effects. Paper presented at the Annual Meeting of the American Educational Research Association, April, Chicago, IL. Statman, S. (1988). Ask a clear question and get a clear answer: An enquiry into the question/answer and the sentence completion formats of multiple-choice items. System, 16(3), 367-376. Tamjid, N. H., & Birjandi, P. (2011). Fostering learner autonomy through self- and peer-assessment. International Journal of Academic Research, 3(5), 245-251. Tchoshanov, M. (2011). Relationship between teacher knowledge of concepts and connections, teaching practice, and student achievement in middle grades mathematics. Educational Studies in Mathematics, 76(2), 141-164. Teaching English to Speakers of Other Languages. (2010). Standards for the Recognition of Initial TESOL Programs in P-12 ESL Teacher Education. Alexandria, VA: TESOL Publications. Retrieved from: http://www.tesol.org/docs/books/the-revised-tesol-ncate-standards-forthe-recognition-of-initial-tesol-programs-in-p-12-esl-teachereducation-(2010-pdf).pdf?sfvrsn=0

CHAPTER TEN FOREIGN LANGUAGE TEACHERS’ PROFICIENCY: THE IMPLEMENTATION OF THE EPPLE EXAMINATION IN BRAZIL DOUGLAS ALTAMIRO CONSOLO AND VERA LÚCIA TEIXEIRA DA SILVA

Abstract The definition of the linguistic aspects and the domain within which to operate in order to assess foreign language teachers’ proficiency has been a challenge in language assessment. Foreign language teachers’ proficiency is understood as how and when linguistic knowledge and the competence for communication lead to effective language use in teaching contexts. In Brazil, the discussion about the characteristics and the quality of teachers’ language is justified by the results of several studies that attest the low proficiency level, mainly in oral skills, achieved by pre-service language teacher trainees and in-service teachers in various teaching contexts. Given the importance of investigating teachers’ language, and the connections between the domain, the testing instruments and the criteria on which to base a valid assessment of their language proficiency, researchers in Brazil have been exploring the implementation of the EPPLE, a language examination for foreign language teachers. This chapter reports on some results from these investigations, focusing on lexical frequency and variety, as well as accuracy and grammatical complexity. Data were collected by means of instruments designed especially to assess foreign language teachers’ proficiency–the TECOLI (a test for listening comprehension in Italian), the TEPOLI (a test for oral proficiency in English) and the EPPLE examination. The data and discussion presented in this chapter can support revisions of the construct

202

Chapter Ten

for the EPPLE examination, and contribute in the areas of foreign language teaching, language testing and teacher education.

Introduction Language proficiency is a requirement for foreign language teachers who are non-native speakers (NNS) of the language they teach and neither higher nor lower levels of proficiency in teacher talk have been fully established by means of comprehensive scales so as to assess, for example, foreign language teachers in a large country such as Brazil and its variety of schooling contexts. The definition of both the linguistic aspects and the domain in which to operate to assess this proficiency has been a challenge in language assessment. Teachers’ proficiency is understood as teachers’ linguistic performance on occasions where linguistic knowledge and communicative competence lead to effective language use in teaching contexts. Our discussion about the domain, the linguistic aspects and the quality of teachers’ language is justified by the results from several studies that attest the low proficiency level, mainly in oral skills, among students in a number of pre-service and in-service teacher education courses, as well as in teaching contexts, especially in ELT in regular schools. In order to investigate the testing instruments and the criteria on which to base a valid assessment of foreign language teachers’ proficiency, researchers from four Brazilian public universities–State University of Sao Paulo (UNESP), State University of Rio de Janeiro (UERJ), State University of Maringa (UEM) and University of Brasilia (UnB)–and three tertiary level institutions, Faculty of Technology (FATEC), Paulista University (UNIP) and UNISEB/Estacio (a private university centre in the city of Ribeirao Preto, in the state of Sao Paulo), have been investigating foreign language teachers’ proficiency through the implementation of the EPPLE (Exame de Proficiência para Professores de Língua Estrangeira) a language examination for foreign language (FL) teachers (Consolo, 2008; Consolo, Lanzoni, Alvarenga, Concário, Martins, & Teixeira da Silva, 2010, 2009). The researchers interested in the development of the EPPLE examination are also members of the ENAPLE-CCC (Ensino e Aprendizagem de Línguas: Crenças, construtos e competências), a research group hosted by UNESP and recognized by the CNPq (Conselho Nacional de Desenvolvimento Científico e Tecnológico), the national council for research and technology. This chapter reports on four studies within the EPPLE project, the present main interest of the ENAPLE-CCC group, focusing on the testing

Foreign Language Teachers’ Proficiency

203

of lexical frequency and variety, and on the accuracy and grammatical complexity in spoken language. The results of these investigations can shed light on the washback effects from examinations such as the EPPLE on foreign language teacher development. The studies contribute to the validation process for the EPPLE, to the development of more adequate and valid test items and, consequently, the establishment of assessment criteria to produce better rating scales for the examination as well as greater guidance in rater training. The research results reported here were compiled from four studies in FL teaching contexts in Brazil, namely Baffi-Bonvino (2010), BorgesAlmeida (2009a), Silva Neto (2014) and Veloso (2012), and data in each study were collected by means of instruments designed especially to assess FL teachers’ proficiency: the TECOLI, a test for listening comprehension in Italian; the TEPOLI, a test for oral proficiency in English; and the EPPLE examination.

The Study Our research contexts comprise Letters courses in public universities in Brazil, as well as in-service teacher courses, in which the data for the studies we review were collected. The Letters course (Curso de Letras), is a four-year, sometimes a five-year university course in Brazil to educate language teachers, in Portuguese and in other languages. Letters students can usually opt for certification in one language or in two languages, which they should be able to teach after graduation. All the data and information discussed in this chapter derive from studies conducted to investigate the implementation and validation of the EPPLE examination. This means that, rather than collecting new data for this discussion, we draw on empirical data generated by our colleagues, available for public consultation, and on information available for members of our research group. The participants, in the studies by Baffi-Bonvino (2010), by BorgesAlmeida (2009a) and by Silva Neto (2014), are graduating students from Letters courses (English and Portuguese languages), some of which were already working as teachers of English as a foreign language (EFL) in private language schools. In Veloso’s (2012) study, participants were students graduating from Letters courses (Italian and Portuguese), and certified teachers of Italian as a FL. The research instruments were two different tests of FL proficiency, especially designed for teachers and teachers-to-be, and the oral test of the EPPLE examination. Each of these instruments is described below.

204

Chapter Ten

The TECOLI The TECOLI, fully described in Veloso (2012), is an abbreviation for the Teste de Compreensão Oral em Língua Italiana (Listening Comprehension Test in Italian), a product of Veloso’s doctoral investigation, based on six versions of a listening comprehension test designed and applied to (future) teachers of Italian as a FL, in Brazil and in Italy, and the final and revised version of the TECOLI is based on data from five contexts of pre-service teacher education courses (Letters courses) with focus on Italian as a FL in Brazil. The TECOLI can be a reference for tasks to test listening skills by means of language samples and test items that were reviewed by teachers of Italian as a FL and also underwent a detailed statistical analysis. The bulk of the detailed results from the investigation conducted by Veloso (2012), thus, provides grounds for the testing of FL teachers’ listening comprehension skills and can support the assessment of oral skills in the EPPLE examination.

The TEPOLI The TEPOLI, short for Teste de Proficiência Oral em Língua Inglesa, (Consolo, 2004), is a test of oral proficiency in English and consists of an interview based on a set of pictures, some of which are accompanied by short texts, taken from EFL course books and magazines. The pictures work as visual prompts in the first testing task, on which the topics for the oral interaction between examiner and examinee(s) are based. Examinees can take the TEPOLI individually or in pairs, and when two examinees are tested, this task is geared towards encouraging the examinees to interact not only with the examiner but also with each other. As of 2003, the TEPOLI includes a second test task that consists of a role-play that aims at assessing the production of oral language that encompasses the metalanguage EFL teachers are expected to use in teaching contexts, for example, when they explain or talk about the English language with their students. This task has been incorporated in the EPPLE oral test, as described in the following section. The data on EFL student-teachers’ proficiency in English discussed by Baffi-Bonvino (2010) and Borges-Almeida (2009a) are largely based on results of the TEPOLI and the levels of oral proficiency in English achieved by students graduating from Letters courses in Brazil. All the student-teachers who participated in Baffi-Bovino’s and in BorgesAlmeida’s studies, as well as in the study by Silva Neto (2014), reported

Foreign Language Teachers’ Proficiency

205

below, were in the last year of an undergraduate teacher development course in Brazil.

The EPPLE The EPPLE examination stands for Exame de Proficiência para Professores de Línguas Estrangeiras, and it is a proficiency examination for FL teachers to evaluate and classify their linguistic proficiencyhenceforth LPFLT (language proficiency of foreign language teachers), a type of language proficiency that is both general and specialized. The examination aims at teachers-to-be, that is, undergraduates about to conclude an initial teacher education programme at a tertiary level, usually in a Letters course (in the case of Brazil), and FL teachers already engaged in the profession and responsible for foreign language lessons to young children, in primary and secondary education, at university and in private language schools. The examination may also be taken by teachers engaged in further education at a postgraduate level. LPFLT includes the abilities of comprehension and production of the foreign language focused on a given version of the EPPLE, in both verbal and written modes. The EPPLE has so far been designed and piloted only in English. However, a detailed study of items to test listening comprehension in Italian has already been carried out and presented by Veloso (2012), as reported earlier in this chapter, and our proposal includes plans to produce the EPPLE in other foreign languages taught in Brazil such as French, German and Spanish. General language proficiency, as seen as part of the LPFLT, is characterized by the quality of performance in a given language, as it is used by the majority of its speakers in a variety of everyday situations, from informal to formal conversations, when reading informative texts and usual documentation, to understand oral language in verbal messages and short videos, and in the production of e-mail messages and written texts aiming at social networking, for example. With regard to the specialized proficiency of FL teachers, the main part of LPFLT that mostly determines language proficiency for professional demands, it encompasses the use of a given FL for educational purposes, for example, to manage classroom discourse and communication in language teaching contexts (Consolo, 2007). In this sense, teachers’ language includes providing information and giving instructions, pedagogical explanations, evaluating students’ performance, reading academic texts and teaching materials, the understanding of audio and videos for pedagogical purposes, and the production of materials and instruments to evaluate students. More

206

Chapter Ten

detailed reviews and studies about teachers’ language have been the focus of studies by members of the EPPLE research team and their supervisees, such as Andrelino (2014), Colombo (2014), Ducatti (2010) and Fernandes (2011). The EPPLE examination comprises two tests: a paper for reading comprehension and written production, and a test of listening comprehension and speaking skills. The written test contains comprehension questions about texts of general interest for language teachers, and items in which the candidate must deal with writing tasks likely to be carried out by FL teachers such as writing questions for a reading comprehension exercise and correcting mistakes in texts produced by language students. Tasks that require the production of argumentative texts, short messages sent by electronic mail, or summaries of academic texts, can also be in this paper. A sample of the written test is available at www.epplebrasil.org. The speaking test, if given in its face-to-face format, is conducted in pairs of candidates and in the presence of an examiner who manages the test and of an examiner who acts more like an observer and a rater. A fully electronic version of the EPPLE in English was produced and has been applied to student-teachers since 2011. The ‘electronic EPPLE’ is a computer-based examination that includes the tasks for the oral test, to be done in around 25 minutes, and the tasks for the written test, to be done in the second part. The whole electronic examination lasts around two and a half hours. The electronic EPPLE includes recorded instructions given to the candidates at the beginning of the examination, and a task to test the camera and the microphone connected to the computer before the oral test starts. Candidates have the possibility to report on faulty equipment, if that is the case, and the examiner(s) and/or technician(s) in charge of the computer laboratory can help solve technical problems that may occur. The answers produced by the candidates are recorded in the computers and in a data bank to be rated on a later date. The speaking test, in both the face-to-face and the electronic versions, has four parts. In the first part the candidates are expected to speak about themselves: about personal and professional information, previous experiences as FL students, and professional expectations for the future. The second part of the oral test is based on a brief video segment, or on two short video extracts, that firstly must be understood so that a discussion about the content in the video(s) can be carried out by the candidates, with each other, and with the examiner conducting the test. In the computer-based version of the EPPLE, candidates answer questions about the video, and the screen design for Part 1 of the video tasks is shown in Figure 10-1.

Foreignn Language Teaachers’ Proficieency

207

The EPPLE exaamination, Orall Test, Part 1. Figure 10-1: T

In the thhird part of thee test the cand didates must show their prroficiency in using mettalanguage, thhat is, specific language for pedagogical purposes. p Situations of problems ussually faced by FL studentss are presented and the examinees aare expected to offer solutio ons for linguisstic doubts lik kely to be raised in lannguage lessonns. Examineess are expecteed to explain linguistic rules, for eexample, as well as prod duce pedagoggical explanaations. A reproductionn of the screenn for this part is shown in F Figure 10-2 beelow. In the laast part of thee oral test candidates are assked for their opinions about the oral test they just j finished, to provide ddata about thee EPPLE from the exxaminees’ perrspective, wh hich may alsoo provide data from a perspective of self-assesssment. The examinees are eexpected to feeel free to express theirr opinions aboout the oral test and about thheir own perfformance. In the faace-to-face oraal test the two o examiners m meet after the test so as to evaluate the candidatees’ performan nce and classiify the quality y of their speaking acccording to a proficiency scale of a hoolistic nature. For the answers prooduced and reecorded in thee electronic vversion of thee test, the recordings aare rated by tw wo examiners later on usingg the same scaale. The dataa analysed andd discussed by y Silva Neto (22014), reporteed below, are based onn results from the EPPLE oral o test in its face-to-face format, f as

208

Chapterr Ten

well on a ppreliminary seemi-electronicc version of the test deveeloped in Power Pointt and administtered in a com mputer laboratoory.

Figure 10- 2: The EPPLE exxamination, Oraal Test, Part 3.

Reesults and Discussion D Vocaabulary, Lexxical Compeetence and tthe Validatio on off the EPPLE E Oral Test Oral dataa collected byy means of thee TEPOLI andd two mock veersions of the Cambriddge FCE (Firsst Certificate in English) ooral test were analysed by Baffi-Boonvino (2010)) and her study recommennded a revisio on of the proficiency rating scale for the EPP PLE oral tesst, and also indicated reliability inn the results of o a test that was designedd and piloted d so as to provide grouunds for the design d of the EPPLE exam mination. In thiis section we present ddata concerninng the oral prroduction of uundergraduatee students in test situattions, and repport on the comparison of tthe language produced and the criteeria of the bannd scales of th he two tests ussed in Baffi-B Bonvino’s study.

Foreign Language Teachers’ Proficiency

209

The extensive analysis of the language produced in test situations, as fully reported in Baffi-Bonvino (2010), was conducted by means of the RANGE software and the lists of vocabulary presented according to the British National Corpus (BNC). In the RANGE software there are 16 lists of words, and from List 1 to List 14 there are 1,000 family words in each of them, in bottom-up order, that is, from the most frequent words in English (List 1) to the least frequent ones (List 14). List 15 is about names, List 16 has exclamations, hesitations, etc; and the List named Not in the lists presents the words which were said incorrectly. The analysis focuses on the results from the categories Types and Families because these are the categories which represent the greatest variation in the use of lexical items, that is, rich vocabulary (Scaramucci, 1997). Tokens were not considered because they represent the number of words produced, not the quality of those words. Further details of this lexical analysis were reported in Consolo and Teixeira da Silva (2011). Below are some data from the TEPOLI, produced by AL, one of the participants in Baffi-Bonvino’s study who had an excellent performance in the test, to illustrate our discussion. The data and the table shown below, discussed in Baffi-Bonvino’s doctoral thesis, were first published in BaffiBonvino’s MA dissertation (Baffi-Bonvino, 2007), a study in which initial comparisons between TEPOLI and two FCE mock oral tests were analysed. Extract 1 01 02 03 04 05 06 07 08 09

AL: uh (+) well MC (+) I (+) I was listening to (+) to you and (+) and your colleague in the class during the (+) the roleplay (+) and I (+) and I noticed that (+) you know (+) you’ve got some problems during your speaking (+) uh (+) and (+) and in your grammar too (+) but (+) I (+) you know (+) one thing that is really (+) worrying (+) is (+) uh (+) the USE of (+) auxiliary words (+) like when you said (+) uh (+) I not worry about the environment (+) what’s missing here? There’s something missing right (+) because as you know in Portuguese uh (+) you know (+) in Portuguese structure (+) in English structure (INCOMP) so (+) you were speaking at the present in the moment (+) you know (+) you want to tell your colleague (+) that you are not WORRIED (+) if you

Chapter Ten

210

use the verb to be (+) you are not worried and you need (+) you can use an adjective 11 (+) but here {ASC} you use WORRY (+) the verb (+) so you can not just put NOT there 12 and that’s it (+) it’s missing {ASC} something and (+) this something is (+) the 13 auxiliary (+) that you know is (+) the auxiliary (+) do (+) so you can tell I do not worry 14 (+) about environment (+)maybe you wanna emphasize (+) or you can just (+) contract 15 that (+) you know (+) like I don’t worry about (+) environment (+) you know (+) 16 because (+) so try to pay attention to (+) to (+) you know (+) when we make negative 17 using the simple present (+) right (+) you (+) you need to use (+) uh you know (+) 18 either don’t or (+) doesn’t for third person (+) right (+) so (+) that’s it (TEPOLI, 28 Nov 2005. Source: Baffi-Bonvino, 2007, pp. 269-270) 10

AL’s performances in the two FCE mock tests and in the TEPOLI were equivalent, if we compare the marks given to this candidate. Similarly, equivalence was found for three other participants in the study, LR, MR and MC (see Table 10-1). Table 10-1: Marks in the FCE and in the TEPOLI (Source: BaffiBonvino, 2007, pp. 276-277).

Student LR MR MC AL

FCE Mock 1 (0 – 5)

FCE Mock 2 (0 – 5)

TEPOLI (0 – 10)

3.5 3.0 3.0 4.5

4.0 3.5 4.0 4.5

8.0 / Band C 7.0 / Band D 6.0 / Band E 9.5 / Band A

In TEPOLI examinees are rated in an A–E scale and given a numerical mark ranging from 10.0 to 0.0. The results presented in Table 10-1 above are based on a rating scale ranging from A (the highest band) to E (the lowest band). An equivalence between the grades for the FCE mock tests and for the TEPOLI is observed in Table 10-1, if we consider that the highest grade for the FCE Speaking paper is 5.0. The participants’ oral proficiency levels, as assessed in the three test situations, are similar and

Foreign Language Teachers’ Proficiency

211

the criteria used for each of the three tests were based on qualitative descriptors. Both tests, FCE and TEPOLI, are based on holistic scales that consider the final product, and the test-takers’ performance is described through many specific linguistic and communicative criteria. Silva Neto (2014) analysed the lexical competence of pre-service teachers who were graduating from a Letters course in a public university in the state of Sao Paulo. The data were obtained by means of a trial version of the EPPLE oral test, in the two formats aforementioned: a faceto-face test conducted with pairs of students, and a preliminary semielectronic version of the oral test administered to the same students in a computer language laboratory. The lexical characteristics and quality of English in the speech of test-takers were analysed, such as the relevance and type of vocabulary used in the target language when interacting, the suitability of the lexical items to the test tasks, negotiations of meaning that might have arisen from the difference between the lexical competence of the candidates, the appropriateness of lexical items to the content of the speech and the coefficient of frequency of the item according to the subject matter. The results obtained by comparing the data from the faceto-face test and its ‘semi-electronic’ version show that the students’ performances do not vary significantly in the two versions of the test. Silva Neto (2014) claims that his results point to the need for revision of the descriptors for the vocabulary produced in the EPPLE oral test and the introduction of an analytical scale to rate speech that considers the differences between proficiency bands based not only on the frequency factor of lexical items but also on their appropriateness to the speaking context. Based on the results presented by Baffi-Bonvino (2010) and by Silva Neto (2014), we recommend a combination of holistic scales with analytical ones for the EPPLE oral test, which would be more adequate to assess candidates’ oral proficiency. The existing scales assess the oral production as a combination of the oral skills involved in the tasks and, although the descriptors within each band focus on separate linguistic aspects, linguistic features are seen to operate interdependently so as to contribute to or impede communication. Analytical scales, on the other hand, would make it possible to assess each of the criteria involved in oral production in a separate way.

212

Chapter Ten

Accuracy and Grammatical Complexity across Levels of Proficiency in TEPOLI: The Search for Valid Scales in the EPPLE Examination In this section we report on part of a doctoral research study (Borges Almeida, 2009a) conducted with graduating students from a foreign language (English) teacher education course (a Letters course) in a state university in Brazil. The study was guided by the research question, “how is grammar characterized across the TEPOLI’s proficiency levels in the participants’ performance in terms of (a) accuracy and (b) complexity in the oral test and in a seminar”? (Borges Almeida, 2009a, p. 27). According to Consolo et al. (2010), the EPPLE projects conducted over a decade ago, aimed at utilizing a general theoretical background on language testing, the literature on existing tests of proficiency both in the Brazilian context and in the international scene, and data generated by results in the TEPOLI and in the preliminary versions of the EPPLE oral and written tests, in order to guide the implementation of the examination. The EPPLE projects (Consolo, 2011; Consolo, 2008) also aimed at establishing, amongst the language aspects and the different factors that constitute and influence the “linguistic-communicative ability” (Bachman, 1990; Llurda, 2000), or “language ability” (see Bachman & Palmer, 1996), criteria for the characterization of the FL proficiency of teachers, mainly in the different contexts of language education in Brazil (e.g., regular schools, namely Ensino Fundamental and Ensino Médio, university courses and private language schools). In the doctoral study conducted by Borges-Almeida (2009a), accuracy in speech production was investigated by means of two quantitative indexes: the number of deviations per unit or its percentage and the percentage of deviation in free-clauses (D’Ely, 2006; Guará-Tavares, 2008). Complexity was investigated in quantitative terms through the mean length of units (given in words), the mean length of clauses, and the frequency of clauses per unit and of dependent clauses per independent unit (Wolfe-Quintero, Inagaki, & Kim, 1998). It must be noted that the sentence, a language unit that pertains to and represents written language, should not be used as a unit of analysis for spoken language. Several studies on spoken language have used Hunt’s (1965) T-unit, which consists of one dependent clause plus all its subordinate clauses. Since there has been criticism regarding the definition and operationalization of the T-unit in interaction contexts, BorgesAlmeida (2009a) adopted Foster, Tonkyn, and Wigglesworth’s (2000) AS-

Foreign Language Teachers’ Proficiency

213

unit (analysis of speech unit), which more clearly presents how to deal with the disfluent mechanisms of speech. An AS-unit is defined as: “a single speaker’s utterance consisting of an independent clause, or subclausal unit, together with any subordinate clauses(s) associated with either” and “allows for the inclusion of independent sub-clausal units, which are common in speech.” (Foster et al., 2000, p. 365)

Borges-Almeida’s study (2009a) is therefore an example of the efforts to bring theoretical background on language description and analysis into the discussions towards the improvement of the EPPLE examination, as pointed out in Consolo et al. (2010). Her review of key aspects that characterize spoken language support the analysis of data from TEPOLI reported in the next section.

Grammatical Accuracy and Complexity in TEPOLI As previously reported in Consolo and Teixeira da Silva (2011), the results of the indexes of deviation free units and deviations per unit (see Table 10-2 below) indicate the internal consistency of the TEPOLI proficiency scale regarding grammatical accuracy, since such indexes reveal that the higher the band, the less deviation can be found in a candidate’s speech. The same can be observed in recordings of class seminars presented by the same students who did TEPOLI, and this indicates that the test has predictive power over a communicative situation that is not an interview in itself–in the case of the seminars, a situation somehow similar to an authentic class. This way, the higher the TEPOLI band, the better the student’s grammatical performance in both, a nontesting setting and in TEPOLI. A comparison of deviations from the normative standards for grammatical accuracy in English, found in free units (first column), in deviations per unit (second column) and in occurrences of self-corrections is presented in Table 10-2.

Chapter Ten

214

Table 10-2: Grammatical accuracy as measured in TEPOLI (Source: Borges-Almeida & Consolo, 2010). Index Index Band Deviation-Free Units Deviations per Unit Oral Test Seminar Oral Test Seminar

Index Self-Correction Oral Test Seminar

B

0.88

0.84

0.14

0.19

0.08

0.12

C

0.79

0.67

0.28

0.51

0.09

0.12

D

0.65

0.62

0.46

0.60

0.10

0.06

E

0.61

0.52

0.53

0.97

0.05

0.05

Conversely, the indexes of self-corrections, for example, do not reveal that their occurrence is determined by the proficiency bands for TEPOLI, and this raises questions about how self-correction is currently described in the TEPOLI scale and calls for a revision of this specific item in the descriptors of grammatical competence and performance in spoken language. The indexes of complexity suggest that the differences between bands are small and not very consistent when quantitatively observed, as illustrated in Table 10-3 below. One of the interpretations for the weak consistency of the index of words per unit is that such an index can also be employed as a measure of fluency. The index of clauses per unit shows little variation due to its relation to subordination, which is not a strong characteristic of spoken language. Still the differences found suggest that candidates placed in higher bands tend to produce more complex clauses than candidates placed in lower bands. Table 10-3: Grammatical complexity as measured in TEPOLI (Source: Borges-Almeida & Consolo, 2010). Band B

Index Clauses per Unit Oral Test Seminar 1.45 1.46

Index Words per Unit Oral Test Seminar 6.62 9.07

C

1.36

1.35

6.79

8.69

D

1.35

0.93

7.32

6.90

E

1.26

1.30

6.32

7.98

Foreign Language Teachers’ Proficiency

215

Band D can be seen as the level in which the candidates’ performance is the most varied. In a paper in which she reports an analysis of data from the TEPOLI from a phonological perspective, Borges-Almeida (2009b) mentions that data from candidates placed in band D for fluency, as analysed according to the criterion of filled pauses, also shows a pattern considerably different from those observed in the other bands. One of the hypotheses for that is that in this band candidates may go through more linguistic restructuring stages, and so band D would be characterized by “a period of variable latency” (Cruz, 2001, p. 51). The results achieved by Borges-Almeida give evidence of important differences between the proficiency bands for the linguistic elements investigated in the broad scope of the bases for the EPPLE examination and contribute to a revision and the improvement of the scale descriptors, in order to maximize its validity. Both grammatical complexity and accuracy increase towards each higher band. On the other hand, the data do not support the description of how the phenomenon of self-correction appears along the scale. Based on such results, the EPPLE scale can be improved taking into account the changes already made to the original scale for TEPOLI (Consolo, 2004).

Conclusion The process of designing the EPPLE and its consolidation as a language examination in the context of educating FL teachers in Brazil is on the way towards a revised construct for the examination and the definition of assessment criteria informed and supported by past and present research studies. Even though the results so far achieved by our research team are mainly about English and Italian, we encourage the inclusion of other languages in future projects and studies that can support further revision of the constructs of the EPPLE examination and, as a consequence, contribute in the areas of foreign language teaching and language testing, as well as foreign language teacher education. Once the EPPLE is widely used, as pointed out by Consolo and Teixeira da Silva (2013), it is expected to motivate a revision of the course contents and aims in pre-service and in-service teacher education, especially in the Letters courses in Brazil. The standards established by such an examination may be considered a reference for LPFLT, and for the quality of language teaching and learning in the Brazilian educational contexts.

216

Chapter Ten

Acknowledgement This project was supported by FAPESP (Fundação de Amparo à Pesquisa do Estado de São Paulo, process 2014/10544-0).

References Andrelino, P. J. (2014). Análise da estrutura genérica das instruções na fala do professor de Inglês: Contribuições para o teste oral do EPPLE (An analysis of the generic structure in English teachers’ talk: Contributions to the EPPLE oral test). Doctoral thesis, UNESP, Sao Jose do Rio Preto, Brazil. Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press. Bachman, L. F., & Palmer, A. (1996). Language testing in practice. Oxford: Oxford University Press. Baffi-Bonvino, M. A. (2010). Avaliação da proficiência oral em Inglês como língua estrangeira de formandos em Letras: Uma proposta para validar o descritor ‘vocabulário’ de um teste de professores de língua Inglesa (The assessment of oral proficiency in English as a foreign language of graduating students in a Letters course: A proposal to validate the descriptor for vocabulary in a test for English language teachers). Doctoral thesis, UNESP, Sao Jose do Rio Preto, Brazil. —. (2007). Avaliação do componente lexical em inglês como língua estrangeira: Foco na produção oral (Assessment of the lexical component in English as a foreign language: Focus on oral production). Master’s dissertation, UNESP, Sao Jose do Rio Preto, Brazil. Borges-Almeida, V. (2009a). Precisão e complexidade gramatical na avaliação de proficiência oral em Inglês do formando em Letras: Implicações para a validação de um teste (Grammatical precision and complexity in the assessment of oral proficiency of Letters graduating students: Implications for a test validity). Doctoral thesis, UNESP, Sao Jose do Rio Preto, Brazil. —. (2009b). Pausas preenchidas e domínios prosódicos: Evidências para a validação do descritor fluência em um teste de proficiência oral em língua estrangeira (Filled pauses and prosodic domains: Evidence for the validation of the descriptor for fluency in a foreign language oral proficiency test). ALFA: Revista de Linguística, 53(1), 167-193. Borges-Almeida, V., & Consolo, D. A. (2010). Investigating accuracy and complexity across levels: The search for a valid scale for the Language

Foreign Language Teachers’ Proficiency

217

Proficiency Examination for Foreign Language Teachers (EPPLE) in Brazil. Poster presented at the Language Testing Research Colloquium (LTRC), 12th-16th April, Cambridge, UK. Colombo, C. S. (2014). O insumo linguístico oral em aulas de Inglês como língua estrangeira para crianças: A fala do professor em foco (The oral linguistic input in EFL lessons for children: Focus on teacher talk). Master’s dissertation, UNESP, Sao Jose do Rio Preto, Brazil. Consolo, D. A. (2011). Avaliação da proficiência linguísticocomunicativa-pedagógica do professor de línguas: operacionalização de construto no Exame de Proficiência para Professores de Língua Estrangeira (EPPLE) (Assessment of language teachers’ linguisticcommunicative pedagogic proficiency: Operating the construct in the EPPLE examination). Research Project–PHASE I. Sao Jose do Rio Preto, Brazil: UNESP. —. (2008). Exame de Proficiência para Professores de Língua Estrangeira (EPPLE): Definição de construto, tarefas e parâmetros para avaliação em contextos brasileiros (Proficiency examination for foreign language teachers (EPPLE): Defining construct, tasks and parameters for assessment in Brazilian contexts). Research Project. Sao Jose do Rio Preto, Brazil: UNESP. —. (2007). A competência oral de professores de língua estrangeira: A relação teoria-prática no contexto brasileiro (The oral competence of foreign language teachers: The theory-practice relationship). In D. A. Consolo & V. L. Teixeira da Silva (Eds.), Olhares sobre competências do professor de língua estrangeira: Da formação ao desempenho profissional (pp 165-178). Sao Jose do Rio Preto, Brazil: Editora HN. —. (2004). A construção de um instrumento de avaliação da proficiência oral do professor de língua estrangeira. Trabalhos em Linguística Aplicada, 43(2), 265-286. Consolo, D. A., Lanzoni, H. P., Alvarenga, M. B., Concário, M., Martins, T. H. B., & Teixeira da Silva, V. L. (2010). Exame de Proficiência para Professores de Língua Estrangeira (EPPLE): Proposta inicial e implicações para o contexto brasileiro (Proficiency Examination for Foreign Language Teachers: Initial proposal and implications for the Brazilian context). In Proceedings of the II Congresso LatinoAmericano de Formação de Professores de Línguas (CLAFPL) (pp. 1035-1050). Rio de Janeiro, Brazil: PUC-Rio. Consolo, D. A., Lanzoni, H. P., Alvarenga, M. B., Concário, M., Martins, T. H. B., & Teixeira da Silva, V. L. (2009). An examination of foreign language proficiency for teachers (EPPLE): The initial proposal and implications for the Brazilian context). In Proceedings of the II

218

Chapter Ten

Congresso Internacional da ABRAPUI (ABRAPUI International Conference) (pp. 1-15). Sao Jose do Rio Preto, Brazil: ABRAPUI. Retrieved from: http://www.abrapui.org/congresso. Consolo, D. A., & Teixeira da Silva, V. L. (2013). A proficiência linguístico-comunicativa-pedagógica do professor de língua estrangeira: Alinhavando pesquisas em avaliação para novas políticas linguísticas no Brasil (The foreign language teacher’s linguisticcommunicative-pedagogic proficiency: Research on language assessment towards new language policies in Brazil). Paper presented at the X Congresso Brasileiro de Linguística Aplicada, September 9th12th, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil. Consolo, D. A., & Teixeira da Silva, V. L. (2011). A discussion about tasks and criteria to assess EFL teachers’ oral proficiency in the EPPLE examination. Paper presented at the AILA World Congress, August 23rd-28th, Beijing University of Foreign Language Studies, Beijing, China. Cruz, M. L. O. B. (2001). Estágios de interlíngua: Estudo longitudinal centrado na oralidade de sujeitos brasileiros aprendizes de espanhol [Stages of interlanguage: A longitudinal study focusing the oral production of Brazilian learners of Spanish]. Doctoral thesis, UNICAMP, Campinas, Brazil. D’Ely, R. C. S. F. (2006). A focus on learners’ metacognitive processes: The impact of strategic planning, repetition, strategic planning plus repetition, and strategic planning for repetition on L2 oral performance. Doctoral thesis, UFSC, Florianopolis, Brazil. Ducatti, A. L. F. (2010). A interação verbal e a proficiência oral na língua-alvo na prática de sala de aula: (Re)definindo o perfil do professor de uma professora de língua inglesa da escola pública (Verbal interaction and oral proficiency in the target language in the classroom context: (Re)defining the professional profile of na English teacher in a public school). Master’s dissertation, UNESP, Sao Jose do Rio Preto, Brazil. Fernandes, A. M. (2011). A metalinguagem e a precisão gramatical na proficiência oral de duas professoras de Inglês como língua estrangeira (Metalanguage and grammatical accuracy in the oral proficiency of two English as a foreign language teachers). Master’s dissertation, UNESP, Sao Jose do Rio Preto, Brazil. Foster, P., Tonkyn, A., & Wigglesworth, G. (2000). Measuring spoken language: A unit for all reasons. Applied Linguistics, 21(3), 354-375.

Foreign Language Teachers’ Proficiency

219

Guará-Tavares, M. G. (2008). Pre-task planning, working memory capacity, and L2 speech performance. Doctoral thesis, UFSC, Florianopolis, Brazil. Hunt, K. (1965). Grammatical structures written at three grade levels. NCTE Research Report 3. Champaign, IL, USA: NCTE. Llurda, E. (2000). On competence, proficiency and communicative language ability. International Journal of Applied Linguistics, 10(1), 85-96. Scaramucci, M. V. R. (1997). The lexical competence of university students to read in EFL. DELTA, 13(2), 215-246. Retrieved from: http://www.scielo.br Silva-Neto, T. M. (2014). Competência lexical na proficiência do professor de Inglês como língua estrangeira: Uma análise do teste oral do EPPLE (Lexical competence in EFL the EFL teacher’s proficiency: An analysis of the EPPLE oral test). Master’s dissertation, UNESP, Sao Jose do Rio Preto, Brazil. Veloso, F. S. (2012). Percurso para a elaboração de um teste de co em língua Italiana de (futuros) professores (The process of designing a listening comprehension test in Italian for (future) language teachers). Doctoral thesis, UNESP, Sao Jose do Rio Preto, Brazil. Wolfe-Quintero, K., Inagaki, S., & Kim, H. Y. (1998). Second language development in writing: Measures of fluency, accuracy and complexity. Technical report 17. Manoa, Hawai‘i, US: University of Hawai‘i Press.

ISSUES IN LANGUAGE ASSESSMENT AND EVALUATION

CHAPTER ELEVEN VOCABULARY SIZE ASSESSMENT AS A PREDICTOR OF PLAGIARISM MARINA DODIGOVIC, JACOB MLYNARSKI, AND RINING WEI

Abstract Culture, education, attitude and language proficiency have been viewed as the major causes of second language writer plagiarism (Amsberry, 2009; Erkaya, 2009). However, research data that would sufficiently substantiate these claims are scarce. The study described in this chapter investigates the relationship between plagiarism and vocabulary knowledge in the writing of over 200 students of English as a second language. It uses both lexical error and vocabulary size assessment as measures of vocabulary command. The study relies on an instructional software tool called Grammarly, which identifies both textual borrowing and language errors, as well as on the Vocabulary Size Test (VST) to measure students’ vocabulary knowledge. The results indicate that there is some correlation between the error count and plagiarism, and a strong negative correlation between vocabulary size and plagiarism rate. Therefore, the findings seem to suggest that poor vocabulary command could be a major cause of plagiarism in second language writers. Based on these findings, the importance of systematic vocabulary teaching and learning as a strategy to avoid plagiarism emerges.

Introduction Plagiarism, or using someone else’s words or ideas without acknowledgment, appears in both native speaker (NS) and non-native speaker (NNS) university student writing (Jaeger & Brown, 2010). Text borrowing or appropriation, as plagiarism is sometimes called, has caused

Vocabulary Size Assessment as a Predictor of Plagiarism

223

librarians (e.g., Amsberry, 2009; Lund, 2004), literacy instructors (Evans & Youmans, 2000) and second language specialists (e.g., Lankamp, 2009; Liu, 2005; Shi, 2006) to give it increasing research consideration, with the focus on possible reasons and remedies. Three factors seem to have contributed to this development: the advent of computational plagiarism detection tools, the emancipation of second language writing research and the increased research in vocabulary learning and assessment. A quick library database search revealed that approximately one third of plagiarism related journal articles talked about antiplagiarism software, while another third dealt with second language writing. Only a handful of articles mentioned lexical issues. However, there seemed to be little overlap among the three groups of articles. To bridge this gap, the current study combined the application of electronic plagiarism detection tools with the vocabulary knowledge assessment of second language writers. The study reported in this chapter investigated the relationship between plagiarism and lexical error in the writing of over 200 ESL students. In addition, it used vocabulary size assessment to further investigate the link between lexical knowledge and plagiarism. Culture and language proficiency are often viewed as the major causes of second language writer plagiarism, but the lack of adequate instruction or consistently implemented policy have also been considered (Bacha & Bahous, 2010). In her fairly comprehensive review, Amsberry (2009) named cultural, educational and linguistic influences as the main causes of textual borrowing among second language students. In addition to these, there seem to be emotional or attitudinal causes as discussed by Erkaya (2009) in a study of textual appropriation in Turkey. Both Erkaya (2009) and Shi (2006) identified the availability of sources on the Internet as a precipitating factor in the increase of student plagiarism. In terms of culture as a major cause of plagiarism, Confucian societies are sometimes believed not to have cultivated as strong notions of text ownership as Western societies (Lund, 2004; Shi, 2006), seeing textual borrowing as a sign of respect toward the original on the one hand and blurring the boundaries between common knowledge and original ideas on the other. Moreover, culture seems to influence educational practices such as rote learning (Ballard & Clanchy, 1984) and copying as learning along with a lack of writing instruction (Amsberry, 2009). To educational influences, Hyland (2001) additionally counted the misplaced cultural sensitivity of second language instructors, which prevents them from giving clear and meaningful feedback on textual borrowing. On the other hand, writing in a new language and trying to accommodate a new rhetorical practice are, according to Amsberry (2009), linguistic influences

224

Chapter Eleven

that push second language students toward patch-writing, i.e., seamless and unacknowledged incorporation of fragments from an original into one’s own text. Finally, students may find themselves unmotivated to write or discouraged by their instructor’s negative attitude toward writing (Erkaya, 2009). Surprisingly, little research has been done to investigate the impact of linguistic insufficiencies of second language writers on their tendency to plagiarise. It is, therefore, important to assess to what extent poor command of the target language might be the cause of plagiarism in second language (L2) writers. A good point of departure for such an investigation would be vocabulary. In the words of David Wilkins (1972, p. 111), “without grammar very little can be conveyed; without vocabulary nothing can be conveyed”. Hence, L2 student writers struggling to find the right words might be indeed enticed to borrow from the writing of more proficient authors. This study, therefore, explores the relationship between vocabulary knowledge and textual borrowing or plagiarism. It does so by correlating the language error rate and vocabulary size with the textual borrowing rate. The notions of language error and vocabulary size are discussed in the following section.

Background There are two types of L2 vocabulary knowledge: receptive and productive (Nation, 2006). Receptive vocabulary is usually larger than the productive and enables the learner to comprehend things they read and listen to. Productive vocabulary, on the other hand, facilitates the productive skills of speaking and writing. In addition to vocabulary size, which is expressed in the number of words a learner knows, vocabulary is also measured in terms of depth (Beglar & Nation, 2007). Depth concerns everything a learner knows about a word, including ways of spelling and pronouncing it, the sentence structure it requires, its part of speech, the functions it can have in connected discourse, the contexts in which it can possibly occur, other words that may accompany it, the idiomatic expressions it is known to build and the connotations it can have (Folse, 2004). It is expected that in productive skills, such as speaking and writing, a larger vocabulary size would have the effect of a greater lexical range used, while a greater depth of vocabulary knowledge would result in a more accurate use of vocabulary. Lexical range is one of the measures of language proficiency. The underlying vocabulary size has been found to greatly affect reading

Vocabulary Size Assessment as a Predictor of Plagiarism

225

success in second language learning (Nation, 2006; Nergis, 2013; Ward, 2009). However, it is interesting that the perceptions of the vocabulary size deemed sufficient for successful reading of expository disciplinespecific texts continues to change: from 5,000 to 3,000 (Nation & Waring, 1997) to 2,570 (Coxhead, 2000; Nation, 2006), reaching the size of only 2,000 for engineering (Mudraya, 2006; Ward, 2009). According to Biber (2012), registers and perhaps disciplines differ in their use of linguistic repertoire, potentially placing different language demands on L2 readers across disciplines. However, except for the studies of cohesion (e.g., Dodigovic, 2005), not much previous research has focused on the effect of vocabulary command (as measured by vocabulary size) on writing quality. Cohesion as an aspect of vocabulary related to L2 writing proficiency has been sufficiently explored (Dodigovic, 2013). The lack of ability to write cohesively has also been identified as one of the factors contributing to plagiarism (Matalene, 1985). In their seminal work on cohesion, Halliday and Hasan (1976) examine the use of cohesive devices, demonstrating how much depends on the mastery of this linguistic subset. Cohesion requires writers to skilfully use the elements of lexico-grammar, such as conjunctions, lexical substitution or omission and grammatical indexing, successfully in order to achieve cohesively composed prose (Dodigovic, 2005; Halliday & Hasan, 1976). Halliday and Hasan (1976) claimed that it is possible to measure one’s ability to write cohesively by identifying and measuring the use of certain discourse markers (Halliday & Hasan, 1976; Lieber, 1981; Meisuo, 2000; Witte & Faigley,1981; Yang, 1989). In line with this, Leo (2012) investigated Chinese students’ language at an English-medium University in Canada in comparison with Canadian-born Chinese. The study revealed that the Chinese language incorporates the use of entirely different cohesive devices from those used in English. Therefore, the NNS Chinese students cannot base their English language development on a positive transfer from their mother tongue (L1) (Leo, 2012). She also identified significant lexical problems of Chinese learners of English as a second language (ESL) related to the use of synonymy and content words. Similar observations were made by Shi (2006) who investigated the causes of plagiarism based on students’ survey responses. Shi (2006) found that one of the most common problems was deemed to be insufficient lexical knowledge, which was also flagged as the core reason responsible for ‘cross-textual borrowings’ or plagiarism. A similar study conducted by Yu (2013) yielded comparable results, as students frequently blamed the inadequate vocabulary command for their poor paraphrasing skills which eventually resulted in unintentional plagiarism. Finally,

226

Chapter Eleven

Dodigovic (2013) also found that poor paraphrasing skills were closely associated with plagiarism. Other aspects of lexis have not been commonly associated with plagiarism. In particular lexical error, which should be an indicator of lexical accuracy or depth of lexical knowledge (Folse, 2004; Nation, 2006), has barely been examined in the context of L2 writing. According to Augustin Llach (2011), despite the fact that lexical errors emerge as the most numerous in the available studies, research in this area is still scarce. The lack of accuracy, otherwise known as language error, is significant in three respects: it informs the teacher about what should be taught; it informs the researcher about the course of learning; and it is an outcome of the learner’s target language hypothesis testing (James, 1998). Vocabulary size is another aspect of vocabulary knowledge that might be associated with plagiarism. While research focus in this area has predominantly been on the receptive command, which enables learners to read and listen with comprehension, not much is known about the productive vocabulary command, which enables them to speak and write proficiently, and its relationship with plagiarism. According to Nation (2006), the size of productive vocabulary required for successful speaking or writing is much smaller that the receptive vocabulary size required for successful reading or listening. However, there might be some indications that in L2 contexts there is little difference between the productive and receptive vocabulary knowledge (Schmitt, 2001), which suggests that any measure of receptive vocabulary knowledge could be helpful as a productive vocabulary knowledge indicator. Another important parameter in the context of ESL plagiarism might be the academic vocabulary (Coxhead, 2000) or Academic Word List (AWL) and the ESL student writer’s ability to use this vocabulary in writing. Studies by Augustin Llach (2011), Storch and Tapper (2009), and Deng, Lee, Varaprasad, and Leng (2010) tracked the development of academic vocabulary in the writing of ESL students over the duration of an academic English course and found evidence of significant improvement. However, the impact of this improvement on the amount of plagiarism has largely remained unexplored. Similarly, Dodigovic, Li, Chen and Guo (2014) examined a range of academic vocabulary errors committed by Chinese learners of English. However, they did not conduct their investigation in the context of textual borrowing or plagiarism. Therefore, examining lexical insufficiency as a possible cause of plagiarism emerges as a worthwhile research goal. To this end, the study reported here focused on Chinese learners of English at an Englishmedium University in China and investigated the relationship between the

Vocabulary Size Assessment as a Predictor of Plagiarism

227

rate of lexical error and vocabulary command on the one hand and the amount of plagiarism in the students’ writing on the other.

The Study The research question that guided this investigation was: To what extent is plagiarism related to Chinese students’ English vocabulary command? The participants in this study were 221 Chinese students at an English Medium Instruction (EMI) University. All of the students were in their first year, aged between 18 and 20, speakers of Chinese as a first language and majoring in English. All of the participants had completed their secondary education in China. The writing task used for the purpose of this study required expressing opinions and a critical review of literary sources. Taking into account the extensive need for quoting and referencing in that particular genre, this task required the student writers to apply advanced paraphrasing techniques in order for the writing to maintain its originality. The length of the writing samples ranged from 800-1,200 words and represented a typical Anglo-American academic genre commonly found at tertiary educational level in English-speaking countries (Dodigovic, 2005).

Instruments Grammarly Grammarly is a plagiarism and error detection software package. Its plagiarism detection engine operates using an ‘extrinsic’ type of analysis, i.e., by identifying similarities between one’s writing and the sources commonly available on the Internet. Conversely, ‘intrinsic’ analysis focuses on the stylistic inconsistencies within the text itself, which are being used to determine the plagiarised content (Sousa-Silva, 2014). Worth mentioning is the fact that apart from identifying copy-pasted parts of the text, Grammarly is capable of pinpointing weak paraphrases with only minor text modifications. Unlike Turnitin or other plagiarism detection tools, Grammarly has a feature crucial for this research, viz. the error detection or writing enhancement engine. According to the developer, Grammarly is capable of identifying up to 250 error types. Although the accuracy of the formative feedback offered by the software can be debated, the precision of the enhancement engine was estimated to be fairly high. For the purpose of this estimate, 20 Grammarly reports were randomly selected

228

Chapter Eleven

and manually reassessed revealing that approximately only 5% of the suggestions made by Grammarly turned out to be falsely labelled as errors. A similar test was performed using the plagiarism detection tool, with a similar outcome. Vocabulary Size Test The Vocabulary Size Test (VST) was used to measure the size of the participants’ vocabulary (Beglar & Nation, 2007). This test has been specifically developed to “provide a reliable, accurate, and comprehensive measure” (Beglar, 2010, p. 103) of NNS English learners’ receptive vocabulary in its written form, including the 14,000 most frequent word families in English. This test, available in both electronic and hard-copy format, was used in its paper-based format and it was corrected manually.

Procedure Student writing was analysed using the Grammarly web-based engine from the perspective of plagiarised content and lexical error. The identified errors were entered into a database following a manual identification accuracy check. VST was administered in class, two weeks after the writing samples were collected, in hard-copy format, which was filled out manually by the participants. It was also manually marked and moderated by two independent markers using the answer key. All of the activities were carried out with adherence to the ethical standards called for in the Belmont Report, Declaration of Helsinki and Nuremberg Code.

Data Analysis The Pearson product-moment correlation coefficient (r), commonly used to reveal a possible linear association between two variables and for calculations on larger samples, in which normal distribution can be expected (Stoynoff & Chapelle, 2005), was used to calculate the correlation between the lexis related variables and plagiarism rate. This correlation coefficient is one of the commonly used measures of effect size, although many who use it may not be aware that it is an effect size index (Ellis, 2010). In the discussion of r values below, reference to ‘effect size’ is made; this serves as a response to the recent calls, from researchers in China (e.g., Wei, 2012) and abroad (e.g., Ellis, 2010; Larson-Hall, 2012), for paying more attention to effect size, which according to the Publication Manual of the American Psychological Association (APA) is

Vocabulary Size Assessment as a Predictor of Plagiarism

229

as important as the significance level (namely the p value) in inferential statistical procedures. To describe the strength of the correlation, Cohen’s (1988, cited in Ellis, 2010) frequently-used guidelines were employed (e.g., for r, 0.5, 0.3 and 0.1 respectively representing the cutting points of the socalled ‘large’, ‘medium’ and ‘small’ effects).

Results Lexical Errors Based on the Grammarly output, all errors were divided into three categories: lexical, grammar and punctuation. For the purpose of this paper, only lexical errors are of interest. Using Grammarly’s categorisation, lexical errors were divided into four categories: confused words, spelling mistakes, wordiness and colloquial speech. To arrive at a more comprehensive picture, the correlation results for two types of scenarios are presented here: results with the outliers and without them (see Table 11-1).

Combined

Colloquial Speech

Spelling Mistakes

2% 2% no outliers 3% 3% no outliers 5% 5% no outliers

Confused Words

Cut-off Point

Wordiness

Table 11-1: Correlation coefficient (effect size) values for lexical errors.

157

-0.0827

-0.0912

0.1623

-0.1242

-0.0792

154

-0.0495

-0.0968

0.0232

-0.1763

-0.1539

108

-0.0722

-0.1082

0.1804

-0.1753

-0.0872

105

-0.0289

-0.1029

-0.0187

-0.2556

-0.2023

66

-0.0694

-0.0418

0.2307

-0.115

0.0103

63

-0.0018

-0.0085

-0.0382

-0.2304

-0.1682

N

According to Table 11-1, the results for the lexical errors with the cutoff point for plagiarism rate set at 5% revealed that Pearson’s correlation coefficient values were the lowest for the confused words, spelling mistakes and wordiness: r = -0.0018 (p < .05), r = -0.0085 (p < .05), r = 0.0382 (p < .05) respectively. However, the statistical analysis of the use

230

Chapter Eleven

of colloquial speech and its relation to plagiarism revealed the existence of a weak but statistically not significant negative correlation r = -0.2304 (p = .069). After adding up all the values for lexical issues and analysing them with respect to citation audit using correlation analysis, the results demonstrated the existence of a weak correlation at r = -0.1682 (see Table 11-1), albeit this result was not statistically significant (p = .187). Establishing the cut-off point for plagiarism rate at 3% did not significantly affect the correlation coefficient for lexical errors. Specifically, the value remained relatively stable at r = -0.2023, only slightly increasing from r = -0.1682 (see Table 11-1). However, these two effect size values still fell within the small-to-medium category according to Cohen’s (1988, cited in Ellis, 2010) criteria for strength of association. As Table 11-1 indicates, the highest value of Pearson’s correlation coefficient for added up lexical errors was achieved in a group of 105 participants with the lowest (below 3%) and the highest (31%, 39%, 49%) outstanding values for the citation audit excluded from the analysis. It was observed that the inclusion of the outliers at 31%, 39% and 49% threshold always had a positive effect on the correlation coefficient value for the lexical errors by significantly increasing it. It was also observed that the exclusion of papers with a high percentage of textual borrowing (31%, 39% and 49%) drastically increased the value of the correlation coefficient using the sum of all lexical error types. In summary, based on the data examined in the study, the overall lexical error rate does not seem to correlate strongly with the plagiarism rate. However, a weak and statistically not significant relationship was observed for lexical errors due to wordiness (r = 0.2307, cut-off point 5%, outliers included) and colloquial speech (3% and 5%, no outliers).

Vocabulary Size Vocabulary Size Test (VST) results were obtained for 107 out the 221 research participants. Hence, the 107 pairs of VST and plagiarism rate results were correlated, without excluding any data sets. The value of the correlation between the plagiarism rate and vocabulary size was high (r = 0.7791). This is a strong negative correlation, representing a ‘large’ effect and meaning that high vocabulary scores indicate low plagiarism levels (and vice versa). Based on this outcome, it is safe to assume that the larger the vocabulary size of second language writers, the less the chance they will resort to plagiarism when engaging in academic writing.

Vocabulary Size Assessment as a Predictor of Plagiarism

231

Discussion Unlike the reviewed literature (Shi, 2006; Yu, 2013) which purports that lexical errors might be a cause of plagiarism in in higher education, the results of the current study revealed a weak and not significant correlation between plagiarism and lexical error rate, suggesting that this might not be the case. While one might argue that the result might be statistically significant once the sample size is increased, the aforereported findings concerning the strength of correlation (effect size) remain stable across studies (including those with much larger samples) because effect size measures, unlike the p values, are unaffected by sample size (Meline & Wang, 2004). Since the writing that was analysed in this study was academic essays in which the students were using the lexically correct verbatim text from a variety of academic sources which was lexically correct, it is also possible that the results of the study were distorted by the presence of verbatim borrowings which could have significantly reduced the proportion of the students’ original writing and in turn might have masked the real error rate. Nevertheless, based on the results of this study, it might be safe to assume that lexical error or the absence of lexical accuracy is not a major cause of plagiarism among Chinese students at EMI universities in China. Furthermore, it seems that vocabulary depth, as the construct underlying lexical accuracy, might not be directly related to plagiarism. On the other hand, vocabulary size due to its strong negative correlation with the textual borrowing rate, suggests that a large vocabulary might be negatively related to the level of plagiarism. In other words, NNS writers with a large vocabulary might be less likely to plagiarise, regardless of how well they know the words they are familiar with. This is consistent with Dodigovic (2013), in which the plagiarism rate was reduced by focusing on paraphrasing skills. This skill requires both receptive vocabulary knowledge and a large vocabulary size, both tested using VST. Similarly, the present study indicates that in the case of a limited vocabulary, having a good command of the entire depth of the vocabulary known is less likely to result in plagiarism in free writing.

Conclusion The objective of this study was to explore the relationship between Chinese English learners’ lexical errors and vocabulary size on the one hand and the amount of borrowed content in their written prose on the

232

Chapter Eleven

other. The study was carried out with a group of 221 Chinese students majoring in English at an EMI university located in China. Learners’ lexical errors and unoriginal content were identified using Grammarly’s enhancement and plagiarism detection engine, while the vocabulary size was determined using the VST. The data was then statistically analysed using Pearson’s correlation coefficient. The results of the study suggest that vocabulary size might be a factor requiring more attention in the context of fighting plagiarism in the higher education sector, but the depth of vocabulary, which has a bearing on the lexical accuracy, might not. The outcome indicates that pedagogical effort should be invested in the vocabulary growth of the target learner population. This can be done either by stimulating deliberate vocabulary learning through the use of vocabulary cards, games and fun activities or through extensive reading programs which rely on a combination of graded readers and authentic texts. Vocabulary size testing as well as other methods of vocabulary assessment should become a more common practice (Schmitt, 2001), so that through washback they might positively impact educational practice.

Acknowledgement This paper is a part of the output from the Jiangsu Higher Education Learning and Teaching Reform project #2015JSJG253 entitled Computational methods of lexical transfer detection in the English writing of Chinese-English bilinguals, funded by the Jiangsu Department of Education, China.

References Amsberry, D. (2009). Deconstructing plagiarism: International students and textual borrowing practices. The Reference Librarian, 51(1), 31-44. Augustin Llach, M. P. (2011). Lexical errors and accuracy in foreign language writing. Bristol: Multilingual Matters. Bacha, N. N., & Bahous, R. (2010). Student and teacher perceptions of plagiarism in academic writing. Writing and Pedagogy, 2(2), 251-280. Ballard, B., & Clanchy, J. (1984). Study abroad: A manual for Asian students. Kuala Lumpur: Longman Beglar, D. (2010). A Rasch-based validation of the vocabulary size test. Language Testing, 27(1), 101-118.

Vocabulary Size Assessment as a Predictor of Plagiarism

233

Beglar, D., & Nation, P. (2007). A vocabulary size test. The Language Teacher, 31(7), 9-13. Biber, D. (2012). Register as a predictor of linguistic variation. Corpus Linguistics and Linguistic Theory, 8(1), 9-37. Chuo, T.-W.I., & Wenzao, U. (2007). The effects of WebQuest writing instruction program on EFL learners’ writing performance, writing apprehension and perception. TESL-EJ, 11(3), 1-27. Coxhead, A. (2000). The academic word list. TESOL Quarterly, 34(2), 213-238. de Jaeger, K., & Brown, C. (2010). The tangled web: Investigating academics’ views of plagiarism at the University of Cape Town. Studies in Higher Education, 35(5), 513-528. Deng, X., Lee, K. C., Varaprasad, C., & Leng, M. L. (2010). Academic writing development of ESL/EFL graduate students in NUS. Reflections on English Language Teaching, 9(2), 119-138. Dodigovic, M. (2005). Artificial intelligence in second language learning: Raising error awareness. Clevedon: Multilingual Matters. —. (2013). The role of anti-plagiarism software in learning to paraphrase effectively. CALL-EJ, 14(2), 23-37. Dodigovic, M., Li, H., Chen, Y., & Guo, D. (2014). The use of academic English vocabulary in the writing of Chinese students. English Teaching in China, 5, 13-20. Ellis, P. D. (2010). The essential guide to effect sizes: Statistical power, meta-analysis, and the interpretation of research results. Cambridge: Cambridge University Press. Erkaya, O. R. (2009). Plagiarism by Turkish students: Causes and solutions. Asian EFL Journal, 11(2), 86-103. Evans, F. B., & Youmans, M. (2000). ESL writers discuss plagiarism: The social construction of ideologies. Boston University Journal of Education, 182(3), 49-65. Folse, K. (2004). Vocabulary myths. Ann Arbor: University of Michigan Press. Halliday, M. A. K., & Hasan, R. (1976). Cohesion in English. London: Longman. Hyland, K. (2001). Bringing in the reader: Addressee features in academic articles. Written Communication, 18(4), 549-574. James, C. (1998). Errors in language learning and use: Exploring error analysis. London, England: Longman. Lankamp, R. (2009). ESL student plagiarism: Ignorance of the rules or authorial identity problem? Journal of Education and Human Development, 3(1), 1-8.

234

Chapter Eleven

Larson-Hall, J. (2012). Our statistical intuitions may be misleading us: Why we need robust statistics. Language Teaching, 45(4), 460-474. Leo, K. (2012). Investigating cohesion and coherence discourse strategies of Chinese students with varied lengths of residence in Canada. TESL Canada, 29(6), 157 -179. Lieber, P. E. (1981). Cohesion in ESL students’ expository writing: A descriptive study. Doctoral thesis, New York University, USA. Liu, D. (2005). Plagiarism in ESOL students: Is cultural conditioning truly the major culprit? RLT Journal, 59(3), 234-241. Lund, J. R. (2004). Plagiarism: A cultural perspective. Journal of Religious & Theological Information, 6(3-4), 93-101. Matalene, C. (1985). Contrastive rhetoric: An American writing teacher in China. College English, 47(8), 789-808. Meisuo, Z. (2000). Cohesive features in the expository writing of undergraduates in two Chinese universities. RELC Journal, 31(1), 6195. Meline, T., & Wang, B. (2004). Effect-size reporting practices in AJSLP and other ASHA journals, 1999-2003. American Journal of SpeechLanguage Pathology, 13(3), 202-207. Mudraya, O. (2006). Engineering English: A lexical frequency instructional model. English for Specific Purposes, 25(2), 235-256. Nation, I. S. P. (2006). Language education-Vocabulary. In K. Brown (Ed.), Encyclopaedia of language and linguistics (Vol. 6, 2nd ed., pp. 494-499). Oxford: Elsevier. Nation, I. S. P., & Waring, R. (1997). Vocabulary size, text coverage, and word lists. In N. Schmitt & M. McCarthy (Eds.), Vocabulary: Description, acquisition, pedagogy (pp. 6-19). New York: Cambridge University Press. Nergis, A. (2013). Exploring the factors that affect reading comprehension of EAP learners. Journal of English for Academic Purposes, 12(1), 1-9. Shi, L. (2006). Cultural backgrounds and textual appropriation. Language Awareness, 15(4), 264-282. Schmitt, N. (2001). Vocabulary in language teaching. Cambridge: Cambridge University Press. Sousa-Silva, R. (2014). Investigating academic plagiarism: A forensic linguistics approach to plagiarism detection. International Journal for Educational Integrity, 10(1), 31-41. Storch, N., & Tapper, J. (2009). The impact of an EAP course on postgraduate writing. Journal of English for Academic Purposes, 8(3), 207-223.

Vocabulary Size Assessment as a Predictor of Plagiarism

235

Stoynoff, S., & Chapelle, C. (2005). ESOL tests and testing. Alexandria: Teachers of English to Speakers of Other Languages. Ward, J. (2009). EAP reading and lexis for Thai engineering undergraduates. Journal of English for Academic Purposes, 8(4), 294-301. Wei, R. (2012). Zaitan waiyu dingliang yanjiu zhong de xiaoying fudu [Effect size in L2 quantitative research revisited]. Xiandai Waiyu [Modern Foreign Languages], 35(4), 416-422. Wessa, P. (2014). Free Statistics Software. Office for Research Development and Education, version 1.1.23-r7. Retrieved from: http://www.wessa.net/ Wilkins, D. A. (1972). Linguistics in language teaching. London: Edward Arnold. Witte, S. P., & Faigley, L. (1981). Coherence, cohesion, and writing quality. College Composition and Communication, 32(2), 189-204. Yang, W. (1989). Cohesive chains and writing quality. Word, 40(1-2), 235-254. Yu, T. (2013). Relationship between the EAP classroom approach and plagiarism. Unpublished manuscript, Final Year Project, Xi’an Jiaotong-Liverpool University, Jiangsu, China.

CHAPTER TWELVE WHAT IS THE IMPACT OF LANGUAGE LEARNING STRATEGIES ON TERTIARY STUDENTS’ ACADEMIC WRITING SKILLS? A CASE STUDY IN FIJI ZAKIA ALI CHAND

Abstract This study has assessed the language learning strategies used by a group of undergraduate students at a tertiary institute in Fiji to find out if there are any correlations with their academic writing proficiency. Data for language learning strategy use were collected through a standard questionnaire, using Oxford’s (1990) Strategy Inventory for Language Learning (SILL). In-depth interviews were also conducted to further explore the students’ language learning strategies (LLS) in early childhood. An error analysis of students’ written texts was undertaken to determine proficiency in academic language. The Statistical Package for the Social Sciences (SPSS) was used for quantitative data analysis. The results of this study showed that students used language learning strategies with a medium frequency. Metacognitive strategies were used most frequently followed by cognitive ones while affective and memory strategies were used the least frequently. There was no significant difference in the number and type of errors made in students’ written texts both before and after writing strategies were taught. In the final analysis, using Pearson’s correlation coefficient, there was no significant correlation found between strategy use and the academic language proficiency of the participants. Both successful and unsuccessful English language learners used the same strategies with almost the same frequency. This study concludes that proficiency in the academic writing of Fiji students is not influenced by their use of language learning strategies.

What is the Impact of Language Learning Strategies

237

Introduction In the 21st century, English has become the dominant global language and it can be established that today English is used as a medium of communication by more non-native than native speakers (Crystal, 1997; Graddol, 1997). Globalization, the current advances in technology and social media all have fueled a demand for English. In Fiji and the Pacific, as elsewhere in the world, English has taken on an increasingly important role and the individual reasons for this vary widely: from personal growth and enhancement to higher education and better employment opportunities. This is evident by the increasing number of enrolments in primary, secondary and tertiary institutes where English has become a mandatory subject. People with good communication skills and qualifications in English are sought after in most work places and educational institutes. In Fiji, as well as in most Pacific Islands, people use English as a lingua franca (ELF) to communicate amongst themselves. For the majority of the urban dwellers, English is the language of business, education, entertainment, politics, and everyday living. However, in the rural areas, the use of English as the language of daily living is not as high. In Fiji, students go through learning English as a compulsory subject for thirteen years of their primary and secondary school life. However, when they enroll in tertiary institutions, it becomes apparent that proficiency in their academic writing skills has not developed much over the years. Educators in Fiji tertiary institutes find that in spite of eight years of primary and four to five years of secondary school education with English as the medium of instruction (EMI), students who enroll in the local universities have weak academic writing skills. Though no comprehensive research data is currently available on the exact areas of weaknesses in academic writing of Fiji students, the following errors are most commonly found in their written texts: tense, subject-verb agreement, weak sentence structures, mechanics (in particular punctuation and spelling), usage of articles, vocabulary, connectives, participles, word forms, word choice, and direct and reported speech. Apart from weaknesses in grammar and punctuation, students lack appropriate skills and knowledge of the structure of formal letters, essays and reports. This study investigates to what extent the use of language learning strategies can enhance the academic writing proficiency of Fijian undergraduate students who may not be aware of language learning strategies and therefore do not use appropriate strategies to enhance their language learning.

238

Chapter Twelve

Background Language Learning Strategies Researchers in second language learning and acquisition have long recognized the role of the learner in the learning process and this subject is the object of enquiry in much research. According to Ehrman and Oxford (1995), the role of the learner is complex and determined by certain variables which might correlate with successful language learning. Since Rubin’s (1975) article on the good language learner, there has been much interest and discussion on what makes some people successful at learning a second or foreign language (Ellis, 1994; Grenfell & Harris, 1999; Naimen, Frolich, Stern, & Todesco, 1978; Nakatani, 2006; O’Malley & Chamot, 1990). Much research has been done over the last three decades on the characteristics and traits of successful language learners which can be taught to less successful learners in ways that would benefit them. It now appears that there is a multitude of factors that can affect language learning and these include: personality type, learning style, aptitude, motivation, and, the focus of this research, language learning strategies (Ehrman & Oxford, 1995; Rubin, 1975). The field of language learning strategies, which can be defined as the methods learners use to aid their learning of a second or foreign language, is complex. Focused research on this subject began in the 1970s (Naimen et al., 1978; Rubin, 1975) with identifying and classifying good language learning strategies. However, there is still much discussion going on about the classification of these strategies and their relevance to language learning and acquisition (Hsiao & Oxford, 2002). What has become clear is that there can be effective methods or techniques that learners can use to learn a second or foreign language successfully (Lessard-Clouston, 1997; Oxford, 1990). But the challenge still remains for many teachers and researchers on how to isolate the language strategies and teach them to learners in a way that can improve their ability to use the second or foreign language and put them on a path to a more self-directed and independent learning (Chamot, 2005). According to Scollon and Scollon (2004), language is “a multiple, complex and kaleidoscopic phenomenon” (p. 272). When one thinks about the intricacies of a language, its design, structure, grammatical systems, phonology, and how it is used according to audience, purpose and context, the challenges of learning a second language become overwhelming. At this point, it is important to distinguish between English as a second language (ESL) and English as a foreign language (EFL) as there are

What is the Impact of Language Learning Strategies

239

important ramifications in terms of teaching, learning and researching in the two contexts. ESL is acquired and learnt in an environment, such as Fiji, where English is the main language of wider communication and the learners are immersed in it (Ellis, 1994). On the other hand, EFL is a term used when students learn a foreign language in an environment where “they do not have ready-made contexts for communication beyond their classroom” (Brown, 2007, p. 134). Students learning English in countries like China, Japan and Iran are classified as EFL learners. Oxford (1990) came up with six groups of learning strategies, namely: memory, cognitive, compensation, metacognitive, affective, and social strategies. The first three are direct strategies and the latter three are indirect. Direct strategies are used in formal and informal settings where learners use the target language in order to improve their language skills, while indirect strategies help learners to manage and support their learning without involving the target language directly (Wu, 2008). These six strategies are measured using a questionnaire designed by Oxford (1990), namely the Strategy Inventory for Language Learning (SILL). The SILL employs a 5-point Likert scale to rate an individual’s use of the different strategies. These strategies are interrelated and interact with one another (Rahimi, Riazi, & Saif, 2008). According to Ellis (1994), Oxford’s classification of language learning strategies was the most comprehensive at the time. Twenty years may have lapsed since Ellis made that comment, however, even more recent studies such as Hsiao and Oxford (2002) and Chamot (2004) have testified on the superiority of the SILL as an instrument for measuring language strategy use. Although strategy use is often an unobservable event as it relates more to the mental processes of a learner, the vast literature on factors influencing it points to gender, ethnicity, motivation, proficiency, learning styles and learners’ level of language proficiency. Of these, motivation, proficiency, learning style and gender seem to have a strong correlation with strategy use (Rahimi et al., 2008). Learning style differs from learning strategies as it deals with the personality of the learners and their naturalistic and habitual ways of acquiring a second language, while learning strategies are conscious steps used by learners to develop their linguistic competence (Lessard-Clouston, 1997). Research done on strategy use over the last twenty years suggests that language performance is related to language learning strategy (Boyce, 2009; Dreyer & Oxford, 1996; Roehr, 2004; Tsan, 2008) and that strategies can be taught. In a study by Wu (2008) at the National Chin-Yi University of Technology in Taiwan, it was found that students who had a higher proficiency in English used these strategies more frequently than

240

Chapter Twelve

those with lower proficiency. Peacock and Ho (2003) also found a positive correlation between twenty-seven strategies and learner proficiency. The most frequent strategies used were compensation followed by cognitive, metacognitive, social, memory and affective strategies. Higher proficiency learners used cognitive and metacognitive strategies more frequently than those with lower proficiency. Similar results on the correlation between high levels of proficiency and an increased use of both direct and indirect strategies were found in earlier research by Green and Oxford (1995). Early research, from the 1970’s and 1980’s, found that successful language learners had “a strong desire to communicate, were willing to guess when unsure, and were not afraid of being wrong or appearing foolish” (Rubin, 1975, p. 43). However, these learners were mindful of correctness, form and meaning and monitored their own language as well as that of those surrounding them. These strategies were not employed universally by all successful language learners. It depended on the learners’ target language proficiency, age, situation and cultural differences. Fillmore (1982) reported similar findings in her research on individual differences. She found that successful learners also used social strategies as they “spent more time...socializing” (p. 285). By and large, research has shown that a number of variables, such as gender, ethnicity, proficiency level, socio-economic background, and level of motivation affect the type and frequency of strategy use by second/foreign language learners (Ehrman & Oxford, 1990; O’Malley, Chamot, Stewner-Manzanares, Russo, & Kupper, 1985; Oxford & Nyikos, 1989).

Academic Language Proficiency and Error Analysis Academic language is an essential and important component of tertiary education (Scarcella, 2008). It is associated with academic success and being proficient in the use of academic language motivates and empowers students and gives them credibility in their chosen professions. At the tertiary level, academic writing is probably the most important aspect of teaching and learning. Good reading skills are also synonymous with good writing skills. In fact, reading is one of the most important tasks required for academic success. According to Grabe (1991), “literacy in academic settings exists within the context of a massive amount of print information” (p. 389). Research has found a strong correlation between reading and proficiency in academic language (Bharuthram, 2012; Lukhele, 2009; Oberholzer, 2005; Pretorius, 2002). Much of the research on correlations between strategy use and academic proficiency employed final exam scores, language proficiency

What is the Impact of Language Learning Strategies

241

test results, and written and spoken tasks done in the classroom (Bremner, 1999; Ketabi & Mohammadi, 2012; Tam, 2013). There is little literature on error analysis and its correlation with academic language proficiency. However, researchers have mentioned the importance of error analysis and its links to academic English proficiency (Michaelides, 1990; Richards, Plott, & Platt, 1996). Research done by Cohen (1998), Ehrman and Oxford (1989) and Oxford (1990, 1993) showed that more frequent use of language learning strategies is often related to higher levels of academic language proficiency. However, according to Green and Oxford (1995), the picture is not crystal clear as “it shows prominent features of the landscape but only gives hints as to what the trees and buildings in the picture would look like up close” (p. 261). In their study of university students studying at different course levels in Puerto Rico, Green and Oxford (1995) found that there was a positive correlation between strategy use and academic proficiency. Bremner (1999), working with students from the City University of Hong Kong, investigated strategy use and its correlation with language proficiency. The results showed that the participants were medium users of the learning strategies. The most frequently used strategy was compensation, followed by metacognitive, cognitive, social, memory, and affective strategies. The correlations between proficiency and strategy use showed positive relations with cognitive and compensation strategies, while there was a negative correlation with affective strategies. Goh and Kwah (1997) had similar results in their study of Singaporean learners, while Green and Oxford (1995) found that, in addition to these two strategies, metacognitive and social strategies also showed positive variation. As for the negative correlation between proficiency and affective strategies, it could be that as learners become more proficient in their language skills, they have less use of such strategies because their confidence, knowledge and motivation have all increased.

The Study The aim of this research study was to identify the second language learning strategies used by tertiary students from the Republic of Fiji, and investigate the impact these strategies have on their academic writing skills. The research subjects were 95 first year undergraduate students and 10 final year students in a Bachelor of Arts in Literature/Language program. Even though it was planned to have a balance of gender and ethnicity in the sample, on the day of the data collection the females (67%) outnumbered the males (33%), and the Fijian Indian students (64%)

242

Chapter Twelve

outnumbered the indigenous Fijians or I-Taukei students (32%). The remaining 4% were made up of students from non-Fijian Indian and non-ITaukei backgrounds. Three methods were used for data collection: firstly, the SILL, version 7.0, designed by Oxford (1990), was distributed to the participants during class time. Secondly, interviews were conducted with students on their language learning experience from childhood. Thirdly, data for proficiency in academic language were collected from three sources: a diagnostic test given at the beginning of the semester before any writing strategies were taught, assignments, and final exam answer scripts. The SILL was distributed to 105 students. Of these, 95 students were first year undergraduate students enrolled in the English for Academic Purposes course, and 10 were final year undergraduates majoring in linguistics and literature. Data were analyzed using the Statistical Package for the Social Sciences (SPSS), version 19. The qualitative data were analyzed using coding techniques. Assignments were marked using a software package called Markin 4, and final examination scripts were marked manually using the same rubrics.

Results and Discussion Use of Language Learning Strategies The study reported an average frequency of 2.76 (out of 5) hence students in this study were medium users of language learning strategies. Metacognitive strategies were the most frequently used strategies, followed by cognitive, social, compensation, memory, and affective strategies. Similar results have been found in other studies (Boyce, 2009; Deneme, 2008; Green & Oxford, 1995; Griffiths, 2004; Hong, 2006; Tsan, 2008). Four strategies were found to have the highest use among the research subjects. Three were cognitive or direct strategies and one was a metacognitive or an indirect strategy: Cognitive Strategies: • Item 6: I watch English language TV shows or go to English movies (Mean = 3.6). • Item 8: I write notes, messages, letters, or reports in English (Mean = 3.6). • Item 9: I first skim an English passage (read over the passage quickly) then go back and read carefully (Mean = 3.6).

What is the Impact of Language Learning Strategies

243

Metacognitive Strategy: • Item 2: I notice my English mistakes and use that information to help me do better (Mean = 3.5). Ten strategies were found to be ‘unpopular’ with means below 2.5. There were five direct and five indirect strategies in this category. Interestingly, none of the metacognitive strategies are in this category while only one strategy from the cognitive group has made it to this list. The rest of the ‘unpopular’ strategies belonged in the memory and affective categories. Overall, memory and affective strategies recorded significantly lower usage than the other four categories, and the majority of these strategies were unpopular with the study participants. The significantly low usage of memory strategies is surprising as quite often teachers are under the notion that Fijian students rely heavily on rote learning, especially since there is great importance given to success in the national examinations. These results contradict such stereotypes. Even though metacognitive strategies had the highest overall usage with a mean of 3, the results for cognitive strategies showed items 6, 8 and 9 had the highest usage with a mean of 3.6. This is an important result. It shows that Fijian students are high users of certain cognitive strategies, which involve watching English TV programs and movies as a means to learn English. Writing notes, letters and reports in English and skim reading are also used with a high frequency. It is also interesting to note that the overall means for cognitive strategies (Mean = 2.9) and metacognitive strategies (Mean = 3) were very close. Results also showed that social strategies were equally popular (Mean = 2.8). Social strategies are used when interacting with other learners and opportunities to improve and practice the language are available. Items 1 and 5 achieved a mean of 3.2. These results show that students socialize using the English language in their daily lives and use opportunities to learn from each other; however, they are reluctant to be corrected by their peers or do not correct each other. One of the reasons for this may have to do with cultural protocols and rules of politeness in their culture. In Fiji, among both the major ethnic groups, indigenous Fijians and Fijians of Indian descent, certain customary rules prevent people from correcting each other’s mistakes in spoken discourse. Nevertheless, social strategies had the third highest usage among the six strategy categories, only preceded by cognitive and metacognitive strategies.

Chapter Twelve

244

Relationship between Strategy Use and Gender/Ethnicity Results showed that there is no relationship between gender, ethnicity and strategy use. In Table 12-1 it can be seen that the correlation coefficient (r) between memory strategy use and gender had a value of 0.169 while p > .05. The coefficient was not statistically significant and the weak correlation shows a negligible relationship between gender and memory strategy use. Likewise, the correlation coefficients between gender and the rest of the strategies were close to 0 and declining to a negative figure in metacognitive, affective and social strategies, with p > .05. Ethnicity showed overall a negative correlation with all the strategy categories. Both gender and ethnicity did not show any significant correlation with strategy use. Table 12-1: Correlations between strategy use, gender and ethnicity.

.448

Social

Sig.

Affective

.075

Metacognitive

r

Communication

Sig.

Cognitive

1

Memory

r

Ethnicity

Ethnicity

Gender

Gender

Strategies

.075

.169

.187

.050

.137

-.020

-.085

.448

.084

.056

.616

.165

.839

.386

1

-.054

-.187

-.165

-.032

-.099

-.045

.582

.055

.092

.746

.317

.649

Note: N = 105; r = Pearson correlation coefficient; Sig. = Significance value.

Interview Analysis The interview data confirmed the results from the analysis of the SILL questionnaire. Most of the participants were not high users of language learning strategies. Social strategies were seldom used by the participants to learn English, both within the family and with relatives and friends. Often social activities were conducted using the participants’ first

What is the Impact of Language Learning Strategies

245

language (L1). Hence, social, memory and affective strategies once again are at the bottom of strategy use.

Error Analysis of Academic Writing Essay samples were collected from a diagnostic test, a course assignment and the final examination. Essays were examined for adherence to grammatical rules of Standard English as well as structure and formatting. Every error was identified and labeled using the rubrics from Markin 4. Table 12-2 below provides a breakdown of all the errors from highest to lowest that occurred in all the texts that were analyzed. When all the texts were combined, a total of 10,313 errors were found in the diagnostic tests, assignments and final exam scripts of the research subjects. Table 12-2: Results of the error analysis. Error Punctuation Word Choice Cut (Unnecessary text) Repetition Agreement (subject/verb) Plural (singular/plural) Preposition Incomprehensible Text Missing Word/s Word Form Modifier (misplaced) Spelling Verb Tense

% 14.3 10.9 7.9 7.7 7.1 6.3 6.1 5.3 4.8 4.1 3.9 3.3 3.1

Error Article Verb Form Conjunction Vague Reference Sentence Fragment Capitalization Word Order Inaccurate Quotation Parallel Construction Missing Space Count/Non-Count Paragraphing Formatting

% 2.7 2.6 2.3 1.9 1.9 1.3 0.7 0.7 0.3 0.3 0.2 0.2 0.01

The analysis shows that the greatest number of errors occurred in punctuation followed by word choice, unnecessary text (cut), repetition (of information or text), agreement, singular/plural, preposition, and incomprehensible text. More than half of all the errors were made in the first six categories. The most common errors made were in punctuation, word choice, relevancy of information, cohesion and coherence, subjectverb agreement, use of singular and plural, and use of prepositions.

246

Chapter Twelve

Academic Language Proficiency and Strategy Use It was hypothesized in this study that the higher the use of language strategies, the fewer the errors in students’ academic writing. The Pearson correlation coefficient was used for this analysis because the data were parametric. In parametric correlations, the correlation coefficient (r) shows the strength of the relationship between two variables. Table 12-3 below shows the results of the Pearson correlation between language learning strategies and overall errors from all the written texts analyzed. According to Cohen and Cohen (1983), a correlation coefficient of 0.22 indicates a small positive linear correlation. The significance was 0.04, which is p < .05. Therefore, the results are statistically significant. There was a small positive linear correlation between errors and learning strategies. As more writing strategies were used by students over the semester, the number of errors in their written work also increased. Therefore, the results show that the language learning strategies used by the students did not have any significant impact on their academic language proficiency. Table 12-3: Correlation between strategy use and language proficiency.

Errors Total Pearson Correlation Sig. (2-tailed) LLS Total Pearson Correlation Sig. (2-tailed)

Errors Total 1 .220* .040

LLS Total .220* .040 1

Note: N = 88; LLS = Language Learning Strategies; * = Significant at the .05 level.

Conclusion The research study reported in this chapter found that university students in Fiji used language learning strategies at a medium level. The most frequent strategies used were metacognitive followed by cognitive and social strategies. Affective strategies were used the least. Undergraduate students in Fiji are aware of the strategies they use to learn English and are taking control of their learning, albeit at a medium, level. The study also found that ethnicity did not have a significant influence on strategy use. Students’ ethnic background, i.e., whether they were indigenous Fijian or Fijians of Indian origin, was not significantly correlated with strategy use. Both major ethnic groups displayed

What is the Impact of Language Learning Strategies

247

similarities in their use of language learning strategies. Overall, gender and ethnicity had no significant correlation with strategy use. The study also found that the strategies students used were not significantly correlated with their academic language. This is contrary to previous findings as many studies, including Green and Oxford (1995), Bremner (1999) and Al-Hebaishi (2012), have found that cognitive and metacognitive strategies had a positive impact on academic language proficiency while affective strategies had an inverse relationship. This study has shown that no single strategy had a direct impact on the participants’ proficiency in academic writing. The research study also found that a high percentage of Fiji students at the tertiary level do not have the required proficiency for academic writing. Students make a large number of errors in grammar, punctuation, sense and style. In the students’ essays, there were several types of recurrent errors in word choice and form, wrong use of prepositions, confusion with verb tenses and forms, and subject-verb agreement. The most common types of errors occurred in punctuation, followed by vocabulary, use of singular and plural, redundancies, incomprehensible text, and subject-verb agreement. The analysis has shown students’ difficulties in understanding correct grammatical structures resulting in weak expressions, poor choice of vocabulary, repetition and redundancies. Both successful and unsuccessful learners used the language strategies with equal frequency. No evidence was found that more successful students used more language learning strategies than the less successful ones. The results showed that the language strategies were used at a medium level, and the number of errors in academic writing increased in spite of strategy use. Memory strategies were one of the least frequently used strategies. Furthermore, the skills of application, analysis, synthesis and evaluation were not mastered by university undergraduates who participated in the study; hence, the strategies which involve more analytical activities were not used frequently by students. Many factors influence the success of language learners. However, a focus on cognitive and language skills is required for academic success. Research indicates that successful language learners are aware of the strategies they use and why they use them (Green and Oxford, 1995; O’Malley & Chamot, 1990), and they generally tailor their learning strategies to the language task and to their own personal needs as learners (Wenden, 1986). Ellis and Sinclair (1989) also suggested that learners can achieve their goals by focusing their attention on the process so that they can become more effective learners and take on more responsibility for their own learning.

248

Chapter Twelve

This study has shown that second language learners in Fiji are not quite aware of their learning strategies. There is a need for further research into the language learning strategies of Fijian students with a larger sample size and from institutions at all levels: primary, secondary and tertiary. Other factors must be explored to determine what is impacting academic language proficiency (or lack of it) among undergraduate students of Fiji. There is a need to consciously teach language learning strategies in the teacher training courses so teachers can then integrate strategy use and training in their lessons. All teachers, irrespective of the subjects they teach, should be able to identify strategies by name, describe them and model them. Strategy training should be integrated within the curriculum rather than taught as a separate entity and it should start with beginner students even if this means providing the training in the students’ first language. Students need to have experience with a variety of strategies to be able to choose the one that works well with them. In case of failure in language learning, students need to be assured that their failure may not be due to lack of intelligence, but to the inability to choose appropriate strategies.

References Al-Hebaishi, S. M. (2012). Investigating the relationships between learning styles, strategies and the academic performance of Saudi English majors. International Interdisciplinary Journal of Education, 1(8), 510-520. Bharuthram, S. (2012). Making a case for the teaching of reading across the curriculum in higher education. South African Journal of Education, 32, 205-214. Retrieved from: http://www.ajol.info/index.php/saje/article/viewFile/76602/67051 Boyce, A. (2009). The effectiveness of increasing language learning strategy awareness for students studying English as a second language. Master’s dissertation, Auckland University of Technology, New Zealand. Bremner, S. (1999). Language learning strategies and language proficiency: Investigating the relationship in Hong Kong. Asia Pacific Journal of Language in Education, 1(2), 490-514. Retrieved from: http://utpjournals.metapress.com/content/d27w088833436k7x/ Brown, H. D. (2007). Principles of language learning and teaching (5th Edition). White Plains, NY: Person Education.

What is the Impact of Language Learning Strategies

249

Chamot, A. U. (2004). Issues in language learning strategy research and teaching. Electronic Journal of Foreign Language Teaching, 1(1), 1426. Retrieved from: http://e-flt.nus.edu.sg —. (2005). Language learning strategy instruction: Current issues and research. Annual Review of Applied Linguistics, 25, 112-130. Cohen, A. D. (1998). Strategies in learning and using a second language. Essex, UK: Longman. Cohen, J., & Cohen, P. (1983). Applied multiple regression/correlation analysis for the behavioral sciences (2nd edition). Hillsdale, NJ: Erlbaum. Crystal, D. (1997). The Cambridge encyclopedia of language (2nd edition). Cambridge: Cambridge University Press. Deneme, S. (2008). Language learning preferences of Turkish students. Journal of Language and Linguistic Studies, 4(2), 83-93. Dreyer, C., & Oxford, R. (1996). Learning strategies and other predictors of ESL proficiency among Afrikaans speakers in South Africa. In R. Oxford (Ed.), Language learning strategies around the world: Crosscultural perspectives (pp. 61-74). Honolulu: University of Hawai‘i at Manoa. Ehrman, M., & Oxford, R. (1989). Effects of sex differences, career choice, and psychological type on adult language learning strategies. The Modern Language Journal, 73(1), 1-13. Ehrman, M., & Oxford, R. (1990). Adult language learning styles and strategies in an intensive training setting. The Modern Language Journal, 74(3), 311-327. Ehrman, M., & Oxford, R. (1995). Cognition plus: Correlates of language learning success. The Modern Language Journal, 79(1), 67-89. Ellis, R. (1994). The study of second language acquisition. Oxford: Oxford University Press. Ellis, G., & Sinclair, B. (1989). Learning to learn English. Cambridge: Cambridge University Press. Fillmore, L. W. (1982). Instructional language as linguistic input: Second language learning in classrooms. In L. C. Wilkinson (Ed.), Communication in the classroom (pp. 283-296). New York: Academic Press. Goh, C. C. M., & Kwah, P. F. (1997). Chinese ESL students’ learning strategies: A look at frequency, proficiency and gender. Hong Kong Journal of Applied Linguistics, 2(1), 39-54. Grabe, W. (1991). Current developments in second language reading research. TESOL Quarterly, 25(3), 375-406.

250

Chapter Twelve

Graddol, D. (1997). The future of English? Retrieved from: http://www.britishcouncil.org/learning-elt-future.pdf Green, J. M., & Oxford, R. (1995). A closer look at learning strategies, L2 proficiency, and gender. TESOL Quarterly, 29(2), 261-297. Grenfell, M., & Harris, V. (1999). Modern languages and learning strategies: In theory and practice. London: Routledge. Griffiths, C. (2004). Language learning strategy use and proficiency: The relationship between patterns of reported language learning strategy (LLS) use by speakers of other languages (SOL) and proficiency with implications for the teaching/learning situation. Doctoral thesis, University of Auckland, New Zealand. Retrieved from: https://researchspace.auckland.ac.nz/bitstream/handle/2292/9/02whole. pdf?sequence=6 Hong, K. (2006). Beliefs about language learning and language learning strategy use in EFL context: A comparison study of monolingual Korean and bi-lingual Korean-Chinese university students. Doctoral thesis, University of North Texas, USA. Hsiao, T., & Oxford, R. (2002). Comparing theories of language learning strategies: A confirmatory factor analysis. Modern Language Journal, 86(3), 368-383. Ketabi, S., & Mohammadi, A. M. (2012). Can learning strategies predict language proficiency? A case in Iranian EFL context. International Journal of Linguistics, 4(4), 407-418. Retrieved from: http://www.macrothink.org/journal/index.php/ijl/article/view/2914 Lessard-Clouston, M. (1997). Language learning strategies: An overview for L2 teachers. The Internet TESL Journal, 3(12). Retrieved from: http://iteslj.org/Articles/Lessard-Clouston-Strategy.html Lukhele, B. S. B. (2009). Exploring relationships between reading attitudes, reading ability and academic performance among teacher trainees in Swaziland. Doctoral thesis, University of South Africa, South Africa. Retrieved from: http://uir.unisa.ac.za/bitstream/handle/10500/3435/dissertation_lukhele _bs.pdf Michaelides, N. N. (1990). Error analysis: An aid to teaching. English Teaching Forum, 28(4), 28-30. Naiman, N., Frohlich, M., Stern, H., & Todesco, A. (1978). The good language learner. Research in Education Series No. 7. Toronto: The Ontario Institute for Studies in Education. Nakatani, Y. (2006). Developing an oral communication strategy inventory. Modern Language Journal, 90(2), 151-168.

What is the Impact of Language Learning Strategies

251

Oberholzer, B. (2005). The relationship between reading difficulties and academic performance among a group of foundation phase learners who have been: (1) identified as experiencing difficulty with reading and (2) referred for reading remediation. Master’s dissertation, University of Zululand, South Africa. Retrieved from: http://uzspace.uzulu.ac.za/bitstream/handle/10530/238/ O’Malley, J. M., Chamot, A. U., Stewner-Manzanares, G., Russo, R. P., & Kupper, L. (1985). Learning strategy applications with students of English as a second language. TESOL Quarterly, 19(3), 557-584. O’Malley, J. M., & Chamot, A. U. (1990). Learning strategies in second language acquisition. Cambridge: Cambridge University Press. Oxford, R. L. (1990). Language learning strategies: What every teacher should know. Boston: Heinle and Heinle Publishers. —. (1993). Instructional implications of gender differences in second/foreign language learning styles and strategies. Applied Language Learning, 4(1-2), 65-94. Oxford, R., & Nyikos, M. (1989). Variables affecting choice of language learning strategies by university students. The Modern Language Journal, 73(3), 291-300. Peacock, M., & Ho, B. (2003). Student language learning strategies across eight disciplines. International Journal of Applied Linguistics, 13(2), 179-200. Pretorius, E. J. (2002). Reading ability and academic performance in South Africa: Are we fiddling while Rome is burning? Language Matters: Studies in the Languages of Africa, 33(1), 169-196. Rahimi, M., Riazi, A., & Saif, S. (2008). An investigation into the factors affecting the use of language learning strategies by Persian EFL learners. Canadian Journal of Applied Linguistics, 11(2), 31-60 Retrieved from: http://www.aclacaal.org/Revue/vol-11-no2-art-rahimi-riazi-saif.pdf Richards, J. C., Plott, J., & Platt H. (1996). Dictionary of language teaching and applied linguistics. London: Longman. Roehr, K. (2004). Exploring the role of explicit language in adult second language learning: Language proficiency, pedagogical grammar and language learning strategies. Working paper no. 59, Centre for Research in Language Education, Lancaster University. Rubin, J. (1975). What the “good language learner” can teach us. TESOL Quarterly, 9(1), 41-51. Scarcella, R. (2008). Defining academic language. Paper presented at the National Clearinghouse for English Language Acquisition Web

252

Chapter Twelve

Conference, August 21st, University of California, Irvine, USA. Retrieved from: http://www.ncela.us/files/webinars/1/scarcella_8-21-08.pdf Scollon, R., & Scollon, S. W. (2004). Nexus analysis: Discourse and the emerging Internet. London: Routledge. Tam, K. C. (2013). A study on language learning strategies (LLSs) of university students in Hong Kong. Taiwan Journal of Linguistics, 11(2), 1-42. Tsan, S.-C. (2008). Analysis of English learning strategies of Taiwanese students at National Taiwan Normal University. Educational Journal of Thailand, 2(1), 84-94. Retrieved from: http://www.edu.buu.ac.th/journal/journalinter/Second%20EJT_08/9suz an.pdf Wenden, A. L. (1986). Incorporating learner training in the classroom. System, 14(3), 315-325. Wu, Y.-L. (2008). Language learning strategies used by students at different proficiency levels. Asian EFL Journal, 10(4), 75-95.

CHAPTER THIRTEEN SPEAKING PRACTICE IN PRIVATE CLASSES FOR THE TOEFL IBT TEST: STUDENT PERCEPTIONS RENATA MENDES SIMÕES

Abstract This chapter presents a research study conducted at an English for Specific Purposes (ESP) one-to-one course focusing on speaking skills, in order to find out if the course met the students’ learning needs and prepared them to take the Test of English as a Foreign Language–Internetbased Test (TOEFL iBT). The study was grounded on the theoretical principles of teaching ESP, needs analysis, task-based teaching, and language assessment. The instruments for the data collection were: initial and final questionnaires; an audio recording of two speaking tasks on the first and last day of class; and the teacher-researcher’s diaries at the end of every class containing the students’ perceptions of their performance in class. The results revealed the students’ satisfaction regarding the course methodology and material, as well as the students’ perception of improvement in their speaking and writing skills. The students’ narratives also indicated the importance of teacher-student interaction and praised the attention given by the teacher to their emotional aspects. The study contributes to the field of ESP and language assessment, and fills the research gap that exists in the teaching of speaking skills in private classes.

Introduction The increasing number of students seeking to study graduate courses in English-speaking countries has led to an unprecedented demand for the Test of English as a Foreign Language–Internet-based Test (TOEFL iBT). Many universities in English-speaking countries require international

254

Chapter Thirteen

candidates to show proof of language proficiency through a minimum TOEFL iBT score. The current literature on teaching and learning in private lessons in preparation for standardized language proficiency tests is greatly lacking. Although there is a wide range of books, academic papers, and courses in language schools aimed at preparing students for the TOEFL iBT, there is little research on the preparation of students in private lessons. According to Dudley-Evans and St. John (1998), private classes can be considered the purest form of teaching English for Specific Purposes (ESP), because the need of each student determines their learning. Moreover, Hutchinson and Waters (1987) suggest that ESP should be considered an approach for language teaching, and they emphasize that the learning should be based on the students’ needs. In other words, all the decisions as to the content and methods of teaching in an ESP course should be based on the students’ needs. This chapter presents the findings of a research study carried out in Sao Paulo, Brazil, with 17 students attending private preparatory classes for the TOEFL iBT proficiency test. The course was developed by the researcher and it was adapted and modified for each student.

Background English for Specific Purposes (ESP) According to Hutchinson and Waters (1987), ESP courses aim at helping students perform adequately in the target-situation, i.e., the situation in which they are going to use the language being learned. The main characteristics of this teaching approach are: the student’s awareness of why he/she is learning the language and satisfying the student’s needs for the language use in the new context. Similarly, Dudley-Evans and St. John (1998) consider ESP an approach. They claim that ESP should reflect teaching with a different methodology from those used in general English teaching. The greatest concern regarding ESP teaching should be the needs analysis and the preparation of students to communicate efficiently in tasks related to their studies or work (Dudley-Evans & St. John, 1998). This view is also corroborated by Basturkmen (2010) who states that ESP teaching involves the discussion of the students’ needs and the role these needs will play in their working and studying environments. Basturkmen (2010) also emphasizes that ESP courses aim at teaching the language and communication abilities that specific groups of students will need to make themselves understood in working, studying and social contexts.

Speaking Practice in Private Classes for the TOEFL iBT Test

255

Therefore, ESP courses need to focus on teaching language and communicative abilities.

Needs Analysis According to Hutchinson and Waters (1987), a needs analysis should gauge the students’ learning needs and not the teachers’ teaching needs. For them, the difference between an ESP course and a general English course is not so much the nature of the need, but the awareness of such need. This is one of the most important aspects for ESP course design, which should be divided between the target-situation needs (what the student must do in the target-situation), that can be further subdivided into necessities, lacks and wants, and the student learning needs (what the student should do to learn). Dudley-Evans and St. John (1998) also consider needs analysis extremely important for ESP courses, as it allows for a much more focused course. They claim that needs analysis is the process to determine “what to do” and “how to do” a course (Dudley-Evans & St. John, 1998, p. 121). The data collection for the needs analysis can be carried out through questionnaires, interviews, surveys, assessments and discussions. Long (2005, p. 19) states that “in changing times, educators are increasingly relying on their needs analysis results in order to develop new courses.” But he also warns that the respondents are usually the very same students who are not always aware of what they will need in the target language (L2). One example is the international students who are preparing to attend graduate courses in English-speaking countries.

Tasks in ESP For Willis (1996), task-based teaching should: stimulate students to use the target language collaboratively and meaningfully; allow students to participate in a complete interaction and to use different communication strategies; and help students develop self-confidence to reach their communicative goals. Based on the consensus of several researchers and educators, Skehan (1998) suggested four criteria to define a task: (i) the meaning is essential; (ii) the focus is on the objective; (iii) the task product must be assessed, and (iv) there must be a relation to the real world. A similar concept was also proposed by Willis (1996), as for her, tasks are activities in which the target language is used by the learners with a communicative objective in order to reach a result. Willis (1996) highlights that task-based teaching should give the learners the freedom to

Chapter Thirteen

256

choose their linguistic form in order to reach an objective, that is, to convey their ideas. This way, language is used as a vehicle to reach the objectives of the task with emphasis on meaning and communication, not on correct production of linguistic forms. Timing is an important aspect because one of the main characteristics of the TOEFL iBT is the limited time (few seconds) test-takers have to prepare the answers before carrying out the tasks. Willis (1996) mentions that learners must know how to start a task and how long they have to prepare and carry it out. Ellis (2003, p. 127) discusses the effects of planning, which he calls “strategic planning”, in the learners’ oral production during communicative tasks. He defines strategic planning as: “The process by which learners plan what they are going to say or write before commencing a task. Pre-task planning can attend to prepositional content, to the organization of information or to the choice of language. Strategic planning is also referred to as pre-task planning.” (Ellis, 2005a, p. 50)

Based on Skehan (1998), Ellis (2003) recommends a set of criteria to determine the level of fluency, accuracy and complexity of foreign language production in communicative tasks (Table 13-1). Table 13-1: Classification of production variables communication tasks (Adapted from Ellis, 2003, p. 117). Aspects Fluency

Accuracy

Complexity

in

oral

Measures Number of words and syllables per minute Number of pauses (one/two seconds or longer) Number of repetitions and reformulations Number of words per turn Number of self-corrections Percentage of error-free clauses Use of verb tenses/articles/vocabulary/plurals/negatives Ratio of definite and indefinite articles Number of turns per minute Lexical richness Amount of subordinate clauses Frequency of use of prepositions and conjunctions

Ellis (2005a) recommends that, when the learner has the opportunity to make use of the strategic planning before the task, his language production will be more fluent and show more complexity. Although there are many

Speaking Practice in Private Classes for the TOEFL iBT Test

257

studies about strategic planning in language laboratories and classrooms, Ellis highlights the need for more studies to verify the benefit of such planning in exam contexts. The effects of strategic planning in exam contexts can be a little different from those in classroom settings. Ellis (2005b) mentions that one reason might be that learners feel pressure in assessment contexts and the results might not be the expected ones. The preparation and adjustment to the restricted timing for answering the test tasks receives special attention in a preparatory ESP course.

Tests and Assessments Language assessments play an essential role in Applied Linguistics, operationalizing its theories and supplying researchers with data for their analysis of language use (Clapham, 2000). McNamara (2000) also discusses this issue, as language tests are of great importance to many people, working as gateways in key educational and employment moments. This is the case of the TOEFL iBT test, through which international students may or may not be accepted in English-speaking universities. Therefore, this is a high-stakes test as it is a pre-requisite for student admission in graduate studies. High-stakes exams require a long preparation because the result will be critical for the academic future of these learners. Proficiency tests such as the TOEFL iBT or the International English Language Testing System (IELTS) are not based on a course content. Instead, they assess the adequacy of the students’ language for future use in the target situation and their ability to attend academic courses delivered through the medium of English (Jordan, 1997). The concept of washback (or backwash) is another very important aspect related to preparatory courses discussed by Scaramucci (2011), among many other scholars. She explains that the concept refers to how outside exams–especially high-stakes ones, such as college entrance exams and some language proficiency tests–can potentially have an impact on the teaching and learning process, the curriculum, materials development and on the attitudes of students, teachers and the school involved in the exams. Washback in teaching and learning is undeniable and it is out of the control of test developers (Scaramucci, 2009). This chapter looks at a preparatory course for the TOEFL iBT which was influenced by the test as it was developed taking into account the test constructs, how the test is applied, and above all, the students’ needs. The aim of the research study was to investigate how this ESP course met the

258

Chapter Thirteen

learning needs of the students and prepared them to take the TOEFL iBT in a one-to-one class environment.

The Study This is a qualitative exploratory research study. The main research questions that guided the study were: 1. What are the students’ needs with regards to taking the TOEFL iBT? 2. How do students perceive their language development during the TOEFL iBT preparatory course? 3. How do students perceive the TOEFL iBT preparatory course? The investigation strategy underlying the research questions was the case study, as proposed by Stake (1988) and Johnson (1992), as it is a research approach which allows the investigation of a specific situation within a specific context. Stake (1988) considers the case study not only a methodological choice, but mainly the choice of the object to be studied. He adds that the main feature of case studies is the presence of the researcher in the context, the contact and the direct involvement of the researcher with the activities of the case, always reviewing and reflecting about the events. The research context was the preparatory course for the TOEFL iBT, developed by the researcher. The course comprised private lessons focusing on the speaking tasks of the test. The data collection was carried out over a period of 24 months.

Participants The participants of this study were 17 students attending the preparatory course for the TOEFL iBT. They were 7 female and 10 male young adults, mostly (71%) between 21 and 30 years old. All but one were graduate students, most of them had advanced levels of English, and only three were at an intermediate proficiency level. Their language proficiency level (basic, intermediate, intermediate/advanced or advanced) was classified informally on the first day of class, taking into consideration their grammar level, their capacity to express themselves without hesitations and their vocabulary mastery. In order to protect the students’ identity, their names have been omitted and each student is identified by the letter ‘A’ followed by a number, from 1 to 17.

Speaking Practice in Private Classes for the TOEFL iBT Test

259

Data Collection The study data collection instruments and the procedures for the data collection helped to answer the three research questions. The data collection was divided into three consecutive phases. Table 13-2 provides a summary of the data collection procedures in each phase of the project, and the procedures in each of these phases is explained in detail in the following sections. Table 13-2: Data collection procedures in each phase of the study. Phase 1st

2nd 3rd

Procedure x initial questionnaire x recording of student’s initial speaking task x assessment of student’s initial speaking task x interview at the end of each class x x x x

recording of student’s final speaking task assessment of student’s final speaking task final questionnaire student’s score on the TOEFL iBT test

Phase 1 The initial questionnaire was used to find out and analyze the students’ learning needs. From the tabulation of the data, it was possible to understand the target audience profile, their needs, and tailor the course content for each student. The initial oral production recording of each student responding to sample Task 1 and 2 of the TOEFL iBT Speaking Section, aimed to provide some data for each student, such as fluency, lexical richness, accuracy, time used to formulate answers, attitude and reaction to the limited time for responses. Task 1 of the TOEFL iBT Speaking Section asks the test-taker to give a personal opinion about a topic, and Task 2 asks the test-taker to make a personal choice between two options and give reasons and examples. The recordings were assessed according to similar criteria used by the Educational Testing Service (ETS) who are responsible for the TOEFL iBT, and the theoretical framework by Ellis (2003) (see Table 13-1 above) regarding fluency, complexity and accuracy of oral production. Based on these criteria, an evaluation form was developed for the speaking task

Chapter Thirteen

260

output. As in the TOEFL iBT, the scores ranged from 1 to 4 (1 = weak, 2 = fair, 3 = good, and 4 = excellent). Phase 2 The second phase of data collection occurred at the end of each class. An interview consisting of three open-ended questions was carried out with each participant, in order to understand and evaluate the perceptions of students with regards to their oral production difficulties and the activities they performed during that lesson. The three interview questions were: 1. What did you think of today’s lesson activities? 2. What difficulties did they present for you? 3. How was your performance today compared to the last class? With the transcription of all the answers, the data were classified into three categories elaborated a priori; i.e., activities, difficulties and performance (see Bardin, 2011). The themes emerged after the analysis of all the responses which were initially grouped by similarity of content. The topics that were most often mentioned and later on guided the analysis were: cognition, affection and methodology. Phase 3 In the third and final phase of data collection, the final questionnaire was used in order to find out if the course had met the students’ specific needs raised in the beginning of the course, and the level of support the course offered them to take the TOEFL iBT. Also, the students’ performance on sample Task 1 and 2 of the TOEFL iBT Speaking Section was recorded. The content of this recording was compared with the initial speaking tasks recording and provided information for analysis of the development of students’ speech production. As with the initial oral production, the evaluation of these final tasks was performed using the same evaluation form. The analysis of the participation and performance of the 17 students aimed at evaluating the adequacy of the course from the students’ perspective and assessing their linguistic ability.

Speaking Practice in Private Classes for the TOEFL iBT Test

261

Results and Discussion Students’ Needs The data from the initial questionnaire showed that the majority of the participants indicated that the reason for choosing the ESP course was due to its specific focus (44%), followed by the fact that they considered oneto-one tutoring more efficient than classroom teaching and learning (26%). In terms of their course expectations, most students reported that they wanted to obtain the minimum score required for entering their desired study programs (71%), and enhance their speech production (41%). These data confirm the importance of the ESP course for preparing candidates for a language proficiency test. With regards to the test constructs, the study investigated how students felt when speaking English under time pressure. This information was taken into account during the lessons as it was necessary to ensure that students were able to adapt to the time limitations of the test. Out of the 17 students, 11 confirmed in the initial questionnaire that they could not express themselves well under pressure. Students reported that they felt nervousness, discomfort, anxiety, and panic. From the answers to the initial questionnaire, it was possible to detect that fluency to express themselves in the English language and the development of complex ideas were two items with which most students had difficulty (29%). Students were also asked to report how they evaluated each of their English language skills using the TOEFL iBT scale. The data from this question helped to better tailor the course to the needs of the students. Table 13-3 presents the results of the students’ answers. Table 13-3: Student perceptions of their language skills before the course. Language skills before the course

Reading Listening Speaking Writing Note: N = 17

Excellent

Good

Fair

Weak

Total

9 7 0 3

6 8 11 7

2 2 6 4

0 0 0 3

17 17 17 17

262

Chapter Thirteen

The results showed that all students considered their oral production either fair (6 students) or good (11 students) before the course started. That was a good indication that they needed to improve this skill during the course. It should be noted that as all the students needed to reach a minimum score of 85% in the test, considering their speaking ability was just ‘good’ was not enough to reach the desired score. This need was also highlighted in the initial questionnaire as their main reason for attending the course was the enhancement of their speaking ability. Writing skills were also worked on extensively throughout the ESP course, as this was the only skill rated as weak (3 students). Interestingly, students reported having greater difficulty in language production (speaking and/or writing) and less difficulty in language comprehension (listening and/or reading).

Students’ Performance With regards to student language development and performance, the final questionnaire data at the end of the course showed that fluency and vocabulary were still a problem for students. Although 35% of students mentioned fluency as the greatest difficulty in speaking English even after they attended the course, and 29% of them signaled a lack of vocabulary as something that would still hinder their oral production, the vast majority (71%) felt more confident at the end of the course (see Figure 13-1). The perceptions of students regarding factors that are more measurable, such as lack of vocabulary or grammar, cause them less concern or are even minimized when compared to the more subjective factors, such as fluency and objectivity when describing details and reasons in the strictly timed answers. Interestingly, all these factors are interrelated, because the grammar and the vocabulary level will influence the fluency and the objectivity of the answers within the 45 seconds allowed in the TOEFL iBT. At the end of the course 47% of students claimed to feel more confident and fluent in English (see Figure 13-2). The high number of answers related to greater confidence (71%) shows that one of the main initial difficulties of the students was overcome by the end of the course.

Number of answers

Sppeaking Practicee in Private Claasses for the TO OEFL iBT Test

263

7 6 5 4 3 2 1 0

Figure 13-1: D Difficulties witth language at th he course (N = 17).

Student percepttions of their language ability aat the end of th he course Figure 13-2: S (N = 17).

The incrrease in studennts’ self-esteeem was alwayys one of the objectives o and concernns during thhe course, ass the emotioonal factor in nfluences students’ pperformance. Furthermoree, helping students gain more confidence m meets the assuumptions of task-based t leaarning as Willlis (1996) claims it is essential to develop d studeents’ confidennce so that th hey reach their commuunicative goalls.

Chapter Thirteen T

264

In the finnal questionnnaire, students were asked tto rate (on a scale s of 0 to 5) their level of learnning as a ressult of the acctivities in th he regular classes. Figuure 13-3 show ws the results.

Number of answers

10 8 6 4 2 0 5

4

3

2

5 = highest

1

0

0 = lowest

Figure 13-3: S Student self-perrceived learning g (N = 17).

Results showed that 7 students (4 41%) rated thheir learning with the highest scorre (5) and 9 students s (47% %) each gave a 4 for their language ability at thhe end of thee course. Acccording to thhese results, it i can be inferred thatt the ESP couurse met the needs n mentionned by the sttudents in the initial quuestionnaire att the beginnin ng of the coursse. In both tthe initial andd final questio onnaires, studeents were askeed to rate their four laanguage skillls. This questtion offered ffour responsee options: excellent, goood, fair and poor. p The folllowing two figgures (Figure 13-4 and Figure 13-5)) show the evoolution of thiss perception frrom the studen nts’ point of view, whhile Table 133-4 comparess the initial w with the finaal student perceptions..

Speaking Practice in Private Classes for the TOEFL iBT Test

265

12 10 8

Reading

6

Listening

4

Speaking

2

Writing

0 Weak

Fair

Good

Excellent

Figure 13-4: Students’ perceptions of their language skills before the course (N = 17). 12 10 8 6 4 2

Reading Listening Speaking Writing

0 Weak

Fair

Good

Excellent

Figure 13-5: Students’ perceptions of their language skills after the course (N = 17).

Table 13-4: Change in students’ perceptions (N = 17).

Skill Reading Listening Speaking Writing

Same

Perception Better

Worse

12 12 9 5

4 4 7 11

1 1 1 1

Chapter Thirteen

266

Looking at the results above, it can be noted that students perceived a general improvement in all four skills. The number of students (11) who felt an improvement in their written production was 63% higher than the number of students (7) who noticed an improvement in their oral production. With regards to speaking skills, which was the focus of the ESP course, among the 17 students, eight of them perceived their speaking skills as “good” in the beginning and at the end of the course, while two of them considered their speaking skills excellent at the end of the course. Both the performance on Task 1 and the performance on Task 2 were assessed according to criteria set a priori (general description, delivery, language, topic development, and organization), on a scale from 1 (weak) to 4 (excellent). Thus, with regards to their outcome on both Tasks at the start and at the end of the course, Figure 13-6 shows the average improvement for each student (A1 to A17). Overall Progress A17 A16 A15 A14 A13 A12 A11 A10 A9 A8 A7 A6 A5 A4 A3 A2 A1

0%

10%

20%

30%

40%

50%

60%

70%

Figure 13-6: Student improvement in Task 1 and Task 2 (N = 17).

In terms of the extent to which the course had met students’ expectations, almost all students (94%) had their expectations exceeded or totally met. To examine to what extent the course contributed to the students’ language development according to their expectations, students’ initial expectations and needs were contrasted with their perceptions at the end of the course. Initially, 41% of students needed and wished to improve

Speaking Practice in Private Classes for the TOEFL iBT Test

267

their speaking skills and at the end of the course 33% of the students stated that their oral production had improved significantly, and three of the students were positively surprised with how much they had improved. At the end of the study, and from a careful analysis of the data (see Figure 13-4 and 13-5), it was possible to quantify the improvement perceived by students in all four language skills. All findings were compiled into a single spreadsheet; the data were then compared and analyzed with the goal of finding the most usual pattern of improvement among the different students, as well as the most common correlations among the skills studied. Table 13-5 summarizes the results of this analysis. Surprisingly, the improvement in the perception of written production was much higher (38.7%) when compared to the improvement of oral production (17.6%), although the latter was the main focus of the preparatory course for the TOEFL iBT. Table 13-5: Average perceived overall improvement in language skills.

Reading Listening Speaking Writing

Before

After

Increase

1.65 1.71 2.35 2.53

1.47 1.53 2.00 1.82

12.0% 11.5% 17.6% 38.7%

Note: N = 17

The average perception of all students for each one of their skills was taken into account, both at the beginning of the course and at the end of it. From these values, the median variation was calculated. As the figures show in the table above, in the comprehension skills, i.e., reading and listening, there was only an increase of 12.0% and 11.5% respectively in the way students viewed their improvement. The course in question, having a narrower focus, was able to help students improve more than one communicative ability, because even though it focused on the practice of the speaking skills, the wide variety of materials and extra writing content to support this focus may have also helped students develop their other language skills such as writing.

Chapter Thirteen

268

TOEFL iBT Test Scores The total score for the TOEFL iBT Test is 120 points, composed of a maximum of 30 points for each skill (Reading, Listening, Speaking and Writing). The Table 13-6 below shows the final scores obtained by each of the students who attended the ESP course. Table 13-6: Students’ final TOEFL iBT score. Students' Final Scores Student

Score aimed

Score obtained

Reading

Listening

Speaking

Writing

A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16 A17

90 90 90 80 100 100 100 90 80 100 100 100 109 100 100 80 80

94 106 102 85 104 102 106 89 84 100 94 99 114 85 111 92 91

24 26 24 22 24 25 29 23 20 25 26 26 30 20 29 24 19

25 26 29 22 26 27 29 24 21 27 21 24 30 18 28 28 25

23 26 27 21 26 24 25 20 21 26 23 26 27 20 26 20 23

22 28 22 21 28 26 23 22 22 22 24 22 27 20 28 20 24

According to the students’ scores above, it can be said that there was indeed much learning along the course as the majority of the students (13 of them, or 76%) obtained a higher score than what they needed to be accepted into their graduate courses. Among the four students who did not reach the desired score, two of them had an intermediate level of English with grammar difficulties, and the two others reported serious anxiety issues during the test.

Speaking Practice in Private Classes for the TOEFL iBT Test

269

Every year since 2006, ETS publishes a report with the average scores of students around the world. The figures reflect each skill and the total score. The average score in the period when the present study took place was 80 points (see Table 13-7). Table 13-7: Comparative table of the TOEFL iBT scores.

Overall Average (2006-2013) Brazil Average (2006-2013) ESP Course Average

Total Score

Reading

Listening

Speaking

Writing

80

20

20

20

21

85

21

22

21

21

98

24

25

24

24

As can be seen in Table 13-7 the Brazilian test-takers average was 85 points, 6% above the world average. When comparing the Brazilian testtakers’ scores between 2006 and 2013 with the scores of the 17 students of the ESP course, it is apparent that the average for the participants in this study (98 points) is 15% higher than the average for the Brazilian testtakers (85 points). Thus, it can be concluded that the course met its goal, i.e., to offer high-quality teaching with a focus on the specific needs of the students.

Conclusion Through the data obtained in the initial needs analysis, I had the opportunity to investigate students’ needs and tailor the ESP course to achieve certain goals, aligned with the needs and weaknesses of each student. Based on the initial questionnaire, it was possible to detect that for most students (53%) oral fluency was the greatest difficulty. After the course, it was found that this problem was indeed overcome, not only in terms of students’ perceptions, but also in terms of the evaluation of their speaking skills in the initial and final speaking tasks. The study also showed that students experienced an increase in their self-esteem and selfconfidence to express themselves in English. The feedback and the

270

Chapter Thirteen

encouragement given to students during the lessons proved to be effective as it generated confidence, comfort and lowered students’ stress levels. The data on general development as perceived by the students were essential in determining the appropriateness of the course. The skills of written and oral comprehension showed a large equivalence (12.0% and 11.5%, respectively). Surprisingly, the results also showed an increase of 38.7% in the students’ writing skills which was not the primary focus of the ESP course. Research in the ESP approach in the form of individual one-to-one lessons is scarce, especially those that combine one-to-one tutoring and preparation for standardized proficiency tests, which is why this research study makes a significant contribution to the field of language teaching and assessment. It is hoped that this work can be an important tool for all those involved in the teaching and learning of ESP courses; not only those responsible for course design, but also for teachers who are in direct contact with students, as it provides data on course design focusing on the development of speaking skills and methodological procedures in the context of teaching and learning languages.

Acknowledgement This research study received support from CAPES, the Brazilian Agency for the Development of Graduate Studies.

References Bardin, L. (2011). Análise de Conteúdo. São Paulo: Edições 70. Basturkmen, H. (2010). Developing courses in English for specific purposes. Great Britain: Palgrave Macmillan. Clapham, C. (2000). Assessment and testing. Annual Review of Applied Linguistics, 20, 147-161. Dudley-Evans, T., & St. John, M. (1998). Developments in English for specific purposes: A multi-disciplinary approach. Cambridge: Cambridge University Press. Ellis, R. (2003). Task-based language learning and teaching. Oxford: Oxford University Press. —. (2005a). Instructed second language acquisition - A literature review. Auckland: The University of Auckland. —. (2005b). Planning and task performance in a second language. Language Learning & Language Teaching Series. Amsterdam: John Benjamins.

Speaking Practice in Private Classes for the TOEFL iBT Test

271

English Testing Service. (2010). Linking TOEFL iBT scores to IELTS scores – A research report. Retrieved from: http://www.ets.org/s/toefl/pdf/linking_toefl_ibt_scores_to_ielts_scores. pdf ETS. (2009). Official guide to the TOEFL iBT test. USA: McGraw-Hill. —. (2011). TOEFL iBT research insight. Series 1, Volume 6. Retrieved from: http://www.ets.org/s/toefl/pdf/toefl_ibt_insight_s1v6.pdf Hutchinson, T., & Waters, A. (1987). English for specific purposes – A learning-centered approach. Cambridge: Cambridge University Press. Johnson, D. M. (1992). Approaches to research in second language learning. New York: Longman. Jordan, R. R. (1997). English for academic purposes: A guide and resource book for teachers. Cambridge: Cambridge University Press. Long, M. H. (2005). Second language needs analysis. Cambridge: Cambridge University Press. McNamara, T. (2000). Language testing. Oxford: Oxford University Press. Scaramucci, M. V. R. (2009). Avaliação da leitura em inglês como língua estrangeira e validade de construto. Caleidoscópio, 7(1), 30-48. —. (2011). Validade e consequências sociais das avaliações em contextos de ensino de línguas. Linguarum Arena, 2(1), 103-120. Skehan, P. (1998). A cognitive approach to language learning. Oxford: Oxford University Press. Stake, R. E. (1988). Case studies. In N. K. Denzin & Y. S. Lincoln (Eds.), Strategies of qualitative inquiry (pp. 86-108). Thousand Oaks: Sage Publications. Willis, J. (1996). A framework for task-based learning. Malaysia: Longman.

CHAPTER FOURTEEN ASSESSING THE LEVEL OF GRAMMAR PROFICIENCY OF EFL AND ESL FRESHMAN STUDENTS: A CASE STUDY IN THE PHILIPPINES SELWYN A. CRUZ AND ROMULO P. VILLANUEVA JR.

Abstract The Far Eastern University (FEU) has one of the highest numbers of international undergraduate students among Philippine universities. A considerable number of these students are Korean students of English as a foreign language (EFL) who are enrolled in general education courses, such as English language classes which are also attended by Filipino learners for whom English is a second language (ESL). Recognizing the constantly increasing population of international students in the university, this research study intended to compare the English grammar proficiency of learners from two Asian English variants, namely Philippine English and Korean English. In fulfilling the objectives of the study, 30 Korean and 30 Filipino students were randomly selected to answer a 130-item grammar test based on the syllabus of their course, namely, Introduction to Language Arts English (Eng AN). Recommendations and implications for English language teaching and learning are also discussed in the study.

Introduction The term ESL (English as a Second Language) is attributed to the use of English in countries like the Philippines and India where English is used in daily activities but not as the main language. On the other hand,

Level of Grammar Proficiency of EFL and ESL Freshman Students

273

English is said to function as a foreign language (EFL) in countries where it is not used in everyday situations. China, Japan and Korea are said to speak EFL (Kirkpatrick, 2007). Kachru (2005) redefined the concept of EFL and ESL through three concentric circles. The Inner Circle refers to the sociolinguistic origins of English where it is used as the native language. The United States, Canada and the United Kingdom are examples of countries in the Inner Circle. The Outer Circle represents countries like the Philippines and Singapore, which were previously colonized by native English speaking countries and are now using English as their second language. Lastly, the Expanding Circle pertains to the users of English as a foreign language. Countries in the Expanding Circle include those in eastern Asia such as Korea and China. In the Philippines, 93.5% of its population is said to speak and understand English well (Magno, 2010), and Philippine English is considered a legitimate variant of World Englishes with distinct features significant enough to be an area of research. Llamzon’s (1969) monograph is believed to be the pioneer study to conceptualize what he termed as Standard Filipino English. It highlighted how the speech forms of the speakers of Filipino English are considered a sub-variety of English. More importantly, it stressed that Filipino English is the type of English used by the educated Filipino. Subsequently, Gonzalez (1985) and Bautista (2000) have greatly contributed to the rich literature on the existence of Philippine English. At present, it seems unquestionable that Philippine English exists not just as a sub-variety of American English but a legitimate World English variety (Borlongan, 2011; Martin, 2012). Korean English, like Philippine English, has also received overwhelming attention (Kachru, 2006). English in South Korea developed in the 1990s when it was made a mandatory subject at schools starting with elementary education as a result of the English Education Policies of Korea and the arrival of native English speakers as educators in the 1990s (Chang, 1990). At present, there are prominent studies concerning the existence of Konglish, though criticised by some as merely codified Korean English (Shim, 1999), and these include the corpus based works of Jung and Min (1993; 1999). Overall, it appears that language contact resulting in apparent changes in its language system could be a major contributor in the prominence of the identity of Korean English (Young, 2008).

274

Chapter Fourteen

Background Looking at the concept of grammatical proficiency, Chomsky posited that grammatical competency involves one’s knowledge of grammar. Hymes (1972), however, thought that this concept was inadequate; thus, the elaboration that the grammatical proficiency that Chomsky was referring to, alongside the meaning or value of one’s utterance, is part of what is termed as communicative competence. Canale and Swain (1980) supported this idea and added that: “a synthesis of knowledge of basic grammatical principles, knowledge of how language is used in social settings to perform communicative functions, and knowledge of how utterances and communicative functions can be combined according to the principles of discourse.” (p. 20)

In view of the recognition of grammatical competency as important in achieving communicative competency, the current study focused on examining ESL and EFL learners’ grammar knowledge. Haegman (1991) defines grammar as the rules and principles of a language that provide a systematic description of sentence formation. Fromkin, Rodman and Hyams (2003) state that knowledge of grammar describes a learner’s linguistic knowledge involving the rules and the unconscious linguistic knowledge. Meanwhile, Taylor (1988) defines proficiency as one’s ability to use competence. Overall, it could be said that grammatical proficiency as a term refers to the explicit awareness of grammatical rules. As Shanklin (1995) proposed, grammatical proficiency is the “ability to make judgments about the acceptability and appropriateness of an utterance with specific reference to grammatical notions” (p. 149). Several studies were conducted to investigate the grammatical proficiency of learners. For instance, Diab (1998) used error analysis in an attempt to reveal what kind of grammatical, lexical semantic and syntactic errors were caused by the negative transfer of the learners’ mother tongue (Arabic) into the learners’ second language (English). With 73 Lebanese native speakers of Arabic who were taking an intermediate level of English course as participants, the study found that the use of prepositions posed the most difficulty among Arabic-speaking students in Beirut, Lebanon (247 errors out of 558 grammar errors made by the participants). In addition, he argued that those above-mentioned errors might have been caused by negative transfer from the students’ mother tongue. Bautista (2000) identified certain deviated grammatical features of Philippine English in comparison to American Standard English. For instance, it was noted that Filipinos tend to choose results to, consist in

Level of Grammar Proficiency of EFL and ESL Freshman Students

275

and specialized on instead of results in, consist of and specialized in. It was stressed in the study that the variations could be broadly understood by the fact that Filipino language does not possess distinctive items to represent different prepositions such as on, in, at, to, and toward, as there is basically one generic preposition, sa, used to equivalently signify English prepositional concepts. The five grammatical aspects, which Bautista found to be prevalent in the mistakes committed by Filipino speakers of English in their speech or writing, were included in the current study. Bauman (2010), conducting a study in Korean English, collated the prominent pronunciation, grammatical and syntactic errors of Korean students of English by means of classroom observation. By analysing the speech utterances and written production of Korean students, Bauman found that Koreans have difficulty with their use of tenses, and definite and indefinite articles. Filipino and Korean students’ explicit and implicit knowledge were studied (Cruz, 2011) to confirm Bautista’s (2000) findings on the grammatical features of Philippine English. After a study with sixty randomly selected first year college students, it was concluded that Filipino and Korean learners of English appeared to have identical features in their grammatical errors. Further, it was found that the students had the ability to use their explicit and implicit knowledge when a certain type of grammar test was given to them. In Nezami and Najafi’s (2012) study of common error types of Iranian learners of English, it was revealed that the group with low scores in the Test of English as a Foreign Language (TOEFL) also got low scores in terms of grammar in their essays where difficulties in the use of articles, verb formation, and singular and plural noun formation were found. Meanwhile, in Li’s (2005) study of EFL and ESL college learners’ grammatical and lexical collocational errors in essays, it was found that the students had almost identical grammatical errors that were mostly concerned with verbs followed by certain prepositions and prepositions followed by a noun, and these errors outnumbered the lexical collocational errors. The present study aims to contribute to the field of English language teaching (ELT) and to the study of Asian Englishes. Specifically, the current study aims to compare the grammar proficiency of Korean and Filipino students through an objective grammar test in order to gain understanding of the areas of grammar that are easy and those that are difficult for students. Furthermore, the current research could provide

276

Chapter Fourteen

empirical evidence for revising the English language syllabus intended for first year students.

The Study A total of 60 first year students, 30 Filipino and 30 Korean, from various disciplines, enrolled in Eng AN or Comm Arts 1 at the Far Eastern University (FEU), were the participants of this study. Only those taking the course for the first time were chosen to take part in the study in order to ensure the reliability of results since a prior exposure to the course materials could result in apparent differences in performance. The first year students were chosen because the syllabus for first year students mainly concentrates on grammar. Convenience sampling was used in the study because filtering the entire population of freshman students in terms of class standing and proficiency level was not plausible at the time that the study was conducted. The Filipino participants consisted of 9 males and 21 females from the Institute of Arts and Sciences enrolled in the Bachelor of Science Major in Medical Technology. All of the participants were students of one of the researchers. The Korean participants were all part of a Filipino for Foreigners class, which was handled by a colleague. The Korean participants comprised 19 males and 11 females who were taking different courses. The researchers were not able to gather a homogenous group of Korean participants in terms of the courses they were enrolled in because of the relatively small number of foreigners in each class compared to Filipinos. The Filipino participants’ age ranged from 15 to 18 years old while the Koreans were from 17 to 20 years old. There were no specific requirements for students to be part of the study apart from being enrolled in the Eng AN class. At the time the data were gathered, the students were about to have their midterm examination; hence, a good portion of the syllabus was expected to have been taught in the class already.

Instruments A 130-item grammar test covering the parts of speech and other aspects of the English language that require the use of rules was used to collect data. All grammatical aspects in the test were mostly based on the syllabus of the Basic English course (Eng AN or Comm Arts 1) which is a course for all first year students in FEU. There was an average allocation of five test items per grammatical aspect as suggested by Brown (2005) on language testing. The researchers took the questions from the grammar

Level of Grammar Proficiency of EFL and ESL Freshman Students

277

book of Folse, Ivone and Pollgreen (2005) and modifications were made for contextualization. There were numerous targeted grammatical aspects to be examined in the current study; hence, an objective type test was needed for convenience in marking and analysis. The test contained multiple choice questions, cloze tests, gap fill exercises, and fill in the blank type of questions. A pilot test was administered with 2nd year English Language students (21 Filipinos, 7 Koreans and 2 Chinese). The Filipino students obtained a general mean of 101.64 while the Koreans had a 96.08 mean score. Minor modifications were done in the test to make the questions more suitable for the first year students.

Procedures The test was administered separately to each group. The Filipinos were given the test during their Eng AN class. The class was composed of 45 students, so 30 students were randomly selected to take the test, while the remaining 15 students were asked to perform a classroom task. The Korean learners, on the other hand, were given the test during their Filipino for Foreign Students class. At that time, there were exactly 30 students enrolled in the class. The students were given forty minutes to answer the test since the participants of the pilot test were able to accomplish it between 25 to 35 minutes. After the test, the researchers marked all of the test papers in two separate days. Two colleagues helped in verifying the reliability of the marks. The marks were re-checked for possible errors and a recount for the test scores was also conducted.

Data Analysis The researchers made use of descriptive statistics to analyze the data. In addition, the researchers devised a scale in order to measure the level of grammar proficiency of the participants (see Table 14-1). The researchers also analyzed the participants’ prominent mistakes. The mistakes committed were collated and tabulated.

Chapter Fourteen

278

Table 14-1: A scale for measuring grammar proficiency. Score 105-130 79-104 53-78 27-52 1-26

5 Items 5 4 3 2 1

10 Items 9-10 7-8 5-6 3-4 1-2

15 Items 14-15 11-13 7-10 3-6 0-2

Interpretation Highly proficient Proficient Average Less proficient Deficient

Results and Discussion Table 14-2 shows that there was a difference between the mean scores of the participants with the Filipino students scoring higher than the Korean students. A t-test was performed to examine the mean difference and it was found to be statistically significant (p = .26). Table 14-2: Mean and standard deviation of participants’ total score based on nationality.

Nationality

Mean

Standard Deviation

Korean Filipino

92.90 100.67

11.174 14.615

Standard Error Mean 2.066 2.672

t-test 2.28

pvalue .026

Tables 14-3 and 14-4 show the means and standard deviations per grammar topic for each group of participants (ESL and EFL) with the corresponding interpretations from the devised scale. Based on the total mean scores of each group in each area, it can be seen that the ESL students achieved a proficient level in 12 areas of grammar (Table 14-3), while the EFL students achieved a proficient level in six areas of grammar (Table 14-4). On the other hand, there are 10 and 11 areas in grammar where the ESL and EFL students achieved an average level respectively. It is also interesting to note that the EFL participants seem to be less proficient in the present progressive tense, modals and articles, while the ESL participants appear to be less proficient in adverbs of frequency. Sentence errors seem to be the most difficult area of grammar for EFL learners based on their total mean score.

Level of Grammar Proficiency of EFL and ESL Freshman Students

279

Table 14-3: Means and standard deviations for the ESL participants. ESL (Filipino Students) # of items Nouns Verbs Adjectives Identification of Parts of Speech Present tense of be Present tense of regular verbs Past tense of be Past tense of irregular verbs Present progressive tense Past progressive tense Perfect tenses Troublesome verbs SVA Prepositions Correlative Conjunctions Articles Adverbs of frequency Modals Demonstratives Possessive Adjectives Sentence Errors Subordinate Conjunctions Total Score

15 5 5 10

9.93 4.57 4.07 6.47

Standard Deviation 2.638 0.728 1.202 1.737

5 10

4.97 7.9

0.182 2.006

Proficient Proficient

5 5

3.47 4.13

2.08 1.252

Average Proficient

5

3.63

1.866

Average

5

3.9

1.029

Average

5 5

3.27 4.5

1.413 0.682

Average Proficient

5 5 5

4.3 3.13 4.63

0.794 1.252 0.556

Proficient Average Proficient

5 5

3.77 2.87

1.223 0.73

Average Less proficient

5 5 5

3.67 3.67 4.53

1.061 1.322 1.042

Average Average Proficient

5 5

4.13 4.07

1.042 1.337

Proficient Proficient

100.67

14.615

Proficient

130

Mean

Interpretation Average Proficient Proficient Average

Chapter Fourteen

280

Table 14-4: Means and standard deviations for the EFL participants. EFL (Korean Students)

Nouns Verbs Adjectives Identification of Parts of Speech Present tense of be Present tense of regular verbs Past tense of be Past tense of irregular verbs Present progressive tense Past progressive tense Perfect tenses Troublesome verbs SVA Prepositions Correlative Conjunctions Articles Adverbs of frequency Modals Demonstratives Possessive Adjectives Sentence Errors Subordinate Conjunctions Total Score

# of items

Mean

Interpretation

11.67 4.63 4.4 6

Standard Deviation 2.869 0.765 1.276 1.287

15 5 5 10 5 10

4.77 8.7

0.43 2.322

Proficient Proficient

5 5

3.5 3.77

1.757 1.357

Average Average

5

2.43

2.269

Less proficient

5

3.73

1.202

Average

5 5 5 5 5

3.07 4.13 3.07 3.57 3.83

1.337 0.629 1.048 1.331 1.367

Average Proficient Average Average Average

5 5

2.03 3.3

1.189 1.149

Less proficient Average

5 5 5

2.8 3.87 4.73

0.61 1.383 0.521

Less proficient Average Proficient

5 5

1.83 3.07

1.315 1.202

Deficient Average

130

92.9

11.174

Proficient

Proficient Proficient Proficient Average

Level of Grammar Proficiency of EFL and ESL Freshman Students

281

Based on the findings, both EFL and ESL learners achieved a proficient level in the following areas of grammar: present tense of be, possessive adjectives, identification of verbs, adjectives, and troublesome verb pairs (i.e., confusing verbs such as ‘sit’ and ‘set’, ‘rise’ and ‘raise’). However, they seemed to have difficulty with the use of articles, identifying sentence errors, subject-verb agreement, use of present progressive and perfect tenses, and prepositions. The results of the current study confirmed the findings in the studies of Najafi (2012), Bauman (2010), Li (2005), Baustista (2000), and Diab (1998) that articles and the use of prepositions seem to be the most common mistakes committed by learners of English.

Analysis of EFL and ESL Learners’ Mistakes Articles The majority of the participants in the current study had difficulty in using definite and indefinite articles. This could be supported by the studies of Bauman (2010), Nezami and Najafi (2012) and Baustista (2000). Nezami and Najafi (2012) found that EFL learners (Iranian) in their study had difficulty with the use of articles, while Bautista found that one of the prevalent mistakes committed by Filipino speakers of English was the incorrect use of articles. These were some of the most frequent mistakes in the use of definite and indefinite articles: 1. 2. 3.

When is the football game? Oh! It’s on the/a Tuesday. Where’s an apple? I put it on the table. I put it in a refrigerator.

The learners seem to interchange the with a and it also appears that they tend to use an article when it is not needed. Present Progressive Tense Both EFL and ESL learners had difficulty in terms of using the present progressive tense. Here is a list of the mistakes committed by the participants based on their answers in the items on present progressive tense. 1. 2. 3. 4.

He can’t repair your car now. He eats lunch. Sheng writes a letter to her parents now. Joan uses the computer now. The examinee takes the test now.

Chapter Fourteen

282

Based on the participants’ answers, it could be gleaned that the learners tend to mistake an on-going action for a habitual action. It is probable that they may not be able to recognize the time marker of the present progressive tense, which is obvious in the sentences. The participants seem to have applied the rule that a subject must agree with its verb; however, they seem to have forgotten the role of tense in sentence construction. Adverbs of Frequency The test items on adverbs of frequency represented the highest number of mistakes committed by both the ESL (57.4%) and the EFL (66.0%) participants. Students were asked to match the adverbs of frequency with the phrase that has the same meaning. Most of the ESL participants understood the meaning of ‘rarely’ as some of the time while EFL learners understood it as much of the time. The adverb of frequency ‘often’, which means most of the time, was understood by both EFL and ESL learners as usually. Identification of Parts of Speech Participants were given a set of sentences and were asked to identify the part of speech for the underlined words. Based on the results of the test, both EFL and ESL participants seemed to confuse nouns with verbs and EFL students tended to interchange adjectives with adverbs. 1. 2. 3. 4.

Sentence Dina wants to water her plants. I hate to sit in the dark. He runs fast. She works hard for her family.

Participants’ answers Noun Adj(EFL)/adv(ESL) Adj Adj

Modals Based on the results, modals could be considered as an area of grammar which both EFL and ESL learners had difficulty with. Here are some of the mistakes committed by the participants with regard to the use of modals: 1. 2. 3. 4.

Excuse me. Shall you pass the salt? (EFL) This cake is too sweet. It couldn’t/wouldn’t have so much sugar in it. (EFL) This cake is too sweet. It didn’t have so much sugar in it. (ESL) When can you like to go shopping? Anytime is OK with me. (EFL)

Level of Grammar Proficiency of EFL and ESL Freshman Students

283

EFL learners used shall instead of will and can instead of would. Both EFL and ESL learners used couldn’t/wouldn’t or didn’t have instead of shouldn’t ignoring the previous sentence that gave them the clue. Prepositions The findings of the current study confirmed the study by Bautista (2000) that prepositions are one aspect of grammar ESL learners have difficulty with. Based on the results of the test, the Filipino students scored lower (62.6%) than the Korean learners (71.4%) in the use of prepositions. The following examples illustrate some of the errors students made in the use of prepositions: 1. 2. 3. 4.

Marilyn was born at 1990. (EFL) Marilyn was born on 1990. (ESL) We lived at Green Street when we were children. (EFL and ESL) …but never on the same time. (ESL)

Subject-Verb Agreement Subject-verb agreement was problematic for EFL learners with the majority of them committing errors in the items that tested this grammatical rule as shown in the examples: 1. 2. 3. 4.

One of the students in the class are a scholar. Either the teacher or the adviser are going to attend the workshop. The poor is blessed. Down the street lives the poor flower vendors.

Students had difficulty locating the real subject of the sentence. For example, in sentence (4) they mistook ‘street’ as the subject rather than ‘vendors’. Overall, the Korean EFL learners achieved a less proficient level for a greater number of test items than the Filipino ESL learners. This could be attributed to the fact that Korean EFL learners studying in the Philippines have a lesser need to speak the English language after their classes finish because they would interact almost exclusively with their fellow Korean students using their mother tongue.

284

Chapter Fourteen

Conclusion This small-scale study supports Kachru’s (2005) model in which learners of English differ in various aspects. The study highlights how Korean students studying in the Philippines need to be closely monitored in their English language learning progress since they are in an environment which may not be too conducive for learning English due to the fact that the majority of the learners in the class they attend come from the Outer Circle. A special English class for the Korean students is needed in order to effectively address their English language needs. However, this might pose problems since the Korean students are learning two foreign languages simultaneously (i.e., English and Filipino). Despite its limitations, this study may offer some insights to English teachers to modify their course to further address the grammar deficiencies of both EFL and ESL learners. Knowing the grammar areas where both EFL and ESL learners are less proficient would give English teachers a better idea as to how to design their lessons and grammar activities in order to address these issues. Finally, it is recommended that teachers incorporate in their teaching strategies activities that would highlight the use of the target grammar in communicative situations.

Acknowledgement This study received financial support from the University Research Center of the Far Eastern University.

References Bauman, N. (2010). A catalogue of errors made by Korean learners of English. Paper presented at the Annual Conference of the Korea Teachers of English to Speakers of Other Languages (KOTESOL), October 26th-28th, Seoul, South Korea. Bautista, M. L. S. (2000). Defining standard Philippine English: Its status and grammatical features. Manila: De La Salle University Press. Borlongan, A. M. (2010). On the management of innovations in English language teaching in the Philippines [Editorial commentary]. TESOL Journal, 2(2), 1-3. Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to second language teaching and testing. Applied Linguistics , 1(1), 1-47.

Level of Grammar Proficiency of EFL and ESL Freshman Students

285

Chang, B. (2011). The origin and development of Asian Englishes. Retrieved from: http://paaljapan.org/conference2011/mate/PAAL2011Program.pdf Cruz, S. (2011). Implicit and explicit knowledge of College of Education Majors. Master’s dissertation, De La Salle University, Philippines. Diab, N. (1998). The transfer of Arabic in the English writings of Lebanese students. The ESP, Sao Paulo, 18(1), 71-83. Folse, K., Ivone, J., & Pollgreen, S. (2005). 101 clear grammar tests: Reproducible grammar tests for ESL/EFL classes. Michigan: University of Michigan Press. Fromkin, V., Rodman, R., & Hyams, N. (2003). An Introduction to language. Boston: Wadsworth. Gonzalez, A. (1985). Studies on Philippine English. Singapore: SEAMEO Regional Language Centre. Haegman, L. (1991). Introduction to government and binding theory. Oxford: Blackwell. Jung, K., & Min, S.J. (1999). Some lexico-grammatical features of Korean English newspapers. World Englishes, 18(1), 23-37. Hymes, D. H. (1972). On communicative competence. In J. B. Pride, & J. Holmes. (Eds.), Sociolinguistics (pp. 269-293). Baltimore, USA: Penguin Education, Penguin Books Ltd. Kachru, B. (2005). Asian Englishes. Hong Kong: Hong Kong University Press. —. (2006). Asian Englishes today: World Englishes in Asian context. Hong Kong: Hong Kong University Press. Kirkpatrick, A. (2007). World Englishes: Implications for international communication and English language teaching. Cambridge: Cambridge University Press. Llamzon, T. (1969). Standard Filipino English. Quezon City: Ateneo de Manila University Press. Li, C. (2005). A study of collocational error types in ESL/EFL college learners’ writing. Master’s dissertation, Ming Chuan University, Taiwan. Magno, C. (2010). Korean students’ language learning strategies and years of studying English as predictors of proficiency in English. TESOL Journal, 39(2), 39-61. Martin, I. (2012). Identity and communication: A framework for teaching English in the Philippines. Retrieved from: http://www.dlsu.edu.ph/library/newsette/201009_10.pdf

286

Chapter Fourteen

Nezami, A., & Najafi, M. S. (2012). Common error types of Iranian learners of English. English Language Teaching, 5(3), 160-170. Retrieved from: http://www.ccsenet.org/journal/index.php/elt/article/view/15276 Shanklin, M. (1995). The communication of grammatical proficiency. In L. Varga (Ed.), The even yearbook 1994 (pp. 147-173). Dept. of Linguistics, SEAS: ELTE. Shim, R. J. (1999). Codified Korean English: Process, characteristics and consequence. World Englishes, 18(2), 247-258. Taylor, D. (1988). The meaning and use of the term ‘competence’ in linguistics and applied linguistics. Applied Linguistics, 9(2), 148-168. Young, R. F. (2008). English and identity in Asia. Asiatic, 2(2), 1-13.

CHAPTER FIFTEEN METHODOLOGY IN WASHBACK STUDIES GLADYS QUEVEDO-CAMARGO AND MATILDE VIRGINIA RICARDI SCARAMUCCI

Abstract This chapter reviews studies on the washback of language assessment from 2004, when Cheng, Watanabe and Curtis published the first book on washback methodology, to 2012. Taking into account Alderson and Wall’s (1993) admonition for the search for empirical evidence about the phenomenon and the use of a more ethnographic approach in the investigations, this review aimed at investigating the researchers’ methodological options during this period. By means of an electronic search, 78 studies from 31 countries were identified. The analyses showed that Alderson and Wall’s words were heard, as the identified works significantly diversified the ways to investigate washback by involving different stakeholders, using a variety of data collection instruments such as document analysis, questionnaires, interviews and classroom observation, as well as by adopting quantitative and mixed approaches of investigation.

Introduction Washback in language learning, that is, the impact or influence that external exams, particularly high-stakes exams, as well as achievement tests may potentially have on language teaching and learning processes, the curriculum, material design and stakeholders’ attitudes (Scaramucci, 2004), is a relatively new concept (Cheng, 2008). Studies carried out mainly after the 1990s consolidated the idea that washback is a frequent, complex and highly important phenomenon that involves several

288

Chapter Fifteen

stakeholders that are affected differently by the same assessment instrument. In 1993, Alderson and Wall called attention to the fact that the studies on washback conducted so far were methodologically limited, particularly due to the absence of classroom evidence. Their article, Does washback exist?, was the first to question washback more deeply. The authors considered it a highly complex phenomenon treated in a naïve and simplistic way with little empirical evidence. This was the first article to discuss methodological alternatives for washback investigations and put forth a research agenda and fifteen hypotheses to guide future research and provide the basis for the development of a theory about the concept. In 2004, Alderson acknowledged a substantial accumulation of empirical evidence produced from 1993 to the early 2000s which confirmed the existence of washback and of a theory about it. Although the findings confirmed the complexity of the phenomenon, they also raised new questions. According to Alderson, “the question today is not ‘does washback exist?’ but rather what does washback look like? What brings washback about? Why does washback exist?” (Alderson, 2004, p. ix). We might also add ‘How does washback work in the different contexts?’. Bearing the authors’ words in mind, a literature review was conducted to investigate whether their words had or had not been taken into account by researchers of the washback phenomenon. Aiming at identifying the researchers’ methodological options to investigate language assessment washback, this review builds on the state of the art washback studies from 2004, when the first book on the topic was published (Cheng, Watanabe, & Curtis, 2004), to 2012. The next section briefly explains the methodology used in this review, followed by data presentation and discussion. This chapter culminates with some thoughts on the contributions for a better understanding of the concept brought about by the methodological options used in the studies and for future research.

Literature Review Methodology This review was carried out by means of search engines such as Google Scholar (http://scholar.google.com.br), websites such as Language Testing (http://ltj.sagepub.com), among others, and a Brazilian Academic Journal website called CAPES (http://www.periodicos.capes.gov.br). The following key words were used, both in English and Portuguese: impact, washback and backwash, combined with assessment, foreign languages, testing, language, test and exam. Overall, 78 studies published from 2004 to 2012 were identified.

Methodology in Washback Studies

289

Data Presentation and Discussion In order to provide an overview of the most used instruments to tackle washback, the studies reviewed will be presented as follows: (1) geographical distribution; (2) year of publication and number of research studies per year; (3) most investigated exams; and (4) methodological issues. Geographical Distribution Table 15-1 presents the distribution of the studies by continent. It should be noted that some of them collected data in more than one country. In this case, the country where the researchers worked was considered. Table 15-1: Studies distributed by continent. Continent Asia Europe South America North America Oceania Africa Total

Studies 12 11 3 2 2 1 31

% 38.7 35.5 9.7 6.4 6.4 3.3 100

Note: Mean = 5.16; Standard Deviation = 4.95.

As observed, the concern with language assessment washback is present in all the continents, mainly in Asia and Europe, where several countries have official compulsory language exams. In Asia, the places where the studies were conducted were China and Hong Kong, Israel, Iran, Japan, Jordan, Korea, Palestine, Pakistan, Taiwan, Turkey and the United Arab Emirates. In Europe, works from Cyprus, England, Germany, Greece, Hungary, Ireland, Poland, Romania, Slovenia, Spain and Sweden were identified. In South America, the studies were carried out in Brazil, Colombia and Uruguay, whereas in North America, research from both countries, Canada and the United States, was present. In Oceania, the countries where research was conducted were Australia and New Zealand, and in Africa, only one work, produced in Egypt, was found.

Chapter Fifteen

290

Year of Publication and Studies per Year As Figure 15-1 shows, the year with the highest number of publications on washback was 2009. 14 12 10 8 6 4

11 7

8

10

11

2

13 5

6

7

0 2004 2005 2006 2007 2008 2009 2010 2011 2012 Figure 15-1: Studies per year of publication.

We could speculate that the growing number of research from 2005 to 2009 was due to the publication of the book by Cheng, Watanabe and Curtis in 2004, which may be considered the first book on washback in language assessment. Till then, only journal articles had been published. Most Investigated Exams The 78 studies investigated 63 high- and low-stakes exams. As for the high-stakes exams, the most frequent were the International English Language Testing System (IELTS), with eight studies, and the Test of English as a Foreign Language (TOEFL), with five studies. The other studies focused mainly on the official compulsory exams in their respective countries. Methodological Issues In 1993, Alderson and Wall stated: “There has been a tendency to date to rely upon participants’ reported perceptions of events through interview and questionnaire responses, or to

Methodology in Washback Studies

291

examine the results and relationships of test performances. […] In addition, we believe it important […] to triangulate the researcher’s perceptions of the events with some account of participants’ of how they perceived and reacted to events in class, as well as outside class—this amounts to an advocacy of a more ethnographic approach to the topic than has been common to date.” (Alderson & Wall, 1993, p. 127)

It was, therefore, necessary to expand the methodological options in washback studies in relation to both participants and instruments so as to obtain more information about the characteristics of the washback phenomenon and possible factors for its existence. The 78 studies that were reviewed revealed that Alderson and Wall’s words were heard because the works demonstrated that the researchers went much beyond measuring the participants’ perceptions. Some of them are classified as qualitative, like Hung (2012), and others as quantitative, such as Green (2006). However, several of them used a mixed-methods research methodology, i.e., a mixture of qualitative and quantitative methods (e.g., Cresswell, 2009; Dörnyei, 2007). This may have been a consequence of the growing importance, in the 2000s, of quantitative methods in Applied Linguistics (Lazaraton, 2005) as well as of the understanding that qualitative and quantitative approaches are not antagonistic, but complementary (Scaramucci, 1995). Alderson and Wall (1993) suggested a more ethnographic approach. It is possible to state that almost all the studies identified are ethnographic. The only exceptions were the ones that used document analysis exclusively. Thus, ethnography is relevant for washback research as it allows the investigation of phenomenological data that represent the participants’ perspectives and the use of such perspectives to describe and reflect upon the phenomenon (Fetterman, 2008; Watanabe, 2004). In addition, the majority of the studies triangulate researchers’ and participants’ perceptions, i.e., by using multiple methods to obtain more precise conclusions from a variety of angles (Lodico, Spaulding, & Voegtle, 2006). The triangulation involved stakeholders, sources of information (or documents) and data-collection instruments. As far as the stakeholders are concerned, ten different ones were identified: teachers, students or examination candidates, teacher supervisors, school principals, coordinators, school supervisors, educational authorities, exam designers, parents and material writers. Retorta (2007), for example, is a study that involved eight different stakeholders. This variety of perspectives surely provides a much more comprehensive picture of the complexity of washback and gives voice to stakeholders who had not been heard in previous research.

Chapter Fifteen

292

As for sources of information, the following were identified: exams or previous exam guidelines and other publications about the exams, documents about previous applications of the exams, candidates’ statistical data and teaching materials. With respect to triangulation by instruments, the studies included from one to five instruments, as shown in Table 15-2. Table 15-2: Number of instruments used in the studies.

2005

2006

2007

2008

2009

2010

2011

2012

Total

1 2 3 4 5 Total

2004

Year of Study Nr. of instruments

0 6 3 1 1

0 2 4 1 0

0 4 3 1 0

1 4 0 3 2

1 9 1 0 0

3 3 5 2 0

0 1 2 2 0

0 1 4 0 1

0 2 2 2 1

5 32 24 12 5 78

% 6.4 41.0 30.8 15.4 6.4 100

In total, the researchers used ten instruments, as shown in Figure 15-2, along with their frequency of use. These instruments will be discussed below.

Figure 15-2: Data-collection instruments in the reviewed studies.

Methodology in Washback Studies

293

Document Analysis Documents are the basis for the majority of qualitative research studies (Schensul, 2008). Document analysis can be a main source of data or a complementary instrument, depending on the study object and aims (Lüdke & André, 1986). All 78 studies reviewed used information obtained from documents, even when the researcher does not explicitly mention this. This is the case with studies like Caine (2005), for instance. Shawcross (2007) and four other studies reported document analysis as the sole source of data collection. Others such as Barletta and May (2006) mentioned it as a secondary source. In the studies that reported document analysis as one of the instruments used, the way the analyses were conducted varied considerably, including quantitative approaches using data codification, and qualitative approaches, with the predominance of selection of relevant information made by the researcher, followed by possible interpretations of the data. Therefore, document analysis is an instrument inherent to all research that aims at investigating exam washback, since a thorough understanding of the exam as well as of the educational context where the study is carried out are essential conditions for any conclusions about this phenomenon. Questionnaire According to Brown (2001, p. 6), “questionnaires are any written instruments that present respondents with a series of questions or statements to which they are to react either by writing out their answers or selecting from among existing answers.” In Dörnyei’s words (2003, p. 1), “the popularity of questionnaires is due to the fact that they are easy to construct, extremely versatile, and uniquely capable of gathering a large amount of information quickly in a form that is readily processable.” However, questionnaires are easy to process mainly when the format chosen allows the use of statistical procedures such as multiple-item scales, generally known as Likert scales. According to Fawcett (2008, p. 836), “a questionnaire adopting a quantitative orientation is concerned with systematically collecting quantifiable data relating to a number of variables. The purpose is to statistically examine the data to discover associations and possible patterns or trends.” Open- or semi-open-ended questions are also possible. In these cases, the answers can be analyzed qualitatively, by using content analysis for example, or quantitatively, by codifying the information to be processed by computer software (Dörnyei, 2003).

294

Chapter Fifteen

The review of the selected studies confirmed the popularity of questionnaires, as they were used by 64.1% of the works (50 out of 78). The questionnaires in these studies were administered to a very high number of participants in different locations, which constitutes one of the advantages of questionnaires (Cresswell, 2009; Dörnyei, 2007). Thus, they allowed the investigation of the washback phenomenon from the perspectives of different stakeholders and in many aspects of the teachinglearning process in contexts such as public secondary schools, universities and language schools. Besides, studies like Muñoz and Alvarez (2010), among others, expanded the concept of washback, which originally referred to the consequences of high-stakes exams (Alderson & Wall, 1993), to issues related, for example, with how much students understand their school assessment systems and their perception on the connections between educational goals and assessment. The importance of questionnaires for the development of such studies is undeniable. However, it is relevant to mention that, like any instrument, they also have disadvantages. For example, the data obtained may not be totally reliable, as the respondents might answer what they think may please those in charge, particularly when respondents are identified or when the questionnaire is administered officially, i.e., by request of the school principal or by a researcher known to the respondents. When there are questions related to classroom teaching, for example, the teacher respondents may not want to expose themselves and they end up describing what does not correspond to what really happens in class. One cannot deny that social relations between the researcher and the participants are always asymmetrical (Bourdieu, 1998) and in the search for a neutral approach, it is necessary “to consider, in detail, the situation in which the texts resulting from such procedures (interviews and questionnaires) are produced; as well as the (illocutionary and perlocutionary) values of the act of asking and the ways of asking which favour the diffusion of assumptions of the researcher about the required information.” (Machado & Brito, 2009, pp. 140-141 – our own translation)

Thus, it is crucial that researchers look for means to minimize such risks by making the participants aware of the important role they play in the study and the relevance of the information they provide. Participants need to be assured of confidentiality by avoiding, for instance, handwriting recognition or identification of places where the questionnaire was administered. In this sense, the use of electronic tools has helped researchers as they allow participants to answer without being identified.

Methodology in Washback Studies

295

In addition, studies should also consider the participants’ views about how their answers will be used, who is conducting the study and the genre used to collect data, which may or may not allow a certain degree of negotiation between researcher and participant (Machado & Brito, 2009). This leads us to what seems to be one of the main problems in the conception of questionnaires in a research context: the (highly disseminated) idea that researchers are neutral elements, exempt of bias or influences by the participants or the research context of production. This notion exists clearly in relation to language, as it is wrongly believed that language is a transparent vehicle, a semiotic costume that would allow the direct achievement of the social actors’ views (Machado & Brito, 2009, p. 157). Interview Out of 78 studies reviewed, 41 (52.6%) used interviews which, according to Dörnyei (2007), provide versatility in data-collection and reproduce a communicative routine that is familiar to people. Many were conducted via internet, by means of synchronous or asynchronous communication, such as in Wall and Horák (2006; 2008), although the majority of the interviews happened face to face, like in Erfani (2012). Informal conversations with the participants as shown in Li (2009) were considered as interviews, as suggested by Brinkmann (2008). The author also states that the majority of the interviews in qualitative research are of the semi-structured variety as this format allows the researcher to guide the conversations according to his/her interests and leaves room for a more spontaneous interaction and the researcher’s intervention in the event that a topic needs further discussion. The advantages of interviews include the possibility of clarification of specific issues, a more personal approach towards the interviewee and the possibility of obtaining non-verbal clues relevant to the study. The disadvantages are related to difficulties in making appointments with the participants, the need to travel to the place where the interviewee is when online interviews are not possible, time consuming data collection and analysis (in the case of manual transcription of recorded interviews), and limitations in the number of interviewees to be used in the research. The comments made above with respect to the risks and precautions related to questionnaires apply to interviews as well.

Chapter Fifteen

296

Classroom Observation Observation entails the gathering of impressions about particular aspects of the world around us in a systematic way, aiming at learning about a specific phenomenon. This instrument can be the main source of data, but is often used together with interviews and questionnaires. It is used in qualitative, quantitative or mixed-methods approaches (McKechnie, 2008). The use of classroom observation to collect information about washback was strongly advocated by Alderson and Wall (1993), who stated that “it is increasingly obvious that we need to look closely at classroom events in particular, in order to see whether what teachers and learners say they do is reflected in their behaviour” (p. 127). Messick (1996) reinforces the importance of classroom observation when he says that such an instrument could be used to record changes in teachers’ and students’ behaviours associated with the introduction of a test, thus contributing for collecting evidence of its consequential validity. However, observing lessons is not an easy task. Watanabe (2004) states that: “the observation task is divided into several subtasks, typically involving construction of observation instruments, pre-observation interviews, recording classroom events, and post-observation interviews. […] Before entering the classroom, a variety of information needs to be gathered about the school (e.g., educational policy, academic level, etc.) and the teacher whose lesson is to be observed (e.g., education, age/experience, major field of study, etc.).” (p. 30)

As shown in Figure 15-2, 30 (38.5%) out of the 78 studies reviewed used classroom observation. Nikoopour and Farsani (2012) assert this instrument became highly important to investigate washback as its biggest advantage was to provide access to rich data on what really happened in the research context. The disadvantages are similar to those mentioned for interviews. Other issues to be considered include the need to obtain the educational authorities’ consent to enter the classroom and limitations related to observation time and the researcher’s own availability. It is also important to take into account the possibility of occurrence of the so-called Hawthorne effect, which is a change in a group’s behaviour due to the awareness they are being observed (Coombs & Smith, 2003). This effect can be minimized when the observation period is longer so that the group gets more and more familiar with the observer.

Methodology in Washback Studies

297

Pre- and Post-Tests Pre- and post-tests are typical of quantitative research and are consequently very common in education when the objective is to investigate the implementation of educational innovations or to compare students’ performance after some kind of intervention (Dugard & Todman, 1995; Lodico, Spaulding, & Voegtle, 2006). In the studies reviewed, 10.2% (8 studies) mentioned pre- and post-tests, such as Saif (2006). This study is a good example of how productive the use of these instruments can be to collect both qualitative and quantitative data about students’ performance before and after instruction, which is a great advantage, particularly when high-stakes exams are used. It allows one to ensure that the level of difficulty of the questions is appropriate for the students’ level and the purpose of the research study, while avoiding the difficulty of having to design tests that should be equivalent so as to be used in different administrations. Focus Group Focus groups consist of collective interviews, generally in groups of 6 to 12 people, conducted by a moderator (usually the researcher). This format produces a lot of data, as the participants are led to think together and share their thoughts and impressions, which results in people’s reactions and raises rich discussions about the topic (Dörnyei, 2007). This is a powerful means of data collection as it allows the participants to engage in deeper conversations about issues relevant to their reality and express different views, thus giving important information to the researcher (Morgan, 2008). This is one of its biggest advantages, besides others similar to those found with interviews. The disadvantages include more difficulties with transcriptions and data analysis, problems with finding a mutually convenient time slot, an imbalance in participants’ contributions, and the intimidation of some participants in front of others. In this review, 7.7% (6) of the studies used focus groups. Fox and Cheng (2007) reported using only focus groups, whereas others like ElEbyary (2009) combined focus groups with other instruments. Such combinations are very productive because different instruments will not necessarily provide identical data about the same participants (Morgan, 2008).

298

Chapter Fifteen

Diary Diaries consist of participants’ reports of their daily activities and objective experiences (Smith-Sullivan, 2008). They were used in 5.1% (4) of the studies. The studies show that diaries have the merit of capturing data that are normally inaccessible or unobservable by other means and can be used together with other instruments (Cresswell, 2009; Dörnyei, 2007), as can be seen in Tsagari (2006). In terms of the disadvantages, one should consider that the responsibility of producing data relies heavily on the participants, which requires their awareness and preparation to write frequently and carefully. In addition, data accuracy may be at risk, and this requires the researcher to carefully plan both the choice of participants and the guidelines that the participants must adhere to. Researcher’s Writing This instrument refers to the daily record of impressions and reflections made by the researcher based on his/her observations (SmithSullivan, 2008). Only one study, Li (2009), collected data with researcher’s writing. Researcher’s writing is analogous to field notes normally produced in classroom observations. These notes are generally not considered a separate instrument, but a kind of by-product of the observation. Student Follow-up Used in Li’s (2009) study only, student follow-up corresponds to the close observation of classroom extension activities developed by students. The use of this instrument is not very common as it demands closer contact between the researcher and the participants. This is probably its main limitation, along with the fact that it is only possible with a small number of participants or with a large research team. This data collection method can be considered a parallel technique to classroom observation as it focuses on what is visible in terms of students’ study habits, which is highly relevant when it comes to investigating washback effects in students’ learning. Student follow-up seems to be appropriate for ethnographic research. The data it generates are predominantly qualitative from criteria that may or may not be pre-established, which can complement information stemming from other data sources.

Methodology in Washback Studies

299

Conversational Analysis Conversational analysis (Eggins & Slade, 1997; Ervin-Tripp, 1979) has been used in studies on oral production or the assessment of oral production (Andrews, Fullilove, & Wong, 2002; Lazaraton, 2002) as it provides, among other data, statistical analyses of turn-taking and pauses. Among the studies reviewed, the only one to use this instrument was Harwood (2007).

Conclusion Based on Alderson and Wall’s (1993) article, in which the authors call the attention of language assessment researchers’ to the need to adopt ethnography in the search for empirical evidence on washback, this study aimed at investigating whether the authors’ admonition had any effect on subsequent works. By means of an electronic literature review, 78 studies conducted in 31 countries were identified. The analysis of the studies revealed that Alderson and Wall’s (1993) words had an impact on the research community, as the number of methodological options to investigate language assessment washback had significantly increased. Evidence of this is the use of mixed-methods research in which qualitative and quantitative perspectives merged to produce a more complete picture of the studied phenomenon. Furthermore, the use of ethnography and triangulation by means of a variety of stakeholders, sources of information (or documents) and data-collection instruments was found in the great majority of the studies reviewed. Taking into consideration Alderson and Wall’s (1993) suggestion for triangulating the researcher’s perceptions with those of the participants, inside and outside classroom, in an attempt to capture the complexity of the phenomenon, the reviewed studies gave voice to different stakeholders: teachers, students or examination candidates, teacher supervisors, school directors, coordinators, school supervisors, educational authorities, exam designers, students’ parents, and material writers. As far as the sources of information are concerned, researchers used different kinds of documents such as previous exams, guidelines and other exam publications, statistical data on test takers’ performance, and teaching materials in order to understand more deeply the construct and the history of the exam they were working with as well as to characterize the research context. In relation to triangulation, ten instruments were identified and are listed here from the most to the least frequently used: document analysis,

300

Chapter Fifteen

questionnaire, interview, classroom observation, pre- and post-tests, focus group, diary, researcher’s writing, student follow-up and conversational analysis. The number of instruments used in the studies varied from one to five. All of the studies used document analysis as at least one means of data collection. Only five studies reported the sole use of document analysis and they were the only ones to be classified as using one instrument. The other 73 works presented different combinations of the instruments such as the use of document analysis and interviews or preand post-tests and interviews. Therefore, document analysis functions as an important background that provides the researcher with information about the research context and potential washback. Future research needs to consider that in order to have a better view of what is happening with washback, more data collection instruments and stakeholders need to be involved in the washback studies. The increased use of methodological options in the reviewed studies surely represents a considerable advance in washback research. By conducting this literature review and focusing on the methodological options of the studies, this work also aimed at contributing to future research design and more conscious and informed research development based on the experience of investigations carried out by researchers worldwide.

References Alderson, J. C. (2004). Foreword. In L. Cheng, Y. Watanabe, & A. Curtis (Eds.), Washback in language testing: Research contexts and methods (pp. ix-xii). New Jersey: Lawrence Erlbaum Associates. Alderson, J. C., & Wall, D. (1993). Does washback exist? Applied Linguistics, 14(2), 115-129. Andrews, S., Fullilove, J., & Wong, Y. (2002). Targeting washback-a case study. System, 30(2), 207-223. Barletta, N., & May, O. (2006). Washback of the ICFES Exam: A case study of two schools in the Departamento del Atlántico. Íkala revista de lenguaje y cultura, 11(17), 235-261. Brinkmann, S. (2008) Interviewing. In L. M. Given (Ed.). The Sage encyclopedia of qualitative research methods (pp. 470-472). Thousand Oaks, CA: Sage Publications, Inc. Bourdieu, P. (1998). Compreender. In P. Bourdieu (Ed.), A miséria do mundo (2a ed., pp. 693-732). Petrópolis: Vozes. Brown, J. D. (2001). Using surveys in language programs. Cambridge: Cambridge University Press.

Methodology in Washback Studies

301

Caine, N. A. (2005). EFL examination washback in Japan: Investigating the effects of oral assessment on teaching and learning. Doctoral thesis, University of Manchester, Manchester, UK. Retrieved from: http://www.asian-efl-journal.com/Thesis_Washback_in_Japan_ Caine.pdf Cheng, L. (2008). Washback, impact and consequences. In E. Shohamy, & N. H. Hornberger (Ed.), Encyclopedia of language and education. (Vol.7: Language Testing and Assessment, 2nd ed., pp. 349-364). NY: Springer. Cheng, L., Watanabe, Y., & Curtis, A. (2004). Washback in language testing: Research contexts and methods. New Jersey: Lawrence Erlbaum Associates. Coombs, S. J., & Smith, I. D. (2003). The Hawthorne effect: Is it a help or a hindrance in social science research? Change: Transformations in Education, 6(1), 97-111. Cresswell, J. W. (2009). Research design: Qualitative, quantitative, and mixed methods approaches (3rd ed.). Thousand Oaks, California: Sage Publications, Inc. Dörnyei, Z. (2003). Questionnaires in second language research: Construction, administration and processing. Mahwah, NJ: Lawrence Erlbaum Associates Inc. —. (2007). Research methods in applied linguistics: Quantitative, qualitative and mixed methodologies. Oxford: Oxford University Press. Dugard, P., & Todman, J. (1995). Analysis of preǦtest postǦtest control group designs in educational research. Educational Psychology, 15(2), 181-198. Eggins, S., & Slade, D. (1997). Analysing casual conversation. London: Cassell. El-Ebyary, K. (2009). Deconstructing the complexity of washback in relation to formative assessment in Egypt. Research Notes, 35, 2-5. Erfani, S. S. (2012). A comparative washback study of IELTS and TOEFL iBT on teaching and learning activities in preparation courses in the Iranian context. English Language Teaching, 5(8), 185-195. Ervin-Tripp, S. M. (1979). Children’s verbal turn-taking. In E. Ochs, & B. Schieffelin (Eds.), Developmental pragmatics (pp. 391-414). New York: Academic Press. Fawcett, B. (2008). Structuralism. In L. M. Given (Ed.), The Sage encyclopedia of qualitative research methods (pp. 834-837). Thousand Oaks, CA: Sage Publications, Inc.

302

Chapter Fifteen

Fetterman, D. M. (2008). Ethnography. In L. M. Given (Ed.), The Sage encyclopedia of qualitative research methods (pp. 288-292). Thousand Oaks, CA: Sage Publications, Inc. Fox, J., & Cheng, L. (2007). Did we take the same test? Differing accounts of the Ontario secondary school literacy test by first and second language testǦtakers. Assessment in Education, 14(1), 9-26. Green, A. (2006). Washback to the learner: Learner and teacher perspectives on IELTS preparation course expectations and outcomes. Assessing Writing, 11(2), 113-134. Harwood, C. (2007). Washback and the Cambridge ESOL Key English Test speaking component: A study from Japan. Master’s dissertation, University of Leicester, Leicester, UK. Hung, S. T. A. (2012). A washback study on e-portfolio assessment in an English as a foreign language teacher preparation program. Computer Assisted Language Learning, 25(1), 21-36. Lazaraton, A. (2002). A qualitative approach to the validation of oral language test. (Studies in Language Testing 14). Cambridge: Cambridge University Press. —. (2005). Quantitative research methods. In E. Hinkel (Ed.), Handbook of research in second language teaching and learning (pp. 209-224). Mahwah, NJ: Lawrence Erlbaum. Li, Y. (2009). A preparação de candidatos chineses para o exame CelpeBras: Aprendendo o que significa “uso da linguagem”. Master’s dissertation, Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil. Lodico, M. G., Spaulding, D. T., & Voegtle, K. H. (2006). Methods in educational research: From theory to practice. San Francisco, CA: Jossey-Bass. Lüdke, M., & André, M. E. D. A. (1986). A pesquisa em educação: Abordagens qualitativas. São Paulo: EPU. Machado, A. R., & Brito, C. (2009). O agir linguageiro em questionário de pesquisa. In A. R. Machado, E. Cols, L. S. Abreu-Tardelli, V. L. L. Cristovão (Eds.), Linguagem e educação: O trabalho do professor em uma nova perspectiva (pp. 137-160). Campinas, SP, Mercado de Letras. McKechnie, L. E. F. (2008). Observational research. In L. M. Given (Ed.), The Sage encyclopedia of qualitative research methods (pp. 573-575). Thousand Oaks, CA: Sage Publications, Inc. Messick, S. (1996). Validity and washback in language testing. Language Testing, 13(3), 241-256.

Methodology in Washback Studies

303

Morgan, D. L. (2008). Focus groups. In L. M. Given (Ed.), The Sage encyclopedia of qualitative research methods (pp. 252-254). Thousand Oaks, CA: Sage Publications, Inc. Muñoz; A. P., & Alvarez, M. E. (2010). Washback of an oral assessment system in the EFL classroom. Language Testing, 27(1), 33-49. Nikoopour, J., & Farsani, M.A. (2012). Depicting washback in Iranian high school classrooms: A descriptive study of EFL teachers’ instructional behaviors as they relate to University entrance exam. The Iranian EFL Journal, 8(1), 9-34. Retorta, M. S. (2007). Efeito retroativo do vestibular da Universidade Federal do Paraná no ensino de língua inglesa em nível médio no Paraná: Uma investigação em escolas públicas,particulares e cursos pré-vestibulares. Doctoral thesis, Universidade Estadual de Campinas, Campinas, SP, Brazil. Saif, S. (2006). Aiming for positive washback: A case study of international teaching assistants. Language Testing, 23(1), 1-34. Scaramucci, M. V. R. (1995). O papel do léxico na compreensão em leitura em língua estrangeira: foco no produto e no processo. Doctoral thesis, Universidade Estadual de Campinas, Campinas, SP, Brazil. —. (2004). Efeito retroativo da avaliação no ensino/aprendizagem de línguas: o estado da arte. Trab. Ling. Aplic., 43(2), 203-226. Schensul, J. J. (2008). Documents. In L. M. Given (Ed.), The Sage encyclopedia of qualitative research methods (p. 232). Thousand Oaks, CA: Sage Publications, Inc. Shawcross, P. (2007). What do we mean by the ‘washback effect’ of testing? Paper presented at the 2nd ICAO Aviation Language Symposium, May 7th-9th, Montreal, Canada. Retrieved from: http://legacy.icao.int/icao/en/anb/meetings/ials2/Docs/15.Shawcross.pdf Smith-Sullivan, K. (2008). Diaries and journals. In L. M. Given (Ed.), The Sage encyclopedia of qualitative research methods (pp. 213-215). Thousand Oaks, CA: Sage Publications, Inc. Tsagari, K. (2006). Investigating the washback effect of a high-stakes EFL exam in the Greek context: Participants’ perceptions, material design and classroom applications. Doctoral thesis, Lancaster University, Lancaster, UK. Wall, D., & Horák, T. (2006). The impact of changes in the TOEFL examination on teaching and learning in Central and Eastern Europe: Phase I, the baseline study. TOEFL Monograph Series, Report Number RR.06.18. Princeton, NJ: Educational Testing Service. Wall, D., & Horák, T. (2008). The impact of changes in the TOEFL examination on teaching and learning in Central and Eastern Europe:

304

Chapter Fifteen

Phase 2, coping with change. TOEFL Monograph Series, Report Number RR.08.37. Princeton, NJ: Educational Testing Service. Watanabe, Y. (2004). Methodology in washback studies. In L. Cheng, Y. Watanabe, & A. Curtis (Eds.), Washback in language testing: Research contexts and methods (pp. 19-36). Mahwah, NJ: Lawrence Erlbaum Associates.



CONTRIBUTORS

Zakia Ali-Chand is currently the Head of School of Communication, Language & Literature at the Fiji National University in Suva, Fiji. She has a PhD in Linguistics from the University of the South Pacific. Her two publications in 2015 include Using SQ3R Method to Improve Reading Comprehension Abilities and The Significance of Intercultural Communication Studies for Second Language Teaching. Her main research interests include language learning strategies and academic language skills of second language learners. She is also the current President of the Fiji Association of Women Graduates, an affiliate of Graduate Women International based in Geneva, Switzerland. Kazuo Amma, PhD, Professor of applied linguistics and TEFL at Dokkyo University (Saitama, Japan), has worked in statistical analysis of language test data. His particular interest in the qualitative difference in language processing among learners of different proficiency levels resulted in a paper which obtained Best Research Paper Award at the SAS Forum 2004 (Tokyo). He also developed a computer programme for the partial scoring of sequential items (2010). His current engagements include test projects for secondary school students at the Ministry of Education and accreditation of airline pilots’ English proficiency at the Ministry of Land and Transportation in Japan. James Dean Brown (“JD”) is currently Professor of Second Language Studies at the University of Hawai‘i at MƗnoa. He has spoken and taught in many places ranging from Brazil to Venezuela. He has published numerous articles and books on language testing, curriculum design, research methods, and connected speech. His most recent books are: Mixed methods research for TESOL (2014 from Edinburgh University Press); Cambridge guide to research in language teaching and learning (2015 with C. Coombe from Cambridge University Press); and Teaching and assessing EIL in local contexts around the world (2015 with S. L. McKay from Routledge).



306

Contributors

Sarah Chabal (PhD from Northwestern University) is a Research Psychologist at the Naval Submarine Medical Research Lab in Groton, Connecticut. As a former member of the Bilingualism and Psycholinguistics Research Group at Northwestern University, she has conducted research on bilingualism, language processing, and interactions between language and perceptual systems. Her work has been published in peer reviewed scientific journals and has been presented at both national and international conferences. Douglas Altamiro Consolo is an Associate Professor of Applied Linguistics and of English as a Foreign Language at the State University of Sao Paulo (UNESP), Brazil. He teaches undergraduate and postgraduate courses, and supervises students’ projects in initial teacher education, as well as MA dissertations, PhD theses and postdoctoral studies. He is the co-ordinator of the ENAPLE-CCC research group and one of the senior researchers in the development of EPPLE, a Brazilian proficiency examination for foreign language teachers. He has published extensively in the areas of EFL and language assessment. He has co-authored and coedited four books. He has recently joined a project on Portuguese as foreign language (PFL) at UNESP, in which he is responsible for online assessment in PFL. His main research interests include assessment, foreign language learning and teaching, language testing and teacher development. Christine Coombe has a PhD in Foreign/Second Language Education from The Ohio State University. She is currently on the English/General Studies faculty of Dubai Men’s College, UAE. Christine is coauthor/editor of numerous professional volumes including A Practical Guide to Assessing English Language Learners (2007, University of Michigan Press); The Cambridge Guide to Research in Language Teaching and Learning (2015, Cambridge University Press) and Reigniting, Retooling and Retiring in English Language Teaching (2012, University of Michigan Press). Most recently, Christine served as President of the TESOL International Association (2010-2013) and received the British Council’s International Assessment Award (2013). Selwyn C. Cruz is a full-time Lecturer at the Enderun Colleges. He finished his Master of Arts in English Language Education at the De La Salle University where he is currently taking his PhD in Applied Linguistics. He has presented papers in conferences, published articles in journals, delivered lectures, and provided trainings on language learning



Current Issues in Language Evaluation, Assessment and Testing: Research and Practice

307

related courses. His research interests are in the areas of Discourse Analysis, Sociolinguistics and Educational Technology. Marina Dodigovic is the Director of both the MA TESOL program and the Research Centre for Language Technology at XJTLU. She is a member of the editorial boards of two refereed academic journals, TESOL International and Voices in Asia. In the past, she served the TESOL community as a Ruth Crymes Fellowship Academy Award reader. Marina is the author of Artificial Intelligence in Second Language Learning, the first English-language monograph on Intelligent CALL, and the editor of Attitudes to Technology in ESL/EFL Pedagogy. She has received multiple extramural and intramural research grants and has published a number of articles in refereed academic journals. Her recent research interests have gravitated toward vocabulary teaching, learning and assessment. Carol J. Everhard was formerly a Teaching Fellow in the Department of Linguistics of the School of English, Aristotle University of Thessaloniki in Greece. She both taught and coordinated undergraduate language courses, was organizer of the School’s Resource Centre and taught a specialist course in Self-access and Foreign Language Learning. Her doctoral studies investigated the relationship between learner-centred assessment and autonomy in language learning. She was Coordinator of the IATEFL Learner Autonomy Special Interest Group between 20062008 and she co-edited Autonomy in Language Learning: Opening a Can of Worms (2011) and Assessment and Autonomy in Language Learning (2015). Christina Gitsaki is an Associate Professor and Research Coordinator at the Center for Educational Innovation, Zayed University, UAE. Dr Gitsaki has previously served as the UNESCO Chair in Applied Research in Education designing professional development programs for teachers in the K-12 and higher education sectors. Dr. Gitsaki has served on the executive boards of professional associations such as the Applied Linguistics Association of Australia, the Gulf Comparative Education Society, TESOL Arabia, and she is currently the Secretary General of the International Association of Applied Linguistics. Her research interests are in second language acquisition and pedagogy, the use of educational technology in the second language classroom, teacher professional development, and the scholarship of teaching and learning.

308

Contributors

Gerriet Janssen is currently finishing his doctorate at University of Hawai‘i, MƗnoa, under the supervision of Dr. James Dean Brown. He has simultaneously maintained his position as Assistant Professor at Universidad de los Andes–Colombia, a position he has held since 2006. There, and within a team of colleagues, he headed the development of the language program Inglés para Doctorados, a program that he currently teaches within and coordinates. His main research interests include: language assessment, academic writing, curriculum development, and program evaluation. His research papers have been published in refereed journals and books both within the Colombian and international context. Liudmila Kozhevnikova received her PhD from Moscow State University and is currently an Assistant Professor at Samara State University, Russia. She is Vice President of SETA and Board Member of the National Association of Teachers of English. She is on the Board of Experts for the Foreign Languages Council on Methodology and Research at the Russian Ministry of Education and Science, the Volga Region Affiliate which she chaired for seven years (1998–2005). She co-authored Exam Success (2013). Her main research interests include testing and assessment, TESOL and CLIL. Caroline Larson (M.Ed. from University of Virginia) is a practicing speech-language pathologist at The Communication Clubhouse in Chicago, Illinois. She is certified by the American Speech-LanguageHearing association and Illinois Early Intervention, and state licensed in Illinois. She is a translational researcher for the Bilingualism and Psycholinguistics Research Group at Northwestern University, and has presented research on topics related to bilingualism and child language at state and national conferences. Persephone Mamoukari is a PhD researcher in the field of Applied Linguistics in Democritus University of Thrace. She holds a degree in English Language and Literature from the Aristotle University of Thessaloniki and an MA in Black Sea studies from the Democritus University of Thrace. She has attended a number of congresses and has written papers in refereed proceedings. She speaks English fluently and is a proficient speaker of French. Her interests focus on language teaching, communication strategies, learning strategies, school psychology, and psycholinguistics.

Current Issues in Language Evaluation, Assessment and Testing: Research and Practice

309

Viorica Marian (Ph.D. from Cornell University) is the Ralph and Jean Sundin Professor of Communication Sciences and Disorders and Professor of Psychology and Cognitive Science at Northwestern University in the United States. Since 2000, she has directed the Bilingualism and Psycholinguistics Research Group, with funding from the National Institutes of Health and the National Science Foundation. Her research centers on bilingualism and its consequences for linguistic, cognitive, and neural function, with a focus on language processing, learning, and memory. Her research has been disseminated in over 100 publications, over 200 conference and invited presentations, and receives extensive press coverage (http://www.bilingualism.northwestern.edu/). Jessica Midraj holds a PhD in Curriculum and Instruction (English Education) from Indiana State University, USA. Currently, she is a faculty member in the College of Arts and Sciences at the Petroleum Institute in Abu Dhabi, UAE. Dr. Midraj has over 18 years of experience in the field of English language education as a teacher, mentor, researcher, and manager. She has published in the areas of self-assessment, reading, parental involvement, educational standards, grammar teaching pedagogy, and motivation and has authored numerous language tests and assessment manuals. Her research interests include curriculum and assessment design, academic language learning instruction, and assessment management. Sadiq Midraj is an Associate Professor of language education and the Quality Assurance Coordinator in the College of Education at Zayed University. He has provided consultations to the Ministry of Education on K-12 English standards, Bidaya Media on research, the UNESCO Office in Beirut on quality assurance standards for education units and Khalifa Award for Education as an arbitrator. Dr. Midraj also worked as the Director for the Center for Professional Development. He has published on assessing Arabic-English bilingual literacy, outcomes-based education, and language-learner variables. His research interests include English language assessment, bilingual education, and quality assurance. Jacob Mlynarski, an EAP Lecturer at the Northern Consortium United Kingdom (NCUK), has taught Academic English in China for the past 6 years. He holds a B.A. in English Philology and M.A. in TESOL. His main interests include TESOL, more specifically plagiarism, the use of technology in writing instruction and contrastive rhetoric. He is particularly interested in differences between rhetorical styles placed

310

Contributors

within various cultural contexts. He is currently developing a Graduate Foundation Programme in China in association with the NCUK. Gladys Quevedo-Camargo is an Adjunct Professor in the Department of Foreign Languages and Translation, University of Brasília, Brasília/DF, Brazil. She worked for many years as an English teacher in private schools and, since 2006, she has worked with teacher education in governmentfunded higher education institutions. She has also been an oral examiner for the Cambridge main suite exams and the IELTS, and coordinated the testing development center in one of the universities where she worked. Her main research interests include language teaching, learning and assessment, national and international exams and teacher education. Matilde V. R. Scaramucci, full Professor in the Department of Applied Linguistics, University of Campinas, SP, Brazil, is past Dean of the Institute of Language Studies, University of Campinas (2011-2014). She is one of the developers of the Certificate of Proficiency in Portuguese as a Foreign Language (Celpe-Bras, Brazilian Ministry of Education) and editor-in-chief of the refereed journal Trabalhos em Linguística Aplicada (2006-2014). She has published extensively in the areas of SL/FL teaching, testing and assessment. She has co-edited Português para Falantes de Espanhol (2008) and Pesquisas sobre Vocabulário em Língua Estrangeira (2008). Her main research interests include the assessment of integrated tasks, validity and washback. Her research papers have been published in numerous refereed journals and books mostly in Brazil. Vera Lucia Teixeira da Silva is an Associate Professor of English as a Foreign Language at the State University of Rio de Janeiro where she works as an educator and researcher. She is currently a member of a group that is developing a Brazilian proficiency exam for teachers of foreign languages. She has published several articles in the area of assessment and EFL. She has co-authored Olhares sobre competências do professor de língua estrangeira: da formação ao desempenho profissional (2007). Her main research interests include assessment, teacher development and second language acquisition. Renata Mendes Simões holds a PhD in Applied Linguistics from Catholic University of Sao Paulo (PUC-SP), Brazil, a Master’s in Education and Arts from Mackenzie University, with specialization in Teaching in Higher Education. She is an autonomous English teacher and EnglishPortuguese translator-interpreter, a research member of the GEALIN

Current Issues in Language Evaluation, Assessment and Testing: Research and Practice

311

group (English for Specific Purposes, ESP, teaching and learning research group - PUC-SP), and editorial assistant in the Academic Journal ‘the ESPecialist’. She is responsible for designing and teaching ESP courses mainly for language proficiency tests such as TOEFL iBT and IELTS. Her main research interests include ESP, one-to-one classes, and language proficiency assessments. Maria Giovanna Tassinari is Director of the Centre for Independent Language Learning at the Freie Universität Berlin, Germany. She is currently committee member of LASIG, Learner Autonomy Special Interest Group of IATEFL, and in the scientific committee of Mélanges Pédagogiques (Université de Lorraine, France). Her PhD has been published as Autonomes Fremdsprachenlernen: Komponenten, Kompetenzen, Strategien (Peter Lang, 2010) and awarded with the Bremer Forschungspreis 2011. Her research papers have been published in numerous refereed journals and books in German, English and French. Her research interests are in learner autonomy, language advising, affect in language learning, and formal and informal learning. Jonathan Trace is a PhD candidate at the University of Hawai‘i at MƗnoa. He has co-authored several articles and book chapters on largescale assessment validation, rubric design, classroom assessment, and rater negotiation in performance assessment. Currently, he is working towards the completion of his doctoral dissertation on the construction of a validation argument for cloze test function for second language assessment purposes. His main research interests include second language assessment, curriculum design, program evaluation, ecological approaches to learning, multivariate quantitative methods, and listening and speaking pedagogy. Romulo P. Villanueva Jr. finished his Master of Arts in Teaching major in English Language at De La Salle University-Manila as Commission on Higher Education (CHED) Scholar and has finished the academic requirements for his PhD in English Language Studies at the University of Santo Tomas. He also has a Diploma in Teaching English to Speakers of Other Languages (TESOL) from the London Teacher Training CollegeUK. Currently, he is the Assistant Program Head of the Department of English and teaches English courses. He has conducted research on language teaching techniques, grammar proficiency, language attitudes, sociolinguistics and presented papers in International Conferences.

312

Contributors

Penelope Kambakis Vougiouklis is a Professor of Linguistics, Department of Greek, Democritus University, Greece. She holds a degree from the Aristotle University of Thessaloniki and an MA and a PhD from the University of Wales. She has attended more than 80 congresses and she is the author of more than 100 papers in journals and refereed proceedings, as well as four monographs. She is fluent in English and can read and communicate in French and Italian. Her interests focus on mathematical models in language teaching, Greek as a second/foreign language, communication strategies with emphasis on guessing as a processing and/or a learning strategy, confidence as a factor of success in communication, learning strategies, dictionary use as a communication strategy, psycholinguistics and dialectology. Beilei Wang, Associate Professor in the School of Foreign Languages, Tongji University, is currently the Director of the China English for Academic Purposes Association. She has been hosting and participating in various ELT projects at primary, secondary and tertiary levels sponsored by Tongji University, Shanghai International Studies University, Shanghai Municipal Education Commission and the Ministry of Education. She has published extensively in such journals as Foreign Language World, Journal of Foreign Languages, contributed chapters to books on EFL theory and practice like Teaching English the Other Way in China and coauthored a serial textbook Reading Faster. Her main research interests include learner autonomy, formative assessment, ESP and classroom teaching. Lee-Yen Wang has been as an Associate Professor for the English department of Xiamen University Tan Kah Kee College since 2013. He has done extensive research in CALL and ESL/EFL. He was a professional computer consultant for high technology companies in the US, such as HP, Alcatel-Lucent (DSC), Samsung Telecommunications America, Fujitsu Network Communication, Cisco, and other medium to small innovative technology companies all over the United States. He started to tap his software prowess a few years ago to build a corpus and write software to analyze language features. He has also created remedial English programs for elementary school children and junior high school in Taiwan. He is currently developing a medical translation program for English major students.

Current Issues in Language Evaluation, Assessment and Testing: Research and Practice

313

Rining Wei, PhD, is a Lecturer at the Department of English, Culture and Communication, Xi’an Jiaotong-Liverpool University (XJTLU), Suzhou, China. He teaches courses related to bilingualism, sociolinguistics, and research methods at undergraduate and postgraduate levels. He is an active member of the Research Centre for Language Technology, and serves as a core member respectively in the Centre’s advisory group and XJTLU’s Language Policy Working Group. He specialises in EAP, Content and Language Integrated Learning (CLIL), language policy, and quantitative methodology. He has published in journals including Asian EFL Journal, English Today, Journal of Multilingual and Multicultural Development, and World Englishes.

INDEX

AARP (Assessment for Autonomy Research Project) 160-173 academic English proficiency 241 language proficiency 240-242 Academic Vocabulary List 98-108, 110-113 academic writing skills 237, 240 accuracy 201, 203, 213 advanced proficiency students 85, 92 advising service 126 session 124 counselling sessions 160, 172 analytical scale 211 AS-unit 212, 213 assessment 64-65, 75-77, 119, 201202, 204, 222-223, 232 assessment alignment 164-169 as learning 158 criteria 125, 203, 215 degrees of 171 dynamic 122-123, 133 for autonomy 120 for learning 123, 158 formative 125, 158, 159 iterative 125, 133 learner-centred 158 maturity 170 of language competencies 119 of learner autonomy 119, 123 of learning 119, 158 of learning competencies 119 oral 161, 166 peer- and self- 161 power 158, 173 practices 159

process 124 qualitative 125 questionnaire 162, 170-171 scales 161 summative 158-159, 171 sustainable 158, 173 tool 128, 133, 160 triangulated 161-162, 164 writing 161, 164 attrition, language 161 autonomy continuum 123 degrees of 171 fostering/promoting 159-160, 171, 173 learners’ understanding 127 multidimensionality 120, 132 awareness 119, 123, 128-129, 133 raising 159-160 reaching 172 backwash 288 bar (fuzzy) 80, 85-87, 89-93 beginners/low proficiency students 85, 92 benefits of self-assessment 125 bilingual 64-65, 71, 75-77 Brown Corpus 5 capacity 120 CEFR (Common European Framework of Reference) 161 Centre for Independent Language Learning (CILL) 126 cheating 166 checklists, criterial 162 chi-square (fixed) 15 classical test theory (CTT) 7 classroom observation 296 setting 124

Current Issues in Language Evaluation, Assessment and Testing: Research and Practice College Entrance and Examination Center 99, 113 communicative competence 202, 274 comparing perspectives 124, 130, 133 competencies language 172 learning 172 confidence 80-82, 84, 87-93 confidence interval 54 consciousness, raising 159 content knowledge 179-180 words 21 conversational analysis 299 Corpus of Contemporary American English 98-101, 111-112 correlations 239-241 counselling meetings, small-group 172 criteria assessment 158, 161 assimilation of 159 learner-created 162 pre-determined/ready-made 158, 161 criterion-referenced 60 CTT item analysis 11 data collection instruments 291-292 frozen 163 gathering 126 qualitative 163, 170-171 quantitative 163 decision-making 125, 128, 172-173 declarative knowledge 179 descriptive statistics 7 descriptors learner autonomy 121-122, 125 perspective 129 dialogue, pedagogical 173 diary 298 distancing from the task 161 document analysis 293 domain 201-202

315

dynamic model of learner autonomy 121-122 Early Intervention 64-66, 75-77 education, distance 159 empirical, explorative-interpretative approach 121, 125-127 English as a foreign language 98-99, 272 as an additional language 178, 182-184, 187-188, 197 learning portfolios (ELP) 137 English Reference Word List 99, 101, 107-108, 110-111, 113 error 222-223, 226-227 ESL 273 ESP 253-255, 261 estimation 41 ethnographic (research) 291 ethnography 291 evaluation process 124 Expanding Circle 273 explorative-interpretative approach 121 extreme items 13 FACETS 6 feedback 158 fitting items 14 focus group 297 foreign language 201-203 formative assessment 137 frequency of use 80-82, 86-93 function words 21 general language proficiency 205 General Scholastic English Ability Test 102 General Service List 99, 100 Germanic words 23 goal-setting 173 grade, oral 161 grammar 274 grammatical competency 274 complexity 212-213, 215 proficiency 274 Greek-speaking students 84, 85

316

heteronomy 171 high-stakes exam 287, 290 higher education 158 holistic scales 211 impact 288 independent learning 180-182, 197 Inner Circle 273 interview 126, 295 item analysis 11 discrimination 12 Facility (IF) 11 reliability 15 separation index 14 Item Response Theory 3 Item Root Mean Square Standard Error (RMSE) 14 language assessment 202, 253, 257 development 258, 260, 262, 266, 270 learning strategies 80-81, 238239, 241 Latinate words 23 learner agency 173 attitudes 121 autonomy (LA) 138 learner autonomy components 120 definition 120 dimensions 120-121, 131 operationalization 119 learner behaviours 121 choice of components 124, 127 contracts 162 empowerment 119 feedback on self-assessment 128-129 profile 126, 132 profile cards 162 resistance 130, 133 strategies 121 strengths 159, 172 weaknesses 159, 172

Index

learning liberatory 158 lifelong 158 sustainable 172-173 transformatory 172 Letters course 203-205 lexical competence 208, 211 frequency 201, 203 variety 203 Likert scales 79, 81, 85-86 linguistic knowledge 202 logistic regression 46 low-stakes exam 290 Many Faceted Rasch Measurement (MFRM) 13 measurement of learner autonomy 132 metacapacity 120 metacognition 120, 133, 159, 172173 metalanguage 204, 207 Ministry of Education 99, 182 misfitting items 14 mixed methods 132, 291, 296 mock tests 209-210 model of autonomy Everhard’s 171 Tassinari’s dynamic 172 multiple-choice items 189 natural cloze needs analysis 253, 255, 260-261, 269 objectivity 161 OPT (Oxford Placement Test) 161 oral protocols 84 Outer Circle, 273 parts of speech 21 Pearson correlation coefficients 164-166 pedagogical dialogue 124-125, 129-130 knowledge 179 peer-assessment 158-172, 181 of mock samples 163 training for 166

Current Issues in Language Evaluation, Assessment and Testing: Research and Practice Philippine English 273 placement 41 plagiarism 222-226 Post-Study intervention 163 Post-Study, AARP 163, 166 post-tests 297 Praxis 182, 184-187 Pre-Study, AARP 162, 166 pre-tests 297 presentations, oral 161 private classes 253-254, 258 procedural knowledge 179 professional disposition 179 proficiency 58, 83, 85, 93-94, 201202, 215, 274 scale 207-208, 213 proficient level 278, 281 qualitative 131-132, 291, 296 quantitative 131-132, 291, 296 questionnaire 126, 293 Rasch analysis 3 vertical rulers 15 rating scale 203, 208, 210, rational deletion cloze 3 reflection 130, 133, 172 reflexive methods 132 reliability 7 research instruments 131 empirical 158 researcher’s writing 298 responsibility, assuming/taking 160, 172 rhetoric 159 rogue group 166 Rossetti 64-65, 67-68 self-assessment 158-173, 180-182, 197 deviations 162, 167-170 contexts 129 steps 123-125 self-directed language learning 123, 126 self-direction 160 self-efficacy 178

317

self-evaluation 180 self-regulation 180-181, 197 sense of ‘being’ as a learner 159 sense of ‘self’ 159 small-scale 41 social cognitive theory 180 SOE (School of English) 160 specialized proficiency 205 stakeholder 287 Standard Filipino English 273 statistical analysis 164 Strategy Inventory for Language Learning (SILL) 80-82, 84, 87 student follow-up 298 support for learner autonomy 119, 123, 126, 129 supporting/enhancing reflection 128, 130, 133 t-analysis 169-170 T-unit 212 tailored cloze 3 teacher certification 179, 185, 187 development 203, 205 education 202, 204, 215 effectiveness 178, 179 efficacy 178, 181 language 202, 205-206 proficiency 201, 202-203 readiness 182-183 talk 202 TESOL Standards 182, 184-185, 188 Teacher Readiness Inventory (T-TRI) 196 testing 201-202, 215 diagnostic 161 textual borrowing 222-226 theoretical-analytical approach 121 thinking criterial 158-159 critical 173 TOEFL iBT Proficiency Test 253254, 256-257, 262, 269 transactional modes of teaching 158

318

transformative modes of teaching 158 transmissional modes of teaching 158 triangulation 291 Tukey-Kramer tests 164-167, 169 Turkish-Greek bilingual students 84 underfitting items 13 United Arab Emirates 181 validation methods 121

Index

vocabulary size 222, 223-226 Vocabulary Size Test (VST) 222, 228 washback 287 WINSTEPS 6 word frequency 23 word origin 23 writing 222-226 Yes/No Test 101-104, 106, 110