Making and Using Word Lists for Language Learning and Testing [ebook ed.] 9027266271, 9789027266279

Word lists lie at the heart of good vocabulary course design, the development of graded materials for extensive listenin

794 87 1MB

English Pages 224 [226] Year 2016

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Making and Using Word Lists for Language Learning and Testing [ebook ed.]
 9027266271, 9789027266279

Citation preview

Making Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

and Using

Word Lists for Language Learning and Testing I.S.P. Nation

John Benjamins Publishing Company

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Making and Using Word Lists for Language Learning and Testing

I.S.P. Nation Victoria University of Wellington

John Benjamins Publishing Company Amsterdam / Philadelphia

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

8

TM

The paper used in this publication meets the minimum requirements of the American National Standard for Information Sciences – Permanence of Paper for Printed Library Materials, ansi z39.48-1984.

doi 10.1075/z.208 Cataloging-in-Publication Data available from Library of Congress: lccn 2016032923 (print) / 2016050268 (e-book) isbn 978 90 272 1244 3 (Hb) isbn 978 90 272 1245 0 (pb) isbn 978 90 272 6627 9 (e-book) © 2016 – John Benjamins B.V. No part of this book may be reproduced in any form, by print, photoprint, microfilm, or any other means, without written permission from the publisher. John Benjamins Publishing Company · https://benjamins.com

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

  To Graeme Kennedy 1939–2016 A good man and a true scholar

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Table of contents

Acknowledgements

ix

Introduction

xi

Section I.  The uses of word lists chapter 1 Word lists3 Section II.  Deciding what to count as words chapter 2 Types, lemmas, and word families23 chapter 3 Homoforms and polysemes41 Paul Nation and Kevin Parent chapter 4 Proper nouns55 Paul Nation and Polina Kobeleva chapter 5 Hyphenated words and transparent compounds65 chapter 6 Multiword units71 Paul Nation, Dongkwang Shin and Lynn Grant chapter 7 Marginal words and foreign words81 chapter 8 Acronyms85 chapter 9 Function words89

viii Making and Using Word Lists for Language Learning and Testing

Section III.  Choosing and preparing the corpus chapter 10 Corpus selection and design Paul Nation and Joseph Sorell chapter 11 Preparation for making word lists

95

107

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Section IV.  Making the lists chapter 12 Taking account of your purpose

113

chapter 13 Critiquing a word list: The BNC/COCA lists

131

chapter 14 Specialized word lists Paul Nation, Averil Coxhead, Teresa Mihwa Chung and Betsy Quero

145

chapter 15 Making an essential word list for beginners Thi Ngoc Yen Dang and Stuart Webb

153

Section V.  Using the lists chapter 16 Using word lists

171

appendix 1 Proper noun tagging in the BNC

183

appendix 2 Closed lexical set headwords

187

appendix 3 The Essential Word List

188

References

197

Author index

207

Subject index

209

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Acknowledgements

I am particularly grateful to Kevin Parent and Joseph Sorell for providing me with data and advice which went well beyond the chapters that bear their names. The following people read early versions of the manuscript and provided very useful feedback – Myq Larson, Stuart McLean, Dale Brown, Stuart Webb and Rob Waring. The advantage of having the assistance of others is that you can blame them for your errors.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Introduction

Word lists lie at the heart of good vocabulary course design, the development of graded materials for extensive listening and extensive reading, research on vocabulary load, and vocabulary test development. This book brings together past and current research and a very large amount of experience to answer the following questions. What are vocabulary lists used for? How can we decide if a vocabulary list is a good one or not? How can we make good vocabulary lists? How can we use vocabulary lists well? In answering these questions we need to cover several very basic issues in vocabulary studies such as: What vocabulary do learners need to know? How many words do they need to know? What do we count as words? Does a word have several meanings? Are multiword units like single words? These questions go well beyond the scope of vocabulary lists. This book is not written for those who are new to the study of vocabulary. It is aimed at those who know something about the teaching and learning of vocabulary and who want to make use of word lists in an informed way or who wish to create their own word lists for particular purposes. If you know very little about vocabulary teaching or learning, you could read either Learning Vocabulary in Another Language (Nation, 2013 second edition) or Teaching Vocabulary: Strategies and Techniques (Nation, 2008). Unless there is an explicit statement about a particular purpose for making a list, it will be assumed in the discussion and recommendations throughout this book that word lists are being made to guide the design of a teaching and learning program aiming initially at receptive knowledge of vocabulary. This book however has plenty to say about lists for productive purposes and lists designed for the analysis of texts and vocabulary test construction. The assumption is made because the purpose for making a particular list has a strong effect on the decisions and

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

xii Making and Using Word Lists for Language Learning and Testing

procedures that need to be followed, and this assumption makes it possible to avoid having to qualify each recommendation or suggested guideline when it is given. All recommendations and guidelines for a different purpose will be clearly signaled. In addition, the BNC/COCA lists will often be referred to for examples. This is because they are the largest available lists of word families and the writer of this book has been working on them since the late 1980s when the School of Linguistics and Applied Language Studies at Victoria University of Wellington got its first personal computer and Alex Heatley wrote the first version of what would eventually become the Range program and AntWordProfiler. The BNC/COCA lists are available free from Paul Nation’s web site and come with the Range program. This book is undoubtedly written with preconceived biases. Among these are the monosemic bias (Ruhl, 1989) which sees words with the same form, with the exception of homonyms, as essentially having the same meaning, and the bias of seeing the vast majority of multiword units as being made of separate words rather than being non-decomposable word-like units. These biases are explained in this book. There are some technical terms that are used in the book and following table explains them. Technical terms Term

Explanation

tokens

Every time the same word form occurs in a text, each occurrence is counted as a token of the word. So, the sentence These are the technical terms which are used in the book is made up of nine different words and eleven tokens. There are two tokens each of are and the. When we ask questions like How long is this book? How fast can someone speak? How many words does a secondary school child read in a year?, we are talking about tokens. When talking about tokens in texts, they are sometimes called running words, as in How many running words are there in this text?

types

Types are also loosely called different words. The sentence These are the technical terms which are used in the book is made up of nine types and eleven tokens. When we ask questions like How large is your vocabulary? How many words did Shakespeare use in his writing? we are talking about word types.

word families

Counting finger and fingers as two different words does not make sense once learners know about singular and plural. Rather than counting finger and fingers as two different words, we may wish to count them as one word. If we do this, we are grouping them as members of the same word family. All the members of a word family need to have the same stem form which has much the same meaning in all the members of the family. There are many levels of word family depending on what we decide to include in them. Bauer and Nation (1993) proposed a series of seven levels (see Table 2.2) using the criteria of frequency, regularity, productivity and predictability. Here is the Level 6 word family for walk – (continued)

Introduction xiii

Term

Explanation

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

walk, walked, walking, walks, walker, walkers, walkable, walkie, walkies. This word family can be divided into three flemmas, seven lemmas or nine types. lemmas

A lemma is a word family where the family members consist of the headword and inflected forms of the word. A pure lemma contains only members that are all the same part of speech. English has eight inflectional affixes – plural; third person singular present tense; past tense; past participle; -ing; comparative; superlative; possessive. Here is the lemma for the noun walk – walk, walks. The verb walk is a different lemma – walk, walks, walked.

flemmas

Unlike pure lemmas, a flemma is a word family that consists of a headword and inflected forms of different parts of speech. Typically flemmas include more members than pure lemmas. Here is the flemma for the headword walk – walk, walks (third person present tense verb and plural noun), walking (in all parts of speech), walked (both past tense and past participle).

inflectional affixes

English has eight inflectional affixes – plural -s; third person singular present tense -s; past tense -ed; past participle -ed; -ing; comparative -er; superlative -est; possessive ‘s. Adding an inflectional affix does not change the part of speech of a word. English inflections are all suffixes, that is they occur after the stem.

derivational affixes

English has well over 100 derivational affixes. Here are some examples – un- as in unhappy, -ly as in slowly, -ment as in government. Adding a derivational suffix usually changes the part of speech of the word. Derivational affixes can be added to free-standing stems (stems that are complete words in their own right) and to bound stems (stems that are not complete words in their own right). Here are some examples of bound stems -fer- as in transfer, -sist- as in consistent. In the Bauer and Nation (1993) levels, all word families are based on free standing headwords.

homoforms

Homoforms are words that have the same form but unrelated meanings. They can be divided into homonyms (words that have the same spoken and written forms but unrelated meanings [ball = a round object, ball = a formal dance party]), homographs (words that have the same written forms but different spoken forms and unrelated meanings [minute = a division of time, minute = very small]), and homophones (words that have the same spoken form, different written forms and unrelated meanings [pie = a food, pi = a mathematical term]).

BNC

The British National Corpus. This is a collection of around 100 million tokens of British English.

COCA

The Corpus of Contemporary American English. This corpus was created by Mark Davies at Brigham Young University. This corpus continues to grow and consists of several hundred million tokens of American English of a variety of text types.

xiv Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

The following are some chapters and articles that are highly recommended additional reading. Gilner, L. (2011). A primer on the General Service List. Reading in a Foreign Language, 23(1), 65–83. This article provides an excellent critical introduction to the history and nature of the highly influential General Service List. Martinez, R., & Schmitt, N. (2012). A phrasal expressions list. Applied Linguistics, 33(3), 299–320. This very clear description of the making of a list of multiword units shows how carefully such a study needs to be done and how much work goes into the making of a good list. Martinez and Schmitt’s research is an excellent example to follow. Nagy, W. E., & Anderson, R. C. (1984). How many words are there in printed school English? Reading Research Quarterly, 19(3), 304–330. This detailed study of part of the Carroll, Davies and Richman (1971) word list carefully examines a lot of the decisions that need to be made when counting words, suggesting solutions and often alternative solutions. Sorell, C. J. (2012). Zipf ’s law and vocabulary. In C. A. Chapelle (Ed.), Encyclopaedia of Applied Linguistics. Oxford: Wiley-Blackwell. Zipf ’s law describes the frequency distribution of vocabulary in a text or collection of texts. Understanding this law and its implications for language teaching provides valuable insights into the reasons for making and using word lists, and in understanding the limitations of word lists. Sorell provides a very clear and accessible explanation of this important law. Sorell, C. J. (2013). A study of issues and techniques for creating core vocabulary lists for English as an international language. Unpublished PhD thesis, Victoria University of Wellington, New Zealand. This PhD thesis has a very comprehensive history of word lists and a lot of well-researched information on the vocabulary of some important text types.

This book draws heavily on the work of several of my PhD students. While I have written every chapter except Chapter 15, I acknowledge the large contribution that their research made to the ideas in the chapter by citing them as a co-author of the relevant chapter. In my eyes at least, their chapter can be regarded as a joint publication. Chapter 15 is written solely by Thi Ngoc Yen Dang and Stuart Webb.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

section i

The uses of word lists

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

chapter 1

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Word lists

This book focuses on the making and using of vocabulary lists for the teaching and learning of a foreign language including course design, research on vocabulary load, and testing vocabulary knowledge. It does not look at the use of vocabulary lists for research purposes in the field of psychology, although some of the information in this book has direct relevance for some kinds of research. The goal of this chapter is to show what word lists can be used for and how they can play a central role in learning a foreign language such as English. This chapter then is a kind of justification for the rest of the book. To be useful, word lists need to be well made and the goal of this book is to show how this can be done. This chapter also looks briefly at important factors affecting the making of word lists, namely the unit of counting and the nature of corpora (collections of texts) used for making word lists. The chapter concludes by looking at some very influential word lists that will be referred to in the book.

Course design An obvious use for word lists is to help course designers decide what words to include in a language course. This is an important use because the various words in a language do not occur with similar frequencies. Typically one or two words like the or of are so frequent that they occur in almost every sentence. A large number of words may be met only once or not at all by a language user in their lifetime. The occurrence of words in a text of 1000 words or more, or a collection of texts, can be described by Zipf ’s law. Zipf ’s law basically says that rank multiplied by frequency gives us a constant figure (the same answer) as we go down a frequency-ranked word list. Table 1.1 is part of an idealized version of Zipf ’s law on a text ten thousand words long. Real-life examples do not fit so neatly although they can get close. Note that the number in the Rank column multiplied by the number in the Frequency column equals or almost equals the number in column 4 (700). The number in column 4 is called the constant (the unchanging figure). In the example, half of the different words occur only once in the text, which also follows from Zipf ’s law. These however are the words that carry a lot of the message. In the example,

4

Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Table 1.1  The patterning of Zipf ’s law Rank

Word

Frequency (per 10,000)

Rank times frequency

1 2 3 4 5 6 7 … 700 701 702 703 …

the and a of to s as

700 350 233 175 140 117 100

700 700 699 700 700 702 700

abide absence abundant accent

  1   1   1   1

700 701 702 703

the most frequent word is the, and the next most frequent is and. Typically the ten most frequent words would cover around 25% of the running words in a text written in English. Here are some generalizations that follow from Zipf ’s law. 1. There is a relatively small group of words that occur very frequently in the language. Typically the ten most frequent words in a language cover from 25% (English) to 35% (Maori). These percentages may change as more languages are analyzed, but the principle will hold true that a very small number of words can cover a very large proportion of a text. The most frequent 100 words of English cover around 50% of the running words in a text and the 1st 1000 different words between 70% and 90% partly depending on the content of the text and whether the text used is spoken language or written language. 2. There is a very large group of words that do not occur frequently. Any boundary between high frequency and low frequency words will be an arbitrary one. Traditionally, for English, the most frequent 2000 words have been called the high frequency words of the language, and there are several reasons to support this figure (see Nation (2001b) for a discussion). Schmitt and Schmitt (2014) argue that rather than have a high frequency/low frequency division, it is better for learning purposes to have a high frequency/mid-frequency/low frequency division with 3000 high frequency words, 6000 mid-frequency words, and the remainder low frequency words. The main purpose of this division is to signal the need to systematically learn the mid-frequency words.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 1.  Word lists

3. The frequencies of words in a frequency-ranked list drop very quickly so that about halfway down the list, the words from there on have a frequency of only 1 occurrence in the text. That is, about half of the words in a text or a collection of texts will occur only once. This will happen in most sensible meaning-focused uses of a language, and even in most language teaching course books, the same pattern will occur. This is largely unavoidable. Cobb (2015 unpublished material) however shows that some short texts are not so burdened with one-timers and adaptation of text can considerably reduce the number of one-timers with no obvious loss of naturalness. To check this out, go to the Compleat Word Lister at www.lextutor.ca and run the sample texts Lit1, Lit2 and Lit1&2 at the bottom of the window. Each of these generalizations has implications for course design. The high frequency words of a language need to be an early focus in most language courses. The principle justifying this is the cost/benefit principle which says that learners should get the best return for their learning effort. By learning the high frequency words first, the learners will have the greatest opportunities to enrich their knowledge through later meetings with the words, and will have the greatest opportunity to produce what they know. A graded list of high frequency words can thus be a great asset for a course designer. If a course or text contains too many words that are unknown to the learners, then learners will struggle to focus on the meaning of the text because of the need to deal with the unknown words. For example, if there are one or two unknown words in almost every line of a text, these become an intolerable burden for anyone reading the text. There are computer programs like AntWordProfiler (http://www. laurenceanthony.net/software/antwordprofiler/) that can analyze texts and highlight the words likely to be unknown. These programs are very useful for someone preparing texts for a language course and such programs depend heavily on the word lists that they use to analyze the texts. The better the word lists, the better the quality of analysis. Adapting the vocabulary of texts (sometimes called simplifying texts) can result in a considerable reduction of the number of unknown nonrepeated words in the text. Learning is helped through repetition, especially if that repetition involves retrieval, varied use or elaboration (see Webb & Nation, 2016 forthcoming for a review). Computer programs that use word lists can identify non-repeated words so that they can be given special attention to support their learning if they deserve it. Good vocabulary course design gives attention to the most useful words first, largely excluding words outside the high frequency lists. When these words are well known, then learners can specialize in their vocabulary learning if they are using English as a medium for study, or can proceed to the mid-frequency words.

5

6

Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Language teaching and learning We have looked at frequency of occurrence but in addition to frequency, there are other measures of usefulness, range and dispersion. Range refers to the number of different texts a word occurs in. A word may be very frequent in one text, but not occur at all in other texts. This word would be said to have a very narrow range. Dispersion measures how evenly the occurrences of a word are spread across different texts. The word the for example occurs with about equal frequency across a range of different texts and so would have a high dispersion score. On the other hand, a word like subtract is likely to occur in only certain kinds of texts and so would have a low dispersion score. Word lists based on range, frequency of occurrence and dispersion are excellent guides for choosing words for the systematic teaching and learning of vocabulary. They are also essential for the preparation of extensive listening and reading materials. The principle of the four strands (Nation, 2007; Nation, 2013; Nation & Yamamoto, 2012) suggests that a well-balanced course has four equal strands of meaning-focused input, meaning-focused output, language-focused learning, and fluency development. The language-focused learning strand includes the deliberate teaching and learning of vocabulary, and frequency-based word lists can act as useful checklists or source lists for such learning. During meaning-focused input, when learners meet unknown words in their listening and reading, the words can be checked against word lists to see if they are frequent enough to be worth learning. The ideal reading program or app for foreign language learners would not only allow easy look-up of a word in a choice of dictionaries (bilingual, monolingual or bilingualized), but would also indicate the frequency level of the word in the language as well as its frequency in the current text, so that learners can make informed decisions about whether to add it to their word cards or flash card program. Such a program could also indicate whether the word has been looked up before by the same learner. Frequency-based word lists are essential for such a program. Graded readers are truly excellent resources for incidental language learning. The preparation of such readers requires the development of a set of well-designed word lists and ideally the use of a program such as the Online Graded Text Editor in www.er-central.com/OGTE or AntWordProfiler which instantly compares a text to word lists. Each publisher of graded readers uses their own word lists, which is unfortunate because it makes the building and organization of a graded reader library more complicated. These word lists are often jealously guarded as being sensitive commercial materials which makes it difficult for teachers and researchers to make informed decisions about the quality and equivalence of the lists. The Extensive Reading Foundation graded reader scale is an excellent attempt to make sense of these various graded reader schemes by relating them in one carefully



Chapter 1.  Word lists

integrated scheme (http://erfoundation.org/wordpress/graded-readers/erf-gradedreader-scale/). When reading a graded reader, a learner can be confident that every word they meet will be a useful word to learn.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Specialized vocabulary Each area of academic specialization and each area of commercial specialization has its own vocabulary which is closely related to the ideas involved in that area. One of the most obvious and largest areas is the vocabulary of medicine. Learning to be a doctor includes learning thousands of medical words that are essential to mastering the knowledge of the field. Subject areas like botany and chemistry also involve large numbers of technical words. Trades like carpentry and building also have their own specialist vocabulary. Specialist vocabularies are made up of two kinds of words: those that are commonly known by people who are not specialists in the field, such as heart, blood, muscle, influenza, and those that are typically only known by specialists, such as xiphoid, intercostal, arrhythmia. Some specialist areas have a very large proportion of words known only by specialists, making them difficult areas to study. In recent years there has been a strong interest in developing specialist word lists of various subject areas to see how large these lists can be, what roles specialist vocabulary play in specialist texts, and how learners studying in those areas can be supported in their vocabulary learning and the development of their specialist knowledge. There has also been strong interest in the vocabulary of broader areas such as academic study (Coxhead, 2000), science (Coxhead & Hirsh, 2007), children’s language (Zeno, Ivens, Millard and Duvvuri, 1995), and spoken language (as compared to written language) (Adolphs & Schmitt, 2003).

Language testing The results of vocabulary size tests and tests of particular levels of vocabulary are very useful in setting learning goals and in monitoring learners’ progress. The construction of such tests relies heavily on the existence of well-made vocabulary lists. Typically the early stages of test construction involve taking representative samples of words from well-designed lists so that knowledge of the lists can be measured. The very early studies of vocabulary size sampled from dictionaries rather than from word lists and this resulted in very faulty estimates of vocabulary size through heavily biased sampling (Thorndike, 1924; Nation, 1993). Because high frequency words occupy relatively more space in a dictionary than low frequency words, there tended to be more high frequency words than there should be in the sample.

7

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

8

Making and Using Word Lists for Language Learning and Testing

As high frequency words are much more likely to be known than low frequency words, the tests based on the samples were easier than they should have been and gave wildly inflated results (see for example Diller, 1978). The development of better tests depended on the development of substantial word frequency lists. Word lists in their turn depend on having a collection of texts (a corpus) that can be analyzed to produce lists with range and frequency data. The nature of such a corpus needs to strongly reflect the learners’ goals for using the language. There is no sense, for example, in using a written corpus to make a list to guide the development of courses in speaking. The following section looks at factors that need to be taken into account when choosing a corpus to develop lists.

Factors affecting the making of word lists The distinction between receptive knowledge and productive knowledge is central to the development of word lists. This is because the unit of counting words (see Chapter 2) needs to suit the reason for counting. The unit of counting needs to represent the kind of word knowledge needed by the end-users of the list. If the learners’ word knowledge for listening and reading (receptive purposes) is to be measured, then the unit of counting needs to be at least the lemma or some more inclusive level of word family (see Chapter 2), with the level of word family membership being determined by the amount of morphological knowledge learners are expected to have. This is because when we read or listen, we can do “morphological problem-solving” (Anglin, 1993). That is, even if we have never met the word form businesslike before, if we know business and know some words with -like (childlike) or even the preposition like, we can work out what businesslike means, perhaps with a little guidance from context. However, when we use language for productive purposes, for speaking or writing, we need to know how to produce the appropriate forms of words so that they are suited to what we want to say. Particularly for the derived forms of words, knowing the stem of a word and how to use it is no guarantee that we can successfully use the same stem with an appropriate prefix or suffix. Productive knowledge is more complicated than receptive knowledge and this is reflected in higher scores on tests of receptive knowledge than on tests of productive knowledge where item format is similar in both tests (Waring, 1997; Webb, 2008). The spoken/written modality is also critical when making word lists. This is because the spoken use of language is affected by the here and now nature of speaking and the strong tendency for most speaking to be interactive and somewhat informal, repetitious and incomplete. As a result, the range of vocabulary used in speaking is typically smaller than that used in writing (and of course met in

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 1.  Word lists

reading). The spoken/written distinction affects not only the richness of vocabulary but also the actual words that frequently occur. The most frequent word in a written corpus is invariably the. In a spoken corpus, I and you are usually at the top of the frequency list. There are also words such as alright, well, yes, hello and nope, that are typical of speaking rather than writing. Word lists made from collections of texts are also strongly affected by the variety and nature of the topics covered. Texts covering a wide variety of topics result in a much longer list of different words than texts on a single topic or within a single topic area such as economics or medicine (Sutarsyah, Nation & Kennedy, 1994; Quero, 2015). The difference can be quite striking with twice as many different words in the diverse corpus even though the diverse corpus and the homogeneous corpus are the same length. In addition, in a homogeneous collection of texts, specialist vocabulary makes up a very large proportion of the running words of the texts. A homogeneous corpus may contain texts on the same topic, or texts of the same type such as informal speech, narrative, or general writing. The age of the people for whom the lists are intended should also have an effect on the corpus used to make the lists and then, of course, the resulting lists (Macalister, 1999). Young native-speakers of the language talk and read about different things than adults do. A good corpus should reflect this. Another factor affecting the nature of the corpus is whether the focus is on high frequency words, mid-frequency words or low frequency words. This can affect the size of the corpus needed, with a relatively small corpus being sufficient for making a stable list of high frequency words and academic words (Brysbaert & New, 2009; Coxhead, 2000). In Chapter 9 we look more closely at these various factors.

Influential word lists The early history of word lists for language teaching is well covered by Fries and Traver (1950) in their book English Word Lists: A Study of their Adaptability for Instruction. They give particular attention to Basic English, and the work of Harold Palmer and Michael West. It is clear that researchers in the early 1900s were very aware of the unequal frequencies of words, though it was not until the 1930s that Zipf (1935) described the pattern behind word frequency distributions in what is now called Zipf ’s law. The early counts showed a growing awareness of the effect of the nature and size of the corpus on the words occurring in a count. Some early counts included inflected forms as a part of the headword rather than just counting word types and in 1917 Palmer distinguished word families (monologemes) from word types (monologs) and multiword units (pliologs).

9

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

10

Making and Using Word Lists for Language Learning and Testing

Thorndike (1921) is credited with being the first to add the criterion of range (the number of different texts or sub-corpora that a word occurs in) to frequency counts, and his 10,000 (1921), 20,000 (1932), and then 30,000 (Thorndike & Lorge, 1944) list was a monumental achievement. His unit of counting was the headword which included inflected forms although adverbs ending in -ly were also included with the headword. The list was intended for use with native speakers of English at school as a part of the efficiency reforms in education, but it also provided a resource for the work of Palmer and West and others interested in the learning of English as a foreign language. It is also provided a starting point for Campion and Elley’s (1971) study of academic vocabulary, and was one of the sources for the original Vocabulary Levels Test (Nation, 1983; Schmitt, Schmitt & Clapham, 2001). Palmer (1931), Bongers (1947), and West (1953) each developed lists of high frequency words for language teaching. West’s A General Service List of English Words (GSL) contains around 2000 words with their frequencies and the relative frequencies of their different senses. It still remains the one that others try to improve on. Gilner (2011) provides an excellent introduction to this list and its history. One of the most notable methodological features of the list is its use of both objective criteria (word frequency and range) and subjective criteria (ease or difficulty of learning, necessity, overlap with words already in the list, style, and intensive and emotional words). The use of subjective criteria meant that the list contains some low frequency words such as accuse, ache, aeroplane, alike, aloud, annoy whose GSL frequencies (below 200 per 5 million tokens) are well below a high frequency word list frequency cut-off point. West considered they were needed for producing material within a limited vocabulary. The subjective criteria were often used to exclude words that frequency and range may otherwise have included. For example, emotional words and formal, literary or very colloquial words were not considered necessary in a basic vocabulary largely intended for learners in the early stages of learning to read English as a foreign language. The better the corpus, the less the need for subjective criteria. A major breakthrough in the construction of word lists came with the availability of computing resources around the 1970s. Previous to that, all counting was done manually, and it is astonishing that counts such as Thorndike’s were ever done. The work which today takes a few minutes must have taken years of tedious labor. The next breakthrough for English word lists was the availability of digital corpora, with the first being a one million token corpus of American English, the Brown Corpus (Kučera & Francis, 1967), and word lists based on them. The lists from the Brown corpus (Francis & Kučera, 1982) and the parallel corpus for British English, the LOB corpus (Johansson & Hofland, 1989), became the sources for word frequency data for a range of pieces of research.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 1.  Word lists

The construction of substantial word family lists began by using the General Service List and then the University Word List (Xue & Nation, 1984) and these were used in early versions of the Range program for frequency counts and text coverage studies. The next step in word lists came with the construction of word family lists (Nation, 2004) based on the British National Corpus. The construction of such lists was also helped by computing power, although initially the word families were made manually and there is always the need for manual checking. There have been several attempts to replace West’s General Service List (Nation, 2014; Browne, 2014; Brezina & Gablasova, 2015), and, as we shall see later in this book, most of the attempts have failed to tackle the main issues of deciding on the purpose of such a list and then using a corpus that suitably represents the purpose of the list. Such general service lists are very attractive to publishers who wish to have the best list for their course books and graded readers. The possibility of making such an all-purpose list is rather remote given the wide range of age groups and purposes for which such a list could be used. A list for young children for example would not be suitable for adults in their first year of university study. Some attempts to make a general service list or an academic list have involved combining several existing lists. Hindmarsh (1980) combined various lists including West’s General Service List to make a substantial general purpose list. The University Word List (Xue & Nation, 1984) was a combination of lists made by Praninskas (1972), Campion and Elley (1971), Lynn (1973) and Ghadessy (1979). Such lists did not directly involve creating a corpus, and while they combined the strengths of the lists they were based on, they also included their weaknesses and a mixing of different criteria. Coxhead’s (2000) Academic Word List of 570 words however was strongly corpus-based. She clearly defined the purpose of the list as being for students about to begin university study in an English-speaking country with the primary purpose of reading academic texts. It assumed that the students were already familiar with the words in the General Service List and built on that. The General Service List was assumed to be known because learners on pre-university courses in English-speaking countries already have a reasonable level of English. For learners in non-English speaking countries going on to do university study largely in their first language but also with an English requirement, the high frequency words may still be poorly known and thus the Academic Word List may be too big a step at least initially. Because the list was intended for students from a wide variety of subject majors all studying on the same pre-university course, the corpus from which the list was made had to represent this variety. Coxhead’s (2000) list was called a “new” academic word list because it replaced the University Word List. There are new “new academic word lists” (see Gardner &

11

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

12

Making and Using Word Lists for Language Learning and Testing

Davies, (2014) and the list at http://www.newacademicwordlist.org/) which we can judge by using the criteria suggested in this book. The next step from an academic word list is more specialized word lists for broad faculty areas like science, commerce or humanities. Coxhead and Hirsh’s (2007) science specific list is an example of such a list. A further step, and the one most commonly taken, is to make a word list for a specific discipline area, such as medicine, engineering or applied linguistics. There are two reasons for making such lists. One reason has the goal of examining the size and nature of subject-specific vocabularies in general (Chung & Nation, 2003, 2004) and how they can be identified. The findings of such research are that technical vocabulary makes up a very substantial proportion of the words on a page in any technical text, and technical vocabularies can range in size from one or two thousand words to tens of thousands in a subject area such as medicine (Quero, 2015). The second reason is to make a list of subject-specific words that could be used in preparing learners for study in that area (Salager, 1983), or for testing vocabulary knowledge of that area, although no one seems to have done that. There are lists for

business (Hsu, 2011) medicine (Quero, 2015; Hsu, 2013; Wang, Liang & Ge, 2008) agriculture (Martinez, Beck, & Panza, 2009) applied linguistics (Vongpumivitch, Huang, & Chang, 2009) engineering (Hsu, 2014; Ward, 2009) nursing (Yang, 2015) pharmacology (Fraser, 2007).

A methodology for making such lists is described in Chapter 14 of this book. So far, we have looked at general service lists, academic lists, and technical vocabulary lists. These are all aimed at learners of English as a second or foreign language. Lists for graded reading schemes for learners of English as a foreign language should also be based on careful research, but the lists are not readily available for inspection. It was not always like this. Longman (Longman Structural Readers Handbook 2nd ed., 1976; O’Neill, 1987) and then Collins (A Guide to the Collins English Library, 1978) actually published their lists in booklet form. The extreme reluctance of some publishers to make their lists available may be because the lists are not very carefully followed when writing and editing the readers. The published research on graded reader lists (Wan-a-rom, 2008; Nation & Wang, 1999) has had little to do with the preparation of such lists, although research in this clearly defined area would be of great benefit to graded reading schemes in having sensibly sized steps in the scheme, making sure the lists covered all the high frequency

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 1.  Word lists

vocabulary, and determining the point where graded reading schemes end and unsimplified reading replaces it. In some countries where English is taught as a foreign language, the Ministry of Education produces word lists to guide the teaching of English. For example, Monbukagakusho in Japan used to produce such a list (see Bowles, 2001a, 2001b for critiques). Such lists have official status but their origins are not clear. It would be helpful if they were accompanied by a description of how they were made, or at least some evaluation of them. Word lists have been used as the basis for vocabulary test construction, both for native speakers and non-native speakers. Thorndike used his Teacher’s Word Book (1921, 1932) to make vocabulary tests for school students who were native speakers of English. Nation (Nation & Beglar, 2007) and Coxhead (Coxhead, Nation & Sim, 2014) used the BNC/COCA word family lists to make 140 item and 100 item tests to measure receptive knowledge of the 14,000 and 20,000 BNC/COCA word families. These tests were used with both native speakers (Coxhead, Nation & Sim, 2015), and non-native speakers (Nation & Beglar, 2007). McLean, Kramer and Beglar (2015) have used lists to make the Listening Vocabulary Levels Test (LVLT), and McLean and Kramer (2015) used lists to make the New Vocabulary Levels Test (NVLT). There is a very different kind of word list from the ones described above, and Basic English is the most famous example of this type (Ogden, 1932). This is a word list that is not based on frequency of occurrence but on the ability of the words in the list to cover the important ideas expressed by a language and to combine well with each other to say everything that needs to be said. Basic English contains only 850 word families but these are enough to hold coherent conversations and to write books on challenging topics. The words in Basic English are all high frequency words but there are many high frequency words that are not in Basic English. Michael West’s definition vocabulary, his minimum adequate vocabulary for speaking, and his general service list were all a little like Basic English in that they were lists that were shaped by what they could do. They were however created in a very different way. Anyone interested in word lists should read one of the books describing Basic English and its development (Ogden, 1932; Richards, 1943). The preceding brief survey shows that researchers have been involved in making a wide variety of word lists. The rapid growth in computing power and the increasing availability of large corpora have made it possible to create such lists with ease. This deceptively simple task however is in fact one requiring much careful planning and numerous critical decisions. It is the purpose of this book to describe what is involved in making a good list and to provide recommendations to guide such research.

13

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

section ii

Deciding what to count as words

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Obviously, word lists are lists of words, but deciding what is counted as a word involves several complicated decisions that will affect the quality and usefulness of a word list. We need to decide what forms in a corpus are counted as words. For example, are numbers like 626 and 1,000 counted as words? Are forms containing a mixture of letters and numbers counted as words, for example, U2, A1, R2D2, 10BX? Are exclamations like aaarrgh, eeek,and ooof which do not have a commonly accepted spelling and which are not in spellcheckers counted as words? Are hesitations like um, er, and ah counted as words even though they may not occur in a particular dictionary? Once the distinctions between words and non-words are decided, what happens to non-words? Are they deleted from the corpus or are they counted but assigned to their various categories? We also need to decide what distinguishes one word form from another. Are words separated by blank spaces or can punctuation such as a hyphen also act as a word separator, so that long-term would be counted as two separate words? Such punctuation could include hyphens, apostrophes, and back or forward slashes. Does capitalization result in different words that otherwise have the same spelling – An, and an? In the following chapters in this section, all these questions and many additional ones will be addressed, but first we will look at how others have dealt with deciding what is counted as a word. Gardner (2007) has a very useful discussion, raising the issues of morphological knowledge, homonymy and polysemy, and multiword units. Carroll, Davies and Richman (1971) carried out a corpus-based word frequency count for a version of the American Heritage Dictionary for use at school. They took the simplest approach, counting word forms with any change of form, even capitalization of the initial letter of a word or of the whole word, being enough to consider the form as a different word. They also counted hyphens and apostrophes as part of a word. So, mother, Mother, MOTHER, mothers, mother’s, mother-land and so on would all be counted as different words. They also made no distinctions between proper names and other words, between non-words and words, or between marginal words and other words. Everything was counted without categorization. Their justification was that they were interested in what children of various ages meet when they read, so every word form was of interest.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

18

Making and Using Word Lists for Language Learning and Testing

Nagy and Anderson’s (1984) study of part of the Carroll, Davies and Richman (1971) count is a very carefully thought out and explicitly described analysis of what can be counted as a word for the purposes of measuring receptive vocabulary size. The goal of Nagy and Anderson’s study was to work out how many words occur in printed school English. Their data was the word list presented in the Carroll, Davies and Richman (1971) Word Frequency Book which came from a count of over five million tokens of text representing a balanced range of school grade levels and a balanced range of school subject areas. Nagy and Anderson had to arrange the word types in the count into what would be considered as “distinct words”. This involved making word families and setting up categories of items that would not be considered as words for the purpose of calculating vocabulary size. The categories not normally counted as words included formulae and numbers, compounds containing numbers, non-words and foreign words. Proper names were counted as a separate category and included derived and inflected forms of proper names. While proper names were not counted as “basic words”, Nagy and Anderson estimated that there would be at least a thousand that could be considered as requiring previous knowledge. Nagy and Anderson used a six point rating scale based on semantic relatedness. “Assuming that a child knew the meaning of an immediate ancestor, but not the meaning of the target word, to what extent would the child be able to determine the meaning of the target word when encountering it in context while reading?” (p. 310). Here are some of their examples at the six levels, with Level 0 being the most transparent relationship and Level 5 the least obvious. The major dividing line was drawn between Level 2 and Level 3. Note that they include both derivational affixes and compound words. Table 1.  Nagy and Anderson’s relatedness level Level 0 Level 1 Level 2 Level 3 Level 4 Level 5

Target word

Immediate ancestor

cleverness enthusiast gunner password saucer peppermint

clever enthusiasm gun pass sauce pepper

Cleverness is classified at Level 0 because it is easy to see the connection with its known immediate ancestor clever. Peppermint is at Level 5 because its connection with its immediate ancestor pepper is not easy to see. Although the BNC/COCA word family lists on Paul Nation’s web site (http://www.victoria.ac.nz/lals/staff/paulnation.aspx) are based primarily on affixation (it is the form of the words that makes them eligible for consideration as a family member), the semantic relationship

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Deciding what to count as words

between the family members is also an essential requirement of family membership. It is worth noting that at times the known member of a family is not always the headword or the stem form. Computer is much more frequent and better known than compute. Education is more frequent than educate. This does not upset the idea of “immediate ancestor” or word families, but just reminds us that knowledge of any member or members of the family can help in coping with the other members when reading or listening. One of the very useful features of Nagy and Anderson’s study is that as well as carefully distinguishing various categories of what will and will not be counted as words, they provide some indication of the number of items in each category. This means that if we disagree with any one of their decisions, we can see what effect it has on the resulting list. This is an argument in favor of having a reasonable number of sub-categories and sub-lists. It is much easier to combine information if decisions change than to re-classify data if further distinctions need to be made. There are three kinds of considerations when setting up different categories of words. Firstly, are the categories qualitatively different from each other? For example, are proper names different kinds of words from other words? Are acronyms different kinds of words from multiword units? Secondly, do the qualitative differences relate to the reasons for doing the word count? For example, if the word count is to set up lists to guide language learning, the different kinds of lists should relate to different learning demands. Similarly, if the lists are set up to see how much vocabulary needs to be learned, the different kinds of lists should distinguish words that require learning from those that do not. If the word count is set up to provide a source for sampling for vocabulary tests, the different kinds of lists should usefully divide what words need to be tested from those that would not be included in a test. Thirdly, when considering what categories of words to distinguish, the number of words in each category and the frequency of the words need to be considered. If a category contains only a small number of items, none of which are very frequent, there may be little value in making such a category. On the other hand, a category containing a few very frequent items, such as the transparent compounds category or the acronyms category, may be worthwhile distinguishing because of the effect of the category on text coverage figures. Each of the chapters in this section looks at a different category of words, attempting to describe the words in the category, looking at the justifications and difficulties involved in distinguishing such words, suggesting resources for finding items for the category, and looking at the number of items and the frequency of items in the category. Figure 1 is a very short collection of items that can be used with a counting program or with a set of wordlists to see what is counted as a different word. To do so, each group in Figure 1 needs to be run through the counting program such as Range, AntWordProfiler, or VocabProfiler to see how many words are counted

19

20 Making and Using Word Lists for Language Learning and Testing

in each group. Alternatively, each group in Figure 1 can be looked up in a set of wordlists to see what distinctions are made. For example, are Book, book, BOOK counted as one, two or three different words? Are Brown (the family name) and brown (the colour) counted as different words because of capitalization? Does an apostrophe or hyphen separate words so that grand-uncle is counted as two words?

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Capitalization Book book BOOK Brown brown Word separators uncle uncle’s grand-uncle either/or friend(s) Numbers 2 22 U2 2U 2,200 Unit of counting (one family, six lemmas, or eight types?) ABLE ABILITIES ABILITY ABLER ABLEST ABLY INABILITY UNABLE (What level of family is used?)

Figure 1.  Groupings of items to see what decisions have been made about what counts as a word

Cut-off points In several of the following chapters, we need to make decisions about whether a distinction between items will affect their ranking in a frequency-ranked word lists. For example, in Chapter 3 when deciding if the presence of homonyms affects the inclusion of a word amongst the high frequency words, it is useful to know where the frequency cut-off point is for the high frequency words. That is, what is the frequency of the least frequent word where we draw the line between high frequency and mid-frequency words. If we take the most frequent word families in the British National Corpus and simply rank them by frequency, the word family cut-off points for each 1000 level based on the whole British National Corpus (98,099,501 tokens) are as shown in Table 2. So, the frequency given in row 1 of Table 2 is for the 1000th word family (element) which occurs 11,918 times in the whole BNC. This converts to around 120 occurrences per one million tokens. So all of the most frequent 1000 word families have a frequency of 11,918 occurrences or higher. Thus we can say that to get into a list of the most frequent 1000 families a word will have to occur at least 11,918 times. Note that these are not the cut-off points for the BNC/COCA lists, because other criteria besides ranked family frequency were used in making the lists.



Deciding what to count as words

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Table 2.  Cut-off frequencies from whole British National Corpus for each 1000 word family level 1st 1000 2nd 1000 3rd 1000 4th 1000 5th 1000 6th 1000 7th 1000 8th 1000 9th 1000 10th 1000 11th 1000 12th 1000 13th 1000 14th 1000

Lowest frequency family

Rounded frequencies per million

11,918 4,465 2,098 1,194 741 490 352 253 185 137 102 76 54 36

120 45 21 12 7 5 4 3 2 1 1  .8 .5 .4

So, in Table 2, the 2000th ranked item has a frequency of 4,465 occurrences. Thus, the word frequency cut-off point for the first 2000 word families is 4,465 occurrences (which rounds off to around 45 per million), and the first 3000 is around 2,100 occurrences (21 per million) for the whole BNC. Thus to get into the top 2000 word families, a word would have to occur at least 45 times per million words, and to get into the top 3000 word families, a word would have to occur at least 21 times per million words. Table 3 has the cut-off points for word types. Table 3.  Cut-off frequencies from whole British National Corpus for each 1000 word type level 1st 1000 2nd 1000 3rd 1000 4th 1000 5th 1000 6th 1000 7th 1000 8th 1000 9th 1000 10th 1000 11th 1000 12th 1000 13th 1000 14th 1000

Lowest frequency word type

Rounded frequencies per million

10,241 4,953 3,077 2,073 1,502 1,153 902 732 604 507 429 368 320 281

103 50 31 21 15 12 9 7 6 5 4 4 3 3

21

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

22

Making and Using Word Lists for Language Learning and Testing

As a comparison of Table 2 and 3 shows, from the 2nd 1000 on, the frequency cutoff points are higher for word types. It is lower for the 1st 1000 types because word families in the 1st 1000 have many frequent family members. The word type cutoff frequencies for the 2nd 1000 and subsequent levels are higher than the family cut-off frequencies. This is because some families are made up of several very high frequency members, whereas the types are single items thus increasing the number of high frequency items. The BNC/COCA lists are not nearly so neat. In the BNC/COCA first 3000, there are 186 word families with a frequency less than 2000 (13 in 1st 1000, 144 in 2nd 1000, and 44 in 3rd 1000). This messiness is because a variety of criteria besides frequency were used to order the BNC/COCA lists. The reasons for including lower frequency words in the BNC/COCA lists are largely because the British National Corpus does not fully reflect the needs of the people who will be affected by the lists. These include learners of English who have a strong need for spoken language (only 10% of the British National Corpus is spoken text), informal language, colloquial language (less than half of the spoken component of the British National Corpus is informal spoken language), language used by young children (there is very little of this in the BNC), and up-to-date language (quite a lot of the British National Corpus has not been affected by the internet). Here are some examples of these words from the BNC/COCA lists that would not have got into the BNC/COCA 3000 high frequency words solely on the basis of frequency. Informal spoken language (horrible, darling, goodbye, hurry) Modern words (internet, web, email) Children’s language (thirsty, naughty, silly, rabbit, orange) Members of lexical sets (thirteen, orange, Thursday, autumn) Survival vocabulary (delicious, hungry, thirsty, excuse, goodbye) US/UK inclusion (autumn, rubbish)

Note that more than one reason can apply to the same word (thirsty, hurry, thirteen). The 1st 1000 and 2nd 1000 of the BNC/COCA lists are largely informal words characteristic of spoken vocabulary. The 1st 1000 is also focused on survival vocabulary and young children. The 3rd 1000 is much more written, formal, adult, and academic. In this book however, we will regard the frequency cut-off point of around 2000 (as shown in Row 3 of Table 2) in the 100 million word British National Corpus (or 20 per million) as being the typical cut-off point for the 3000 high frequency words.

chapter 2

Types, lemmas, and word families

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

The unit of counting When deciding on a unit of counting (word type, lemma, word family and level of inclusiveness of the word family), we need to look at the reasons why the counting is being done and who the lists will be used with (Gardner, 2007; Schmitt, 2010). If the counting is being done to see how much or what needs to be learned, then the unit of counting needs to relate to the learning required. Does each new word require new learning? This is the major reason why for receptive use (for listening and for reading), the word family at Level 2 of Bauer and Nation (1993) or higher is typically the most sensible unit of counting. Unfortunately, until recently Level 6 families have been the only ones available. When families at the right level are chosen, learners with a knowledge of the major morphological word building devices of English should be likely to figure out in a written or spoken context what a derived form of a known stem is likely to mean with the help of context clues. When counting for the purposes of production (for speaking and for writing), the word type (ignoring capitalization) or perhaps the lemma may be the most sensible unit of counting, because a different word form often requires different collocates and different grammar. We will look at the choice of counting unit more closely later in this chapter. The unit of counting makes a big difference in the number of items. Using the lists made by Dang and Webb (see Chapter 15 this volume), the BNC/COCA first 3000 families contain the number of lemmas and types shown in Table 2.1. Note that the lemmas do not use the strict part of speech criterion, so the lemma for walk for example includes its verb, noun and adjectival uses. The approximate ratio for the 3000 high frequency word families is 1 : 3 : 6 for families, lemmas and word types. That is, for the high frequency words, which typically have more family members than the lower frequency words, an average family at Level 6 will consist of around three lemmas and have a total of over six word types. Table 2.1  Number of lemmas and types in each of the BNC/COCA 3000 families 1st 1000 2nd 1000 3rd 1000

Level 6 families

Lemmas

Types

1000 1000 1000

3281 2996 2855

6859 6371 5880

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

24

Making and Using Word Lists for Language Learning and Testing

Counting word types will give numbers of words at least six times larger than counting word families. Lemmas of course will have fewer members per lemma than families: 2 for countable nouns – singular and plural, or 3 if the possessive is included; 4 or 5 for verbs (plain stem, third person singular, past tense, past participle, present participle); and 3 for adjectives which can take -er, and -est forms. Using a much more inclusive definition of a word family (each including more family members than the families in the BNC/COCA lists), Brysbaert, Stevens, Mandera and Keuleers (2016) estimated that the average 20 year old native speaker knew “42,000 lemmas derived from 11,200 word families” – a ratio of almost 4 to 1. When counting words and when comparing different studies of vocabulary size, it is important to know what the unit of counting is and whether the same or different units are being used (Reynolds & Wible, 2014). Let us now look in more detail at what could be counted as separate words, covering types, lemmas and families.

Types When making word lists for receptive purposes, the word type is clearly not a sensible unit of counting because learners very quickly gain some control of the inflectional affixes of the language. Carroll, Davies and Richman (1971) however, when doing a count of a five million word corpus of school texts for the American Heritage Dictionary, had spelling as one of their main concerns and so not only made word types their unit of counting (different types have different spellings) but also made capitalization a distinguishing feature of types so that run, Run, and RUN would be counted as three different types. Note how the purpose of their count (particularly spelling and word recognition by children) affected the decision of what to count. In a later study with similar goals to those of Carroll, Davies and Richman (1971), Zeno, Ivens, Millard and Duvvuri (1995) also used texts likely to be met in educational institutions. “The [Educator’s Word Frequency Guide] Corpus was created from 60,527 samples of text obtained from 6,333 textbooks, works of literature and popular works of fiction and non-fiction used in schools and colleges throughout the United States” (Zeno et al., page 1). The corpus was over 17 million tokens long. “The corpus represents much of the textual material that typical students in the United States are likely to encounter in their school years” (p. 2). The count used word types as the unit of counting and a word type was defined “as a string of characters bounded by spaces” (p. 6). Capitalization was ignored. Numbers, and decimal points were counted as characters, so items like 1.25 were counted as a single type. Full stops in people’s names were also regarded as characters, so John

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 2.  Types, lemmas, and word families

F.Kennedy was counted as two types, john and f.kennedy. Apostrophes, dashes, single quotation marks and hyphens were also counted as characters (except for hyphens at the end of a line), so the count contains items like mark’s, I’m, o’neill, ’olympics, comin’, cross-country, twenty-nine, eight-year-old, xiv, h2o, h2, u.s.owned. This short list of examples shows that the decisions on what to count as characters and types lead to a mixture of the readily acceptable (o’neill, comin’, h2o) and the marginal (’olympics, I’m, eight-year-old, f.kennedy). The Carroll et al. and Zeno et al. frequency counts were based on written text. When spoken text is used and transcribed for counting, the transcription conventions will affect what is counted as the same type (Nelson, 1997). Will the transcription choose American spelling over British? What hyphenation conventions will be followed? How will hesitations like um and er be transcribed? How will variant pronunciations of words like yes (yeah, yeeah, yep, yah) be transcribed? Although the word type is the simplest unit of counting, it still requires careful and consistent definition. Even as a unit for counting for productive purposes, it is probably not as sensible a unit as the lemma because the inflectional suffixes are reasonably regular and are introduced very early in language learning courses.

Lemmas The strict definition of a lemma distinguishes part of speech and includes the stem of the word and its inflectional suffixes. For nouns this should include the possessive, even where it is separated by an apostrophe, but more commonly the possessive is not included. Here are some lemmas from the Leech, Rayson and Wilson (2001) frequency count of the British National Corpus. Lemma

PoS

Form

Std freq

Range

Dispersion

rumour @ @ run @ @ run @ @ @ @ @ runner

NoC @ @ NoC @ @ Verb @ @ @ @ @ NoC

% rumour rumours % run runs % ran run runnin’ running runs %

 20   7  13  60  47  13 406  87 174   0 104  40  15

100  98  98 100 100  97 100 100 100  12 100  99  95

0.93 0.93 0.92 0.94 0.94 0.86 0.96 0.91 0.96 0.63 0.95 0.93 0.87 (continued)

25

26 Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

(continued) Lemma

PoS

Form

Std freq

Range

Dispersion

@ @ running running @ @ rural

@ @ Adj NoC @ @ Adj

runner runners : % running runnings :

  7   8  14  22  22   0  63

 90  80 100 100 100   8  98

0.89 0.85 0.94 0.95 0.95 0.58 0.85

Note that the same form, run, can occur in different lemmas – as a count noun (NoC), and as a verb. Note also that truncated spellings are part of the lemma (runnin’), as are irregular forms (ran). Lemmas do not distinguish different senses of the word of the same part of speech and do not distinguish homoforms (words with the same form but unrelated meanings) of the same part of speech. The three numbers indicate Frequency per 1 million tokens, Range out of 100 different 1 million token sub-corpora, and Dispersion (also out of 100 but expressed as decimals to save space). Dang and Webb (see Chapter 15 this volume) included different parts of speech in the same lemma, so that for example the noun run was included with the verb run, but a derived form such as runner would be a different lemma. A distinctive term for this type of Level 2 word family is flemma, where the f stands for family (Geoffrey Pinchbeck was the first to use this term, Pinchbeck, 2014). Note that although Bauer and Nation (1993) place lemmas at Level 2 in their scheme (See Table 2.2), these levels can be subdivided with the flemma also occurring in Level 2 (for example as Level 2.5) but being a step further than the lemma towards Level 3. The lemma is probably too restrictive as a unit of counting for receptive purposes except for the very lowest proficiency learners. Thorndike and Lorge (1944) included words ending in -ly as family members in their word count and some of the derivational affixes at Level 3 in the Bauer and Nation (1993) levels such as -er and -ness (see Table 2.2) could be usefully included in the lowest level of word family for receptive purposes. The lemma is probably a sensible unit for productive purposes.

Word families The unit of counting called word families is described in Bauer and Nation (1993). In that article some general principles and a detailed list of affixes are provided (see Table 2.2). One of the major purposes of this description was to enable the construction of frequency-ordered word family lists (the BNC/COCA lists), which could then be used for computerized text analysis using programs like Range and



Chapter 2.  Types, lemmas, and word families

AntWordProfiler, and could also be used for the construction of vocabulary size tests so that the biases involved in dictionary-based sampling could be avoided. The lists can also be used as a basis for language course design, and for graded reader schemes. Table 2.2  Summary of the Bauer and Nation (1993) levels

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Cumulative Level 1

A different form is a different word. Capitalization is ignored.

 1

Level 2

Regularly inflected words are part of the same family. The inflectional categories are – plural; third person singular present tense; past tense; past participle; -ing; comparative; superlative; possessive (8 affixes).

 9

Level 3

-able, -er, -ish, -less, -ly, -ness, -th, -y, non-, un-, all with restricted uses (10 affixes).

19

Level 4

-al, -ation, -ess, -ful, -ism, -ist, -ity, -ize, -ment, -ous, in-, all with restricted uses (11 affixes).

30

Level 5

-age (leakage), -al (arrival), -ally (idiotically), -an (American), -ance (clearance), -ant (consultant), -ary (revolutionary), -atory (confirmatory), -dom (kingdom; officialdom), -eer (black marketeer), -en (wooden), -en (widen), -ence (emergence), -ent (absorbent), -ery (bakery; trickery), -ese (Japanese; officialese), -esque (picturesque), -ette (usherette; roomette), -hood (childhood), -i (Israeli), -ian (phonetician; Johnsonian), -ite (Paisleyite; also chemical meaning), -let (coverlet), -ling (duckling), -ly (leisurely), -most (topmost), -ory (contradictory), -ship (studentship), -ward (homeward), -ways (crossways), -wise (endwise; discussion-wise), anti- (anti-inflation), ante- (anteroom), arch- (archbishop), bi- (biplane), circum- (circumnavigate), counter(counter-attack), en- (encage; enslave), ex- (ex-president), fore(forename), hyper- (hyperactive), inter- (inter-African, interweave), mid- (mid-week), mis- (misfit), neo- (neo-colonialism), post- (postdate), pro- (pro-British), semi- (semi-automatic), sub- (subclassify; subterranean), un- (untie; unburden) (50 affixes).

80

Level 6

-able, -ee, -ic, -ify, -ion, -ist, -ition, -ive, -th, -y, pre-, re- (12 affixes).

92

Level 7

Classical roots and affixes

Levels 2 to 6 include 91 affixes. Note that in Table 2.2 the total number of affixes at a particular level is given in brackets, and the cumulative number of family members available at a particular level is given in column 3 with the addition of the stem form. Note also that if word families are defined as being at Level 4, then a total of 29 affixes plus the stem is permitted, because the affixes at the previous levels are also included. So, the grand total at Level 4 would be 8 from Level 2, plus 10 from Level 3, plus 11 from Level 4, which equals 29 and with the stem there would be a total of 30 possible family members.

27

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

28

Making and Using Word Lists for Language Learning and Testing

Ideally, what is included in the word family should grow as learners’ morphological proficiency develops, although this creates problems in the creation and use of word lists (see the note at the end of this chapter). Certainly, however, the restriction of word families to the affixes available up to Level 6 in Bauer and Nation should be relaxed when looking at words from, say, the 10th 1000 onwards. It would be useful to know at what proficiency level, or better still at what vocabulary size, learners can deal with most affixes when reading. It would also be useful to have some scheme based on learner knowledge, if this is possible given a variety of L1 backgrounds affecting affix knowledge. It is important to make a distinction between vocabulary size as measured by a vocabulary size test (Nation & Beglar, 2007; Coxhead, Nation & Sim, 2014) and vocabulary knowledge as measured by a vocabulary levels test (McLean & Kramer, 2015; McLean, Kramer & Beglar, 2015) which checks knowledge of the most useful words of the language at 1000 word levels. Vocabulary levels tests have enough items at each word level to get reliable figures for each level. A vocabulary levels test is much more likely to give an accurate measure of vocabulary in regard to text coverage because getting a score of 3000 on a vocabulary size test does not mean that the learner knows the most useful 3000 words (Nguyen & Nation, 2011). A vocabulary levels test however indicates how well each level of vocabulary in the test is known and so it is possible to see if learners have large gaps in their knowledge of the high frequency words. As Table 2.3 shows, the average size of Level 6 word families differs according to the frequency level of the words, with high frequency words typically having more family members. Table 2.3  Number of types per family in each of the first twenty 1000 BNC/COCA word families at Level 6 of Bauer and Nation (1993) Level

Types

Level

1 2 3 4 5 6 7 8 9 10 11

6.859 6.371 5.880 4.865 4.295 4.103 3.690 3.422 3.200 2.982 2.942

12 13 14 15 16 17 18 19 20 Average

Types 2.755 2.415 2.299 2.284 2.086 2.079 1.933 1.861 1.820 3.40705

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 2.  Types, lemmas, and word families

The data comes from the BNC/COCA lists. The 1st 1000 word family list has of course 1000 word families which consist of 6,859 word types, an average of almost 7 word types (family members) per family including the head word of the family, usually the stem form. By the 5th 1000 words this is down to just over 4 members per family, by the 10th 1000 just under 3 members per family, and by the 20th 1000 under two members per family. The average for the first 20,000 is 3.4 members per family, and for the first 10,000 4.6 members per family. As frequency drops, so does the number of members per family. The conservative position when building word families follows the Bauer and Nation guideline of only allowing free forms not bound forms to act as the headword for a word family. This means that clearly related items like present and presence are put into different families. When the guidelines of what can be included in a family are followed, some family members may be very infrequently occurring word forms, but if the families are properly made, the family members should be readily accessible when reading. Ideally, there should be lists of family members for at least the high frequency words at various levels, particularly at the less inclusive levels and their subdivisions, such as Levels 3 and 4. This would allow better matching of family level to proficiency or vocabulary size level. The reason this has not yet been done is that making word family lists is very time consuming. The first word family lists made (1st 1000 GSL, 2nd 1000 GSL, University Word List) for the now obsolete FVORDS program were made manually, with each family member added on intuition and then the families were checked by running the program with a variety of texts. Later a program (AffixAppender) was specially written by Chris Andreae which used the lists from Leech, Rayson and Wilson’s (2001) study of the British National Corpus with part of speech indicated to create word families using affixation rules. Each rule was applied in one of two possible ways, the division largely fitting with the inflection/derivation distinction. In the first way, the rule was applied, for example, add the -s suffix to verb stems, the result was checked against a spellchecker, and then the resulting word was added as a family member. In the second way, the affix was added to the stem, checked against a spellchecker and then the result was presented to me for me to accept, reject or modify. The checking was necessary because the application of the rules occasionally resulted in words that were real words but not members of that particular family. The noun abbreviation co for company had the suffix -y added to it to produce coy which was not a member of that family. This program reduced the time required to create a 1000 word family list from several days to several hours, and resulted in the first fourteen 1,000 word family lists from the BNC. These word family lists now go beyond the 28th 1000. The great difficulty in expanding the lists comes from finding new words that are not already in the lists

29

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

30

Making and Using Word Lists for Language Learning and Testing

and which are not proper names or very low frequency family members of families already in the lists. Mark Davies and Marc Brysbaert provided lists of these which allowed the BNC/COCA lists to grow beyond the 20th 1000. The reason for this description is to show that although it is relatively straightforward to make lemma lists using a tagged corpus, word family lists require a large amount of manual checking and editing. This is because derivational affixes are not as regularly rule-based as inflectional affixes, and the rules that do exist are not easily handled by computer programs. It is hoped that the freely available word families made from the British National Corpus can be improved and adapted so that this time-consuming work does not have to be repeated. The word family is a sensible unit of counting for receptive purposes and the problem then lies in deciding what level of word family is most suitable for particular groups of learners. Level 6 of Bauer and Nation is too inclusive for most learners of English as a foreign language and is too exclusive for adult native speakers of English. The Bauer and Nation levels may need to be adapted to fit with research on learners’ mastery of affixes. We look at this issue more closely in the following section of this chapter.

Lemmas versus word families There is considerable debate about the choice between lemmas (or flemmas) and word families. The discussion regarding the unit of counting is often posed as lemmas versus word families (Gardner, 2007; Brown, 2010; Nation, 2013). Bauer and Nation’s (1993) levels however included types and lemmas as being part of the word family levels – Level 1 for word types, Level 2 for lemmas. The discussion could thus be more usefully focussed on what level of word family is suitable for particular learners and for particular purposes. This is not just playing with terminology. Choosing to count lemmas involves accepting the idea that lies behind word families, namely that words with the same stem may be seen as related to each other and thus the recognition and learning of a word which is morphologically related to a known word is likely to be substantially easier than dealing with a completely unrelated word. The current problem is that the existing lists of word families are at either end of the Bauer and Nation scale with nothing in between. That is, there are lists of types (Level 1), lemmas and flemmas (Level 2) which include only headwords and inflected forms as family members, and families (Level 6) which include headwords, inflected forms and a wide range of derived forms. With such a wide contrast (stem and eight inflections at Level 2, and 91 affixes at Level 6), it is tempting to see Level 2 and Level 6 as a binary opposition rather than

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 2.  Types, lemmas, and word families

points on the same shaky scale. Some researchers with an interest in native-speaker knowledge see Level 6 word families as not being inclusive enough (Brysbaert, Stevens, Mandera and Keuleers, 2016). The major issue is what level of word family is suited to a particular group of learners. Brown (2013) argues that “A word family is not simply a lemma with derivations added; the lemma is a fundamentally different unit.” (page 1053). Morphologists would agree, but underlying both lemmas and word families is the idea that forms sharing the same stem are related to each other and can usefully be seen as a unit. The lemma is a step on the path to fuller morphological knowledge. Dang and Webb (see Chapter 15 this volume) argue that, for beginning learners of English, Level 2 word families based only on inflections (in their case the flemma) is a better choice than a larger word family because beginning learners who are still learning the 1st 1000 words of English have very limited morphological knowledge (Schmitt & Zimmerman, 2002; Ward & Chuenjundaeng, 2009), and word families contain a mixture of high frequency and low frequency family members. Their argument about learner knowledge is important because the unit of counting needs to suit the target learners. Their argument about low frequency members is true, but also applies to lemmas. In addition, it needs to be noted that family members are not chosen on the basis of frequency but on the transparency of their family relationship. Brezina and Gablasova (2015) also used the lemma with different parts of speech of the same form being different lemmas. Their justification was also based on the need to create a list for beginners for both receptive and productive purposes, although they also had criticisms of the transparency of the relationships between items in the same word family. Brezina and Gablasova also noted that “opting for an alternative to the traditional word-family approach allowed us to narrow down the wordlist to significantly fewer forms than included in West’s General Service List and at the same time retain comparable coverage of text. This methodological choice also plays an important role in the ranking and the frequency-bands organization of words in the new-GSL, which reflects more precisely the actual occurrence of words in text” (Brezina & Gablasova, 2015: 18). This seems to mean that using lemmas makes it more obvious which word forms are occurring frequently. Whereas a lemma based on part of speech typically only has one to four members, a family can have many more members, especially high frequency word families. This is not a strong argument as text analysis programs like Range and AntWordProfiler provide both family and type frequencies. Gardner and Davies (2014) criticize the use of word families in the making of the Academic Word List (Coxhead, 2000) for the following reasons. (1) All the members of a word family are not closely related in meaning. (2) Word families do not make part of speech distinctions. (3) Learners, including young native speakers,

31

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

32

Making and Using Word Lists for Language Learning and Testing

may not have knowledge of the word-building devices of English. Gardner and Davies conclude that using lemmas will solve many of these problems. However, these criticisms are not correct. (1) Bauer and Nation (1993) specifically note “Clearly, the meaning of the base in the derived word must be closely related to the meaning of the base when it stands alone or occurs in other derived forms” (page 253). If some of the word families in the BNC lists contain unrelated stems, then those families are not well made and need to be revised. Polysemy however is not usually enough however to distinguish families, and the belief that polysemes should be seen as different words is where Gardner and Davies part company with Bauer and Nation. (2) Word families do not make part of speech distinctions except at Level 2, the level of the lemma. Making part of speech distinctions has the negative effect of distinguishing very closely related items (smile as a noun and smile as a verb), and does not distinguish homonyms that are the same part of speech, such as bank (for money) and bank (landform), or ball (round object) and ball (dance). Word families like lemmas do not effectively distinguish homoforms. (3) The word families proposed by Bauer and Nation are actually a set of levels, the second level of which includes lemmas. The issue is not whether lemmas are better than word families, but which level of word families (including lemmas) is most suitable for a particular learner or group of learners. Most previous text coverage research with word families, such as Coxhead (2000), Nation (2006), and Webb and Macalister (2013), has used the Bauer and Nation Level 6 word families, largely because these were the only ones available, but research on learners’ knowledge of derived words indicates that this level is far too high and greatly overestimates learners’ knowledge of family members, especially derived family members. Thus, the most telling criticism of word families for receptive purposes centers on what affixes should be allowed for membership of the family. The combined levels up to Level 6 include 91 affixes and many of these will be unknown by young native speakers and intermediate level learners of English as a foreign language. Choosing between word family levels, which also include types and lemmas, involves making a decision on (1) how much rule-based generalization involving morphology and core meaning is involved in dealing with the previously unknown family members and (2) how much each family member requires new learning. There is likely to be both of these two kinds of knowledge involved, so it is a matter of considering the degree of involvement of these kinds of knowledge. If a prospective family member requires a large amount of new learning which is different from the usual rule-based generalization, then it should be counted as a separate family. Hard and hardly, although they share a stem form and differ only by the addition of a very common suffix, do not share stems with the same or closely related meanings. They are in different families.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 2.  Types, lemmas, and word families

Just as it is important to recognize that seeing the meaning relationships between some members of the word family may require some effort, it is also important not to overstate the differences. Seeing the relationships between different lemmas sharing the same form or related forms is typically very easy, compare walk in When I go for a walk and When I walk. It does not make sense for receptive purposes to see swim as a noun (I went for a swim) requiring extra learning when swim as a verb (I like to swim) is already known. Similarly, there is minimal interpretive effort required to understand sadness and sadly once sad is already known and the learners know about -ly and -ness, and have met them in several words. Understanding a newly met regular form like sadness where the stem or another family member is known is in a completely different category of learning from understanding a newly met unknown word like grief. It is not at all hard to show the inconsistency in making distinctions based on part of speech at Level 2 of Bauer and Nation. The main reason is that part of speech distinctions overwhelmingly involve polysemy not homonymy, and polysemy works within the same part of speech just as well as between parts of speech. If we take the form walk, and distinguish the noun and verb uses as different lemmas, why don’t we distinguish walk as a noun referring to somebody moving (I went for a walk) from walk as a physical object as in boardwalk, or a tourist attraction as in Great Walks of New Zealand. The distinction between walk as a physical object and walk as an action (both nouns) is greater than the predictable noun-verb difference, although both differences are not great enough to distinguish different words. These seem to require similar or less effort to comprehend or use than the distinction between singular and plural, or the various forms of a verb. The next step after the flemma for low proficiency learners of English could be the word family at Level 3 of the Bauer and Nation (1993) scheme. This includes adverbs ending in -ly, nouns ending in -ness, the negative prefixes un- and non- and the very useful and productive suffixes -er (as in worker, teacher, singer, learner), -able (as in fixable, doable, acceptable), -ish (foolish, selfish, stylish), -less (endless, regardless, useless, careless), -th (fourth, tenth, nineteenth), and -y (bloody, lucky, greeny). If Level 3 seems too big a step, then a reduced version of Level 3 could be used with just four of the ten permitted affixes (un-, -ly, -er, -th,). There are now word families available using this reduced version of Level 3, probably better called Level 3 partial. The most important reason for choosing these affixes is that three of these affixes are in many words that occur in the top 3000 frequency-ranked word types, and they are thus likely to be in some words already known by low proficiency learners of English, and the -th suffix is very limited in that it only is used in a very small group of ordinal numbers (fourth, tenth, hundredth). It would require a relatively small amount of teaching and learning effort to help learners quickly become familiar with these four affixes. The teaching should focus on (1) learning

33

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

34

Making and Using Word Lists for Language Learning and Testing

the meaning of the affixes, (2) recognizing the affixes in already known words, and (3) interpreting the meaning of the affixed word in a context. Researchers with experience in analyzing spoken English (Sinclair, 1987) argue that for productive purposes (speaking and writing), the word type is the unit of counting that best represents the knowledge involved in language production. Sinclair has argued that different members of the same lemma (the stem and its inflections) often take different collocates, and so, for productive purposes, learning and thus word counting needs to be at the level of the word type. The alternative argument to this is that most individual words take a range of different collocates and thus collocates are not the criterion that should be used to distinguish what is counted as the same word. Lemmas may thus be a suitable productive unit but this needs investigation. When doing counts of specialized vocabulary, the word type is a better unit because all the members of a lemma are not similarly technical words (Chung & Nation, 2003). For receptive purposes with low proficiency learners, lemmas, flemmas, or Level 3 partial would be sensible choices. The choice of the level of word family needs to be justified for any particular study to show that the unit of counting suits the goals of the study. This justification needs to involve, where relevant, whether the resulting list will be used for receptive or productive purposes, whether it is for oral or written material, and the proficiency level (including vocabulary size) of the learners for which it is intended. When word lists are used for text analysis, it is useful to see how the choice of a different level of word family affects the results. Table 2.4 compares the coverage of 14 million tokens of spoken and written data by different units of counting, from single word types (Bauer & Nation Level 1), Level 2 lemmas where a lemma can include more than one part of speech (flemmas), Level 3 families using inflections plus -ly, -er, un-, -th (for ordinal numbers), and Level 6 families. Table 2.4  Percentage coverage of 14 million words of spoken and written data by the first 3000 types, Level 2 flemmas, Level 3 partial families and Level 6 families 1st 1000 2nd 1000 3rd 1000 Total

Types

Flemmas

Level 3 partial families

Level 6 families

76.46  5.45  2.88 84.79

80.97  5.29  2.41 87.77

81.67  5.28  2.32 89.27

82.95  5.36  2.07 90.38

Note in Table 2.4 that the coverage by the 1st 1000 word families is the highest. This is because many word families from the 1st 1000 have several high frequency family members. At the 2nd and 3rd 1000 levels, however, the smaller units provide

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 2.  Types, lemmas, and word families

better coverage because these high frequency Level 6 family members are now separate units at the less inclusive levels of word family, and boost the 2nd and 3rd 1000 levels. Examples from 2nd 1000 flemmas of what would be derived-form members of Level 3 and 6 families include clearly, quietly, completely, slowly, easily, mostly, fairly, totally, immediately. Examples from 3rd 1000 include partly, equally, frequently, unable, originally, previously, unlike, unusual, currently. Note that the increase in coverage from the Level 2 flemmas to the Level 3 partial (flemmas plus four derivational affixes) is an increase in coverage of 0.7% at the 1st 1000 level, and a drop of 0.12% at the 2nd and 3rd 1000 levels. The total increase in coverage by the first 3000 words at the different word family levels from the Level 2 flemma on is just over 1% per level for those in Table 2.4.

Learners’ knowledge of affixes There is evidence that learners differ greatly in their knowledge of affixes. For some learners, a particular level of word family is too inclusive, while for others it is not inclusive enough. For learners of English, knowledge of affixes is related to vocabulary size. The more words you know, the more affixes you are likely to know (Mochizuki & Aizawa, 2000; Schmitt & Meara, 1997). For productive purposes, the learners studied by Schmitt and Zimmermann (2002) knew between two to four of the derivatives per family. We would expect the figure to be higher for receptive knowledge. Reynolds (2013), in a discussion of Webb and Macalister’s (2013) research using the BNC/COCA lists, and Reynolds and Wible (2014) question whether we know what non-native speakers see as a word, suggesting we need to investigate how learners actually deal with complex words. There is plenty of evidence to support this suggestion (Schmitt & Zimmerman, 2002; Schmitt & Meara, 1997; Mochizuki & Aizawa, 2000; Ward & Chuenjundaeng, 2009; Brown, 2013; Webb & Sasao, 2013; McLean, in preparation). They also indicate that Level 6 on the Bauer and Nation (1993) scale is far too advanced for most foreign language learners of English. Research on morphological knowledge and reading comprehension shows that for both first and second language learners morphological knowledge makes a significant contribution to reading proficiency and develops as vocabulary size grows (Jeon, 2011; Wang, Cheng & Chen, 2006; Droop & Verhoeven, 2003). Reynolds and Wible (2014) also make the point that the unit of counting makes a big difference when counting the frequencies of words and researchers need to be explicit about the unit of counting, and also need to be consistent and careful about how they actually do the counting.

35

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

36

Making and Using Word Lists for Language Learning and Testing

Knowledge of morphology is sometimes referred to as morphological awareness. Morphological awareness is a rather broad term referring to a learner’s awareness of the morphological structure of words both in general and in relation to particular words, and the ability to make use of that knowledge in comprehending and producing words. Tests of morphological awareness have tested both receptive and productive knowledge of affixes. Receptive knowledge for use involves being able to recognise that a word contains an affix or affixes (identification), and then being able to comprehend that word as it is being used, drawing on knowledge of the affix and taking account of the context in which the word occurs (interpretation). Morphological awareness is particularly important in vocabulary growth in English because the number of words with affixes greatly outnumber stem forms even when inflections are not counted. Tests of morphological knowledge of second language learners have been rather demanding, with some testing productive knowledge (Schmitt & Meara, 1997; Schmitt & Zimmerman, 2002), and others requiring good metacognitive knowledge of parts of speech (Mochizuki & Aizawa, 2000). The studies show that affix knowledge does not follow the Bauer and Nation (1993) levels, although those levels may still provide a frequency-based and transparency-based rationale for teaching. On the positive side, the studies also show that some low proficiency learners do know some derivational affixes. In the Mochizuki and Aizawa (2000) study which involved receptive knowledge tests, the learners had an average vocabulary size of 3769 words and knew 7.24 out of 13 prefixes and 10.70 out of 16 suffixes. Affix knowledge was related to vocabulary size. As Anglin (1993) has shown for native speakers, morphological problem-­solving is a major way in which learners rapidly increase their knowledge of vocabulary, and is one of the top vocabulary learning strategies for foreign language learners.

The validity of the word family levels The Bauer and Nation (1993) word family levels were set up from the point of view of morphology rather than learner knowledge which is likely to differ greatly partly depending on the L1 of the learners involved. There is no strong reason to have to stick to those levels. At best, they can be regarded as a starting point as an initial framework of reference. As usual language acquisition follows function rather than form, and is organic rather than logical. It is very important however not to be dismissive of the idea of word families. As L1 research has shown and as common-sense clearly indicates, gaining control over the morphology of English is an essential aspect of foreign language

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 2.  Types, lemmas, and word families

proficiency development. We need to become skilled at gauging our learners’ progress in knowledge of word parts and we need to have the resources to support this progress and evaluate its effects. It is not a given that the pedagogical sequence of word levels has to reflect learner knowledge. Word lists are typically not based on what learners know but on what is useful, although there is a strong argument for tests to take account of what is known. The Bauer and Nation (1993) levels are based on frequency, regularity, productivity, and predictability. Frequency refers to the number of words in which an affix occurs. This involves primarily the number of different words in which an affix occurs and secondly the frequency of those different words. Regularity refers to the amount of change to the base that occurs in the spoken and written forms when an affix is added. Some of this is very regular with no change (sad/sadly), or with predictable change (happy/happily). Productivity refers to the likelihood that the affix will be used to form new words. That is, are new words using this affix still being created? Predictability refers to the ease of determining the meaning of the affix. Some affixes have more than one meaning, making them less predictable. Frequency and productivity relate to the usefulness of affixes. Regularity and predictability relate to difficulty. Assigning affixes to levels often involved a compromise between these criteria. The arguments in favour of the levels proposed by Bauer and Nation are that they relate directly to what needs to be learned in terms of relative importance and they provide for a series of levels that are standard across a wide range of different L1 backgrounds. In the terminology of curriculum design, they provide a useful description of the necessities for language learning (Hutchinson & Waters, 1987), and are thus particularly useful for language testing. The arguments in favour of levels based on learners’ knowledge of affixes is that they provide a more accurate measure of what is known and are thus particularly useful for word coverage studies. In the terminology of curriculum design, they provide a useful description of lacks (what learners know and don’t know). Levels based on learner knowledge may differ according to the learners’ L1 but this remains to be shown. It is likely that advances in corpus tagging and database design, and research on depth of word knowledge including affix knowledge will soon allow us the choice of a variety of well-justified and well-defined units of counting that better reflect the needs of the users of word lists.

37

38

Making and Using Word Lists for Language Learning and Testing

Note

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

When a different definition of what is included in a word family is used for learners at different proficiency levels, the problem arises about how low frequency members of high frequency families are included in the lists. If we take the high frequency family agree which has the following members, we can see that the frequency of the members is very different. The frequencies given after each word are from the whole BNC. AGREE 8057 AGREEABLE 394 AGREEABLY 70 AGREED 14390 AGREEING 813 AGREEMENT 13254 AGREEMENTS 2704 AGREES 938

If only the lemma agree is used in the 1st 1000 list, what happens to other members of the family in regard to the lists, if we move to the use of word families from the 3rd 1000 or 4th 1000 level on because of learners’ increasing proficiency? One solution is to include all the excluded low frequency members of the high frequency words as a big group of items possibly under just one headword in say the 4th 1000 list or whatever list the full word family becomes acceptable as the unit of counting. Another way is to have two or more parallel sets of word lists one using lemmas and one using word families so that the word family lists can replace the lemma lists at a certain point. This may mean the exclusion of a few word families that are in the high frequency word families but are not represented in the high frequency lemmas. A way of dealing with this is to make sure that all high frequency word families are represented by a high frequency lemma, but this is also problematical because some high frequency families will be represented by two or more high frequency lemmas. For example, both the lemma agree and the lemma agreement are in the most frequent 1000 lemmas. In addition to pronouns, there are 24 instances where two lemmas from the same family get into the most frequent 1000 lemmas according to the BNC, and another two instances where three lemmas from the same family (manage, relate) get in (see Table 2.5). Dividing pronoun families into lemmas makes 30 lemmas from seven families. The division of pronouns into lemmas however is clearly justified by the formal differences.



Chapter 2.  Types, lemmas, and word families

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Table 2.5  Lemmas from the same family that all occur in the 1000 most frequent lemmas act action agree agreement apply application argue argument associate association certain certainly claim claimant develop development discuss discussion final finally general generally lead leader manage management manager

move movement near nearly operate operation particular particularly play player recent recently relate relation relationship simple simply state statement teach teacher train trainer treat treatment work worker

There are sixteen lemmas where the derived form gets into the most frequent 1000 lemmas, but the stem form does not get in – actually, education, election, financial, government, information, national, natural, organization, professional, quickly, really, responsibility, security, situation, usually. So, actually is in the first 1000 lemmas but actual is not.

Recommendations 1. Very careful thought needs to be given to the unit of counting words, namely what level of word family should be chosen? This decision needs to be explicitly justified not by accepting what others have done, but by explaining the purpose of the word lists and showing how the chosen unit best reflects that purpose. 2. The criteria for the unit of counting also need to be carefully described. If types, then what is classified as a word type? If families at Level 2 (lemmas) are the unit, is part of speech distinguished or not? Are alternative spellings included in a lemma and so on? Are abbreviations and shortened forms (comin’) included? If families from Level 3 on, what level of family is being used? These decisions also need to be explicitly justified. 3. If derived forms are included as family members in the unit of counting, a careful study should include evidence that the learners involved could cope with such family members.

39

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

chapter 3

Homoforms and polysemes

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Paul Nation and Kevin Parent People often talk about the different meanings of words when they are really referring to different senses of the same basic meaning. For example, they may say that tree has many meanings in that we can talk about trees in the forest, a family tree, and a tree diagram. These meanings are all related and linguists refer to them as “senses” of the same meaning. There are some words however that share the same spoken and written word form but have distinctly different meanings. The most commonly cited example is bank which can be a place to store things such as money, and can also be a long raised piece of ground such as beside a river. There is no shortage of other examples – like, lie, bowl, box, leaves, pen, spell, well, yard. Such words which have distinct unrelated meanings but which have exactly the same written and spoken form are called homonyms. There are also words that share the same written form but have different spoken forms and distinct unrelated meanings (homographs) and words that have the same spoken form but have different written forms and distinct unrelated meanings (homophones) (see Table 3.1). In this book, the word homoform will be used to cover these three types. Table 3.1  The three types of homoforms with examples Spoken form

Written form

Meaning

Examples

Homonym

The same

The same

Different

bridge, case, firm, miss, ring, rest

Homograph

Different

The same

Different

close, lead, minute, present, row

Homophone

The same

Different

Different

eye/I, peace/piece, write/right/rite

The most frequent 2000 word families of English as represented by West’s (1953) A General Service List of English Words contains 75 homonyms, 7 homographs, and 147 homophones (Parent, 2012). In Parent’s data then, 82 words in the first 2000 or roughly one word in every 24 involved a homograph or homonym. Wang and Nation (2004) found that the Academic Word List, which consists of 570 word families, contained 60 homonyms and homographs (attribute, decline, generation, issue,

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

42

Making and Using Word Lists for Language Learning and Testing

project, volume). That is, roughly one word family in every ten involved a homograph or homonym. An analysis of their list shows that 54 were homonyms and 6 homographs (abstract, appropriate, attribute, contract, converse, project) where the pronunciation differed largely through the placement of stress. Homonymy and homography thus can have a noted effect on the number of different items in a word list. Separating homonyms and homographs could increase the items in a list by up to 10%. Written corpora already distinguish homophones because they have different written forms, and all corpus work on word lists from spoken text is done at present using written transcriptions. So, in most word counts using written text, we would want to count homonyms and homographs as two or more different words, particularly if we are interested in vocabulary learning and vocabulary level or vocabulary size, because each member of a homonym or homograph pair (or more) requires learning a new meaning and also learning to recognize when that meaning is being signaled. A word count or word list for spelling purposes however would not need to distinguish homonyms. However, at present word lists do not usually distinguish homographs and homonyms. There are ways to deal with this, such as using a computer program to tag homographs and homonyms in the corpus before they are counted, and Tom Cobb is working on doing this with some success. The homonym and homograph lists created from the General Service List (Parent, 2012) are very useful starting points for such tagging. There are however a few compromises until reliable tagging can be done. In the present BNC/COCA word lists, where affixes can distinguish at least some of the items, such as hardly from the hard family, oriented, orients, orienting from Orient, Oriental, then they are given separate entries in the word lists. For many homographs and homonyms this means that one family will contain a form, for example orient, which should really be in both families. This is for most homographs and homonyms far from a completely satisfactory solution, but it is at least a start. The other slightly consoling factor is that most pairs of homonyms and homographs typically differ greatly in their frequency of occurrence (Parent, 2012; Wang and Nation, 2004), with one member of the pair making up well over 70% of the total occurrences of the pair. In Parent’s data, there were only six homonyms (bowl, ring, rest, net, yard, miss) that were under the 70% level. For example, the most evenly distributed was bowl, with 50% of occurrences for the game meaning and 50% for the dish meaning. In frequency-based lists and in particular high frequency word lists, only a small number of homographs and homonyms are likely to have more than one entry in the lists, the other member(s) being in the mid-frequency or low frequency lists. Wang and Nation (2004) found that for the Academic Word List, only three word families (intelligence, panel, offset) would drop out of the list if homonyms and homographs were separated, because the frequency and range of each of the separated items did not meet the inclusion criteria for the Academic Word List. Only

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 3.  Homoforms and polysemes

five items required more than one entry in the Academic Word List (issue, volume, objective, abstract, attribute) because both meanings of each word met the range and frequency criteria for inclusion. These compromises and reassurances should not be taken as reasons for not doing anything about homonyms and homographs. Being able to distinguish them would result in better lists. Table 3.2 lists the high frequency homonyms. The word family of arm occurs 19,916 times in the BNC. 74% of these occurrences use the meaning related to the arm on your body, which also includes uses like He sat on the arm of the chair. 26% of the 19,916 occurrences use the meaning related to weaponry (We armed the troops). Note that to save space, the different meanings are signaled in Table 3.2 not with definitions but just a rough clue to the meaning. Using 2000 occurrences for word families in the British National Corpus as the minimum cut-off point for inclusion in the 3000 high frequency words, the meanings marked with an asterisk would not be included among the high frequency words, for example ball meaning “social event”. Note that many of these homonyms are the same part of speech, and that some meanings can be more than one part of speech. Bank involving money can be a noun and a verb (bank your money in the bank). A corpus tagged for part of speech does not consistently distinguish homonyms. The meaning divisions in Table 3.2 and the following tables were taken from the analyses in the Oxford English Dictionary which distinguish words with different etymologies. That is, there are well established historical reasons for distinguishing the different meanings of the homonyms. Table 3.2  High frequency homonyms with meaning percentages and British National Corpus frequencies (* = meanings occurring less than 2000 times in the BNC) ARM

19916

BOIL

3515

body part

74%

(verb)

97%

weapon

26%

swelling

3%

BALL

9474

round object

96%

social event

4%

BAND group of people

81%

hoop/ring

19%

BANK

BOWL

5420

dish

50%

*

game

50%

9576

BOX

14320

container

95%

*

sport

5%

28750

BRIDGE

93%

infrastructure etc.

92%

landform

7%

card game

8%

14459

(verb)

92%

animal

8%

*

* 7940

finance BEAR

*

CAN

* 263978

(modal verb)

98%

tin

2%

43

44 Making and Using Word Lists for Language Learning and Testing CASE

63060

situation

98%

container (separate word)

2%

CHECK (various polysemous uses)

99%

pattern of crossed lines

1%

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

DATE 99%

fruit

1%

DIE

88%

*

direction (inflected as left)

9%

13253

permission

2%

plural of leaf (leaves)

1%

*

* 22985

to stop living

95%

singular of noun ‘dice’

5%

DOWN

* 97491

opposite of UP

98%

downlands

2%

feathers

0%

EGG

* 6366

produced by females

99%

to egg on

1%

FINE

* 16324

good/small

95%

penalty

5%

FIRM

* 24002

business

56%

solid, strong

44%

FOLD 99%

flock

1%

HOST

* 4888

of a party, etc.

80%

multitude

19%

*

sacrificial victim

2%

*

LAST

75835

previous/final

92%

to continue

8%

LAY

18043

to place

98%

non-clergy

2%

LIE

16047

to be prostrate

92%

falsehood

8%

LIGHT

*

* 33318

opposite of ‘dark’

80%

opposite of ‘heavy’

20%

LIKE

162509

to resemble

78%

opposite of dislike

22%

LINE

38277

geometric figure

97%

to apply lining

3%

MATCH

* 16232

game

97%

small wooden stick

3%

MEAN

* 91891

to have meaning

96%

cruel

3%

average

1%

MISS 4365

to bend

84131

to depart/bequeath

24020

related to calendar

LEAVE

* 20748

fail to hit

75%

(title)

25%

NAIL

2433

building

70%

*

finger

30%

*

NET

8304

web

63%

total

37%

PAGE

22801

of book, internet, etc.

99%

to call out

1%

PAN cooking (including ‘criticize’)

* 2657

96%

to move a camera (‘panorama’)

Chapter 3.  Homoforms and polysemes 4%

POLICY

35464

as in ‘foreign policy’

99%

as in ‘insurance policy’

1%

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

POOL

* 6112

water, combined resources

94%

billiards

6%

POT

* 5502

cookware

96%

marijuana

4%

POUND

* 19819

monetary unit

58%

weight

13%

to crush

29%

dog pound

0%

PUPIL

* 10638

students

99%

eye

1%

RACE

* 14015

competition of speed

87%

species

13%

RAIL

* 5174

horizontal beam

98%

to rail against

2%

REST

* 19847

remainder

73%

recuperate

27%

RIGHT

108971

correct/opposite of left

85%

legal rights

15%

RING 88%

circle

12%

ROLL

* 10684

to spin

96%

a list

4%

ROW

* 7872

75%

row a boat

25%

SCALE

* 11959

measurement/weight

96%

to climb

2%

*

reptile skin

2%

*

SET

59824

to place, to be firm

83%

collection

17%

SOCK

1194

garment

94%

to punch

6%

SOUND

* 24147

audio phenomenon

87%

sea inlet

6%

*

sturdy

6%

*

to inquire (to sound out) 1%

*

SPELL

5013

letter-by-letter/magic

70%

time interval

30%

STEEP

* 2386

(adjective)

93%

(verb)

7%

TEND

* 15287

to engage in habitual actions

97%

to take care of

3%

WAKE

* 5574

to be awake

71%

a track (in the wake of)

25%

*

vigil

4%

*

WEAVE 13165

sound of bell

a line

*

interlaced thread

2775 85%

to move from side to side 15%

*

YARD

7257

land

57%

36 inches

43%

45

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

46 Making and Using Word Lists for Language Learning and Testing

Using a high frequency cut-off point of 2000 occurrences for word families in the BNC, when we look at the highest occurring meaning and calculate the number of occurrences from the British National Corpus frequency, the only homonym that does not remain with its most frequent meaning above 2000 is nail. Let us look at this calculation to make the idea clearer. The nail word family occurs 2433 times in the BNC. The most frequent meaning of nail is as is used in building for example (nail the boards, hit the nail on the head) and makes up 70% of this homonym’s occurrences. 70% of 2433 is 1,703 occurrences which puts this use of nail below the 2000 occurrence cut-off point. So, if we counted the two meanings of the homonym nail separately, neither use would get into the high frequency level. The most frequent uses of all the other homonyms in Table 3.2 would remain in the high frequency lists either because the most frequent use makes up a very high percentage of the total occurrences of the family, and/or the frequency of the family is very high. Bowl, for example, is the only homonym in the list with a 50/50 split of its uses. It has a total frequency of 5420 occurrences. Splitting these occurrences between the two meanings would result in having two entries for bowl in the high frequency lists. For the other homonyms, like rail, the lower frequency meaning or meanings would be in the mid-frequency lists (4th 1000 to 9th 1000) or the low frequency lists (10th 1000 on). 17 of the 55 homonyms in Table 3.2 should have two separate entries in the first 3000 words of English, the high frequency words (arm, bank, bowl, can, firm, last, light, like, mean, miss, net, pool, pound, rest, right, set, yard). One, leave, should have three entries. Note that there is often no great difficulty in distinguishing homonyms when meeting them in a text as there are often part of speech clues (You are mean, What does it mean), inflectional clues (You can do it, cans of beans) and occasionally capitalization (Miss Jones, I miss you), along with the contextual meaning clues. The main issue with homonyms in making word lists is that the members of a homonym are different words, usually with no historical relationship between their identical word forms, and they need to be counted as separate words. In the BNC/COCA lists 3 of these 17 homonyms do not have separate entries in the lists (bank, pool, set). The fourteen others have some kind of form-based distinction (arm, arms vs armed, arming, unarmed; bowl, bowls vs bowled, bowling, bowler; can vs cans, canned; yard vs yards). Although these distinctions are makeshift, they are better than no distinction at all. No distinction was made with bank because the percentage for the landform was too low and the less frequent form did not have any formally different members from the higher frequency family. The two meanings of pool, and the two dominant meanings of set were not distinguished for the same reason. Proper names and abbreviated proper names can be involved in homonymy, for example, US/us, May/may, March/march, Brown/brown. Sometimes capitalization

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 3.  Homoforms and polysemes

can sort out the problem, but occasionally it does not, St (Saint)/St (Street), June/ June, May (a person’s name)/May (the month), King/King. We will look at proper names in Chapter 4. There are many homonyms with one very low frequency member. Green, for example, has a Scottish use as a verb, meaning to yearn for something. While the word with the yearning meaning would not come very high in most frequency-ranked word lists, it serves to illustrate that distinguishing homonyms is not straightforward. While the etymologies of two identical word forms are very helpful in distinguishing most homonyms, there are some homonyms where the words were originally related but have now moved so far apart that they are best regarded as separate words. These are called cognitive homonyms. The distinction between cognitive homonymy and polysemy is subjective, and some of the words below will be seen as polysemes by some. Table 3.3 contains a list of 84 high frequency cognitive homonyms. With careful thought and imagination it is possible to work out the original shared meaning of the same forms. Table 3.3  Cognitive homonyms in the General Service List Homonym

Meanings

BAR

beam of steel, pub, bar exam, to exclude

BEAR

to give birth to, to sustain (something unpleasant)

BELT

a strip (of leather, land), to shout or sing loudly, an alcoholic drink

CHARGE

electrical ~, ~ card, to officially accuse of crime, to make a rushing attack

CHEST

upper torso, treasure ~

CLUB

‘I’ve a good mind to join a club and beat you over the head with it!’ Groucho Marx

COAST

to move easily/not work hard, coastlines

COMPANY

a corporation, fellowship/visitors

CONCENTRATE

to focus one’s thoughts, chemical solution of increased strength (i.e., juice)

COUNTRY

nation, far from the city

COURSE

direction or path (including of course), an area constructed for certain sports

COURT

place of law, an enclosed area, monarch’s place, to woo/to invite disaster

CROSS

intersecting lines, angry or ill-mannered

CRUSH

to smash, infatuation

CRY

to weep, to exclaim

CULTURE

customs/refinement, cultivated cells

CURE

to treat medically, to preserve (by salting, drying)

CURRENT

a flow, the present (adj)

DATE

to see romantically, to reveal one’s age indirectly (That dates me!)

DEGREE

unit of measurement for temperature or angles, level of completed education (continued)

47

48 Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Table 3.3  (continued) Homonym

Meanings

DRAG

to pull along, ~ race, women’s clothing worn by a man, boring situation

DRAW

to sketch, to pull

DRIVE

to operate a motor vehicle, disc ~

DUTY

one’s responsibilities, a payment enforced by law

FAINT

to lose consciousness, hard to perceive

FALL

autumn, to descend (often accidentally)

FAST

quick, secure

FENCE

surrounding wall, to engage in fencing

FIGURE

form, to think

FINE

good, very small, sharp

FIRE

flame, to dismiss

FLAT

~ tire, an apartment (in BE)

FOOT

body part, 12 inches, ~ the bill

FORMAL

correct and serious (~ language, ~ attire), related to form (especially in art)

FORWARD

toward the front or future, presumptuous/direct

GAME

competition, crippled (limb), certain wild animals

HABIT

something done regularly, a piece of clothing worn by nuns

JUST

only, fair

KIND

gentle-natured/friendly, a class (including kind of)

KNOT

a fastening (of rope), nautical measurement

LATE

tardy, recently deceased

LEFT

opposite of right, past tense of leave, liberal

LETTER

writing character, missive

LOT

much (a lot of), a yard (of land), random resolution of disputes, a portion

LOVE

the emotion, a score of zero

METRE

measurement of length, poetic or musical rhythm division

MIGHT

(modal verb), power

MOUSE

rodent, computer peripheral

NATURE

the natural world, inherent quality

NUT

hard fruit, eccentric or crazy person, part for securing bolt

ORDER

sequence/(properly) organized, direct command

ORGAN

body part that performs a specific function, musical keyboard

PARK

public area for recreation, to bring a vehicle to a halt in order to leave it there

PASSAGE

passing (a law, etc.), travel (by sea), a (narrow) passageway, a short part of a book

PATIENT

to have patience, a doctor’s client

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 3.  Homoforms and polysemes

Homonym

Meanings

PICK

to pierce/thrust/detach, to choose, to pluck with a plectrum

PLANT

flora, a factory

PRESENT

a gift, the current time

PRESS

to push/printing press

PRETTY

attractive, moderately

PRIVATE

not public, military rank

REALISE

to become aware of, to bring to fruition (dreams, designs)

RIGHT

(see homonyms above), conservative

ROCK

stone, to move back and forth, a kind of music

SEASON

a division of the year, to add spices or herbs to food

SECOND

next after the first, one-sixtieth of a minute

SENSE

the five physical abilities, to be aware of something, not nonsense, one meaning of a word

SENTENCE

a linguistic construction, a period of punishment

SHOOT

to use a gun or camera, new growing parts of plants

SHOWER

a spray of water, a party before marriage or childbirth

SPIRIT

ghost, alcohol

SPRING

the first season of the year, a source of water, a coil

STAFF

a long pole, people employed together

STAGE

a temporal division of development, a place for actors to act

STATE

a condition, the government, to make a statement

STONE

rock (including jewels), unit of weight (especially of a person)

STORY

a narrative, a floor of a building

STRIKE

to hit, an organized protest

TABLE

~ and chairs, tabulated data

TRAIN

railway cars connected together, trailing section of a dress/body of followers

TRIP

journey, to fall

TYPE

a category, to use a typewriter or computer keyboard

WATCH

to see, a portable clock

WHIP

instrument for flogging, member of a party in parliament

Several of these cognitive homonyms may be worth distinguishing in word lists. For others like bear, drive, duty, press, strike, or train for example, a teacher could help learners see the connection between the meanings. The General Service List contains seven homographs (same written form, different pronunciation) and these are listed in Table 3.4 (Parent, 2012). So, we can see that the spelling close has two pronunciations with two unrelated meanings.

49

50

Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Table 3.4  Homographs in the General Service List Written form

Examples

close 32845 lead 68842 minute 29613 present 41841 row 7872 wind 11588 wound 4980

Close (47%) the door before he gets too close (52%). The lead (95%) singer developed lead (5%) poisoning. I’ll need a minute (82%) to go over the minute (18%) details. At present (70%), he hates to present (30%) speeches. This was their third row* (10%) in a row (62%). (mainly in British English) The wind (88%) will wind* (12%) its way through the crevices. The bandages were wound* (20%) too tightly, making the wound (80%) worse

According to the meaning frequency data in West (1953), only four homographs would each require two entries in the most frequent 3000 words – close, lead, minute, present. The three items not frequent enough are marked by asterisks (row meaning “loud argument” and wind/wound meaning “turn”) and would be among the mid-frequency words. In West’s (1953) General Service List of English Words, the entry for row consists of two homonyms – row of houses, row a boat and a homograph – make a row (a lot of noise). Individually, the frequency of each of these would not be enough to get into the list, but the combined frequency was enough. To be fair to West, frequency was not the only criterion for inclusion in the list, and he carefully distinguished the three meanings and gave their relative frequencies.

Word senses Some makers of word lists (Biemiller, 2010) want to go further than distinguishing members of homonyms and homographs. They also wish to distinguish related senses of words and count them as different words, for example, distinguishing appeal in court of “court of appeal” from appeal relating to a request. There are several arguments which can be gathered in favor of doing this. Firstly, different senses have different collocates (these of course help distinguish the senses). Secondly, different senses may be translated by different words in other languages. For foreign and second language learners, this can be a compelling argument. Thirdly, different senses of members of a word family may be distinguished by different suffixes, so government could be counted as a different word from govern because the meaning of the two word forms may be seen to differ more than the difference caused simply by the addition of the suffix -ment. Fourthly, different senses of members of a word family may be distinguished by part of speech, so walk as a verb may be seen as a different word from walk as a noun.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 3.  Homoforms and polysemes

Let us now look at the arguments against distinguishing different senses as different words. As a part of normal language use, we are well accustomed to dealing with different senses of a word. Nagy (1997) makes a useful distinction between sense selection and reference specification when talking about how we process the different senses of words. Sense selection involves storing each sense as a separate entry in the brain so that we can directly access the sense we need. In computer terms, this simply requires storage space and a search, and does not require a lot of online processing. Reference specification involves storing a core meaning in the brain and using the contextual clues and background clues to elaborate this whenever we deal with a particular use of the word. In computer terms this requires more online processing. Treating related senses of words as different words is not giving learners enough credit for what they are able to do with the help of context while reading. The essence of language is its flexibility. Every day we meet new uses of familiar words but we deal with them with no fuss. Occasionally the new uses are striking and may be deliberately so, for example went viral, but they are nevertheless something that we can deal with. It is also very difficult to decide where one sense ends and another begins. We can see this problem clearly if we compare the entry for the same word in different dictionaries. When deciding whether to count the same forms or formally similar forms as different words or not, we can use questions like the following. 1. Is there an historical/etymological connection between the word forms or is the relationship purely accidental or arbitrary? For example, is the relationship between process (a noun) and process (a verb) an etymological connection or a purely accidental or arbitrary connection? 2. Is the form connection easily recognizable? For example, the spelling of process as a noun and a verb is exactly the same, but the pronunciation is slightly different. However, seeing the similarities between the two pronunciations is not difficult. 3. Do the forms share some common meaning? For example, is there a common meaning in the use of process as a verb and as a noun? Do all the uses of sweet share a common core meaning or are their meanings completely unrelated? 4. Are the meaning differences between the forms predictable from (i) the morphological differences, or (ii) the grammatical differences, or (iii) the collocational differences? For example, (i) is the difference between process and processing largely the difference between a noun form and a stem+ing form? (ii) Is the difference between process as a verb and process as a noun simply the

51

52

Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

general difference that exists between verbs and nouns? (iii) Is the difference between the two uses of process in process the information, and process the raw material not something which is contained in the word process but which comes from the collocations which accompany it? 5. Are there other major meaning differences? For example, are the differences between process as a noun and process as a verb beyond those caused by the verb-noun difference? On a five-point scale for example, how wide are the differences? Usually, the same form is used in different senses because the senses are related to each other. It is not an accident in the vast majority of cases that the same word form is used to express different senses. The same word form is used because the different senses are related to each other. With the exception of true homographs and homonyms, Ruhl’s (1989) monosemic bias says if two items have the same form, they are the same word. We should look for what is common between the items and see the differences between them as something that we have to deal with as a part of reference specification. However with highly frequent senses, it is likely that the brain also stores them separately in the interests of efficiency, for example odd in its two senses of “unusual” and an “odd number”, or execute meaning “to kill” and “to carry out an order”. This does not mean however that they should be seen as different unrelated items. These arguments are strong enough to suggest that for most purposes, different senses should not be distinguished in word lists. Exceptions to this may be when a word list is intended for speakers of the same first language and there is some compelling reason to distinguish words with different L1 translations. The argument against this is that one of the educational reasons for learning another language is to see how experience is classified differently in different languages, and using L1 translations to distinguish different senses as different words is misrepresenting the L2. The treatment of polysemy is a source of disagreement amongst word list makers. Low proficiency learners may have problems in dealing with polysemy and this encourages makers of word lists for such learners to distinguish senses of some words in their lists, such as lists for low level graded readers. For example, cried can refer to crying with tears and to shouting out something (“Follow me!” he cried). While it is very difficult to be consistent about this across a range of words, it is likely to result in more comprehensible low level graded readers. Homonymy and polysemy are pervasive language features and the makers of word lists need to have clear and well-thought-out policies in dealing with them.



Chapter 3.  Homoforms and polysemes

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Recommendations 1. The members of homonyms and homographs, especially those that would get into the high frequency word lists, should be counted as separate word families. This is because they are different words usually with no etymological connection. For the high frequency items, this is not an enormous task with 17 homonyms and 4 homographs requiring separate entries. It is likely that tagging could eventually separate members of homonyms, allowing more comprehensive counting of homonyms, but it is essential to clearly and reliably distinguish homonyms from polysemes and this is primarily not corpus-based decision-making. That is, a list of homonyms needs to be created before tagging can be done. 2. Unless the aim of word list making is to take close account of one particular first language of foreign or second language learners, or to prepare material for very low proficiency learners, polysemes should not be counted as separate words, largely because they are not separate words. They are uses of the same word. There may be reliable ways of distinguishing different senses of words, but dictionary making suggests that this is not at all straightforward. In addition, learners of a language need to be able to see the relatedness between polysemes, because this is the essence of dealing with unfamiliar uses of known words.

53

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

chapter 4

Proper nouns

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Paul Nation and Polina Kobeleva Proper names are notoriously difficult to define (Kobeleva, 2008), although most people have a feeling for what they are. They are typically indicated by the use of an initial capital letter, but although all proper names are likely to be capitalized, all capitalized words are not necessarily proper names. Grammatically tagged corpora invariably contain items tagged as proper names, or more precisely proper nouns. Although these two signals (capitalization and tags) provide straightforward ways to identify them, like most other types of words, proper names have their own complications. Rather than define proper names using a classical approach which uses a set of criteria to clearly determine membership of the category, it is more appropriate to deal with proper names using prototype theory (Rosch, 1978) which draws on intuitions and allows one category to shade into another. This better reflects the nature of proper names. Capital letters are also used at the beginning of sentences and this use needs to be distinguished from the use of capital letters to mark proper names. There is also some inconsistency in the use of capitalization for proper names. Should the names of subjects like geography and economics be capitalized? Why is chess not capitalized when the game Monopoly is? A common way publishers and writers deal with the inconsistencies in the use of capital letters is through the use of style guides such as the Chicago Manual of Style, or APA, or in-house style guides. These however are not usually based on any consistent underlying theory of the nature of proper names. In corpus-based research it is dangerous to rely on classifications of words that other people have made unless you know the criteria they used and you agree with them, and you have evidence that the criteria have been consistently applied. The history of vocabulary size testing for example is full of examples of researchers accepting dictionary makers’ decisions and assertions about the words in their dictionaries which were demonstrably wrong and which made the resulting tests grossly misleading (Thorndike, 1924; Lorge & Chall, 1963; Nation, 1993). When doing research, it is better to do the basic classifications yourself, and explicitly describe your reasoning and decision making. The largest category of proper names is personal names. These are frequent in writing, particularly in informative writing as in newspapers. Personal names

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

56

Making and Using Word Lists for Language Learning and Testing

(including the names of gods and pets) are the prototypical core of proper names. Halfway between the core and the periphery come the names of places and enterprises. Together, personal names and the names of places and enterprises account for the vast majority of proper names. On the outermost edge of proper names come the names of events and artefacts. Events can include geographical epochs, historical eras, wars, military campaigns, battles, historical events, space missions, typhoons, hurricanes, floods, earthquakes, fires, sporting events, festivals, special celebrations and conferences. Artefacts include ships, boats, trains, aircraft, spacecraft, books, periodicals, chapters, articles, poems, songs, music, paintings, sculptures, films, radio and TV shows, and brand names (Kobeleva, 2008). Capitalization is more variable in these outermost categories and the proper names are more likely to be multiword units rather than single words. There are proper names that are capitalized that we might not wish to count as proper names, especially if we are doing a word count related to vocabulary learning. For example, the seemingly transparent names of businesses such as Quality Meat, Super-cheap Cars, or part of such names such as Karori Swimming Pool, Creswick Garage, Northland Medical Centre, Jackson Street may be more appropriately counted as occurrences of the common nouns quality, meat, super, cheap, cars, swimming, pool, garage, medical, centre and street. A feature of prototypical proper nouns is that they do not carry much meaning beyond the particular unique thing they refer to. The common noun book for example can be used to refer to particular books but it has a core meaning that applies to all books. Proper nouns on the other hand are essentially ways of conveniently referring to very particular things without necessarily carrying meaning that is shared by other things that may at some time be referred to using the same name. All Johns are not the same. Because of this, in word counts related to text coverage by vocabulary and lists to guide learning, proper nouns are usually assumed to involve words that require little if any previous knowledge. That is, they may be considered a part of known vocabulary. The thinking behind this is that before you read a novel, you do not have to know who the characters are. You learn who Bloom, Mulligan and Dedalus are when you read the novel. Similarly, newspaper stories explain relevant proper names in the story. There are however proper names where previous knowledge is expected. Nagy and Anderson (1984) thought that there might be about a thousand of these. Consider these sentences. He’s no Einstein. He’s not another Gandhi.

It may be useful to create a separate list of proper names that would be treated like ordinary vocabulary in that their meanings are reasonably stable and they need to be learned. Such a list could include countries, important cities, very well-known people, and significant events.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 4.  Proper nouns

With personal names, there may be some useful core knowledge, particularly which names are typically male names (John, Fred, Jack, Tom) and which are typically female names (Janet, Joan, Jill, Johanne). There is another feature that makes proper names different from most other words. Most, but not all, proper names remain the same regardless of the language being used. For example my name remains the same regardless of whether my name is used by people speaking English, French or Japanese. Thus knowledge of proper names is not necessarily unique to one particular language. Brown (2010) sensibly advises caution in assuming that proper nouns are not problematical. Research by Kobeleva (2012) has shown that for listening at least, previous knowledge of proper names significantly and substantially improved comprehension of a news text on two out of three comprehension measures. This previous knowledge included knowledge of the word form of the proper name and practice in linking it to what it referred to. It may be that proper names pose less difficulty in written text, but that is an assumption that still needs to be proved. When making word lists, it is useful to distinguish proper names from other types of words, but assuming that they do not require some previous knowledge when met in listening and reading is being optimistic. It is likely that a higher than usual density of unfamiliar proper names may increase comprehension difficulty. At present proper names are distinguished in text analysis programs like Range and AntWordProfiler by using lists of proper names. Because there are so many different proper names and a very large proportion of proper names in most texts are outside existing lists, the initial analysis of the vocabulary in a text usually results in more words being added to the proper names list, and then running the analysis again. Nation and Webb’s (2011: 146) analysis of the vocabulary outside the lists in the British National Corpus indicated that well over half of the words occurring only once in the British National Corpus which were not in the BNC/COCA word family lists were proper names. Proper name lists are unavoidably in continual need of amendment. A list of proper nouns made from the BNC contains well over 250,000 types, of which well over one-half occur only once in the BNC, and of which many are multiword units, for example, New Zealand, Yves Saint-Laurent. Table 4.1 contains the proper nouns at the top of this list which would get into the top 3000 high frequency words of English. The frequencies are based on the BNC. The proper noun tagging policy as followed in the British National Corpus is generally very suitable for dealing with proper nouns when making word lists for course design and text analysis (see Appendix 1). The categories of proper nouns tagged in the British National Corpus include Kobeleva’s (2008) core category, personal names. Of the categories between the core and the periphery, the names of places and enterprises, the British National Corpus proper noun tagging includes

57

58

Making and Using Word Lists for Language Learning and Testing

Table 4.1  The most frequent words tagged as proper nouns in the BNC

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Proper noun LONDON UNITED STATES ENGLAND NEW YORK MAY APRIL BRITAIN US MARCH JUNE JULY MR UNITED KINGDOM NORTHERN IRELAND UK EUROPE SCOTLAND FRANCE OCTOBER JANUARY SOUTH AFRICA SEPTEMBER NOVEMBER DECEMBER FEBRUARY OXFORD HONG KONG NEW ZEALAND AUGUST DARLINGTON AMERICA WALES DEC JOHN AUSTRALIA Note: Frequency per 100 million tokens

Frequency 10912 6929 6043 5393 5206 5199 5016 5015 4798 4777 4646 4251 4207 4187 3578 3484 3471 3438 3173 3067 3012 2966 2908 2885 2699 2649 2647 2517 2418 2412 2409 2189 2097 2019 2003

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 4.  Proper nouns

geographical names, and enterprises such as Volkswagen, Unilever, Xerox, including capitalized acronyms, such as RAF, UNESCO, IBM and RSPCA. It does not tag descriptive names as proper nouns, such as United Nations or World Health Organization. The British National Corpus proper noun tagging also includes the closed classes of days of the week and months of the year, and titles such as Mrs, Mr, Miss, Reverend, President, Prince, Sheikh, where they are followed by a proper name. Note that in the British National Corpus two or more consecutive proper nouns are counted as the same proper noun. This includes items like Mrs Thatcher, John Major, and President Bush. This sometimes results in erroneous clusters like America Hillary. Words like British and American are not counted as single word proper nouns because they are adjectives. There are items properly tagged as proper nouns in the British National Corpus that are probably the result of errors in the corpus. Here are some examples: AMERICASFALKLAND, AMERICASST VINCENT, AMERICASTRINIDAD, AFRICABURKINABASIC, AFRICANIGERIABASIC, AFRICASOUTH. This means if you extracted all the proper nouns from the British National Corpus and used them as a proper name list, it would contain several unusual items. In principle, not tagging descriptive names as proper names is sensible from a language learning perspective, because it is usually not necessary to distinguish words like road, organization, garage, hairdressers, hospital when they are used as part of the name of an organization or enterprise from when they are used as common nouns. Their meaning is typically the same. The highly frequent but small groups of days of the week, months of the year, and titles can be either classified as proper names or not without much effort. It is preferable not to count them as proper names as they are usually an early focus of learning in language courses and carry a meaning like common nouns. Here are the words most frequently tagged as proper nouns in the British National Corpus which are also in the BNC/COCA 1st 1000 word families: Mrs, Sir, Road, Street, Miss, Lord, Mark, Hill, Lady, River, Van, General, Bill, Brown, King, Major, Young, May, Green, Island, Lake, West, Wood, Place, Father, White, Square. There are many other words like these in the 1st 1000 BNC/COCA list, but these are the most frequent. Note that words like Mark, Bill, Brown, Young, May, Green, White are homonyms if capitalization is ignored. That is, mark can be someone’s name as well as being a verb and a common noun. As Table 4.2 shows, 658 word types in the 1st 1000 families in the BNC/COCA lists also occur in the words tagged as proper nouns in the BNC, and every 1000 word level in the word families lists contains on average 200 word families that also have forms that appear in the words tagged as proper nouns. This large number of words which includes several hundred very frequent words makes it clear that it is

59

60 Making and Using Word Lists for Language Learning and Testing

Table 4.2  Number of proper noun homonyms in the BNC/COCA word family lists

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Word list one two three four five six seven eight nine ten 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Proper Marginal Compounds Abbreviations Not in lists Total

Number of word families 658 557 307 324 324 276 259 251 247 238 229 264 240 233 210 241 218 210 209 201 174 167 177 270 252 52 37 34 15,887 29 468 747 76,215 100,205

important to distinguish the proper noun use from other uses of these words when counting and making word frequency lists. To reinforce this point, here are the most frequent words from the 238 from the BNC 10th 1000 list that also have uses tagged as proper nouns in the British National Corpus – Glen, Paddy, Lynch, Shah, Mart, Barrow, Dyke, Ayatollah, Palazzo, Doe, Provost, Stead, Regent, Parson, Tong, Fiat,

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 4.  Proper nouns

Alderman, Plumb. These words are a mixture of personal names (forenames and surnames), place names and geographical features, enterprises, and titles. The ones that are homonymic are the personal names (Glen, Paddy, Lynch, Barrow, Stead), and enterprises (Fiat), while the others are not homonyms (Shah, Mart, Palazzo, Alderman) in that they have the same meaning in their proper noun use as in their non-proper noun use. There are 2,140 proper noun homonyms which occur in more than one text in the British National Corpus and which have a frequency of at least 50 occurrences in the BNC (Table 4.3). Note that this still leaves a lot of potential homonyms outside this group, but their range is narrow and their frequency is low (less than 50 per 100 million tokens). Table 4.3  Frequent and wide range proper noun homonyms in the BNC Over 90% of occurrences as proper nouns 10%–90% of occurrences as proper nouns Under 10% of occurrences as proper nouns Total

1,497 583 60 2,140

Around 70% (1,497) of the 2,140 proper noun homonyms have over 90% of their occurrences as proper nouns. These include a large number of items such as Jim, February, Buckinghamshire where the non-proper noun occurrences are typically only a few occurrences, often one or two, suggesting a tagging problem or a rather unusual use, perhaps adjectival. Exceptions include august (where the adjectival use meaning “revered” is rare but clearly distinguishable from the proper noun use), smith, lee, mike, ken, jack, and march. Around 27% (583) have 10%–90% of their occurrences as proper nouns. The most frequent of these are listed in Table 4.4. Most proper noun homonyms are thus predominantly used as proper nouns rather than common nouns, but there are a few hundred where the use not as proper nouns is high enough to make it worth distinguishing the proper noun use from other uses. A few proper noun homonyms (60 from the 2,140 list) have over 90% of their occurrences not as proper nouns. These include items such as labor, will, good, nice. Table 4.4 lists the most frequent of the homonyms where the proper name usage makes up from 10% to 90% of the total occurrences of the word form (583 from the 2,140). The main reason for looking at such homonyms when making word lists is to see if some words are getting into the high frequency word lists because of the combined frequency of their use as a common noun or other part of speech and their proper name frequency. Their inclusion in a high frequency word list should be because of the frequency of their use not as a proper noun, particularly where their proper name meaning is not the same as their other use. For example, we

61

62

Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Table 4.4  The most frequent words in the first 2000 BNC which have 10%–90% of their occurrences involving use as a proper noun Homonym

BNC frequency

% proper noun use

May US Major White Lord King Bill Mark Rose Brown Grant Wood Hill China Castle Silver Bishop Lane Marks Guy Cook Bath Bush Pat Ray Ace Gay Penny Lamb

127,301 76,338 28,642 23,427 16,079 15,765 13,557 12,091 10,622 8,410 7,421 7,177 6,900 5,467 5,420 4,881 4,553 4,435 4,359 3,844 3,811 3,788 3,717 2,097 2,094 1,917 1,844 1,834 1,635

11.68% 20.59% 16.59% 10.71% 69.70% 24.34% 27.72% 49.74% 12.74% 44.95% 27.09% 24.31% 53.62% 83.60% 11.76% 12.81% 40.21% 51.23% 20.33% 36.94% 36.56% 26.36% 75.10% 81.47% 85.44% 14.34% 12.47% 37.02% 33.58%

need to make sure that the word bill does not sneak into a high frequency word list because of the combined frequency of its use as a proper name Bill (often a short form of William) and its use meaning the bill you need to pay or a bird’s bill. Ideally it should get into the list on its use as an ordinary word. In Table 4.4, we can see that may occurs 127,301 times in the 100 million token BNC, and 11.68% of these occurrences are as a proper noun. These uses as a proper noun include as the name of a month of the year and as a person’s name. Even if we removed this 11.6% from the total occurrences, the word, may would still easily get into the 1st 1000 words on the frequency of its use as a modal. The same applies to bill.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 4.  Proper nouns

Note that words like street, north, northern, god, square, and island have not been included in the list because their meaning is essentially the same in both proper noun and common noun uses, except of course where they are people’s names. Note also that the frequency cut-off point of the last four items is below the high frequency cut-off point of 2000 per 100,000,000 which is the cut-off point for the first 3000 words (see Table 2). Note in Table 4.4 that words like Mark, Brown, Hill, Bishop, Lane, Bush, Pat, and Ray have a high percentage of proper noun uses. Some but not all of the proper noun uses of Hill, Bishop, and Lane may be a part of a name where the word has its common noun meaning. The inclusion of bush, pat, and ray in the most frequent 2000 words is strongly influenced by their high proportion of proper name use, and if proper name uses were counted separately, they would not get into the most frequent 2000 word families on the frequency of the remaining uses. There are other proper name homonyms in the first 2000, but their proper noun use is very infrequent (less than 10% of their total occurrences), or they are derived forms like woody, moody, baker, sung, fuller, rocky, hunter, and their proper noun frequency has little effect on the inclusion of their family in the high frequency words. In their research to create an essential word list, Dang and Webb in Chapter 15 carefully distinguish the proper name uses of proper name homonyms from their other uses.

Needed research A list of content bearing proper names needs to be developed including proper names like Gandhi, Einstein, and Panmunjom where background knowledge is needed to successfully deal with most uses of the name. How accurate are proper name taggers, particularly in dealing with homonyms?

Recommendations 1. Proper names should be counted as a separate category from other words. This is because semantically proper names are different from other words. 2. It is important to describe explicitly what is categorized as a proper name and what is not. 3. Proper name lists should be continually updated as a result of examining the words that are not in the existing lists when a particular text or collection of texts has been processed using the lists.

63

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

64 Making and Using Word Lists for Language Learning and Testing

4. Within the category of proper names, a list of proper names that rely on previous knowledge should be developed. This is because these words require learning in much the same way other words are learned. This list will also need continual updating as particular individuals, places, enterprises and events become well known. Clear criteria for the inclusion of words in this list should be developed. 5. A well-conducted word frequency count should distinguish the members of homonyms where one member is a proper name with a meaning that is clearly different from its common noun use. This means that generally it is not useful to distinguish the proper name use of words like Street, Place, North, and Garage, but it is worth distinguishing the members of homonyms like Pat, Ray, and Bush (and whoever the next homonymic leader may be). The major reasons for distinguishing the different uses are that their meanings are different, there is a large number of such homonyms, and for a relatively small number of them the combined frequency of their proper noun and common uses can affect their placement in frequency-ranked lists. If large scale homonym differentiation is not possible, then there should be careful examination of each word in the small group of proper noun homonyms where the frequency of proper noun usage may result in the unwarranted inclusion of the word in a high frequency word list (see Table 4.4). In any individual text or series of related texts, particular proper names, which may be homonyms, can occur with a high frequency, and when analysing individual texts it is worth checking if proper name homonyms are affecting coverage figures. 6. The British National Corpus policy of tagging adjoining proper nouns as the same multiword name (John Major, Mrs Thatcher, Wicketkeeper Adam Parore) leads to some unwanted sequences (Widcombe Manor Bath Somerset), but is worth considering including in counting programs, if there is a strong reason for wanting to preserve the unity of compound proper names.

chapter 5

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Hyphenated words and transparent compounds

Hyphenated words are simply words containing a hyphen. The item before or after the hyphen can be an affix (sub-zero, wife-less) or a part of a word when justifying text (advent-ures), or a word that can stand by itself (open-plan, life-span). Although there are many reasons for using a hyphen, the overarching reason is to signal a close relationship between the hyphenated items. The many reasons for using hyphens include the following:



syllabification in justified texts, hyphenation of prefixes and suffixes for clarity and to avoid awkwardness as when two adjacent vowels are pronounced separately (co-operation not coop) and to avoid ambiguity (re-storing = “storing again”), to show grammatical relationships when two words modify the meaning of another word (long-term decision not a long decision), in items that are seen as compounds or as being like compound words, to replace single letters when censoring words (b----r that), in duplicated words (a softly-softly approach)

There are over 10,000 hyphenated words with a frequency of 2 or higher in the British National Corpus which also have a non-hyphenated compound form (written with no spaces) occurring in the corpus. Table 5.1 contains those where the hyphenated form occurs 1000 times or more in the corpus. Several of these items can also occur as separate non-hyphenated words, e.g. short term. When we look at Table 5.1, it seems that some occurrences of the non-hyphenated forms, for example nineteenthcentury, may simply be typographical errors where the space was accidentally omitted. It is likely that many of the hyphenated words that occur only once and which have a non-hyphenated form in the BNC may be errors or incorrect uses and thus do not typically have non-hyphenated compound forms. Here are some examples of hyphenated words from the British National Corpus to show the range of frequency relationships between hyphenated and compound forms – childcare 98, child-care 88; childlike 105, child-like 28; childbirth 178, child-birth 6; childcentred 1, child-centred 64. The examples range from roughly equal frequency to higher frequency for the compound form to higher frequency for the hyphenated form.

66 Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Table 5.1  The most frequent hyphenated forms with a corresponding compound form and their frequencies in the BNC Compound form

Frequency

Hyphenated form

Frequency

longterm videotaped cooperation socalled fulltime parttime noone shortterm northeast postwar wellknown decisionmaking middleclass makeup southeast twothirds oldfashioned twentyfive largescale cooperative vicepresident nineteenthcentury northwest

24 34 499 7 28 3 22 4 89 124 6 8 11 69 130 2 6 3 8 260 1 3 133

long-term video-taped co-operation so-called full-time part-time no-one short-term north-east post-war well-known decision-making middle-class make-up south-east two-thirds old-fashioned twenty-five large-scale co-operative vice-president nineteenth-century north-west

4088 3697 3637 2642 2169 2053 1975 1746 1665 1598 1509 1422 1356 1296 1270 1149 1148 1111 1110 1090 1073 1015 1013

The basic problem with compound words is being consistent in dealing with them when they occur with a hyphen (short-term), when they occur as a non-hyphenated compound word (shortterm), and when they occur as two separate words (short term). A decision needs to be made whether hyphens are seen as letters and thus are considered as parts of words, or whether hyphens are seen as being like word dividers separating words. The policy of not counting hyphens as part of a word and editing a corpus by replacing hyphens with [space] hyphen [space] has positive effects when the hyphen is used to show grammatical relationships and compounding, and is used in duplicated words. Note however, as in Table 5.1, that many hyphenated words also exist as compound words without the hyphen (colour-blind, colourblind; coal-mine, coalmine). Separating the hyphenated parts has negative effects in syllabification in that the syllabified words with a hyphen would be counted as new words (illustr-ated and illustrated would be counted as three different words). Ideally such hyphenation would be removed in a corpus, replacing the hyphenated words with the original

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 5.  Hyphenated words and transparent compounds

unhyphenated word, before counting. Such editing should also include where hyphens are used to replace single letters. Replacing hyphens with space hyphen space has negative effects with hyphens used with prefixes and suffixes because the prefixes and suffixes are then counted as separate items. If a hyphen is counted as part of a word, then this results in fewer tokens in the corpus and more types. The safest approach could be rather time-consuming but involves dealing with the different uses of hyphens in different ways. This approach could be helped by the existence of different kinds of electronic hyphens, and the internet can provide useful advice on finding soft hyphens. Here is how you can deal with soft hyphens using Find and replace. You can copy/paste one of the soft hyphens into the Find & Replace “Search for” field, or you can set “Options/Regular expressions: ON” and enter \x00AD in place of the soft hyphen. If you have trouble selecting a soft hyphen, try one of these tricks. 1. Select & copy some characters on either side, paste into the dialog, then delete the extra characters. For me, the soft hyphen is visible in the dialog. 2. Use the cursor keys instead of the mouse: move to the position before the character before the hyphen, then use the right arrow key to move one position to the right (i.e. before the soft hyphen). Then hold Shift and arrow one more step to the right. That will select one character, the soft hyphen. Copy/paste into the dialog.  https://forum.openoffice.org/en/forum/viewtopic.php?f=7&t=26817

Transparent compounds Transparent compounds are words that can be written as one word (with no hyphens or spaces within the word) but are made up of existing words, and the meaning of the compound is transparently related to the meanings of its parts. The most common ones in the British National Corpus in order of frequency are railway, weekend, shareholder, lifespan, policeman, widespread, workshop, birthday, outstanding, airport, upstairs, bathroom, classroom, underground, worldwide, wildlife, businessman, all occurring over 2000 times in the BNC, and thus frequent enough to get into the most frequent 3000 word families. Most of these also occur in a hyphenated form in the British National Corpus, often but not always with a big frequency difference between the compound and hyphenated forms. To be classified as a transparent compound, the meaning of the word should have a close and obvious connection with the meaning of its parts. The test for a transparent compound is whether the meaning of the compound can be expressed largely through a paraphrase using its parts, for example “the weekend is the days at the end of the week”. The most transparent compounds will be those where all the content words in the paraphrase are part of the compound word. Research is

67

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

68 Making and Using Word Lists for Language Learning and Testing

needed to see the problems with this criterion and how many additional content words can be permitted in the paraphrase. The paraphrase of weekend includes one (days). There is also a problem with the criterion when part of the paraphrase is a function word (inbox, output). The reason for distinguishing transparent compounds and counting them separately as known items is that they require little or no learning once their parts are known. If they were counted as new unknown words, they would inflate the number of unknown words. One effective solution is one used by Brysbaert, Stevens, Mandera and Keuleers (2016) which is to split the compounds into their parts so that each part is counted as a separate word and their occurrences are thus counted along with the non-compounded uses of the component words. So, classroom would be split into class and room. The arguments in favor of this solution are that the whole is largely made up of the parts, so separating them is not problematical, and also many compounds also exist as hyphenated forms (see Table 4.1) and as noncompounded collocations, for example, classroom, class-room, class room. Such a treatment would also be in line with separating the members of hyphenated forms. It also allows all uses of the component words to be counted under the same word. That is, the occurrence of home in homework is counted under home, which results in a more realistic representation of the frequency of home. The argument against this solution is that the compounded form of the word would then be lost in the count. However, in a family-based count, inflected and derived family members are also lost in the count of word families. That is the purpose of counting in families. Transparent compounds could be counted as family members of one of their word parts, but this raises the problem of which family they would be counted under. Would weekend be counted under week or end? It would also misrepresent the overall frequencies of the parts as the frequency of the end component of weekend is ignored if weekend is counted under week. The problem of how to deal with transparent compounds would need to take account of the purpose of the count. For a text coverage study, the transparent compounds could be counted in a separate list and their coverage added to the known coverage figure. In a word frequency study, a decision would need to be made on whether to treat each compound as a family member of the most frequent or least frequent family of the words in the compound, or whether to split them into their component words. The decision would need to be described and justified. For example, the decision to split transparent compounds into their parts could be justified by showing that the meanings of the parts make up the meaning of the whole, that counting a transparent compound as one word affects the frequency counts of its parts, and that many compounds also appear in a hyphenated form or with a space between the parts and there should be consistency in dealing with the three forms (compounded, hyphenated, spaced). In a list made for a source list for

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 5.  Hyphenated words and transparent compounds

a sample of words for a vocabulary test, the compounds should be separated into parts or counted in a separate list which is not sampled from for the test. If this is not done, an infrequent compound word in the test would be highly likely to be well known through morphological problem solving. There is a list of transparent compounds (basewrd33) that is used with the Range and AntWordProfiler programs. This list contains over 3000 families, with around two members per family. The transparent compounds in this list make up 0.36% of the tokens (one per 300 running words) and 1.41% of the types in the BNC. The output from the Range program has been checked to make sure that all transparent compounds above a frequency of 40 per 100,000,000 tokens in the British National Corpus are included in the list. So, it contains all the higher frequency compounds but is by no means a complete list of transparent compounds. The vast majority of the transparent compounds in the list easily fit the criteria for a transparent compound, though there is likely to be a cline of transparency, with items such as grassroots and godawful being at the less transparent end of the scale. A few compounds have one component in a reduced form – flexitime, adman. Compound words can be classified according to their internal structure and part of speech, but in creating word lists, the main classification needs to be into those that are not readily interpretable from their parts and those that are transparent. While some transparent compounds may always be written as a compound word, there is a lot of variation in whether an item is written as a compound word with no spaces, as a hyphenated item, or as two or more separate words. For this reason, we must not see compound words as a stable class of items. What in one text is a compound word, in another could be two separate words, or two hyphenated words.

Needed research What and how many compound words are never written with a hyphen or as two separate words? How frequent are they? Frequency of the parts of compound words: Are most compound words made from high frequency words? Are some words members of several different compounds? What are the commonest non-transparent compounds? How many are there? Are most compounds transparent? Are some compound parts marginal affixes e.g. grand-, -wide? Can learners easily cope with transparent compounds and to what degree is this proficiency related and L1 related?

69

70 Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Recommendations 1. Hyphens fill a variety of roles and a corpus may need to be prepared so that unwanted hyphens are removed without creating typographical errors in the corpus. It is worth developing a simple computer program for this purpose. 2. Hyphenated words and transparent compounds need to be dealt with in a consistent, well-justified way, taking account of the fact that the same two words may appear in different places in a corpus as a hyphenated item, a transparent compound and two separate words. 3. The strongest justifications are probably for separating hyphenated words and breaking transparent compounds into their parts, but there are also reasonable justifications for other solutions such as having separate lists for transparent compounds and counting hyphens as word separators.

chapter 6

Multiword units

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Paul Nation, Dongkwang Shin and Lynn Grant Multiword units are phrases that are made up of words that frequently occur together. These go under a wide variety of names including collocations, idioms, lexical bundles, and formulaic sequences which all are defined in various ways. In this chapter, we will look at making lists of multiword units and at the issue of including some multiword units in word lists of largely single words. By far, the biggest problem in making lists of multiword units is in developing a clear operational definition of what will be counted as a multiword unit and then consistently applying that definition. Researchers are well aware of this problem, and this awareness is one of the causes of the large number of terms used for multiword units. Renaming and redefining is an attempt to separate a new count of multiword units from previous counts by using the new name to signal a different set of defining features.

Co-occurrence and frequency The simplest and most reliable criteria, although not necessarily the most valid, are those that rely solely on form and frequency of occurrence. A computer program is used to find strings of words (often called n-grams where n equals the number of words in the string) where the strings are a certain length, say four words long, and of a minimum frequency. Usually in such counts, the words in a string need to occur immediately next to each other, no variations in the forms of the words are allowed, and the strings need not be grammatically or semantically complete units. This means that sequences such as the United States of would be counted as a string, and the president of the and the presidents of the would be counted as two different strings even though they differ by only a single inflection. However, many of the strings produced in such counts do make sensible units – end of the day, on the other hand, United States of America. Table 6.1 is slightly adapted from Nation and Webb (2011: 177). It lists the factors affecting the decisions that need to be made when counting multiword units. Each of the first four factors involves a choice between a criterion that a computer

72

Making and Using Word Lists for Language Learning and Testing

can apply (listed below as the first choice) and a criterion that requires human judgement (listed below as the second choice). For the first three criteria, the second choice also includes the first choice. That is, for example, if discontinuity is allowed, then both adjacent and discontinuous items are counted. If discontinuity is not allowed, then only adjacent items are counted. For criterion 4, Grammatically incomplete/Grammatically structured, Grammatically incomplete also can include grammatically complete items, but not vice versa. Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Table 6.1  Form-based factors affecting frequency counts of multiword units Form-based factors

Cautions

1. Adjacency/Discontinuity Words in the multiword unit can occur right next to each other or be separated by a word or words not in the multiword unit. By and large/serve s.o. right

If you count items that are not adjacent, you need to be especially careful that the items are in fact part of a multiword unit.

2. Grammatically fixed/Grammatically variable A multiword unit can be counted as a fixed form which is unchanging, or the components in the multiword unit can occur together in a variety of grammatical and affixed forms. To and fro/pulling your leg

If you count variable items, you need to search using a variety of search words and combinations of words.

3. Lexically invariable/Lexically variable A multiword unit can include some substitutable words of related or similar meaning, or it can contain items that cannot be replaced by others. as well as /once a week

If you count multiword units that allow substitution, you need to examine a lot of data manually to make sure you are including and counting acceptable substitutions

4. Grammatically incomplete/Grammatically structured A multiword unit can be a complete grammatical unit, such as a sentence, a sentence subject, a predicate, an adverbial group etc., or it can be grammatically incomplete. On the other hand/on the basis of

If you count grammatically structured items, you need to have clear criteria describing what is and what is not grammatically structured.

5. Number of components A multiword unit must contain at least two words. Some counting is done with a limit on the number of words in the multiword units, some counting has no limit.

If you count multiword units with no limit on the number of units, it becomes even more important to check that they meet other criteria for being a multiword unit.

Let us look briefly at the factors. If only adjacent words are counted as multiword units, then shut up may be counted as a unit but not when it occurs with a pronoun separating the two parts – shut him up. Counting discontinuous as well as adjacent occurrences would involve counting both of the two examples as the same

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 6.  Multiword units

multiword item. A big problem with counting discontinuous items is that in some idiomatic expressions the order of the words can change. That is they can be grammatically variable – once in a blue moon, when the moon is blue. A related problem is that two items that are normally part of a multiword unit may occur close to each other and yet not be part of the same unit. For example, if we were checking if the adjective beautiful was used with males, man or men, we would need to exclude instances such as Men like beautiful things because beautiful is not modifying men even though men and beautiful occur near each other. Dealing with discontinuous items involves lots of manual checking. The same cautions apply to grammatically variable multiword units. The expression pull someone’s leg means to tease or trick them – You’re pulling my leg. This expression can have numerous forms such as Your leg is being pulled, Pull the other one it’s got bells on it. Lexically variable items allow substitution of words with related meanings. The most common is pronoun substitution in multiword units like serve s.o. right where s.o. (someone) can be you, him, her, them, us etc. It is not always easy to draw the line with multiword units like seven days a week where different numbers can be substituted, because substitution is also possible with days and week as in ten days a year. The clearest way is to list or describe the substitutions that were counted. A grammatically incomplete multiword unit cannot function as an immediate constituent of a sentence (subject, verb, object, adverb, adjective, conjunction, preposition, predicate, sentence) because the unit crosses constituent boundaries, or is not a complete constituent. Thus, very frequent groups like of the, the main, him in front of are grammatically incomplete. The main justification for this criterion is that a multiword unit needs to be learnable. This criterion is often not used in computer-based studies because of the time-consuming manual judgement required to apply it. Some studies place an arbitrary number on the number of parts making up a multiword unit, especially if the criterion of grammatical completeness is not applied. Shin and Nation (2008) found that there was a reasonable number of grammatically well-formed high frequency multiword units (see Table 6.2 for examples), that frequent words had more collocates than less frequent words, that the frequency of multiword units followed a pattern roughly like that described by Zipf ’s law, and that the frequent multiword units tended to be short. High frequency multiword units in spoken texts occurred with a much higher frequency than the highest frequency multiword units in written texts. They were also largely different multiword units, with only a few being of high frequency in both spoken and written texts. Most of these observations about multiword units also apply to single words.

73

74

Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Table 6.2  The most common multiword units from Shin and Nation (2008) You know I think (that) A bit (always, never) used to As well A lot of (N) {No.} pounds Thank you {No.} years In fact very much {No.} pound talking about {sth} (about) {No.} percent I suppose (that) at the moment

a little bit looking at {sth} this morning (not) any more come on number {No.} come in (swe, sth) come back have a look in terms of {sth} last year so much {No.} years ago this year go back last night

Note in Table 6.2 that curly brackets indicate an item where substitution is possible, so {No.} stands for any number such as 10 or thirteen. Curved brackets indicate an optional item (about) that can occur or not occur.

The relationship between the meaning of the parts and the meaning of the whole We have looked at the criteria involved in co-occurrence. Let us now look at noncompositionality – the relationship between the meaning of the parts and the meaning of the whole. The non-compositionality criterion says that in some multiword units, the meaning of the whole unit is not apparent from the meaning of the parts. Such whole units are sometimes called idioms. The problem with this criterion is that at least at one point in the life of a particular multiword unit, there was a connection between the parts and the whole. That is why the parts were used. Liu (2010) has an excellent discussion of the non-arbitrariness of multiword units. He shows that expressions like strong tea and tall building are not unmotivated combinations, but reflect the meaning and use restrictions of their parts. Similarly, Boers and colleagues (Boers, 2000; Boers, 2001; Boers & Lindstromberg, 2009) have shown that examining the history of figurative expressions can help in remembering them.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 6.  Multiword units

Grant and Bauer (2004) looked at the compositionality of multiword units, dividing them into three groups – core idioms (where the meaning of the whole is not apparent from the meaning of the parts), figuratives which strictly speaking are not compositional but which can in fact be understood by applying a commonly used figurative interpretation strategy, and literals where the meaning of the whole is transparently related to the meaning of the parts. Figure 6.1 illustrates the decision process for classifying multiword units into the three categories. Grant (2005) found that surprisingly there were around only 100 core idioms in English, many of which would not be known by most native speakers. The most common ones are given in Table 6.3. The full list can be found in Grant and Nation (2006). Compositional?

NO

YES

Figurative?

NO

YES

A core idiom

A figurative

A literal

Figure 6.1  A flow chart for classifying multiword units according to transparency

The frequencies in Table 6.3 are from the British National Corpus. The items in italics are frozen (they cannot change their form in any way); those followed by an asterisk* have literal equivalents in the British National Corpus; and underlined items have an unusual grammatical form.

75

76

Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Table 6.3  The most common core idioms in English (Grant & Nation, 2006) with their frequency in the British National Corpus As well by and large so and so such and such out of hand take the piss and what have you serve s.o. right take s.o. to task red herring (be) beside s.o.self* out and out *

take the mickey pull s.o.’s leg* touch and go the Big Apple cut no ice with s.o. come a cropper put your foot in it an axe to grind make no bones about it a piece of cake* a white elephant (all) of a piece

30,499 487 327 196 141 137 136 101 92 87 72 72

71 60 53 52 50 49 48 47 44 43 43 38

Martinez and Schmitt (2012) made a very useful frequency-based list of 505 phrasal expressions using the criteria of frequency, meaningfulness (grammatically structured) and “relative non-compositionality”. “A phrasal expression is hence defined as a fixed or semi-fixed sequence of two or more not necessarily contiguous words with a cohesive meaning or function that is not easily discernable by decoding the individual words alone” (page 304). Table 6.4 contains some of the items from the list. Table 6.4  Examples from the Phrasal Expressions list (Martinez & Schmitt, 2012) LAST NIGHT AS A RESULT IN ADDITION (TO) WORK ON THINK ABOUT FOR INSTANCE TOO MUCH YOU SEE

What did he say last night? He was tired and as a result not very aware. The house was well located in addition. That kid needs to work on his attitude. I’m thinking about changing careers. That holds true for even the governor for instance. I don’t worry about it too much. And the problem you see is with awareness-raising.

They saw the purposes of their list being to guide teaching and learning, to guide testing, and to measure progress in vocabulary acquisition. They set out a very clearly described set of criteria, painstakingly monitored their application of the criteria, and described the decision processes they faced. They also carried out their frequency count so that the frequency of the phrasal expressions could be compared with word family frequencies, thus making it possible to see what phrasal expressions could be included in a high frequency word list. They found that 32 phrasal expressions would get into the 1st 1000 word families and 85 into

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 6.  Multiword units

the 2nd 1000. The frequency cut-off points they used agree closely with Table 2 in this book. The meaning criterion typically used is the degree of transparency of the compositionality if the multiword unit was first met when reading. For example, if stuck with was met in the sentence you were twenty minutes into a half hour programme and you stuck with it to the end, there would be little difficulty in working out its meaning from its parts and the context. Another possible related criterion is how memorable the multiword unit would be if it had been met once and its meaning was explained or looked up. This is a less conservative measure of compositionality and one that is closely related to learning. If for example last night, too much, or for instance were met and understood, would they be much more easily understood on subsequent meetings than for example a completely new word such as digress or twilight?

Including multiword units in word frequency counts Multiword units can be counted to make a list of multiword units, or multiword units can be counted as a part of a word frequency count so that the resulting list contains both single words and multiword units. The main issue with including multiword units in a list of words is one of accuracy of representation of the range and frequency figures. If a multiword unit was included, it would be necessary to make sure that there is not double counting. That is, the multiword unit may be counted, but the words that make it up should not also be counted again as part of the frequency of the single words. This would have to apply to each of the words in the multiword unit. There are related issues. In a multiword unit some words may behave as they do in normal use outside the multiword unit, but other words in the unit are less obviously compositional. For example, in raining cats and dogs, raining is used in its normal meaning, but cats and dogs is a core idiom. Taking the multiword occurrence frequency away from the frequency of the single word would misrepresent the occurrences of that word that behaves as normal. Martinez and Schmitt’s (2012) solution of using cut-off frequencies and having separate lists is a safe one. The frequency of any multiword unit is going to be less than part of the frequency of its least frequent member. This probably means that for the vast majority of words, their frequency of occurrence in a multiword unit will only be a small fraction of their total occurrences. It is worth checking to see if there are any high frequency exceptions to this statement. For example, around fifty per cent of the occurrences of instance are in the phrase for instance, which is a rather high proportion.

77

78

Making and Using Word Lists for Language Learning and Testing

Choosing between multiword lists

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

There are now several multiword lists and teachers need to be able to choose between them. The following questions can help in evaluating a particular list. The questions have been ranked in order of importance for evaluating the quality of the list. 1. Is there a clear definition of what is counted as a multiword unit? This definition should deal with the compositionality of the unit (Do the parts make up the whole or is there some degree of opaqueness?) and the form of the unit (Do the parts have to occur next to each other or can essential parts be separated by other words? Is there flexibility in the form of the words that make up the unit?). 2. Does the definition of the multiword unit match with the needs of the users of the list? 3. Is the definition consistently applied in the list? There are several lists that provide a definition but then do not carefully follow it when searching for multiword units in a corpus. This can be checked by applying the definition to several of the units in the list. 4. Does the list provide frequency data? This helps when deciding what multiword units to give most attention to. 5. Does the corpus used for finding the multiword units and for gathering the frequency data include the kinds of texts relevant to the users of the list? 6. Are non-adjacent items counted? See Table 6.1. 7. Are variants of items counted under the same item? See Table 6.1. 8. Are the multiword units grammatically well-formed? See Table 6.1. 9. Is lexical substitution allowed within an item? See Table 6.1. The Shin and Nation (2008) list contains largely transparent units and is suited to low proficiency learners because it applies a frequency restriction on the parts of the units. It also includes multiword units that consist of a function word and a content word as well as those involving at least two content words. The Martinez and Schmitt (2012) list is the result of a very carefully applied study and consists of multiword units that have some degree of opaqueness. Both the Shin and Nation and Martinez and Schmitt lists contain multiword units that are grammatically well-formed, provide clear frequency data and count non-adjacent units as well as adjacent units. There are multiword lists for academic purposes (Biber, Conrad & Cortes, 2004; Durrant, 2009; Simpson & Mendis, 2003; Simpson-Vlach & Ellis, 2010) as well as lists of phrasal verbs (Gardner & Davies, 2007; Garnier & Schmitt, 2015; Liu, 2011). These can all be evaluated using the criteria given above.



Chapter 6.  Multiword units

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

The counting of multiword units is still in its early stages despite lists having been around since the 1930s (Palmer, 1933). As the compromises between validity and ease of making are worked through, the quality of the lists is likely to continue to improve. It is worth bearing in mind that although deliberately learning multiword units can be an effective way of increasing knowledge, most learning of multiword units is likely to come from extensive reading and listening.

Recommendations 1. The criteria for deciding what will be counted as a particular kind of multiword unit need to be carefully described with details and examples regarding the application of the criteria. 2. The criteria need to include descriptions of how the form-based features of adjacency/discontinuity, grammatically fixed/grammatically variable, lexically invariable/lexically variable, grammatically incomplete/grammatically structured, and number of components were dealt with, and how the meaning-based feature of compositionality was dealt with. 3. Multiword units should be counted in separate lists to avoid upsetting the frequency figures of their single word components.

79

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

chapter 7

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Marginal words and foreign words

In the BNC/COCA lists, there is a word list for what are called marginal words. This list was initially set up as one way of dealing with items that occur frequently in transcriptions of spoken language, but which we would not consider to be words in their own right. This is because they have little referential meaning, but are used in spoken discourse to fill pauses and express emotions. These items usually have variable spelling. Here is the aargh family. AARGH ARGH ARRGH AAGH AAAGH

Some marginal words however have stable spellings  – heck, gee, ahem, gosh. Basewrd32 in the BNC/COCA lists is the marginal words list. The marginal words list also includes commonly used nonsense words (diddy, dum) and sounds (vroom, rrrm for engines). The inclusion of such words in the marginal words list suggests that one of the motivations for the list is in coverage to reduce the words not found in any list to as small a group as possible. Marginal words are typically language specific in that hesitation (um, er, ah), anger (grr), surprise (gosh, oh, ah, geez, wah), satisfaction (aha, oho), relief (phew), stupefaction (duh), regret (aww, heck, oops, woops, woopsy), agreement or acknowledgement (yay, uhuh), disagreement (humph, bah, tsk), showing support (rah), excitement (whee), ways of attracting attention (psst, ahem, oy), asking for silence (sh), showing pain (arrgh, ouch, ow, ugh, yow), air escaping when punched (oof, umph), shivering (brr), laughter (ha, ho, hee), sleep (zzzz), etc. are signaled in different ways in different languages. Even marginal words need to be learned. Because the BNC/COCA word lists were intended to be used with school children, swear words and cursing were also included in the marginal words list, rather than in the main word lists. Like some of the marginal words described in the previous paragraph, swear words express emotions with a single swear word being capable of expressing a very wide range of emotions. Because of their relatively stable spelling and because they often refer to bodily functions or body parts, there is justification for including them in the ordinary lists. They are however included

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

82

Making and Using Word Lists for Language Learning and Testing

in the marginal words in the BNC/COCA lists. Damn is included in the main lists in the BNC/COCA lists largely because it is not a strong swear word and has a meaning that is not offensive (He was damned to spend the rest of his time in Hell). It could however be moved to the marginal words list. Letters of the alphabet are included in the marginal words list in the BNC/ COCA lists. Some single letters are words (I, a), reduced forms (’d, ’m, ’s, ’t as in aren’t), or inflections (s), and these need to be in the main lists, sometimes as family members. For example, s and m are included in the be family as they can be short forms of is and am. The letters included in the marginal words list are B, C, E, F, G, H, J, K, L, N, O, P, Q, R, U, V, W, X, Y, Z. Letters can perform a variety of functions, such as initials in people’s names, labels in lists, and letters in acronyms if a full stop is used after each letter (I.B.M.). Letters of the alphabet are also homonyms. Because letters and their sequence need to be learned, there is some justification for including them in the highest frequency word list. The argument against doing this is that they have no meaning beyond their sequence meaning, that is, that B comes after A and before C and is the second letter in the alphabet. The marginal words list needs constant updating largely because the variable spellings of some words mean that more family members may need to added. The line between marginal words and words such as yes, no, absolutely, goodness, well, and phrases such as good grief and goodness me is not clear. In addition most categories of words contain some homonyms and marginal words are no exception. Yes can mean “yes” but can also function simply as an acknowledgement like uhuh.

Foreign words Foreign words often occur in English texts and some of them are so well known that they need to be included in the main word lists. Drawing the line between what is foreign and what is now a regular part of the language is not easy, especially if you consider English to primarily be a Germanic language, because about 60% of its vocabulary is from French, Latin or Greek. When including foreign words in the main word lists, there are two kinds of related decision-making. If a foreign word is to be included as a family member, it needs to be very similar in spelling to at least one of the other members. Hôtel could be an acceptable inclusion in the hotel family on the basis of form as long as its meaning was also considered close enough. Including a foreign word as a headword requires more careful decision-making as there are foreign words which have now been fully accepted into English (elite), there are those on the border (cappuccino), and there are those still seen as being

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 7.  Marginal words and foreign words

clearly foreign (flambé). If a lot of clearly foreign words are included in the lists, this should be noted and justified. Some well-accepted words such as cafe and cliche are also written in English with accents – café, cliché. It makes sense to include these forms as part of the word family, because they are very closely related in form and meaning to other members of the family. If this is done, then the word lists need to be saved in a format (UTF8 without BOM) that will preserve these non-English letters or they may be automatically changed to some unwanted form. For anyone working with word lists, there is an enormous advantage in using a text editing program like Notepad++ (Notepad plus plus) as the default text editor for .txt files. Notepad++ can be set up to preserve the Unicode format of word list files. Notepad++ also has the advantages of being able to search in more than one file at a time and to make use of regular expressions when searching. Where a text contains a lot of foreign words, it may be useful to set up a separate file for foreign words in the interest of text coverage. When adapting Lafcadio Hearn’s Glimpses of Unfamiliar Japan to be a mid-frequency reader using AntWordProfiler, all the Japanese words that would be unknown to an English speaker were put in a separate file.

Recommendations 1. Marginal words occur in corpora but because they have little referential meaning and may have variable spelling, they should be included in a separate list. 2. Swear words need to be counted, but the audience for the lists needs to be considered when deciding where to place them in the lists. 3. Letters of the alphabet occur frequently with a variety of functions. The target audience for the lists may determine where they are listed. 4. Foreign words can be added as family members if they share a very close form and meaning relationship with an existing family. Otherwise it may be worth setting up a separate list of foreign words.

83

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

chapter 8

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Acronyms

Acronyms are like multiword units in that they are made of several words but have been reduced to the first letter of each word, and are like transparent compounds in that they are typically made of known parts. Many acronyms are also proper names in that they are the names of enterprises and suchlike and have a unique reference. Abbreviations which are a shortened form of a word (Rd for Road, or Rev for Reverend) would be included in the word family for a word. Occasionally an acronym becomes a word in its own right, for example laser, scuba and NASA, particularly if it is easy to pronounce as a word rather than as a series of letters. The most frequent acronyms in the British National Corpus are (in order of frequency) UK, PM, IBM, PC, MP, VAT, eg, MA, DC, lb, ie, PLC, ph, CD, all with a frequency above 2,000 occurrences. Note that many of these frequent acronyms are homonyms – PM can be prime minister, or post meridiem among other things, PC can be Police Constable, personal computer or politically correct, MP can be military police or member of parliament, and ph can be pH (thought to be derived from power of hydrogen as in pH factor) or an abbreviation for phone. Many native speakers would struggle to recall what pm (post meridiem) stands for although they would have no trouble with its meaning. Most would not know what eg (exempli gratia), ie (id est), and lb stand for, although they would know their meaning. lb is strictly speaking not an acronym but an abbreviation, derived from libra meaning “weighing scales”. Note also that am (ante meridiem) is not included in basewrd34 of the BNC/COCA lists as it is a homonym for the much more frequent first person form of the verb be. The frequency counts of many acronyms are likely to be underestimates if the versions containing a full stop are not also included, for example I.B.M., p.m. or U.K. and there could be some value in regularizing these in a corpus (that is always writing IBM for example without full stops), if this is relevant to the purpose of the lists. The 1,100 or so acronyms contained in basewrd34 of the BNC/COCA lists account for 0.19% of the tokens in the BNC. The arguments for listing acronyms separately from other words are that they are made of already known parts and that for receptive purposes they may be explicitly related to their parts on their first occurrence in a text. As a result, they should be easy to learn, and their full form should be easy to recall because of the clues given by the initial letters. All of these assumptions deserve research.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

86 Making and Using Word Lists for Language Learning and Testing

The main argument against listing acronyms in a separate list is that such a list is likely to be a mess unless homonymy and the use and non-use of full stops after each letter are taken into account. The use and nature of acronyms is rapidly changing as texting makes use of acronyms of all kinds of phrases, such as asap (as soon as possible), fyi (for your information), dyk (Do you know), btw (by the way), along with semi-acronyms such as CUL (See you later). These are made along the same lines as the very well established MIA, POW, RSVP and RIP. A word list study wishing to include corpus material from texting should probably involve a separate list of texting acronyms and other shortcuts, because this is such a rapidly changing field and the acronyms and other forms used in texting are rather different from most other acronyms in their nature and purpose.

Needed research It would be useful to have a definitive list of the most common acronyms and also to have some information of their transparency. There is likely to be scale of acronyms with some being no longer seen as acronyms (scuba, laser), others moving towards that state, and others requiring glossing when they are used.

Recommendations 1. Abbreviations of single words should be included in the word family or lemma for the full word. 2. It is worth setting up a list of acronyms because they are clearly distinguishable from other types of words. They share the characteristics of multiword units, transparent compounds, and proper names, they are easier to learn than ordinary words in that their form provides strong clues to their meaning, and some are very frequent. 3. Acronyms should meet the criteria of (1) being made of the first letters of their full multiword forms, (2) the full multiword forms should be familiar to adult native speakers so that the acronyms are reasonably transparent, and (3) the typical pronunciation of the acronym involves sounding out the letters that make it up (MIT), or saying it as a word (UNESCO). 4. If the full form of an acronym is included in a list of transparent compounds or an accompanying list of multiword units, then the acronym could be included in the word family of the full form.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 8.  Acronyms

5. If the actual frequency of acronyms is important in a word count, then forms containing full stops separating each letter must be included in the family of the acronym. For example, UNICEF and U.N.I.C.E.F. would make up one family. 6. Word counts involving material from texting would need to set up a separate list of texting acronyms and shortcuts. This is because these kinds of shortcuts are qualitatively different from more traditional acronyms. 7. Acronyms are very likely to be homonyms including words of various types such as other acronyms, content words, and abbreviations. If the actual frequency of acronyms is important in a word count, then the members of homonyms need to be distinguished and counted separately. Acronyms which have become words in their own right, particularly where the acronym does not easily lead to recall of the full form, should be included in the non-acronym lists, that is the lists of content and function words. Frequent words like this are laser, scuba, sonar, UNESCO, DOS, NATO, ASCII. Note that these words are not pronounced by saying their letters, but have a word-like pronunciation. 8. A list of acronyms would need continual updating, because each field of specialization has its own acronyms, and each academic text (especially PhD theses and academic articles!) uses its own convenient set of acronyms.

87

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

chapter 9

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Function words

A distinction is sometimes made between content words and function words. Content words (also called lexical words) include nouns, verbs, adjectives and adverbs, and function words include all the remaining grammatical categories such as articles, prepositions, conjunctions, auxiliary verbs, and particles. Function words are seen as operating the grammar of the language while content words convey the main meaning. This distinction is a hard one to maintain, as many function words, such as conjunctions and prepositions, carry substantial lexical meaning. There are also content words that sometimes act like function words, such as according to and nevertheless. Function words are also seen as consisting of closed classes where new items are not commonly added (which is one reason why numbers are usually included in function words), while content word classes (nouns, verbs, adjectives, adverbs) are often being added to. This distinction is also difficult to maintain, because there are groups of nouns such as the days of the week, the months of the year, the seasons, points of the compass and so forth which are closed classes in much the same way numbers are, but which are not included in function words. In spite of these reservations, there are possible justifications for distinguishing function words when making word lists. Grammarians see function words as being qualitatively different from other words, and many function words need to be learned in ways which are different from how content words are learned, relying more on use and perhaps grammatical knowledge. The strongest argument for counting them separately is that they are very frequent indeed, covering 51% of the tokens in the BNC. The frequency of function words is the major reason why the most frequent 1000 word families of English have such high text coverage. Table 9.1 is a list of the function words of English, categorized largely according to part of speech. The major source for this list is Biber, D., S. Johansson, G. Leech, and S. Conrad (1999) Longman Grammar of Spoken and Written English. London: Longman. Most of the words occur in the most frequent 3,000 words of English. Because the list in Table 9.1 is intended to be used with a program such as Range or AntWordProfiler, homonyms such as to (preposition and to infinitive) and for (preposition and conjunction) are only entered once in the earliest category in the list. So, to occurs in Others and not in Prepositions.

90 Making and Using Word Lists for Language Learning and Testing

Table 9.1  English function words Others there not to

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Auxiliary verbs am are is was were be been being did do does doing done get gets getting got had has have having can could may might must ought shall should will would Conjunctions after albeit although and as because before but for how however if neither nor or since so than that though till unless until what whatever when whenever where whereas wherever while whither which who whoever whom whose why yet Prepositions about above across against along among amongst around at behind below beneath beside besides between beyond by despite down during except from in into like minus near notwithstanding of off on onto over per plus round through throughout towards under underneath unlike up upon via with within without Adverb particles aside away back forth out still whence Determiners a all an another any both certain each either enough every few fewer half less many more most much no other others several some such the these this those Pronouns he her hers herself him himself his I it its itself me mine my myself our ours ourselves she their theirs them themselves they us we you your yours yourself yourselves thou thee thy thine anybody anyone anything anywhere anyhow everybody everyone everything everywhere nobody none noone nothing nowhere nohow somebody someone something somewhere somehow Numbers billion billionth eight eighteen eighteenth eighth eightieth eighty eleven eleventh fifteen fifteenth fifth fiftieth fifty first five fortieth forty four fourteen fourteenth fourth hundred hundredth last million millionth next nine nineteen nineteenth ninetieth ninety ninth once one second seven seventeen seventeenth seventh seventieth seventy six sixteen sixteenth sixth sixtieth sixty ten tenth third thirteen thirteenth thirtieth thirty thousand thousandth three thrice twelfth twelve twentieth twenty twice two zero

Six function words are in the Academic Word List (Coxhead, 2000) (albeit, whereas, despite, notwithstanding, plus, via). Eight function words are not in the high frequency words (albeit, notwithstanding, thou, thee, thy, thine, whence, whither). The list of function words to be used in the Range program consists of 414 word types making up 162 word families. One problem with counting word types for function words is that some family members are not function words, for example, the and family contains the member

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 9.  Function words

ands as in No ands and buts where the plural use is really a noun. Others include zeroed, pluses, minuses. This means that a list of function word families would contain families with a smaller set of members than the families would contain if they were not separated out from content words. There is a list of function word families to use with the Range program on Paul Nation’s website. These have been edited to remove non-function word uses as much as is possible, so the function word families are not the same as those in the word lists for Range and AntWordProfiler. There are also various alternative spellings of was (wuz, wiz) that are used to represent different pronunciations, wiv for with, nuffink for nothing, and dialect forms such as youse as the plural of you, owt for anything, and nowt for nothing. In the BNC/COCA lists, these alternative forms are included in the main word family, so owt is included with anything in the family for any. The function word list contains several homoforms – won (as in won’t), t, m, n (as in n’t, I’m), round, can, have, be (when they are main verbs), less, like, may, might, till, as well as uses of function words as other parts of speech with substantially the same meaning. If function words are a particular focus of a word list study, then it may be necessary to set up additional word families so that function word and non-function word uses of family members are distinguished. For example, nothingness and nothings would have to be separated from the nothing family even though they are clearly closely related in meaning and form to nothing, because nothings and nothingness are not function words. This then would be an inconsistency in creating word families as content word families do not distinguish part of speech but are based on form and meaning relationships. There are some words that could be considered as function words but which have more frequent uses as content words (considering, bar, following, inside, little, opposite, outside, past, regarding, various). The Range program has a stop list option which if chosen allows a designated list of words to be ignored when counting. The list needs to contain all the words to be ignored and needs to be a text file. Anyone wishing to count only content words can use this option.

Recommendations 1. Because function words do not make up a reliably distinguished category, any study involving function words as a separate list should make it clear what are included as function words and whether function word families include nonfunction word members.

91

92

Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

2. Function words include very frequent homonyms, and the members of homonyms would need to be distinguished in a function word study. 3. There are various alternative spellings of some function words, often used to represent alternative pronunciations, and these should be counted as function words.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

section iii

Choosing and preparing the corpus

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

chapter 10

Corpus selection and design

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Paul Nation and Joseph Sorell A word list is typically made by choosing a corpus of appropriate texts and then making a list of the words in the corpus using criteria such as frequency, range and dispersion to order the items in the list. The nature of the list is strongly determined by the nature of the corpus, and this chapter looks at the factors that need to be considered when making or choosing a corpus. At the very least, there should be a defensible rationale for choosing or compiling a particular corpus. Biber (1993: 243) “argues that theoretical research should be prior in corpus design, to identify the situational parameters that distinguish among texts in a speech community”. While Biber’s focus is primarily on grammatical features in different kinds of texts, his suggestions apply equally well to vocabulary. Typically researchers focus on sample size as the most important consideration in achieving representativeness: how many texts must be included in the corpus and how many words per text sample. Books on sampling theory, however, emphasize that sample size is not the most important consideration in selecting a representative sample; rather, a thorough definition of the target population and decisions concerning the method of sampling are prior considerations. Representativeness refers to the extent to which a sample includes the full range of variability in a population.  (p. 243)

This means that there needs to be a clear and sufficiently detailed description of the population that the sample is intended to represent. Ironically, the most commonly cited purpose for a word list, a general service list, is the most difficult to describe. This is largely because where English is taught as a foreign language, there are often no clear and pressing reasons for teaching the language. In an age of acronyms such as ESP and EAP, the teaching of English as a foreign language in secondary schools has been jokingly referred to as TENOR (Teaching English for No Obvious Reason). Beyond dealing with English course books and graded readers, it is not easy to see the uses young EFL learners will actually make of the English they learn. The more detailed a description of the uses that need to be made of the language, the more accurately a corpus can represent these uses, because this means that stratified sampling of texts based on grouping of the same kinds of texts can be used rather than random sampling from any of the available texts.

96 Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Text types and other factors The speaking/writing distinction is a primary one both for vocabulary and grammatical features. The very first word in a frequency-ranked list indicates whether the corpus used was spoken (I) or written (the). Moreover, the relative frequency of the very high frequency words is strongly contrasted between spoken and written texts with high frequency words in spoken texts occurring more frequently, and with a general service list such as the BNC/COCA 2000 covering a much greater percentage of the tokens in a spoken text than in a written text. There is a considerable variety of types of spoken texts. Table 10.1 shows some of the possibilities. Table 10.1  Types of spoken texts Spontaneous

Scripted

Informal dialogue

Interactive conversation – face-to-face or telephone Radio talkback

Movies and TV Plays

Formal dialogue

Parliamentary speeches

Informal monologue

After-dinner speeches Recounts

Radio talks

Formal monologue

Court rooms University lectures

News broadcasts Speeches and conference presentations

Note that the examples given in Table 10.1 do not always fit neatly into one category. Parliamentary speeches can be spontaneous or scripted, and recounts can be spontaneous or scripted. University lectures can be formal or informal. Biber (1993) also distinguished spoken texts on whether they were directed to a single person or an audience, whether they were institutional, public or private, and whether the addressees were present (as in an after-dinner speech) or not (as in a radio talk). While we can try to characterize spoken texts using various descriptors such as spontaneous or scripted, and formal or informal, what is important is whether these distinctions significantly affect the way vocabulary is used. This includes the actual words that are used, how frequently they are used, their ranking in a frequency list, and the multiword units they form. Quaglio (2009) compared a TV series Friends with unscripted conversation finding many similarities with differences partly being the result of the limited range of situations in the TV program. There was no comparison directly focusing on vocabulary, but the comparison of discourse features suggests that TV sitcoms and perhaps movies may be a useful addition to a corpus of spontaneous dialogue. Sorell looked for reliability from a vocabulary perspective within text types as a way of seeing if some distinctions were truly distinctions. “Frequency lists can only

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 10.  Corpus selection and design

reflect the corpora they are derived from, so mixing the types of texts in the corpus inevitably produces mixed results. The question then becomes, how does one divide a language in a principled and consistent manner? This is why a taxonomy of texts is needed to guide the analysis” (Sorell, 2013: 64). Sorell started with Biber’s (1995) eight text types, and using scores on Biber’s dimensions, reduced them to four – interactive (largely spoken texts but also including personal letters), general reported exposition (written non-academic and newspaper texts), imaginative narrative (mainly fiction), and academic writing. Prepared spoken texts such as speeches could have been put into the general reported exposition category, but Sorell decided to exclude them from his research as he wanted distinctive categories and thought that prepared spoken material may be too intermediate a category between speaking and writing. His conversation category showed that with a large enough corpus, there could be a high level of agreement between the very high frequency word types and acceptable agreement up to the 9000 word type level between lists made from different source corpora of the same text type. It seems that the text category of informal spoken interaction is robust enough to yield consistent results for vocabulary. It remains to be seen if scripted informal speech as in movies, TV shows and plays is similar enough to spontaneous dialogue to be included in the same category. Brysbaert and New (2009) found that television and film subtitles gave much more useful frequency figures for matching with lexical decision times in psychological experiments than the Kučera and Francis or Celex lists that had been commonly used in reaction time experiments. If the vocabulary of films and TV programs turns out to be very similar to spontaneous spoken language, it would certainly make the construction of spoken corpora easier. The face validity problem of written speech as in movies however may be the discouraging factor. Sorell (2013) found that each text type that he studied was roughly equidistant from its neighbor in terms of vocabulary similarity with conversation followed by narrative, then general writing and then academic writing. He saw conversation as being at the centre of any core vocabulary. His four text types were distinctive enough and gave reliable enough results for them to be seen as useful groupings of texts. Table 10.2 provides more detail of the content of the three written text types. None of the three categories included spoken texts. Table 10.2  The three written text types used in Sorell’s (2013) study Text type

Content

Narrative writing

Fiction of various kinds including novels and short stories

General writing

Biographies, essays, editorials, humor, instructions, persuasive writing, popular writing, reportage

Academic writing

Humanities, social sciences, physical sciences, technology

97

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

98 Making and Using Word Lists for Language Learning and Testing

General writing and academic writing could clearly be subdivided, separating out newspapers (reportage) for example, or distinguishing major academic divisions such as humanities, science, and law. Such subdividing would depend on the purpose of the word lists. Informal spoken language can usefully be seen as the medium for the learning of the high-frequency and mid-frequency words for native speakers of English. When making a general service vocabulary for use in primary and secondary schools where English is taught as a foreign language, spoken language may not be the major source of input for learning, but the high frequency words of conversation should be an important part of such a general service vocabulary.

Geographical divisions The major distinction in English is between US and UK uses of English. In terms of vocabulary lists, this can involve a preference for certain words (movie-film), but only involves a limited number of different words for the same concept (gasolinepetrol, faucet-tap) http://www.oxforddictionaries.com/words/british-and-americanterms has more examples. It does however involve systematic spelling differences (labor-labour, center-centre). A decision needs to be made whether these spelling differences are treated as members of the same lemma or family, or as different lemmas or families. The BNC/COCA lists treat them as members of the same family because the spelling differences are very small and cause few if any problems when reading. However, a study with a strong geographical focus may want to have separate families. Leech and Fallon’s (1992) very amusing comparison of the rather small Brown and LOB corpora suggests that the US\UK contrast may go deeper than different terms and different spellings.

Age-related material Lists intended for young children need to be based on a corpus that reflects the language that children hear and read. There is a noted lack of such corpora, especially spoken language used by and directed at children, and resources like CHILDES contain only very short texts. Macalister (1999; Webb & Macalister, 2013) created a million token corpus of the New Zealand School Journals. This publication is aimed at children from around six years old to twelve years old (the end of primary school). This high quality publication is published in four parts, part 1 being for the youngest readers, and part 4 for the oldest. It appears three times a year and is a mixture of fiction (including poetry) and engaging factual material.



Chapter 10.  Corpus selection and design

The Oxford Children’s Corpus (Banerji, Gupta, Kilgarriff, & Tugwell, 2012) is a collection of writing by children, but is not publically available and preserves the spelling errors that the children made in their writing. The use of a corpus like the British National Corpus results in many formal Latinate words appearing with high frequency in word lists which are not words typically known or used by young children.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Language learning situation Where English is learned as a second language, there is strong justification for including TV programs and movies in a corpus, along with writing for children. Where English is learned as a foreign language, there are very strong motivations for including university entrance tests. The first major considerations then when making or choosing a corpus are to decide on the purpose of the lists and the people for whom they are intended, and then to decide what types of texts best meet the purpose. The next consideration is corpus size.

Corpus size Sorell (2013: 204) found that a corpus of around 20 million words was needed to get reliable lists of the most frequent 9000 word types. With a 20 million token corpus size, the 1000 most frequent words of conversation should be expected to differ by around 20 words (2%) from other lists made from different corpora of the same text type, conversation. The 3000 most frequent words would likely differ by around 150 words (5%), and the 9000 most frequent words by around 810 words (9%). The written text corpora in his study, narrative, general writing, and academic writing, were in roughly the same ball park, with general writing texts having smaller differences at the 3000 level (3%) and 9000 level (6%). Smaller corpus sizes than 20 million tokens will result in greater variability. The 20 million figure is roughly in agreement with Brysbaert and New (2009) for low frequency words, though they suggested that a one-million token corpus may be enough for high frequency words. Their criterion for evaluating size was the stability of the correlation between frequency figures and lexical decision times. Brezina and Gablasova (2015) also argue that for high frequency general service word lists, it makes little difference if the corpus size is one million tokens, 100 million, or 12 billion. However, it is important to note the amount of non-overlap in their lists. They found a 78%–84% overlap between each of the 3000 high frequency lemma lists from the four corpora with high correlations between the ranks of the

99

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

100 Making and Using Word Lists for Language Learning and Testing

shared items. When looking at the overlap, we can stress the overlap or the differences. Around 2,400 lemmas overlapped between individual lists and 600 in any one list did not. 2,122 lemmas (71% of 3000) were common to all four lists, so in any one 3000 word list there were 878 lemmas not common to all four lists. This figure is comparable to that found by Nation and Hwang (1995) when comparing overlap between the high-frequency wide-range words in the General Service List, LOB and Brown corpora. This is not a great overlap, especially when compared with the figures found by Sorell (2013) when looking at corpora all of the same homogeneous text type. Rather than differing by 878 words at the 3000 lemma level, Sorell’s differed by 90 to 150 word types depending on the text type. Brezina and Gablasova are a bit too tolerant in accepting that 71% or even 78%–84% overlap is good enough. If roughly one out of every four or five words is different from one list to another, that is a lot of difference. The cause of such difference most likely lies in the very different nature of the corpora used, and size does not overcome this content difference. There are various ways of deciding on how large a corpus needs to be. A very practical way is to make an arbitrary decision about how many examples you would need of the least frequent words you are interested in. If the purpose for making the lists is to provide a list of around 3000 high frequency words, then the cut-off points in Table 2 show that the least frequent words in the first 3000 word families occur with a frequency of 20 per million tokens or higher. If you want each word to have at least 50 occurrences so that you have enough examples to do a useful concordance to look for homonyms, the core meaning, the common collocations, or the basic grammatical features of a word, then a corpus of two and a half million tokens would be needed. If however you wanted a list of high and mid-frequency words (9000 word families), then the cut-off points in Table 2 suggest a corpus size of at least 25 million tokens to get at least 50 occurrences of the least frequent word families at the 9000 word level. To get at least 50 occurrences for all word families, a 100 million token corpus like the British National Corpus is not enough after the 13th 1000 level. For word types (Table 3), a corpus size of eight to nine million words would get 50 occurrences for the least frequent word types at the 9,000 word type level. Sorell (2013) used the criterion of the stability (or reliability) of the words in the list. That is, how large a corpus do you need to keep getting largely the same words occurring in your list even though you re-make the list from a different corpus of the same kinds of texts? His conclusion was a corpus size of around 20 million tokens was needed for the first 9000 word types. Even with this size, around 540 (6%) of the word types in a general written corpus (magazines, newspapers etc.) would differ from lists made from a different general writing corpus.



Chapter 10.  Corpus selection and design 101

Clearly, a large well-constructed corpus is better than a small well-constructed corpus. For high- and mid-frequency words, a corpus size of 20 million tokens would be likely to give reliable results. Smaller corpus sizes will give less reliable results, but may be unavoidable if there not enough texts of the right type.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Proportion of text types in a corpus As well as types of texts and corpus size, the proportion of different kinds of texts in a corpus needs careful consideration. The Academic Word List used equally sized sub-corpora (four sub-corpora) and sub-sub-corpora (twenty-eight of them) because Coxhead (2000) wanted each faculty and discipline area to have equal representation. She did not want one discipline area or faculty to strongly affect the lists. The Academic Word List was intended to be equally valuable to students, regardless of what subject they study. When making a corpus for young native speakers, it would be good if the proportion of the various kinds of texts in the corpus represented the proportion of types of language use they meet in their daily lives. A very large proportion of this (probably 80% or more) is likely be spoken input from peers, parents and family, teachers, and TV and movies. According to Statistics Canada (1998) and the United States Department of Labor (2006), Americans and Canadians watch television five times as much as they read. The amount of reading done varies considerably between individuals with estimates ranging from under 100,000 tokens per year to over four million tokens of written text a year (Cunningham, 2005) for school children. Nagy, Herman and Anderson (1985) suggest the median for fifth graders was around 700,000 tokens a year. This is around 2000 tokens a day. At a slow reading speed of 150 words per minute, this is around thirteen minutes a day, one minute per waking hour. Even if this figure is doubled or tripled, for most young native speakers, reading makes up only a very small part of their language exposure. Some of this reading now comes via electronic devices. The Guardian newspaper notes that according to a report on media consumption by ZenithOptimedia, “the amount of time spent reading newspapers across the world averaged 16.3 minutes per reader a day in 2015, down 25.6% from the 21.9 minutes daily average in 2010.” (http://www.theguardian.com/media/2015/jun/01/global-newspaper-readershipzenithoptimedia-media-consumption). Only a very small part of a corpus representing typical daily exposure to language should consist of newspapers. “In the UK, the increase in use of the internet has been dramatic, with the average minutes per day spent online rising 55% from 82 minutes to 2 hours and 7 minutes between 2010 and 2014” (ibid). Much of this may be spent on social networking, but a

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

102 Making and Using Word Lists for Language Learning and Testing

modern corpus needs to represent this, meaning that up to a quarter of a modern corpus might need to include material from the internet. When Michael West (1953) made his General Service List for young foreign learners of English, he saw reading as the main goal, and in many places in his writing about language teaching and learning stressed the practicality and importance of reading. The General Service List thus does not include some very common spoken words, such as OK, Goodbye, and alright. In spite of that the General Service List provides good coverage of spoken text. The British National Corpus is not a well-balanced corpus from the viewpoint of the daily use of language. Ninety per cent of it is written. Ten per cent is spoken and less than half of this spoken section is unscripted spoken interaction. Sorell (2013) suggests that the core of a general service list should be informal spoken language. A shortage of transcribed text of this type, especially children’s spoken language, limits the building of such a corpus, but there are several million words of informal spoken language available in the British National Corpus (around four million), in the material prepared for the American National Corpus (over three million), in the Wellington Corpus of Spoken New Zealand English (one million), and in the International Corpus of English (600,000 tokens per English variety, although only an average of around 200,000 tokens per variety could be truly considered to be informal conversation). It remains to be shown whether movies and TV programs are close enough to informal spoken text to be a useful addition to spoken corpora. The largest proportion of a general corpus should be spoken text and if the daily use of language is considered, it should make up at least 80% of the corpus. Informal chat however may now need to include digital chat. The second section of a general corpus according to Sorell should be narrative writing. According to the 5th edition of the Scholastic report on reading, about 30% of children are frequent readers, reading for fun on at least five days a week, and this proportion is declining (http://www.scholastic.com/readingreport/). Reading hardcopy books may however be replaced by various forms of digital reading. Perhaps the proportion of narrative in a corpus should be around 10% or less. The third section of a corpus should be general writing which includes reading factual material. This includes newspapers, consulting the internet using sites like Wikipedia or doing a web search, and reading popular magazines. This should make up about 5% of the corpus. The fourth section of a corpus could involve academic writing, but it may be more appropriate to move from a general service list to an academic list once the main general service words are known, and thus not include academic text in the general corpus. We have looked at possible proportions of text types in a single corpus, but it may be that this is not the best way to build a general service word list. Mixing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 10.  Corpus selection and design 103

different text types together when counting is likely to affect the reliability of the count and also to obscure the effects of important text types like informal conversation, written narrative and general writing. It may be better to do separate counts of each text type and then find ways of using the separate results to make a combined word list. For example, if informal conversation is given the priority it seems to deserve, then the initial list could be made from a corpus of informal conversation. After that, words that are important (frequent and of wide range) in narrative could be added, and then words from general writing and other appropriate text types. Somehow words that are of moderate frequency in several text types might need to be given the priority suggested by their combined frequency. This approach suggests that general service word lists may need to be more modular than they are now.

How do you divide a corpus into sub-corpora? Dividing a corpus into sub-corpora allows the creation of range and dispersion figures. In some ways range figures are more important than frequency figures, because a range figure shows how widely used a word is, and this indicates its “general service”. Brysbaert and New (2009) found that a range measure was a good predictor of lexical decision times. Carroll, Davies and Richman (1971) found in their study that frequency and their measure of dispersion correlated at .8538 (page xxix), showing that the more widely used a word is, the more likely it is to be frequent. Some words however are frequent in just one or two texts or sub-corpora and may not even occur in others. The use of a range or dispersion figure or both can indicate such words. There are three principles to guide the division of a corpus into sub-corpora. They are, (1) the sub-corpora should be large enough, (2) the sub-corpora should be of equal size, and (3) each sub-corpus should be coherent. A coherent sub-corpus is made up of the same kinds of texts, not a mixture of different kinds of texts. Let us look at the justification for each these principles and their application. Each sub-corpus should be large. Each sub-corpus needs to be large enough to give the words you are interested in a chance to occur. Engels’ (1968) evaluation of the General Service List used short texts as his sub-corpora, but his texts were so short that very few 2nd 1000 GSL words had a chance to occur (Gilner, 2011). Engels used this non-occurrence as a criticism of the General Service List 2nd 1000 words. If the focus is on the high frequency words, then a sub-corpus size of one million tokens would be enough for each high frequency word to occur at least 20 times. If the focus is on the mid-frequency words, then a corpus size of 25 million tokens would be enough for the lowest frequency mid-frequency words to occur at

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

104 Making and Using Word Lists for Language Learning and Testing

least 50 times. Leech, Rayson and Wilson (2001) used 100 sub-corpora each of one million words. While this may be a little small, the number of sub-corpora provides a compensation. At the other extreme, when making the BNC/COCA lists, Nation used ten sub-corpora each of 10 million words. This does not allow for as fine range distinctions as the Leech et al. study, but each mid-frequency word in the study certainly had ample opportunity to occur in each sub-corpus. The sub-corpora should be the same size. The use of a range figure involves the comparison of one sub-corpus with another and this comparison will be distorted if the sub-corpora are not of equal length. A long sub-corpus provides plenty of opportunity for lower frequency words to occur and a shorter one does not. From a practical standpoint, the size of the smallest coherent sub-corpus determines the size of the remaining sub-corpora. In the BNC, the spoken section makes up 10% of the total corpus. This guided Nation’s decision to have ten sub-corpora, and still allowed Leech et al. to have one hundred coherent sub-corpora. The demographic spoken section of the BNC makes up around 4% of the corpus, so it would be possible to make sub-corpora of around four million words each, although some shorter left-over sections such as the two million token remainder of the spoken contextgoverned section could not be used. Note that even though the content of several sub-corpora may be the same kinds of texts, for example spontaneous spoken texts, each sub-corpus should be the same size. In the Leech, Rayson and Wilson (2001) study, there were probably four spontaneous spoken sub-corpora each a million tokens long, six more formal spoken corpora each a million words long, several million words of narrative with each sub-corpus a million words long and so on. The texts in a sub-corpus should all be of a similar type. The reason for using a range measurement is to see if a word occurs across a range of different topics and types of texts. For this reason, the texts in each sub-corpus should not be mixture of different types of texts but should represent a similar type of text. Then, a high range figure would show that the word is needed regardless of the type of text that is met. If each sub-corpus contained a mixture of different types of texts, then a word could occur across a range of sub-corpora even though it only occurred in one type of text, for example in the spoken part of each sub-corpus. Several different sub-corpora can be made of the same type of text, but mixing very different texts within the same sub-corpus is not a good idea. The content and size of a corpus and the size and nature of its sub-corpora determine the quality of any lists made from it. Poor corpus design will result in poor lists and simply increasing the size of a corpus will not make up for weaknesses in the design of its content.



Chapter 10.  Corpus selection and design 105

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Recommendations 1. The content of a corpus should represent the actual or potential language uses of the target audience for the resulting word lists. 2. The corpus size should match the frequency level of the words that are the focus of the count, and should be enough to get a reliable list. A corpus size of around 20 million tokens is recommended for high-frequency and midfrequency words. 3. To ensure that the most useful words occur early in the lists, both range and frequency of occurrence need to be considered when gathering data on the words. 4. Because any corpus is likely to be only an approximate representation of what learners need to know, and because of the limitations of word counting programs, the use of the criteria of frequency and range may need to be accompanied by more subjective criteria. 5. It is worth exploring the feasibility of a modular approach to making word lists that initially separates the results of counting different text types.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

chapter 11

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Preparation for making word lists

The quality of a list made from a corpus depends on the quality of the corpus. Size does not compensate for quality. In the chapter on corpus design we looked at the quality, size and representation of a corpus. In this chapter we look at tidying up a corpus to get rid of errors and unwanted items, and the use of additional lists to contain words that need to be separated from the other lists. It is worth spending time getting a corpus ready for analysis even though it can be very time consuming. A well prepared corpus not only results in better lists but can be used for different purposes in future studies.

Preparing the corpus In many corpora there are various kinds of typographical errors, and these can make up a significant proportion of some corpora. They may come from the inputting of the text, the formatting of the text (for example, where the deletion of end of line markers results in words being joined together), or the nature of the text itself where for example colloquial or accented speech is represented by unusual spelling (orf for off, huntin’ for hunting, ’ouse for house, dawg for dog). Ideally the corpora used for making a word list should be free from typographical errors such as misspellings and words joined together. An efficient way of checking on this is to run the corpus through a counting program such as AntWordProfiler choosing the output to be presented in range or frequency order. The words not in any list should then be inspected for errors. If for example yousaid occurs 140 times, then Find and replace should be used to get rid of this error in the corpora. This where a program like Notepad++ (Notepad Plus Plus) is very helpful because it can do Find and replace in several files simultaneously (Choose Replace all in all opened documents in the Find and replace dialogue box). The words not found in any list can be cut from the output file after running the corpus through the counting program and can be run through a spellchecker, keeping a note of the frequent spelling errors for a later Find and replace. Running the spellchecker over the output is likely to be much less time-consuming than running it over the whole corpus because types rather than tokens are being checked. The disadvantage is that such checking needs to be followed by Find and replace for each misspelled type.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

108 Making and Using Word Lists for Language Learning and Testing

Notepad++ allows the use of what are called Regular expressions in the Find and replace dialogue box. Regular expressions are ways of referring to formatting marks such as tabs or hard returns as well as to whole classes of symbols such as letters of the alphabet or numbers. It is well worth learning how to use simple regular expressions and there are several websites that provide tutorials and guides. When preparing a corpus, it is tempting to delete words from the corpus. For example, when preparing an academic corpus, the decision might be to not count proper names and years used in references in the text, for example (Nation & Webb, 2011). In a spoken corpus there may be notes like laughs, latch (signaling overlapping speech), or unclear, which are comments on the text rather than parts of the text itself. Movie scripts may contain stage directions and the name of each speaker in front of each text they speak. The safest way of dealing with these unwanted items is not to delete them but to enclose them in triangular brackets, for example . Both Range and AntWordProfiler have an option where items in triangular brackets can be ignored in the counting. This option preserves the original nature of the corpus and still allows the exclusion of some items. It also makes it easy to later change your mind about the decision. Neufeld, Hancioglu and Eldridge (2011) usefully point out the very severe effects that can occur when a corpus containing non-ANSI characters is processed using the Range program. This underlines the importance of carefully checking the output from processing programs, and particularly looking at the words not found in any list. Moreover, if the lists used with the program are not saved in unicode format, some characters like e written with an acute, as in élite, may become corrupted and affect counting.

Preparing the lists As well as preparing the corpora, the lists to be used in a word count also need to be prepared. It is important to make your lists available for others to use and critique. This means not only making the headwords available but also the family members. Coxhead did this for the Academic Word List, and the BNC/COCA families are freely available. This availability makes the criteria for inclusion in a family easier to understand but also shows how carefully the family members were chosen and checked. Once again, looking at the words which do not occur in any list is the most useful source of data. Inevitably, many of the frequent words that occur in a corpus or text and which are not in the existing lists will be proper names. At least, the most frequent proper names not in the proper names list need to be added to the

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 11.  Preparation for making word lists 109

list. It is best to do this systematically so that at least all proper names above a stated frequency are included in the proper nouns list. If a text contains a lot of foreign words, it may be useful to create an additional list of such words that then becomes one of the lists used by the program. When adapting Glimpses of Unfamiliar Japan by Lafcadio Hearn to be a mid-frequency reader, it was necessary to do this by making a list of most of the Japanese words used in the text from the not in any lists output. The justification for using this separate list was that most of the Japanese words were explained in the text or could readily be guessed from context clues. It did not seem reasonable to consider them as words to be replaced in an adaptation or as unknown vocabulary in the same way that low frequency English words were unknown vocabulary. In addition, by having them as a separate list, their overall effect on the difficulty of the text could be gauged by seeing how many such words were used, how often the individual words were used, and what proportion of the text consisted of such words. Moreover, as the text was most likely to be of interest to Japanese readers for whom Lafcadio Hearn is a well-known figure, it was useful to separate these words from unknown English words, thus allowing a more realistic assessment of the vocabulary difficulty of the text for Japanese readers. A reasonable number of words not in any lists may be low frequency members of word families that are already in the lists. Here are some examples: acquirement, adjustor, adulterant, adventurously. If the unit of counting is the word family, then these family members can be added to the families in the lists. Notepad++ with its ability to search in several lists at the same time is a quick way of finding the family. Because high frequency families tend to have more family members than low frequency families, most of the additions will be to the higher frequency lists. When adding a new member to a family, it is important to make sure that the new member is clearly a member of that family in that it shares the same word stem with a similar meaning, and the new member has an affix that is permitted in the definition of word family used in the lists, for example Bauer and Nation (1993) Level 3 or Level 6.

Preparing the program Both AntWordProfiler and Range have ways of configuring the program to determine what is counted as a word. To do this, Range uses a text file called range.txt. If unwanted items have been put in triangular brackets make sure the option Ignore ‘’ has been clicked in Range, or the correct Tag setting chosen in Global settings in AntWordProfiler.

110 Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Recommendations 1. Tidy up the corpus by correcting errors and by enclosing unwanted items in triangular brackets. This can be very time consuming but is very important. 2. Check the Not in any list in the output to Range or AntWordProfiler to see what needs to be added to existing lists such as proper nouns, marginal words, transparent compounds, or family members. 3. Check to see if any additional lists need to be made to deal with substantial categories of unwanted items such as foreign words.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

section iv

Making the lists

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

chapter 12

Taking account of your purpose

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Purposes In this section we look at a range of purposes for word lists, providing examples of the purposes from published studies. It also includes discussion of the criteria of range, frequency and dispersion, as well as other more subjective criteria.

General service vocabulary The most common purpose for making a word list is for deciding what vocabulary learners of English as a foreign language should learn in their first years of learning English. This was West’s (1953) intention when making A General Service List of English Words with the main focus on reading. The major problem with this purpose is that children begin learning English at a wide range of age levels, from pre-school to primary school to secondary school, and in addition begin learning English several times. That is, teachers at university level may consider that their learners know so little English that they have to start again from the beginning. Lessons for very young learners seem to focus on spoken language, animals, colors, songs and fairy stories, while this kind of vocabulary is less relevant for secondary school students and first year university students. The greatest need for university students is likely to be reading, particularly in subject areas. It seems unlikely that it is possible to make a general service list beyond the 1st 1000 words that will be equally relevant to learners of all ages and with differing language needs. General service lists include West’s (1953) A General Service List of English Words, Nation’s BNC/COCA lists (available from http://www.victoria.ac.nz/lals/ staff/paul-nation.aspx), Brezina and Gablasova’s (2015) New General Service List, Browne’s (2014) New General Service List, and Dang and Webb’s (see Chapter 15 this volume) Essential Word List. The assumption has been that a general service list should contain around 2,000 words because West’s list had around that number. Nation (2001b) examined this assumption and found that this number could be supported by several reasons based on coverage, cost/benefit analysis, overlap of different lists, total number of words, and criteria based on meaning and use. The main idea is that a general service list should contain the high frequency words of the language because the way a teacher deals with these words should differ from the way the teacher deals with

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

114 Making and Using Word Lists for Language Learning and Testing

mid-frequency and low frequency words (Nation, 2013). High frequency words each deserve attention from the teacher, while mid- and low frequency words do not, but provide opportunities for the practice of vocabulary learning strategies. Now Schmitt and Schmitt argue that the group of high frequency words should number 3,000 rather than 2,000, so presumably general service lists should also include 3,000 words. Dang and Webb (see Chapter 15 this volume) consider that lists of 2,000 to 3,000 words are not truly practicable in that they are too large for a teacher to cover in a course. At least, such lists should be broken into smaller lists that correspond at most to a year’s learning and preferably to smaller chunks of learning. With smaller lists of a hundred words or less, teachers may be encouraged to make more use of lists in their own course design and teaching, because small lists are more manageable and relate more closely to short-term learning goals.

General academic vocabulary There have been several studies focusing on the academic vocabulary necessary for academic study across a wide range of disciplines. The target audience for such a list is students in pre-university English courses or in the first year of academic study at English-medium universities. One of the early calls for such a list came from Barber (1962) who noted the recurrence of certain academic words in different subject areas. The earliest list was Campion and Elley (1971) which was made in preparation for the development of an English proficiency test for overseas students coming to study in New Zealand. Praninskas’ (1972) American University Word List for the American University in Beirut was reading and learning focused and like Campion and Elley’s study was based on a manual count of academic texts. Coxhead’s (2000) Academic Word List was the first computer-based study to develop a list of general academic vocabulary. The purpose of the list was “to show which words are worth focusing on during valuable class and independent study time” (page 213) in an English for Academic Purposes course. All three studies assumed a general service vocabulary and looked for words in academic texts not in the general service vocabulary. Praninskas and Coxhead used West’s General Service List as their list of words assumed known, while Campion and Elley used the first 5000 of the Thorndike and Lorge (1944) list. The reason for assuming a general service vocabulary was that foreign learners of English about to enter an English medium university already know several thousand English words but may not have a good mastery of general academic vocabulary. An academic word list thus needs to take them from knowledge of the high frequency words to a level

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 12.  Taking account of your purpose 115

of proficiency more focused on academic study. Coxhead’s list contains 570 word families and is an achievable goal on an intensive pre-university English course of around twelve weeks. A more recent general academic word list by Gardner and Davies (2014) contains three thousand words (it also singles out the top 500 of these). It does not assume an already known high frequency vocabulary, but simply looks for the high frequency words in academic texts. Rather than using a high frequency word list to distinguish general high frequency words from academic words as Coxhead (2000) did, Gardner and Davies used a corpus-comparison approach, classifying words as academic if they were 50% more frequent per million words in the academic corpus than in the general corpus. A possible issue is whether academic words are distinguished from technical words. Chung and Nation (2004) found that what could be called technical words occurred among the high frequency, general academic and low frequency words. Nation (2013 2nd edition) in a change from Nation (2001a 1st edition) describes technical words not as one of a series of word levels but as a category of words that cuts across frequency levels. Gardner and Davies data suggests that it may be sensible to view academic vocabulary in the same way, that is, not as a level after high frequency but as a different kind of classification that cuts across high, mid, and low frequency levels. So, a particular word, for example cost, could be (1) a high frequency word, and also (2) a general academic word, and as well (3) a technical word in economics. Because the purpose and users of a general academic word list can be easily and clearly defined, the resulting lists have been very useful and well made. The corpora they have used have been comprised of academic texts, often including the same texts that the students will eventually study from. They have also been drawn from a wide range of subject areas representing the range of courses offered at university. Because the Gardner and Davies list does not assume a previous vocabulary it is a bit like the list that Ward and Chuenjundang (2009) made, making it most suitable for learners who come to academic study in English with a very small vocabulary size. Many of the words in Gardner and Davies’ Academic Vocabulary List are high frequency words in the BNC/COCA lists. The corpus for developing a general academic vocabulary needs to represent the needs of the users of the list. A list aimed at learners beginning academic study needs to be made up of the kinds of texts that first year university students need to deal with. It not only needs to represent the range of subjects they may study, including the most popular subjects, but also the kinds of texts they may have to read, such as articles, book chapters, books, and laboratory manuals.

116 Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Specialized vocabulary The next step from a general academic word list is a word list for a particular subject area. Often the pedagogical purpose of such lists is not clearly thought out. There is little point in teaching the technical vocabulary of a subject area in a pre-university course or even an adjunct course unless the content of the subject area is also being dealt with. Most technical words need to be learned as a part of learning the content of the subject area, and some technical vocabulary does not make sense unless the content of the subject area is understood. Many teachers on pre-university English preparation programs would struggle with teaching the technical vocabulary of a subject that they are not very familiar with. A study of technical vocabulary can show the size of the task facing the learner in terms of the number and types of technical words. Chung and Nation’s (2003) research on the technical vocabularies of anatomy and applied linguistics sought to answer the following questions: How big is a technical vocabulary? What kinds of words make up a technical vocabulary? How important is technical vocabulary in specialized texts? How can learners be helped to cope with technical vocabulary? Their purpose was clearly focused on the nature of technical vocabulary in general. Salager (1983) was more interested in how learners could go about learning the vocabulary of a particular subject area – medicine – in an English for Specific Purposes course. Martinez, Beck and Panza (2009) had an even narrower focus, the behavior of words from the Academic Word List in a particular subject area – agriculture. They found that some of the words from the Academic Word List were technical words in agriculture as were many general service words. Their list of words was intended to be used as a support in teaching students to write research papers relating to agriculture. Hsu (2013) also had a clear pedagogical goal in creating a medical word list “to bridge the gap between non-technical and technical vocabulary” (page 454), suggesting the use of online concordancers to help establish the vocabulary. Hsu’s list contained 595 words and so represented a feasible goal for a teaching program. Ward (2009) working with low English-proficiency university students looked at the feasibility of not following the general vocabulary → academic vocabulary → technical vocabulary sequence but going straight to the vocabulary of the technical texts. “Our research question, then, is: how can we create a word list, as a basis for a lexical syllabus, which is

Chapter 12.  Taking account of your purpose 117



Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Useful, in terms of word frequency and general coverage, for engineers in all subdisciplines and Easy enough, in terms of length and technicality, for learners who have nothing like mastery of the GSL or AWL.”  (p. 172)

Ward described his students well and chose a corpus that truly represented their potential needs. His resulting list of 299 word types represented a very feasible learning goal for his students. He chose types as his unit of counting because he had evidence that they struggled to see the connection even between inflected forms and the stem. Where substantial lists of technical vocabulary of more than a thousand words are developed, there needs to be a clear statement of how such lists might be used.

The vocabulary of English course books Word lists can be used in the evaluation of course books. The caution in doing such studies is that like graded readers even course books are likely to show a vocabulary distribution that exemplifies Zipf ’s law with a very large proportion of the vocabulary occurring once or twice. Such a study can show something about the repetition of vocabulary, but its greatest value lies in showing the amount and level of vocabulary covered in the course. Matsuoka and Hirsh (2010) examined a popular upper intermediate text for learning English, concluding that the text dealt well enough with vocabulary from the 2nd 1000 words and included some words from the Academic Word List. The frequency distribution of words roughly followed Zipf ’s law with 33% of the 603 words from the 2nd 1000 words occurring only once in the course book. Two thirds of the unlisted words (beyond the 2nd 1000 and Academic Word List occurred only once. The course book itself however is not the only source of repetition, which can be strongly affected by how the teacher uses the book.

The vocabulary of graded readers There have been no published studies on the construction of word lists for graded readers. The closest article is by Nation and Wang (1999) which examined the grading of the Oxford Bookworms series and described what a set of researched graded reader levels might look like. It is worth considering the content of a corpus for research on graded readers. Many but not all graded readers are narratives, so a corpus of contemporary novels might be a suitable corpus, especially if the goal of graded reading is seen as

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

118 Making and Using Word Lists for Language Learning and Testing

helping learners move to reading unsimplified texts. Because some graded readers are not fiction, part of the corpus could be what Sorell (2013) calls general writing. This includes newspapers, magazines, biographies, instructions and editorials. General writing shares quite a lot of vocabulary with narrative, and a corpus with sub-corpora of general writing and narrative would be likely to yield a reliable list. Another possibility is a little circular and that involves making a corpus of a wide variety of existing graded readers. The argument in favor of such a corpus is that the words in the resulting lists would be tried and true. That is, it is possible to write graded texts using such words. The negative aspect of such a corpus is that it assumes the status quo. If the word lists for the existing readers were not well researched then using a corpus of material based on these lists would not bring about improvement. The diversity of the existing graded reader schemes, in terms of number of levels and the size of each level, suggests that research is needed in this area.

Testing Word lists can be the basis for tests. The vocabulary section of LATOS (the Language Achievement Test for Overseas Students) was based on an academic word list developed by Campion and Elley (1971) in New Zealand. Campion & Elley used a corpus of just over 300,000 words of material from 19 different university disciplines. These disciplines had the highest enrolments in New Zealand universities. The material in their corpus included lectures in journals, articles, examination papers, and textbooks. They only counted words which were not in Thorndike and Lorge’s (1944) first 5000 word level and which were not technical words related to one specific discipline. They also gathered familiarity ratings for the words in their list. Although their main purpose was for selecting words for a test at university entrance level, they also saw other purposes for the list (pages 16–26), namely selecting words for vocabulary enrichment exercises, estimating the difficulty of prose, and checking the word familiarity with learners in sixth form classes against national trends. The Vocabulary Size Tests (Nation & Beglar, 2007; Coxhead, Nation & Sim, 2014) were developed from the BNC/COCA lists.

Approaches to making lists There are two major approaches to making corpus-based word lists. One is to stick strictly to criteria based on range, frequency and dispersion (Brezina & Gablasova, 2015; Dang & Webb, Chapter 15 this volume; Leech, Rayson & Wilson, 2001). The other is to use a similar statistical approach but to adjust the results using other criteria such as ensuring that lexical sets such as numbers, days of the week, months

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 12.  Taking account of your purpose 119

of the year, family relationships (mother, father, child, son, daughter) or survival vocabulary (Nation & Crabbe, 1991) are complete within a particular list, and taking account of the age and needs of the learners using the lists. The major advantage of a strict statistical approach is that the research leading to the lists is replicable. That is, someone else with the same corpus and criteria should be able to re-make exactly the same list. Such an approach also makes the sequencing of word lists easier, such as setting up successive groups of 250, 500 or 1000 words. Because the procedure is replicable, it makes comparison with other lists using different corpora or different criteria much more controlled. Brezina and Gablasova (2015) for example used exactly the same criteria and procedure for making a high frequency word list using four different corpora, and thus suggested that corpus size was not a major factor in making high-frequency word lists. In such a comparative study, there is value in using statistically based decision-making. There are however disadvantages to statistically based decision-making. The major disadvantage is that corpus effects are more noticeable in the resulting word lists. For example, a British corpus will result in peculiarly British words (bloke, chap, owt) occurring reasonably high in the frequency order. Or a written corpus will result in rather formal Latinate words occurring high in the frequency order. Similarly, very important survival words (Nation & Crabbe, 1991) such as hello, delicious, goodbye, and some numbers may fall outside the most frequent words. The second approach, a statistical approach using additional more subjective criteria, aims to make the resulting word lists as useful and complete as possible for the target users. This usefulness is its major advantage. The disadvantage of this approach is that there is a subjective element to this decision-making about what to add to a list and what to delete from it, and different list makers are likely to end up with slightly different lists. In this chapter, we try to estimate the likely strength of this subjective element in terms of how many words may be affected, and will suggest possible ways of making the adjustments less subjective by specifying categories and groups of the actual words that may be involved. While replicability is a clear virtue in experimental research, it is not necessarily a virtue in making a word list. This is because a word list is not an end in itself but is a tool for curriculum design, teaching and testing. The quality and relevance of the word list are its most important features, and quality and relevance are most likely to be achieved with the involvement of procedures that are not easily replicated. These procedures include completing lexical sets, taking account of users’ needs, and making up for deficiencies in the corpus. It is worth noting that several of the most useful word lists, for example the General Service List, Basic English, and the Survival Vocabulary for Foreign Travel (Nation & Crabbe, 1991), were all made using non-replicable procedures.

120 Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Transparency is a virtue because it can contribute to the validity of a list by providing information about the nature of the vocabulary in the list such as whether they are common in spoken or written language, whether they are likely to be known by young children and so on. While transparency is connected to replicability, transparency can include non-replicable procedures. Let us now look at the criteria used in sequencing words in lists.

Range, frequency, dispersion Range If we were forced to rank the statistical criteria for the inclusion and ordering of words in a word list, range would come at the top of the list. This is because range shows how many different texts or sub-corpora a word occurs in. Truly useful words occur everywhere. Range however is affected by the size of the sub-corpora. The larger the sub-corpora, the greater chance a lower frequency word has to occur. For range figures to be valid, each sub-corpus needs to be the same size and each subcorpus needs to be internally coherent. That is, each individual sub-corpus needs to consist of similar kinds of texts. Different sub-corpora can be very different from each other, but all the text within a sub-corpus should be of the same type. Range and frequency are closely related. The more different texts a word occurs in, the more frequent it is likely to be. There are some words with wide range and rather low frequency but they are a rather small group. A problem with range is that a corpus may be divided into very large sub-corpora and a word might occur in all sub-corpora, but occur only once or twice in some of the sub-corpora. In spite of this very low frequency in some of the sub-corpora, it still has a high range figure because it occurs in all the sub-corpora. The use of a dispersion measure is a way to take account of this (see below).

Frequency Frequency refers to how often a word occurs in a corpus. Some strongly topicrelated words are very frequent but have very limited range. This is the weakness of using frequency figures without any range measure. Topic-related words might occur in one long text and be central to the ideas in that text. The unusual word profitboss is like this in the BNC. It has a frequency of 425 and yet occurs in only one text. Word families are more frequent than lemmas, and lemmas are more frequent than word types, because the frequency of families and lemmas is the sum of their



Chapter 12.  Taking account of your purpose 121

members (see Reynolds and Wible (2014) for a discussion of this). Families and lemmas also tend to have larger range figures for the same reason.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Dispersion Dispersion combines frequency and range and looks at how balanced the frequencies of a word are across the different sub-corpora. Carroll, Davies and Richman (1971) looked at dispersion across seventeen different subject areas. Range and frequency are measured simply by counting – how many sub-corpora a word occurs in and how often the word occurs. Dispersion requires a calculation from a formula involving range and frequency data, namely the frequency of a word in each of the sub-corpora. Range, frequency and dispersion are all closely related and some studies do not use all three measures, partly because of this relationship. Leech, Rayson and Wilson (2001) however used all three in their study of the words in the British National Corpus thus providing a very useful set of data for analysis.

Combined measures The use of range, frequency and dispersion measures provides three numbers that then need to be used to rank words in a word list. In order to make use of two or three of these numbers in a consistent way, researchers have combined them in a formula that produces just one number that can be used for the ranking. Carroll, Davies and Richman (1971) used the Standard Frequency Index (SFI) which made use of frequency and dispersion data. Leech, Rayson and Wilson (2001) do not provide a combined measure but their data allows the use of one.

Applying the criteria A combined measure balances the criteria used according to the formula used in the measure. When a combined measure is not used decisions need to be made regarding the priority and cut off points for the criteria. When making a list based on general usefulness, the range and dispersion criteria should be given priority. This is because range and dispersion show how widely a word is used and if the sub-corpora well represent the kind of uses that will be made of the language then it is more useful to learn words that are frequent across all relevant uses of the language than words that are particularly frequent in one or two uses. This guideline still fits even if a modular approach is taken to the creation of a general service list.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

122 Making and Using Word Lists for Language Learning and Testing

The problem that occurs when range is given priority is that there is a frequency bump when the items in a sequenced list move from range level to another. For example when making the BNC/COCA lists, all the words in the first 9000 word families had the highest range of 10 (10 ten million word sub-corpora of the BNC were used). The coverage by the words in the 10th 1000 was markedly higher than the coverage of words in the 9th 1000 because the 10th 1000 included the highest frequency words with a range of 9 and these words were more frequent than the last of the words with a range 10. Using dispersion can help make this frequency increase less marked by putting words with a lower dispersion figure but with a range of 9 or 10 in the later sub-lists. The situation becomes more complicated in a modular approach when a certain sub-corpus (such as informal conversation) or certain sub-corpora are given priority over other sub-corpora (such as general writing or academic). However, if a list has a general purpose, words that are more generally useful than others (that is, those that have wide range) should be given a higher ranking in the list than those that are less generally useful. In Chapter 16 we look at how long lists should be. An important consideration relates to practicality, that is, a list or its sub-lists should be short enough to represent achievable learning goals. These can be short term goals or longer term goals or both. One factor affecting the length of lists and sub-lists is word frequency. A large proportion of the words in the whole list derived from a corpus will occur only once (around half of the different words according to Zipf ’s law). If a list is intended for teaching and learning, then frequency figures much higher than 1 are needed. Coxhead (2000) used a frequency cut-off point of 100 occurrences in 3,600,000 tokens for her minimum cut-off point because she wanted each word family to have enough occurrences for researchers to be able to do reliable concordance analysis of meanings, collocations and grammatical use. Any decision like this has a degree of arbitrariness about it, but it is helpful to have some rationale for deciding on list size.

Other criteria Lexical sets One way of adapting a sub-list is to make sure that all the members of certain closed lexical sets are included. A closed lexical set consists of words that easily fit under a headword and which are complete with a relatively small number of words. These include numbers (thirty-one words), days of the week (seven words), months of the year (twelve words, some of which are homonyms and proper noun homonyms),

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 12.  Taking account of your purpose 123

seasons of the year (five words including both fall and autumn), points of the compass (four words), and family members (twelve words). See Appendix 2 for a list of the headwords and see the BNC/COCA lists for the members of each word family. Greetings and politeness words (thanks, sorry, excuse) are also worth including early, and most of these are in the survival vocabulary. Including sets like family members is on the borderline of what is easy to justify. Those making lists for children may also want to include lists of animals, colours, and classroom items and activities. If we make the first 3000 word families adhering strictly to range and frequency criteria using the British National Corpus as the corpus, we find the following things about the lexical sets mentioned above. Of the days of the week, Tuesday and Thursday, are not in the first 2000 words. All the months of the year are in the first 2000 words with February and August in the 2nd 1000. Of the numbers, the word thirteen is not in the first 2000. Eleven, twelve and billion are in the 2nd 1000. All the seasons except autumn are in the first 2000. All the points of the compass are in the 1st 1000. The family members, aunt, uncle, nephew, niece, cousin, are not in the first 2000, and daughter, sister and grandfather (which includes other family members beginning with grand-) are in the 2nd 1000. Lexical sets seem more strongly associated with certain text types. In Sorell’s study, (2013: 178–180), the days of the week all occurred in the top 500 in conversation (a focus on personal activities), and the months were well within the top 1000 for general writing (a focus at an annual scale). Color words were best represented in narrative writing due to the need for description. The survival language learning syllabus for foreign travel (Nation & Crabbe, 1991) consists of around 120 words and phrases which are very useful when spending some time in an English speaking country. Excluding numbers (31 words), it contains 117 word families. Most of the words in the survival vocabulary are in the 1st 500 words of English. However, nineteen words in the survival vocabulary are in the 2nd 500 most frequent families according to BNC/COCA family figures. They include please, thank [you], fine, closed, [good] morning, wait. Thirteen are in the 2nd 1000 (bus, hello, sick, sorry, restaurant, tomorrow, cheap, gentlemen (for toilets), ticket, welcome), and eight are not in the most frequent 2000 words (excuse [me], delicious, airport, entrance, exit, goodbye, toilet, underground (metro)). For adults at least, all of the survival vocabulary should be learned in the first few lessons of English. Brezina and Gablasova’s (2015) New GSL does not contain the numbers eleven, twelve, thirteen, thirty, and does not contain the following words from the survival vocabulary – hello, goodbye, excuse [me], delicious, toilet, exit, airport, underground. It contains all the days of the week and the months of the year.

124 Making and Using Word Lists for Language Learning and Testing

If we wanted to adjust the 1st 1000 word families made solely by range and frequency criteria to include members of closed lexical sets and survival vocabulary, we would need to make the forty-two additions shown in Table 12.1. The full sets are in Appendix 2.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Table 12.1  Adjustments to the 1st 1000 families to include useful words and to complete sets Types of words

Number of words

Headwords to add to 1st 1000

Weekdays

 2

Tuesday, Thursday

Months

 2

February, August

Numbers

 4

eleven, twelve, thirteen, billion

Seasons

 4

spring, summer, winter, autumn

Compass points

 0

Family members

 8

daughter, sister, grandfather, aunt, uncle, nephew, niece, cousin

Survival words

22

ahead, bus, expensive, hello, restaurant, sick, sorry, straight, ticket, tomorrow, welcome, cheap, gentleman, excuse, delicious airport, entrance, exit, goodbye, post-office, toilet, underground

Total changes

42

The seasons probably should be in the 2nd 1000 rather than the first, as three occur there and autumn is beyond the 2nd 1000. There are arguments against the inclusion of lexical sets. Research has shown that learning items in lexical sets generally has a negative effect on vocabulary learning (Nation, 2000). West (1951, 1955) also argued against what he called “catenizing”, showing that words in sets were typically of very different frequencies. Teaching lexical sets thus not only makes learning more difficult but also results in having to learn a mixture of useful and not so useful words. Completing lexical sets in lists may be seen by course designers as encouragement to teach lexical sets which would be bad for learning. In addition, the inclusion of lower frequency words in a list to make up a set means that other more frequent words do not get into that particular list. As Table 12.1 shows, that can amount to 42 words for the first 1000, over 4% of the list.

Making additions because of the age of the corpus It is also worth considering “new” or “modern” words like internet and email. Brezina and Gablasova (2015) compared recent corpora with older ones and found about ten frequent words that could be said to be recent innovations (internet, online, web, website, video, television, TV, email, cd, mobile, drug). This was a very



Chapter 12.  Taking account of your purpose 125

small number of items and we should not overestimate the number or text coverage of such items (Nation & Hwang, 1995). Such items are well worth including if the corpora are too old to include them.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

West’s efficiency criteria When West (1953) made the General Service List, he wanted to make an efficient list, so he also used the criteria of ease or difficulty of learning, necessity, overlap with words already in the list, style, and intensive and emotional words. Ease or difficulty of learning meant that he included some words that were easy to learn because they were cognates or loan words or were closely related to known words through word parts. Necessity is like including words in the survival vocabulary. Some words may not be high frequency but they are the only way of saying something important. The classroom word chalk fits into this category, as might soap and soup. Many of the “necessity” words probably occurred to West when writing his New Method Readers. He found that certain words were often needed to express what he wanted to say. One word that I found I often needed when writing vocabulary-controlled texts and which is not in the General Service List is suitable. Overlap with words already in the list meant that a word like commence did not get in because begin or start could cover most of its uses. The criteria of style and intensive and emotional words seemed to be used more for exclusion from the list. West decided that his list should be as neutral as possible, so emotional expressions, and words associated with a flowery or literary style would not be included in the list.

Taking account of user characteristics Because it is often impossible to make a large enough corpus that truly represents the language use of the target users, it may be necessary to add words that are clearly relevant to the users or to exclude words that are clearly not. The problem lies in doing this in a principled and transparent way. For example, there is no corpus of young children’s language that is both satisfactory and large enough. Adult spoken language corpora are still small, especially those that involve unscripted interactive spoken use. One way of dealing with this lack is to use what relevant corpora we have and to give the frequency and range figures from such corpora priority over frequency and range figures from larger less relevant corpora. It may not be possible to do this without applying some subjective judgement to the resulting lists to counter the possible biasing effects of a small corpus. The justification for the subjective judgement would be that small corpora are less reliable (see Sorell, 2013) and as long as

126 Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

the adjustments were relatively small and the number of them were within the usual variation we would expect to see with a corpus of that size, and they were noted in a commentary on the list, then this adjustment would be acceptable. An additional safeguard could be comparison with other lists (see Nation, 2001b for an example), and close inspection of the range and frequency of particular words among those added or deleted.

Evaluating a word list The number of adjustments that need to be made to a word list primarily depends on how well the corpus used reflects the reason for making a word list. The more the corpus includes texts that are relevant to the needs of the users of any resulting list, the fewer the adjustments that will need to be made. Many studies do not evaluate their lists well. To do a fair evaluation, the following steps need to be applied. Find out as much as possible about the purpose of each list involved in the evaluation. Make sure that the unit of counting for each list (type, lemma, family) is the same. If they are not the same, then convert the lists so that they are. The number of words in each list being compared should be the same, or the difference in number of words and its effect should be clearly described. The coverage of each list should be tested on corpora that are different from those used for making the lists. It may be interesting to compare the lists using the corpora they were made from, but this connection between the lists and the corpora should be clearly noted. The relevance of the corpora used should be described, particularly in relation to the purpose of the lists. Where the lists have a clear purpose, they should also be evaluated on their inclusion of words in relevant lexical sets such as numbers, days of the week etc. and the survival vocabulary. Dang and Webb (under review) evaluated four word lists across eighteen different corpora (nine spoken and nine written) using text coverage by headwords and families are the main criterion. This evaluation was particularly useful because the corpora used were different from those used to make the lists. Brezina and Gablasova’s (2015) study comparing word lists made from different sized corpora did not specify the purpose of their lists, although the implication from its name, the new-GSL, suggests that they were intended for the same audience as West’s General Service List. That is, they were for children or young adults studying English as a foreign language with an emphasis on reading.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 12.  Taking account of your purpose 127

Brezina and Gablasova (2015) used four corpora of three very different sizes (1 million, 100 million, and 12 billion tokens) to see if a frequency, range and dispersion based count, “a purely quantitative approach”, would result in similar high frequency lists. The unit of counting was the lemma with different parts of speech of the same form being different lemmas. Their justification for the lemma is primarily based on the need to create a list for beginners for both receptive and productive purposes, although they also have criticisms of the transparency of the relationships between items in the same word family. The British National Corpus is the only one used in their study containing spoken material. As we saw in Chapter 10, they found an overlap around 80%, with all four lists having only 70% common vocabulary, which leaves a lot of non-overlapping words. Brezina and Gablasova added to their New-GSL 378 lemmas that overlapped between the two most recent word lists. This was done to make sure that current vocabulary was well represented. When the New-GSL was compared with West’s General Service List, “Surprisingly, the largest proportion of the second 1,000 words in the NewGSL overlaps also with West’s first 1,000 word families” (page 16). This seems to me to provide evidence that a more inclusive level of word family may have been a more suitable unit of counting than lemmas (Level 2), given that both headwords and derivational members of the same families commonly occurred with high frequency. Just 178 lemmas (7.1%) out of a total of 2,494 occurred in the New-GSL and not in the old General Service List or Academic Word List. If we carefully analyse these 178 lemmas, there are only about eleven recent words (internet, online, web, website, video, television, TV, email, cd, mobile, drug). The rest are words like acid, alcohol, atmosphere, barrier, capture, carbon, cell, climate, column, disorder, engage, entitle, fuel, height, infect, muscle, organic, peak, plot, profile, protein, recall, session, species. These words reflect the nature of the corpora rather than the General Service List being out-of-date. An analysis of the Academic Word List reveals that with the exception of computer, few if any words could be classified as recent additions to the language. West’s General Service List comes out extremely well in the comparison with the New-GSL, both in overlap of lemmas and text coverage. These high overlap figures and coverage figures are partly a result of its size, that is, the New-GSL contains 40% fewer lemmas than West’s General Service List. The higher coverage figures for West’s General Service List are also remarkable because they are based on the corpora from which the New-GSL was made, which strongly stacks the odds in favour of the New-GSL. When evaluating a list, it is necessary to evaluate it on a corpus which is different from the one from which it was made. It is worth bringing the General Service List and other word lists up-to-date, but, as Nation and Hwang (1995) and Brezina and Gablasova (2015) found, the required additions and their effects are rather small.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

128 Making and Using Word Lists for Language Learning and Testing

A very useful finding of the Brezina and Gablasova (2015) study is that there is a fairly consistent frequency and range ordering of the items in a high frequency word list. This agrees with Biemiller and Slonim’s (2001) finding that the order in which young native speakers learn words is largely predictable. This is an important finding when applying word lists to course design, because it shows that well-made word lists can not only usefully guide what should be presented for learning, but are also effective in sequencing that learning. Sorell’s (2013) research shows that with a large enough homogeneous corpus, around 20 million tokens, it is possible to make a high frequency word list that is likely to contain mostly the same words as another word list made from a different but similarly sized homogeneous corpus containing the same kinds of texts. For course design, there should be little debate about which high frequency words need to be learned. The corpus used for the comparison in Table 12.2 consists of eight million tokens, made up of seven million words of British, American and New Zealand informal spoken language, and one million tokens of texts written for young children (School Journals). The comparison with the BNC/COCA lists is unfair because when being made, the BNC/COCA lists were checked against spoken corpora and the School Journal corpus. However, compared with the performance of the General Service List in the Brezina and Gablasova study where the General Service List did worse in spite of the extra families it contains, the New-GSL did worse in the comparison in Table 12.2. Note that the lemmas in the New-GSL were all converted to word families with extra members added so that each word family for a particular word was exactly the same in all three lists. The point of the comparison in Table 12.2 is to show that some corpora suit some lists and other corpora suit others. Table 12.2  Coverage of a variety of one million token corpora by three high frequency word lists Lists

Families

School journals

Novels

US spoken

UK spoken

TV/ Movies

Academic

New-GSL GSL BNC/COCA

1866 2168 2000

85.12% 88.77% 88.82%

86.17% 88.44% 88.88%

90.50% 90.14% 91.82%

90.10% 90.91% 92.16%

88.51% 89.09% 90.87%

84.28% 76.19% 78.59%

Note that the New-GSL does best on the academic text, largely because it was developed from formal written corpora. Although West’s General Service List provides better coverage of most of the other texts in comparison, the New-GSL has 302 fewer word families. A general service list is a list that is useful in a wide variety of

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 12.  Taking account of your purpose 129

uses of the language. That is what “general service” means. General service should include spoken and written uses, should include language used by both adults and children, and should cover a range of daily uses of the language such as casual conversation, watching TV, reading novels, reading children’s books, reading newspapers, and reading more serious texts. The New-GSL reflects the largely formal written corpora from which it was made, although given its smaller number of word families, it performs reasonably well on other kinds of texts. The corpus for a general service list however must draw on a much more varied and representative corpus of daily language use. Although Michael West saw his list as a vocabulary for reading, it does reasonably well on spoken uses. The New-GSL contains 144 word families fewer than the BNC/COCA lists. If we take the UK spoken corpus as an example, the least frequent 144 families in the BNC/COCA lists covered 0.32% of the corpus. Taking these away from the BNC/ COCA coverage of 92.16% (91.84) to equate family numbers has little effect on the comparison. Coverage of the tokens in text is not the only way to evaluate a word list. The words that a list contains can be compared with other word lists to see what words are excluded and included (see Nation, 2001b, Schmitt & Schmitt, 2014 for examples). Various closed lexical sets like numbers, days of the week, months of the year, family relationships, points of the compass, and seasons of the year can be checked for inclusion. More open sets like academic words, survival vocabulary, colours and greetings and terms of politeness can also be checked. The relevance of the sets used will depend on the purpose for the list. There may also be a role for drawing on learners’ word knowledge or nativespeakers’ judgements as a way of evaluating and developing high and mid-frequency word lists. Some researchers have suggested that particularly when designing vocabulary tests for EFL learners, sequencing based on word knowledge may be a useful criterion. The danger here is that knowledge and usefulness are not the same as word difficulty factors such as cognates, pronounceability and word parts can have a strong effect on what words are answered correctly in a vocabulary test (see Cobb (2000) for an example). Nonetheless, in some vocabulary lists, most notably West’s (1953) General Service List, criteria involving judgements about the usefulness of words have been included. One criterion, diponibilité, involved words that native speakers saw as being very useful but which were not always very high frequency. Examples include words such as soap, toilet, and silly (Richards, 1970).

130 Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Recommendations 1. The purpose and audience for a word list need to be clearly described. These descriptions provide a basis for evaluating the list. 2. Different purposes require different corpora, different units of counting, and different criteria for inclusion in a list. 3. Usually word lists should be evaluated on different corpora from the corpus from which they are made. In some cases however as in Ward’s (2009) study, the corpus is actually the target materials, so it is appropriate to evaluate the list on these materials. 4. Lists can be checked against various relevant lexical sets to see what is included and excluded.

chapter 13

Critiquing a word list

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

The BNC/COCA lists

The BNC/COCA word family lists have been available now for several years and have been used in various pieces of research. They have been continually revised but still and always will remain a piece of unfinished work. Word lists are a bit like a black hole that seems to absorb hours and hours of work for little obvious improvement. In this chapter we look at a framework for a critique of a word list which is based on the previous chapters in this book. This critique framework is then applied to the BNC/COCA lists.

Critiquing a word list Table 13.1  Questions for critiquing a word list Focus

Questions

Purpose

Was the target population for the word list clearly described? Was the purpose of the list clearly described?

Unit of counting Was the unit of counting suited to the purpose? Was the unit of counting clearly defined, including issues such as UK vs US spelling, alternative spellings, part of speech, abbreviations and numbers? Was the unit of counting explicitly well-justified? Corpus

Was the content of the corpus suited to the purpose of the list? Was the corpus large enough to get reliable results? Was the corpus divided into sub-corpora so range and dispersion could be measured? Were the sub-corpora large enough, of equal size, and coherent? Was the corpus checked for errors?

Main word lists

Was there an explicit description of what would be counted as words and what would not be included? Were homoforms dealt with? Were proper names dealt with, including proper name homoforms? Were content bearing proper names distinguished? Were hyphenated words dealt with? Were transparent compounds dealt with in a way consistent with hyphenated words? Were acronyms dealt with, including acronym homoforms? Were the proper name lists and other lists revised on the basis of initial output? (continued)

132 Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Table 13.1  (continued) Focus

Questions

Other lists

Were marginal words dealt with? Were any other supplementary lists used?

Making the lists

Were the criteria for inclusion and ordering in the list (frequency, range dispersion, or some composite measure) clearly described and justified? Were the criteria for making sub-lists clearly described and justified? Were any subjective criteria used? Were they described and justified? Were the lists checked against competing lists not just for coverage but also for overlapping and non-overlapping words?

Self-criticism

Are the weaknesses of the lists clearly acknowledged?

Availability

Are the lists readily available in electronic form for evaluation?

The BNC/COCA word family lists The lists The BNC/COCA word family lists consist of 32 word family lists. Twenty-eight of the lists contain word families based on frequency and range data. The four additional lists are (1) an ever-growing list of proper names, (2) a list of marginal words including swear words, exclamations, and letters of the alphabet, (3) a list of transparent compounds, and (4) a list of acronyms. In the lists for AntWordProfiler, each list has a name which describes its content. In the lists for Range, because of the requirements of the Range program, each list has a fixed name – basewrdx.txt, where x is a number. Basewrd29 and 30 just contain one nonsense word each. They were made to provide space for additional lists and to avoid having to keep changing the names of the proper nouns etc. lists. Basewrd31 contains proper nouns, basewrd32 marginal words, basewrd33 transparent compounds and basewrd34 acronyms. More detail on these additional lists can found in Nation and Webb (2011: Chapter 8). The lists are text files saved in UTF-8 without BOM (choose under Encoding in Notepad++). The COCA data used in the BNC/COCA lists was in the form of word lists. It was not possible to run the Range program over COCA. The low frequency words from COCA which were not in the BNC lists were added to the low frequency BNC/ COCA lists. The high frequency words from COCA were checked against the high frequency BNC lists to make sure that there were no notable words missing in the BNC lists. Ideally it would be good to have a combined and balanced BNC/COCA corpus to use in the making of the lists, but this was not possible.



Chapter 13.  Critiquing a word list: the BNC/COCA lists 133

Table 13.2 contains the specifications for the BNC/COCA lists. The first 2000 words are intended to be a general service list and research by Dang and Webb (under review) suggests that it fills this role reasonably well.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Table 13.2  Specifications for the BNC/COCA lists Purpose

Course design for EFL at secondary school level Receptive vocabulary size testing of NS & NNS

Size of lists

1000 word families per list

Number of lists

At least 28 (total 28,000 word families), goal 30

Unit of counting

Word families – Bauer & Nation Level 6. Each family includes US and UK spellings, abbreviations, dialect forms

Capitalization

No distinction between capitalized and non-capitalized words

Numbers

Range does not count numbers except if preceded by letters, for example U2 is counted but not 2B

Criteria for list

The corpus differed for the first 2000, the mid-frequency words and the low frequency words. (1) Family range across 10 BNC sub-corpora. (2) The high frequency words were checked against a variety of corpora and lists. The mid-frequency words used the 10 million spoken tokens of the British National Corpus and the whole British National Corpus for the rest. (3) Subjective judgement was used in the high frequency words for some lexical sets and survival words

Word separators

Space, apostrophe, hyphen, full stop

Separate lists

Proper names – various parts of speech, Roman numerals Marginal words – hesitations, exclamations, letters of the alphabet, swear words and impolite words, sounds (Zzzzzz, umph) Transparent compounds – parts define the whole Acronyms

Function words

Included in the frequency lists

Homoforms

Homonyms and homographs are not separate families, but forms unique to one member of a homoform are listed as a separate family with the stem in the most frequent family – Orient, Oriental etc. in one family; oriented, orienting, orients, orientate etc. in another.

Corpus

BNC 100 million tokens divided into 10 sub-corpora according to type of texts (all spoken in one sub-corpus). Data from lists from the COCA corpus was used but without direct access to the corpus.

Possible criticisms

Use of subjective criteria, corpus change at 3000 and 10,000 levels, data from lists from COCA and Brysbaert was used as a source and check for later lists, homoforms were not consistently distinguished including names which can be words (Bush – bush, Brown – brown), British National Corpus is predominantly written and formal

134 Making and Using Word Lists for Language Learning and Testing

The making of the lists The 1st 1000 and 2nd 1000 word family lists

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

The first two 1000 word family lists were made using a specially designed 10 million token corpus. Six million tokens of this corpus were spoken English from both British and American English as well as movies and TV programs. The written sections included texts for young children and fiction (see Table 13.3). Table 13.3  The corpus used for the first two 1000 word family lists US

UK/NZ

Spoken 1. AmNC spoken face to face, telephone 1

1,107,602

  4. BNC 1

1,036,097

2. AmNC spoken face to face, telephone 2

1,029,831

  5. BNC 2

1,125,523

3. Movies and TV

1.000,000

  6. BNC Plus half of WSC

1,132,620

7. AmNC written fiction, letters 1

1,145,081

  9. School journals

1,028,842

8. AmNC written fiction, letters 2

939,407

10. BNC fiction

1,040,204

Written

This unusual step of creating a special corpus for the first 2000 word families was followed because the previous lists made from the British National Corpus were so strongly influenced by the written formal nature of the corpus that they were not suitable lists for creating language courses or graded reader lists (see Nation, 2004). Using a corpus biased more towards spoken language meant that very common words in spoken English like alright, pardon, hello, Dad, bye were included in the high frequency words. Other arbitrary adjustments included putting all the word forms of numbers (one, two, hundred etc.) and weekdays in the 1st 1000, and the months of the year in the 2nd 1000, even though their frequency did not always justify this. The goal was to have a set of high frequency word lists that were suitable for teaching and course design. Words from the survival vocabulary for foreign travel (Nation & Crabbe, 1991) were also included.

The 3rd 1000 onwards The lists from the 3rd 1000 to the 9th 1000 inclusive used the spoken section of the British National Corpus and COCA/BNC frequencies in data kindly provided by Mark Davies after removing my specially created first 2000 word families. The whole British National Corpus was used from the 10th 1000 on.



Chapter 13.  Critiquing a word list: the BNC/COCA lists 135

Word families The criteria used to make word families were based on Bauer and Nation’s (1993) Level 6, which includes all the affixes from levels 2 to 6 (see Table 2.2). The word families were developed over several years and low frequency family members continue to be added to the existing families.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

The nature of the families The word lists were made to be used with the AntWordProfiler and Range computer programs and these programs cannot distinguish between homonyms like Smith (the family name) and smith (blacksmith) and March (the month) and march (as soldiers do). Thus when the program runs, these uses are not distinguished and would be counted in the same family and as the same type. There was an attempt to deal with this wherever possible. Marched, marching, marches, marcher, marchers etc. for example were put in one family and March into another. This does not completely distinguish the homonyms, but it is a step towards doing so. The high frequency word families tend to be quite large as it appears that higher frequency stems generally can take a greater range of affixes than lower frequency words. For example, the high frequency word family nation has the following members nations, national, nationally, nationwide, nationalism, nationalisms, internationalism, internationalisms, nationalisations, internationalisation, nationalist, nationalists, nationalistic, nationalistically, internationalist, internationalists, nationalise, nationalised, nationalising, nationalisation, nationalize, nationalized, nationalizing, nationalization, nationhood, nationhoods. The word family lists group items together that would be perceived as the same words for the receptive skills of listening and reading. If word lists were made for productive purposes, for speaking and writing, the lemma (Bauer & Nation (1993) Level 2) would be the largest sensible unit to use. Some researchers argue for the word type. The word lists contain compound words but they do not contain phrases. According to or au fait, for example, might be best counted as a unit, but in the lists the unit is the single word.

The validity of the BNC word family lists There are ways of checking whether the word family lists are properly ordered. From the 1st 1000 to the 28th 1000, the number of tokens, types, and families found in an independent corpus should decrease. That is, when the lists are run over a different corpus from the British National Corpus or COCA, the 1st 1000 word

136 Making and Using Word Lists for Language Learning and Testing

family list should account for more tokens, types and families than the 2nd 1000 family list does. Similarly, the 2nd 1000 word family list should account for more tokens, types and families than the 3rd 1000 family list does and so on. While this does not show that each word family is in the right list, it does show that the lists are properly ordered. Table 13.4 presents such data using the Range output from the Wellington Written Corpus.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Table 13.4  Tokens, types and families in the Wellington Written Corpus Wordlist

Tokens/%

Types/%

Families

one two three four five six seven eight nine ten 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 Not in the lists Total

 772697/75.22   91545/8.91   53591/5.22   17967/1.75   10899/1.06    7267/0.71    4513/0.44    4313/0.42    2592/0.25    2005/0.20    1533/0.15    1063/0.10     832/0.08     737/0.07     531/0.05     443/0.04     628/0.06     250/0.02     247/0.02     269/0.03     132/0.01     130/0.01      80/0.01     296/0.03     134/0.01       0/0.00       0/0.00       0/0.00       0/0.00       0/0.00   30991/3.02    3111/0.30    4203/0.41    1380/0.13   12819/1.25 1027198

 4762/11.74  4299/10.60  3903/9.62  2853/7.03  2336/5.76  1986/4.90  1564/3.86  1336/3.29  1089/2.68   920/2.27   721/1.78   589/1.45   438/1.08   346/0.85   276/0.68   220/0.54   194/0.48   127/0.31   104/0.26   104/0.26    79/0.19    63/0.16    43/0.11    52/0.13    31/0.08     0/0.00     0/0.00     0/0.00     0/0.00     0/0.00  3844/9.48    90/0.22  1200/2.96   191/0.47  6803/16.77 40563

999 999 999 995 981 950 904 853 760 700 585 489 391 304 246 198 173 117 101 89 74 59 40 48 29 0 0 0 0 0 3691 33 926 188 ????? 16921

Chapter 13.  Critiquing a word list: the BNC/COCA lists 137



Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Note that the percentage token coverage for each list does not always drop as we go down the lists, see List 17 for example where 0.06 coverage is more than list 16 with 0.04 coverage. This occurs because there is a word or two in list 17 that is particularly frequent in the corpus being analyzed. A second way of checking the validity of the lists is to look at the total number of types in each list. Low frequency words tend to have less family members than high frequency words, so even though the number of families in each list is the same, one thousand, the number of types should be less. Table 13.5 contains this data. Table 13.5  The number of types (family members) in each of the first twenty-five 1000 word family lists 1 6857 2 6374 3 5880 4 4863 5 4294

 6 4104  7 3679  8 3417  9 3196 10 2985

11 2941 12 2754 13 2415 14 2299 15 2283

16 2086 17 2076 18 1933 19 1872 20 1820

21 1651 22 1539 23 1394 24 1296 25 1675

The 1st 1000 word list contains 6,857 word types, an average of 6.857 per family as each list contains exactly 1000 word families. There is a decrease in word types from one list to the next. A third way of checking the validity of the lists is to make sure that no wide range, high or mid-frequency words are missing from the lists. To check this, the lists were run over a wide range of different corpora, existing lists, and texts. No frequent, wide range word families were missing. Table 13.6 looks at words that are outside the existing lists, including the proper noun, marginal words, transparent compounds and acronyms lists. The goal is for few if any headwords to be outside the main lists. There are 272,782 word types in the British National Corpus that are not in the first 20 word lists used with the Range program, plus a list of proper nouns, a list of transparent compounds, and a list of exclamations, hesitations and other spoken marginal words. Note in Table 13.6 that almost half of the different words are proper nouns. Four percent are foreign words, and 6% are low frequency members of word families already in the first twenty 1000 word lists. Ideally, these family members should be added to the families in the existing lists. The main point of the table is to show that the new words (49,101) plus the 20,000 in the word lists total around 70,000 word families which is a figure not too far from Nagy and Anderson’s (1984) estimates, and the number of words in most reasonably sized non-historical dictionaries. The reason for distinguishing recurring words (those occurring 2 times or more in the British National Corpus) from those occurring only once in the corpus (one-timers) is to show that the proportion of new words in the one-timers is half that in the recurring words.

138 Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Table 13.6  The percentage amounts of different kinds of word types in the British National Corpus and not in the first twenty 1000 word family British National Corpus word lists and additional lists Kinds of words

Recurring Onewords timers

Total %

Projected total

New words

12

 6

 18%

49,101

Proper nouns

23

25

 48%

130,937

Foreign words

 2

 2

  4%

10,911

nationaux, panellinion

Low frequency family members

 2

 4

  6%

16,367

obeyance, velcros, realizational, ungrouped

Transparent compounds

 2

 2

  4%

10,911

lockgates, poolrooms, countertop, duststorm

Acronyms, abbrev  5

 2

  7%

19,095

USYN, MLD, EMW

Alternative spellings

 0

 3

  3%

8,183

velem, Hindostani, cronicles

Letters with numbers

 1

 3

  4%

10,911

Exclamations

 2

 0

  2%

5,455

Errors

 0

 4

  4%

10,911

Total

49

51

100%

272,782

Examples tucuxi, pericentric, escritoire, polyacrylonitrile, trochar, pancreata Southwick, Akrokorinth, Frakes, Aalberse, Stycar, Thucyd, Wellferon, Mlungisi

AW17, MX300, PTFMA215 fattafattafatta, cheeeeee approprite, gorups, dispoal

Evaluation and critique of the BNC/COCA lists The writing of this book resulted in several improvements to the BNC/COCA lists. In fact I wish I had written the book before I made the lists, but the long process of making the lists gave me much of the knowledge needed to write the book. If you are going to build a shed, build a shed for someone else first, so that you have a chance to not make the same mistakes when you eventually build your own shed. Here are some of the improvements made to the BNC/COCA lists while writing this book. In the following evaluation of the BNC/COCA lists, these revised versions are used to show them in the best possible light! 1. The less frequent word families in the 1st 1000 and 2nd 1000 words were checked to see if they met subjective criteria for inclusion.



Chapter 13.  Critiquing a word list: the BNC/COCA lists 139

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

2. The high frequency homonyms were checked for separate entries and several entries were added to distinguish these homonyms as much as possible. 3. A few proper noun homonyms were moved from the high frequency word lists. 4. The transparent compound list was checked to make sure it contained all the high frequency compounds. 5. The marginal words list was reorganized and expanded. When the total BNC/COCA 1st 2000 coverage of the British National Corpus (84.10%) is compared with the coverage of a 2000 word family list (86.6%) based solely on frequency and range with no subjective adjustments, the difference in coverage is 2.5% in favor of the range and frequency list. The lists based solely on range and frequency should do best on the British National Corpus because they are made solely from that corpus and any list will perform best on the corpus from which it is made. There are 54 word families in BNC/COCA 1st 1000 which are not in the frequency-based 2000. These are the 54 – amaze, ashamed, aunt, autumn, awful, bet, boring, bother, bread, cake, chicken, crazy, darling, delicious, dig, dirty, excuse, glad, goodbye, hat, horrible, hunger, hurry, internet, lazy, loud, mad, mess, movie, naughty, neat, orange, pardon, penny, rabbit, rid, rubbish, scare, shy, silly, snow, stupid, tail, thirst, thirteen, throat, Thursday, Tuesday, ugly, uncle, underneath, web, wed, zero. 11 words among these 54 come from completed sets – numbers, days of the week, family members, and survival vocabulary. Two are recent words – internet, web. There are 463 word families in BNC/COCA 2000 which are not in frequencybased 2000 word families. 354 of these 463 are also in the first 2000 of the children’s lists, which is a good indication of their usefulness for young learners of English. Table 13.1 contains questions that can be applied to most word list studies. In Table 13.7 they are applied to the BNC/COCA lists. The criticisms largely relate to the information supplied in this chapter which was written to accompany the BNC/ COCA lists before this book was written. The criticisms will be used to improve the lists and the description of them. In the following discussion, there is an attempt to rank the criticisms of the BNC/COCA lists according to their relevance to the validity of the lists. The two strongest criticisms of the BNC/COCA lists are that their purpose is not clearly enough described, and whatever their purpose, the British National Corpus is too much a written formal corpus containing texts written by and for adults to be suitable for making a list for children or young adults who are learning English as a foreign language. The unit of counting is also a weakness for foreign language learners. For a vocabulary size test for teenage and adult native speakers, it is probably satisfactory.

140 Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Table 13.7  Questions for critiquing a word list applied to the BNC/COCA lists Focus

Questions

The BNC/COCA lists

Purpose

Was the target population for the word list clearly described? Was the purpose of the list clearly described?

This is poorly done. There is no clear description of the people who will learn from the lists and only a very vague reference to the uses of the lists.

Unit of counting

Was the unit of counting suited to the purpose? Was the unit of counting clearly defined, including issues such as UK vs US spelling, alternative spellings, part of speech, abbreviations and numbers? Was the unit of counting explicitly well-justified?

The unit of counting is the right one to use for the lists but there is no justification for it. Setting the Level at Bauer & Nation Level 6 makes the lists more suitable for native speakers but is too high for foreign language learners. What is included in a family needs to be more explicitly described and justified with examples. The availability of the lists partly counters this criticism.

Corpus

Was the content of the corpus suited to the purpose of the list? Was the corpus large enough to get reliable results? Was the corpus divided into subcorpora so range and dispersion could be measured? Were the sub-corpora large enough, of equal size, and coherent? Was the corpus checked for errors?

The British National Corpus is not suitable for high-frequency and mid-frequency words for general purposes. The use of a largely spoken and general corpus for the first 2000 was a good idea. The British National Corpus is large, has good sub-corpora, but contains a lot of errors.

Main word lists

Was there an explicit description of what would be counted as words and what would not be included? Were homoforms dealt with? Were proper names dealt with, including proper name homoforms? Were content-bearing proper names distinguished? Were hyphenated words dealt with? Were transparent compounds dealt with in a way consistent with hyphenated words? Were acronyms dealt with, including acronym homoforms? Were the proper name lists and other lists revised on the basis of initial output?

The description of words is satisfactory. The lists deal with homoforms to a degree. There is no list of content-bearing proper names. Hyphens were replaced with space hyphen space in the BNC but transparent compounds were not split. A good acronym list needs to be developed. Whenever the lists are used the proper names list is added to.

(continued)

Chapter 13.  Critiquing a word list: the BNC/COCA lists 141

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Focus

Questions

The BNC/COCA lists

Other lists

Were marginal words dealt with? Were any other supplementary lists used?

Marginal words are dealt with satisfactorily. No other supplementary lists.

Making the lists

Were the criteria for inclusion and ordering in the list (frequency, range dispersion, or some composite measure) clearly described and justified? Were the criteria for making sub-lists clearly described and justified? Were any subjective criteria used? Were they described and justified? Were the lists checked against competing lists not just for coverage but also for overlapping and nonoverlapping words?

Only range and frequency were used. With only ten sub-corpora, this is satisfactory for the high and midfrequency words, most of which have a range of 10. The subjective criteria need to be formalized and more explicitly applied. The high- and mid-frequency lists have had a lot checking.

Self-criticism Are the weaknesses of the lists clearly acknowledged?

Several weaknesses are acknowledged.

Availability

Yes, on Paul Nation’s website.

Are the lists readily available in electronic form for evaluation?

The main strengths of the BNC/COCA lists are that they use word families and most of these word families have been well checked over the years. In spite of this there are still very low frequency members that could be added to the families and it is hoped that a program can be written to make this task less laborious. The low frequency members not in the lists occur with a frequency of less than 20 and a range less than 7 out of 10 in the 100 million tokens of the BNC. These words nonetheless seem very familiar, and include words like sheikdom, rubberize, roguishly, puppetry. It is good that the first 3000 words are now available as flemmas thanks to Dang and Webb (see Chapter 15 this volume), and now the first 3000 families are also available in a version of Level 3 of the Bauer and Nation (1993) list (headword, inflected forms and -ly, -er, -th and un-). One great benefit of the lists is that the word families are there for others to freely use in their own lists, thus saving a lot of time. The next most serious criticisms of the BNC/COCA lists are that they are not replicable. There has been so much adjustment to them over the years that they could not be re-made following a clear procedure. I am not too worried by this criticism because it is the quality of the resulting lists that matters and the adjustments have been made to improve their quality. I will try however to make their composition more explicit and principled.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

142 Making and Using Word Lists for Language Learning and Testing

A positive feature of the lists is that they do work well. The aspects of validity described earlier in this chapter show that they can do many of the jobs they were intended to do. And with the supplementary lists of proper nouns etc., they typically cover over 99% of the words in a text. The treatment of homoforms in the lists is at best a barely satisfactory compromise. Range and AntWordProfiler cannot deal with the same form twice in the lists, but the treatment of homoforms will be improved if there is some kind of corpus tagging program which can go through a corpus marking the distinct meanings of homoforms. This is not beyond the demands of present-day computing. We know what the most frequent homoforms are, although some may disagree about what is in the list, and Tom Cobb in a personal communication has reported getting success in the high 90 per cents in separating some homoforms using a computer program. If, say one meaning of bowl, was left untagged and the other had an asterisk or some other mark placed after the word and its family members (bowl*, bowls*), Range and AntWordProfiler could easily count them separately. This would increase the precision of word counts and improve the resulting word lists. A weakness for some uses of the lists is the treatment of transparent compounds. These are listed in basewrd33 and contain the most frequent transparent compounds in the BNC. Typically their frequency is added to the coverage by the main lists in text coverage studies, because they are made up of frequent parts and are easily decomposable for comprehension. Ideally, a text processing program would be used to go through a corpus and split these compounds before Range or AntWordProfiler is used. The program could draw on the list in basewrd33 which would have the breaks in the words indicated. The overall effect would be small, with little if any effect on coverage but with a very small effect on the frequency of individual high frequency words involved in transparent compounds. This would be consistent with the way in which hyphenated words are treated. For me, one of the greatest benefits of the BNC/COCA lists has been to shed some light on how large a native speaker’s vocabulary size could be and the kinds of words that make up a vocabulary. If a native speaker’s vocabulary is based on word families as common-sense and the first language evidence suggests (Nagy, Anderson, Schommer, Scott & Stallman, 1989; Bertram, Baayen, & Schreuder, 2000; Bertram, Laine, & Virkkala, 2000), looking at how difficult it is to find nontechnical word families after the 25th 1000 makes one aware that most nativespeaking adult’s general vocabulary is somewhere around the twenty thousand level. Brysbaert, Stevens, Mandera and Keuleers (2016) argue that the definition of word family used in the BNC/COCA lists is too conservative because it restricts what affixes can be used and requires the word stem to be a possible free-standing word (a free form). They argue for an adult size closer to ten to eleven thousand



Chapter 13.  Critiquing a word list: the BNC/COCA lists 143

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

families. I can accept their arguments. It is clear however that earlier estimates of vocabulary sizes of fifty thousand or a hundred thousand or more word families are way out of the ball park. The BNC/COCA lists represent one way of working with lists. Researchers like Mark Davies, Marc Brysbaert, and Rob Waring use relational databases to work with lists, giving them much greater flexibility. However, the quality of their lists still depends on how well the families are made, chosen and ordered.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

chapter 14

Specialized word lists

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Paul Nation, Averil Coxhead, Teresa Mihwa Chung and Betsy Quero The Academic Word List (Coxhead, 2000) and similar word lists (Gardner & Davies, 2014) look at a general academic vocabulary that can be used across a wide range of subject areas and disciplines. The Science Word List (Coxhead & Hirsh, 2007) narrows its focus to words that occur across the sciences. There is a growing number of word lists and studies of specialized areas that look at the technical vocabulary of a particular subject area such as commerce, engineering, nursing or medicine. In this chapter we look at the uses of such studies and lists, and examine the methodology for making specialized word lists.

Values of specialized word lists The values of specialized lists range from providing knowledge about the nature of vocabulary and vocabulary size to providing guidance for teachers in subject areas. Studies of technical vocabulary can contribute to vocabulary size studies. Such studies can answer questions like the following. How much does technical knowledge add to your vocabulary? How big is a technical vocabulary? This goal requires technical vocabulary to be classified into technical vocabulary available through general knowledge and that only available through specialized knowledge. If we take medicine as an example, technical words available through general knowledge include words like arm, lungs, and penicillin. These words are common in medicine but are known by most people. Some may not wish to call these technical words but they are closely related to the field of medicine. Words that are only available through specialized knowledge include words like costal, xiphoid, and haemoglobin. Some technical areas provide less of a vocabulary barrier than others because they have technical vocabulary that is generally widely known. Studies of technical vocabulary can help us understand the size of the vocabulary learning task that learners face when beginning to study a technical area. Studies of technical vocabulary can suggest paths towards dealing with such vocabulary from a curriculum perspective, either going directly into the study of

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

146 Making and Using Word Lists for Language Learning and Testing

technical vocabulary, or progressing through high frequency and general academic vocabulary before learning technical vocabulary. Studies of technical vocabulary can guide in the development of appropriate vocabulary learning strategies, particularly for word parts that are important in the specialized area. Studies of technical vocabulary can help in developing subject matter material for English for Academic Purposes courses and can support teachers’ awareness of words that may need attention. This requires frequency-ranked technical word lists. Studies of technical vocabulary can help in examining the role of technical vocabulary in specialized texts and its possible effects on comprehension. This requires coverage studies using word lists and examination of the discourse role of technical vocabulary in specific texts. Cross-language studies of technical vocabulary can show the amount and nature of technical word borrowing. Studies of technical vocabulary can help in developing tests of previous topic knowledge, because knowledge of technical words is a good indicator of familiarity with the topic. Some of these purposes have significance for the particular specialized area focused on. Others contribute to our knowledge of the field of vocabulary studies in general.

Making a specialized word list Technical vocabulary consists of words and phrases that are closely related to the ideas covered in a particular subject area. Because of this close meaning connection between the words and the subject area, some technical vocabulary may occur only in that subject area or may occur only in that subject area with a narrowed technical meaning, and most technical vocabulary will occur more often in the subject area than outside the subject area. There are several ways of trying to identify technical vocabulary. The most obvious way is to get an expert in the field to say which words are closely connected with the field and which are not. Typically a rating scale is used for this (see Chung & Nation, 2004 for an example). This turns out to be a very time-consuming and difficult process, and experts in the specialist field can differ considerably on what they consider to be a technical word in their subject area. Technical dictionaries are also a source of technical vocabulary, but the procedures and criteria used in compiling the dictionaries are typically not explained, and a researcher may wish to use a more transparent and replicable process for identifying technical vocabulary. Chung and Nation (2004) found that a dictionary

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 14.  Specialized word lists 147

contained about 80% of the technical vocabulary identified by a corpus-comparison process. Another approach is to examine technical texts to see what words are highlighted and explicitly defined in a specialized text. It is also possible to examine diagrams, illustrations and labelled pictures in the texts to see what words are explained by pictorial means. Chung and Nation (2004) found that not enough of the technical words were explicitly dealt with in these ways. The corpus-comparison approach (Chung, 2003) has been shown to be an efficient and effective way of finding most technical terms and we look at this in the next section.

Using a corpus-comparison approach to make technical word lists The corpus-comparison approach involves developing a corpus that represents the specialized area you wish to study, and making a similarly-sized general corpus that does not contain any texts related to the specialized area that is the focus of the study. The two corpora are run at the same time through a word frequency counting program like AntWordProfiler or Range and the output is put into a spreadsheet such as Microsoft Excel. The two frequencies of each word are compared and words that are unique to the specialized corpus or have a much higher frequency in the specialized corpus are highly likely to be technical words. Having large enough corpora of a similar size may be necessary to ensure that the non-technical words have an equal chance to occur and thus do not have differing frequencies that suggest they are technical words. It is also important that the technical corpus properly represents the field in that it covers the topics that are seen to be essential to the field. Generally technical word studies have not made much use of a technical corpus divided into sub-corpora. Quero (2015) used two large basic textbooks of medicine and compared the occurrence of technical vocabulary in them. One of the issues faced in such a study is deciding what the unit of counting should be – word type, lemma, family? The decision is usually in favour of the word type because not all the members of a word family are technical words. It would be acceptable to have mixture of units of counting – for some words the type, for some the lemma, and for some the family – as long as all the members of a lemma or family were closely related in meaning and were technical words in that field. Most technical dictionaries contain large numbers of phrases and acronyms. A thorough study of the technical vocabulary of a particular subject area would need to include these units of counting, requiring a principled decision about whether to count the phrases separately or to combine them in a count with single word units. Some of the issues with this were discussed in Chapter 6 on multiword units.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

148 Making and Using Word Lists for Language Learning and Testing

Another issue is whether highly topic-related words which are widely known should be counted as technical vocabulary (e.g. bone, muscle, leg in medicine) or should only specialized words which are largely unique to the subject area be counted? The answer depends of course on the purpose of the study. A study of the nature of technical vocabulary would count all technical words, familiar and unfamiliar. A study that considers what needs to be learned might at least separate unique and lower frequency technical words from those that are already known words. Related to the idea of familiar and unfamiliar technical words, is the idea of the degree of technicalness? That is, are some words more technical than others? The same criterion would need to run through all steps in a scale of technicalness, and that criterion might be accessibility of the word by a non-specialist. The corpus-comparison approach uses a cut-off ratio. That is, if a word is xtimes more frequent in the specialized corpus, it is likely to be a technical word. It is not clear what this ratio should be. Chung (2003) used 50 times more frequent. Other researchers have used different ratios (Quero, 2015). It may not be necessary to have a threshold type cut-off point. The size of the ratio may also depend on the corpus size.

Making an academic word list We have already looked at general academic word lists in Chapter 1. There is considerable debate over the usefulness of general academic word lists with those in favour presenting the following arguments. 1. Teachers on pre-university English proficiency programs are typically faced with classes containing learners about to study a wide range of disciplines. Part of their study should focus on the vocabulary that is shared by these disciplines. 2. The sub-technical (academic) vocabulary is often not well known, and is not salient in academic texts making it less likely to be learned than the technical vocabulary. It thus needs attention. 3. Academic vocabulary covers a large portion of an academic text, usually around 10% or more. This means that at least one word in every line will be a general academic word. The time spent learning such words will be well repaid by the opportunities to meet and use them. 4. The most useful academic words are such a small group that it is easily feasible to learn them during a pre-university English course. The arguments against separating out academic vocabulary are as follows.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 14.  Specialized word lists 149

1. The meanings and uses of academic words are different in different disciplines. Academic words need to be learned in relation to a specific subject area, so that their discipline-specific senses and collocations are learned (Hyland & Tse, 2007; Hyland, 2008). 2. Word lists encourage learning out of context. Vocabulary needs to be learned in use. 3. Academic word lists such as those made by Campion and Elley (1971), Praninskas (1972), and Coxhead (2000) assume previous knowledge of a general service vocabulary. This assumption negatively affects the words that are included in an academic vocabulary (Gardner & Davies, 2014). Academic vocabulary needs to be seen as cutting across the three frequency levels of high-, mid-, and low-frequency words. Words can be both academic words and high frequency words, both academic words and mid-frequency words and so on. The arguments for and against academic vocabulary are not as opposed as they might seem to be, especially if one takes an inclusive approach to learning which sees opportunities for learning occurring across a range of strands (Nation, 2007). It is good to do both deliberate decontextualized learning and learning in context. It is good to learn words with their core meanings and also give attention to their various related senses in different contexts. Learners need to learn both academic vocabulary and the vocabulary of their specific disciplines and a well-run preuniversity course will provide opportunities for these different focuses. Coxhead’s (2000) Academic Word List (AWL) was designed to replace the University Word List which was made by combining existing word lists. The initial goal in the development of the Academic Word List was to have one million words each of Science, Humanities, Commerce and Law, but the pressures of time to complete a Masters thesis and the difficulty in obtaining texts already in an electronic format meant that the figure for each faculty area was around 875,000, making a total corpus of 3,513,330 tokens. In hindsight, it would have been good to get copyright clearance for the material so that the corpus could be made freely available for other researchers. In reality, this was impossible given the way the texts were gathered, from authors rather than publishers. To further ensure the representativeness of the corpus, each faculty area was divided into seven discipline areas, making a total of 28. This meant that each discipline area had around 125,000 tokens, which is rather small to comfortably represent that area, but nevertheless ensures that the corpus was not heavily-weighted towards a particular subject area. Two-thirds of the word families in the Academic Word List occur in 25 or more of the 28 discipline areas, and 94% of the 570 word families occur in 20 or more discipline areas. So, the relatively small corpus sizes for the discipline areas did not overly restrict the occurrence of the academic words. The word family was the unit of counting

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

150 Making and Using Word Lists for Language Learning and Testing

because reading was seen as being of prime importance, although it was clear that academic vocabulary needed to be used both receptively and productively, and across the four skills of listening, speaking, reading, and writing (Corson, 1995). The Academic Word List has been widely used and has also had its good share of criticism. A recurrent criticism is that it assumed knowledge of the General Service List. This criticism was anticipated and Nation and Hwang’s (1995) study was done to ensure that the General Service List was still good enough to use in this way. Coxhead (2000) also checked what words in the Academic Word List would better be seen as general service words rather than academic words, and found only a small number. The Academic Word List was conceptualized as the next step beyond knowing the high frequency words, but academic vocabulary is probably better seen as a separate kind of vocabulary that is not directly related to the high frequency, mid-frequency and low frequency word levels (Nation, 2013: Chapter 1), but cuts across them as technical vocabulary does (see Neufeld, Hancioglu & Eldridge (2011) for a similar suggestion). Hyland and Tse (2007) suggested that the same word form was not used in the same way in different disciplines and that lumping this variety of uses together in one word family misrepresented these words. That is, counting the same word forms across different disciplines might result in combining meanings that are really homonyms and not polysemes. They argued that rather than focus on “academic” words, it is better to focus on how these words are used in the discipline that each learner is studying in. This criticism was also at least partly anticipated. Wang and Nation (2004) looked at homography in the Academic Word List finding that it affected only a few items in any significant way, meaning that only three word families (intelligence, panel, offset) would be excluded from the list because none of their homographs met the criteria for inclusion in the list. Hyland and Tse’s criticism goes deeper than this however, looking at the contextual use of the words as well as their meaning senses. The compromise must be not either/or but both/and. While it is ideal that learners read and study texts in their own disciplines in a pre-university course, this is what the learners need to do individually or with classmates studying in the same discipline, while the teachers focus on what is shared among the various disciplines. Working on the core meaning and uses of an academic word enables rather than disables current or later learning in more discipline-specific contexts. From the viewpoint of making an academic word list, it is time to reconsider the methodology. If academic vocabulary is best viewed as a different kind of vocabulary from the three main frequency levels, then the idea of assuming a known high frequency vocabulary as a starting point may need to be rethought. Gardner and Davies’ (2014) approach using corpus-comparison and relative frequency as with technical words was only partially successful with their academic list including words such as between, such, however, within, student, group, program, which seem

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 14.  Specialized word lists 151

to be only marginally academic. It may be that an assumed vocabulary is still necessary and there are now several viable alternatives to West’s (1953) General Service List, including smaller well-designed lists. If an academic word list can be made without an assumed vocabulary, it needs to contain words that are clearly academic and deserving of attention on a pre-university course on academic English. The value of lists of academic and specialist vocabulary needs to be seen in the broader scope of the various opportunities for language learning. Lists encourage deliberate attention and while repeated deliberate attention to vocabulary has strong effects on learning, the greatest amount of time needs to be directed towards incidental learning through language use in input and output and fluency development. Deliberate learning and incidental learning complement and support each other and there needs to be a balance of these types of learning (Nation, 2007; Nation, 2013). Specialized word lists and academic word lists are thus only one part of the larger picture of vocabulary learning.

Recommendations 1. It is important to consider the possible applications of the results of a study of specialized vocabulary, largely because a large list of specialized vocabulary would be of little value to non-specialist teachers wishing to prepare learners for specialist study. Specialist vocabulary needs to be learned in the context of learning the subject matter of the field. 2. A corpus-comparison approach is an efficient and effective way of identifying technical vocabulary. The results still need careful manual checking. 3. The non-specialized comparison corpus needs to be large enough for all general service words and most non-technical mid-frequency words to occur. This means the corpus needs to be well over a million tokens and preferably several million tokens. 4. The comparison corpus must not contain any texts from the specialist area. 5. The unit of counting needs to result in only technical uses being counted. This typically means using the word type as the unit of counting or using a mixture of types, lemmas, and families where all the members are technical words. 6. Although there are useful word lists of academic English, the methodology for making word lists of this type needs investigation.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

chapter 15

Making an essential word list for beginners

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Thi Ngoc Yen Dang and Stuart Webb This chapter describes a word list study which expands on earlier studies and creates a practical wordlist that would provide a starting point for L2 beginners’ lexical development. An initial aim is to identify which words should be included in an essential wordlist for L2 beginners. A second aim is to determine how many items should be included in a wordlist for L2 beginners using three criteria: practicability, change in the coverage curve, and amount of lexical coverage. The word list could serve as the foundation for L2 beginner lexical development. The points to note about the study are its choice of the unit of counting, the size of the list and its sublists, the treatment of proper noun homonyms and the extensive validation of the list.

Which items should be included in a word list for beginners? Analysis of established lists that were developed from large corpora using precise and valid methodologies may provide a reliable list of essential vocabulary for beginners. West’s (1953) GSL was chosen as one of the source lists in the present study because it is the oldest and most influential high-frequency wordlist. Nation’s (2006) BNC2000, Nation’s (2012) BNC/COCA2000, and Brezina and Gablasova’s (2013) New-GSL were chosen because they are three recently-created high-frequency wordlists, and earlier studies (Dang & Webb, under review; Brezina & Gablasova, 2013) have shown that these lists provided higher lexical coverage than the GSL in multiple corpora. Another high-frequency wordlist, Browne’s (2013) New General Service List, was created recently. It was not used in the present study for two reasons. First, preliminary analysis of the list as a whole showed that Browne’s (2013) list provided lower average coverage per item than any of the four lists in the present study including the GSL. Second, there has been little written about the way it was developed. In a comparison of the lexical coverage provided by items in the GSL, BNC2000, BNC/COCA2000, and New-GSL in nine spoken and nine written corpora, Dang and Webb (under review) found that each list had both strong and weak items. This suggests that a list made of the best items from the four lists may provide greater coverage than any one list. Moreover, because high-frequency words occur

154 Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

frequently in a wide range of texts, validation of the items in high-frequency wordlists should be based on coverage in a larger number of corpora with a greater degree of variation of English language than has been used in the earlier studies. The present study aims to fill this gap by ranking the items from all lists based on their lexical coverage in a large number of corpora representing different discourse types and varieties of English.

What should be the unit of counting in a wordlist created for beginners? An important issue when developing wordlists is the unit of counting. The GSL, BNC2000, and BNC/COCA2000 used Level 6 word-families (Bauer & Nation, 1993) as the unit of counting while the New-GSL used lemmas (Level 2 word families). The choice of word-families as the unit of counting is based on the assumption that, if learners know one word-form, they may recognize its inflected and closely derived forms. In contrast to Level 6 word families, the choice of Level 2 families is based on the assumption that, if learners know one word-form, they may only recognize its inflected forms. Level 2 word families in the Bauer and Nation scheme involve only inflected forms. Each option has its advantages and disadvantages, and the choice of unit of counting should be based on the characteristics of target list-users (Gardner, 2007). In wordlists for L2 beginners, Level 2 word-families are more suitable than Level 6 word-families for two reasons. First, L2 beginners’ morphological awareness may be limited, and it may be inappropriate to assume that if they know one member of a word-family, they may recognize its derivational forms. This is supported by Schmitt and Zimmerman’s (2002) and Ward and Chuenjundaeng’s (2009)studies that found that not all derivational members of a word-family were known by L2 learners. Second, for L2 beginners who lack sufficient English morphological knowledge and their teachers, a Level 2 word family list might be more useful than a Level 6 word-family list. Level 2 lists consist of mainly high-frequency lemmas (study) while word-family lists are made up of both high-frequency (study) and low-frequency lemmas (studious, studiously). Introducing Level 2 lists to L2 beginners will draw their attention to the high-frequency words first. By developing knowledge of these most important forms, it may be easier to learn the infrequent members from the same word-family at a later stage of lexical development. For these reasons, the present study chose Level 2 word-families rather than Level 6 word-families as the unit of counting for the EWL. However, unlike the traditional definition of lemmas which separate parts of speech, the present study defined Level 2 families as a word-form (headword) plus its inflections without distinguishing between parts of speech. This expanded version of lemmas have been called flemmas (family lemmas), but in this study we will refer to them as Level 2 families.

Chapter 15.  Making an essential word list for beginners 155



Research questions The aim of the present study is to develop a wordlist for L2 beginners by including the best items in terms of lexical coverage from the GSL, BNC2000, BNC/ COCA2000, and New-GSL. It sought to develop the Essential Word List (EWL) through answering the following seven questions.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15







What is the mean coverage provided by each set of 100 Level 2 headwords from a master list made up of Level 2 headwords from the GSL, BNC2000, BNC/ COCA2000, and New-GSL in 18 corpora? What is the mean coverage provided by each set of 100 Level 2 headwords plus members from the master list in the 18 corpora? How many headwords should be included in an EWL? Do the EWL headwords provide higher mean coverage in 18 corpora than the best headwords from each of the source lists from which the EWL was developed (GSL, BNC2000, BNC/COCA2000, and New-GSL)? Do the EWL families provide higher mean coverage in 18 corpora than the best families from each of the source lists? What is the overlap between the EWL headwords and the best headwords from the master list that were found in nine spoken corpora? What is the overlap between the EWL headwords and the best headwords from the master list that were found in nine written corpora?

Materials The master list A master list was created of Level 2 word family (flemma) headwords from four source lists: West’s (1953) GSL, Nation’s (2006) BNC2000, Nation’s (2012) BNC/ COCA2000, and Brezina and Gablasova’s (2013) New-GSL. Because word families at Level 2 of Bauer and Nation (1993) were chosen as the unit of counting in the present study while word-families at Level 6 were the unit of counting in the original versions of the GSL, BNC2000, and BNC/COCA2000, Level 6 word-families from these lists were converted into Level 2 families. This was done by regrouping the GSL, BNC 2000, and BNC/COCA 2000 word-family members by following Leech, Rayson, and Wilson’s (2001) principles for creating lemmatized wordlists. For example, the word-family study has six members: study, studied, studies, studying, studious, and studiously. When converted into flemmas, these members were grouped into three families: study (study, studied, studies, studying), studious (studious), and studiously (studiously). Once the conversion had been completed, the

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

156 Making and Using Word Lists for Language Learning and Testing

Level 2 word family versions of the GSL, BNC2000 and BNC/COCA2000 had 6,601, 6,465, and 6,412 headwords, respectively. Because there was overlap between the items in the four lists, repeated headwords were excluded, resulting in 8,722 headwords remaining in the master list. A further 66 headwords were excluded. These items were letters (e.g., B, X) (20), affixes (e.g., anti, non) (7), cities (2), people’s names (2), and the names of places and languages (35). Although learning letters of the alphabet is important, letters were excluded because it was assumed that L2 beginners would know them before learning English words. Learning affixes, especially the high-frequency affixes, has value for L2 beginners because they may have insufficient English morphological knowledge. However, it may be more reasonable to introduce a list of affixes when learners have reached a certain level rather than introducing affixes together with words right at the beginning (Nation, 2013). Proper nouns such as cities’ names and people’s names were not included because they are usually transparent and may have less value to learners than content words. The names of places and languages were excluded for two reasons. First, to be consistent with the decision to exclude the names of people and cities, these proper nouns should not be included in the master list. Second, the 35 names of places and languages may be biased towards the corpora from which the source lists were developed. For example, Scot, a BNC headword, appeared 496 times in the BNC but not in several corpora of other English varieties. This suggests that Scot was included in the BNC2000 not because it is a high-frequency word, but because it occurred very frequently in the BNC, the corpus from which the BNC2000 was developed. Table 15.1  Nine spoken corpora used in the present study Name

Tokens

Variety of English

10,484,320

British

Spoken corpora British National Corpus (spoken component) International Corpus of English (spoken component)

5,641,642

Indian, Philippino, Singapore, Canadian, Hong Kong, Irish, Jamaican & New Zealand

Open American National Corpus (spoken component)

3,243,449

American

Webb and Rodgers (2009a) movie corpus

2,841,573

British & American

Wellington Corpus of Spoken New Zealand-English

1,112,905

New Zealand

Hong Kong Corpus of Spoken English

977,923

Hong Kong

Webb and Rodgers (2009b) TV program corpora

943,110

British & American

London-Lund corpus

512,801

British

Santa Barbara Corpus of Spoken American-English

320,496

American



Chapter 15.  Making an essential word list for beginners 157

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Table 15.2  Nine written corpora used in the present study Name

Tokens

Variety of English

British National Corpus (written component)

87,602,389

British

Open American National Corpus (written component)

12,839,527

American

International Corpus of English (written component)

3,467,451

Indian, Philippino, Singapore, Canadian, Hong Kong, Irish, Jamaican, New Zealand & American

Freiburg-Brown corpus of American-English

1,024,320

American

Freiburg–LOB Corpus of British-English

1,021,357

British

Wellington Corpus of Written New Zealand-English

1,019,642

New Zealand

Lancaster-Oslo/Bergen corpus

1,018,455

British

Brown corpus

1,017,502

American

Kolhapur Corpus of Indian-English

1,011,760

Indian

The corpora Nine spoken and nine written corpora were used in the present study to examine the coverage provided by the headwords from the master list (Tables 15.1 and 15.2). These 18 corpora were in the form of untagged text files. They varied in terms of size, type of discourse, and variety of English. The number of tokens ranged from 320,496 to 10,484,320 in the spoken corpora, and from 1,011,760 to 87,602,389 in the written corpora. The corpora represented 10 varieties of English: AmericanEnglish, British-English, Canadian-English, Hong Kong-English, Indian-English, Irish-English, Jamaican-English, New Zealand-English, Philippino-English, and Singapore-English. Thus, it was expected that the 18 corpora would provide a thorough picture of the vocabulary that is essential for L2 beginners.

Procedure This study had three phases: (1) ranking the Level 2 headwords in the master list according to the mean coverage they provided in the 18 corpora, (2) determining the number of headwords to include in the EWL, and (3) assessing the EWL. Phase 1 related to determining the relative value of items in the four source lists. Phase 2 determined the cut-off point of the EWL. Phase 3 focused on evaluating the EWL.

158 Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Ranking the headwords in the master list Four steps were followed to determine the ranking of the headwords in the master list. First, the frequency of each headword was examined in each corpus. This was done by running each corpus through the Range program with the master list in turn serving as the baseword list. Range is a program which analyses the lexical coverage provided by a wordlist in a text. It can be downloaded from Paul Nation’s website (http://www.victoria.ac.nz/lals/about/staff/paul-nation). The second step was to calculate the coverage provided by each headword in each corpus. In this step, the frequency of each headword was divided by the number of running words in the corpus and multiplied by 100. For example, the coverage of programme in the Wellington Corpus of Spoken New Zealand-English (WSC) was 0.015% (165÷1,112,905 x 100 = 0.015%). The third step was to calculate the mean coverage of each Level 2 family in all 18 corpora. This was done by adding the coverage provided by the headwords in each of the 18 corpora and then dividing by the number of corpora (18). Mean coverage of the headwords in each corpus was more useful than the combined frequencies because combined frequency would bias the results towards findings in the largest corpora. By using the mean coverage of the headwords across 18 different corpora, range of lexical coverage was a key criterion to rank the items in the present study. The fourth step was to rank headwords from the master list according to their mean coverage. That is, Level 2 word family headwords with the largest mean coverage were at the top of the master list while Level 2 word family headwords with the smallest mean coverage were at the bottom.

Determining the number of EWL headwords To determine how many headwords should be included in the EWL, two steps were followed. In the first step, the mean coverage provided by each set of 100 headwords from the master list and by each set of 100 Level 2 word families were determined. The present study examined the mean coverage provided by master list items at every 100-lemma headword level up to the 2,000-headword level. The mean coverage provided by each set of 100 headwords was calculated by adding the mean coverage of each headword in the set together. For example, the mean coverage provided by the set of the 1st 100-headwords was the sum of the mean coverage of each item in the top 100 headwords of the master list. To determine the mean coverage provided by sets of 100 Level 2 families, the coverage provided by each set of 100 families in each corpus was determined by running each corpus through Range with each set serving as the baseword list. Then, the mean coverage

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 15.  Making an essential word list for beginners 159

provided by each set of 100 Level 2 word families was calculated by adding the coverage in each of the 18 corpora together and dividing by 18. In the second step, the cut-off point of the EWL was decided based on three criteria: practicability, change in the lexical coverage curve, and amount of lexical coverage. Practicability considered the size of the EWL in relation to the feasible amount of vocabulary that can be acquired by L2 leaners within a language program. The purpose of the present study is to develop a more practical wordlist for L2 beginners; therefore, practicability was the primary criterion to determine the length of the EWL. It would influence the decision related to the other two criteria: change in the lexical coverage curve, and amount of lexical coverage. Change in the lexical coverage curve involved examining the change in the lexical coverage provided by each set of 100 Level 2 headwords, and by these headwords plus members. Coverage by headwords is the actual coverage that learners may gain if they know the headwords. Coverage by headwords plus members reflects the potential coverage that learners may achieve if they can recognize members of these headwords. Although headwords were chosen as the primary unit of counting in the present study because they were usually the most frequent member in a lemma, there are still chances that members are more frequent than headwords. Therefore, it is also useful to use coverage provided by Level 2 word families as one criterion. Using both units of counting to decide the cut-off point provides an indication of how knowledge of two related but different units of counting might affect comprehension. Amount of lexical coverage examined the number of words necessary to reach different lexical coverage figures. Earlier studies have decided the length of a list based on the amount of vocabulary necessary to reach 95% coverage of text. However, lower coverage figures may still provide some indication of learners’ progress in overall language development and assist teachers and course designers in organizing their English language programs to support learners’ comprehension, as well as their lexical development. The number of words needed to reach different coverage figures was determined by the cumulative coverage provided by each set of 100 Level 2 headwords, and by the cumulative coverage provided by these headwords plus members.

Assessing the EWL Four criteria were used to evaluate the EWL. The first criterion involved a comparison between the mean coverage provided by the EWL headwords in the 18 corpora and the best headwords in terms of lexical coverage from the four source lists. The second criterion compared the mean coverage provided by the EWL word families with the mean coverage provided by the best word families from each source

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

160 Making and Using Word Lists for Language Learning and Testing

list. The mean coverage provided by the best items in terms of lexical coverage in each source list was determined by following the same steps used to find the mean coverage provided by the EWL items. The third criterion was the overlap between the EWL headwords and the best headwords in terms of lexical coverage from the master list that were found in nine spoken corpora. The fourth criterion was the overlap between the EWL headwords and the best headwords in terms of lexical coverage from the master list that were found in nine written corpora. To determine the best headwords in nine spoken corpora, and in nine written corpora, the same steps used to select the EWL headwords were followed. Using both coverage provided by headwords and coverage provided by Level 2 word families as criteria to compare the EWL with the four source list provided a better picture about the actual coverage and potential coverage that L2 beginners may gain by knowing these wordlists. Looking at the mean coverage of the EWL and the best items in the source lists in the 18 corpora demonstrated the relative value of the lists in general, while the overlap between the EWL headwords and the best headwords in spoken and written corpora assessed the value of the EWL in different kinds of discourse. Together, these four criteria should provide a thorough assessment of the EWL.

Ranking the headwords in the master list The coverage provided by the sets of 100 headwords from the master list up to 2,000 headwords, as well as examples of items from each set are shown in Table 15.3. The mean coverage figures for the items in the different sets reflect their varying relative values. Those at higher levels are of greater value to language learners than those at lower levels. The 1st 100-headword level included items such as the and okay. The 10th 100-headword level included items such as garden and huge. The 20th 100-headword level included items such as consumer and loud. In answer to Research Questions 1 and 2, the 1st 100-headwords provided mean coverage of 45.68% and the 1st 100 Level 2 word families (flemmas) provided 55.46% coverage. After the 1st 100-headwords, the mean coverage fell quickly. The 2nd 100-headwords provided mean coverage of only 5.62%; plus members, they provided 6.71% coverage. The coverage provided by headwords from the 3rd, 4th, 5th, 6th, and 7th 100-headword levels was 2.94%, 2.08%, 1.61%, 1.29%, and 1.04%, respectively. The coverage provided by these headwords with their members was 3.71% (3rd 100-headword level), 2.70% (4th 100-headword level), 2.06% (5th 100-headword level), 1.78% (6th 100-headword level), and 1.36% (7th 100-headword level). Beyond the 8th 100-headword level, the mean coverage provided by each 100-headword set was less than 1% while the mean coverage provided by families was less than 1% by the 10th 100-headword level.

Chapter 15.  Making an essential word list for beginners 161



Table 15.3  Additional coverage provided by the master list headwords and members at each 100 lemma headword level in 18 corpora

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Headword level 1st 100 2nd 100 3rd 100 4th 100 5th 100 6th 100 7th 100 8th 100 9th 100 10th 100 11th 100 12th 100 13th 100 14th 100 15th 100 16th 100 17th100 18th 100 19th100 20th 100

Examples the, okay sure, maybe sorry, hey please, run alright, hi thanks, ok hello, bye drink, fast tea, heavy garden, huge busy, weather fresh, draw active, holiday fire, ride shoot, lake tiny, neck vast, snow attractive, channel journey, calm consumer, loud

Coverage provided by each set of 100 words (%) Headwords

Headwords & members

45.68 5.62 2.94 2.08 1.61 1.29 1.04 0.89 0.77 0.69 0.62 0.56 0.51 0.46 0.41 0.37 0.34 0.32 0.29 0.27

55.46 6.71 3.71 2.70 2.06 1.78 1.36 1.31 1.11 0.99 0.90 0.76 0.72 0.63 0.61 0.54 0.49 0.45 0.43 0.43

Determining the number of EWL headwords In answer to Research Question 3, the three criteria (practicability, change in the lexical coverage curve, and amount of lexical coverage) provide support for 800 items as the cut-off point for the Essential Word List. As the primary criterion, practicability will first be discussed alone, and then in relation to the other two criteria. Practicability indicates that the EWL should have no more than 1,000 items. Earlier research on vocabulary growth has shown that L2 learners can acquire around 400 word families (Webb & Chang, 2012) or 500 lemmas (Milton, 2009) in a year. With this modest vocabulary growth rate, learning a list of more than 1,000 items may be too ambitious a goal for L2 beginners within an institution. This is supported by earlier research showing that EFL students from a range of contexts often fail to master the 1st 1,000 items despite a lengthy period of English instruction (Webb & Chang, 2012; Henriksen & Danelund, in press; Nurweni & Read, 1999; Quinn, 1968). A wordlist of less than 1,000 items is a more feasible task

162 Making and Using Word Lists for Language Learning and Testing

that might be learned within a single institution over two years. It focuses learners’ attention on the most important items, which provide a much larger amount of lexical coverage than the subsequent 1,000 items (Dang & Webb, under review; Engels, 1968). 60

Headwords Headwords & members

40 Coverage (%)

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

50

30 20 10 0

1st

2nd

3rd

4th

5th

6th

7th

8th

9th

10th

Set of 100 lemma headwords

Figure 15.1  Coverage by each set of 100 headwords

Practicability was then considered together with the other two criteria to determine a cut-off point within the 1st 1,000-headword level. Figure 15.1 illustrates changes in the coverage curves up to the 10th 100-headword level. The lower line presents the coverage provided by sets of headwords while the upper line presents the coverage provided by sets of headword plus members. In both cases, there was a decline in the coverage provided by each set of 100-items as the headwords became less frequent. There was a huge drop in coverage between the 1st and 2nd 100-headword level. From the 2nd to the 8th 100-headword levels, the amount of additional coverage, though not as high as that at the 1st 100-headword level, was still relatively large. However, beyond the 8th 100-headword level, the curve flattens out and the amount of additional coverage was less than 1%. The small change in the coverage between sets of 100 headwords beyond the 800 cut-off point suggests that the sequencing of items becomes less reliable because of the small difference in the mean coverage provided by headwords in adjoining levels. That is, items which are in the 9th100 could just as well be in the 10th100. The lexical coverage curve criterion suggested two possible cut-off points for the EWL: 100 words and 800 words. If only the lexical coverage curve were used as the criterion to determine the cut-off point, 100 words would have been a more reasonable option because there is an

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 15.  Making an essential word list for beginners 163

extremely large decrease in coverage between the 1st and 2nd 100-headword sets. However, when the lexical coverage curve was considered together with practicability, 800 words is a better option. 78% of the words in the 800-item list were lexical words, and 22% were function words. In contrast, the percentage of lexical words and function words in the 100-item list was 28% and 72%, respectively. Lexical words are “words that convey content meaning” while function words are “words that express grammatical relationship” (Biber, Conrad, & Leech, 2002: 457–458). As lexical words enable L2 beginners to express their ideas, a list with an insufficient number of lexical words may not be very useful for L2 beginners. Therefore, an 800-item list seems more appropriate than a 100-item list when the coverage curves and practicability were considered together. Table 15.4  Cumulative mean coverage provided by the master list headwords and members at each cut-off point in 18 corpora Number of headwords 100 200 300 400 500 600 700 800 900 1,000

Cumulative coverage at each 100 Level 2 headword point (%) Headwords

Headwords & members

45.68 51.3 54.24 56.32 57.93 59.22 60.26 61.15 61.92 62.61

55.46 62.17 65.88 68.58 70.64 72.42 73.78 75.09 76.2 77.19

The 800 item cut-off point was also supported by the third criterion (amount of lexical coverage). The top 800 headwords provided mean coverage of 61.15%, and potential coverage of 75.09% if all members of the lemmas were known (Table 15.4). The purpose of the EWL is to provide L2 beginners with the foundation for further vocabulary learning. Learning a relatively small number of words but reaching the 60% and 75% levels of coverage might be considered meaningful and practical to all stakeholders: teachers, program coordinators, and students. In this case, learning the 800 headwords would allow students to recognize over 60% of English words and as much as 75% of the English language if all members of the Level 2 families are known. The pedagogical significance of gaining knowledge of such a large proportion of English through studying a relatively short wordlist should be motivating to all stakeholders. Taken together, the three criteria suggested that 800 items should be the number of items in the EWL. The EWL is included in Appendix 3.

164 Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Assessing the EWL In answer to Research Questions 4 and 5, the EWL headwords and families provided higher mean coverage in the 18 corpora than the best 800 headwords and similarly-sized families from each source list. The coverage provided by the EWL headwords in the 18 corpora was 61.16%. This is higher than the coverage provided by the top 800 GSL headwords (57.86%), top 800 BNC2000 headwords (57.66%), top 800 BNC/COCA2000 headwords (58.39%), and top 800 New-GSL headwords (60.83%). Similarly, the EWL Level 2 families provided higher mean coverage in 18 corpora (75.09%) than the best 800 Level 2 families from each source list (72.24%, 71.63%, 72.72%, 74.92%). This is not surprising because the EWL headwords were the best items from the four source lists. The fact that the EWL families provided the highest coverage also strongly supports the choice of headwords as the primary unit of counting in the present study. It might be assumed that the top ranked 800 items in the best source list in the comparisons (New-GSL) are a reasonable substitute for the EWL. However, our analysis indicated that while there were many strong items in the New-GSL (and the other three lists), the rank order of the items is quite different when based on their coverage in the 18 corpora. There were 186 different items in the EWL and the top 800 items in the New-GSL indicating that the lists are quite different, and that the EWL is not simply a replica of the New-GSL. In answer to Research Questions 6 and 7, 86.5% of the EWL headwords (692 items) appeared in the best 800 spoken headwords, and 698 EWL headwords (87.25%) were included in the best 800 written headwords. Importantly, 590 out of 800 EWL headwords (73.75%) appeared in both the best 800 spoken headwords and the best 800 written headwords. Among the 210 remaining headwords, 102 headwords (12.75%) appeared in the top 800 spoken headwords alone and 108 headwords (13.5%) appeared in the top 800 written headwords alone. The fact that most EWL headwords appeared in both the top 800 spoken and the top 800 written headwords, and there was a good balance in the number of remaining headwords that were unique to spoken and written discourse indicates that the EWL included basic words that are necessary for both written and spoken texts. This suggests that it would likely meet the needs of L2 beginners.



Chapter 15.  Making an essential word list for beginners 165

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Discussion As well as providing higher mean coverage in 18 corpora, the EWL has seven other strengths that make it superior to the source lists. First, unlike the four source lists, the number of items in the EWL was determined by examining the issue from different perspectives with the characteristics of target list-users (L2 beginners) in mind. Therefore, it may better reflect L2 beginners’ needs. Second, the EWL items may have greater validity than those from the four source lists. Unlike the four source lists, in the selection of EWL words, the frequencies of 110 words that can be either proper nouns or common words (e.g., frank, mark) were adjusted to reflect the real value of learning these items. That is, the frequency of headwords that occurred as proper nouns was subtracted from the total frequency of the headwords in the corpus. For example, in the WSC, mark appeared 176 times in total, but it was used as a proper noun 77 times. Therefore, the final frequency of mark in the WSC was 99. Without this adjustment, mark would be among the top 800 headwords of the master list. Also, the frequency of 92 headwords which had American variants was adjusted by adding the frequency of American variants to the total frequency. For instance, the final frequency of programme in the WSC (165) was the sum of the frequency of the British variant (programme) (148) and its American variant, (program) (17). Counting frequencies of both British and American variants in the final frequency of the headwords ensures that the EWL will better represent the essential vocabulary that learners often encounter in different language contexts. Moreover, while the other lists included letters (BNC2000), proper nouns (BNC2000), and affixes (BNC2000, BNC/ COCA2000), the EWL excluded these items. Without this treatment, 12 names of places (e.g., Indian, London), 18 letters (e.g., B, Y), one affix (non), and seven items that can be either proper nouns or common words (e.g., mark, lord) would be included in the EWL. This would have meant excluding 39 items from the EWL including: colour, dollar, fight, park, and television. Compared with names of places, letters and affixes, these words should provide greater value to L2 beginners. Third, the EWL included items which are very common in general conversation but are absent from some source lists. For example, okay and alright do not appear in the GSL and New-GSL; hey, hi, hello, and bye are absent from the New-GSL. A wordlist which contains common words in general spoken conversation might be more valuable for L2 beginners because, “for most people, the spoken language is the main source of exposure to language, and is thus the main engine for language change and dynamism” (McCarthy & Carter, 1997: 38). With widespread use of Communicative Language Teaching and Task-based Language Teaching approaches that pay more attention to spoken language, a list with a considerable

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

166 Making and Using Word Lists for Language Learning and Testing

number of words common in spoken discourse may be very attractive to teachers and learners. Fourth, unlike the GSL, BNC2000, and BNC/COCA2000, which use Level 6 word-families as the unit of counting, the EWL uses Level 2 families (flemmas). This is a more reasonable decision because the EWL does not require sophisticated morphological knowledge or include low-frequency lemmas. Therefore, it is more appropriate for L2 beginners who are unlikely to be able to recognize many family members. Fifth, the EWL items were derived from their lexical coverage in 18 corpora representing different discourse types, and 10 different varieties of English. In contrast, creation of the items in the earlier lists was based on a maximum of four corpora. Moreover, frequencies of both American and British variants were counted in the development of the EWL. Hence, the EWL should better represent the essential vocabulary encountered by learners in diverse situations. Sixth, while none of the four source lists distinguish between function words (e.g., the, of, in, at) and lexical words (e.g., know, big, people), the EWL was divided into a list of 624 lexical words and a list of 176 function words. Although there are a number ways of classifying function words and lexical words, to be consistent, the present study follows Biber et al.’s (2002) classification. Words which can be either function words or lexical words (e.g., have, past) will be considered function words. However, to allow flexibility in the implementation of the EWL, teachers and learners can reclassify some EWL items into function word or lexical word lists. Classifying the EWL items into function words and lexical words has pedagogical value because of their different characteristics. In a text, lexical words are more salient than function words; therefore, the way to deal with lexical words should be different from the way to deal with most function words (Carter & McCarthy, 1988). It will be best to sequence the teaching of lexical words according to their frequency. However, it is more reasonable to incorporate teaching function words with other components of language lessons due to their lack of salience in the text. No other word list has made the distinction between lexical and function words. This also makes the EWL more pedagogically appropriate. Seventh, the EWL list of lexical words has sub-lists with manageable sizes. While the other lists either do not have sub-lists (New-GSL) or have 1,000-item sub-lists that might be too large to be incorporated effectively into language learning programs (GSL, BNC2000, BNC/COCA2000), the EWL list of lexical words is divided into 13 sub-lists according to decreasing mean coverage. The first 12 sublists have 50 headwords each while Sub-list 13 has 24 headwords. The mean coverage provided by each sub-list ranges from 6.26% (Sub-list 1) to 0.20% (Sub-list 13). Breaking the EWL list of lexical words into 50-headword sub-lists has two benefits.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 15.  Making an essential word list for beginners 167

First, the size of the sub-lists is small enough to fit into individual courses within an English language program. Second, teaching the EWL lexical words following the rank order of sub-lists will increase learning effectiveness because it ensures that the most useful items are learned first. It also allows programs to prepare a curriculum that covers all sub-lists, and avoids teaching the same items between courses. With these strengths, the EWL is a more suitable list for L2 beginners than the four source lists. Considering the influence of the GSL in vocabulary learning and practice, it is hoped that in the long run the EWL will receive the same attention from textbook authors, course designers, teachers, learners and researchers. However, promoting the use of the EWL does not mean that the present study does not recognize the value of the four source lists. The GSL, BNC2000, BNC/ COCA2000, and New-GSL still have value, but perhaps they are more useful for intermediate-level learners and researchers rather than L2 beginners.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

section v

Using the lists

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

chapter 16

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Using word lists

In this chapter, we look at how word lists can be used for course design, language teaching and learning, the design of graded reading programs, analyzing the vocabulary load of texts, and developing vocabulary tests.

Course design Setting long-range vocabulary learning goals An important role of course design involves setting language learning goals. Word lists and the data gathered from their use in analyzing corpora provide very practical ways of deciding what and how much vocabulary needs to be learned at various stages of language learning. The research on text coverage using the BNC/COCA word family lists (Nation, 2006) has suggested the levels of high, mid-, and low frequency words (Schmitt & Schmitt, 2014; Nation, 2013) which are seen as longrange learning goals for learners of English as a foreign language. The most recent changes have been the expansion of the traditional 2000 high frequency words (West, 1953; Nation, 2001b) to 3000, and the distinguishing of 6000 mid-frequency words (4th 1000 to 9th 1000) from the low frequency words (Schmitt & Schmitt, 2014). The high frequency and mid-frequency levels are seen as being the major receptive vocabulary learning goals for those who want to be able to read in English and use English for a range of everyday purposes. Because of this, there is still a lot of interest in creating the ultimate high frequency word list, either through combining existing lists, or creating new lists. Such work might be more productively directed towards making smaller word lists to meet shorter term learning goals, or towards examining the size and nature of the under-researched mid-frequency word lists and improving them. This is because different high frequency word lists will overlap to a large degree (around 80% to 90%) and the non-overlapping words will result from corpus differences (Nation & Hwang, 1995; Sorell, 2013). There can be improvements in high frequency word lists if there are dramatic improvements in corpus construction with a greater availability of colloquial spoken texts and texts produced for and by younger users of English. Apart from the Essential Word List and the first two 1000-word BNC/

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

172 Making and Using Word Lists for Language Learning and Testing

COCA lists, current corpora used for high frequency word lists are too heavily formal, written and adult. However, it is fair to say that the field is reasonably well served by the number and range of available high frequency word lists. The influence of word lists on curriculum design is rather uncertain. Most course designers do not take account of vocabulary knowledge and vocabulary levels in a systematic way. An exception to this is the Academic Word List and the old University Word List. There are several course books that explicitly and systematically set out to teach the words on these lists (Schmitt & Schmitt, 2005; Valcourt & Wells, 1999). This may be at least partly because the Academic Word List is a manageable length for an intensive pre-university course with 570 word families divided into ten lists. Longer lists of 2000 words or more may seem too overwhelming, requiring planning across several years of a course. Part of Dang and Webb’s (Chapter 15 this book) motivation in creating a relatively short essential word list was to have a list length that teachers could see as something they could cover within their own course. Their division of the list into smaller sub-lists also had this goal. The use of word lists in course design requires course designers to have some knowledge about the nature of vocabulary frequency, vocabulary control, and the teaching and learning of vocabulary. It also requires course designers to be aware of existing word lists and their relevance for particular groups of students. Only a few Masters programs in TESOL, Second Language Acquisition or Applied Linguistics have a course on the teaching and learning of vocabulary. The vocabulary aspect of course design requires the choice of suitable list that fits the knowledge and needs of the learners. Once the list has been chosen, the course designer then needs to make sure that the words in the list are covered in the course, and more importantly that words outside the lists make up a minimum of the words that occur in the course. There are two ways of doing this (Nation & Macalister, 2010: 72). One way, a series approach, is to work through the list of words making sure that each is covered and repeated. Another way, a field approach, is to prepare and organize the materials for the course making sure that words outside the lists are largely excluded, giving attention to the target vocabulary as it appears. The field approach is typical of graded reader schemes where books are written at a particular level without any deliberate attempt to cover particular words. The field approach can also be applied to courses, and if the material is in electronic form, it can be run through a program like AntWordProfiler to make sure that there are no glaring vocabulary omissions or inclusions.



Chapter 16.  Using word lists 173

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Deciding on short-term vocabulary learning goals Word lists tend to be around 1000 words long, and while native speakers of English may increase their vocabulary size by close to 1000 words a year, most non-native speakers struggle to do this. Logically, such a vocabulary learning rate is possible. For deliberate vocabulary learning, forty weeks a year, five days a week, at a rate of five new words per day would result in 1000 words being learned. The amount of input needed to have a chance of doing this incidentally through extensive reading for the 3rd 1000 words is around 300,000 tokens (the equivalent of three novels) which at a slow to moderate average reading speed of 150 words per minute would require only ten minutes reading per day, five days per week, forty weeks per year (Nation, 2014). A well-organized English program should be easily able to arrange both of these mutually reinforcing vocabulary learning opportunities. Nonetheless it is likely that word lists of 1000 words or more are just too long for teachers to use to plan for short-term learning goals. There is value for curriculum design and the preparation of teaching materials including computer-based programs in having smaller word lists of less than a hundred or just a few hundred words long. The early stages of most graded reading series are in steps of a few hundred words, and sub-lists of the Essential Word List and the Academic Word List are just fifty and sixty words long. A set of fifty or sixty word cards is a suitable size for intensive word card learning. There are two approaches to deciding on the size of word lists. The approach used in graded reading schemes is to have lists of gradually increasing size. This may be because as word frequency becomes lower, more words are needed to provide the same percentage of text coverage as each of the preceding lists in the series. Nation and Wang (1999) using data from graded readers provided a way of working out an ideal series of list sizes based on this idea, so that the words at each target level make up a consistent small percentage of the running words, so that reading is not burdensome. The other approach to list size is simply to decide how many words there should be in a list and make each list the same length. It would be good to make this decision of list length in a rational way, basing it on research evidence or some practicality criterion such as words per lesson or course book unit or some regular proportion of 1000 (100, 250 or 500) so that they easily fit with the 1000 word lists. One of the most useful very short word lists is the survival vocabulary for foreign travel (Nation & Crabbe, 1991). This is a list of roughly 120 words and phrases which have been proven to be particularly useful for greeting people and being polite, finding food and accommodation, shopping, getting around and getting help. The list has been translated into several languages.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

174 Making and Using Word Lists for Language Learning and Testing

Dang and Webb (see Chapter 15 this volume) used two frequency-based criteria (change in the lexical coverage curve using levels of 100 words, and percentage of text coverage) and a practicality criterion to decide on the cut-off point for their Essential Word List of lemmas. Different parts of speech were included in the same lemma, so walk as a noun was included with walk as verb. After the 800 word level, each successive 100 word level gave less than 1% coverage of text and the coverage curve started to level out. Dang and Webb also saw 800 words as a practical learning goal for two years of study of English as a foreign language. To evaluate their list, Dang and Webb looked at text coverage over nine spoken and nine written corpora, and the overlap of items with their master list. The coverage of the essential word list was 61.156% which was higher than the coverage provided by the top 800 lemmas in other word lists (by around 0.4% to 3.5%). Most of the words in the Essential Word List appeared in the top 800 spoken words and 800 written words showing its value for both spoken and written use. The 800 word list has been broken into sub-lists of 50 words. A notable strength of the Essential Word List is its excellent representation of spoken English. When lists are made from largely written corpora, words like hello, bye, alright, okay, yes, which are essential in spoken English, are not included. The Essential Word List includes them. Because the sole criterion for inclusion is frequency, some members of sets are missing. Twelve is included but not eleven, and of the -teens only fifteen and nineteen occur. The tens (twenty, fifty) are similarly patchy. Sunday is the only day of the week included. Summer is included but not winter or spring. Survival words like hungry and thirsty or excuse (as in excuse me) are not there. Neither are toilet or bathroom, or sick, ill, hurt or pain. These examples of missing words are given here to show that while frequency and range of occurrence are very important criteria, there are additional reasons for including words in a basic word list. Having said that, the Essential Word List is undoubtedly one of the best high frequency general purpose word lists which are solely frequency-based.

Language teaching and learning A well-balanced language learning program has a roughly even mixture of the four strands of learning from input, learning from output, deliberate learning, and fluency development (Nation, 2007; Nation, 2013). Let us look at how word lists can inform each of these strands.



Chapter 16.  Using word lists 175

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Vocabulary lists and learning from meaning-focused input Learning from meaning-focused input occurs when learners read or listen with the main aim of understanding the messages they receive. To have meaning-focused input at all levels of proficiency, it is essential that at least until learners know around 3000 word families and preferably more, there are large amounts of graded material available which is largely but not completely within the vocabulary that they already know. The relatively low proportion of words beyond the learners’ current knowledge represents the new words to learn, and they can be largely learned through meeting them in context, with reference to a dictionary or gloss if necessary. Word lists play a key role in the selection and preparation of such material (see the following section on graded reading), and in doing the research that can justify and provide guidelines for planning such a program. Such research has shown that dealing with and learning from large amounts of input is feasible (Nation, 2014), and that the vocabulary of such material needs to be adapted to reduce the heavy vocabulary load (Nation & DeWeerdt, 2001; Nation, 2009; Nation, 2016 forthcoming). Lexical frequency profilers such as Range and AntWordProfiler can be used to help select appropriate materials that include target vocabulary in meaning focused input (Webb & Nation, 2008).

Vocabulary lists and learning from meaning-focused output Word lists such as the survival word list for foreign travel (Nation & Crabbe, 1991) and lists of high frequency spoken words and phrases provide a very quick way of getting spoken communication going. West (1956; 1960; 1968) saw the need to develop a minimum adequate vocabulary for speech which would allow learners to communicate with native speakers. He saw one of the ways to do this was to include a basic vocabulary which could be used to define words to fill productive vocabulary gaps. The development of a minimum adequate vocabulary for speaking could begin with corpus-based research, but would undoubtedly require a lot of careful needs analysis.

Vocabulary lists and deliberate language-focused learning Learning from word cards or flash cards: Any discussion of vocabulary lists immediately triggers negative ideas of the deliberate rote learning of words out of context. However, when efficiently done with the right words, such learning is very effective. There is plenty of good research evidence to support it (Nation, 2013: Chapter 11). However, deliberate vocabulary learning should make up only a very small part of

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

176 Making and Using Word Lists for Language Learning and Testing

a course. Using the four strands framework as a guide, the deliberate rote learning of vocabulary should make up no more than one quarter of the language-focused learning strand, which makes it one-sixteenth or less of the total course time, including time spent learning out-of-class. Guidelines for independent deliberate learning using word cards or electronic flash cards can be found in Nation (2008; 2013: Chapter 11). An excellent set of guidelines for evaluating flash card apps can be found in Nakata (2011). Well-designed word lists are an essential resource for effective deliberate vocabulary learning using word cards or a flash card app. Such word lists need to be in reasonably small groupings. Coxhead’s (2000) Academic Word List has sublists which each contain 60 words with 30 in the last list. Dang and Webb’s (see Chapter 15 this volume) sub-lists for the 800 word Essential Word List each contain 50 words with 24 in the last list. Both Coxhead’s and Dang and Webb’s sub-lists were created using range and frequency data. Fifty or sixty words are reasonable sizes for groups of words for deliberate learning. Fewer words may be needed for absolute beginners. Unfortunately, a lot of commercial or free flash card programs do not use wellresearched lists. Learners’ dictionaries: Word list research has an important role to play in the construction of learners’ dictionaries, both in the choice of words to go into the dictionary, in the annotation of words in the dictionary to indicate their relative usefulness, and in the construction of a well-controlled defining vocabulary to convey their meaning. Although virtually all dictionaries draw on a corpus of one type or other, it was only in the 1980s that computer-based corpus research was used in compiling a dictionary for foreign learners, namely Collins COBUILD English Language Dictionary for learners of English as a foreign language (Sinclair, 1987). Carroll, Davies and Richman (1971) produced the Word Frequency Book for the American Heritage Dictionary which was intended primarily for native speakers of English using a very carefully designed corpus of school texts totaling five million tokens. Such a corpus size now seems small, and subsequent research has shown much larger corpus sizes are needed to get reliable results for the fifteen thousand or more word families that students at the end of secondary school are likely to know. Nonetheless, the high quality of the corpus design makes the Carroll, Davies and Richman research an exemplary model. The Collins COBUILD dictionary was the first to include frequency markings to indicate the most useful words. This was originally done using diamond shapes with the more filled diamonds indicating greater frequency. The Longman Dictionary of Contemporary English (Summers, 1978, 2001) used research by Kilgarriff (1997) to

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 16.  Using word lists 177

indicate which of the words were in the 1st 1000, 2nd 1000, and 3rd 1000 for each of speech and writing. The rankings were indicated by the labels S1, S2, S3, W1, W2, W3. Putting this information in a dictionary is a very innovative and useful move. At the very least high-frequency words should be marked, and because it is possible to get reliable lists up to the 9,000 word level (Sorell, 2013), mid-frequency words should also be marked, with perhaps 1000 level gradations within each of these two kinds of words. This would be useful for both teachers and learners. In 1935, Michael West developed a 1,490 word definition vocabulary consisting of the smallest number of words that could be used to define other words. This list was published in the same year as his New Method Dictionary and was used to define the words in the dictionary, so it is a well-tested list. Because so much work has gone into this list, it seems a waste not to use it, updating it if necessary. It could usefully be checked against Basic English as a test of its adequacy, because Basic English was used to write several texts and to make a dictionary. The Longman Dictionary of Contemporary English (Summers, 1978) uses a defining vocabulary of 2000 words. Books or computerized sets of vocabulary activities: Vocabulary learning from flash cards is a kind of deliberate vocabulary learning activity, probably the best one, because when done properly, every word that is studied is eventually learned. This is because any word not learned early on is recycled for continued study. There are lots of deliberate learning activities for vocabulary learning, but some are better than others (Hulstijn & Laufer, 2001), and anyone writing a program for vocabulary learning should look at the research on involvement load (see Nation, 2013: Chapter 3 for a review), and also consider Technique Feature Analysis (Nation & Webb, 2011: Chapter 1). In the same way that word lists can provide well-researched graded input for word card learning, word lists are useful input for vocabulary activities.

Vocabulary lists and fluency development The essence of the strand of fluency development is that learners should work with easy, familiar material in each of the four skills of listening, speaking, reading, and writing with the aim at increasing the speed with which they use the language receptively and productively. Until learners have a vocabulary size of at least three thousand words and preferably more, this material for fluency development needs to be strongly vocabulary controlled so that the learners meet no new words in their fluency activities. Text adaptation programs like AntWordProfiler use vocabulary lists to analyze the material to be adapted, so that there can be careful control of vocabulary.

178 Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Designing graded reading programs A well-designed graded reading scheme has the vocabulary specified for each level of the scheme. Books are written or adapted for each level so that they contain few or no words outside the current level and the preceding levels. After reading several books at one level, the learner then moves on to books at the next level. Because of the strict vocabulary control, there are only a few unknown words per page in each book and so reading has a chance to be successful and enjoyable. When moving to the next level, the step up is not too great in that there are still only a few unknown words per page. In several places in this book, we have questioned the research basis for word lists for graded reader programs. Current graded reading schemes are of a very high standard in terms of the quality of the books produced. The winners of the annual Extensive Reading Foundation competition are very good evidence of that. However, there is little published research on what would be an ideal graded scheme. It may be that the word list research for such a scheme needs to be done by an individual or organization not connected with any publisher. The free availability of such a list might eventually lead to its general adoption and thus make integrating the various publishers’ series of graded readers more straightforward. The general principles behind such graded reader lists would be (1) that the most useful words are met early on in the lists, (2) the size of the steps from one level to another should be large enough to provide some useful vocabulary to learn but not so large that there is too high a density of unknown words (Michael West’s suggestion of one unknown word in 50 running words on average is still a reasonable guide), (3) the graded readers should cover at least the 3000 high frequency word families of English. The lists should be closely followed when writing texts, but essential topic vocabulary from outside the current level or outside the lists can be used, provided each added word is repeated several times in the text. Principle 2 on the size of the steps means that the steps will get bigger through the scheme because it will take more different words to cover the one or two per cent of running words of the text which are outside the learners’ current knowledge. This is because the frequency of the words will be lower in successive sub-lists of the scheme. It may be possible to design an effective scheme with equally sized steps of one or two hundred words, but this needs to be tested in practice.



Chapter 16.  Using word lists 179

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Analyzing the vocabulary load of texts With the ready availability of text analysis programs like Tom Cobb’s VocabProfiler on his Compleat Lexical Tutor website (www.lextutor.ca), Range, and AntWordProfiler, there is a rapidly growing body of research on the vocabulary load of texts and on the development of specialized word lists. In a typical study, a program such as VocabProfiler is used to analyze a text. The text is pasted into the profiler and the program analyzes it by seeing how many words in the text are at various frequency levels and how many are not in any of the lists. A judgement is made on the vocabulary difficulty (vocabulary load) of the text by looking at what proportion of the tokens in the text are at the low frequency word list levels or outside the lists. Such research needs to look very carefully at words outside the lists and see if (1) they are members of already existing families (they should be added to the families if they fit the word family criteria), (2) they are errors in the text being analyzed (they should be corrected in the text), (3) they are words which are already well known by the potential readers of the text (they should be counted as familiar vocabulary), (4) they are L1 words or foreign words known by the potential readers (they should be counted as familiar vocabulary), or (5) they are repeated topic words closely related to the meaning of the text (these words may need to be discussed as a separate group as they are likely to be learned while reading the text). Vocabulary load research depends on the use of well-constructed word lists. Fortunately there is an openness regarding such lists with the list makers freely sharing them through publication and through postings on web sites. This can only improve the quality of such lists and thus the research that uses them. A lot of the research on text coverage has been too uncritical of the weaknesses of such research. Such research requires very careful preparation of the corpora and updating of the lists as described in Chapters 10 and 11 of this book. The words that appear in the corpus but which are not in any list need very careful analysis both in preparation for the research and as part of the analysis of the results. Although the focus is on text coverage by known words, it is the words outside the lists that determine text difficulty from a vocabulary perspective. Viewed in this way, text coverage is a rather blunt instrument for carrying out analysis. There is some research evidence that a learner’s vocabulary size is a better predictor of text difficulty than text coverage (Larson, 2016; Webb & Paribakht, 2015). Research by Chujo and Utiyama (2005a, 2005b) provides useful guidelines for carrying out text coverage research.

180 Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Developing vocabulary tests Vocabulary proficiency tests measure learners’ vocabulary knowledge without reference to a particular course of study. A vocabulary levels test, for example, measures how well certain frequency levels of the language are known. The original Vocabulary Levels Tests (Nation, 1983; Schmitt, Schmitt & Clapham, 2001) tested the 2000, 3000, 5000, Academic Word List and 10,000 levels. Note that the levels are not all adjacent to each other in that the 4000 level and the 6000 to 9000 levels are not tested. These original tests relied heavily on the Thorndike and Lorge (1944) word book, and now that better lists are available better levels tests are beginning to appear. A vocabulary size test measures how many words a learner knows in the language, and this can then be used to see how many more words need to be learned to do certain language tasks such as read unsimplified texts, or follow movies of TV programs. A vocabulary size test thus needs to use a sample that represents the vocabulary of the language. This sample of words used to be taken from dictionaries but such sampling was fraught with difficulty typically resulting in a very large sampling error (Nation, 1993). Sampling from word lists avoids most of these difficulties but depends on the quality of the lists. Do the lists include all the necessary words, and are the words ordered in a reliable way? When sampling from a frequency list, a random sample can be taken from each frequency level or every nth word can be sampled from each frequency level. The use of frequency levels ensures that the words in the sample represent a range of relevant frequency levels and are not biased towards any one level. Research indicates that a sample needs to include around 30 words to have a chance of being reliable (Beglar & Hunt, 1999; Schmitt, Schmitt & Clapham, 2001). This means that in vocabulary levels tests where there is interest in how well each level is known, each level needs to contain around thirty items. In a vocabulary size test where the total score is the focus of attention, then the whole test needs to contain at least thirty items with more items likely to increase reliability. The words in word lists depend on the size and quality of the corpus used for the research, and there is always possible variability in what words get into a particular sub-list or into the complete set of lists. Vocabulary tests based on word lists are affected by this variability and we thus need to be cautious when interpreting their results, even though this variability may be relatively small. Sorell (2013) found that even up to the 9000 word level, it was possible to make word lists from different corpora that used the same types of texts that varied by around 500 to 800 words (6% to 9%) providing a large enough corpus of around 20 million tokens was used. The unit of counting used in a word list for a vocabulary test must match the kind of knowledge that is being measured, and must take account of the proficiency

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15



Chapter 16.  Using word lists 181

level of the learners taking the test. Proficiency level particularly relates to the level of morphological knowledge that the learners can cope with. Much more research is needed on the development of morphological knowledge in foreign language learners. The small amount of relevant research focuses on the productive control of affixes rather than receptive control (Schmitt & Meara, 1997; Schmitt & Zimmerman, 2002), although the research by Mochizuki and Aizawa (2000) is an exception to this. Mochizuki and Aizawa (2000) found a relationship between affix knowledge and vocabulary size. The availability of word lists affects the development of vocabulary tests, and the BNC/COCA word lists have been used to develop vocabulary size tests (Nation & Beglar, 2007; Coxhead, Nation & Sim, 2014) and vocabulary levels tests (McLean & Kramer, 2015; McLean, Kramer & Beglar, 2015). The Academic Word List has also been used for test development (Schmitt, Schmitt & Clapham, 2001; McLean & Kramer, 2015). The distinction between vocabulary size tests and vocabulary level tests is an important one. Size tests try to measure total vocabulary including words outside the learners’ usual word levels. Level tests focus on particular frequency levels and try to provide reliable results for each frequency level in the test. When planning a teaching program and when assigning learners to graded reading, vocabulary level tests are much more likely to provide more useful results because they focus on the actual vocabulary that needs to be known. One of the problems when sampling words from lists for a vocabulary test for learners of a particular language background is making sure that cognates or loan words are properly represented in the lists. Cognates or loan words are more likely to be answered correctly in a test than words that have no connection with an L2 learner’s first language (Elgort, 2013). Ideally, a random sample of words from a list should in itself take account of the proportion of loan words. Unfortunately, there are few if any reliable figures for how many words of a particular language are loan words from English. So, there is nothing to check the sample against. Moreover, word borrowing is occurring at such a fast rate that any figure on the number of loan words in a particular language would soon be out-of-date. For the present at least, it may have to be enough to say how many words in the sample are loan words for a particular group of users of the same language background, and how many of these were known when the test was administered compared with the words which are not borrowings. The issue of the size of the word family is also important when making lists for vocabulary testing. An inclusive definition of a word family, such as Level 6 or higher in the Bauer and Nation (1993) levels, will underestimate the number of words known and over-estimate knowledge of particular word families. Too low a level of word family will overestimate the number of families known (see Table 2.4 in Chapter 2). Existing tests such as the Vocabulary Size Test which use

182 Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Level 6 families are probably not inclusive enough for adult native speakers (see Brysbaert, Stevens, Mandera and Keuleers, 2016), and too inclusive for lower proficiency learners of English as a foreign language (Schmitt & Zimmerman, 2002; Schmitt & Meara, 1997; Mochizuki & Aizawa, 2000; Ward & Chuenjundaeng, 2009; Brown, 2013; Webb & Sasao, 2013; McLean, in preparation). This book has looked at the making and use of word lists. It was motivated by a surge in the growth of new word lists and my dissatisfaction with the methodologies used in making the lists. Unfortunately, this dissatisfaction includes my own word lists. Writing this book has been like most of my writing – a way of extending my learning. As I wrote the book, I realized that there was a lot I did not know about word lists and their construction. Writing the book has helped fill some of the gaps but we still have much to learn about making corpora and making lists from them. I hope this book is just one of many steps in our learning.

appendix 1

Proper noun tagging in the BNC

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

The following data is copied from the British National Corpus Part of Speech Tagging Guide (http://ucrel.lancs.ac.uk/bnc2/bnc2guide.htm#m2) Proper nouns The tag NP0 ideally should denote any kind of proper noun, but in practice the open-endedness of naming expressions makes it difficult to capture all possible types consistently. We have confined its coverage mainly to personal and geographical names, and even within these, somewhat arbitrary borderlines have had to be drawn. Users of version 1 of the corpus should be aware of a few small but important changes in BNC2 listed below. a. b. c.

Personal names Sally Joe Bloggs Madame Pompadour Leonardo da Vinci Geographical names London Lake Tanganyika New York Also: days of the week; months of the year April Sunday

Notes The distinction between singular and plural proper nouns is not indicated in the tagset, plural proper nouns being a comparative rarity.

John Smith. All of the Smiths.

Multiwords. As the examples in (a) and (b) above show, proper nouns are not processed as multiwords (even though there may be good linguistic reasons for doing so). Each word in such a sequence gets its own tag. Initials in names A person’s initials preceding a surname are tagged NP0, just as the surname itself. The choice whether to use a space and/or full-stop between initials (eg J.F. or J. F. or J F or JF) is determined in the original source text; the tagged version follows the same format.

John F. Kennedy = John F. Kennedy J. F. Kennedy = J. F. Kennedy J.F. Kennedy = J.F. Kennedy

184 Making and Using Word Lists for Language Learning and Testing

IMPORTANT NOTE: In the spoken part of the BNC, however, the components of names – and, in fact, most words – that are spelt aloud as individual letters, such as I B M, and J R in J R Hartley, are not tagged NP0 but ZZ0 (letter of the alphabet).

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Nouns of style Preceding a proper noun, or sequence of proper nouns, style (or title) nouns with uppercase initial capitals are tagged NP0. Pastor Tokes Chairman Mao Sub-Lieutenant R C V Wynn Sister Wendy Contrast: You remember your sister Wendy… [HGJ.800] where Wendy is in apposition to a common noun sister, in lowercase letters. Geographical names For names of towns, streets, countries and states, seas, oceans, lakes, rivers, mountains and other geographical placenames, the general rule is to tag as NPO. (If the word the precedes, it is tagged AT0, as normally.)

East Timor South Carolina Baker Street West Harbour Lane the United States the United Kingdom the Baltic the Indian Ocean Mount St Helens the Alps

Ordinary (non-NP0) tags are applied to more verbose (especially political) descriptions of placenames, or those that are not typically marked on maps. (As above, the preceding word the is optional.)

Latin America Western Europe the Western Region the Soviet Union the People’s Republic of China the Dominican Republic the Sultanate of Oman

The examples show a little arbitrariness in application, for example with United States counting, and Soviet Union not counting as proper nouns. (Also: the ex-Soviet Union [KJS.28]) NB. Multiple-word names containing a compass point, ie. those beginning North, South, East, West, North East, South-west etc. nearly always become NP0, whereas those with Northern, Southern, Eastern, Western follow the non-NP0 pattern. Rare exceptions are:

Northern Ireland Western Samoa

Appendices 185

Non-personal and non-geographical names – including eg names of organisations, sports teams, commercial products (incl newspapers), shops, restaurants, horses, ships etc. When such names consist of ordinary words (common nouns, adjectives etc.), they receive ordinary tags (NN1, AJ0 etc.). Where a word as part of a name is an existing NP0 (typically a personal or geographical name), or a specially-coined name, it is tagged NP0. Examples:

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Organisations, sports teams etc. Ordinary tags

Tagged NP0

Cable and Wireless

Procter and Gamble

Acorn Marketing Limited

Minolta; IBM; NATO

Wolverhampton Wanderers (football club)

Tottenham Hotspur (football club)

The Chicago Bears

Spartak Moscow

World Health Organisation

Oxfam

There is a slight inconsistency here, in that acronyms of organisation names (WHO, NATO, IBM etc.) take NP0, whereas the expanded forms of these names take regular tags. Products (including newspapers and magazines) Ordinary tags

Tagged NP0

Windows software

Weetabix

Lancashire Evening Post

Mars bars

Time Magazine

Scotchgard

The Reader’s Digest

Perrier water

Company names may sometimes be used to represent product names; in such cases the same tags apply. For example:

John drives a Volkswagen Golf. John drives a Volkswagen.

Shops, pubs, restaurants, hotels, horses, ships etc. Ordinary tags

Tagged NP0

Body Shop

Mothercare

The Grand Theatre

Sainsburys supermarket

The King’s Arms

The Ritz

Red Rum

Aldaniti

The Bounty

The Titanic

Here again NP0 is reserved for parts of names that are specially coined, or derived from existing personal/geographical proper nouns.

186 Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Changes in NP0 assignment since BNC1 In the first release of the BNC, the use of NP0 tags applied a little more widely. The geographical category tagged NP0 used to include names of buildings and other institutions. Names of newspapers and magazines used to be treated separately from other products and tagged NP0. NOTE THAT IN BNC2 BOTH THESE TYPES NOW TAKE ORDINARY (non-NP0) TAGS: Buildings and institutions BNC1: Blackpool Tower Prospect Theatre Company Austro-Hungarian Empire BNC2: Blackpool Tower [B22.1633] Prospect Theatre Company [A06.1962] Austro-Hungarian Empire [G3B.617] Newspapers and magazines BNC1: the Daily Mail Railway Gazette BNC2: the Daily Mail [D95.334] Railway Gazette [HWM.1860]

appendix 2

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Closed lexical set headwords Numbers One, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty, thirty, forty, fifty, sixty, seventy, eighty, ninety, hundred, thousand, million, billion. Days of the week Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday. Months of the year January, February, March, April, May, June, July, August, September, October, November, December. Seasons Spring, summer, autumn, fall, winter. Points of the compass North, south, east, west. Family members Mother, father, son, daughter, uncle, aunt, brother, sister, grand-, niece, nephew, cousin.

appendix 3

The Essential Word List

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

EWL list of 176 function words Word

Coverage

Word

Coverage

Word

Coverage

the and of to a I in you that it for he on we they be with this have but as not at what so there or one by from all she no his do can

5.076702 2.575968 2.493762 2.435477 2.137175 1.962055 1.671146 1.548362 1.478688 1.454354 0.787768 0.682112 0.652326 0.600016 0.576618 0.572301 0.572169 0.554955 0.552630 0.525926 0.517697 0.462806 0.449336 0.423312 0.419961 0.398397 0.369339 0.361277 0.354970 0.344882 0.344091 0.341661 0.329884 0.328752 0.318649 0.307325

if about my her which up out would when your will their who some two because how other could our into these than any where over back first much down its should after those may something

0.291651 0.271931 0.270334 0.245510 0.244671 0.242768 0.239861 0.231944 0.229850 0.212242 0.207630 0.198558 0.195225 0.183434 0.167555 0.161094 0.153304 0.146183 0.137160 0.132441 0.130945 0.130654 0.121168 0.119501 0.117378 0.117275 0.115672 0.110854 0.105415 0.102800 0.100849 0.099182 0.093835 0.092260 0.091314 0.091102

three little many why before such off through still last being must another between might both five four around while each under away every next anything few though since against second nothing without during six enough

0.090475 0.086051 0.082514 0.081919 0.079630 0.076452 0.075810 0.075656 0.074548 0.073349 0.071413 0.065188 0.059581 0.059331 0.056154 0.053861 0.053276 0.052873 0.051047 0.050673 0.049819 0.047632 0.046554 0.046188 0.045869 0.045157 0.043898 0.042847 0.040884 0.038173 0.037336 0.037328 0.037313 0.036829 0.036549 0.036392

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Appendices 189

Word

Coverage

Word

Coverage

Word

Coverage

once however half yet whether everything until hundred within ten twenty either although past himself seven eight along round several someone whatever among

0.035365 0.035215 0.034138 0.032636 0.032346 0.030842 0.030745 0.030596 0.028519 0.026953 0.026911 0.026430 0.024921 0.024446 0.024254 0.023815 0.023031 0.022695 0.022173 0.021865 0.021324 0.019997 0.019839

across behind million outside nine thousand shall myself themselves itself somebody upon thirty third above therefore everybody towards thus everyone near inside nineteen

0.019763 0.019638 0.018905 0.018569 0.017620 0.017567 0.017356 0.017326 0.017138 0.017120 0.017042 0.017025 0.016772 0.016468 0.016127 0.016070 0.015302 0.015032 0.014827 0.014537 0.014437 0.014433 0.014274

yourself fifty whose anyone per except forty nobody unless mine anybody till herself twelve fifteen beyond whom below none nor more most Total

0.014113 0.014075 0.013802 0.013335 0.012703 0.012239 0.010608 0.010585 0.010361 0.010310 0.009820 0.009615 0.009479 0.009250 0.009125 0.008964 0.008951 0.008498 0.008345 0.009141 0.186355 0.081376 44.40842

List of 624 lexical words Sub-list 1 Word

Coverage

Word

Coverage

Word

Coverage

know like well just think right then now get time go yes very see people here good

0.393575 0.290277 0.283758 0.277469 0.215088 0.202345 0.199168 0.197925 0.188457 0.176987 0.166317 0.162819 0.160123 0.153409 0.151158 0.148245 0.139109

only really say mean come also okay want way even new too work take make year look

0.135443 0.128289 0.109920 0.109712 0.106297 0.105675 0.105115 0.104179 0.103877 0.094256 0.093810 0.090746 0.089604 0.087640 0.085923 0.079147 0.078027

thing man put let day never long need Mr. thought lot same old word course life Total

0.075902 0.074627 0.070657 0.070174 0.069527 0.068699 0.067751 0.065478 0.064079 0.063134 0.062738 0.060849 0.057487 0.057425 0.057167 0.056876 6.256458

190 Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Sub-list 2 Word

Coverage

Word

Coverage

Word

Coverage

again own quite give home tell world use always great kind actually sort government house find place

0.056625 0.056571 0.055452 0.055183 0.054775 0.054753 0.054354 0.054158 0.053158 0.052864 0.052557 0.052459 0.051300 0.050554 0.049411 0.049356 0.048279

different part sure point number school end money better big probably fact bit night left found high

0.047884 0.047245 0.046453 0.045747 0.044106 0.044078 0.043562 0.043558 0.043428 0.043236 0.042851 0.041579 0.041086 0.041073 0.040818 0.040250 0.038944

help maybe far case whole today side god week family ever talk state set system keep Total

0.038014 0.037433 0.037370 0.036894 0.036503 0.036113 0.035457 0.035396 0.035332 0.035265 0.035249 0.035168 0.035060 0.034935 0.034549 0.034398 2.210841

Word

Coverage

Word

Coverage

Word

Coverage

problem love name percent call water important country small feel real best laugh room remember nice rather

0.034271 0.034195 0.034186 0.033760 0.033632 0.033497 0.033193 0.033119 0.033112 0.033107 0.033062 0.032601 0.032538 0.032531 0.032109 0.031557 0.031300

public mother less later hand already mind thank job business else group question sorry show able together

0.031041 0.030929 0.030802 0.030602 0.03045 0.030438 0.030402 0.030139 0.030005 0.029944 0.028800 0.028692 0.028588 0.028292 0.028288 0.028240 0.028221

order head least read morning car try change general area believe young power almost start person Total

0.028167 0.028138 0.028016 0.027935 0.027898 0.027888 0.027798 0.027757 0.027642 0.027571 0.027292 0.026961 0.026837 0.026773 0.026712 0.026595 1.505626

Sub-list 3

Appendices 191

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Sub-list 4 Word

Coverage

Word

Coverage

Word

Coverage

company perhaps ago hard form party means often example play care ask social national book father bad

0.026472 0.026427 0.026359 0.026236 0.026194 0.026178 0.026066 0.026042 0.025926 0.025628 0.025454 0.025271 0.025263 0.025214 0.025200 0.025152 0.025098

war large true open pretty matter information hey face early please pay woman city possible cause present

0.025047 0.024974 0.024919 0.024859 0.024774 0.024766 0.024718 0.024655 0.024332 0.024218 0.024182 0.024181 0.024030 0.023972 0.023485 0.023377 0.023293

leave service stuff idea line guess become local run anyway full office live development level understand Total

0.023256 0.023152 0.023112 0.023085 0.023079 0.023042 0.022910 0.022778 0.022526 0.022498 0.022468 0.022461 0.022156 0.021770 0.021664 0.021615 1.213536

Word

Coverage

Word

Coverage

Word

Coverage

fine certain turn control sometimes political top story door hear food moment sense law programme free front

0.021578 0.021425 0.021419 0.021297 0.021230 0.021172 0.021130 0.021125 0.021057 0.021042 0.021004 0.020992 0.020959 0.020948 0.020938 0.020866 0.020777

language study white wrong further move guy girl wife education close stop class reason black human university

0.020733 0.020485 0.020474 0.020441 0.020390 0.020342 0.020314 0.020252 0.020228 0.020202 0.020125 0.020020 0.020019 0.019946 0.019817 0.019781 0.019744

body act hope child interest air boy research light data wait road particular paper view major Total

0.019701 0.019634 0.019520 0.019455 0.019321 0.019234 0.019119 0.018974 0.018966 0.018955 0.018954 0.018953 0.018892 0.018838 0.018664 0.018646 1.008097

Sub-list 5

192 Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Sub-list 6 Word

Coverage

Word

Coverage

Word

Coverage

thinking stay market age support clear police department table experience late cut south report soon usually bring

0.018646 0.018643 0.018609 0.018604 0.018598 0.018592 0.018543 0.018537 0.018472 0.018288 0.018267 0.018244 0.018220 0.018147 0.017991 0.017869 0.017814

society history couple short community cost friend economic position period policy special centre Mrs. living difficult town

0.017782 0.017775 0.017769 0.017760 0.017688 0.017633 0.017576 0.017575 0.017571 0.017558 0.017465 0.017439 0.017356 0.017228 0.017216 0.017193 0.017091

music health buy certainly type street deal news future death figure land process meeting seem interesting Total

0.017061 0.017052 0.017037 0.017027 0.017014 0.016952 0.016908 0.016858 0.016784 0.016782 0.016609 0.016608 0.016524 0.016505 0.016490 0.016293 0.878261

Word

Coverage

Word

Coverage

Word

Coverage

situation alright court exactly main rest son miss hold especially available mum church rate council answer common

0.016293 0.016116 0.016101 0.016098 0.016035 0.015876 0.015872 0.015752 0.015745 0.015743 0.015710 0.015694 0.015681 0.015634 0.015547 0.015544 0.015496

north happy building low meet west art century dead particularly hi result plan effect subject college red

0.015492 0.015408 0.015391 0.015316 0.015314 0.015280 0.015238 0.015004 0.014941 0.014935 0.014925 0.014886 0.014842 0.014811 0.014773 0.014764 0.014712

hour provide watch staff board husband private easy month evidence total indeed strong stage committee tax Total

0.014663 0.014640 0.014564 0.014518 0.014492 0.014452 0.014442 0.014398 0.014364 0.014291 0.014269 0.014263 0.014262 0.014170 0.014104 0.014101 0.754961

Sub-list 7

Appendices 193

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Sub-list 8 Word

Coverage

Word

Coverage

Word

Coverage

thanks middle suppose field game test dad section industry sir instead action personal write team heart value

0.014019 0.013979 0.013937 0.013886 0.013850 0.013830 0.013794 0.013769 0.013755 0.013727 0.013679 0.013674 0.013646 0.013603 0.013574 0.013532 0.013523

issue various alone ready sound international necessary training ok according single term stand individual similar doctor yesterday

0.013521 0.013489 0.013476 0.013440 0.013433 0.013430 0.013399 0.013318 0.013283 0.013218 0.013208 0.013148 0.013103 0.013046 0.013020 0.013008 0.013007

nature increase Dr bed fire baby listen likely sit feeling voice poor self chance speak earlier Total

0.012995 0.012898 0.012869 0.012831 0.012804 0.012802 0.012797 0.012794 0.012749 0.012667 0.012650 0.012585 0.012492 0.012466 0.012465 0.012442 0.662631

Word

Coverage

Word

Coverage

Word

Coverage

finally quality happen amount role force difference phone forward member letter range ground reading tomorrow due knowledge

0.012431 0.012411 0.012388 0.012380 0.012378 0.012369 0.012353 0.012337 0.012265 0.012231 0.012157 0.012142 0.012137 0.012119 0.012093 0.012066 0.012059

brother decision beautiful hair bank obviously east writing return break project minute check cold central throat eat

0.012052 0.012045 0.011974 0.011964 0.011951 0.011937 0.011927 0.011913 0.011906 0.011902 0.011899 0.011894 0.011830 0.011799 0.011757 0.011717 0.011716

learn price foreign lead analysis post evening hello hospital simply walk hit recent final beginning attention Total

0.011703 0.011679 0.011630 0.011618 0.011616 0.011586 0.011572 0.011548 0.011454 0.011425 0.011419 0.011398 0.011382 0.011370 0.011345 0.011307 0.594553

Sub-list 9

194 Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Sub-list 10 Word

Coverage

Word

Coverage

Word

Coverage

president production trouble management account size hot worth simple hell standard record ball piece interested natural agree

0.011299 0.011216 0.011168 0.011127 0.011086 0.011085 0.011083 0.011028 0.010999 0.010915 0.010901 0.010835 0.010833 0.010822 0.010785 0.010784 0.010778

modern student summer wish serious minister growth blood bill trade list basis floor fun financial ahead tonight

0.010740 0.010739 0.010738 0.010733 0.010730 0.010704 0.010667 0.010663 0.010642 0.010583 0.010496 0.010481 0.010432 0.010374 0.010358 0.010355 0.010349

lower current recently model population funny security send dog degree dear date normal blue material choice Total

0.010348 0.010342 0.010342 0.010326 0.010313 0.010309 0.010305 0.010302 0.010299 0.010257 0.010223 0.010207 0.010155 0.010150 0.010136 0.010129 0.530673

Word

Coverage

Word

Coverage

Word

Coverage

approach computer straight space colour fall performance culture pick river pressure picture relationship dark drive visit green

0.010088 0.010085 0.010074 0.010063 0.009980 0.009972 0.009969 0.009952 0.009945 0.009910 0.009899 0.009886 0.009878 0.009853 0.009848 0.009832 0.009783

teaching hotel truth island sign basic military press spend consider sea complete bye fish clearly share doubt

0.009782 0.009781 0.009745 0.009730 0.009709 0.009709 0.009687 0.009685 0.009680 0.009635 0.009600 0.009583 0.009568 0.009562 0.009556 0.009541 0.009538

wide step income drink authority film cover afternoon science movement expect capital economy Christmas fast continue Total

0.009529 0.009480 0.009465 0.009445 0.009439 0.009405 0.009393 0.009384 0.009330 0.009321 0.009321 0.009315 0.009300 0.009297 0.009292 0.009286 0.483109

Sub-list 11

Appendices 195

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Sub-list 12 Word

Coverage

Word

Coverage

Word

Coverage

kill follow function practice former sister absolutely somewhere include Sunday design wonder dinner medical fair radio legal

0.009277 0.009268 0.009250 0.009236 0.009209 0.009195 0.009169 0.009153 0.009137 0.009131 0.009112 0.009111 0.009100 0.009099 0.009097 0.009089 0.009089

pass nearly daughter theory shot energy offer patients significant deep begin quickly charge completely worry generally page

0.009087 0.009082 0.009072 0.009055 0.009045 0.009044 0.009040 0.009014 0.008989 0.008965 0.008959 0.008931 0.008902 0.008875 0.008866 0.008859 0.008856

average respect structure club purpose specific earth organisation wall property activity note treatment station teacher forget Total

0.008845 0.008834 0.008814 0.008795 0.008785 0.008780 0.008745 0.008717 0.008712 0.008704 0.008695 0.008638 0.008625 0.008606 0.008598 0.008576 0.447832

Word

Coverage

Word

Coverage

Word

Coverage

television western opportunity key series born park response style

0.008575 0.008539 0.008535 0.008485 0.008484 0.008465 0.008427 0.008413 0.008397

hall trust window carry rights fight environment cool sex

0.008379 0.008368 0.008360 0.008359 0.008350 0.008342 0.008336 0.008312 0.008286

eye region original dollar square direct Total

0.008284 0.008272 0.008272 0.008270 0.008269 0.008228 0.201008

Sub-list 13

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

References

A Guide to Collins English Library. (1978). Glasgow: William Collins & Son. Adolphs, S., & Schmitt, N. (2003). Lexical coverage of spoken discourse. Applied Linguistics, 24(4), 425–438. doi: 10.1093/applin/24.4.425 Anglin, J. M. (1993). Vocabulary development: a morphological analysis. Monographs of the Society for Research in Child Development Serial No. 238, 58(10 Serial No. 238), 1–165. Banerji, N., Gupta, V., Kilgarriff, A., & Tugwell, D. (2012). Oxford Children’s Corpus: a Corpus of Children’s Writing, Reading, and Education. https://www.sketchengine.co.uk/documentation/raw.../Beebox_final.docx. Barber, C. L. (1962). Some measurable characteristics of modern scientific prose. In C. L. Barber (Ed.), Contributions to English Syntax and Philology (pp. 21–43). Goteburg: Acta Universitatis Gothoburgensis. Bauer, L., & Nation, I. S. P. (1993). Word families. International Journal of Lexicography, 6(4), 253–279. doi: 10.1093/ijl/6.4.253 Beglar, D., & Hunt, A. (1999). Revising and validating the 2000 word level and the university word level vocabulary tests. Language Testing, 16(2), 131–162. Bertram, R., Baayen, R., & Schreuder, R. (2000). Effects of family size for complex words. Journal of Memory and Language, 42, 390–405. doi: 10.1006/jmla.1999.2681 Bertram, R., Laine, M., & Virkkala, M. (2000). The role of derivational morphology in vocabulary acquisition: Get by with a little help from my morpheme friends. Scandinavian Journal of Psychology, 41(4), 287–296. doi: 10.1111/1467-9450.00201 Biber, D. (1995). Dimensions of cross-linguistic variation: A cross-linguistic comparison. ­Cambridge: Cambridge University Press. doi: 10.1017/CBO9780511519871 Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing, 8(4), 243–257. doi: 10.1093/llc/8.4.243 Biber, D., Conrad, S., & Cortes, V. (2004). “If you look at. ..”: Lexical bundles in university teaching and textbooks. Applied Linguistics, 25(3), 371–405. doi: 10.1093/applin/25.3.371 Biber, D., Conrad, S. & Leech, G. (2002). Longman Student Grammar of Spoken and Written English. Pearson Education Limited. Biber, D., Johansson, S., Leech, G., & Conrad, S. (1999). Longman Grammar of Spoken and Written English. London: Longman. Biemiller, A. (2010). Words Worth Teaching: Closing the Vocabulary Gap. Colombus: McGraw-Hill. Biemiller, A., & Slonim, N. (2001). Estimating root word vocabulary growth in normative and advantaged populations: Evidence for a common sequence of vocabulary acquisition. Journal of Educational Psychology, 93(3), 498–520. doi: 10.1037/0022-0663.93.3.498 Boers, F. (2000). Metaphor awareness and vocabulary retention. Applied Linguistics, 21(4), 553–571. doi:  10.1093/applin/21.4.553

Boers, F. (2001). Remembering figurative idioms by hypothesising about their origin. Prospect, 16(3), 35–43.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

198 Making and Using Word Lists for Language Learning and Testing

Boers, F., & Lindstromberg, S. (2009). Optimizing a Lexical Approach to Instructed Second Language Acquisition. Basingstoke: Palgrave Macmillan. doi: 10.1057/9780230245006 Bongers, H. (1947). The History and Principles of Vocabulary Control: Wocopi: Woerden. Bowles, M. (2001a). A quantitative look at Monbusho’s prescribed word list and words found in Monbusho-approved textbooks. The Language Teacher, 25(9), 21–27. Bowles, M. (2001b). What’s wrong with Monbusho’s prescribed word list? The Language Teacher, 25(1), 7–14. Brezina, V., & Gablasova, D. (2015). Is there a core general vocabulary? Introducing the New General Service List. Applied Linguistics, 36(1), 1–22. doi: 10.1093/applin/amt018 Brown, D. (2013). Types of words identified as unknown by L2 learners when reading. System, 41, 1043–1055. doi: 10.1016/j.system.2013.10.013 Brown, D. (2010). An improper assumption? The treatment of proper nouns in text coverage counts. Reading in a Foreign Language, 22, 355–361. Browne, C. (2014). A new general service list: The better mousetrap we have been looking for? Vocabulary Learning and Instruction, 3(1), 1–10. Brysbaert, M., Stevens, M., Mandera, P., & Keuleers, E. (2016). How many words do we know? Practical estimates of vocabulary size dependent on word definition, the degree of language input and the participant’s age. Frontiers in Psychology. Brysbaert, M., & New, B. (2009). Moving beyond Kucera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990. doi:  10.3758/BRM.41.4.977

Campion, M. E., & Elley, W. B. (1971). An academic vocabulary list. Wellington: NZCER. Carter, R., & McCarthy, M. (1988). Vocabulary and Language Teaching. London: Longman. Carroll, J. B., Davies, P., & Richman, B. (1971). The American Heritage Word Frequency Book. New York: Houghton Mifflin, Boston American Heritage. Chujo, K., & Utiyama, M. (2005a). Exploring sampling methodology for obtaining reliable text coverage. Language Education and Technology, 42, 1–18. Chujo, K., & Utiyama, M. (2005b). Understanding the role of text length, sample size and vocabulary size in determining text coverage. Reading in a Foreign Language, 17(1), 1–22. Chung, T. M. (2003). A corpus comparison approach for terminology extraction. Terminology, 9(2), 221–245. doi: 10.1075/term.9.2.05chu Chung, T. M., & Nation, P. (2003). Technical vocabulary in specialised texts. Reading in a Foreign Language, 15(2), 103–116. Chung, T. M., & Nation, P. (2004). Identifying technical vocabulary. System, 32(2), 251–263. doi:  10.1016/j.system.2003.11.008

Cobb, T. (2015). What does Zipf ’s Law tell us about the limits of vocabulary recycling in a text? Cobb, T. (2000). One size fits all? Francophone learners and English vocabulary tests. Canadian Modern Language Review, 57(2), 295–324. doi: 10.3138/cmlr.57.2.295 Corson, D. J. (1995). Using English Words. Dordrecht: Kluwer Academic Publishers. doi:  10.1007/978-94-011-0425-8

Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2), 213–238. doi:  10.2307/3587951

Coxhead, A., & Hirsh, D. (2007). A pilot science-specific word list. Revue Francaise de Linguistique Appliquee, 12(2), 65–78.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

References 199

Coxhead, A., Nation, P., & Sim, D. (2014). Creating and trialling six versions of the Vocabulary Size Test. The TESOLANZ Journal, 22, 13–27. Coxhead, A., Nation, P., & Sim, D. (2015). The vocabulary size of native speakers of English in New Zealand secondary schools. New Zealand Journal of Educational Studies, 50(1), 121–135. Cunningham, A. E. (2005). Vocabulary growth through independent reading and reading aloud to children. In E. H. Hiebert & M. L. Kamil (Eds.), Teaching and Learning Vocabulary: Bringing Research to Practice (pp. 45–68). Mahwah, N.J.: Lawrence Erlbaum. Dang, T. N. Y., & Webb, S. (2016). Evaluating lists of high frequency vocabulary. Manuscript submitted for publication. Diller, K. C. (1978). The Language Teaching Controversy. Rowley, Mass.: Newbury House. Droop, M., & Verhoeven, L. (2003). Lanuage proficiency and reading ability in first- and secondlanguage learners. Reading Research Quarterly, 38(1), 78–103. doi: 10.1598/RRQ.38.1.4 Durrant, P. (2009). Investigating the viability of a collocation list for students of English for academic purposes. English for Specific Purposes, 28, 157–169. doi: 10.1016/j.esp.2009.02.002 Elgort, I. (2013). Effects of L1 definitions and cognate status of test items on the Vocabulary Size Test. Language Testing, 30(2), 253–272. doi: 10.1177/0265532212459028 Engels, L. K. (1968). The fallacy of word counts. IRAL, 6(3), 213–231. doi: 10.1515/iral.1968.6.1-4.213 Francis, W. N., & Kucera, H. (1982). Frequency Analysis of English Usage. Boston: Houghton Mifflin Company. Fraser, S. (2007). Providing ESP learners with the vocabulary they need: Corpora and the creation of specialized word lists. Hiroshima Studies in Language and Language Education, 10, 127–143. Fries, C. C., & Traver, A. A. (1950). English Word Lists. Ann Arbor: George Wahr. Gardner, D. (2007). Validating the construct of “word” in applied corpus-based research: a critical survey. Applied Linguistics, 28(2), 241–265. doi: 10.1093/applin/amm010 Gardner, D., & Davies, M. (2014). A new academic vocabulary list. Applied Linguistics, 35(3), 305–327. doi: 10.1093/applin/amt015 Gardner, D., & Davies, M. (2007). Pointing out frequent phrasal verbs: A corpus-based analysis. TESOL Quarterly, 41(2), 339–359. doi: 10.1002/j.1545-7249.2007.tb00062.x Garnier, M., & Schmitt, N. (2015). The PHaVE List: A pedagogical list of phrasal verbs and their most frequent meaning senses. Language Teaching Research, 19(6), 645–666. doi:  10.1177/1362168814559798

Ghadessy, M. (1979). Frequency counts, word lists, and materials preparation: a new approach. English Teaching Forum, 17(1), 24–27. Gilner, L. (2011). A primer on the General Service List. Reading in a Foreign Language, 23(1), 65–83. Grant, L. (2005). Frequency of ‘core idioms’ in the British National Corpus (BNC). International Journal of Corpus Linguistics, 10(4), 429–451. doi: 10.1075/ijcl.10.4.03gra Grant, L., & Bauer, L. (2004). Criteria for redefining idioms: Are we barking up the wrong tree? Applied Linguistics, 25(1), 38–61. doi: 10.1093/applin/25.1.38 Grant, L., & Nation, I. S. P. (2006). How many idioms are there in English? ITL – International Journal of Applied Linguistics, 151, 1–14. doi: 10.2143/ITL.151.0.2015219 Henriksen, B., & Danelund, L. (in press). Studies of Danish L2 learners’ vocabulary knowledge and the lexical richness of their written production in English. In P. Pietilä, K. Doró, and R. Pipalová (eds.): Lexical issues in L2 writing. Newcastle upon Tyne: Cambridge Scholars Publishing, pp. 1–27. Hindmarsh, R. (1980). Cambridge English Lexicon. Cambridge: Cambridge University Press.

200 Making and Using Word Lists for Language Learning and Testing

Hsu, W. (2011). A business word list for prospective EFL business postgraduates. Asian ESP Journal, 7(4). Hsu, W. (2014). Measuring the vocabulary load of engineering textbooks for EFL undergraduates. Special Issue: ESP in Asia, 33, 54–65. Hsu, W. (2013). Bridging the vocabulary gap for EFL medical undergraduates: The establishment of a medical word list. Language Teaching Research, 17(4), 454–484.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

doi:  10.1177/1362168813494121

Hulstijn, J., & Laufer, B. (2001). Some empirical evidence for the involvement load hypothesis in vocabulary acquisition. Language Learning, 51(3), 539–558. doi: 10.1111/0023-8333.00164 Hutchinson, T., & Waters, A. (1987). English for Specific Purposes. Cambridge: Cambridge University Press. doi: 10.1017/CBO9780511733031 Hyland, K., & Tse, P. (2007). Is there an “Academic Vocabulary”? TESOL Quarterly, 41(2), 235–253. doi:  10.1002/j.1545-7249.2007.tb00058.x

Jeon, E. H. (2011). Contribution of morphological awareness to second-language reading comprehension. The Modern Language Journal, 95, 217–235. doi: 10.1111/j.1540-4781.2011.01179.x Johansson, S., & Hofland, K. (1989). Frequency Analysis of English Vocabulary and Grammar 1&2. Oxford: Clarendon Press. Kilgarriff, A. (1997). Putting frequencies in the dictionary. International Journal of Lexicography, 10(2), 135–155. doi: 10.1093/ijl/10.2.135 Kobeleva, P. P. (2012). Second language listening and unfamiliar proper names: Comprehension barrier? RELC Journal, 43(1), 83–98. doi: 10.1177/0033688212440637 Kobeleva, P. P. (2008). The impact of unfamiliar proper names on ESL learners’ listening comprehension. Unpublished PhD thesis, Victoria University of Wellington, New Zealand. Kučera, H., & Francis, W. N. (1967). A computational analysis of present-day American English. Providence, R.I.: Brown University Press. Larson, M. (2016). Thresholds, text coverage, vocabulary size, and reading comprehension in applied linguistics. Unpublished PhD thesis, Victoria University of Wellington, New Zealand. Leech, G., & Fallon, R. (1992). Computer corpora – What do they tell us about culture? ICAME Journal, 16, 29–50. Leech, G., Rayson, P., & Wilson, A. (2001). Word Frequencies in Written and Spoken English. Harlow: Longman. Liu, D. (2011). The most frequently used English phrasal verbs in American and British English: A multicorpus examination. TESOL Quarterly, 45(4), 661–688. doi: 10.5054/tq.2011.247707 Liu, D. (2010). Going beyond patterns: Involving cognitive analysis in the learning of collocations. TESOL Quarterly, 44(1), 4–30. doi: 10.5054/tq.2010.214046 Longman Structural Readers Handbook. (1976). (2nd ed.). London: Longman. Lorge, I., & Chall, J. (1963). Estimating the size of vocabularies of children and adults: an analysis of methodological issues. Journal of Experimental Education, 32(2), 147–157. doi:  10.1080/00220973.1963.11010819

Lynn, R. W. (1973). Preparing word lists: a suggested method. RELC Journal, 4(1), 25–32. doi:  10.1177/003368827300400103

Macalister, J. (1999). School Journals and TESOL: an evaluation of the reading difficulty of School Journals for second and foreign language learners. New Zealand Studies in Applied Linguistics, 5, 61–85. Martinez, I. A., Beck, S. C., & Panza, C. B. (2009). Academic vocabulary in agriculture research articles: A corpus-based study. English for Specific Purposes, 28, 183–198. doi:  10.1016/j.esp.2009.04.003

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

References 201

Martinez, R., & Schmitt, N. (2012). A phrasal expressions list. Applied Linguistics, 33(3), 299–320. doi: 10.1093/applin/ams010 Matsuoka, W., & Hirsh, D. (2010). Vocabulary learning through reading: Does an ELT course book provide good opportunities? Reading in a Foreign Language, 22(1), 56–70. McCarthy, M. & Carter, R. (1997). Written and spoken vocabulary. In N. Schmitt and M. ­McCarthy (eds.) Vocabulary: Description, Acquisition and Pedagogy. Cambridge: Cambridge University Press, pp. 20–39. McLean, S. (2016 in preparation). An appropriate word counting unit. McLean, S., & Kramer, B. (2015). The creation of a new Vocabulary Levels Test. Shiken, 19(2), 1–11. McLean, S., Kramer, B., & Beglar, D. (2015). The creation and validation of a listening vocabulary levels test. Language Teaching Research, 19(6), 741–760. doi: 10.1177/1362168814567889 Milton, J. (2009). Measuring Second Language Vocabulary Acquisition. Multilingual Matters. doi:  10.1057/9780230242258

Mochizuki, M., & Aizawa, K. (2000). An affix acquisition order for EFL learners: an exploratory study. System, 28, 291–304. doi: 10.1016/S0346-251X(00)00013-0 Nagy, W. E. (1997). On the role of context in first- and second-language learning. In N. Schmitt & M. McCarthy (Eds.), Vocabulary: Description, Acquisition and Pedagogy (pp. 64–83). ­Cambridge: Cambridge University Press. Nagy, W. E., & Anderson, R. C. (1984). How many words are there in printed school English? Reading Research Quarterly, 19(3), 304–330. doi: 10.2307/747823 Nagy, W. E., Anderson, R., Schommer, M., Scott, J. A., & Stallman, A. (1989). Morphological families in the internal lexicon. Reading Research Quarterly, 24(3), 263–282. doi: 10.2307/747770 Nagy, W. E., Herman, P., & Anderson, R. C. (1985). Learning words from context. Reading Research Quarterly, 20(2), 233–253. doi: 10.2307/747758 Nakata, T. (2011). Computer-assisted second language vocabulary learning in a paired-associate paradigm: A critical investigation of flashcard software. Computer Assisted Language Learning, 24(1), 17–38. doi: 10.1080/09588221.2010.520675 Nation, (2016). Reading a whole book to learn vocabulary. Nation, I. S. P. (2014). How much input do you need to learn the most frequent 9,000 words? Reading in a Foreign Language, 26(2), 1–16. Nation, I. S. P. (2013). Learning Vocabulary in Another Language. (Second edition). Cambridge: Cambridge University Press. Nation, I. S. P. (2009). New roles for L2 vocabulary? In W. Li & V. J. Cook (Eds.), Contemporary Applied Linguistics Volume 1: Language Teaching and Learning (pp. 99–116): Continuum. Nation, I. S. P. (2008). Teaching Vocabulary: Strategies and Techniques. Boston: Heinle Cengage Learning. Nation, I. S. P. (2007). The four strands. Innovation in Language Learning and Teaching, 1(1), 1–12. doi:  10.2167/illt039.0

Nation, I. S. P. (2006). How large a vocabulary is needed for reading and listening? Canadian Modern Language Review, 63(1), 59–82. doi: 10.3138/cmlr.63.1.59 Nation, I. S. P. (2004). A study of the most frequent word families in the British National Corpus. In P. Bogaards & B. Laufer (Eds.), Vocabulary in a Second Language: Selection, Acquisition, and Testing (pp. 3–13). Amsterdam: John Benjamins. doi: 10.1075/lllt.10.03nat Nation, I. S. P. (2001a). Learning Vocabulary in Another Language. Cambridge: Cambridge University Press. doi: 10.1017/CBO9781139524759

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

202 Making and Using Word Lists for Language Learning and Testing

Nation, I. S. P. (2001b). How many high frequency words are there in English? In M. Gill, A. ­Johnson, L. Koski, R. Sell & B. Warvik (Eds.), Language, Learning, and Literature: Studies presented to Hakan Ringbom (pp. 167–181). Abo: Abo Akademi University. Nation, I. S. P. (2000). Learning vocabulary in lexical sets: dangers and guidelines. TESOL Journal, 9(2), 6–10. Nation, I. S. P. (1993). Using dictionaries to estimate vocabulary size: essential, but rarely followed, procedures. Language Testing, 10(1), 27–40. doi: 10.1177/026553229301000102 Nation, I. S. P. (1983). Testing and teaching vocabulary. Guidelines, 5(1), 12–25. Nation, P., & Beglar, D. (2007). A vocabulary size test. The Language Teacher, 31(7), 9–13. Nation, P., & Crabbe, D. (1991). A survival language learning syllabus for foreign travel. System, 19(3), 191–201. doi: 10.1016/0346-251X(91)90044-P Nation, I. S. P., & Deweerdt, J. (2001). A defence of simplification. Prospect, 16(3), 55–67. Nation, I. S. P., & Hwang, K. (1995). Where would general service vocabulary stop and special purposes vocabulary begin? System, 23(1), 35–41. doi: 10.1016/0346-251X(94)00050-G Nation, I. S. P., & Macalister, J. (2010). Language Curriculum Design. New York: Routledge. Nation, P., & Wang, K. (1999). Graded readers and vocabulary. Reading in a Foreign Language, 12(2), 355–380. Nation, I. S. P., & Webb, S. (2011). Researching and Analyzing Vocabulary. Boston: Heinle Cengage Learning. Nation, I. S. P., & Yamamoto, A. (2012). Applying the four strands to language learning. International Journal of Innovation in English Language Teaching and Research, 1(2), 167–181. Nelson, G. (1997). Standardizing wordforms in a spoken corpus. Literary and Linguistic Computing, 12(2), 79–85. doi: 10.1093/llc/12.2.79 Neufeld, S., Hancioglu, N., & Eldridge, J. (2011). Beware the range in RANGE, and the academic in AWL. System, 39, 533–538. doi: 10.1016/j.system.2011.10.010 Nguyen, L. T. C., & Nation, I. S. P. (2011). A bilingual vocabulary size test of English for Vietnamese learners. RELC Journal, 42(1), 86–99. doi: 10.1177/0033688210390264 Nurweni, A., & Read, J. (1999). The English vocabulary knowledge of Indonesian university students. English for Specific Purposes, 18(2), 161–175. doi: 10.1016/S0889-4906(98)00005-2 O’Neill, R. (1987). The Longman Structural Readers Handbook. London: Longman. Ogden, C. K. (1932). The Basic Words. London: Kegan Paul, Trench, Trubner & Co. Palmer, H. E. (1933). Second Interim Report on English Collocations. Tokyo: Kaitakusha. Palmer, H. E. (1931). Second interim report on vocabulary selection submitted to the Eighth Annual Conference of English Teachers under the auspices of the Institute for Research in English Teaching. Tokyo: IRET. Parent, K. (2012). The most frequent English homonyms. RELC Journal, 43(1), 69–81.  doi:  10.1177/0033688212439356

Pinchbeck, G. G. (2014). Lexical Frequency Profiling of a Large Sample of Canadian High School Diploma Exam Expository Writing: L1 and L2 Academic English. Roundtable presentation at American Association of Applied Linguistics. Portland, OR, USA. Praninskas, J. (1972). American University Word List. London: Longman. Quaglio, P. (2009). Television Dialogues: The Sitcom Friends vs. Natural Conversation. Studies in Corpus Linguistics 36. Amsterdam: John Benjamins. doi: 10.1075/scl.36 Quero, B. (2015). Estimating the vocabulary size of L1 Spanish ESP learners and the vocabulary load of medical textbooks. Unpublished PhD thesis, Victoria University of Wellington, New Zealand.

References 203

Quinn, G. (1968). The English vocabulary of some Indonesian university entrants. Salatiga: English Department Monograph IKIP Kristen Satya Watjana. Reynolds, B. L. (2013). Comments on Stuart Webb and John Macalister’s “Is Text Written for Children Useful for L2 Extensive Reading?” TESOL Quarterly, 47(4), 849–852. doi:  10.1002/tesq.145

Reynolds, B. L., & Wible, D. (2014). Frequency in incidental vocabulary acquisition research: An undefined concept and some consequences. TESOL Quarterly, 48(4), 843–861. doi:  10.1002/tesq.197

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Richards, I. A. (1943). Basic English and its Uses. London: Kegan Paul, Trench, Trubner & Co. Richards, J. C. (1970). A psycholinguistic measure of vocabulary selection. IRAL, 8(2), 87–102. doi:  10.1515/iral.1970.8.2.87

Rosch, Eleanor. (1978). Principles of categorization. In Cognition and categorization, edited by E. Rosch and B. B. Lloyd. p 27–48. Hillsdale: Lawrence Erlbaum. Ruhl, C. (1989). On Monosemy: A Study in Linguistic Semantics. Albany: State University of New York Press. Salager, F. (1983). The lexis of fundamental medical English: classificatory framework and rhetorical function (a statistical approach). Reading in a Foreign Language, 1(1), 54–64. Schmitt, N. (2010). Researching Vocabulary: A Vocabulary Research Manual. Basingstoke: P ­ algrave Macmillan. doi: 10.1057/9780230293977 Schmitt, N., & Meara, P. (1997). Researching vocabulary through a word knowledge framework: word associations and verbal suffixes. Studies in Second Language Acquisition, 19, 17–36. doi:  10.1017/S0272263197001022

Schmitt, N., & Schmitt, D. (2014). A reassessment of frequency and vocabulary size in L2 vocabulary teaching. Language Teaching, 47(4), 484–503. doi: 10.1017/S0261444812000018 Schmitt, D., & Schmitt, N. (2005). Focus on Vocabulary: Mastering the Academic Word List. New York: Longman Pearson Education. Schmitt, N., Schmitt, D., & Clapham, C. (2001). Developing and exploring the behaviour of two new versions of the Vocabulary Levels Test. Language Testing, 18(1), 55–88. Schmitt, N., & Zimmerman, C. (2002). Derivative word forms: What do learners know? TESOL Quarterly, 36(2), 145–171. doi: 10.2307/3588328 Shin, D., & Nation, I. S. P. (2008). Beyond single words: the most frequent collocations in spoken English. ELT Journal, 62(4), 339–348. doi: 10.1093/elt/ccm091 Simpson, R., & Mendis, D. (2003). A corpus-based study of idioms in academic speech. TESOL Quarterly, 37(3), 419–441. doi: 10.2307/3588398 Simpson-Vlach, R., & Ellis, N. C. (2010). An academic formulas list: New methods in phraseology research. Applied Linguistics, 31(4), 487–512. doi: 10.1093/applin/amp058 Sinclair, J. M. (Ed.). (1987). Collins COBUILD English Language Dictionary. London: Collins. Sinclair, J. M. (1987). Looking Up. London: Collins ELT. Sorell, C. J. (2013). A study of issues and techniques for creating core vocabulary lists for English as an international language. Unpublished PhD thesis, Victoria University of Wellington, New Zealand. Sorell, C. J. (2012). Zipf ’s law and vocabulary. In C. A. Chapelle (Ed.), Encyclopaedia of Applied Linguistics. Oxford: Wiley-Blackwell. Statistics Canada (1998). Average time spent on activities, by sex. Ottawa, Ontario: Author. Retrieved October 21, 2007 from http://www40.statcan.ca/l01/cst01/famil36a.htm

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

204 Making and Using Word Lists for Language Learning and Testing

Summers, D. (2001). Longman Dictionary of Contemporary English (3rd ed.). Harlow: Pearson Education Ltd. Sutarsyah, C., Nation, P., & Kennedy, G. (1994). How useful is EAP vocabulary for ESP? A corpus based study. RELC Journal, 25(2), 34–50. doi: 10.1177/003368829402500203 Thorndike, E. L. (1921). The Teacher’s Word Book. New York: Teachers College Columbia University. Thorndike, E. L. (1924). The vocabularies of school pupils. In J. C. Bell (Ed.), Contributions to Education (pp. 69–76). New York: World Book Co. Thorndike, E. L. (1932). Teacher’s Word Book of 20,000 Words. New York: Teachers College Columbia University. Thorndike, E. L., & Lorge, I. (1944). The Teacher’s Word Book of 30,000 Words. New York: Teachers College Columbia University. United States Department of Labor. (2006). American time use survey summary. Washington, D.C.: Author. Retrieved October 21, 2007 from http://www.bls.gov/news.release/atus.nr0.htm. Valcourt, G., & Wells, L. (1999). Mastery: A University Word List Reader. Ann Arbor: University of Michigan Press. Vongpumivitch, V., Huang, J., & Chang, Y.-C. (2009). Frequency analysis of the words in the Academic Word List (AWL) and non-AWL content words in applied linguistics research papers. English for Specific Purposes, 28(1), 33–41. doi: 10.1016/j.esp.2008.08.003 Wan-a-rom, U. (2008). Comparing the vocabulary of different graded-reading schemes. Reading in a Foreign Language, 20(1), 43–69. Wang, J., Liang, S., & Ge, G. (2008). Establishment of a medical academic word list. English for Specific Purposes, 27(4), 442–458. doi: 10.1016/j.esp.2008.05.003 Wang, M.-t. K., & Nation, P. (2004). Word meaning in academic English: Homography in the Academic Word List. Applied Linguistics, 25(3), 291–314. doi: 10.1093/applin/25.3.291 Wang, M., Cheng, C., & Chen, S. W. (2006). Contribution of morphological awareness to ChineseEnglish biliteracy acquisition. Journal of Educational Psychology, 98(3), 542–553.  doi:  10.1037/0022-0663.98.3.542

Ward, J. (1999). How large a vocabulary do EAP Engineering students need? Reading in a Foreign Language, 12(2), 309–323. Ward, J. (2009). A basic engineering English word list for less proficient foundation engineering undergraduates. English for Specific Purposes, 28, 170–182. doi: 10.1016/j.esp.2009.04.001 Ward, J., & Chuenjundaeng, J. (2009). Suffix knowledge: Acquisition and applications. System, 37, 461–469. doi: 10.1016/j.system.2009.01.004 Waring, R. (1997). A study of receptive and productive learning from word cards. Studies in Foreign Languages and Literature (Notre Dame Seishin University, Okayama), 21(1), 94–114. Webb, S. (2008). Receptive and productive vocabulary size. Studies in Second Language Acquisition, 30(1), 79–95. doi: 10.1017/S0272263108080042 Webb, S. A., & Chang, A. C.-S. (2012). Second language vocabulary growth. RELC Journal, 43(1), 113–126. doi: 10.1177/0033688212439367 Webb, S., & Macalister, J. (2013). Is text written for children useful for L2 extensive reading? TESOL Quarterly, 47(2), 300–322. doi: 10.1002/tesq.70 Webb, S., & Macalister, J. (2013). A response. TESOL Quarterly, 47, 852–855. doi: 10.1002/tesq.70 Webb, S., & Nation, I. S. P. (2016 in press). How Vocabulary is Learned. Oxford: Oxford University Press.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

References 205

Webb, S., & Nation, I. S. P. (2008). Evaluating the vocabulary load of written text. TESOLANZ Journal, 16, 1–10. Webb, S., & Paribakht, T. S. (2015). What is the relationship between the lexical profile of test items and performance on a standardized English proficiency test? English for Specific Purposes, 38, 34–43. doi: 10.1016/j.esp.2014.11.001 Webb, S., & Rodgers, M. P. H. (2009a). The lexical coverage of movies. Applied Linguistics, 30(3), 407–427. doi: 10.1093/applin/amp010 Webb, S., & Rodgers, M. P. H. (2009b). The vocabulary demands of television programs. Language Learning, 59(2), 335–366. doi: 10.1111/j.1467-9922.2009.00509.x Webb, S. A., & Sasao, Y. (2013). New directions in vocabulary testing. RELC Journal, 44(3), 263–277. doi: 10.1177/0033688213500582 West, M. (1968). The minimum adequate: a quest. ELT Journal, 22(3), 205–210. doi:  10.1093/elt/XXII.3.205

West, M. (1960). Teaching English in Difficult Circumstances. London: Longman. West, M. (1956). A plateau vocabulary for speech. Language Learning, 7(1&2), 1–7. doi:  10.1111/j.1467-1770.1956.tb00852.x

West, M. (1955). Catenizing (chaining words together) Learning to read a Foreign Language (pp. 61– 68). London: Longman. West, M. (1953). A General Service List of English Words. London: Longman, Green & Co. West, M. (1951). Catenizing. ELT Journal, 5(6), 147–151. doi: 10.1093/elt/V.6.147 West, M. (1935). Definition vocabulary. Bulletin of the Department of Educational Research, University of Toronto, 4. Xue, G., & Nation, I. S. P. (1984). A university word list. Language Learning and Communication, 3(2), 215–229. Yang, M. N. (2015). A nursing academic word list. English for Specific Purposes, 37, 27–38. doi:  10.1016/j.esp.2014.05.003

Zeno, S. M., Ivens, S. H., Millard, R. T., & Duvvuri, R. (1995). The Educator’s Word Frequency Guide: Touchstone Applied Science Associates. Zipf, G. K. (1935). The Psycho-Biology of Language. Cambridge, Mass.: M.I.T. Press.

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Author index

A Adolphs, S.  7 Aizawa, K.  35, 36, 181, 182 Anderson, R.  xiv, 18–19, 56, 101, 137, 142 Anglin, J.  8, 36 B Baayen, R.  142 Banerji, N.  99 Bauer, L.  xii, 23, 25, 26–29, 30, 32, 33, 35, 36, 37, 75, 133, 141, 154, 181 Beck, S.  12, 116 Beglar, D.  13, 27, 28, 118, 181 Bertram, R.  142 Biber, D.  78, 89, 95–96, 97, 163, 166 Biemiller, A.  50 Boers, F.  74 Bongers, H.  10 Bowles, M.  13 Brezina, V.  11, 31, 99–100, 118, 119, 123, 124, 126–129, 153–167 Brown, D.  30, 35, 57, 182 Browne, C.  11 Brysbaert, M.  9, 24, 30, 31, 68, 97, 99, 103, 142, 143, 182 C Campion, M.  10, 11, 114, 118, 149 Carroll, J.  17, 18, 24, 103, 121, 176 Carter, R.  165–166 Chall, J.  55 Chang, A.  161 Chang, Y.  12 Chen, S.  35 Cheng, C.  35 Chuenjundang, J.  35, 115, 154, 182

Chujo, K.  179 Chung, T.  12, 34, 115, 116, 146–148 Clapham, C.  10, 180, 181 Cobb, T.  5, 129, 142, 179 Conrad, S.  78, 89, 163, 166 Corson, D.  150 Cortes, V.  78 Coxhead, A.  7, 9, 11, 12, 13, 28, 32, 90, 108, 114, 118, 122, 146, 148–151, 176, 181 Crabbe, D.  119, 123, 134, 173, 175 Cunningham, A.  101 D Danelund, L.  161 Dang, T. N. Y.  18, 26, 31, 118, 126, 133, 153–167, 174, 176 Davies, M.  12, 30, 31, 32, 115, 143, 145, 146, 149 Davies, P.  17, 18, 24, 103, 121, 176 Deweerdt, J.  175 Diller, K.  8 Droop, M.  35 Durrant, P.  78 Duvvuri, R.  7, 24–25 E Eldridge, J.  108, 150 Elgort, I.  181 Elley, W.  10, 11, 114, 118, 149 Ellis, N.  78 Engels, L.  103, 162 F Fallon, R.  98 Francis, W.  10, 97 Fraser, S.  12 Fries, C.  9

G Gablasova, D.  11, 31, 99–100, 118, 119, 123, 124, 126–129, 153–167 Gardner, D.  11, 17, 23, 30, 31, 32, 115, 145, 146, 149, 154 Garnier, M.  78 Ge, G.  12 Ghadessy, M.  11 Gilner, L.  xiv, 10, 103 Grant, L.  71, 75–76 Gupta, V.  99 H Hancioglu, N.  108, 150 Hearn, L.  109 Henriksen, B.  161 Herman, P.  101 Hirsh, D.  7, 12, 117, 146 Hofland, K.  10 Hsu, W.  12, 116 Huang, J.  12 Hutchinson, T.  37 Hwang, K.  100, 150, 163 Hyland, K.  149, 150 I Ivens, S.  7, 24–25 J Jeon, E.  35 Johansson, S.  10, 89 K Kennedy, G.  9 Keuleers, E.  24, 31, 68, 142, 182 Kilgarriff, A.  99, 176 Kobeleva, P.  55–64 Kramer, B.  13, 28, 181 Kučera, H.  10, 97

208 Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

L Laine, M.  142 Larson, M.  179 Leech, G.  25, 29, 89, 98, 104, 118, 121, 163, 166 Liang, S.  12 Lindstromberg, S.  74 Liu, D.  74, 78 Lorge, I.  10, 26, 55, 114, 118 Lynn, R.  11 M Macalister, J.  9, 32, 98, 172 Mandera, P.  24, 31, 68, 142, 182 Martinez, I.  12, 116 Martinez, R.  xiv, 76–78 Matsuoka, W.  117 McCarthy, M.  165–166 McLean, S.  13, 28, 35, 181, 182 Meara, P.  35, 36, 181, 182 Mendis, D.  78 Millard, R.  7, 24–25 Milton, J.  161 Mochizuki, M.  35, 36, 181, 182 N Nagy, W.  xiv, 18–19, 51, 56, 101, 137, 142 Neufeld, S.  108, 150 New, B.  9, 97, 99, 103 Nguyen, L.  28 Nurweni, A.  161 O O’Neill, R.  12 Ogden, C.  13 P Palmer, H.  9, 10, 79 Panza, C.  12, 116 Parent, K.  41–53 Paribakht, S.  179 Pinchbeck, G.  25 Praninskas, J.  11, 114, 149

Q Quaglio, P.  96 Quero, B.  9, 12, 147–148 Quinn, G.  161 R Rayson, P.  25, 29, 104, 118, 121 Read, J.  161 Reynolds, B.  24, 35, 121 Richards, I.  13 Richards, J.  129 Richman, B.  17, 18, 24, 103, 121, 176 Rosch, E.  55 Ruhl, C.  xii, 52 S Salager, F.  12, 116 Sasao, Y.  35, 182 Schmitt, D.  4, 10, 129, 171–172, 180, 181 Schmitt, N.  xiv, 4, 7, 10, 23, 35, 36, 76–78, 114, 129, 154, 171–172, 180, 181, 182 Schommer, M.  142 Schreuder, R.  142 Scott, J. A.  142 Shin, D.  73, 78, 114 Sim, D.  13, 28, 118, 181 Simpson, M.  78 Simpson-Vlach, M.  78 Sinclair, J.  176 Sorell, C. J.  xiv, 95–105, 118, 123, 125, 163, 180 Stallman, A.  142 Stevens, M.  24, 31, 68, 142, 182 Summers, D.  176 Sutarsyah, C.  9 T Thorndike, E.  7, 10, 13, 26, 55, 114, 118 Traver, E.  9

Tse, P.  147, 150 Tugwell, D.  99 U Utiyama, M.  179 V Valcourt, G.  172 Verhoeven, L.  35 Virkkala, M.  142 Vongpumivitch, V.  12 W Wan-a-rom, U.  12 Wang, J.  12 Wang, K.  12, 41–42, 117, 150, 173 Wang, M.  35 Ward, J.  12, 35, 115, 116–117, 130, 154, 182 Waring, R.  8, 143 Waters, A.  37 Webb, S.  5, 8, 18, 31, 32, 35, 57, 71, 98, 118, 126, 132, 133, 153–167, 174, 176, 177, 182 Wells, L.  172 West, M.  10, 13, 41, 50, 102, 113, 124, 125, 126, 129, 151, 153, 155, 175, 177 Wible, D.  24, 35, 121 Wilson, A.  25, 29, 104, 118, 121 X Xue, G.  11 Yamamoto, A.  6 Yang, M.  12 Z Zeno, S.  7, 24–25 Zimmerman, C.  35, 36, 154, 181, 182 Zipf, G.  9

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Subject index A Abbreviations 86 Academic Word List  11, 42–43, 90, 101, 108, 114, 116, 149–150 Acronyms  19, 59, 85–87, 131, 137, 140 AffixAppender 29 Affixes  27, 33, 156 Alternative spellings  39, 91–92, 131, 140 Amount of reading  101 AntWordProfiler  5, 6, 31, 57, 69, 89, 91, 107, 108–110, 132, 142 B Basic English  13, 119 Bauer& Nation levels  27, 37, 135 BNC/COCA lists  xii, 13, 30, 35, 46, 57, 59–61, 85, 91, 98, 108, 113, 115, 118, 128, 131–143 C Capital letters  17, 20, 24, 27, 46, 55–56, 133, 184 Categories of words  19, 39 CHILDES 98 Cognitive homonyms  47–49 Compleat Word Lister  5 Compositionality 74–77 Content words  89 Core idioms  75–76 Corpus comparison  147–148, 151 Corpus design  95–99, 101–105, 171 Corpus size  9, 99–101, 107, 128 Course books  117 Course design  3–5 Coverage  4, 159, 162, 179 Cut-off points  20–22, 46, 77

D Derivational affixes  xiii Dispersion  6, 121, 131, 141 Diverse corpus  9 E Essential Word list  114, 153–167, 174, 188–195 Evaluating lists  126–129, 138– 143, 159–160, 164 Extensive Reading Foundation 6 F Field approach  172 Flemmas  xiii, 33, 155 Foreign words  82–83 Four strands  6, 174–177 Frequency 120 Frequent derived forms  38–39 Function words  89–92 G General Service List  xiv, 10–11, 29, 41, 49–50, 102, 103, 113, 119, 128, 151, 153 Graded readers  6, 12, 117–118, 178 H High frequency words  4, 5, 7–8, 28, 114, 171 Homoforms  xiii, 41–53, 131, 133, 142 Homographs  41–43, 49–50 Homonyms  41–50, 61–64, 87, 92, 150 Homophones 41–42 Hyphenated words  65–67, 70, 131 I Inflectional affixes  xiii Internet use  101

K Knowledge of affixes  35–36 L Lemmas  xiii, 25–26, 30–35, 38 Letters of the alphabet  83 Level 3 partial  33 Low frequency words  4, 7–8, 115 M Maori 4 Marginal words  81–83, 131, 137 Mid-frequency words  4, 115 Monosemic bias  xii, 52 Morphological awareness  36 Morphological problem-solving 8, 36 Morphological knowledge  35–36, 39, 156 Multiword units  xi, xii, xiv, 19, 64, 71–79, 85–86, 147 N n-grams 71 non-compositional 74–75 Notepad++  83, 107–108, 132 Non-words  17, 18 P Part of speech  32–33, 131 Polysemy  32, 41–53, 150 Productive knowledge  8, 31 Proper nouns  18, 46–47, 55–64, 108–109, 131, 140, 183–186 R Range 120 Range program  57, 69, 89, 90, 91, 108, 109–110, 132–133, 135–136, 137, 142, 147, 158, 175, 179 Receptive knowledge  xi, 8, 24, 31 Relatedness levels  18

210 Making and Using Word Lists for Language Learning and Testing

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

S Specialised vocabulary  7, 145–151 Spoken/written 8–9 Spoken texts  96–97 Standard Frequency Index  121 Sub-corpora  103–105, 131 Subjective criteria  10, 105, 119–126, 129, 132, 133, 141, 187 Subject-specific lists  12 Swear words  83, 132 T Technical vocabulary  116–117, 147–148 Text types  96 Texting 87 Thorndike lists  10

Tokens  xii Topics 9 Transparent compounds  19, 65–70, 86, 110, 131, 137, 140, 142 TV programs/movies  97, 134 Types  xii, 24–25, 30, 137 U Unit of counting  8, 23–24, 35, 37, 39, 130, 131, 140, 147, 151, 154, 181 University Word List  11, 29 US/UK  98, 133, 140 V Vocabulary learning goals  171–174

Vocabulary levels test  28 Vocabulary load  179 Vocabulary size  7 Vocabulary size test  28, 118, 181 Vocabulary testing  7, 117, 122, 180–182 W Wellington Written Corpus 136, 157 Wellington Spoken Corpus  156 Word families  xii, 26–39, 135 Word senses  50–53 Word separators  133 Written texts  97–98 Z Zipf ’s law  xiv, 3–5, 9, 73

Ngawang Trinley (202880) IP: 118.167.28.75 On: Tue, 14 Jan 2020 02:12:15

Word lists lie at the heart of good vocabulary course design, the development of graded materials for extensive listening and extensive reading, research on vocabulary load, and vocabulary test development. This book has been written for vocabulary researchers and curriculum designers to describe the factors they need to consider when they create frequency-based word lists. These include the purpose for which the word list is to be used, the design of the corpus from which the list will be made, the unit of counting, and what should and should not be counted as words. The book draws on research to show the current state of knowledge of these factors and provides very practical guidelines for making word lists for language teaching and testing. The writer is well known for his work in the teaching and learning of vocabulary and in the creation of word lists and vocabulary size tests based on word lists.

isbn 978 90 272 1244 3

John Benjamins Publishing Company