A Statistical Linguistic Analysis of American English [Reprint 2021 ed.] 3112416414, 9783112416419, 9783112416426

183 46 19MB

English Pages 442 [438] Year 1965

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

A Statistical Linguistic Analysis of American English [Reprint 2021 ed.]
 3112416414, 9783112416419, 9783112416426

Citation preview

A STATISTICAL L I N G U I S T I C A N A L Y S I S OF A M E R I C A N

ENGLISH

JANUA L I N G U A R U M STUDIA MEMORIAE N I C O L A I V A N WIJK D E D I C A T A edenda curai

C O R N E L I S H. VAN S C H O O N E V E L D STANFORD UNIVERSITY

SERIES

PRACTICA

Vili

1965 M O U T O N & CO. LONDON • T H E H A G U E • PARIS

A STATISTICAL LINGUISTIC ANALYSIS OF AMERICAN ENGLISH by

A. HOOD R O B E R T S

1965 M O U T O N & CO. LONDON

• THE H A G U E •

PARIS

© Copyright 1965 Mouton & Co., Publishers, The Hague, The Netherlands.

N o part of this book may be translated or reproduced in any form, by print, photoprint, microfilm, or any other means, without written permission from the publishers.

Printed in The Netherlands

PREFACE

Frequency studies of the components of language have always been hampered by the prodigious amount of man-hours required to make the studies extensive enough to be either valid or useful. The great semantic counts and the Lorge "Magazine Count" of the 1930's were by-products of the depression. For example, The Semantic Count of the 570 Commonest English Words by Irving Lorge employed the efforts of several hundred workers provided by the W.P.A. for a period of six years beginning in 1934. Fortunately, scores of phoneme counters were not needed to do the work on this project owing to these factors: 1. The data from the frequency counts of the past can be converted into a form more useful in present-day linguistic study. 2. Modern digital computers can now do in minutes work which would require thousands of man-hours. The original idea for this study came from Professor Frederic G. Cassidy, who suggested it to me as a possible dissertation topic and who was instrumental in obtaining the funds necessary to accomplish this investigation. I owe a great debt of gratitude to Professor Cassidy for the helpfulness and patient understanding which he showed me in his direction of this dissertation. I would like to acknowledge the grant made available for this project by the Research Committee of the Graduate School of the University of Wisconsin which enabled me to use the facilities of the Numerical Analysis Laboratory. Despite the speed of the computer used in this project, the preparatory work, which had to be done by hand, was both laborious and time-consuming and required something over 2,000 man-hours. It is with pleasure that I acknowledge my indebtedness to those helpful participants in what was, for the most part, sheer drudgery. My appreciation is expressed here to those who assisted in various stages of the project: Professor Gerald B. Kelley, for his assistance in determining the phonemics of the informant. My colleague and the informant for this study, Donald C. Green, not only for his willingness to record the tapes but also for his assistance in the laborious task of checking the first print-out for errors. Mrs. Jean Walsh and my brother Hal Roberts for their hours spent in the tedious recording of data on the laboratory sheets.

6

PREFACE

John Cerveny for his aid in determining the number of letters in the corpus and for his help in numerous other parts of the study. Miss Margaret Horigan for her assistance in several areas of the investigation. Miss Nancy Krahn, who did part of the programming, for her perfect cooperation and willingness to give of her time. My colleague, Richard George Wolfe, who did the major part of the programming for this project, and whose capability was matched by his eagerness to serve. Without his expenditure of time and effort, this project would still be far from completion. Professor Murray Fowler, for his willingness to serve as acting chairman of the examining committee and his aid in seeing this project through to its completion. To my wife, Carolyn Roberts, go my deepest thanks, for her understanding, assistance and encouragement during the work on this project. Without her I could not possibly have done this work and to her is owed my deepest gratitude. Western Reserve University

A. H. R.

CONTENTS

Preface I.

II.

III.

IV.

V.

5 PIONEERS IN W O R D C O U N T I N G

9

GENERAL PLAN OF THE PRESENT S T U D Y

13

PREPARATORY W O R K

15

Choice of System of Phonemic Notation Selection of Informant Informant's Phonemics Recording the Corpus Phonemic Transcription of the Corpus Recomputation of Phoneme Frequencies Choice of Etymological Authority Manner of Recording Etymological Sources Recomputation of Etymological Frequencies Choice of Word Count for Use in this Study Review of the Criteria for Selection of Word Count

15 15 16 16 16 17 19 20 22 22 27

PROCESSING THE D A T A

31

RESULTS OF THE INVESTIGATION

34

Relationship between Alphabetic and Phonemic Codes The Etymological Composition of English Phoneme Frequencies Statistical Analysis of Phoneme Frequencies Vowel/Consonant Ratio Word Length in Phonemes and in Syllables Canonical Forms Canonical Forms with Respect to Manner of Articulation Canonical Forms with Respect to Points of Articulation Transitional Probabilities for Sequences of Two Phonemes , , , , .

35 35 38 42 44 44 48 51 52 52

8

CONTENTS

VI.

Transitional Probabilities for Word-Initial Phoneme Sequences . . . Entropy and Redundancy Initial Consonants and Consonant Clusters Intervocalic Consonants and Consonant Clusters Final Consonants and Consonant Clusters

57 57 60 60 61

PREVIOUS STUDIES OF SPEECH S O U N D S

62

BIBLIOGRAPHY

64

APPENDICES

67

I. II. III. TV. V. VI. VII. VIII. IX. X. XI. XII. XIII. XIV. XV. XVI. XVII. XVIII. XIX. XX. XXI. XXII.

Etymological Composition of English Relative Frequency of Segmental Phonemes Relative Frequency of Vowels Relative Frequency of Consonants and Semivowels Phoneme Proportions by Number and Frequency for Each Decile. . Word Length in Phonemes Word Length in Syllables Joint Frequency Distribution of Word Length by Syllable and Phoneme Number Canonical Forms of Consonant, Vowel, Semivowel Canonical Forms by Manner of Articulation of Phonemes Canonical Forms by Place of Production of Phonemes Transitional Probabilities for Sequences of Two Phonemes Transitional Probabilities for Word-Initial Phoneme Sequences . . . Entropy and Redundancy based on Phoneme Frequencies by Decile. Entropy and Redundancy based on Word Length in Phonemes by Decile Entropy and Redundancy based on Word Length in Syllables by Decile Initial Consonants and Consonant Clusters Intervocalic Consonants and Consonant Clusters including PreConsonantal Off-Glides Final Consonants and Consonant Clusters including Pre-Consonantal Off-Glides Intervocalic Consonants and Consonant Clusters Final Consonants and Consonant Clusters Phonemic Transcription of the First Decile

69 81 95 100 Ill 113 116 118 126 159 262 340 357 395 396 397 398 400 411 417 425 428

I PIONEERS IN WORD COUNTING

Frequency studies of the components of languages have been concentrated largely on the two components, words and sounds. Some investigations have been based on counts of entries in dictionaries. This type of count reveals the overall pattern of the lexicon, but its great limitation is that it does not take into account the importance of the components as determined by their frequency of use. Other counts have been based on running words - either printed, written, or spoken. These frequency counts have been made with a variety of purposes in mind. Some have been made to determine the most frequent sounds in the language; others have determined the most frequently used words in various types of reading or in spelling. Despite the variety of these latter studies, underlying them all is the principle that, to a great extent, a word's importance is measured by its frequency of occurrence. Perhaps the best known investigators of word frequency in the United States are R. C. Eldridge, Edward L. Thorndike, Irving Lorge, Ernest Horn, and G. K. Zipf. Although each of these pioneers in this field used word frequency as the basis for their studies, their purposes and interests were different. R. C. Eldridge, the manager of a factory which employed a high proportion of foreign born, was concerned with the employees' problems in learning English. His count grew out of this concern and his interest in a universal phonetic alphabet and a universal vocabulary. The sample for his count was taken from four different newspapers published in Buffalo, New York, on different dates in 1909. A word list was made from the count of the words in each newspaper, and a fifth list containing 6,000 different words out of a total of 43,989 running words was made by combining the four counts. Edward L. Thorndike, a professor of Educational Psychology at Teachers College, Columbia University, was interested chiefly in the study of reading vocabulary. His work in this field was continued for many years, and in this time he compiled three influential word lists, at intervals of a decade. His first study, The Teacher's Word Book, "...is an alphabetical list of the 10,000 words which are found to occur most widely in a count of about 625,000 words from literature for children; about 3,000,000 words from the Bible and English classics; about 300,000 words from elementaryschool text books; about 50,000 words from books about cooking, sewing, farming, the trades, and the like; about 90,000 words from the daily newspapers; and about

10

PIONEERS IN WORD COUNTING

500,000 words from correspondence. Forty-one sources were used." 1 With the publication of this count, Thorndike became the acknowledged leader in this area of study in the United States. However, he was not content to stop with the publication of this work, and in 1932 he published an expanded word count. 2 This list added the counts from over two hundred sources including about 5,000,000 words, and incorporated the results of other published counts. Then in 1944, with Irving Lorge, Thorndike published The Teacher's Word Book of 30,000 Words.3 With the publication of this compilation, word counts of reading vocabulary had gone about as far as they were ever apt to go. Based on counts of over 20,000,000 running words, the 1944 count is regarded by many as the epitome of word counts. Perhaps the 1944 count helped foster an attitude of reverence for the work. The 1921 and 1932 counts both emphasized that they were counts of English reading, as, indeed, the 1944 count does in the "Preface" which states "This book is not final as a frequency count of English reading." However, the "Introduction" to the 1944 count begins, "Part 1 of this book is a list of words, each followed by a record of the frequency of occurrence of the word in general, and four different sets of reading matter." 4 The word general perhaps has been misinterpreted by some. One is reminded of what Irving Lorge, the co-author of the 1944 count, has said: "Practically all counts that have been made show that there is no finality in word counts. The extent of the sampling, the choice of the materials counted (printed books or magazines, spoken vocabulary, written correspondence, compositions, or school work), the nature of the selection of materials (geographic, urban-rural) all play a part in the specification of the universe of background materials in communication." 13 Irving Lorge was a professor of Education at Teachers College, Columbia University, and a frequent collaborator with Thorndike. Lorge and Thorndike were co-authors of A Semantic Count of English Words.6 This count, based upon approximately five million words, gives the total frequency of occurrence for each word (except the five hundred most frequent) and the relative occurrence of all its different meanings. The omission of the five hundred most frequent words in A Semantic Count is a serious disadvantage inasmuch as the most frequently used words are generally those which have the largest number of different meanings. In 1949 this 1 Edward L. Thorndike, The Teacher's Word Book (New York, Teachers College, Columbia University, 1921), p. iii. 2 Edward L. Thorndike, A Teacher's Word Book of the 20,000 Words Found Most Frequently and Widely in General Reading for Children and Young People (New York, Teachers College, Columbia University, 1932). 3 Edward L. Thorndike and Irving Lorge, The Teacher's Word Book of 30,000 Words ( N e w York, Teachers College, Columbia University, 1944). 4 Ibid., p. ix. 3 Irving Lorge, "Word Lists as Background for Communication", Teachers College Record, VI (1944), p. 546. • Irving Lorge and Edward L. Thorndike, A Semantic Count of English Words ( N e w York, Teachers College, Columbia University, 1938).

PIONEERS IN WORD COUNTING

11

disadvantage was overcome with the publication of Lorge's The Semantic Count of the 570 Commonest English Words? Lorge was also the author of a frequency count based on popular magazines. In this count he was attempting to get an estimate of the frequency of occurrence of words read by the average adult. This study was never published separately and is to be found only in The Teacher's Word Book of 30,000 Words. Ernest Horn, a professor of Education at the University of Iowa, was interested in spelling rather than in reading vocabulary. His major work, A Basic Writing Vocabulary, which was the result of several years of original work and a compilation of previous studies, was based on a count of 5,136,816 running words. Horn combined the data presented in the following correspondence studies which had been made from samples taken outside the school: 1. Chancellor, W. E., "Spelling: 1000 Words", The Journal of Education, Boston, Vol. 71-2 (May, 1910). 2. Ayres, L. P., The Spelling Vocabularies of Personal and Business Letters (New York, Russell Sage Foundation, 1913). 3. Nicholson, Anne, A Speller for the Use of the Teachers of California (California State Printing Office, Sacramento, 1914), containing the following investigations: a. McFadden, Effie, and Burk, Frederic, Ninety-one Friends' Letters (1914); b. Social letters of the members of the Parents' Association, Normal Training School, San Jose, California; c. 100 letters from the California Barrel Company; d. 400 letters from the Emporium, San Francisco, and Hale's Department Store, San Jose, California. 4. Cook, W. A., and O'Shea, M. V., The Child and His Spelling (Indianapolis, The Bobbs-Merrill Company, 1914). 5. Andersen, W. N., Determination of a Spelling Vocabulary Based upon Written Correspondence (Iowa City) ( = University of Iowa Studies in Education, Vol. II, No. 1) (1921). 6. Houser, J. D., "An Investigation of the Writing Vocabularies of Representatives of an Economic Class", Elementary School Journal, Vol. XVII (1916-1917), pp. 708718. 7. Clarke, W. F., "Writing Vocabularies", Elementary School Journal, Vol. XXI (January, 1921), pp. 349-351. 8. Horn, Ernest, "The Vocabulary of Bankers' Letters", English Journal, Vol. XII, No. 6 (June, 1923). 9. Horn, Ernest, The Vocabulary of Highly Personal Letters, 1922. Unpublished. 8 In the preceding studies there was a total of about 865,000 running words represented. This compilation (begun in 1919) is referred to as the compilation of 1922. ' Irving Lorge, The Semantic Count of the 570 Commonest English Words (New York, Teachers College, Columbia University, 1949). * Ernest Horn, A Basic Writing Vocabulary (Iowa City, Iowa, 1926), p. 7.

12

PIONEERS IN WORD COUNTING

Then with a grant from the Commonwealth Fund Horn began to make new investigations of writing vocabulary. Horn classifies the new investigations as follows : 1. The Nature and Extent of the Vocabulary of Business Correspondence. 2. The Nature and Extent of the Vocabulary of Personal Correspondence. 3. The Vocabulary of the Letters of People of more than Average Literary Ability. a. The Vocabulary of Letters of Well-known Writers. b. The Vocabulary of Letters Printed in Magazines and Metropolitan Newspapers. 4. The Nature and Extent of Vocabularies of Letters of Application and Recommendation. 5. The Vocabulary of Adult Writing Needs other than Correspondence. a. The Vocabulary of Minutes, Resolutions, and Committee Reports. b. The Vocabulary of Excuses Written to Teachers by Parents. 6. The Vocabulary of the Letters of a Single Individual. 9 In A Basic Writing Vocabulary all words, including slang, colloquial, and supposedly obsolete words, were recorded except words of less than four letters, proper nouns, and forty-one high frequency words. The probable frequencies of the omitted words were estimated (for proper nouns only the months and days of the week) using the data gathered in the compilation of 1922. Throughout the study Horn is careful to explain all the steps used in tabulating the huge amount of data which he examined. Thus the reader may understand fully what was done in the investigation and how the results were obtained. This gives to the reader a certain confidence in the work not always felt when examining other studies. Concerning this word list, Fries and Traver say, "This tremendous work of Horn was the greatest of all spelling and writing lists, and remains the definitive work in this particular area of word counting." 10 George Kingsley Zipf of Harvard University is mentioned here, not because he made any significant word counts, but rather because he had a genius for utilizing their results. Despite the fact that one of his fundamental principles (that of the relation between word frequency and rank-order) has recently been discredited and that the reader is not always absolutely convinced of the correctness of his conclusions, still Zipf constantly impresses one by his skill in analysis and ability in presentation. Zipf's major contribution was that he was able to regard speech as a natural phenomenon and investigated it as such. His The Psycho-Biology of Language and Human Behavior and the Principle of Least Effort were both early applications of the statistical method to the investigation of language, and many of his statistical tabulations were based upon the materials in word counts.

11

Ibid., p. 23. Charles C. Fries and A. Aileen Traver, English Publishing Co., 1950), p. 18. 10

Word Lists

(Ann Arbor, The George Wahr

II GENERAL PLAN OF THE PRESENT STUDY

In the present study we wanted to do much as Zipf had done, that is, we wanted to base this investigation on one of the extensive word counts which was available to us. It was felt that if a suitable word count could be found, then the words could be converted into their segmental phonemes. By having a speaker use the words in sentences rather than calling out the words in their artificial, citation forms, the phonemes thus abstracted from the stream of speech would be the phonemes ordinarily used by the speaker of an idiolect. It was obvious from the beginning that the writer's own idiolect would not be a suitable one on which to base this study. The reasons governing this decision were as follows: (1) A study based on my idiolect would not be as readily comparable to other studies which have been made in the past. (2) Almost one-third of my life has been spent outside my native dialect area (Northeast Florida) and this has, no doubt, wrought certain changes in my idiolect which would cause it to be somewhat atypical. (3) The element of subjectivity could be better overcome, perhaps, in analyzing another's idiolect. Therefore, it was decided to make use of an informant from a dialect area other than my own and one which had a larger geographic spread. When the informant had put the words from the word count into sentences and when the segmental phonemes had been recorded, we would have a count of phonemes based upon a count of word frequencies, rather than upon a count of running words. Therefore, it would not be necessary to count millions of running words or sounds in order to obtain a reliable estimate of the frequency of the segmental phonemes of English. Once the frequencies of the phonemes were determined, the phonemes and phoneme sequences could be studied in relation to their frequencies. And, since the phonemes and phoneme sequences were not expected to be the same uniformly, because of the mixed composition of the lexicon, it was felt that they could most profitably be studied by their frequencies per thousands. One other study which was needed was an etymological count of words of single or multiple components by thousands of frequency. This study should get (in the standard lexicographical manner) at the language of proximate origin rather than the

14

GENERAL PLAN OF THE PRESENT STUDY

language of ultimate origin. Again, because of the mixed lexicon of English, the etymological composition of the language would be expected to vary considerably by thousands of frequency. Owing to the amount and nature of the information desired, the digital computer was reckoned to be the only means by which this study could be accomplished. The digital computer was well suited for use in the present study in that it can sort, compile, and compute. Within a matter of minutes it can scan the entire corpus, pick out certain combinations, total how many times they occur, compute their absolute and relative frequencies, arrange them in order of frequency, and print the results with far greater accuracy than any investigator or team of investigators could. For example, Godfrey Dewey, The Relativ Frequency of English Speech Sounds, p. 13, mentions that the analysis of sounds according to their occurrences initially, medially or finally in words and syllables took about 720 hours to perform. Work of a similar nature in this study took about thirty minutes from the time that the CDC 1604 began its searching until the results were printed.

III PREPARATORY WORK

Choice of System of Phonemic Notation

The system of phonemic notation followed most closely in this study is that used by W. Nelson Francis. 1 This system is based upon that of G. L. Trager and H. L. Smith. 2 The wide acceptance of this system by linguists in this country far outweighed the disadvantage which accrued from its use, viz. that the results would not be initially comparable, since most of the previous studies were done prior to the time of Trager and Smith's formulation.

Selection of Informant Mr. Donald C. Green, a graduate student in English at the University of Wisconsin offered to serve as the informant for this study. Mr. Green, a native of Minnesota, speaks a North Central variety of "General American". He has never resided outside the Midwest, and therefore his idiolect is relatively free from foreign influences. In addition to being an educated speaker of an idiolect with few, if any, outside admixtures, Mr. Green also is representative of what might be termed, for want of a better word, the "careful" speaker, e.g., his /fifG/ is my / f i 0 / f i f t h ' , his /kaeptan/ is my /kaepam/ captain, etc. Despite the impression that Mr. Green gives of exercising care in speaking, he is not inclined to overcorrect. (The only utterances in the entire corpus which impress me as being overcorrections are those forms ending in -day, i.e., /mandey/, /wenzdey/, saetardey/ instead of /mandi/, /winzdi/, /saetardi/ as in my idiolect.) He consistently produces the traditional pronunciations, often being /ofan/, forehead /farid/, toward/tord/, which are often overcorrected by spelling-pronouncers. Furthermore, he apocopates, syncopates, and metathesizes, as one would normally do in informal speech (e.g., /swif/ swift, /saprayz/ surprise, /fufil/ fulfill, /tabrbli/ tolerably, /winsiyld/ windshield, /purtiyar/ prettier, /ankamftarbal/ uncomfortable).

1 2

The Structure of American English (New York, The Ronald Press Co., 1958). An Outline o f English Structure (Norman, Oklahoma, Battenburg Press, 1951).

16

PREPARATORY WORK

Informant's

Phonemics

After the informant had been selected, the next step was to determine the phonemics of his idiolect. The traditional search for minimal pairs began. As was to be expected, the consonants posed no serious problem, but the vowels were quite another matter. Tape recordings of the informant's idiolect were used to check on troublesome sounds, and several sessions with the informant, Professor Frederic G. Cassidy and Professor Gerald B. Kelley in which vocoids were elicited and re-elicited established the existence of the following vowels, all of which can occur without a following offglide. i

u

e

9

0



a

0

The consonants were the following: Stops : I p t k b d g / AiTricated stops : / c j / Fricatives : / f 9 v ô / Sibilants: / s s z z / Nasals: / m n q / Lateral: / 1 / Semivowels : / y h r w /

Recording the Corpus Mr. Green was instructed to put each word in the Horn list into what he felt to be a normal sentence frame. Emphatic or other unnatural stress was to be avoided. The results of this method were better than had been expected, for the mental effort required to put these words rapidly into normal sentence frames apparently minimized the premeditated use of prestige pronunciations. On several occasions the informant, realizing that he had uttered an unprestigious form, stopped to remark that he did not always pronounce the word in this manner. An example of this was the pronunciation /ajdvartayzmant/ advertisement. The prestige pronunciation of this form was /asdvartismant/. Phonemic Transcription of the Corpus The following principles were established as guides to be followed throughout the process of phonemic transcription: (1) Only what was heard was to be recorded and (2) care was to be taken not to "hear" something merely because it was expected to

PREPARATORY WORK

17

be heard. How successful I was in abiding by these principles, I cannot tell; but, at least every effort was made to put them into practice. Adherence to the first principle led to the recording of missing as /misin/ although this was the only example in the entire corpus of a present participial ending which did not end with the velar nasal /rj/. Nevertheless, it was recorded as pronounced and was punched on the cards as /misin/ (much to my delight and to my informant's chagrin). Many times it was almost impossible to determine what was heard or, indeed, if anything was heard at all. In many instances the distinctions between unstressed /a, u/ and /i/ in rapid speech were almost imperceptible. Often the presence or absence of glides was extremely difficult to detect, and the tape was replayed many times before a final decision was hesitatingly made. Following the second principle meant that I had to doublecheck those distributions in the informant's idiolect which were different from my own, e.g., /e/ before nasals.

Recomputation of Phoneme Frequencies It was apparent even before the phonemicizing of the corpus was underway that some method would have to be devised for proportioning the frequencies of words which had the same spelling but different pronunciations. Irving Lorge and Edward L. Thorndike's A Semantic Count of English Words provided a ready and rather reliable solution to this problem. As an illustration, I shall begin with abuse. In the Horn list abuse occurred seventy-six times. But since Horn's is not a semantic count, there is no way to determine how many times out of the seventy-six occurrences the form was a noun and how many times it was a verb, i.e., how many times the form is /abyuws/ and how many times /abyuwz/. Consulting A Semantic Count, which gives the frequency per thousand occurrences, one finds that the verb occurred 318 times per mille and the noun 644 times per mille (the remainder of 38 [actually 37] refers to abused, which, since it was counted separately by Horn, can be disregarded). Simple mathematical proportioning yields the answer: /abyuws/ (noun) /abyuwz/ (verb)

51 25

The problem of recomputation of frequencies was a little more involved when the same part of speech happened to have different pronunciations. I shall illustrate how this was done with bow. Under this entry A Semantic Count gives a separate tabulation for three nouns and one verb. In order to distinguish among the three nouns, one must consult the O ED (1889), to which A Semantic Count is keyed. The first noun is the bow with which one shoots arrows, the second noun is the bow which one makes from the waist, the third noun is the bow which is the forward part of a ship. The verb, of course, is the inclining from the waist.

18

PREPARATORY WORK

A Semantic Count entry looks (in part) like this: sb 1 [/bow/]

sb2 [/baw/] sb3 [/baw/] v 1 [/baw/]

1 4 4c 6 16b 16c 19 1 3 2 3 4 5 6 7 7b 8c 9 11 lib 12

006 353 006 013 006 006 032 109 064 019 006 013 006 038 103 122 006 006 006 038 013 013

Thus it may be determined that the proportion of /bow/ to /baw/ is 422:562, and the frequency from the Horn list may be adjusted accordingly. Abbreviations were recorded exactly as they were pronounced by the informant; when not pronounced differently from the full form, their credits were added to the full form. Thus the abbreviation for advertisement was recorded as /aehd/ and A.M. was recorded as /eyem/, while the frequency credits for ans. were added to answer and those of Apr. were added to April. No attempt was made to consolidate or combine homophones, such as ladies, ladies', lady's, and they were entered each time they occurred in the Horn list. Little could have been gained by attempting to consolidate homophones, e.g., right and write, and much would have been lost (not the least of which was time). The recomputation of variants which could not be determined from either A Semantic Count or Lorge's The Semantic Count presented something of a problem. For example, the frequency credits for the, which varies as it is stressed or unstressed and in its occurrence before a vowel or consonant, would be impossible to compute using The Semantic Count. Godfrey Dewey says, "To consider 10 % of the occurrences of each of these 2 words [the and a] as named or emfatic or accented is a maximum

PREPARATORY WORK

19

assumption so far as I can deduce from observation and the estimates of others. As the proportion of initial vowels to initial consonants is roughly 1 to 2, the unemfatic 90% of occurrences of the may properly be distributed as 30% 3i, and 60% 3a." 3 Dewey's estimate that the emphatic the /0iy/ accounts for 10% of the occurrences of the most certainly is far too high. Since his figures struck me as being unreliable, another means of recomputation had to be found. The means decided on was as follows: I played back the recorded tapes from the entry ha to immensely and from writer to zoology recording every major variant of the which occurred in the sentence frames. The 149 occurrences of the had the phonemic shapes below: /3a/ /5i/ /3iyI

(before consonants) (before vowels) (emphatic)

127 occurrences 22 occurrences. 0 occurrences

The frequency credits for the in the Horn list were therefore proportioned between /3i/ and /3a/ and entered on the lab sheets (/3i/ 82,773 and /3a/ 477,828). The same method was used to proportion the frequency credits for he (/hiy/ and /hi/), her (/har/ and /ar/), him (/him/ and /im/), his (/hiz/ and /iz/), she (/siy/ and /si/), and them (/3em/ and /5am/ [/am/ did not occur]). Recomputations for several other reduced forms, such as those of will, shall, are, have, had, am, would, is, did not have to be made since they were entered in the Horn list with the words with which they combine (it's, I'd, I'll, etc.).

Choice of Etymological

Authority

What was desired in the ultimate etymological authority was that it be recent and reliable, and give separate etymologies in the entries for cognate forms. This latter requirement was more than just a convenience, for a large number of words which would seem to be derived from a common head-word are actually cognates. The following dictionaries were examined according to the preceding criteria: Webster's New Collegiate Dictionary, 6th edition (Springfield, Massachusetts, G. and C. Merriam Co., 1953); The Concise Oxford Dictionary of Current English, 3rd edition (Oxford, 1934); W. W. Skeat, An Etymological Dictionary of the English Language, New and Rev. edition (Oxford, 1910); Thorndike-Barnhart, Comprehensive Desk Dictionary (Garden City, N.Y., 1958); Webster's New World Dictionary of the American Language, College edition (Cleveland, The World Publishing Co., 1958); The American College Dictionary (New York, Random House, 1959). The comparison of these dictionaries was made as follows: Omitting abbreviations, proper nouns, plurals, and third person singular present verbs, I listed in order 250 5

Godfrey Dewey, Relativ Frequency of English Speech Sounds (Cambridge, Harvard University Press, 1923), p. 124.

20

PREPARATORY WORK

words (a - admiring, each - edition, label - lantern, wade - warn) from The Teachers Word Book of 30,000 Words. Two hundred of these words had a separate lexical and etymological entry given at least once among the six dictionaries. Two hundred was therefore the maximum score that any dictionary could make.

Dictionary American College Dictionary The Concise Oxford Skeat, Etymological Dictionary Thorndike-Barnhart Webster's New Collegiate Webster's New World

Number of etymological entries 161 182 141 135 154 187

Score 80.5% 91.0% 70.5 % 67.5% 77.0% 93.5%

The importance of having the maximum number of etymological entries is illustrated below ( " F " is French, " L " is Latin, " O F " is Old French, " A F " is Anglo-French, " M F " is Middle French, " X " means that there is no separate lexical entry, "—" means that there is no etymological entry) :

persist persistent persuade persuasion prosper prosperous public publicity recollect recollection

ACD

Oxford

L

L X L L F F F X L L



L L OF L L —

L —

Skeat

F X F F MF L F F F X

Thorndike Barnhart

Webster's New Coll.

Webster's New World F L F L F L L F L F

L

F



—•

L OF L L

F F F AF F









L





The list could be expanded ten fold if there were any necessity for it; however, these ten words seem to demonstrate amply the desirability of having the maximum number of separate lexical entries and etymological entries. Webster's New World Dictionary was selected as the authority for the etymologies of the words in the Horn list.

Manner of Recording Etymological Sources The following criteria were adhered to in the recording of etymological sources : 1. The source of the immediate, rather than the ultimate etyma was recorded.

PREPARATORY WORK

21

Thus telegraph is assigned to French rather than to Greek. Academy, which proceeded from Greek to Latin to French to English, is recorded as French. 2. Words listed as echoic in origin were credited to Anglo-Saxon since it seemed fairly certain that they had had their origin in English speech. 3. The source of derivational affixes was recorded, although the source of inflectional affixes was not. Although there is still not complete agreement among linguists as to exactly what the distinctions are between the two types of affixes, I separated the two types in accordance with criteria stated by Francis: "Those suffixes which must always come at the end of the morpheme groups to which they belong we will call inflectional suffixes. Those which may be followed by other suffixes we will call derivational suffices."* Francis also states that all prefixes are derivational. 3 A base to which an affix has been joined is indicated by the symbol " + " . If both the base and affix are from the same source, the " + " is omitted. Therefore, roadster is recorded simply as Anglo-Saxon, although it is as certain that King Alfred never had one as it is that Plato never used a telephone. Anglo-Saxon means, therefore, AngloSaxon or Anglo-Saxon forms combined in later stages of English. The exclusion of inflectional affixes from consideration may seem arbitrary or subject to criticism, but their inclusion would have enormously increased the already large number of combinations. Since all the inflectional affixes are Anglo-Saxon in origin, they may therefore be allowed for (if one wishes to do so) by increasing the Anglo-Saxon element all along the line, since the use of inflection may be expected to be uniform for all parts of the lexicon. 4. It was decided to record with " + " those few instances in which a change in spelling altered pronunciation: blot AS + French dawn AS + Old Norse endorse French + Latin example French + Latin examples French + Latin fault French + Latin faulty French + Latin glance French + Dutch schedule French + Latin Those instances in which a spelling change was made but in which the pronunciation remained the same (crescent, debt, debtor, debts, doubt, liquor, receipt, scissors) are not indicated by " + " . 5. Distinctions were preserved as to whether the establishment of the source of a word was definite or probable and whether the word came from two sources or from 4 4

Francis, p. 197. Ibid., p. 199.

22

PREPARATORY WORK

either one or the other of two sources, though it is impossible to determine exactly which one. 6. The arbitrary formations gas, kodak, and quiz were recorded simply as "arbitrary formations". 7. Various stages in the development of a language were disregarded. Old French, Middle French, and Modern French, for example, are all recorded as French. Recomputation of Etymological Frequencies Since many morphemes in English are phonemically identical although they have different etymologies, it was necessary to find a method for adjusting the frequencies in the Horn list. The same method used for computing phoneme frequencies was available for this purpose. For example, the verb race is from Old Norse, while the noun race is from French. By consulting A Semantic Count, the frequencies of occurrence per mille for the noun and verb were found and the frequency in the Horn list was easily proportioned between the two sources: French 160, Old Norse 115. Choice of Word Count for Use in this Study The Teacher's Word Book of 30,000 Words: Since Thorndike is usually the name that comes to mind when one mentions "word count", it was perhaps natural that we first thought of using The Teacher's Word Book of 30,000 Words as a basis for this study. Here was a word count which was based on counts of an enormous number of running words (around 20,000,000). It listed a large number of these words (30,000) and it also included abbreviations. Furthermore, due respect had been paid to the concept of "range" in the tabulation of the results. ("Frequency" refers to how often a word is used, while "range" refers to how widely it is used. In general, a word's importance is determined by both these considerations.) All the preceding characteristics of this count recommended this work for use in the present study Despite these advantages, The Teacher's Word Book of 30,000 Words had some decided shortcomings: 1. The frequency of occurrence of all words in the count was not given. For the first 1,069 most frequent words (those occurring 100 or more times per million) no frequencies are given; they are merely marked as "AA". The 952 next most frequent words (those occurring 50 to 99 times per million) are marked simply as "A". Thus the frequencies of occurrence of the commonest 2,021 words could not be obtained from this count. 2. It was apparent quite early that one of the count's major disadvantages for use in this investigation stemmed from the nature of the sample. It was desirable that this study be based on a count which represented something akin to "everyday" English. Needless to say, The Teacher's Word Book of 30,000 Words was not intended to

PREPARATORY WORK

23

record this type of vocabulary. A brief examination of the count reveals that the influence of the materials (the Bible and English classics) of the original count (1921) has not been overcome despite the incorporation of millions of words from other counts. The literary and learned words are disproportionately large in comparison with the common, ordinary words. Among the 1,069 most frequent words one finds the vocative O. O'er has a frequency of 18 occurrences per million. Ope is listed among the first 20,000 words, as are mustachio, Mussulman, Myrrh, naiad, nuncio. Pecan and popcorn are of the same frequency as ambrosia (1 occurrence per million), as are such words as pangolin, pard, Pegasus, Peloponnesian, Pendragon, petiole, Petrarch, Phidias, Phrygia, pistillate, Plantagenet, Pleiad, Poincaré, polyp, poniard, and porphyry. Perhaps the best indication of the learned quality of the sample is that ibid, is recorded as one of the 2,000 most frequent words. 3. Another shortcoming of the sample was that much of it was not recent. The samples from the Bible and the works of Cowper, Pope, Milton, Dryden, DeFoe, Boswell, and Gibbon would be enough to give the count a somewhat archaic flavor. If one should desire proof from the list itself, he needs only to compare roadster (5) with chariot (18), motorcycle (2) with palfrey (3), rifle (31) with spear (40), bayonet (9) with lance (16), or schooner (11) with steamship (9). Lo (18) and sayeth (2) are additional evidence of the archaic nature of the sample. 4. Perhaps the most serious defect of The Teacher's Word Book of 30,000 Words was that the unit of entry was, the lexical or "dictionary" unit, rather than the word. Concerning these two general types of entry Lorge says, "The methodology of word counts, however, must consider the basic unit of entry. In the Kaeding count, for instance, the unit of entry was the word in its fully inflected form so that Buch, Buches, Bücher and Büchern were separately tabulated. In contradistinction to the words as a unit of entry, many modern word-counters have used the lexical unit as the basic unit of entry. When words are entered as lexical units, the four inflected forms of Buch would be entered under Buch. In English, give, gives, and given would be tabulated under give as the basic entry." 6 In this word count "Regular plurals, comparatives and superlatives, verb forms in s, d, ed, and ing, past participles formed by adding n, adverbs in ly that occur less than once in a million words, and equally rare adjectives formed by adding n to names of places are ordinarily counted in under the main word." 7 It is obvious that if this count were to be used as the basis for a determination of the frequency of English segmental phonemes, the relative frequencies of Is, z, t, dI and /ig/ would be too low. 5. Another shortcoming of The Teacher's Word Book of 30,000 Words was that it included proper names in the count. Many of these names are from Classical antiquity or from other foreign sources. Therefore it was felt that the inclusion of these names would be a liability in attempting to determine the frequency of American English segmental phonemes. 6 7

Lorge, "Word Lists", p. 545.. Lorge, 30,000 Words, p. ix.

24

PREPARATORY WORK

For the five reasons given above, The Teacher's World Book of 30,000 Words was deemed to be unsuitable for the purposes of the present investigation. Six Thousand Common English Words: The next word count examined to determine whether or not it would be suitable for use as the basis for this investigation was that by R. C. Eldridge, Six Thousand Common English Words (Niagara Falls, 1911). Unlike the previous count, Eldridge's study used the word as the unit of entry rather than the lexical unit, e.g., act, acts, acting. This method of entry was desirable from our point of view. Eldridge's study gave the frequency for each word; thus it would have been possible to compute the frequencies of all the phonemes which occurred in this word count. Eldridge's count also was suitable from a point of view of the recency of the sample. Since the count was made in 1909, no archaisms should be present. Eldridge's count also recommended itself for this study inasmuch as it did not include proper names. Unfortunately, however, Eldridge's count had a number of shortcomings which impaired its use as the basis for the present investigation: (1) A total of 43,989 running words could scarcely be considered as "extensive". (2) A total of 6,002 words, although perhaps enough to work with, could hardly be termed "large". (3) The samples counted probably represent "everyday" newspaper English very well, but that it represents "everyday" English is highly doubtful. For instance, the vocabulary runs rather heavily to words associated with crime and politics, the staples of newspaper writing. Among the 238 most frequent words are the following: court, police, committee, case, country, tax, body, killed, state, law, national, Senate. (4) Eldridge did not apply the principle of range to his study, for, indeed, with the nature and small number of his samples, it would have been impossible for him to do so. Eldridge's count was not found to be suitable for use in the present investigation. However, one hates to leave this pioneer work without praising Eldridge for this study which is still, after fifty years, of value and interest. The Vocabulary of College Students in Classroom Speeches: This count had as its goals "to sample and appraise the formal speaking vocabulary of young men of college age and to specify the over-all vocabulary of a group of 274 students in 607 classroom speeches". 8 The speeches of the students were recorded on discs and played back and transcribed by typists. From these transcriptions the frequency list was compiled. "The total sample was 288,152 word symbols; these included 6,826 different words. The frequency of usage ranged from the (approximately 15,000 occurrences) to nearly 2,000 words that occurred only one time." 9 The authors present their findings in two lists - one with the words arranged alphabetically and the other with the words arranged by frequency. This study would have been an excellent one to use as the basis for the present 6

John W. Black and Marian Ausherman, The Vocabulary of College Students in Classroom (Columbus, Bureau of Educational Research, Ohio State University, 1955), p. I. » Ibid., p. 5.

Speeches

25

PREPARATORY WORK

investigation since all the words were actually spoken rather than taken from the printed page. Despite the fact that it was at least as suitable as the word counts previously examined, it did not meet the requirement established for the basic unit of entry. All forms were transcribed by the typists as they were heard. Unfortunately for the purposes of this study, however, these words were then grouped into their appropriate lexical units. According to the authors, " . . . w e simply followed Thorndike's principal rules as closely as possible in grouping forms for a more economical tabulation: Different tenses of verbs were usually considered one word when the phonetic element of one tense appeared in another (look, looked; not see, saw). Exceptions were permitted in a few instances in which Thorndike had obviously deviated from his general practice. Adverbs formed by adding ly were not entered unless this form was the only one to occur (soft, n o t

softly).

Comparatives and superlatives formed by adding er, est, r, and st were considered with their related p r i m a r y w o r d s (long, rich-, n o t longer, longest, richer,

richest).

Past participles formed by adding n to verbs, and adjectives formed by adding n to proper nouns were considered with their related root words (grow, America-, not grown, American). Plurals formed by adding i and es were combined with the singular form. Homographs were treated as single words (tear, fluid from the eye, and tear, to pull apart). Contractions, slang, abbreviations, and neologisms were preserved." 1 0 Thus, the grouping of the words under lexical entries made it impossible to determine the frequencies of the segmental phonemes. A Study of the Oral Vocabulary of Adults: This impressive work was designed to aid in the teaching of English to immigrants in Australia. The method in which the raw material for the count was collected is of considerable interest: "Workers were interviewed and tape recordings made of their speech. Month after month recordings of workers' speech were made from a variety of sources until over a half a million spoken words from 3,000 workers (men and women) in 1,500 different everyday work, street and home situations had been collected. All the while the pages of shorthand records and spools of tape recordings were being typed for treatment yielding 1,300 foolscap pages of typescript. The words and expressions on these 1,300 pages of material were just as carefully and laboriously counted and classified according to special rules of function and structure. For four years the work went on and involved much time and thought on the part of many people - and a cost of over 4,000 pounds." 1 1 lu

Black and Ausherman, p. 5, quoting E. L. Thorndike's A Teacher's Word Book of the Twenty Thousand Words (New York, 1932), pp. iv and v. 11 F. J. Schonell, I. G. Meddleton, B. A. Shaw, et. al., A Study of the Oral Vocabulary of Adults (Brisbane, University of Queensland Press, 1956), p. 6.

26

PREPARATORY WORK

Here was a study, based on spoken English, which listed the actual frequency of occurrence. It was based on a large number of running words (512,647) from recent samples of English. These samples of informal speech are by far the best available for the English language. The principle of range was adequately followed, and proper names except for the months, days, and a few others were excluded. All these characteristics of the count were desirable from the standpoint of this study. However, this word count had some features which made it unsuitable to serve as the basis for the present investigation: 1. Some words on the list which are widely used in Australia have no currency in this country. Actually, the number of these words, e.g., quid, pub (hotel), bloke, local (nearby hotel), is quite small, and this shortcoming, in itself, would not be significant. 2. Although 12,611 "word-forms" and 4,539 head-words were found in the count of 512,647 running words, only about 6,000 words in these two categories were listed in this publication. 3. Although praiseworthy in other respects, the method of classifying and listing the words did not permit the use of this list in the present investigation. As an example of the method of classification and presentation of the materials of the study, I cite the following entry: PACK, (v.) Packs (v.) Packed (v.) Packing (v.) To pack (inf.) Pack (n.) Packing (n.) Packer (n.) Packers (n.) Packet (n.) Pack-horse (n.) Pack (adj.) Packing (adj.) Packed (adj.)

18 2 15 1 7 6 10 2 9 12 1 2 5 1 94

Concerning this entry, the authors say, "According to our rules we must consider this as four different words, while it is counted as one head-word and fifteen different word-forms" (p. 48). The first thousand head-words were listed as above in alphabetical order. The frequency for each group of head-words and "word-forms" had a total frequency of over 21. One may see that although the most frequent head-words are listed, the

PREPARATORY WORK

27

various "word-forms" which constitute the entry grouping are of mixed frequencies, many of them having a frequency of occurrence of only 1 or 2. Doubtless, quite a few "word-forms" which occurred in the running count and which do not appear in the word list had higher frequencies of occurrence than the "word-forms" which are listed. What was needed for the present study was a word list which gave the commonest "word-forms" rather than the most frequent head-words.

Review of the Criteria for Selection of Word Count At this stage in the investigation, four word counts had been examined, and all had proven to be unsuitable. However, more definite ideas had been formulated as to exactly what was wanted in a word count. It was highly desirable that the word count selected meet the following requirements: 1. The actual frequency of occurrence of all words entered in the list should be shown. Some lists, e.g. Interim Report on Vocabulary Selection, The Institute for Research in English Teaching (Tokyo, 1930), are compilations of "important" words rather than actual word counts. From such lists there is no way to determine the frequencies of occurrence of the words in the list. Other lists (e.g. Bongers' K.L.M.List and Thorndike's counts) give the frequency of occurrence only for the less frequently used words. The higher frequency words, in such lists, are merely given credit numbers which indicate their rank. In order to properly compute the frequency of segmental phonemes in American English, the actual frequency of each entry must be shown. 2. The word count should be based on a large number of running words. In general, the validity of the count becomes greater when more words are added to the sample examined. 3. The word count should list a large number of the words found in the sample. As these words were to be the raw material for this study, it was important that a large number of them be recorded. One early word counter, the Rev. J. Knowles, The London Point System of Reading for the Blind (London, 1904), counted 100,000 words but saw fit to publish only the 353 most common. 4. The sample upon which the count was based should consist of something like the "everyday" English of the American adult. Word counts of the vocabularies of children, of the Bible, or of other specialized counts were too narrow for use in the present investigation. 5. The word count should be based on samples of living English. The inclusion of archaic forms would invalidate the findings of this study. 6. The "word" should be the unit of entry, rather than the lexical unit or head-word. The listing of lexical units only would make impossible the determination of the frequencies of phonemes. The effect of the omission of plural endings and verb endings would be incalculable.

28

PREPARATORY WORK

7. Proper names other than the names of the months and days should not be entered in the tabulations. Names are some of the linguistic forms most subject to variation from time to time and from place to place. Many foreign names have not yet been assimilated into English, e.g. although Schmidt, Schneider, Schlitz, etc. are not uncommon names in the U.S., they contain quite un-English phoneme sequences. Some of the entries beginning with "Fr-" in The Teacher's Word Book of 30,000 Words illustrate well the hazards of using proper names to determine English phonemes and phoneme sequences: Fra, Francesca, Francois, Franz, Frau, Fraulein, Fritz. 8. Due regard should be paid to the concept of range. Frequency, of course, refers to how many times the word occurred. Range refers to how many sources the word occurred in. It is easy to imagine what would happen if, in a small sample, one included Poe's "The Bells" or Kipling's "Boots". Boots occurs thirty-two times out of 265 running words in Kipling's poem. Without the consideration of range the effect of such a sample would be extremely distorting. 9. The word count should record abbreviations. In this acronymic age due regard should be paid to abbreviations inasmuch as they are quite commonly pronounced as they are spelled, rather than as the full forms. Frequency and phonotactic differences would be found insofar as a word count did or did not include abbreviations. For example, to omit A.M. entirely would be to omit a relatively frequent phoneme sequence. This could not but have its effect on the total frequency of segmental phonemes. Through the process of trial and error the nine specifications above had been formulated. The "ideal" word count for use in this study would be the count that most closely conformed to these requirements. The process of examining word counts began once more with the examination of Ernest Horn's A Basic Writing Vocabulary. A Basic Writing Vocabulary: This study by Ernest Horn met the first requirement in that it gave the actual frequencies of occurrence of all words in the word count. These frequencies ranged from 715,130 (1) to 11 (12 words). The frequencies for all but 372 words are based on actual counts; the frequencies for the 372 words are estimates. The Horn list was based on a large number of running words (5,136,816). When the frequencies for the omitted words had been estimated and added to the lists, the total frequencies added up to more than 15,000,000. Therefore, the results of the Horn count approximates the results which would have been found had 15,000,000 running words actually been counted. Horn's count lists the most frequent 10,000 words. This amply fulfills requirement # 3 . How closely the Horn list approximated the everyday English of the American adult was an important question. The list was based on written English, whereas the present investigation was intent on determining the frequency of occurrence of phonemes in speech. According to the authors of A Study of the Oral Vocabulary of Adults, "It is a truism to say that most of us have three somewhat different vocabu-

PREPARATORY WORK

29

laries - a relatively wide recognition vocabulary, one less extensive for writing purposes, and a still narrower one for everyday usage" (p. 14). As Horn's list was based on written, not printed, matter, it represents the second type of vocabulary. Horn himself made a tentative comparison of the spoken vocabulary of children (the only study of spoken vocabulary available to him at that time) with his study and concluded that the two showed a large amount of overlap. Although the two studies are not readily comparable, A Study of the Oral Vocabulary of Adults and the Horn list have been scrutinized by the present writer. The conclusion is that the two studies show a considerable area of agreement. To a certain extent one's experience has prepared him to judge the relative frequency of words. One does not need a word list to recognize which words are common words and which ones are rarely used words. Horn says, "There is perhaps a certain validation which comes from a common-sense judgment of the importance of the words in the lists. With a few exceptions the words in this list impress one as words likely to be needed." (p. 189). Of the 25 words cited in the discussion of Thorndike as examples of literary and learned words, none appear in the Horn list. On the other hand, popcorn, mentioned in contrast as a common word, appears in the Horn list with a frequency of 31. The fact that almost one-third (1,433,948) of the sample came from personal letters, would probably cause the vocabulary to tend toward the common, informal, and colloquial, e.g. babe, bacon, banana, bandage, bang, barley, baseball, basin, bathrobe, bathroom, beans, bedroom, beer, beet, belt, biscuits, blackberries, blanket, blouse, blues, body, boiler, bone, booster, boots, booze, boss, bottle, brakes, breakdown, breakers, breeches, breezy, broom, brunette, bubbles, bucket, buckle, bug, bumper, bunch, bunk, burg, bus, buttons, gallon, galvanized, gang, garage, garbage, garters, gasoline, gear, girdle, girlie, glasses, etc. Horn's samples from business and personal correspondence account for over half the running words counted. These samples, it seems to the present writer, would reflect the common, ordinary, everyday activities and interests in our society. For this reason and the reasons above, it is felt that the Horn list is about as close as one can come to the indefinable entity which we call "everyday English". The samples which Horn used to make his count are comparatively recent ones; nearly all the samples are from this century. Although in the thirty-five years since the list was published some words have decreased in frequency of use, little or nothing in the list strikes one as being archaic. Submitting the Horn list to the same test that was given above to the Thorndike-Lorge list is instructive; roadster (26), chariot (-); motorcycle (24), palfrey (-); rifle (28), spear (-); bayonet (-), lance (-); schooner (-), steamship (58). Lo and sayeth do not occur. The Horn list fulfills the requirement that the count be based on samples of living English. The word was the unit of entry in the Horn list, rather than the lexical unit or head-word. For example, nation, national, nationality, nationally, nation's and nations are all entered separately as are said, say, saying, says. Thus this method of entry

30

PREPARATORY WORK

permits the determination of the frequencies of the segmental phonemes which make up the words. Proper names, except for the months and days, were not counted in the Horn list. This was desirable from the point of view of the present investigation. Horn rigorously applied the principle of range in his tabulations. The exact procedure which he followed is too detailed to explain here.12 Suffice it to say that both the total frequency of each word and its spread among 65 different sources were important considerations in choosing the 10,000 most important words. Horn's list fulfilled requirement # 9 in that it recorded abbreviations found in the sample. In addition to the desirable characteristics mentioned above, Horn carefully detailed every step of the procedure used in making his count. His careful explanation of all that he had done increased the present writer's confidence both in the word count and in his own selection of it as the basis of the present study.

12 For Horn's rules for selecting the 10,000 commonest words see A Basic Writing pp. 50-53.

Vocabulary,

IV PROCESSING THE DATA

When it had been decided that the Horn list was to be used in this study, the informant put the words into sentence frames and recorded them on magnetic tape. The tapes were played back, and the words were transcribed phonemically. Then the etymologies of the words in the Horn list were looked up in the Webster's New World Dictionary. The phonemic transcription was converted into numeric codes as were the etymological sources. These codes and the frequencies of the words in the Horn list were then entered on large laboratory sheets. From these sheets the input cards were first punched on the IBM 026 printing card punch and verified on the IBM 056 card verifier. The cards were then sorted on the IBM 083 sorter in order of frequency. The information on the cards was then transcribed onto a magnetic tape by the CDC 160. A short program for the CDC 1604 was developed which checked the images on tape for illegal codes, and produced a printed list of the words recorded into the alphabetic codes of the final output. This list of 10,065 words was checked completely by hand and corrections were made in the cards. They were again put on tape. A total of seven 1604 FORTRAN programs was used to process the cards. When it was necessary to record a large number of combinations (e.g., when recording the occurrences of combinations of vowel, consonant, and semi-vowel), the table was built as the cards were read and enlarged when a new combination was found; parallel tables were kept for frequency and number of occurrence. This was feasible even when the tables became several thousand members long because of the great searching speed of the CDC 1604. The processing which involved practical sized possibilities (e.g., the phoneme frequency counts which required a table of only 32 members) was done using fixed sized tables into which frequencies were added when the proper phonemes and etymologies came up. Output was obtained after each decile. All results were stored on two magnetic tapes and printed by the ANalex printer. Despite the fact that numerous precautions were taken to prevent and detect errors, some inevitably crept in. Examination of the entire output has revealed the presence of the following errors: (1) merchandise /meriandays/ (frequency 2,393) was incorrectly recorded as /merdanaays/

32

PROCESSING THE DATA

(2) designated /dezigneytid/ (frequency 39) was incorrectly recorded as/dezigneytd/ (3) funniest jfaniyest/ (frequency 32) was incorrectly recorded as /faniest/ (4) emphatic /emfaetik/ (frequency 27) was incorrectly recorded with an etymology of —5 instead of 06 (Greek). The total frequencies of the erroneous entries is 2,491; the total frequencies of the corpus is 15,465,010. These errors make up only .016% of the total frequencies. Since the significance of these errors was so slight and the amount of time and money ($300.00 per hour for use of the CDC 1604) necessary to correct them was so great, another run on the computer did not seem worth making, and thus these errors are present in the final print-out. If desired, one can make adjustments for these errors throughout the appendices. It should be mentioned here that three entries in the Horn list were not tabulated in this study because of their length. Owing to the amount of information fed into the CDC 1604, storage space in the computer presented something of a problem, and it became necessary to drop all words of over fifteen phonemes. (The term word, as used in this study, signifies any entry in the Horn list which is normally pronounced in American English. Thus Dr. is not considered to be a word, and its credits have been added to doctor; however, C.O.D. is considered to be a word and has been recorded as /siyodiy/. The advantage in using word in this way is that it subsumes an unwieldy variety of other terms, such as morphemes, minimal and non-minimal free forms, lexemes, idioms, etc.) The three entries dropped were: Entry self-explanatory misrepresentation superintendency

Transcription /selfiksplsenatori/ /misreprizenteysan/ /suwprintendentsiy/

Origin AS + L French Latin

Frequency 63 18 20

None of these entries have figured in any of the tabulations in this study. The code for the etymological sources appears below : 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:

Anglo-Saxon (AS) Arabic (Ar) French (Fr) Latin (L) Spanish (Sp) Greek (Gr) Italian (It) Dutch (D) Norse (N) Celtic Swedish

12: 13: 14: 15: 16: 17: 18: 19: 20: 21:

Ar + AS Fr + AS L + AS Sp + AS Gr + AS It + AS D + AS N + AS Portugese (Port) AS + probably Low German (LG)

22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32:

AS + LG Fr and L + Fr Fr + L + AS Fr or Sp Gr + Fr Fr or It AS + LG or D N and AS Persian AS and Fr Fr or L + AS

33

PROCESSING THE DATA

33: 34: 35: 36: 37: 38: 39: 40: 41: 42: 43: 44: 45: 46: 47: 48: 49: 50: 51: 52: 53: 54: 55: 56: 57: 58:

Fr or L Fr and L Sp and Port Irish It + Fr LG or D N + Celtic Hindustani Fr + AS and Fr Chinese Fr + L L + prob. LG Sp and Port + AS Hawaiian It and Sp Fr and D N or Fr American Indian AS o r N Gaelic F + N L + N Sp and Fr LG or Flemish Nand D LG and D

59: 60: 61: 62: 63: 64: 65: 66: 67: 68: 69: 70: 71: 72: 73: 74: 75: 76: 77: 78: 79: 80: 81: 82: 83: 84:

N and Fr Japanese Celtic + AS arbitrary formation Fr + D AS + F and L Sp or Port LG or AS prob. LG N o r LG AS and N + AS German (G) LG LG + Fr prob. F + AS prob. AS and N prob. N + AS prob. Fr or LG prob. F + N prob. D and AS prob. N and Fr English Gypsy prob. AS prob. German prob. Fr prob. L

85: 86: 87: 88: 89: 90: 91: 92: 93: 94: 95: 96: 97: 98: 99: 100: 101: 102: 103: 104: 105:

prob. Sp prob. LG or D prob. LG or D + AS prob. D prob. N unknown or uncertain (?) + AS prob. Welsh prob. East Frisian prob. East Frisian + AS prob. L + F prob. Walloon prob. LG or Scandinavian Scandinavian prob. Mod. Scandinavian unknown or uncertain Fr and It L or It LG + Gr Gr + L prob. Port

There were no occurrences of words with Codes 12, 15, and 20. Code 100 was not used.

V RESULTS OF THE INVESTIGATION

It is hoped that the results of this investigation will help fill several needs in present-day linguistic study. To my knowledge, this is the first study which gives: 1. The etymological composition of English according to proximate sources by thousands of frequency. 2. The etymological composition of English including derivational affixes. 3. The canonical forms of the language with respect to points of articulation. 4. The canonical forms of the language with respect to manner of articulation. 5. The frequencies of the canonical forms of Vowel, Consonant, and Semivowel in polysyllabic words. 6. The frequency of American English phonemes in a Midwestern idiolect according to the Trager-Smith notation. 7. The average word length in phonemes, by thousands of frequency, using the Trager-Smith notation. 8. The average word length in syllables, by thousands of frequency, using the Trager-Smith notation. 9. The relationship between the alphabetic and the Trager-Smith phonemic code. 10. The frequency of English phonemes using the Trager-Smith notation and based on a large corpus. 11. The frequencies of occurrence of all initial, medial, and final consonant and consonant clusters based on an extensive corpus. 12. The entropy of English determined by the relative frequencies of the phonemes in an extensive corpus using the Trager-Smith system of notation. 13. The entropy of English determined by word length in phonemes based on an extensive corpus using the Trager-Smith system of notation. 14. The entropy of English determined by word length in syllables based on an extensive corpus using the Trager-Smith system of notation. 15. The transitional probabilities of phonemes to the 2nd order, i.e., sequences of three phonemes, based on an extensive corpus. 16. Approximate transitional probabilities of phonemes to the 3rd order, i.e., sequences of four phonemes, based on a large corpus. 17. A statistical analysis of the phonemes of English, using the Trager-Smith system of phonemic notation, which gives the Standard Error of a Proportion, the

RESULTS OF THE INVESTIGATION

35

Standard Error of Difference between two proportions, and the Standard Error Deviation for consonants and vowels separately and together.

Relationship between Alphabetic and Phonemic Codes According to Herdan, "With the exception of a few languages like Finnish, Czech, and Spanish, all European languages show a greater or smaller gap between their phonemic and alphabetic systems, in which respect English may well be said to represent an extreme." 1 The present writer was interested in determining the correlation between the conventional alphabetic code and the phonemic code of Trager and Smith as used in this study. 2 The ratio of the number of symbols in both codes is 32/26 (32 phonemic characters and 26 alphabetic characters). It might appear that since there are fewer alphabetic symbols than phonemic symbols, the alphabetic code is more economical than the phonemic code; however, as Herdan points out, the actual situation is just the opposite. 3 The 10,065 words in the Horn list were comprised of 66,534 phonemes. The sum was totaled by the CDC 1604 computer. The number of letters comprising the 10,065 words was counted by hand and totaled 70,979. Thus the total number of phonemes was 93.7 % of the total number of letters. Herdan's count of the phonemes and letters in the study of French, Carter, and Koenig 4 showed that the total of phonemes was only 81.5% of the total number of letters based on number, rather than frequency. 5 The disparity between the figures in the present study and in Herdan's may be explained by the fact that whereas the system of notation in the count of French, Carter, and Koenig uses only one character to represent a diphthong, the Trager-Smith system uses two.

The Etymological Composition of English The editors of the Oxford English Dictionary have compared the vocabulary of English to a nebulous mass in the heavens which has a bright and clearly defined center and concentric zones of decreasing brightness that finally become imperceptible. At the center, of course, are the most frequent words - the structural or function words and the most common nouns and verbs, etc. The circumference of this core would seem to lie near the thousandth most frequent word: "West finds the same faults with 1 Gustav Herdan, Language as Choice and Chance (Groningen, P. Noordhoff, 1956), p. 139. * Trager and Smith's system has 33 segmental phonemes. In the idiolect upon which the present study is based, Trager and Smith's /i/ is an allophone of /i/ and does not have phonemic status. 3 Herdan, Language as Choice and Chance, p. 139. 4 Norman R. French, Charles W. Carter, Jr., and Walter Koenig, Jr., "The Words and Sounds of Telephone Conversations", Bell System Technical Publications, Monograph B-491 (1930). 5 Herdan, Language as Choice and Chance, p. 143.

36

RESULTS OF THE INVESTIGATION

frequency counts as do Ayres, Faucett, and Thorndike: i.e. the counts are very sensitive to the nature of the material from which they are extracted. While the validity of the specific word count is unquestionable, that of a large and general count is not. Such counts do not agree on selection of words and their order after the first 1,000 words." 6 In this study we wanted to find out what the etymological composition of the core was, and the composition of all the other concentric circles. We wanted to find out the immediate or proximate etymological sources for the words in English, rather than the ultimate etymological sources.7 This has been done (Fig. 1), and what has been made abundantly clear is the almost solid (83 %) Anglo-Saxon character of the core (the 1st decile). However, the relative frequency of Anglo-Saxon drops from 83 % in the 1st decile to 34% in the 2nd decile, and continues to drop until the 8th decile. The variation between the 1st and 2nd deciles (the sum of the differences in frequency for Anglo-Saxon, French, Latin, and Norse) is greater than the variation between the other nine deciles.

TABLE 1 Frequency

Variations

between

Deciles

1/2

2/3

3/4

4/5

5/6

6/7

Anglo-Saxon French Latin Norse

49 35 5 0

4 0 3 1

1 1 4 0

2 2 1 0

2 5 2 1

89

8

6

5

10

Total:

Deciles 7/8

8/9

9/10

5 3 2 0

3 4 1 0

0 0 0 0

0 1 0 1

10

8

0

2

Total: 49

After the first decile, French takes the lead which it holds throughout (despite a slight decrease in the higher deciles). Latin increases until the 4th decile where it tends to level off. It is of interest to note the general increase of other sources which are not marked. In the first decile Anglo-Saxon, French, Latin and Norse account for 99 % of all frequencies; in the last decile they account for only 89%. Of interest, also, is the way in which the changes in the etymological composition of the language level off by the 8th decile. The stability of the proportions in the last three deciles is remarkable. After the 2nd decile both Anglo-Saxon and French show a general decrease, while G

Fries and Traver, p. 62. Edward Y. Lindsay, An Etymological Study of the Ten Thousand Words in Thorndike'' s "Teacher's Word Book" (= Indiana University Studies, Vol. XII, Study No. 65) (March, 1925). This study, done at the suggestion of the American Classical League, records the ultimate sources of English words. 7

RESULTS OF THE

37

INVESTIGATION

LEGEND TW

Anglo-Saxon

M

French

I!I!I!I

Latin

Norse Uncertain or Unknown •:::::•::

Decile

1

2

3

4

5

6

7

8

9

Dutch

10

Fig. 1. Etymological Composition of English by Relative Frequency and by Decile.

38

RESULTS OF THE INVESTIGATION

Latin shows a general increase. Contrary to the action of these languages, Norse shows little change at all. Norse is quite exceptional in that it never varies more than one percent between deciles. The influence of the high frequencies of the words in the 1st decile may be seen in the overall etymological composition of the language (Fig. 2). Despite the secondary position of Anglo-Saxon in nine deciles, its high frequencies in the first largely determine the composition of the whole. The complete tabulations by decile and overall are to be found in Appendix I.

Fig. 2. Etymological Composition of English. ["Prob." and "••!•" figures included.]

Phoneme Frequencies If all thirty-two phonemes occurred with the same frequency, the relative frequency of each phoneme would be approximately 3.13%. Of course, they do not, but it is interesting to see how far they depart from this theoretical standard of equality (Appendix II and Fig. 3). Eleven phonemes, /a, i, t, y, r, n, e, a, w, s, I/, have a relative frequency of more than 3.13%; 22 have less. The frequency of the phoneme /a/ is almost four times

RESULTS OF THE INVESTIGATION

11.82

6.77

1I I

6.58 4.74

4.63

2

-45

2.25

1.54

1.54

4.52

3.90



¡¡S

w

2.00

1.91

1>88

2.63

I

3.04

L70

h

1.63

2.61

I I I m

1.61

I I I I I I I X

•92 •

.87 •

.72 •

64 •

•46

.42

.36

•g

g

s

o

c

0

j

Fig. 3. Relative Frequency of Segmental Phonemes.

.03 z

40

RESULTS OF THE INVESTIGATION

greater than the theoretical standard, while the frequency of the least frequent phoneme, /z/, is 100 times smaller. The total frequencies of the 11 commonest phonemes is 69 %. (The frequencies of vowels are shown in Fig. 4 and Appendix III.

u 5.28 %

25.73 %

0

4.26% 13.11%

32.73 %

ae

D 1.78%

4.27 %

12.83 % 4. Relative Frequency of Vowels (Frequency indicated by height).

The frequencies of consonants and semivowels are shown in Fig. 5 and Appendix IV.) When the rank orders of the phonemes are compared by decile, it becomes apparent that some phonemes are quite stable, while others are not. The maximum variation by rank and relative frequency for each phoneme is shown in Table 2. /8/, with a

R E S U L T S OF T H E

10.87

41

INVESTIGATION

10.60

7.07 5.18

t i l

h

4.09

3. 8 3

lllllil m

2.52

I

1.44 i.ft

1.36



i t-)

l i

.72

.66

.56

g

j

.05 z

Fig. 5. Relative Frequency o f Consonants & Semivowels.

rank variation of 18, has the least stability, while /1, z, i, e, a/, with a variation of only 1 %, have the greatest. The explanation for the wide variation of /5/ is that it is used chiefly in Anglo-Saxon words, and particularly in those Anglo-Saxon words which have very high frequencies of occurrence. This explanation accounts for other wide variations in rank.

The

sum of the rank variations for all phonemes in all deciles is 159. The sum of the rank variations for all phonemes between the 1st and 2nd deciles is 108, whereas the sum

42

RESULTS OF THE INVESTIGATION

of the variations between the 2nd and 3rd deciles is only 22. Just as the etymological composition of English changes drastically after the 1st decile, so does the rank order of phonemes. TABLE 2

Rank Order Variation of Phonemes Ranks Phoneme

P t

£ k b d j g f e s s V

ö z z 1 m n

n y h r w i e œ 3 a ii o 3

Low

High

24 5 30 16 21 13 31 27 21 30 10 28 24 32 19 32 11 15 6 26 7 25 5 17 2 9 22 2 14 26 23 29

12 4 28 10 17 10 26 25 17 29 5 22 18 14 12 31 8 13 4 18 3 12 3 8 ! 8 16 1 7 17 17 27

Maximum Variation (Ranks)

12 1

2

6 4 3 5 2 4 1 5 6 6 18 7 1 3 2 2 8 4 13 2 9 1 1 6 1 7 9 6 2

Maximum Variation (Relative Frequencies)

1.85 1.32 .22 2.39 .60 1.99 .59 .50 .48 .19 2.72 .80 .84 2.67 1.28 .09 1.80 .47 .86 1.02 1.93 1.77 1.52 3.03 1.20 .68 .57 2.95 2.26 .94 .62 .27

Statistical Analysis of Phoneme Frequencies In Appendices II, III, and IV are given the Standard Error of a Proportion for each phoneme, the Standard Error of the Difference between successive phonemes, and their Standard Error Deviations. The formulae used in these computations were

RESULTS OF THE INVESTIGATION

43

those set forth by David W. Reed 8 and later used by Rebecca E. Hayden. 9 According to Reed, the Standard Error of a Proportion "...indicates accurately the degree of reliability of a given proportion which occurs in a particular sample. It seems reasonable that, whenever a percentage figure is used to express the frequency of a given linguistic form, there ought also to be some indication of the reliability of that percentage figure. Standard error provides a simple, concise method of indicating this reliability and could profitably be employed in this connection." 10 The formula for the Standard Error of a Proportion is pq SE

=

—. N

The symbol p stands for the relative frequency of a phoneme; q stands for the total relative frequencies of all other phonemes; N represents the total frequency of occurrence of all phonemes in the sample. We turn again to Reed, whose article this part of the present study follows closely, for an explanation of the Standard Error of the Difference: "Any study of the relative frequency of the structural units of a language must provide some means of analyzing which differences in frequency found in a sample of linguistic forms are probably due to sheer chance and which are probably due to a real difference in the whole language or dialect being studied." 11 The formula used in determining the Standard Error of the Difference is DEair

=

SE*

4-

SE*.12

The Standard Error Deviations were calculated by dividing the relative frequency differences of pairs of phonemes by their Standard Error of the Difference. When the Standard Error Deviations are known, their significance or nonsignificance may be determined by consulting a table which lists the probability of occurrence of Standard Error Unit deviations. 13 Reed says, " . . . a percentage of probability above 5.00 is usually taken to indicate a mere chance deviation; a percentage of probability below 5.00 is usually taken to indicate a strong likelihood that some factor other than chance is responsible for the deviation; and a percentage of probability below 1.00 is usually taken to indicate almost the certainty that some factor other than chance is responsible for the deviation." 14 8

David W. Reed, "A Statistical Approach to Quantitative Linguistic Analysis", Word, V (1949), pp. 235-247. 9 Rebecca E. Hayden, "The Relative Frequency of Phonemes in General-American English", Word, VI (1950), pp. 217-223. 10 Reed, p. 240. 11 Ibid., p. 245. 14 Ibid., p. 242. 13 Such a table is presented by Reed, p. 244. 14 Reed, pp. 243-244.

44

RESULTS OF THE INVESTIGATION

From Reed's table one determines that a percentage of probability of 5.00 coincides with 1.96 Standard Error Deviations, and a percentage of probability of 1.00 occurs between 2.57 and 2.58 Standard Error Deviations. Thus, by consulting the last column in Appendices II, 111, and IV, one may know to what extent the rank order of any phoneme is determined by chance. For instance, in the 1st decile all Standard Error Deviations are significant except that for /0/ and /c/ which is less than 1.96 and therefore is assumed to be ordered by chance.

Vowel I Consonant

Ratio

Having seen the frequencies and rank orders of individual phonemes, we now look at them as a group and as vowels and consonants (semivowels are counted as consonants). Figure 6 and Appendix V, which show the frequency of phonemes in each decile, illustrate the dominant role in the language which is played by the core. Despite the preponderance of frequencies in the 1st decile, the actual number of phonemes in each decile varies only slightly (Fig. 7 and Appendix V). 15 The lowest number of phonemes in any decile occurs in the first. This is due to the relatively large number of monosyllables found among the most frequent words. The relative number of the consonants in remarkably close, having a maximum variation of less than 1 % (Fig. 8 and Appendix V). One can see that the consonant/vowel ratio is nearly the same throughout the language, varying almost imperceptibly from the core to the 10th decile. In contrast, when the relative frequency, rather than the relative number, of consonants is tabulated, one sees that the maximum variation is over 4% (Appendix V). Word Length

in Phonemes

and in

Syllables

The relationship between phonemes and words is shown in Figure 9 and in Appendix VI. The average word length in phonemes by number is 6.610, while the average length by frequency is 3.625. Here again is seen the influence of the core of the language. The relationship between syllables and words is shown in Figure 10 and in Appendix VIF. The average word length in syllables by number is 2.194, and the average length by frequency is 1.309. Table 3 and Appendix VIII show the joint frequency distribution of word length by syllable and phoneme number. This table and the two preceding show clearly the inherent economy in the language, i.e. the most frequently used words are the shortest. 15 As used in this study, "number" means "how many" and "frequency" means "how often". For example, in the first decile there were 5,185 phonemes. The frequencies of occurrence of these phonemes totaled 46.270,952. Therefore, 5,185 is the number of phonemes, and 46,270,952 is the frequency.

RESULTS OF THE INVESTIGATION

Fig. 6. Frequencies of Phonemes in Each Decile.

45

46

RESULTS OF THE INVESTIGATION

7,000

6,000 -

5,000 -

4,000 -

3,000 -

2,000 -

1,000 -

0 Decile

1

2

3

4

5

6

7

8

9

Fig. 7. Number of Phonemes in Each Decile.

10

RESULTS OF THE INVESTIGATION

67

-

66

-

65

64

-

63.2 Decile

I

2

3

4

5

6

7

8

9

10

Fig. 8. Relative Frequencies of Consonants by Number.

Total

48

RESULTS OF THE INVESTIGATION

Fig. 9. Words of 1—15 Phonemes.

Canonical Forms Charles F. Hockett defines a canonical form as " . . . a sort of generalized phonemic shape". 16 For example, in the sentence above, the words form, sort, and shape are of the same canonical form, /form/, /sort/, and /seyp/ are similar in that they are formed from the sequence consonant, vowel, semivowel, consonant (CVSC). The words of /sv/ and as /aez/ also have the same canonical form (VC). An interesting thing about canonical forms is that some forms which are theoretically possible never or infrequently occur, while others are greatly favored. Out of 16

Charles F. Hockett, A Course in Modern Linguistics (New York, The MacMillan Co., 1958), p. 284.

RESULTS OF THE INVESTIGATION

Fig. 10. Words of 1—7 Syllables.

49

50

RESULTS OF THE INVESTIGATION

TABLE 3 Joint Frequency Distribution

of Word Length by Syllable and Phoneme

Number

SYLLABLES

1 15 14 13 12 11 1 0

*

S 1 £ *

9 8

7 6 5 4 3 2 1

1,371 42,721 559,554 1,733,258 4,461,932 4,738,940 359,119

Total

11,896,895

2

1,479 13,920 82,302 429,032 915,837 777,834 366,148 51,010

2,637,562

3

4

16 4,699 11,344 51,143 166,725 221,035 200,676 48,503 67

704,208

5

6

294 5,884 4,576 15,437 42,166 36,920 40,630 30,875 160

811 3,401 11,663 19,229 9,206 3,084 983

81 695 81 155

176,942

48,377

1,012

7 14

14

Total 1,186 9,994 16,336 39,520 62,716 92,626 222,258 334,212 631,239 1,007,061 1,337,455 2,099,406 4,512,942 4,738,940 359,119 15,465,010

10,065 words in the corpus there were 1,790 different canonical forms, and of these words 298 words had the form CVCC and 16 had the form VCC, but no word had the pattern CCV. A total of 307 words had the form CVC, but only one had the opposite pattern (Appendix IX). Out of the 1,790 canonical forms, only eleven were common to more than 100 words. The favored forms determined by the number of words of each shape are given in Table 4. TABLE 4 Favored Canonical Forms by Number Form

cvsc

CVC CVCC

cvcvc cvscc cvscvc cvccvc csvsc cvcvcc SVC

svsc

Number of words of each form 394 307 298 266 226 224 137 112 112 104 103

51

RESULTS OF THE INVESTIGATION

However, the favored forms from the standpoint of frequency of occurrence are given in Table 5. The twenty-one forms in this table account for 75 % of all the forms determined by frequency of occurrence. The remaining 1,769 canonical forms make up but 25 % of the total frequencies of occurrence in English. Or, to rephrase this statement, 75 % of the words used in English have one of only twentyone different forms. TABLE 5

Favored Canonical Forms by Relative Frequency of Occurrence Form

Number of Words in Each Form

Frequency of Occurrence

Relative Frequency (%)

VC cvc cv CVS SVC vs cvsc svs sv cvcc v svsc vss cvcvc vsc cvcvs cvscc ssvc cvscvc csvc ccvsc

29 307 14 85 104 10 394 24 6 298 1 103 4 266 28 86 226 5 224 71 95

2,243,699 1,472,752 1,255,846 1,167,258 848,175 783,990 592,888 555,156 455,405 364,620 359,119 216,311 213,772 172,842 161,831 161,347 152,175 138,645 128,727 97,660 92,734

14.50822858 9.52312349 8.12056378 7.54773518 5.48447754 5.06944386 3.83373823 3.58975520 2.94474430 2.35770944 2.32213882 1.39871232 1.38229461 1.11763264 1.04643321 1.04330356 .98399548 .89650766 .83237580 .63149005 .59963750

Canonical Forms with Respect to Manner of Articulation Canonical forms (generalized phonemic shapes) can be determined according to the manner of the production of phonemes, i.e., stops, fricatives, etc. Reference to Appendix X will show that only eighteen forms account for 50 % of the occurrences by frequency, although they account for only 2.7% of the occurrences by number. It should be noticed also that these eighteen commonest forms are all monosyllables. Two forms are used over 5% of the time: Vowel, Nasal (6.574%) and Vowel, Semivowel (5.069 %). Although 86 words have the form Stop, Vowel, Stop and 68 words have the form Stop, Vowel, Semivowel, Stop, these two forms account for but 1.6% of the relative frequency. The 10,065 words in the corpus yielded 5,726 different forms. 5,708 forms had a total relative frequency of only 50 %.

52

RESULTS OF THE INVESTIGATION

Canonical Forms with Respect to Points of Articulation Canonical forms can be established according to the place of the production of phonemes, e.g. labial, alveolar, etc. The 10,065 words in the corpus contained 4,301 different canonical forms. Appendix XI shows that the first fifteen forms account for 50 % of the frequencies of occurrence. One form alone (Vowel, Alveolar) accounts for more than 10% of the total frequencies of occurrence. The fifteen commonest forms are all monosyllables.

Transitional Probabilities for Sequences of Two Phonemes In Appendix XII are listed the transitional probabilities17 for sequences of two phonemes.18 An abbreviated, but perhaps more easily comprehended, presentation of these probabilities is to be found in Table 6. None means the absence of a segmental phoneme immediately preceding or following. None, therefore, is the equivalent of a cover symbol for junctures. Turning to the entry for the phoneme /u/ in Appendix XII, one sees the following: U BEFORE NONE 1 28 .00261862 This entry means that /u/ occurs word-initial in one word which had a frequency of occurrence of 28. Out of all sequences in which /u/ was the second member, this particular sequence accounted for .00261862% (this was the word oodles /uwdalz/). The second part of the entry for /u/ is as follows: NONE

2

AFTER 172967

16.17626704

This entry means that /u/ occurred word-final in two words which had a total frequency of occurrence of 172,967. (The two words are value /vaslyu/ and one of the two variants of you /yu/.) This frequency accounted for 16.176% of the occurrences of /u/ as the first member of a sequence. Table 6 shows the combinatorial patterns of the phonemes, e.g. the combinatorial latitude of /s/ as compared to the restrictions of /z/. 17 Strictly speaking, a probability can never be greater than one; therefore, by moving the decimal point two places to the left, one will have the probability, rather than the percentage of probability. Percentage of probability was used in this study simply because 2 % is apt to be more meaningful to the average reader than is a probability expressed as .02. 18 Transitional probabilities for sequences of two and three phonemes have been calculated by Prof. John B. Carroll of Harvard University. See Chapter 6 of the present study for additional details concerning Prof. Carroll's work.

53

RESULTS OF THE INVESTIGATION — no o ci r'| ' on ' r i

oo -«r I e-j

fS

no \o *

N | O no1 «tj



"tnt^ ' '

NO

ci no no 0 —> -- ri ri M '

r>o

on r- —h on t x h- NC n o o no r i ' m r i r i ' r-^ on | -«fr r i SO ^

u C

«/-> so rn ' Tt

^

^ O ~ ° ri f i

r-; « •tHin« ^ r i r^ I -rf ^ « mrjov | t-^ no r i |

oo on —« ' viri^ ••}• - ' r i ° ci |

Om

On >r> ri

r" iri

on r j p r i -o o Tt

ci

K ©

NO rf in ri ^

O £

in

r j ON O ^ >n '^nrn ' ^

o

rf ' ' I

"n O O I s - o o in fh in N \o r^ in ' r i ri't^'^H

©

inmNt in oo I

»nmovq 'rnTjrn

"no\o\M on '

»n cn '

NNO NO m NO C ^ ^ 1« « O C N ON N I"; ON NO O oo r i rn ' ci f'i ri ' * ' r i ri ^h * * ^ ^ m -h m Tt

«

-Cl ©

in^ONti^minN^O

c R © tc §

-0

^ "O

Stitu

so ^O ' Isofnrn

® w »on > *Q N >N —4 £

-h ro ^H m »-H ro

C

C? >>J3 I-I ^

54

RESULTS OF THE INVESTIGATION

o\

•V 1O yuf f v IJ\ r ^ ^ M ^ n c as rn fsj I

• • I i

II

I

I

II

nrtin I II ^

1 e® II mr-^osr^v©^oo rn ' © > O T CN CN

OS

m

00

1

OO TJ-

O TM

'II

+ ~ £

O

TF

-

1

1

~

VO

' ' 11 ' ' 11

1

1 1 1 1 1 1 12 1

^^

1i ^^ i I iI i1 1

TJ- m (S N N O -H

m fS —

» N TS

tn 1-N

F I Y O N N

1 - 1 II

1 1 1 VO

'II

1 1 1 1

II

II

,

1

1 II ~

VO •>!• Tt 00 VO Tj