Is Intuition as Good as Corpus Frequency for Selecting Vocabulary?

When it comes to language learning, size apparently matters: Vocabulary is at the core of language learning, and research has shown that those with larger vocabularies are more proficient (Meara, 1996). In knowing that vocabulary is crucial, it then becomes a challenge to know both how many and which vocabulary words to teach. Various estimates ranging from 3,000 to 14,000 vocabulary families have been suggested if students are to read authentic material unassisted, with 6,000-7,000 possibly being the sweet spot until learners advance into their specific fields and disciplines. Okamoto (2015) looks at the question of how to decide which vocabulary families to teach. In particular, their article looks at whether native-speaker intuition is as accurate as corpus frequency when selecting vocabulary. It turns out that it is – up to a certain level.

Background: How Many Words Should We Teach?

There are differing views of how much vocabulary students need to learn:

  • 3,000 word families for academic texts (Laufer, 1992)
  • 5,000 word families for novels (Hirsh and Nation, 1992)
  • 10,000 word families for university textbooks (Hazenberg and Hulstijn, 1996)
  • 14,000 word families for university textbooks (Chujo and Hasegawa, 2003)
  • 8,000-9,000 for written texts, 6,000-7,000 for spoken texts (Nation, 2006)

Schmitt has argued that the common approach of disregarding lower-frequency words is not a good idea and recommends more focus on mid-frequency vocabulary (4,000-8,000) as important. In addition, Schmitt and Schmitt (2012) view textbooks as poor vehicles for teaching vocabulary as they often rely on intuition rather than corpus data, though that is changing.

What Do Frequency Numbers Represent?

A term like “mid-level frequency” vocabulary represents word families found in a compiled list of the most frequent words of English. This list is broken into different “bands” of 1,000 families each. So, the absolute most common 1,000 words would be in the first band (1k). These include words like about, east, and yard. The next 1,000 would be in the second band (2k, e.g. accident, lack, wise) and so on. A 10k list would have words like abet, fen, and wrest. As you can see, the higher the band, the less frequent the vocabulary.The vocabulary bands for the study below come from Paul Nation’s (2006) BNC-compiled list, which you can find here, among many of his other lists.

How Is Vocabulary Usually Chosen?

According to Okamoto, there are typically two methods to choose vocabulary:

  • Native-speaker intuition – some research has shown that these judgments vary and are not reliable when compared to corpus frequency (Alderson, 2007)
  • Corpus data based mostly on frequency (how often a word appears) and to a lesser extent on dispersion (how often a word appears in various domains, e.g. Academic, Spoken, Fiction, etc.)

The Study

Okamoto’s study looked at two aspects of word selection: (1) how do native-speaker judgments correlate to corpus frequency, and how is frequency related to dispersion. To answer the first question, they chose 238 words (20 words from 12 different frequency levels) and had 17 native speakers (12 from the US, 5 from the UK) rate them as (1) I know and use this word and use it often, (2) I know this word but rarely use it often, and (3) I do not know this word. Frequency percentages and linear regression analysis revealed their main finding: native-speaker word use is significantly correlated with corpus frequency up to the 7,000 word family level. For the dispersion analysis, the author found that the frequency of words is equally dispersed up to the 6,000 word level.

The Study’s Implications

Whereas Schmitt and Schmitt (2012) and Alderson (2007) claim that intuition is an unreliable method to select vocabulary, and, therefore, books based on this methodology may have flaws, Okamoto has found that native-speakers, “who play an important role in vocabulary selection for EFL textbooks, seem to make reasonable judgments based on the frequency of their actual word use up to the 7000- or the 6000-word level”. Implied here is that intuition-based vocabulary found in textbooks is likely to be sound. However, these judgments should not be seen as “identical” to corpus frequency, but “complementary yardsticks for selecting a vocabulary to teach under temporally restricted conditions”. In other words, the main implication of the study is that judgement can serve as a good method to select vocabulary when corpus use is not possible.

Article Discussion

I found a number of things interesting about this study. First was the focus on native-speaker intuition. Textbook authors do seem to be largely L1 English language users – “native-speakers.” However, it’s more than probable that non-native English speakers – L2 English language users – make up the majority of English language teachers who are responsible for not just teaching vocabulary but also for developing their own classroom materials. It is likely important to understand to what degree their intuitions reflect corpus frequency. Related to this, I wonder how the use of a corpus of English as a lingua franca (ELF) would change any of the word frequency or dispersion results, or even the figures related to how many vocabulary families are needed, especially for successful ELF communication.

One final interesting thing of note is the study design itself. The current study looked at “self-reported frequency of word use” and made the assumption that the type of information collected reflected participants “judgment as to whether they should use [the words] in the textbooks they
write” (p. 3). However, one’s report of word use and frequency based on a vocabulary list devoid of context or purpose does not represent realistic pedagogical judgments of teachers actually choosing words to teach. It is more about word measuring vocabulary recognition than vocabulary selection. A study that asks teachers to look at a text and indicate vocabulary that should be taught, and perhaps even getting them to sort words into bands of importance, may be a more accurate measurement of whether intuition can in fact be a reasonable complement to corpus frequency.

If you are interested in learning more about different ways to select vocabulary from a text, I have written a post about 7 different methods of mining for vocabulary, not including using the plethora of freely available corpora and text analysis tools out there!


Alderson, J. C. (2007). Judging the frequency of English words. Applied Linguistics, 28(3), 383e409.

Chujo, K., & Hasegawa, S. (2003). Jijieigo no jugyo de motiirareru eibunsozai no goi reberuchousad BNC (British National Corpus) wo kijun ni site [An investigation of vocabulary levels of materials used in current English class: in reference to BNC]. Jiji Eigogaku Kenkyu, 42, 439-451.

Hazenberg, S., & Hulstijn, J. (1996). Defining a minimal receptive second-language vocabulary for non-native university students: an empirical investigation. Applied Linguistics, 17(2), 145-163.

Hirsh, D., & Nation, P. (1992). What vocabulary size is needed to read unsimplified texts for pleasure? Reading in a Foreign Language, 8, 689-696.

Laufer, B. (1992). How much lexis is necessary for reading comprehension? In P. J. L. Arnaud, & H. Bejoint (Eds.), Vocabulary and applied linguistics (pp. 126-132) London: Macmillan Academic and Professional.

Nation, P. (2006). How large a vocabulary is needed for reading and listening? The Canadian Modern Language Review, 63(1), 59i82.

Okamoto, M. (2015). Is corpus word frequency a good yardstick for selecting words to teach? Threshold levels for vocabulary selection. System, 51, 1-10. 

Schmitt, N., & Schmitt, D. (2012). A reassessment of frequency and vocabulary size in L2 vocabulary teaching. Language Teaching.,

Anthony Schmidt on TwitterAnthony Schmidt on Wordpress
Anthony Schmidt
English language Instructor at University of Tennessee, Knoxville
Anthony Schmidt is editor of ELT Research Bites. He also has his own blog at Offline, he is a full-time English language instructor in a university IEP program. He is interested in all aspects of applied linguistics, in particular English for Academic Purposes.

4 thoughts on “Is Intuition as Good as Corpus Frequency for Selecting Vocabulary?”

  1. Regarding this study, I would like to know more about the L1 English speakers who participated – maybe there would be a difference in the findings if the study was repeated with certain groups of L1 speakers, e.g. with a degree, teachers, etc.

    1. From the article: “It involved 17 NSs with a Master or PhD degree in language education, who were teachers of English at Japanese universities. Twelve of them were American and five were British by nationality.”

  2. Interesting!

    I definitely think intuition is perfectly acceptable as a kind of vocabulary selection heuristic device, and that it’s valuable if the corpus derived vocabulary lists haven’t controlled for things like genre dispersion and so students are getting words that won’t be frequent in contexts that they will likely encounter.

    I haven’t read the study yet, and maybe this is handled in the actual article, but I’ve thought the (a?) value of using a corpus to derive vocab lists is that it will be able to select words that expert speakers (native or otherwise) wouldn’t come up with on their own. In other words, intuition is good for hearing/reading a word and determining it’s relative frequency/commonness/appropriateness for study, but it’s not so good for selecting words ‘from scratch’.

    1. Hi Michael,

      Thanks for commenting! Your comment about the value of intuition is right on the money: the study demonstrated that the participants could judge a word’s frequency, but there was absolutely no information about actually choosing vocabulary from scratch (e.g. in writing a model text) nor about judging vocabulary in the context of a teaching task.

Leave a Reply