Middle School Vocabulary Lists (MSVL)

This page describes the Middle School Vocabulary Lists (MSVL), giving information on what the MSVL are, how the lists were developed, and coverage of the MSVL.

What are the MSVL?

The Middle School Vocabulary Lists (MSVL) are a series of five lists of academic and technical vocabulary for middle school (grade 6-8) students, covering five subject areas: (1) English Grammar and Writing, (2) Health, (3) Mathematics, (4) Science, and (5) Social Studies and History. Each list consists of 300-400 words families that occur frequently in each of these areas, but which are not contained in the GSL (General Service List).

How were the MSVL developed?

The MSVL were developed from a corpus of 109 middle school textbooks, called the MS-CAT corpus (Middle School Content-Area Textbook corpus), with multiple textbooks for each subject area and grade level. There were a total of 18.2 million words in the MS-CAT corpus.

Academic words for the MSVL were selected based on range and frequency. Words needed to occur with a minimum frequency of 11.4 per million words in each subject area subcorpus, and a minimum frequency of 28.5 per million words in MS-CAT corpus as a whole. These frequencies levels were first used to identify words in the AWL (Academic Word List) which occurred in all subject areas, which were placed in all five lists. The same frequencies were used to identify AWL words which occurred in one or more but not all subject areas, which were placed in the relevant list(s). The same frequencies were then used to identify non-AWL words which occurred in all subject areas, which were again placed in all lists. These frequencies were chosen since they are proportionally the same as those used by Coxhead when devising the AWL.

Additionally, in order to identify and include suitable technical vocabulary, words which occurred with a frequency of 100 times per million words in each subcorpus were added to the relevant list.

Although the lists are arranged by word family, the MSVL only includes word family members meeting the above frequency thresholds, with the exception of headwords, which were added if missing. This contrasts with the AWL, which includes all members of the family, regardless of frequency.

Proper nouns, acronyms and abbreviations are not included in the MSVL.

What is the coverage of the MSVL?

The coverage of the lists ranges from 5.85% (social studies and history) to 10.17% (science). In combination with the GSL, the coverage ranges from 83.74 (social studies and history) to 91.43% (health). The average coverage of the GSL on the MS-CAT corpus is 79.56%, which contrasts with 76.1% for academic corpus used to create the AWL, possibly reflecting the fact that middle school textbooks, intended for a lower age range, use a great proportion of high frequency vocabulary.

One reason for the lower percentages for social studies and history may be the fact that these subjects include a large number of proper nouns, which are excluded from the MSVL.

The coverage of the MSVL can be contrasted with the AWL on the MS-CAT corpus. The AWL gives only 5.37% coverage, much lower than the MSVL, showing that the MSVL is a better list for middle school students, and that the AWL, which is intended for university level students, is less suitable.

The lists were tested by using a parallel corpus, which demonstrated similar coverage, ranging from 5.95% (social studies and history) to 9.48% (science).

The lists were also used on a corpus of middle school reading and literature textbooks, giving coverage of between 1.73% (mathematics) and 2.89% (English grammar and writing), demonstrating that it is a true academic list.

The coverage of each list, along with the coverage by the GSL and the GSL and MSVL combined, is shown in the following table.

Content areaMSVLGSLTotal
English grammar and writing6.83%82.14%88.97%
Social studies and history5.83%77.91%83.74%


