Haladyna, T. M., Rodriguez, M. C., & Stevens, C. (2019). Are multiple-choice items too fat ? Applied Measurement in Education , 32 (4), 350–364. https://doi.org/10.1080/08957347.2019.1660348

Journal Article

Haladyna, T. M., Rodriguez, M. C., & Stevens, C. (2019). Are multiple-choice items too fat? Applied Measurement in Education, 32(4), 350–364. https://doi.org/10.1080/08957347.2019.1660348


Accommodation/s not specified; College entrance test; Elementary; High school; K-12; Language arts; Middle school; Multiple ages; No disability; Postsecondary; Reading; Science; U.S. context





Accommodations were not reported, yet design elements of assessments bear on the performance of all students, including students with disabilities. The purpose was to examine—analytically not experimentally—the evidence for three option multiple-choice items, rather than the typical four options, on standardized exams. Distractor functioning was a central focus yielding evidence to inform this matter.


Extant data sets, incorporating an unreported number of individual test-takers, from 58 standardized tests were analyzed. Test-takers were from K–12 and postsecondary levels. These were large-scale datasets; information about whether students had disabilities was not reported, and no specific comparisons among test-taker characteristics within each dataset were examined.

Dependent Variable

A wide selection of 58 different standardized assessments—apparently with approximately 5,204 items overall—comprised the data, measuring college readiness or professional certification, as well as general state achievement assessments. The assessments had items using a selected-response, or multiple-choice design, with a correct option and up to four additional incorrect options, identified as "distractors." The authors stated, "A distractor should be plausible enough to be chosen by examinees with low ability and implausible to those examinees with high ability" (p. 355). These extant datasets were analyzed to determine the relative benefits of selected response items having three rather than four or five options. The related matter of distractor functioning was discussed, seeking to clarify appropriate criteria and advancing an item typology. Data Set 1 had 1,000 items from reading assessments and 1,000 items from math assessments for grades 3–8 and 10, and Data Set 2 had 1,000 items from college readiness assessments over content including math, reading, science, and English language arts. Data Set 3 had 1,000 items from postsecondary admissions tests, and Data Set 4 had over 1,200 items from professional credentialing tests. [Data Set 4 was not further discussed, as it appeared not relevant to the educational levels in the Accommodations Bibliography.]


The researchers asserted that assessments having items with three answer options can sufficiently measure the knowledge and skills of all students. Countering the argument for a wide set of response options in order to neutralize the potential inaccuracy that can be introduced by random guessing, the authors demonstrated that as test length increases, probability decreases that random guessing has an effect; they showed that score results on three-option items beyond 40% had relatively small likelihoods with 100 items, and scores beyond 50% were negligibly different with three options even with 50 test items but especially with more items. The results of distractor analyses were reported. Low selection frequency and limited discriminatory power were each identified as criteria that demonstrate that distractors had little usefulness, or were non-functioning. Data Set 1's math assessments (grades 3–8 & 10) had 4% distractors that were low in selection frequency and 36% distractors that were low in discrimination; that is, about 40% of distractors were non-functioning. Test item distractor analysis results for each grade level were also reported. Data Set 1's reading assessments (grades 3–8 & 10) had 12% distractors that were low in selection frequency and 22% distractors that were low in discrimination; that is, about 34% of distractors were non-functioning. Data Set 2 (pre-college readiness) had 33% of distractors overall—across math, reading, science, and English language arts assessments—that were non-functioning; the range was that 23% of reading assessment distractors were non-functioning, and 48% of math assessment distractors were non-functioning. Data Set 3 (postsecondary admission) had 32% distractors overall that were non-functioning, with 17% non-functioning distractors in reading and 46% non-functioning distractors in math. For both Data Set 2 and Data Set 3, the math assessments had five response options and the other assessments had four. The researchers reported that a large proportion of test items had non-functioning distractors. They asserted that strategic guessing ought to obligate assessment developers to inject more precision into item design, which can be achieved with fewer than the typical four or even five response options. The authors concluded that assessment design ought to incorporate items that have fewer response options that better measure the distinctions between understanding and not understanding academic content.