Seo, D. G., & De Jong, G. (2015). Comparability of online- and paper-based tests in a statewide assessment program: Using propensity score matching . Journal of Educational Computing Research , 52 (1), 88–113.

Journal Article

Seo, D. G., & De Jong, G. (2015). Comparability of online- and paper-based tests in a statewide assessment program: Using propensity score matching. Journal of Educational Computing Research, 52(1), 88–113.


Disabilities Not Specified; Electronic administration; Electronic administration; Electronic administration; High school; K-12; Math; Middle school; Multiple ages; No disability; Reading; Social studies; Student survey; U.S. context




The computer-based, online-delivered assessment was investigated in terms of student performance, in comparison to students taking paper-based testing. This study also focused on the potential usefulness of propensity score matching as a method for comparing performance data.


Samples of the data set consisted of 12,203 grade 6 students taking the state social studies assessment on paper and the same number of grade 6 students taking the same assessment via an online version of the test. These two groups of students were determined to be equivalent based on propensity score matching, drawing very similar sets of students based on various demographics—which included gender, ethnicity, English proficiency status, and also based on previous state reading and math test scores. Accordingly, there were approximately 700 special education students and 11,503 general education students taking the paper-based test, and 709 special education students and 11,494 general education students taking the online version. Similarly, 15,563 grade 9 students took each of the test versions (paper or online). The paper version was completed by approximately 805 special education students and 14,758 general education students, and the online version was completed by 834 special education students and 14,729 general education students. These numbers by student type are approximate; the mean performance levels of the two samples at each grade level were compared, and item-level performance was also calculated; however, the performance scores of specific students were not compared according to disability status.

Dependent Variable

The state social studies assessment (Michigan Educational Assessment Program/MEAP) extant data set from 2012 was sampled for this testing mode comparison study. This year's data was selected because both an online and a paper-based version of the social studies assessment were provided. Districts and schools volunteered, that is self-selected, to implement the online version rather than the paper-based version. In order to ensure similar test-taker characteristics in each group, the propensity score matching method was followed; this approach was reported by researchers to be among various quasi-experimental designs. A brief survey about the online test was also administered.


The propensity score matching process yielded that the two samples of students at each of the grade levels, were very similar in demographic characteristics, and also had very similar means for the math and reading scale scores. Further, these performance scores became even more similar after the matching process was completed—that is, the group mean differences across all the tests were smaller when the data matching was completed for each data set. Differential item functioning (DIF) analyses on test items yielded that "none of the items were identified as performing differentially between two modes" (p. 101). Additional test level comparisons indicated that there were no significant differences in scoring patterns for the grade 6 and grade 9 students taking the paper-based test version compared to the online version. The online social studies test-takers' mean scores in grade 6 were slightly (not statistically significant) higher than the mean scores of their matched paper-based test-taker peers, and similarly slightly higher than the mean of all paper-based test-takers. (The effect size differences, between raw and scale scores, for both grade levels were negligible.) The grade 9 online test-takers' mean scores were slightly (not statistically significant) higher than the mean of their matched paper-based test-taker peers, and slightly lower than the mean of all paper-based test-takers. These differences in grade 9 means between matched paper-based test-takers and all paper-based test-takers suggested to the researchers that "the appropriate matching method can have significant impact on the comparability study" (p. 106). In other words, the propensity score matching process was asserted to have shown more precise datasets for comparison, with more equivalent comparison groups. Following this overall trend of slightly higher means, the proportions of online test-takers in grades 6 and 9 who achieved advanced, proficient, and partially proficient performance levels were slightly larger than their matched paper-based test-taking peers; the proportion of online test-takers who had not proficient performance levels was smaller than the proportion of matched paper-based test-takers in both grades. The brief survey results yielded that about 70% of students preferred the online test mode, 10% preferred the paper-based test, and 20% had no preference; further, none of the online test-takers indicated discomfort with the computer. Future research directions were suggested.