Wednesday, June 13, 2007

Limitations of Threshold-Based Assessment

New York state announced significant improvement in this year's math scores, especially in New York City and schools in high-poverty areas; but Robert Tobias, former director of New York City's office of assessment, cautions that large increases in one grade that are not accompanied by similar increases in other grades may indicate significant sources of influence apart from the quality of instruction:

On this year’s reading test, for example, the proportion of state eighth graders reaching proficiency surged by 7.7 percentage points, but the proportion of proficient sixth graders increased by a more modest 2.8 points and that of seventh graders by only 1.4 points.
Indeed, state officials explained that the jump in percent of proficient eighth graders is not as significant as it first appears due to statistical factors:
David M. Abrams, the state’s assistant commissioner for standards and assessment, noted that sixth graders and eighth graders improved about the same in raw numbers--five points on a scale in which a score of 650 represents proficiency. But since a comparatively large number of sixth graders were already proficient the year before and a relatively large number of eighth graders were clustered just below the 650 threshold, the same five points qualified many more eighth graders as proficient while doing far less for the sixth-grade showing.
This example underscores two weaknesses of using the percentage of students receiving proficient scores from year to year as a measure of a school's improvement.

First, each year the test assesses a different cohort of students, and many people who have worked in schools for any significant length of time will confirm that different groups of students perform better or worse, on average, than their older or younger peers. Due to natural variations in aptitude and attitude, some cohorts simply achieve more than others, even with the same courses, teachers, and resources. As a result, any progress the school as a whole is making toward universal proficiency can be obscured by the variation in what the students bring to the academic enterprise from cohort to cohort.

Second, since percent proficient is a threshold measure, differences in portion of students proficient do not translate into the difficulty in achieving the result--that is, the effort required to accomplish the increase. If, as in the case of New York eighth graders, the previous year's scores were just below the threshold, on average, a modest gain in test performance will appear as a significant improvement. However if, as in the case of New York sixth graders, the previous year's scores were just above the threshold, on average, the same modest gain will correspond to a modest improvement in number of proficient students. This threshold measure, therefore, fails to give an accurate picture of the actual improvement in instructional quality.

If test results were reported as average sores instead of percent proficient, on the other hand, analysts could compare effect sizes, and the five point increase in New York's sixth and eighth graders would be rightly understood as roughly the same accomplishment. Moreover, value-added assessment could provide a much better indicator of how much schools and teachers themselves are contributing to students' academic progress, factoring out a number of cohort differences, and teachers could greatly benefit from being able to target instruction based on prompt results from beginning-of-the-year baseline assessments, which would indicate the specific needs of each student in the new cohort.

(Incidentally, if value-added scores were incorporated into teacher evaluations, more high-quality teachers might be inclined to work with poor students: They wouldn't have to be as concerned with negative consequences for not reaching absolute proficiency in a single year if their students are still making remarkable gains, and students who start at the bottom have more room to move up than average students.)

Of course, we can't simply ignore the brute passing rate, since we want children not merely to experience improvement, but to achieve actual proficiency in core areas of knowledge and skills. But, until we reach universal proficiency, a more-detailed view of how students are improving (or not) could help us target interventions as we attempt to improve our system of education.

No comments: