Less technical post about VAM: What ‘value-added’ is and is not
Value-added has been getting a lot of media attention lately but, unfortunately, most stories are missing the point. In Gotham Schools I read about a teacher who got a low score but it was because her score was based on students who were not assigned to her. In The New York Times I read about three teachers who were rated in the single digits, but it was because they had high performing students and a few of their scores went down. In The Washington Post I read about a teacher who was fired for getting low value-added on her IMPACT report, but it was because her students had inflated pretest scores because it is possible that the teachers from the year before cheated.
Each of these stories makes it sound like there are very fixable flaws in value-added. Get the student data more accurate, make some kind of curve for teachers of high performing students, get better test security so cheating can’t affect the next year’s teacher’s score. But the flaws in value-added go WAY beyond that, which is what I’ve been trying to show in my posts — not just some exceptional scenarios, but how it affects the majority of teachers.
In part 1 I demonstrated that the same teachers generally get wildly different percent ranking in value-added from one year to the next. In part 2 I showed that many teachers who taught multiple grades got rated ineffective and effective in the same year. The only thing crazier, I guess, is teachers getting rated effective and ineffective in the same year IN THE SAME CLASS. Yet, this is exactly what happened.
The TDR spreadsheet has a lot of columns. Looking through them, I found that in addition to getting a value-added percentile rank for the entire class, there were also separate value-added scores for the top 1/3 of each class, the middle 1/3, and the bottom 1/3. So I made a plot to compare how the 5th grade math teachers did in adding value for the top 1/3 vs. the bottom 1/3. Again, there was little correlation.
Percentile ranks are often not a good way to compare things. When you see the different plots I’ve made showing little correlation between things that should relate closely, you might think that just because the value-added percentile ranks fluctuate so much from year to year, or from class to class, (or now from portion of class to other portion of class), that the value-added scores also do. Actually these wild changes in percentile rank are not due to the value-added scores changing by much. The issue is that all teachers have pretty much the same value-added scores. They range from about -.5 (the class scores went down by about 10%) to +.5 (the class scores went up by about 10%). A zero on this scale means that the students stayed in the same relative position as they were in the year before, not that they didn’t learn anything. Since the scores are so close, a slight change one way or the other could drastically change someone’s percentile score, which are the scores we hear about in the paper where this teacher got a 1 or this teacher got a 7. This is why all the graphs that compare percentile ranks look so random.
When things are grouped together so closely, it does not make sense to do a percentile ranking. Instead, there should be some kind of cutoff score that, beforehand, is determined as ‘good.’ It’s like if I give my class a test and the average is an 85 and most of the kids get between a 75 and a 95 while there are 2 kids who got under a 65. So two kids fail because I have a cutoff of 65 and it is possible for everyone to pass. But if I do a percentile ranking, a kid who got a 75 might now be in the bottom 10 percentile since he has the third lowest grade in a class of 34.
Here is a plot showing how a teacher’s pretest scores relate to her value-added score. Notice that the scores are all pretty much between -.6 and +.6. Also see that there is no correlation so the teachers at the ‘failing’ schools (left side of the plot have the lower pretest scores) add as much ‘value’ as the teachers at the ‘good’ schools. I’ve added color for the five categories, high (top 5%), above average (5% to 25%), average (25% to 75%), below average (75% to 95%), low (bottom 5%)
I also made a histogram to see how clustered these points really are. What I learned is that 98% of the teachers in this data set had scores between -.3 and +.6. If there were to be some kind of ‘cutoff’ for passing, it would probably be around -.6 in which case 99% of teachers would have to be rated effective.
Statistically, allowing for a margin of error, these 99% of teachers are ‘equal’ in terms of ‘value-added.’ Ironically, a tool that was designed to show how widely different the quality of teacher is, is actually showing that all teachers are about the same. It is only with the improper use of percentile rankings where the bottom 5%, no matter how close they are to the average, get the awful seeming single digit ‘scores’ we see in the paper.
I want to make it clear that I’m not saying that all teachers are equally good. But when a tool designed to show how vastly different teachers are at improving test scores just ‘proves’ all teachers are equally good, that tool needs to be scrapped.