Last year I spent a lot of time making scatter plots of the released New York City teacher data reports to demonstrate how unreliable value-added measurements are. Over a series of six posts which you can read here I showed that the same teacher can get completely different value-added rankings in two consecutive years, in the same year with two different subjects, and in the same year with the same subject, but in two different grades.

Here is an example of such a scatter plot, this one showing the ‘raw’ score (for value added, this is a number between -1 and +1) for the same teachers in two consecutive years. Notice how it looks like someone fired a blue paint ball at the middle of the screen. This is known, mathematically, as a ‘weak correlation.’ If the value-added scores were truly stable from one year to the next, you would see a generally upward sloping line from the bottom left to the top right.

There is actually a slight correlation in this blob. I’d expect this as some of the biases in these kinds of calculations will hurt or help the same teachers in the same schools in two consecutive years, but the correlation is so low that I, and many others who have created similar graphs, concluded that this kind of measurement is far from ready to be used for high-stakes purposes like determining salaries and for laying off senior teachers who are below average on this metric.

Teacher evaluations are a hot topic right now as Race To The Top required ‘winners’ to implement evaluations that incorporate ‘student learning’ as a significant component. Though value-added is not the same thing as student learning, most states have taken it to mean that anyway. In some places, value-added is now as much as 50% of a teacher’s evaluation. A big question is what is the appropriate weight for such an unreliable measure. In D.C. it was originally 50%, but has been scaled back to 35%. I’d say that it should currently be pretty close to 0%.

Bill Gates has spent $50 million for a three year project known as the MET (Measures of Effective Teaching) project. They just concluded the study and released a final report which can be found here. In the final report they conclude that teacher evaluations have an ideal weighting of 33% value-added, 33% principal observations, and 33% student surveys. They justify the 33% value-added because they have analyzed the data and found, contrary to everyone else’s analysis of similar data, that teachers DO have similar value-added scores from one year to the next. To prove their point, they print on page 8 this very compelling set of graphs.

These graphs are scatter plots comparing the ‘predicted achievement’ to the ‘actual achievement’ for the teachers in the study. This ‘predicted achievement’ is, presumably, based on the score that the teacher got the previous year. As these points line up pretty nicely on the slanted line, they conclude that value-added is actually very stable.

Well, there were a lot more than twenty teachers in the study. The reason that there are twenty dots on these graphs is that they averaged the predicted and actual scores in five percentile groups. In doing this, they mask a lot of the variability that happens. They don’t let us see the kind of scatter plot with thousands of points like I presented above.

To test how much this averaging masks the unreliability of the metric, I took the exact same data that I used to create the ‘paint ball’ graph at the top of this post. Here’s what that data looks like when I do that.

As even a ‘paint ball’ produces such a nice line when subjected to the principle of averaging, we can safely assume that the Gates data, if we were to see it in its un-averaged form would be just as volatile as my first graph.

It seems like the point of this ‘research’ is to simply ‘prove’ that Gates was right about what he expected to be true. He hired some pretty famous economists, people who certainly know enough about math to know that their conclusions are invalid.

Wow! great work,Gary.