Less technical post about VAM: What ‘value-added’ is and is not

It’s all about the ‘error rate’ — or so, even I, used to think.

Nearly everything I’ve read that questions the validity of the value-added metric mentions the astronomical ‘error rates.’ When the NYC Teacher Data Reports were first released, the New York Times website ran an article with some numbers that have been frequently quoted:

But citing both the wide margin of error — on average, a teacher’s math score could be 35 percentage points off, or 53 points on the English exam — as well as the limited sample size — some teachers are being judged on as few as 10 students — city education officials said their confidence in the data varied widely from case to case.

Thinking so much about value-added for my analysis, it recently hit me that the above critique actually understates the real problem with value-added error rates. Implied is the possibility that if they could just figure out some way to get those error rates down to under some acceptable threshold, then the measure would be much more useful. But I plan to show in this post that it would not matter if they got the error rates down to 0% because the error rates do not actually mean what most people think they mean.

So what is an ‘error rate’? Well, if I have a thermometer and when the temperature outside is 70 degrees while my thermometer says the temperature is 77 degrees then my thermometer, at that moment, has a 10% error. If I read the thermometer twenty times when the temperature is 70 degrees and I get readings as low as 56 degrees and as high as 84 degrees, we can say that my thermometer is not very accurate since there is a 20% error rate compared to the ‘true’ temperature. This is what we think about when we hear about error rates. How the measurement compares to the ‘true’ number.

In the case of the temperature, the ‘true’ temperature that my readings are compared to is measured by some kind of very expensive and very accurate thermometer. Without that other thermometer that has the ‘true’ temperature, there is no way to measure the accuracy of my thermometer.

So when we hear that the value-added metrics have a 35 percent error rate for a particular teacher and that the teacher scores at the 40th percentile, we think that this means that the teacher’s ‘true’ quality is somewhere between the 5th percentile and the 75th percentile. There is no way the teacher’s ‘true’ quality can be lower than the 5th percentile or higher than the 75th percentile otherwise the error rate for that teacher would not be 35 percent.

This is making the very reasonable assumption that these ‘error rates’ are defined by how the value-added measure compares to the ‘true’ measure of teacher quality. But since there is no equivalent, in teaching, of the super accurate thermometer that measures the ‘true’ quality, how can it possibly be compared to that?

Because the error rates are more meaningless that I had realized. They don’t compare the value-added number to the ‘true’ teacher quality number — they can’t. Instead, all that the error rate measures is how the value-added number for that teacher compares to what the value-added number for that teacher would be if we re-calculated it with about fifty times the amount of data. That’s it. With more data the error rates go down so that with fifty years of data, the error rate would be pretty close to zero and then we could say, definitively, that this teacher is in the 40th percentile as a ‘value-adder.’ But that is not the same thing as saying that the teacher is in the 40th percentile in her ‘true’ teacher quality number.

Now that is not to say that this more accurate ‘value-adder’ percentile would be completely useless — but it still would not deserve to count as a large portion of a teacher’s evaluation.

My point is that most people who hear about these ‘error rates’ do assume that it means that the error rate is based on comparing the number to the teacher’s ‘true’ quality. Even I’ve written in the past things like “The 30% error rate means that 30% of the time an effective teacher will be rated ineffective by this measure and an ineffective teacher will be rated effective.” Now I realize that this was too generous. It would have been more accurate to write “30% of the time an effective ‘value-adder’ will be rated as an ineffective ‘value-adder’ and vice-versa.” Until the ‘true’ quality of a teacher can be measured accurately with some other method, we’ll never be able to say anything more definitive than that about value-added.

Part 1

Less technical post about VAM: What ‘value-added’ is and is not

I was going to comment something to this effect on your previous value-added posts. In the language of statistics, it appears that the scores have neither precision nor accuracy, which makes them doubly invalid.