In my last post, I showed how in the final report from the Gates Foundation MET project they produced a very misleading graph. Though the implication of this graph — namely, that value-added measures are consistent from one year to the next — was not the only point of this study, I called it THE $50 million lie because it is the thing that will be used by ‘reformers’ for years to come as ‘proof’ that test scores should be a significant factor in teacher evaluations.

Here is the cover of the report. Note the hip white male teacher and the black student with both a hoodie (and wearing it with the hood on in class!) and glasses (note to the sensitive reader: I’m not saying that no black kids who wear hoodies also wear glasses nor that no black kids with glasses also wear hoodies. Also I’m aware that anyone can need glasses which has nothing to do with intellectual ability or academic motivation. I just found this picture to contain a lot of ‘subtext’). Also how the effective ‘teaching’ is actually using ‘blended’ learning as the kid learns at his own pace with the aid of a very old computer with the old-school giant monitor.

There are an infinite number of ways to make charts and graphs of different results obtained. The choice of which ones to include and which ones to omit can reveal what the authors THINK the data proves. The authors also describe their conclusions which are supported by the charts and graphs they chose to make. As the readers don’t get all the objective raw data, it is easy to be swayed into believing the author’s interpretations of the numbers.

In this post I intend to show that the few numbers they present in this paper can just as easily be interpreted to contradict the conclusions the authors make. In this way, it seems to me, this costly undertaking will have served little purpose. There will be a few things (mostly misleading) that will be quoted by ‘reformers’ and also plenty that I hope, based on this post, can be used by ‘anti-reformers’ when this study is used as evidence by ‘reformers’ in any sort of debate.

One thing that isn’t mentioned in the paper, but is very evident by the graphs I showed in the previous post, is that according to test score gains, the vast majority of teachers are statistically ‘equivalent.’ It seems like 90% of the teachers are within .05 standard deviations of the mean. They don’t say how much extra learning the .05 is, but they do say that .25 is equivalent to one year. So I’d say (if I had to, that is, I don’t think ‘learning’ is measured in time units) that .05 would amount to a few weeks of learning. Where are the mythical teachers who get a ‘year and a half of growth’ and the even rarer ‘effective triplets’ who, if you happen to get three of them in a row, can erase the built up achievement gap? An implication of this is that finding the ideal weighting for components of teacher evaluation will not make a significant difference in student achievement.

The main conclusion of this study is that when it comes to teacher evaluations, the best policy is to base them on ‘multiple measures’ and that observations are not as good a tool for measuring teacher quality as test score gains are. Though they don’t commit to an exact breakdown of how the measures should be weighed, they suggest that 33% to 50% could be based on standardized test scores while the rest is based on equal parts of classroom observations and student surveys. This 33% to 50% figure is not different than the primary author, Thomas Kane — one of the leading figures in value-added, has been saying for years.

Though science does sometimes prove things that are not intuitive, science does depend on accurate premises. So, in this case, IF the conclusion is that “you can’t believe your eyes” in teacher evaluation — just because you watch a teacher doing a great job, this could be a mirage since that teacher doesn’t necessarily get the same ‘gains’ as the other teacher that you thought was terrible based on your observation — well, it could also mean that one of the initial premises was incorrect. To me, the initial premise that has caused this counter-intuitive conclusion is that value-added — which says that teacher quality can be determined by comparing student test scores to what a computer would predict those same students would have gotten with an ‘average’ teacher — is the faulty premise. Would we accept it if a new computer programmed to evaluate music told us that The Beatles’ ‘Yesterday’ is a bad song?

One thing that struck me right away with this report is that the inclusion of student surveys — something that aren’t realistically ever going to be a significant part of high stakes teacher evaluations — is given such a large percentage in each of the three main weightings they consider (these three scenarios are, for test scores-classroom observations-student surveys, 50-25-25, 33-33-33, and 25-50-25.)

Conspicuously missing from the various weighting schemes they compare is one with 100% classroom observations. As this is what many districts currently do and since this report is supposed to guide those who are designing new systems, wouldn’t it be scientifically necessary to include the existing system as the ‘control’ group? As implementing a change is a costly and difficult process, shouldn’t we know what we could expect to gain over the already existing system?

This large inclusion of student surveys in each scenario is not very good scientific practice since it adds an unnecessary (for practical purposes) variable into the mix and makes it that much more difficult to make decisive claims about what the weighted averages mean (perhaps this was something they intended.) This study would have been much more useful if they focused on value-added and classroom observations. And by including the student surveys so heavily, they won’t be able to guide policy makers who don’t have student surveys as an option. If a district wants to do the 33-33-33 model, but student surveys aren’t permissible, then since they can’t just do 33-33 for test scores and observations, they would make it 50-50. Likewise if they wanted to to the 50-25-25 split, but couldn’t use student surveys, it would require a 67-33 split to keep test scores double the weight of observations.

For many reasons, I don’t think that student surveys will ever be a part of teacher evaluation. I know that the proposed surveys are not just the students ranking the teacher on a scale of 1 to 10, which is something that would not be very accurate and would be something that might cause teachers to game the system at the expense of student learning (like giving everyone high grades). Instead they ask a series of enigmatic questions like “Does your teacher spend a lot of time doing test prep?” or “Does your teacher care about how much each student in the class learns?” by which the teacher’s quality is pieced together by the answers to these questions.

Now I’m a teacher and I like to think that my students think that I’m doing a good job. Certainly there are students who like me more than other students. But I don’t know if I want to trust my job on whether or not my students interpret what I do in class. Maybe I see that my class is looking lethargic and I spend a minute telling them I saw a good movie over the weekend, as something to wake them up. Now they get a survey at the end of the year with the question “Does your teacher often stray from the subject matter?” and though I only did it a bit, and for a specific reason, this could have stuck out in their minds and suddenly I’m losing my job or getting a pay cut. On ratemyteacher.com, I have an average of 4 out of 5 stars, but I have seen some bizarre comments. A recent rater gave me two stars and wrote “he seems to be popular…although i’m not sure why. he gives zero notes, and if you are not good at teaching yourself you will be screwed for his tests (which are mostly problems similar to ones we do).” I have good ones too, like “Really really great teacher. He truly cares about his students and help them do their best. Offers help even if the student doesnt want it haha. Gives fair tests, and is a very encouraging teacher!” On the other hand, a teacher who retired a few years ago who I felt was really ‘phoning it in’ at the end has five stars including comments like “ideal teacher: never checks hw, doesn’t yell, doesn’t get upset, you can eat in class, you can sleep in class, you can play your PSP or listen to music and he doesn’t notice =)” and “Best teacher ever… He understands his students and handles everything very maturely. You can do ANYTHING in his class. ANYTHING. You can answer 1/8 ?s on the test and he’ll give you a 65. Not too clear but so nice and cool.” Though I do understand that students observe a teacher for hundreds of hours and therefore do have insight into the teacher’s quality, I’m not convinced that the information we get from these student surveys adds much to what can be obtained by a competent principal.

Some of the results of the study are prominently displayed in figure 4 on page 12 of the report.

It took me a while to process what this graph is supposed to tell us, but I’ll attempt to explain this thoroughly here. I’m hoping this will help people discuss this study from an informed perspective.

This graph shows the correlation between the middle school language arts teachers ratings under four weighting systems when compared to three other measurements. When two things are highly correlated, they have a ‘correlation coefficient’ close to 1. When correlation coefficients are low and you make a scatter plot of the data, it looks like a bunch of random points. Though it depends on the scenario, anything less that .4 is considered to be pretty weakly correlated.

The yellow bars (system 2) is when value-added is 50%, observations are 25%, and student surveys are 25%. The green bars (system 3) is when each is weighted 33%. The blue bars (system 4) is when observations are 50% and value-added and student surveys are 25% each. The red bars (system 1) for this set is the weighting that would have caused the highest correlation with value-added. This is something that is hypothetical and calculated after the fact, to see what weighting would have given the most accurate predictor of state test gains. In this case it would have been if they used 81% value-added, 2% observations, and 17% student surveys.

The first bar graph shows that an evaluation that uses nearly all value-added correlates best with state test score gains. This does look compelling, but this would imply that if the teacher is evaluated with primarily value-added for two straight years, she should get around the same score, which contradicts what we see in the third set of bars where this system is, by far, the least reliable. Most interesting, however, is the second set of bars. The fact that the red bar is the shortest of the 4 means that of the four systems, the one that is nearly all value-added correlates the least with other higher thinking tests. Now the other three systems are equally predictive of scores on these more difficult tests, and as all the numbers are under .4, it seems that none of these systems are very predictive of scores on these other tests which, we can assume, are what the common core tests are supposed to be like.

Throughout the report the authors describe the value-added as ‘student learning.’ While these other tests might be more indicative of student learning it’s hard to say if they do either. To me this second set of bars indicates that none of the systems really correlate with these other tests so if you really think that these other tests measure student learning, you’d want to go with the system that is most reliable and also one that wouldn’t be needlessly expensive. So from this perspective, I’d say that this report “defeats its own purpose” (‘Raging Bull’ reference!) if the point was to demonstrate that value-added should be a significant part of evaluations.

Now these graphs were all based on the data for middle school language arts. Presumably these numbers and the conclusions that go along with them can’t be only good for one grade level and subject. Otherwise we would need a different weighting system for every grade level and for every subject within that grade level. Well, though they only produced one set of these graphs, they did give data on page 14 for the other three situations: elementary ELA, elementary math, and middle school math.

Aside from middle school math, the classroom observations are as good or better predictors of scores on higher order tests than value-added. What a mess.

Even the authors, despite their obvious bias in using the term ‘student learning’ throughout the paper when they mean ‘gains’ compared to computer predictions, aren’t very enthusiastic about their conclusions. I think a relevant quote is something I found on page three of the supplemental technical report which said:

To guard against over-interpretation, we add two caveats: First, a prediction can be correct on average but still be subject to prediction error. For example, many of the classrooms taught by teachers in the bottom decile in the measures of effectiveness saw large gains in achievement. In fact, some bottom decile teachers saw average student gains larger than those for teachers with higher measures of effectiveness. But there were also teachers in the bottom decile who did worse than the measures predicted they would. Anyone using these measures for high stakes decisions should be cognizant of the possibility of error for individual teachers.

This report will surely be quoted by ‘reformers’ as some kind of scientific proof that value-added has finally been vindicated. But my examination of the same data (and I look forward to the certain deep analysis that will soon happen by others) tells me that they really didn’t come up with anything we didn’t already know about the problems with these crude metrics. But generally these numbers, from my perspective, don’t really reveal any $50 million secret.