It was the best of teachers. It was the worst of teachers.

But how much better are the best teachers than the worst teachers? Well THAT’S a tough one, but one that is very important to answer. The corporate reformers believe that the gap is great so a feasible solution is to fire those bad teachers.

Now, nobody who ever had more than two teachers in their lives would argue that there is no difference in teacher quality. As a student I’ve had great teachers and lousy ones. As a teacher I’ve been a great one and a lousy one at different parts of my career. But we do need to get an accurate measure for how many and how bad the ‘bad’ teachers are before we go forward with a plan that assumes the worst.

Of course the reformers don’t just assume the worst, they say that ‘research shows’ that they are right. One of the most quoted pieces of research by reformers like Michelle Rhee and Wendy Kopp is that the difference between a student who has an effective teacher three years in a row vs. a similar student who has three ineffective teachers in a row is ‘life-changing.’

Watch the first thirty seconds of this video from 20:00 to 20:30 to see an example of this statistic being described.

The challenge in proving or disproving this is that we can’t go to a parallel universe and compare what would have been had this student had this combination of teachers rather than that combination.

So we settle for educational research. Like psychology, education research is sometimes called a ‘soft’ science. Unlike something like chemistry or physics, the experiments that prove different theories are generally impossible to replicate. Also, there are many ways to use statistics to analyze the results of the various experiments and, like with most science, the interpretation can often be affected by what the researcher is hoping is true.

With this in mind, I decided to take on a challenge I’ve wanted to for a while. Find what study Rhee is referring to, study it, and interpret it for myself. I looked at presentations from The New Teacher Project, which Rhee founded for clues. I learned that the main two value added studies are from 1996 and 1997. The 1996 Cumulative and Residual Effects of Teachers on Future study was conducted by Sanders and Rivers from Tennessee. The 1997 study ‘Teacher Effects on Longitudinal Student Achievement’ was conducted by Jordan, Mendro, and Weerasinghe from Dallas. Both studies concluded that the difference between having three effective teachers vs. three ineffective teachers can result in significantly higher standardized test scores after three years. They don’t make the leap to say that these standardized test score improvements are ‘life changing.’

I downloaded both studies. The Sanders study was very short so there was not a lot of raw data for me to play with to see if I could discover some of my own conclusions. But the Jordan study was 50 pages long with about 30 pages of raw data. The Jordan study is also the source of the most common example cited by reformers.

Here’s a summary from here:

A study conducted by Jordan et al. (1997) estimated that average reading scores of sixth graders in Dallas schools would be expected to increase from the 59th percentile to the 76th percentile if they were assigned to three highly effective teachers in a row, while average scores for sixth graders would be expected to decrease from the 60th to the 42nd percentile if they were assigned to a series of three highly ineffective teachers during the same period. In mathematics, third graders in Dallas schools would be expected to increase their average mathematics score from the 55th percentile to the 76th percentile if they were assigned to three highly effective teachers, while the average mathematics score for third graders would be expected to decline from the 57th percentile to the 27th percentile if they were assigned to highly ineffective teachers for three years in a row.

Often a reproduction of one of the 20 graphs from the report is displayed in presentations comparing the scores of three groups of students who had similar scores in 1st grade and then after having three ineffective teachers, three medium teachers, or three effective teachers.

To get a context for interpreting the consequences of this chart, I read the Jordan report. Here’s what I understand about their methodology:

First they get a group of 3,200 kids about to start 2nd grade. They will track the reading scores for this group for three years, until they complete 4th grade. This group is called R4. They then get another 3,200 2nd graders to track math, and call them M4. Then they do the same with 3rd graders, 4th graders, 5th graders, and 6th graders to get 8 more 3,200 student cohorts, R5, R6, R7, R8, M5, M6, M7, M8.

They want to see the effects of having ineffective teachers vs. having effective teachers so they take the teachers that each of the 32,000 kids have each year and give them an effectiveness ranking, either 1 for bottom 20%, 2 for 20% to 40%, 3 for 40% to 60%, 4 for 60% to 80%, and 5 for 80% to 100%.

Then after three years, they sort the kids into groups depending on what quality teachers they had. If a student in R4 had a 3 teacher, a 5 teacher, and then a 2 teacher, they are in group 352. There are 125 groups for each 3,200 student cohort ranging from 111 to 555. If students were randomly assigned teachers, there should be at least 20 kids that had each combination of teacher.

Finally, they compare the change in standardized test scores for the different combinations. How much better did the R4 kids in group 555 do than in group 111, or group 314, or any other combination.

Sounds fair, right?

Well, the first thing that should cause some concern is the initial rating of the teachers. Their effectiveness is not an objective thing as would be, for instance, number of years teaching. So the teachers who are effective are the ones who are known to get the good test scores.

What this means is that the study is at risk of simply proving that ‘effective teachers are effective’ and ‘ineffective teachers are ineffective,’ something that must be true by common sense. But they are also studying the effects of having several effective or ineffective teachers in a row. Maybe the results magnify, maybe having three effective teachers in a row is a lot better than the benefit you get from having one effective teacher multiplied by 3?

Another potential flaw in the methodology is whether or not students are truly randomly selected to be in group 111, or 314, or 555. Perhaps there are ‘confounding’ factors that made the sampling not random. On page 6 of the Jordan report, they even admitted

Because there was some bias in student assignment to groups (which will be reported on in further research) …

This offhand omission is pretty serious as it could invalidate all the results!

In the appendices, I was able to find some of the data that they used to produce the graph. They don’t compare group 111 to 555 because those two groups had different starting points. Instead they compare 111, 314, and 535. The raw scores in the low group went from 57.2 to 33.4. The raw scores in the middle group went from 57.3 to 50.2. The raw scores in the high group went from 56.6 to 59.7. To make the graph, they translated these scores into percentiles. They did not provide the conversion table, so I wasn’t able to verify these percentiles. Since the percentiles weren’t provided, I had to use, for my examples that don’t support the hypothesis, numbers based on just raw score changes.

When you look at the data it is, at first, pretty intimidating. 125 combinations for 10 different cohorts. To test the stability of the data, I thought of a good experiment: I took the results for cohort R8 and checked the starting scores and ending scores for all six combinations of the numbers 134. If their process is stable, it shouldn’t matter if kids have the three teachers in the order 314 or 134. It’s still a middle teacher a low teacher and a high teacher.

I found that although they all started with about 42%, three years later the scores ranged by a lot. So either this data is highly unstable or somehow the order that you encounter these teachers makes a big difference. This is something that was not mentioned in the paper, but it is very revealing, and probably the easiest way to refute the accuracy of their data.

Here are the results for the six possible orderings of 1,3,4:

R8-134 42.3% to 41.7% change of -.7%

R8-143 42.6% to 40.8% change of -1.8%

R8-314 40.6% to 38.9% change of -1.6%

R8-341 42.8% to 35.8% change of -7.0%

R8-413 42.7% to 34.0% change of -8.8%

R8-431 45.7% to 42.0% change of -3.7%

As you can see, the results are very unstable.

One easy thing I did was take group R4 and do a scatter plot on their gains. Rather than have 125 different combinations to check, I instead added the three numbers together. My assumption is that if someone had combination 314, it’s the same as having 413 or 134 or 341. I’m also making an assumption that 314, since it adds up to 8, is similar enough to 332, 431, or even 215. This made it easier to get a handle on the data since now there were only 13 groups to check (a sum of 3 at the minimum for 111 up to 15 for 555).

When I made a scatter plot for group R4 based on this, it looked like this:

The first thing to notice is that the one dot for 15, the 555 group actually had only a 2 point gain. The 111 was not the lowest score and 555 was not the highest. Also, the 14s did not do that well, considering they had two 5s and a 4. The 12s who averaged 4 per teacher had as may classes with negative gains as positive ones. The 9s (including the 3/3/3 group) had all negative gains (though that could still be an increase in percentile. They did not provide the conversions except for their 20 graphs).

Another thing to notice is that there is what I’d call a ‘weak’ linear correlation — weaker than I would have predicted, actually, considering the circular nature of the methodology where they are proving that effective teachers are, indeed, effective.

Another inconsistency I noticed is that in the 10 cohort groups, the 555 group had one of the top two gains only in M5. The 111 group was in the bottom two twice.

Now, to me, the weak linear correlation is much less than I would have predicted, but you might think that the fact that there is any correlation at all proves, in some way, the point.

So, if you think that the study is valid, I have another argument: So what?

If having three great teachers in a row is so much better than having three horrible ones, how does that guide us in what to do? Sanders says the data is useful since it can help us give ineffective teachers support so they can become more effective. He also thinks we can use this data to prevent a student from getting a bad combination of teachers. There’s no mention of firing those teachers. The ironic thing is that these studies have happened fifteen years ago and Tennessee and Dallas have had the chance to use their value-added studies to improve education in that state and city. And where has it gotten them? Neither is known as a leader or model for successful ed reform.

Now Rhee and the corporate reformers use these studies to justify making it easier to fire teachers. But does she take the authors’ advice and have any plan to help ineffective teachers improve? Rhee and others using these (flawed, in my opinion) research papers to justify their agenda reminds me, a bit, of Nazis (I know this is a low blow) using Nietzsche’s philosophy books to justify their agenda.

[Note: I have since apologized to Michelle Rhee for using an analogy like this. Name-calling is not a mature thing to do, and I shouldn't compare someone who I have no doubt believes that her work will help students with a group of people who murdered millions of people. I will not use such analogies in the future.]

Now, I should point out that as a highly effective teacher myself, don’t I like to think that I make more of a difference than an ineffective co-worker? Of course I do. But I don’t teach to the test so it is unlikely that my students’ test scores would be remarkably higher than my peers. I actually did have the opportunity when I taught in Houston to have many of the same students for three consecutive years. That was almost 20 years ago. I don’t think their standardized test scores went up by that much, but I do know that the number of times they went home and told their family about the math riddle of the day was raised. I also feel like when their children are going to school, they will tell those children about how fun math is — all something I contributed to in an immeasurable way.

Yes, there are good teachers and there are not-so-good teachers.

If all teachers were a little better, that would be good for kids. I think that the number of math teachers who don’t really know or love math is a problem — but we just don’t have enough who do know and love math who are willing to do it.

And I remember having some duds when I was in school. I don’t think it hurt me much. A teacher who was bad for me might have been good for someone else. Who am I to say? And I don’t think I would have even wanted to have all ‘highly effective’ teachers. Imagine going through 8 periods a day in high school with a ‘Dead Poets Society’ Robin Williams teacher? I think I’d have a heart attack. I liked my mix. Some good ones, some OK ones, a few pretty bad ones.

But I digress … The main point here is that there were some value added studies which have been adopted by the reformers. When the Jordan study came out in 1997, it wasn’t such a threat to teachers and the U.S. education system that I don’t know if many people took the time to examine it more closely. Now, that study has taken the aura of legend. We don’t question it. It just is. But since so much seems to be riding on the 3 effective teachers in a row vs. 3 ineffective teachers in a row conclusion, I hope that I’ve pulled this study out of its vault and raised at least some reasonable doubt about its validity — especially now that it’s become a weapon in the wrong hands.

I’ve written a continuation of this post with further analysis which you can read here.

If you’d like to mess with the Jordan data, I managed to pull it from the pdf file and convert it to an Excel file. You can download it here. Let me know if you come up with anything interesting.

For more scholarly examination of this 3 consecutive highly effective teachers argument, start with this post on the Shanker Blog.