Mar 30 2012

Analyzing Released NYC Value-Added Data Part 5

Part 1

Part 2

Part 3

Part 4

Part 5

Part 6

Less technical post about VAM: What ‘value-added’ is and is not
It’s all about the ‘error rate’ — or so, even I, used to think.

Nearly everything I’ve read that questions the validity of the value-added metric mentions the astronomical ‘error rates.’  When the NYC Teacher Data Reports were first released, the New York Times website ran an article with some numbers that have been frequently quoted:

But citing both the wide margin of error — on average, a teacher’s math score could be 35 percentage points off, or 53 points on the English exam — as well as the limited sample size — some teachers are being judged on as few as 10 students — city education officials said their confidence in the data varied widely from case to case.

Thinking so much about value-added for my analysis, it recently hit me that the above critique actually understates the real problem with value-added error rates.  Implied is the possibility that if they could just figure out some way to get those error rates down to under some acceptable threshold, then the measure would be much more useful.  But I plan to show in this post that it would not matter if they got the error rates down to 0% because the error rates do not actually mean what most people think they mean.

So what is an ‘error rate’?  Well, if I have a thermometer and when the temperature outside is 70 degrees while my thermometer says the temperature is 77 degrees then my thermometer, at that moment, has a 10% error.  If I read the thermometer twenty times when the temperature is 70 degrees and I get readings as low as 56 degrees and as high as 84 degrees, we can say that my thermometer is not very accurate since there is a 20% error rate compared to the ‘true’ temperature.  This is what we think about when we hear about error rates.  How the measurement compares to the ‘true’ number.

In the case of the temperature, the ‘true’ temperature that my readings are compared to is measured by some kind of very expensive and very accurate thermometer.  Without that other thermometer that has the ‘true’ temperature, there is no way to measure the accuracy of my thermometer.

So when we hear that the value-added metrics have a 35 percent error rate for a particular teacher and that the teacher scores at the 40th percentile, we think that this means that the teacher’s ‘true’ quality is somewhere between the 5th percentile and the 75th percentile.  There is no way the teacher’s ‘true’ quality can be lower than the 5th percentile or higher than the 75th percentile otherwise the error rate for that teacher would not be 35 percent.

This is making the very reasonable assumption that these ‘error rates’ are defined by how the value-added measure compares to the ‘true’ measure of teacher quality.  But since there is no equivalent, in teaching, of the super accurate thermometer that measures the ‘true’ quality, how can it possibly be compared to that?

Because the error rates are more meaningless that I had realized.  They don’t compare the value-added number to the ‘true’ teacher quality number — they can’t.  Instead, all that the error rate measures is how the value-added number for that teacher compares to what the value-added number for that teacher would be if we re-calculated it with about fifty times the amount of data.  That’s it.  With more data the error rates go down so that with fifty years of data, the error rate would be pretty close to zero and then we could say, definitively, that this teacher is in the 40th percentile as a ‘value-adder.’  But that is not the same thing as saying that the teacher is in the 40th percentile in her ‘true’ teacher quality number.

Now that is not to say that this more accurate ‘value-adder’ percentile would be completely useless — but it still would not deserve to count as a large portion of a teacher’s evaluation.

My point is that most people who hear about these ‘error rates’ do assume that it means that the error rate is based on comparing the number to the teacher’s ‘true’ quality.  Even I’ve written in the past things like “The 30% error rate means that 30% of the time an effective teacher will be rated ineffective by this measure and an ineffective teacher will be rated effective.”  Now I realize that this was too generous.  It would have been more accurate to write “30% of the time an effective ‘value-adder’ will be rated as an ineffective ‘value-adder’ and vice-versa.”  Until the ‘true’ quality of a teacher can be measured accurately with some other method, we’ll never be able to say anything more definitive than that about value-added.
Part 1

Part 2

Part 3

Part 4

Part 5

Part 6

Less technical post about VAM: What ‘value-added’ is and is not

16 Responses

  1. I was going to comment something to this effect on your previous value-added posts. In the language of statistics, it appears that the scores have neither precision nor accuracy, which makes them doubly invalid.

  2. John Smith

    How do we know that the scores are not accurate? Perhaps they converge in probability to the true value, or maybe they are even unbiased estimators of the true value.

    • Gary Rubinstein

      True. But until we are sure that they are, why would we implement them, especially if they actually cause teachers to teach worse and students to learn less, which is what I think is the result of this.

      • John Smith

        I don’t think that we can be absolutely sure about consistency without, literally, decades worth of information. To an outsider, it certainly seems like a reasonable assumption to make in this model.

        Your blog is absolutely fantastic. The posts are always so well written and thought provoking.

  3. Tom Forbes

    In the old world of statistics, this was called reliability and validity. I think the corporate types who use these matrices and data have forgotten about these for their own convenience. They do not allow independent review of the data in any meaningful way and run public schools systems like corporations. They purposely change the measurement sticks every few years so they can claim distorted success. It is like keeping the stock price artificially inflated. By the time their crimes are realized by the public, it will be too late, the damage will have been done.

  4. I do not think analysis and statistics can solve this problem you should see this statistics

  5. Gary, you and I both found that the ‘temporal stability’ of the NYC value added scores were extremely low, with r^2 values of about 0.05 to 0.09 (thus, correlation coefficients of about 0.22 to 0.3). That’s what most researchers are finding for this sort of thing in other states as well.
    But not in Los Angeles.
    Apparently there, the correlation coefficients start there at 0.6 and go all the way up to 0.96.
    I just don’t believe it.
    Any thoughts?

    • I noticed the same thing after browsing the LATimes database. It’s hard to be sure without seeing the full dataset (is it available somewhere that I missed?).

      In any case, one conspicuous difference is that LA has more years of data (at least 7).

      The methods also differ, though it’s hard to say much about that without having the actual model code rather than just technical descriptions.

      One thing I did notice about the NYC data is that it doesn’t behave like a Bayesian update when new data is added, which is odd. I haven’t looked at other jurisdictions, so from my view it could be that NYC is uniquely bad rather than LA is uniquely good.

      • I’ve got something that purports to be a full LA teacher/student/test score data set for the years that the LATimes paid an analyst to analyze. It’s huge – terabytes?
        I can let you have it, but we need to use one of those services for sharing huge data files.
        Another problem: it’s really raw data. I have neither the skills, or the experience, nor the specialized software needed to crunch all the data myself and produce these useless Value Added measures for each year for each teacher, in English and in Math.
        Obviously, the LATimes has that data, both raw and in processed form.
        They told me that it didn’t matter how much money I paid them, there was no way on earth that they would even sell me a simple spreadsheet with just a few columns:
        Name of teacher; name(s) of school(s) taught at; subject(s); value-added scores for year 1, year 2, year 3, and so on.

        I suspect that those scores would vary a lot from year to year and from reading to math. I can’t tell, though, because the LATimes folks refuse to release the results.

        Weirdly, even NEPC, which did its own analysis of the same (or a similar) dataset, wouldn’t give me their results, either. They claimed it would be too much work.

        Go figure. Seems to me that what I’m asking for is fairly simple and straightforward; and that if you haven’t produced such a table already, there’s something wrong.

        • If it were only a couple gigabytes I’d be interested in having a peek. Too much bigger than that and it would probably take too much time just to process the raw stuff into a usable format.

          Easiest way to move it might be to move it might be to put it in one of the cloud drives, like Skydrive or dropbox, and share a file link via email.

          My email is just below the Submission Guidelines at if you’re motivated.

          Even if the LAT won’t release a table, they ought to be willing to release some basic diagnostics, like a scatter plot of year vs. year scores.

    • Thinking a bit further about this … the NYC single year rankings use a single year of data (as far as I can tell). Therefore they’re highly volatile. I’m not sure how much total data there is for NYC – it appears to be more than the 3 years released under FOI, but hard to say how much more.

      The LAT index uses about 7 years. In the latest iteration, they added a new year of data, and dropped the first year. So, 5 out of 7 years of the data overlap in the 1st and 2nd generation estimates. That’s the likely reason for greater score stability.

      So, it’s not really fair to compare the correlations LAT vs NYC’s single-year scores. Comparing LAT to NYC multi-year scores might be more fair, though LAT may still have an advantage in data volume.

    • Steve McFadden

      Guy, I was going to send this critique of the LA Times study to that paper back in August 2010, but I never did. The issue died down a bit after a couple of weeks, and I eventually just saved my file. Perhaps my thoughts from nearly two years ago will be of benefit, but I would have to really go back and read everything over again to make any new intelligent comments:

      Regarding the methods employed in Richard Buddin’s study “How Effective Are Los Angeles Elementary Teachers and Schools?”:
      Buddin used an instrumental variable approach to deal with errors inherent to the use of the lagged student achievement data (achievement data of students’ peers) that his analysis is based upon. In his words, the intention of using such an approach was to reduce “some noise in the prior year test score and improve the quality of the model estimates.” Unfortunately, instrumental variable models themselves are controversial, and require sets of independent data free of any commonality. The author’s choice to use lagged math scores in lagged English Language Arts error estimations (and vice versa) requires that the math and ELA performance levels of students not be related in any way. Such an assumption has a statistical probability of zero, meaning that Buddin’s study inherently has more error than he reports.
      Due to the ambiguity of certain sections of the document, I am not sure whether Buddin’s panel model utilizes a student retention coefficient (that is, the proportion of what students retain from one year to the next) of 0.87 or 1.00 in its second level regressions. Studies similar to Buddin’s have generated retention coefficients as low as 0.50, and there is marked debate amongst researchers regarding what size of a coefficient is appropriate for use in Value-Added Models (VAM). In any case, the paper suggests that when Buddin performs his second set of regressions he assumes students have year-to-year retention rates of 100%. If this value is actually used, then all of the author’s calculated second-level effects, including teacher effects, will be smaller than those published.

      Regarding the scope of Buddin’s study:
      The greatest flaw in Buddin’s analysis is his near-complete disregard for the effects that individual schools have on their students’ academic achievement, and the resulting assignation of those effects onto teachers…a fault common to all VAM. In light of this, one can imagine LAUSD Superintendent Cortines and Deputy Superintendent Deasy giddy over the prospect of assigning the blame for many of the district’s shortcomings on its teachers and also understand why those two are so eager to include value-added evaluations within teacher contracts.
      Buddin states that school fixed effects on student achievement could be decomposed into various elements in exactly the same type of second level regressions that he does for teacher fixed effects. Yet, he maintains that there are relatively few measureable school characteristics that may be included in such an analysis, and thus casually dismisses much of what affects students on a daily basis. Any educator would have had very little difficulty coming up with several internal factors school that influence students and which could be used in a more rigorous study: differences in class size; disparities in facilities allocated to staff members; disparities in resources made available to teachers; school-wide and grade-level intervention programs adopted by faculty/staff; team-teaching efforts between instructors.
      Proponents of VAM know that it is very difficult and expensive to actually go into the field and estimate factors like those mentioned above, with the result that such elements are completely omitted in most value-added studies. The Times’ analysis did omit such factors, and its estimated school effects were the miniscule 0.06 standard deviations in ELA and 0.08 standard deviations in math that Buddin calculated. Yet, Buddin admits in a later portion of his analysis that another study has determined that school reforms implemented in elementary schools have an average effect of 0.33 standard deviations. When we see that a single school-related factor can have more than five times the effect on student achievement than Buddin’s total estimated school effects, we can assume that the effects schools have on students were essentially ignored in Buddin’s second level regressions. This leads me to conclude that Buddin’s model is not valid and makes me seriously doubt his subsequent claims, chief of which is that “…the [LAUSD] teacher effects in ELA and math are more than three times as large as the corresponding school effects. The teacher and school results indicate that teacher effectiveness varies much more from classroom to classroom within schools than it does across schools.“

      Regarding the study’s results:
      After adjustments, Buddin found that his models produced standard deviations for the teacher effects in ELA and math to be 0.1902 and 0.2772 respectively. However, for all of the observed teacher effects the analysis produces coefficients of determination (R2) values of 0.6863 for ELA and 0.5966 for math. This means that, even if his model is valid, more than 31% of the ELA effect and more than 40% of the math effect Buddin attributes to teachers cannot be so attributed. Yet, Times reporters Felch, Song and Smith advocate that LAUSD use these metrics in such venues as teacher evaluations and personnel decisions.
      The study correctly stated that between 2002 and 2009 the average class size for 3rd grade was 19 as compared with 28 for 5th and 6th grades. However, Buddin then makes the assertion that class size is unrelated to teacher effectiveness in ELA and math. This was probably not his intention, as the author’s phrasing in that portion of the study is rather awkward. However, if this stark statement was indeed his conclusion regarding class size effects on students, then one can only be left bewildered, as Buddin did not include class size effects relative to the deviance of a teachers’ class size from the average in his results. Buddin’s study is both ambiguous and deficient in this respect.

      To provide more perspective on the whole “teacher value-added” issue, I present some results of a series of statistical calculations that I performed on the average student ELA scores of LAUSD’s third grade teachers for the year 2004. I must first qualify my results by saying that they are very close to the actual averages, but are slightly off since my calculations are limited by the unavailability of individual teacher and student data. If the Times’ database had reported average raw CST scores for teachers’ students on a year-to-year basis it would have gone a long way towards assuring skeptics like me of the validity of paper’s teacher assessments…thus my statement of disappointment at the beginning of this letter. Instead, the format of the database’s results encourages me to suspect that the Times actually wishes to mislead the public.
      In its new database, the LA Times rates the best and worst elementary school teachers’ effectiveness in teaching English according to the following schedule: a “most effective” English teacher is one whose students repeatedly increase their average raw ELA scores by 7% or more relative to the students’ LAUSD peers; a “least effective” teacher of English is one whose students’ average raw ELA scores repeatedly decrease by 7% or more compared to the students’ LAUSD peers. We must note that these definitions reflect changes in students’ scores relative to the students’ peers, and not as simple gains and loses from one year to another. This is key, as the use of relative percentile gains and loses make it appear to the general public that gross differences in teacher effects are occurring, when the reality is that only small differences are happening in most cases. I would not say that this is a statistical “trick”, rather it is often a means to dupe people who really do not care about the admittedly boring topic of statistics.
      Since large differences in teacher effects are not occurring in most cases, this classification scheme can rightfully be deemed as ridiculous. Observe how the Times’ method produces the following absurd labels:
      In 2004, the average LAUSD third grader’s raw ELA score was approximately 37.8, with a standard deviation of about 6.5. This means that the typical student correctly answered 37.8 out of 65 questions, and that more than two thirds of the district’s third graders received raw scores between 31.3 and 44.3. Teachers at an underperforming school, with students who averaged 35.1 out of 65 on their third grade ELA exam, would be classified as “most effective” if their students’ average scores went up to 36.3 (that is, if the teachers’ students managed to correctly answer 1.2 more questions than the average school peer). An unfortunate colleague next door would be labeled as “least effective” if her students answered 33.8 out of 65 questions correctly (1.3 questions fewer than the school’s average third grader).
      In a high performing elementary school, where third grade students averaged 43.3 out of 65 on their ELA exams, a teacher would be labeled as “most effective” if her students scored 45.1 (answering nearly two questions more than their average peer). On the other hand, a teacher down the hall would be deemed “least effective” if his students answered an average of 41.8 questions correctly (1.5 questions fewer than his school’s average third grader).
      Why are these “most effective” and “least effective” labels absurd? They are absurd because teachers place varying degrees of importance on the annual California Standards Test. Some educators compensate for their mediocre or poor instruction throughout the year by dedicating the entire month of April to CST review and drill. On the other hand, many teachers feel comfortable with what they have imparted to their students throughout the year and would rather present new lessons than spend precious hours reviewing for what they consider a meaningless test.
      The result is that some average or exceptional teachers in the LAUSD will have students who correctly answer one or two questions fewer than their school peers, and some of the worst teachers will produce above average, or even stellar CST scores. I have observed both types in my sixteen-year career as a teacher. Indeed, I can say that some of the worst teachers I have known have been those that teach only those concepts covered by the CST, producing students who have a hard time doing anything other than respond to a number of similarly phrased, trivial questions.

      I would advise parents of students who attend LAUSD schools not to place much importance on the LA Times’ teacher rankings. Instead, keep abreast of what your children’s teachers are doing and ask your school principals to ensure that your children’s teachers do not “teach to the CST”.
      In the interest of fairness, I would ask the LA Times to publish the average year-to-year CST scores for all of the teachers included in your LAUSD database. Those averages are much more telling than your alarmist/sensationalist classification scheme.

    • Steve McFadden (LAUSD teacher)

      Guy, I posted a VERY lengthy comment regarding the LA Times study yesterday, but Gary either deleted it or will post it again. Here are the relevant values that you are referring to:

      After adjustments, Buddin (the author of the Times study) found that his models produced standard deviations for the teacher effects in ELA and math to be 0.1902 and 0.2772 respectively. However, for all of the observed teacher effects the analysis produces coefficients of determination (R2) values of 0.6863 for ELA and 0.5966 for math. This means that, even if his model is valid, more than 31% of the ELA effect and more than 40% of the math effect Buddin attributes to teachers cannot be so attributed. This assumes that Buddin’s model is valid, which it is not…as I showed in the opening paragraphs of my deleted post.

  6. Alan

    These tests are a scam. I feel sorry for NYC teachers right now. It is a rigged game. The media says that education is all dependent on having a good teacher. So what do they do? They increase class sizes, cut the budgets, and then they want to find the “bad” teachers. Guess what, most beginning teachers are bad teachers. It takes around 10 years to really get into the groove. Also, a teacher can only control maybe 10% of how well a student does on a standardized test, if they even take it seriously. Teaching as a career is really going down the drain. We are letting billionairs like Bill Gates who know nothing about education design teaching reform because they are rich. Good luck fighting this garbage. We are getting this in Chicago too. Hopefully this whole movement will blow up soon.

  7. Guy Branden urg

    Thanks, Steve. Mind if I repost your comments on my blog? You raise some excellent points.
    BTW I do see your long comment.

    • Steve McFadden

      Guy, go ahead and use my post as you see fit. But looking back at what I wrote two years ago, I would clarify this part a bit more:

      “Any educator would have had very little difficulty coming up with several internal factors school that influence students and which could be used in a more rigorous study: differences in class size; disparities in facilities allocated to staff members; disparities in resources made available to teachers; school-wide and grade-level intervention programs adopted by faculty/staff; team-teaching efforts between instructors.”

      Value added models usually ignore school-level effects, and Buddin’s LA Times study did exactly that. But, what I called school-level effects may not necessarily fall into that category. School-wide reading and intervention programs would certainly fit the bill, but would a team-teaching situation count? Funding disparities between two similar schools would count, but would mismanagement of funds within a school (e.g. disproportionate funding going to some departments or small learning communities within a school) be classified as a school effect? We can come up with a dozen of these questions, I’m sure. But, the people who create value added models gloss over all of this simply because it is HARD to quantify all of these considerations and it would muddy their models even more (and expose them to even more criticism).

      *Gary, thanks for reposting my entry. I know that it was overly long, but I think it can be useful to a few like-minded people out there.

About this Blog

By a somewhat frustrated 1991 alum

High School

Subscribe to this blog (feed)

Subscribe via RSS


Reluctant Disciplinarian on Amazon

Beyond Survival On Amazon

RSS Feed