Measuring Teachers by Measuring Students. Are Teacher VAM Scores So Discriminatory, Arbitrary and Capricious As to Constitute an Abuse of Discretion?

Are the regulations implementing Section 3012-d of the Education Law so discriminatory, arbitrary and capricious that they constitute an abuse of discretion, and, if so, what shall be the remedy?

Months, or perhaps a year down the road the question supra will be posed to an arbitrator or a judge (generally referred to as an Article 78 proceeding).

If I was representing a teacher at the arbitration I would first explore the use of an outside evaluator. In most school districts there are about 160 days of actual instruction, five teaching periods a day equals 800 teaching periods per school year. Can one or two “ineffective” lesson observations determine that a teacher is “ineffective” for an entire school year?

For the many decades prior to the passage of the Annual Professional Practice Review (APPR) school-based supervisors were the sole deciders of teacher competence. Cycles of pre-observation, lesson observation and port-observation meetings would result in an observation report: a summary of the lesson, commentary on the quality of the lesson, perhaps using an agreed upon rubric, a conclusion and recommendations. Under the former law lessons were rated “satisfactory” or “unsatisfactory,” under the current system lessons are rated on the HEDI scale (Highly Effective, Effective, Developing, Ineffective).

Over a school year, commonly, there would be a cycle of observations. A satisfactory teacher may be observed once a year, new or probationary teachers many more times. The observations may be linked; the recommendations in single observation may be referenced in subsequent observations. In addition to the formal observations the supervisor may be in the teacher’s classroom on a daily basis.

Under the new law, Section 3012-d, outside evaluator judgements count for 35% of the teacher observation section of the assessment and principal evaluations count for 15%.

35% of a teacher’s HEDI score would be determined by an outside evaluator who bases their judgement on observing less than one quarter of one percent (assuming two lesson observations per school year) of a teacher’s practice.

“Mr. Evaluator, would you agree that in the week before and after you observed the teacher the teacher could have taught a highly effective lesson?”

Of course, in the real world, it is highly unlikely that a school district would seek to discharge a teacher if the outside evaluator and the principal their observation judgements.

The other side of the matrix is student performance measured by a Value-Added Assessment Model – usually referred to a VAM. The Regents/State Education Department is in the process of developing scoring bands, determining teacher HEDI scores, essentially cut scores.

The American Statistical Association issued a report critical of the use of VAMs for high-stakes decisions.

As the largest organization in the United States representing statisticians and related professionals, the American Statistical Association (ASA) is making this statement to provide guidance, given current knowledge and experience, as to what can and cannot be reasonably be expected from the use of VAMs. This statement focuses on the use of VAMs for assessing teachers’ performance but the issues discussed here also applies to their use for school accountability. The statement is not intended to be proscriptive. Rather, it is intended to enhance general understanding of the strengths and limitation of the results generated by VAMs and there encourage the informed us of these results.

Many states and school districts have adopted Value-Added Model (VAMs) as part of educational accountability systems. The goals of these model, which are referred to as Vale-Added Assessment (VAA) Models. is to estimate effects of individual teachers or schools on student achievement while accounting for
differences in student background. VAMs are increasingly promoted or mandated as a component in high-stakes decisions such as determining compensation, evaluating and ranking teachers, hiring or dismissing teachers, awarding tenure, or closing schools.

The American Statistical Association (ASA) makes the following recommendations regarding the use of VAMs.

* The ASA endorses the wise use of data, statistical models, and designed experiments for improving the quality of education.

* VAM’s are complex statistical models, and high -level statistical expertise is needed to develop the models and interpret the results.

* Estimates from VAMs should always be accompanied by measurements of precision and a discussion of the assumptions and possible limitations of the model. These limitations are particularly relevant if VAMs are used for high-stakes purpose.

* VAMs are generally based on standardized test scores, and do not directly measure potential teacher contributions toward student outcomes.

* VAMs typically measure correlation, not causation: Effects -positive and negative – attributed to a teacher may actually be caused by other factors that are not captured in the model,

* Under some conditions, VAM scores and rankings can change substantially when a different model or test is used, and a thorough analysis should be undertaken to evaluate the sensitivity of estimates to different models.

* VAMs should be viewed within the context of quality improvement which distinguishes aspects of quality that can be attributed to the system from those that can be attributed to individual teachers, teacher preparation programs, or schools. Most VAM studies find that teachers account for about 1% to 14% of the variability in test scores, and that the majority of opportunities for quality improvement are found in the system-level conditions. Ranking teachers by their VAM score can have unintended consequences that reduce quality.

Read the entire report here.

The Washington Post slams to the use of VAM data here.

A month after the release of the American Statistical Association (ASA) report Chetty, Friedman and Rockoff responded to the report in a paper entitled, “Discussion of the American Statistical Association’s Statement (2014) on Using VAMs for Education Assessment.” The response acknowledges the concerns voiced in the ASA report, and defends their own research. The response fails to address whether VAMs should be used for high-stakes, namely, firing teachers, decision-making.

On Thursday, May 7th the Regents/State Education will host a summit in Albany, an all-day series of panels, by invitation only: superintendents, school boards, parent and teacher associations and an “expert” panel. How do you define expert? The panel includes experts in economics, sociology and education-policy, not in statistics (Read Chalkbeat article here).

Tom Kane, one of the panelists and the author of richly-funded MET Report suggests that student performance data should 33-50% in an individual teacher assessment: the report; however, is extremely light on evidence.

Howard Wainer, a statistician with a long resume challenges the misuse of data to determine education polices in “Uneducated Guesses: Using Evidence to Uncover Misguided Education Policies” (Princeton University Press, 2011), also listen to interview on the site.

Watch a superb Wainer presentation at NYU discussing the differences between data and evidence – (Wainer presentation begins at minute 8).

To return to the original question: does the use of VAM meet the DAC (discriminatory, arbitrary or capricious) standard?

In my view the answer is clearly, “no.”

There are lots of anecdotes, and, while anecdotes are data they are not evidence. The public believes that we should rid the profession of “bad teachers” and looking at student outcomes seems to make sense.

In an oft quoted remark, “[A governor] asked his legislature for enough money to give a cassette or CD of classical music to every newborn child in the state. The governor cited scientific evidence to support this unusual budget request. ‘There’s even a study,’ he declared in his State of the State address, ‘that showed that after college students listened to a Mozart piano sonata for ten minutes, their IQ scores increased by nine points’.”

The only problem is the widely repeated assumption, called the Mozart Effect, is not supported by evidence.

The high controversial 1994 “Bell Curve” book on intelligence is an example.

People often react most defensively when challenged not on their firmly held beliefs but on beliefs they wish were true but suspect at some level to be false.

If we could differentiate among teachers based on student test scores we would be living in Lake Woebegone, that wonderful National Public Radio village where “All Children are Above Average.”

I suspect that school districts will avoid charging teachers with “ineffectiveness” in situations where the sole determinant are student test scores, neither the state nor school districts want to defend what is essentially the deeply flawed governor’s law.

The governor, following in the footsteps of Mayor Bloomberg would simply attack the arbitrator or the judge who rejects his law.

The most effective determinant of teacher quality is experienced principals and superintendents, as well as teacher peers, utilizing all sources of evidence.

Data alone should not rule, evidence should rule.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s