Tag Archives: VAM

How Should We Evaluate/Assess/Rate Teacher Performance? (Maybe Peer Review)

We live in a world of assessment; let’s take a look at sports. Every major league baseball team has a group of data wonks who collect bits and pieces of data and create algorithms to assess and predict future performance. Once upon a time we could quote batting averages, home runs, earned run averages, now we’re overwhelmed by Wins Over Replacement (WAR), launch angle, etc… We live in the world of Sabermetrics (“A Guide to Sabermetrics Research”).  Every sport has its own set of data used to assess player performance and to predict outcomes.

If we work out we keep track of minutes on the treadmill, number of pull-ups and dips, deep knee bends,  we can measure our performance. We can keep track on our I-Phone or I-Watch. If we play golf: has our handicap dropped? Or, tennis: are we beating players we used to lose to?

Dancers and musicians practice with a coach, guided practice, and improve at their art.

Which raises the nurture/nature question?  Do some athletes and artists have encoded DNA that makes them a better athlete or musician, or, does 10,000 hours of practice produce excellence? Grit and determination or natural ability?

David Epstein, The Sport’s Gene: Inside the Science of Extraordinary Athletic Performance explores,

The debate is as old as physical competition. Are stars like Usain Bolt, Michael Phelps, and Serena Williams genetic freaks put on Earth to dominate their respective sports? Or are they simply normal people who overcame their biological limits through sheer force of will and obsessive training?

The truth is far messier than a simple dichotomy between nature and nurture. In the decade since the sequencing of the human genome, researchers have slowly begun to uncover how the relationship between biological endowments and a competitor’s training environment affects athleticism. Sports scientists have gradually entered the era of modern genetic research.

In his book, Outliers, Malcolm Gladwell lays out the much quoted “10,000 hours rule,”  simply put: gaining mastery requires 10,000 hours of “deliberate” practice.

The principle holds that 10,000 hours of “deliberate practice” are needed to become world-class in any field.

But a new Princeton study tears that theory down. In a meta-analysis of 88 studies on deliberate practice, the researchers found that practice accounted for just a 12% difference in performance in various domains.

In education, a 4% difference
In professions, just a 1% difference

In it, [the authors] argue that deliberate practice is only a predictor of success in fields that have super stable structures. For example, in tennis, chess, and classical music, the rules never change, so you can study up to become the best.

But in less stable fields, like entrepreneurship  [and teaching]… rules can go out the window… mastery is more than a matter of practice.

Teaching is a far more complex task: on one side the teacher, with whatever skills s/he possesses, on the other side twenty or thirty students with a wide range of life experiences: are they hungry, or bullied, or depressed, and, in the middle the content you’re expected to transmit to the students, content, or, standards, or a curriculum or a program, none of which you played a role in selecting. Almost ten years ago the Obama-Duncan administration decided  dense algorithms can be used to compare teachers to teachers who are teaching “similar” students, the tool is called Value-Added Measurement, referred to as VAM, it was rolled out as “we can use results on standardized test scores to rate and compare teachers.” John King, at that time the NYS Commissioner adopted the use of VAM combined with supervisory observations, to assess teacher performance.

The pushback was vigorous, Chancellor Merryl Tisch convened a summit, experts from around the country to discuss the efficacy of using the VAM tool. The experts were crystal clear, VAM was never intended to assess the performance of an individual teacher. The Board of Regents agreed upon a four year moratorium on the use of standardized test scores to assess teacher performance. Last week both house of the state legislature passed a bill returning the question of teacher assessment to school districts, with considerable pushback from parents who felt district would simply substitute another off-the-shelf test.

See my blog here

We should completely de-link teacher assessment from test results.

The Netherlands are among the highest achieving school systems in the OECD, 8,000 unionized public schools functioning like charter schools. the schools have extremely wide discretion in how they run. Read a detailed description here.

European school systems use an inspectorate system (See links in the blog here), the school supervisory authority sends teams of experts into schools to assess the functioning of the school.

Back in the 90’s and early 2000’s New York State sent Schools Under Registration Review (SURR) teams into schools for a deep dive into the functioning of the school and produced highly specific (“Findings and Recommendations”) reports. I was the teacher union representative on many teams.

New York City conducts periodic Quality Review visits to schools, a type of inspectorate system.

Experienced educators conduct a two-day school visit. They observe classrooms, speak with parents, students, teachers, and school leaders. They use the Quality Review Rubric to help them examine the information they gather during the school visit.

After the school visit, a Quality Review Report is published on the school’s DOE webpage. The Quality Review Report rates the school on 10 indicators of the Quality Review Rubric. The report also describes areas of strength and areas of focus and has written feedback on six of the indicators. Information from this report is also used in the School Quality Snapshot.

The QR teams can be improved, they should be joint Department/Union teams and the union should play a role in constructing the Quality Review Rubric.

As far as the assessment of individual teachers we shouldn’t fear peer review, respected colleagues providing feedback.

Let me say, I’m not hopeful. At a recent live streamed town hall, (by invitation only), the mayor, the chancellor and the chancellor’s crew met with parent and community leaders from the Bronx. To a question about the large number of schools in a district the chancellor posited an additional deputy superintendent, and added, the press would attack him for bloating the administration, and, the oress would be correct. Level upon level of supervision “monitors” data: educational decisions should be made in schools not in distant offices. A parent worried, she was in her son’s 6th grade class and saw student work replete with frequent spelling errors, the deputy chancellor suggested a Google Spelling app, the parent sighed, “He’ll only want to play video games on the computer.” Maybe a sign the school has serious instructional issues?

Empowering schools and holding them accountable for their decisions make much more sense than measuring and punishing, and, BTW, resources matter they matter a great deal, and, any school assessment should factor in “poverty risk load.” (See discussion here ).

Figthing over whether a teacher is “developing” or “effective” is insane, maybe we should be working to create collaborative school communities in which school leaders, parents and teachers work together to craft better outcomes.

The Assessment Wars: Is a Grassroots Revolution Bubbling Up Across the Nation Opposing Punitive Annual Testing?

The education community has been fighting the Reading Wars for decades, and, continues, unabated.

“Why Johnnie Can’t Read” became a national best seller in the 50’s and the battle simmered for decades. For some “whole language instruction” was a political attempt to capture the minds of our children, for others, namely, ED Hirsch, phonics was the path to effective reading instruction.  “Why Johnnie Can’t Read” even became a popular song.

 They’re back. Or maybe the “reading wars” never really went away. For decades, political skirmishes have raged between supporters of phonics instruction and proponents of language.

 Recently the skirmishes have boiled over into battles

The Assessment Wars are not far behind.

20% of parents in New York State have opted their children out of state tests, the  Long Island Opt Out Facebook page  has 25,000 members who are active in local politics, endorsing candidates and working in campaigns, they are a political force.

Is testing of children “new”?

We’re tested children for decades; in New York State children in the 4th and 8th grades were tested annually, additionally there were city and school district tests. The Regents Examinations, required for a diploma in New York State have been around for over 100 years

No Child Left Behind (NCLB), the bipartisan law widely heralded in 2002 required testing of all children in grades 3-8 in ELA and Math, and, states had to establish Adequate Yearly Progress. The goal of the law was that by setting AYP goals states would incrementally more forward with all children being at grade level of 2014. The law seemed like NPR Garrison’s Keller’s mythical town of Lake Woebegone where all children are above average. If schools failed to meet goals, higher test scores, the law required interventions, i. e., schools closings, re-staffing, conversion to charter, and the punitive section of the law.

The successor law, the Every School Succeeds Act (ESSA) continues annual testing; the law does change the school measurement metric from proficiency to a combination of growth plus proficiency. Major civil rights organizations strongly supported annual testing. The Leadership Conference on Civil and Human Rights , an umbrella advocacy organization, that was led by Wade Henderson  insisted on continuing annual testing.

“I don’t think you can dismiss the role that assessments play in holding educators and states overall responsible for the quality of education provided,” said Wade Henderson, president and chief executive of the Leadership Conference on Civil and Human Rights, an umbrella group of civil rights advocates that includes the NAACP and the National Urban League. States and school districts that don’t want to deal with the daunting task of improving the achievement of poor students complain about testing as a way of shirking accountability, Henderson said. “This is a political debate, and opponents will use cracks in the facade as a basis for driving a truck through it,” he said.

 In spite of efforts by unions and other advocates to test every third year and other compromises the law continued annual testing.

Has annual testing improved student outcomes?

The answer is a resounding “no.”

With a roll of drums the Obama administration rolled out the Common Core standards followed by Common Core based testing, the result: student scores declined, and, failed to recover.

The National Assessment of Student Progress (NAEP), called the nation’s report card, compares educational progress by state and large urban districts. Over the last few years New York State is moving in the wrong direction,; scores flat or actually regressing.

How do we assess student learning:  the collision of teaching and learning, that point at which the light bulb goes on, that magical moment in which a student “learns?”  Mike Petrilli at the Fordham Institution sees more research needed to uncover the “secret sauce.” For others the “solution” was the stick, use test scores to assess teacher effectiveness, use Value-Added Measurements, and reward and punish teachers. VAM has been trashed by leading statisticians; reformers ignored the criticism.

The reformers who led the VAM crusade ignored the “Cuban-Tyack Principle.”, unless teachers and parents embrace the innovation, the reform, it is doomed.

David Steiner challenges a basic premise, Common Core is not an “answer,” the answer should be curriculum; we should test what we actually teach.

Are there alternatives to the current testing regimen?

New Hampshire and a number of school districts are using performance tasks, Louisiana has an approved federal waiver to use periodic tests instead of end year testing, forty schools in New York State use portfolios in lieu of Regents Exams,

Can testing be useful?

Testing to inform teachers, to inform instruction is used every day, that Friday spelling tests, the math problems; Mike Schmoker in Focus  suggests multiple tests for understanding in every lesson.

Statewide testing has nothing to do with individual students, the purpose is to assess school/school district or state “progress,” or lack thereof. It is also used to shame and stigmatize, and, has created a growing opposition among parents.

The opt outs are now a national political movement. In the recent election teachers and parents played a major role. The “blue wave” included massive numbers of teachers, both in service and retired teachers; not only at the polls but in the trenches, ringing door bells, manning phone banks oft times side by side with active parents.

All politics is local, and, the revolt against testing is bubbling up across the nation. In New York State progressives rolled to victory first in the September democratic primary and in the November general election. While education has been on a political back burner the new crop of progressive electeds might very well be at the heart of a growing anti-testing revolt.

As Jefferson wrote 1787, “I hold that a little rebellion now and then is a good thing and as necessary in the political world as storms in the physical.”

VAM, the Lederman Decision and the Misuse of Statistical Tools. “Gut versus Actual Evidence.”

What if the educators making important decisions about schools and colleges are acting too much on their guts and not enough based on actual evidence? (Review of Howard Wainer, “Uneducated Guesses: Using Evidence to Uncover Misguided Education Policies,” 2011)

Back in my union rep days I occasionally represented members in interest arbitrations, claims of violations of the agreement. The Board fired a paraprofessional claiming he had assisted students; cheating thorough the erasure of incorrect answers and using expert testimony explaining how software was used to analyze the erasures. I scrambled to find my own expert. I worried that the technical evidence would be too dense; however, the arbitrator had a background in math and economics and not only understood the testimony he asked numerous questions of the expert witnesses.

A few months later: I won the case; I was ecstatic, the inappropriate use of the erasure analysis software would be barred.

While the arbitrator found the use of the software was not “persuasive;” he sustained our case writing the Board failed to reach their burden of proof. It was a victory, a narrow victory that did not resolve the question of the misuse of the software.

A couple of years ago Sheri Lederman, a teacher on Long Island received an “ineffective” rating on the Value-Added Measurement (VAM) side of the teacher evaluation metric. The appellants introduced evidence from numerous experts all challenging the use of VAM to assess individual teachers.

In a narrowly worded decision a New York State Supreme Court judge overturned the “ineffective” rating of the teacher ruling that use of Value Added Measurement for the appellant in the instant case was “arbitrary and capricious,” No precedent was set.,

Read the Lederman filing here: http://www.capitalnewyork.com/sites/default/files/Sheri%20Aff%20Final.pdf

Read an excellent analysis here: https://www.the74million.org/article/ny-teacher-wins-court-case-against-states-evaluation-system-but-she-may-appeal-to-set-wider-precedent

In 2010 the New Teacher Project (TNTP), an advocacy organization firmly embedded in the (de)form side of the aisle issued a report – a survey of school districts across a number of states, the findings,

  • All teachers are rated good or great. Less than 1 percent of teachers receive unsatisfactory ratings, making it impossible to identify truly exceptional teachers.
  • Professional development is inadequate. Almost 3 in 4 teachers did not receive any specific feedback on improving their performance in their last evaluation.
  • Novice teachers are neglected. Low expectations for beginning teachers translate into benign neglect in the classroom and a toothless tenure process.
  • Poor performance goes unaddressed. Half of the districts studied have not dismissed a single tenured teacher for poor performance in the past five years.

Six years later New York State is working on Teacher Evaluation 4.0, and, we are in the first year of a four year moratorium on the use of grade 3-8 standardized test scores to assess teachers.

Value-Added Models also referred to as Growth Scores; attempts to compare teachers from around state teaching similar students. A dense mathematical algorithm incorporates a variety of variables and generates a numerical score for each teacher. For example, a fourth grade teacher is compared to other fourth grade teachers across the state taking into account percentages of students she teaches who are Title 1 eligible, students with IEPs, English Language Learners, by gender and perhaps other variables. The criticism is the use of the formula to assess individual teachers: the experts aver the scores are “unreliable,” large errors of measurement,  i. e., plus or minus five or ten or fifteen percent, and the scores are “unstable,” teacher scores vary widely from year to year.

The use of value-added measurements to assess individual teachers has been pilloried by experts.

The New York State Learning Summit  brought together experts from across the country – they were sharply critical of the use of VAM to assess individual teachers.

Howard Wainer, a statistician with decades of experiences and published articles has been a harsh critic of the misuse of statistical tools,

Ideas whose worth diminishes with data and thought are too frequently offered as the only way to do things. Promulgators of these ideas either did not look for data to test their ideas, or worse, actively avoided considering evidence that might discredit them.

The issue is not the mathematical model; the issue is how the model is used. If a particular teacher over a number of years consistently receives high scores it is worthwhile to ask: what is that particular teacher doing? What instructional practices is the teacher utilizing? Can these practices be isolated and taught to prospective teachers in college teacher preparation programs? In school and district-based professional development? Or, are these practices unique to the individual teacher?  Is there a “teaching gene,” an innate quality that resonates with students?

Sadly, VAM has been misused in spite of the evidence that discredits the use of the tool to assess individual teachers

Six years after the Widget Report, a report that bemoaned that only 1 percent of teachers were rated unsatisfactory, six years into the use of student achievement data using dense mathematical prestidigitation we find that 1 percent of teachers are found “ineffective.”

Millions of dollars and endless conflicts and the percentage of teachers found unsatisfactory remain at 1 percent!

Insanity: doing the same thing over and over again and expecting a different result

In New York State we are in year one of a four-year moratorium on the use of grade 3-8 student test scores to evaluate teachers.

How should management evaluate teacher competence?

“One size fits all” fits no one.

The state should explore a menu of choices to fit the many differences among the 700 school districts in the state.

Chancellor Tisch Will Not Seek Another Term: Some Suggestions – How To Begin to Win Back Parents and Teachers

If you used the word “Regent” a decade ago I would have said one of the five exams a student needs to graduate high school in New York State. The “Board” of Regents, with origins in the late 18th century, met monthly and anonymously in Albany; the members were college professors, retired superintendents, former legislators and business leaders, who had enough political clout to be “elected” (in reality, selected) by the Speaker of the Assembly.

No newspaper stories, no blogs, no one paid much attention to the meetings and the policy determinations.

A test: who was the chancellor prior to Merryl Tisch?

The major chore of the Board was to select a commissioner, who usually was a senior, well-regarded superintendent, who ran the State Education Department.

The board is a policy board and the policy items usually originated with the commissioner.

In 2009 Merryl Tisch, who had been a board member for a dozen years, was selected by her colleagues as chancellor.

Commissioner Mills “retired,” and the Chancellor Tisch chose a new path, instead of selecting a state superintendent the board selected David Steiner, the Dean of the School of Education at Hunter College and before that the leader of the National Academy of the Arts.

The “dirty little secret,” a secret that everyone suspected, the scores on the state reading/math tests, that had been inching up every year, had been “pushed long” by the exiting commissioner.

Tisch and Steiner asked David Koretz, a Harvard professor to examine testing practices, and, yes, the testing practices that were in place allowed the scores to incrementally increase each year.

Test practices were corrected, and the scores dipped.

I was optimistic, the Tisch-Steiner team signaled a new, more open board that might actually address the major issue, the elephant in the room: a funding formula based on local property tax revenue that guaranteed that the richest districts would get richer and the poorer district poorer. It was a national disgrace.

To my disappointment the commissioner resigned, John King was appointed without a search and the chancellor did not address the funding catastrophe, instead moved down the Race to the Top, Common Core, testing and teacher evaluation tied to student growth scores path.

There is no question in my mind that the chancellor’s goal is to improve opportunities for the most disadvantaged, to drive the members of the board to adopt policies to improve futures for every child in the state. The chancellor, which had been an anonymous position, was changed into the chancellor as the driver of education policy across the state; every meeting is covered by the media. At Regents’ meetings the chancellor is frequently the kid in the class who calls out loud, who interrupts the speaker to ask a challenging question, who pushes, who cross-examines the speaker, whether a State Ed staffer or the commissioner.

To me she has been an enigma – a brilliant leader, passionately concerned about education, a champion of the disempowered, a champion of the poorest communities and the children without the power to change their own paths, a leader who somehow wandered down the wrong road.

While the state constitution designates the Board of Regents as the organization that sets policy for education in state more and more policy is set by the second floor of the Capital building – the offices of the executive – the governor. The “education governor” is now trying to repair his torn education legacy.

The NY Daily News report on a new poll,

In the poll’s education section, the Common Core curriculum got boos from respondents who believe the strict standards have made schools worse. By a 2-1 margin, voters gave Common Core a thumbs down, with 40% of those surveyed saying public education has gotten worse. Another 21% said Common Core standards have made little impact.

A year ago John King was pushed out as commissioner – the Cuomo Education Commission was reconstituted with a specific agenda, any district can receive a waiver from the new teacher evaluation plan, a teacher appeal process has been put in place, the commissioner is exploring the efficacy of the use of growth scores in assessing teachers and the governor has selected a superintendent as his chief education advisor – a superintendent who had been a sharp critic of the Cuomo policies.

The Speaker of the Assembly dumped the two most senior members of the board and chancellor chose not to seek another term.

Two Regents (Tisch and Anthony Bottar – Syracuse) terms expire and next year three Regents terms expire.

Geoff Decker, at Chalkbeat does an excellent job of tracking the rapid changes here.

The Governor selected a new deputy for education who moves from a critic of the policies of the governor and the Regents to the chief educational advisor to the governor.

Hochman has been critical of state education policies in the past. Last year, he said the Common Core has become too tied to a “culture of testing” and questioned whether it would need to be revamped.

“The whole accountability, ‘gotcha’ culture is so out of control that we need a fresh start,” Hochman told The Journal News in September of last year. “The standards are OK, but every problem is connected to the Common Core. New York needs to take a bold stance so we can focus on educating kids.”

With 200,000 parents opting out of state tests, with an angry teaching force, with nose diving polling numbers the governor is bailing the sinking ship of state.

Although Chancellor Tisch will not be seeking another term she does have five months to begin to turn around the leviathan – the education system in New York State

A few suggestions:

Reduce or eliminate the impact of Value-Added Measures (VAM) on teacher evaluation.

Whether or not the use of student growth scores are a valid, stable and reliable tool to assess teacher performance the impact has been to alienate teachers and parents and to overemphasize annual school testing. The state should begin to explore an inspectorate system, the school and teacher evaluation system used in almost all other nations. (Read about the Inspectorate System here)

Reduce the length of state tests, release more test items and release the scores earlier

While the changes I suggest are relatively minor they address the complaints of parents and teachers and are achievable for the next testing cycle. The state should begin the move to adaptive testing; the data is immediately available to teachers and parents and helps guide instruction tailored to the needs of each individual child.

Fully examine and adjust the Common Core and Engage NY Curriculum Modules

The standards are not well-written, all policies should be regularly reviewed in a transparent process, the complaints about the Common Core, for example, the inappropriateness of the early childhood standards, should be explored. The Curriculum Modules should have been constructed from the bottom up and should be constantly expanded to reflect the brilliance of teachers around the state.

Increase the role of teachers in the Common Core/Curriculum Module revision process

A no brainer: ”Participation Reduces Resistance.”

Continue to advocate changing the federal testing requirements for English language learners and Students with Disabilities.

With Arne Duncan almost out the door hopefully Secretary-Designee King will accept the New York State request to alter the cruel testing requirements now in place for English language learners and Students with disabilities.

Robert Frost mulls over “The Road Not Taken,” we have taken the wrong road – time to correct our error.

Hot Potato: The NYS Regents May Have to Decide How Teachers Are Assessed (in 60 days), Any Ideas?

Ever play “hot potato”?

♫ One potato
Two potatoes
Three potatoes
Four!
Five potatoes
Six potatoes
Seven potatoes
More!

Maybe its appropriate that the state legislature and the governor are playing a children’s game?

The governor’s blustering and threatening resulted in a cyclonic backlash; from the teachers to parents to electeds with his approval rating in free fall. Others new plans, a “matrix,” as convoluted as the movie, a committee, each suggestion was met with suspicion, the “hot potato” was bouncing from the governor to the Senate to the Assembly, no one wanted it, it was too politically hot.

News reports and the rumor mill claim the budget process will return teacher evaluation to the Board of Regents to craft a plan and return to the legislature for action by June 1. Let me underline, I have not seen a bill, just reporting based on news reports.

The entire teacher evaluation catastrophe seems beyond redemption. The heart of the argument is tying teacher effectiveness to test scores, using a dense algorithm so that teachers teaching similar students are compared to each other; the statistical term is value-added modeling (VAM). All the experts agree that while the data is interesting, especially over time, it should not be used for high stakes decisions, like firing and promotion, it’s too unstable; however, arguing that teachers are totally responsible for test scores is a simple answer to a complex issue has swept from state to state.

The Race to the Top (RttT) application required a teacher evaluation plan, a multiple measures plan incorporating student test scores (VAM).

The pushback has been unabated, with examples of teachers rated highly effective by the principal and ineffective by the VAM algorithm, and a few the reverse.

To further confuse, about 75% of teachers teach non-tested subjects or classes, how do you use student data to assess the 75%? The Measures of Student Learning (MOSL) vary from school to school from school district to school district.

The New York State plan calls for 20% assessment by student test scores: 20% by a locally negotiated metric and 60% by supervisory observations using one of six approved rubrics.

When the dust settled after year one 51% of teachers were rated “highly effective” and 1% “ineffective.”

In the days before Charlotte Danielson became an iconic name, I met with Charlotte and about twenty principals. At the end of the session one principal proudly proclaimed, “In my school every teacher will be highly effective.” Danielson shook her head, “You’re lucky if a teacher is highly effective occasionally during a single lesson.”

BTW, Danielson emphasized that her Frameworks were a professional development tool not an evaluative tool and trashed the use of student test scores as an evaluative tool.

A closer look at the scores across the state is disturbing. In many districts every teacher received very high observation scores: can all 200 teachers in a district be highly effective? Teachers teaching special education, English language learners, and very high poverty kids tended to get lower scores. Do we attract less competent teachers or is the algorithm flawed? Some teachers (art, music, physical education) are rated based upon the school-wide ELA and/or Math scores, does that make any sense?

How are the seventeen members of the Board of Regents going to “correct” the current system in 60 days?

In a handful of schools teachers play a role in assessing colleagues: peer assessment is commonplace in other professions. Should we include teachers, colleagues in the same school, on teams with principals? Should we use teachers from other schools?

A paper from the Chicago Consortium on School Research published in Education Next, Does Better Observation Make Better Teachers, November 2014, assesses a teacher observation experiment in Chicago,

The principals’ role evolved from pure evaluation to a dual role in which, by incorporating instructional coaching, the principal served as both evaluator and formative assessor of a teacher’s instructional practice. It seems reasonable to expect that more-able principals could make this transition more effectively than less-able principals. A very similar argument can be made for the demands that the new evaluation process placed on teachers. More-capable teachers are likely more able to incorporate principal feedback and assessment into their instructional practice.

Our results indicate that while the pilot evaluation system led to large short-term, positive effects on school reading performance, these effects were concentrated in schools that, on average, served higher-achieving and less-disadvantaged students. For high-poverty schools, the effect of the pilot is basically zero.

In another study, “Teacher Dismissal Under New Evaluation System” (Grover Whitehurst and Katherine Lindquist), published also in Education Next sees “troublesome” flaws in observational teacher evaluation systems,


… we identified flaws in the evaluation systems that need correction. The most troublesome of these is a strong bias in classroom observations that leads to teachers who are assigned more able students receiving better observation scores. The classroom observation systems capture not only what the teacher is doing, but also how students are responding. This makes the teacher’s classroom performance look better to an observer when the teacher has academically well-prepared students than when she doesn’t.

We are a long way from a system that clearly differentiates effectiveness among teachers. There is no question that the variation from school to school, from school district to school district, is significant.

European countries use teacher inspectorates, teams that visit schools and assess both school and teacher quality.

I wish the Board of Regent luck.

Any ideas?

UPDATE: Just Out!! General outline of education initatives in the budget here

Teacher Evalution: Firing to Excellence is Faux Reform, Consistent, On-Going Assessment by Skilled Observers Engenders Excellence.

The State Education Department released the second year of teacher assessment scores, the first year for New York City teachers. The anti-teacher “fire to excellence” (de)former forces slammed the scores, teacher unions gave a modified approval and others appeared to be waiting for the governor’s state of the state message.

The New York Times reports,

In the city, only 9 percent of teachers received the highest rating, “highly effective,” compared with 58 percent in the rest of the state. Seven percent of teachers in the city received the second-lowest rating — “developing” — while 1.2 percent received the lowest rating, “ineffective.” In the rest of the state, the comparable figures were 2 percent and 0.4 percent.

Gov. Andrew M. Cuomo has said he wants to strengthen the evaluation system … a spokeswoman said, “As the governor previously stated, stronger, more competitive, teacher evaluation standards will be a priority” for the next legislative session.

A State Education Department press release,

… similar to the first year, the vast majority of teachers and principals received a high performance rating. The preliminary results show more than 95 percent of teachers statewide are rated effective (53.7 percent) or highly effective (41.9 percent); 3.7 percent are rated as developing; approximately one percent are rated ineffective.

Chancellor Tisch, the leader of the Board of Regents does not seem to fully understand what the rating assess,

“The ratings show there’s much more work to do to strengthen the evaluation system,” Board of Regents Chancellor Merryl H. Tisch said. “There’s a real contrast between how our students are performing and how their teachers and principals are evaluated. The goal of the APPR process is to identify exceptional teachers who can serve as mentors and role models, and identify struggling teachers to make sure they get the help they need to improve. I don’t think we’ve reached that goal yet. The ratings from districts aren’t differentiating performance. We look forward to working with the Governor, Legislature, NYSUT, and other education stakeholders to strengthen the evaluation law in the coming legislative session to make it a more effective tool for professional development.”

Former Commissioner David Steiner, John King and UFT president Michael Mulgrew spent months working on a teacher evaluation plan that eventually was crafted into a statute. Commissioner King convened a technical working group that worked intensely for months to develop the final regulations and the 700 school districts in New York State negotiated plans within the regs to development district specific plans (The New York City plan was imposed by the commissioner after the union and the mayor could not agree upon a plan)..

The plans all divide into three sections: 20% of the teacher score is based on growth in student test scores, 20% on a locally negotiated metric and 60% on principal observations using an approved observation tool (New York City uses the Danielson Frameworks).

To compare student test scores on the new common core tests with teacher scores is comparing apples to oranges. The common core exams are new exams with a new baseline. The commissioner, through the standards-setting process, made an arbitrary decision, to set the “passing” grade (“proficiency”) at a high level. A few members of the Regents suggested that the “proficiency” level be phased in over a number of years, the commissioner decided to move from the former “proficiency” level to the new far higher level. Not surprisingly the 2/3 “proficiency” rates on the old exam flipped to 2/3 “below proficiency” rates on the new common core exams. The commissioner, by his actions, acknowledged his error, the new common core regents exams will have the grades phased in, what we used to call “scaling” the grades.

Governor Cuomo and Chancellor Tisch seem to be upset because the teacher scores do not match the student scores – there is absolutely no correlation between the scores.

1. Virtually every statistics expert agrees that student test scores should not be used for high stakes decisions. The American Statistical Association sharply criticizes the use of student test scores, called Value-Added Modeling (VAM).

* VAMs … do not do directly measure potential teacher contributions toward student outcomes.

* VAMs typically measure correlation, not causation: Effects – positive or negative attributed to a teacher may actually be caused by other factors that are not captured in the model.

* Most VAM studies find that teachers account for about 1% to 14% of the variability in test scores.

2. The principal observation section, the 60% of the overall score, assumes some comparability in assessments, school districts can select from six different assessment rubrics, and, most importantly, observers view lessons differently from supervisor to supervisor, from school to school, from district to district, observation scores vary widely.

The Gates-funded MET Project report, “The Reliability of Classroom Observation by School Personnel,” (January, 2013) authored by Andrew Ho and Thomas Kane of the Harvard School of Education,” takes a deep dive into the world of teacher observations. The research,

…highlights the importance of involving multiple observers … [and] to provide better training and certification tests for prospective raters … the only way a district can monitor the reliability of classroom observation and ensure a fair and reliable system for teachers would be to use multiple observers and set up systems to check and compare feedback given to teachers by different observers,,, the element of surprise [unannounced observations] may not be necessary.

…classroom observations are not discerning large absolute differences in practice. The vast majority of teachers are in the middle of the scale with small differences in scores producing large changes in percentile ranks … the underlying practice on the existing scales does not vary that much.

The American Statistical Association and the Gates Foundation directly challenge the comments by the outgoing commissioner, the governor and the chancellor; to attempt to “contrast” teacher and student test scores are to completely misunderstand findings by recognized experts,

The purpose of a teacher assessment system must be to improve practice, not to assign a grade. The current system compares to teachers to teachers instead of teachers responding to observations by skilled observers. Teachers, from year to year, change the grades they teach, groups of student vary from year to year, trying to create a “growth model” is futile, and it fails the tests of “validity and reliability.” How do you “measure” teachers of students with disabilities? Can you measure the effectiveness of teachers in high functioning schools and school districts with teachers in low functioning schools and districts?

The “vast majority” of teachers, to use the Gate terminology, fall in the mid-range. If “correcting” the teacher evaluation metric means “failing” more teachers the entire highly flawed system will crumble.

We never master teaching, great teachers, athletes, musicians and artists practice, they strive to upgrade skills; whether the novice or the expert, guided practice is crucial.

Gates suggests multiple, highly trained observers, and I add breaking the isolation of teachers by creating teams of teachers which include peer review and peer exchanges

From acceptance to teacher preparation programs, to the content of the programs, to teacher preparation exit standards, to hiring, to supporting probationary teachers and the tenure-granting process to consistent, on-going professional development, we must strive to embed research-based policies with wide stakeholder input.

Unfortunately the ideology-driven “disruptive” reformers advocate for “market-driven” solutions: teacher against teacher, public school versus charter schools, “solutions” that fly in the face of valid and reliable research.

Shaming teachers is a ludicrous. The current policies are antithetical to positive change; the (de)formers have succeeded in uniting teachers against policies. We should create polices that attract teachers, that build communities of learners,

Perhaps the governor will have an apotheosis.

Byte-ing Teachers: Is Teaching an Art or a Science? Does VAM Help Improve Practice?

Billy Beane, the manager of the Oakland Athletics, changed the method of evaluating baseball players – instead of the cigar chomping baseball scouts Beane used a data-driven approach, memorialized in Michael Lewis’ Moneyball (2003). The analysis/manipulation of large, really large sets of data has become the sine qua non, the “standard” for decision-making, in baseball, in healthcare as well as in education. In an Atlantic article, “Can the Government Do Moneyball,” the authors aver,

The moneyball formula in baseball—replacing scouts’ traditional beliefs and biases about players with data-intensive studies of what skills actually contribute most to winning—is just as applicable to the battle against out-of-control health-care costs. According to the Institute of Medicine, more than half of treatments provided to patients lack clear evidence that they’re effective. If we could stop ineffective treatments, and swap out expensive treatments for ones that are less expensive but just as effective, we would achieve better outcomes for patients and save money.

The field of education is no different – we now have the ability to parse large data sets to assist educators to fine tune, to individualize instruction to students,

“The important thing with the data as we see it is this: How does it improve instruction in the classroom? … The trick is to be able to combine what I call ‘autopsy data’ of what has happened with the child, with what goes on currently in class, with formative and summative evaluations on an ongoing basis.”

In addition to longitudinal data, scores from online work or assess¬ments scanned in from offline work go ¬immediately into the platform, and its predictive ¬analytics engines go to work to ¬develop recommendations to help the student get up to speed ….Teachers don’t continue to teach things their students already know. It gives just-in-time feedback of what to pay attention to now, ¬before students get so far behind that they can’t catch up.”

One thing is certain: As education becomes more Big Data–driven, ¬educators and IT leaders must ¬remember that human judgment matters too. “You have to pay attention to whether the data resonates with what the teachers know to be true about a student’s performance, …There’s no ¬substitute for authentic analysis.”

We can receive real time feedback on suggested approaches to remediating student errors, of course, the process does not explain why a student gets a wrong answer and does not involve critical thinking skills, it can tell us that, for example, 26% of Afro-American seventh graders who are eligible for Title 1 services cannot successfully divide fractions 80% of the time. How can the school district use the data? Why are 74% of students succeeding? Are the teachers of the 74% consistently successful year to year? Are the textbooks the same? Is the race/gender/experience level of the teachers a significant factor?

The use of “big data” can, within a statistical range, answer the questions; however, the “answer” does not tell us why…

The Gates-funded Measures of Effective Teaching Study identifies more effective teachers using test scores to define effectiveness. What the MET study does not do is tell us why some teachers are more effective than other teachers. Are they more effective in teaching particular skills or more effective in motivating students or some combination? Highly effective teachers have no idea why they are more or less effective from year to year.

Paul Tough, in “How Students Succeed,” challenges, “…the cognitive hypothesis, the belief ‘that success today depends primarily on cognitive skills — the kind of intelligence that gets measured on I.Q. tests, including the abilities to recognize letters and words, to calculate, to detect patterns — and that the best way to develop these skills is to practice them as much as possible, beginning as early as possible.” … Tough sets out to replace this assumption with what might be called the character hypothesis: the notion that noncognitive skills, like persistence, self-control, curiosity, conscientiousness, grit and self-confidence, are more crucial than sheer brainpower to achieving success.”

The teacher who ignites noncognitive skills may be more effective than the teacher who uses the proper teaching techniques, as “measured” by the Danielson Frameworks.

The policy wonks, the decision-makers are seduced by data, the right data set, the right algorithm, the right combination of variables can result in attributing a numerical score to a teacher. Once we’ve identified the most effective teachers we can use the “score” to drive decisions: who gets tenure, who gets fired, who gets a raise or a promotion. What used to be a decision solely made by the principal is now of part of a multiple measures rubric with a value-added measurement (VAM) counting for 20% to 50% of the scores.

The Educational Testing Service (ETS) warns us that the instability and unreliability of VAM algorithms should not be used for decisions that impact careers.

Edward H. Haertel (March, 2013) in “Reliability and Validity of Inferences About Teachers Based on Student Test Scores,” warns,

Teacher value-added scores are unreliable … that means that teachers whose students show the biggest gains one year are often not the same whose students show the largest gains the next year…

The goal for VAM is to strip away just those student differences that are outside of the current teachers control … those things the teacher should not be held accountable for…

Teacher VAM scores should emphatically not be included as a substantial factor with a fixed weight in consequential teacher personnel decisions … It is not just that the information is noisy … the scores may be systemically biased for some teachers and against others.

In spite of the evidence the US Department of Education steadfastly hews to the VAM line. Data is the answer – if you can create the right mathematical equation, if the mountain of data is large enough – you can solve all.

The use of data-making decision-making presumes that teaching is a science; it presumes that with the proper mix of “chemicals” you can “create” a desired outcome.

Is the process of teaching a science which can be measured, or, is teaching an art? Can you assign numerical values to every Danielson element and VAM growth score and assign a teacher a grade, or, was Judge Potter Stewart correct when he wrote he could not define pornography but he “knew it when he saw it.”

The teacher evaluation law in New York State is incredibly dense (see website here). The State Education Department approved 700 locally negotiated plans, collected gigabytes of data, spun the computers, and, (roll of drums!!)

51% of teachers are “highly effective”
40% of teachers are “effective”
8% of teachers are “developing”
1% of teachers are “ineffective.”

In June, 2012 2.7% of New York City teachers received an “unsatisfactory” rating. The dense formula identifed fewer ineffective teachers.

Data has become an addiction, the “meth” of the world of education.

Perhaps it would be more cost effective if we poured the dollars into a genome project – is there a “teaching” gene? Are “highly effective” teachers the product of “nature” or “nurture”?

What does a numerical score tell a teacher? What can they learn from a VAM score?

The byte-ing of teaching is a failure.