Tag Archives: Teacher Evaluation

Presidential Polling and Measuring Teachers: The Misuse of Data and the Gullibility of the Public

We are addicted to predicting winners: at race tracks the betting public creates the odds for each horse in a race, every Sunday the odds makers in Las Vegas predict winners and the numbers of points by which teams will win based on previous records and a plethora of player related achievement numbers.

This is called gambling.

Data can be used for more respectable purposes, namely predicting winners in elections, another type of race, a political race, as well as predicting “success” in teaching by measuring increases in student achievement attributed to individual teachers.

Each day the New York Times online publishes odds, in the form of a percent, for the presidential election – on Sunday Hillary was “leading” Trump 90% to 10%, on Wednesday 88% to 12% percent. The section is called Upshot and the site explains the methodology. One of the sources is the Princeton Election Consortium and, if you want to get into the weeds, you can read about “symmetric random drift” and “setting a Bayesian prior,” probably well beyond the interest and knowledge of the vast percentage of “ordinary” folk.

The essential problems are the source data, the actual polling. Lo those many years ago we learned we had to create a stratified random sample, a microcosm of the population we wished to poll. An example is the upcoming September 13th Democratic primary election in the 65th Assembly District in Manhattan, the seat formerly held by Sheldon Silver, awaiting sentencing by the feds. There are six contenders for the seat, and a close look at the population in the district is revealing

Population figures, though, do not always translate into actual voters. According to 2014 census data, there were 32,952 Asian and South Asian citizens of voting age in the district. But only 15,284 were registered Democrats, said Jerry Skurnik, a partner at Prime New York, which compiles voter information. Of those, only 5,500 voted in the last three primaries.

Far fewer registered Hispanic and Portuguese Democrats voted in those three previous primaries, said Mr. Skurnik, who analyzed election data relating to social groups based on surnames. Of 11,675 registered voters, only 4,101 participated in a previous primary election, he said. Those of “European background,” including English, Irish, Italian and likely-to-be-Jewish voters, were the largest group, at 20,496 registered Democrats, with 8,205 showing up in previous primaries.

Randomly selecting names from census data is not a stratified random sample, selecting names from prime voter lists is a major step; however, how many potential prime voters don’t answer the phone and participate in the poll?  Do the participants constitute a “stratified random sample?”  I understand that fewer than 10% of those called actually respond to a polling call.

In June the United Kingdom (England, Scotland and Wales) voted in the Brexit election, an election to decide whether the UK would remain in the Common Market. Extensive polling revealed that the Brits would remain in the Common Market by a 52-48 vote, when the dust cleared the Brits voted to leave 52-48 – what went wrong?

An experienced pollster commented on “what went wrong.”

The difference between survey and election outcome can be broken down into five terms:

  1. Survey respondents not being a representative sample of potential voters (for whatever reason, Remain voters being more reachable or more likely to respond to the poll, compared to Leave voters);
  2. Survey responses being a poor measure of voting intentions (people saying Remain or Undecided even though it was likely they’d vote to leave);
  3. Shift in attitudes during the last days;
  4. Unpredicted patterns of voter turnout, with more voting than expected in areas and groups that were supporting Leave, and lower-than-expected turnout among Remain supporters.
  5. And, of course, sampling variability.

In spite of extensive polling by “the best and the brightest” the pollsters were off by four percent!!

Howard Wainer, a statistician with vast experience explains

… the response rate for virtually all of the polls ranges from 8 to 9 percent. Yes, more than 90% of those asked for their opinion hang-up. Do you know anyone who chooses to answer the phone? Who? Do you? Professional pollsters never talk about this because it means their paychecks.

The only way to use such polls is to make heroic assumptions — most commonly what is assumed is ‘ignorable nonresponse’ — that is that those who don’t respond are just the same as those who do — clearly nonsense.

Even such a sensible person as [pollster] Nate Silver has to make do with terrible information. Yes, drawing inferences from flawed data are usually better than doing it with no information at all, but it is hardly enough to keep from being terrified.

The one aspect of this in which I find some solace is that the polls may be self-fulfilling. This is seen in the shrinkage of donations to Republicans.

Although it is an unintended consequence polling results influence voters – polls discourage voters who are on the trailing side and impact voters who want to be on the winning side – the band wagon effect.

The only absolute winners are the pollsters who receive fees for parsing out the results.

Attempts to use dense mathematical algorithms to assess teacher performance face the same core issue. Value Added Measurement (VAM) purports to compare teachers who are teaching similar students, i.e., Title 1, English language learners, special education, etc. The formula creates a score for each teacher on a 1 – 100 scale so that teachers can be compared. The problem is not the dense formula – the issue is that teachers teach different students each year and the VAM scores have high errors of measurement that swing widely from year to year. A score with an error of measurement of plus or minus fifteen percent means the teacher score falls with a thirty point range. The following year the score may be substantially higher or lower and the entire system is predicated on student tests that may be fatally flawed.

If the stratified random sample is flawed or the test is flawed all conclusions emanating are flawed.

The other method of assessing teacher performance is supervisory observations, which may be helpful in improving teacher performance; however, have no inter rater reliability.

An irony is that there are numerous examples of low scores from supervisors and considerably higher VAM scores. VAM scores, although deeply flawed, in many cases protect teachers from low observational scores that may be biased.

Polls are a photograph, a moment in time based on available data that might very well be flawed or change dramatically in the days or hours before the “final” poll, the election.

Value Added Measurements have enriched testing companies, confused and angered teachers and parents and created a Quixote quest (“…revive chivalry, undo wrongs, and bring justice to the world”) that is impossible to fulfill.

We are gullible and accept complex formula as truth. If an explanation is filled with obtuse Greek letters and symbols it must be accurate.

Australia has compulsory voting, polling is probably far more accurate, in the United States local voting participation is commonly below 50%, and the voters vary from election to election. The only accurate poll is the election.

If teachers taught the same students every year and the tests met statistical standards of validity, reliability and stability the VAM scores might be reasonably accurate.

Bottom line: polling is an informed guess and VAM scores are of little value.

Advertisements

The Politicization of State Tests: Creating Tests in Which “All Students in New York State Are Above Average”

When the dust cleared the greatest ally to the anti-testing clique was (roll of drums!!!)  MaryEllen Elia, the New York State Commissioner of Education.

The deeply flawed state tests (“All children are above average”) reignited the argument – why do we have state test at all (aside from the federal requirements)?

Statewide ELA test scores jumped by around 7% – although the racial achievement gap remained the same.

A magic potion, incompetence or simply political legerdemain?

A little review: in September, 2015 Governor Cuomo reconvened a blue ribbon panel, actually a process to repair the Governor’s foolhardy attacks on teachers and parents. In 2014 it appeared that Cuomo had a clear path the Democratic nomination for his second term and deep pockets for the November general election. Seemingly out of nowhere Zephyr Teachout, a law professor at Fordham challenged Cuomo for the Working Families Party spot on the ballot and challenged Cuomo in the Democratic primary. While the teacher union made no endorsement some members and locals were on the Teachout side. After defeating Teachout and Rob Astorino, his Republican challenger Cuomo decided to punish teachers. He cozied up to the charter school folks, used the budgeting process to tack on legislation to extend teacher probation, and, was nastier than usual.  NYSUT, the statewide teacher union responded with a series of aggressive TV ads and the opt-out movement was created, 20% of kids opted-out of the 2015 state tests.

Cuomo’s popularity rating tumbled.

I suspect clearer heads prevailed.

The purpose of the Task Force was to guide education policy from afar and place the Board of Regents and the commissioner in the foreground. The recommendations were more than recommendations; they were a pathway for state education policy. (Cuomo: This is the endgame – you figure us out how to get us there)

The Task Force Report (Read here), which was released in December, contained twenty-one recommendations, the last recommendation was a moratorium on the use of state tests to evaluate principals and teachers for four years, applauded by the teacher union.  The recommendations called for a thorough review of the Common Core Standards and teachers would be included in every step of the process.

Recommendation 15: Undertake a formal review to determine whether to transition to untimed tests for existing and new Standardized tests aligned to the standards; not controversial, garnered little,  if any discussion; perhaps a pilot in a few schools and school districts across the state.

Surprisingly, very surprisingly, without any discussion with the Board of Regents, the Commissioner announced that the 2016 state tests would be untimed.

The January announcement, entitled “Changes for the 2016 Grades 3-8 ELA and Mathematics Tests” begins,

This memo outlines changes made as a result of feedback from the field:

* Greater involvement of educators in the test development process

* Decrease in the number of test questions, and

* A shift to untimed testing

The announcement came from Angela Infante, Deputy Commissioner, Office of Instructional Support and Peter Swerdewski, Assistant Commissioner, Office of State Assessment.

The state document states, “…students will be provided with as much time as they need.” No pilot, no transition, jumping off the diving board into the pool, and, the state made no attempt to identify students who took additional time.

The scores soared, the state commissioner, in the Daily News admits the scores are “not exactly a perfect comparison,”

After widespread opposition to the difficulty of the tests erupted in 2015, state education department officials shortened the exams for 2016 and eliminated time limits.

“Because of the changes in testing, it’s not exactly a perfect comparison,” Elia said. “And even with the increases this year, there remains much work to be done.”

The state spent many millions of dollars purchasing tests, teachers and students months of test prep, to collect data from what turns out to be a non-standardized test. A test that might not even meet federal requirements, although I’m sure the feds will simply ignore the faux jump in scores.

Was the test itself “harder” or “easier;” many months down the road a Technical Advisory Committee (TAC) will release a report, hundreds of pages of dense analysis that few will read and fewer will understand.

The basic questions: are the results of the test useful?  Can they be compared with the previous year? Can schools and school districts be compared? And, at the top of the list: are the schools in New York State making academic progress?

Howard Wainer, a Distinguished Research Scientist, the author of innumerable books and articles, an internationally recognized expert writes,

Because of the changes this year’s scores can’t be compared to last year’s and because of the untimed nature of the test (and there being no record of how long anyone took) you can’t compare scores of students who took it this year with one another. It is, in no uncertain terms, an unstandardized test.

This test is akin to measuring children’s heights but allowing some students, we don’t know who, to stand on a stool, we don’t know how high, and then declaring some taller than others.

Fred Smith, another testing expert, writing in City Limits, had doubts about the validity of the test before the test administration.

Either the state education psychometrician is lacking in competence, or knew by adopting untimed tests scores would likely jump – either is unacceptable.

If the state continues down the same path, retaining the untimed tests, even if it keeps track of students who take extra time, and the amount of extra time, we will be once again be comparing apples to oranges. Kids who take extra time or choose not to take extra time may not be the same kids as this year – we simply can’t know.

Will states across the nation also jump on the untimed tests bandwagon?

In the politicized world of education the charter school folk and their acolytes beamed at higher scores, of course, we have no way of knowing why charter school scores were generally higher than public schools, and, the pro-charter print media crowed. Mayor de Blasio and Chancellor Farina also took a victory lap, and the Mayor immediately claimed the scores were proof that mayoral control be made permanent.

Board of Regents Chancellor Rosa reminded us it’s not time for a victory lap, unfortunately everyone else is milking the results – de Blasio and Farina, the charters and principals and teachers are breathing a sigh of relief.

A perverse kind of victimless crime: except for the kids who were tortured preparing for a non-standardized test.

Although the law has changed, No Child Left Behind has been replaced by the Every Student Succeeds Act; the requirement for annual testing in grades 3-8 remains. The Leadership Conference is an umbrella group representing the major civil rights organizations across the spectrum has strongly supported the accountability requirements, aka, testing and reporting scores by subgroup, and, the law is not changing.

Testing is here to stay.

The US Department of Education has announced they will be selecting six or so states or consortiums of states to play with alternate assessments.

The anti-testing crowd points to the new law and the testing kerfuffle in New York State, why not move to portfolios and performance tasks to current replace testing? This is not a new idea.

Vermont spent a decade working to create an assessment system based on portfolios, and after an external report pointed to fatal flaws, abandoned the effort.

…report by the RAND Corporation … found that the “rater reliability” in scoring the portfolios–the extent to which scorers agreed about the quality of a student’s work–was very low. The researchers urged the state to release the assessment results only at the state level.

Daniel M. Koretz, a senior social scientist at RAND and the report’s author, said the low levels of reliability indicate that the scores are essentially meaningless, since a different set of raters could come up with a completely different set of scores.

Can thousands of teachers be expected to rate portfolios the same?

The portfolio process was expensive, extremely time consuming  and there is no guarantee the portfolio work was not “assisted” by parents or others .

Yes, portfolios and performance tasks are effective classroom tools and in the perfect world might be a way of assessing student progress, in the real world, the world in which we live, it is not reasonable to expect inter-rater reliability.

The anti-testing movement will not disappear and the opt-out movement is alive.

What is absent is leadership – Arne Duncan drove us down a path for seven years that divided education: reformers versus deformers, marketeers versus public schools, unions versus the hedge funders: education is bitterly divided. Will the next president nominate an education leader who can bring together the disparate constituencies?

Education is adrift and the unstandardized testing regimen in New York State is a prime example.

Off to Minneapolis: Preparing for the American Federation of Teacher Convention: Will the Bernie and Hillary Supporters Bond?

On Monday the American Federation of Teachers will celebrate its hundredth anniversary at their bi-annual convention, this year in Minneapolis. About 3,000 teachers, school-related personnel and nurses will spend four days setting policy for the national union, listening to a range of speakers and on Monday afternoon meet the “presumptive” Democratic nominee Hillary Clinton. (You can watch on the AFT.org website).

National conventions are always fascinating, an opportunity to meet teachers from around the nation. Chicago (CTU-Local 1) has been at war with Rahm Emmanuel, their mayor, with another strike possible in September.  California appears to be making positive changes away from endless testing, or, are they creating a dense accountability system – talking with teacher trade unionists from across the nation is always enlightening.  I will be meeting teacher union guests from other countries. Recently I was speaking with teachers and school leaders from Austria: How do you become a principal in Austria? “You belong to the right political party.” Are teachers involved in hiring staff? (Odd look) “No, neither is the principal, teachers are assigned by the bureaucracy, and have lifetime tenure after a few years.  We needed a history teacher, they sent us a gym teacher, and our system is totally top down.” BTW, Austria scores above average on PISA assessments (See here).

The convention schedule is packed full of meeting – first meeting 7 am Monday morning. The delegates will debate changes to the AFT constitution and bylaws and debate, in committees, ninety-one resolutions submitted from locals around the country. Linda Darling-Hammond will lead a discussion of teacher assessment and the new federal All Students Succeed Act (ESSA) that replaces No Child Left Behind (NCLB). The US Department had just released draft regulations – no question the regulations and the possibilities for innovation pilots will be discussed (See Education Week discussion here)

On the convention floor there are multiple microphones (usually six, seven or eight) scattered around the arena. Any delegate can jump up to a microphone to support, oppose or amend a resolution. The committees, after debating the resolutions, set priorities, the highest priority resolutions must be debated on the floor – there are thirteen committees – the top three priorities must reach the floor.

The Democratic Platform on Education was set last week, the original platform reflected the de-reformers, led by Democrats for Education Reform (DFER), a coalition led by Bernie supporters and Randi Weingarten made significant changes (Read details here), angering the DFER faction, who are supporters of the Duncan-King policies.

A theme of the convention will be bonding the Hillary and the Bernie acolytes and building a teacher-led Hillary campaign across the nation. Not an easy task since passions were high during the lengthy campaign, considering the Trump alternative, one would hope the Bernie folks will jump on board.

The latest polls, if you have any confidence in polls, predicts a very close election (Read polling results here).

I’ll be blogging from Minneapolis – stay tuned.

Magic Bullets Equal Duds: Why Do Top-Down Educational Initiatives Rarely Succeed? Will ESSA Change the Face of Education?

I asked an astronomer friend whether the Juno spacecraft would find life under the frozen oceans of Ganymede, one of the moons of Jupiter, he answered,

The frozen ocean is essentially the surface, it’s thought that the watery mantle of Ganymede might be within the range tolerated by extremophiles, but again, a lot of speculation has been done. 

Much heat and little light, or, when Sagan was asked “yeah, but what are your gut feelings?” he replied “I don’t think with my gut.”

Science is a process of enforced intellectual rigor (enforced by peer review) which requires going from known data toward new understandings. We fill in the pages of a blank book with observations, measurements, and analysis, and then try to elucidate new models of how nature works. 

Going from the “already filled in book” to elicit behavior changes is the province of religion. 

Sadly, education policy-making is in the realm of faith, not science.

The reformers abjure “enforced intellectual rigor” and make sweeping decisions that impact millions of students based upon the absence of peer reviewed “observations, measurements and analysis.”

The current and former US Secretaries of Education support a portfolio system of schools – public and charter schools competing with each other for students – the competition, they argue, will raise student achievement in both public and charter schools. The belief is loosely based on the theories of Nobel Laureate economist Milton Freedman,

 In a famous 1955 essay, Friedman argued that there is no need for government to run schools. Instead, families could be provided with publicly financed vouchers for use at the K-12 educational institutions of their choice. Such a system, Friedman believed, would promote competition among schools vying to attract students, thus improving quality, driving down costs, and creating a more dynamic education system.

The current day reformers cannot point to any evidence and, in fact the current system of public and charter side-by-side schools has not “raised all boats,” they may actually diminish student achievement.

In New York City another example is the rekindling of the “Reading Wars,” Chancellor Farina is a close friend of Lucy Calkins and “balanced literacy, her approach to the teaching of reading. Most experts are sharply critical of the Calkins’ approach and support the use of phonics to teacher reading. Friendship rules: the chancellor supports her friend (Read a discussion of the “Reading Wars” here)

Will a teacher evaluation system based on student test scores sort the best and the worst teachers and lead to higher student achievement?  Once again, there is no evidence, and, in fact, scholars tell us that value-added measurement is highly inaccurate and inappropriate for measuring teacher competency.

To make the realm of policy creation and implementation even more depressing is  when schools and school districts attempt to use the “wisdom and knowledge of experts” the attempts fail.

Anthony S. Bryk, Louis M. Gomez, Alicia Grunow, and Paul G. LeMahie in Learning to Improve: How America’s Schools Can Get Better at Getting Better, argue,

… there is no universal mechanism in education for transforming the wisdom and knowledge experts accumulate as they work into a broader professional knowledge base … well-intentioned educational reforms across the ideological spectrum were unsuccessful because they were formed around a novel solution (such as the small schools movement, etc.,) rather than a practitioner-driven problem and were imposed from above without attention to the ways local conditions might require adaptation.

To address these two challenges, the authors argue that practitioners, policy makers, and researchers should collaborate across traditional organizational boundaries to engage in ongoing disciplined inquiry.

(Read a detailed description of the book here)

The authors lay out what they call “The Six Core Principles of Improvement”

  1. Make the work problem-specific and user-centered.

It starts with a single question: “What specifically is the problem we are trying to solve?” It enlivens a co-development orientation: engage key participants early and often.

 

  1. Variation in performance is the core problem to address.

The critical issue is not what works, but rather what works, for whom and under what set of conditions. Aim to advance efficacy reliably at scale.

 

  1. See the system that produces the current outcomes.

It is hard to improve what you do not fully understand. Go and see how local conditions shape work processes. Make your hypotheses for change public and clear.

 

  1. We cannot improve at scale what we cannot measure.

Embed measures of key outcomes and processes to track if change is an improvement. We intervene in complex organizations. Anticipate unintended consequences and measure these too.

 

  1. Anchor practice improvement in disciplined inquiry.

Engage rapid cycles of Plan, Do, Study, Act (PDSA) to learn fast, fail fast, and improve quickly. That failures may occur is not the problem; that we fail to learn from them is.

 

  1. Accelerate improvements through networked communities.

Embrace the wisdom of crowds. We can accomplish more together than even the best of us can accomplish alone.

 In other words, there are no magic bullets. The “answer” is not the program; the answer is the competency and cooperation of the practitioners at the school, district and university level.

By competency I mean the ability to collaborate within and across schools, the ability to understand data and convert data into classroom practice, to become reflective practitioners. The New York City-based Progressive Redesign Opportunity Schools for Excellence (PROSE) encourages schools to break free of perceived or real constraints, to craft research-based solutions at schools with the guidance and support off labor and management.

The International Network supports twenty schools, fifteen in New York City, that work with English language learner high school students who have been in the country four years of less. The six year graduation rates match all other schools, the schools share instructional practices.

How do we seed fertile soils?  How do we prepare teachers and school leaders to use peer reviewed research to drive actual practice?  And, vitally important, how we create district leadership that supports schools and not constantly chase the magic grail, that magic bullet that has never existed.

There are highly successful schools, succeeding, frequently under the radar while schools with similar populations struggle.  Unfortunately the most successful principals, and occasionally superintendents must resort to practicing creative resistance, smiling, nodding, and continuing to do what actually works in spite of “higher ups” that chase that elusive secret sauce.

The new Every Student Succeeds Act (ESSA) devolves power from the feds to the states; states have until the spring of 2017 to create plans to address struggling schools: will states simply replicate the failed federal programs or actually create creative approaches to school improvement?

The New York State Legislature Adjourns with a “Whimper,”as Educational Policy-Making Moves to the Board of Regents

This is the way the world ends
This is the way the world ends
This is the way the world ends
Not with a bang but a whimper

T. S. Eliot, The Hollow Men (1925)

The last stanza of Eliot’s poem is an apt description of the end of the 2016 legislative session. The final days, called “the Big Ugly,” is a scramble, an endgame, the Republicans and the Democrats vying for an advantage as the state moves toward the November election. All the seats in the legislature, the 150 in the Assembly and the 63 in the Senate will be on the ballot. While the Assembly is firmly in the Democratic column the Senate is far more complex, and byzantine. The Democrats hold a single seat edge in the Senate (32-31); however five Democrats (Jeff Klein, Diane Savino, Tony Avella, David Valesky, and David Carlucci), the Independent Democrat Conference (IDF), under the leadership of Klein (Bronx) caucuses with the Republicans, giving the Republicans control of the Senate.

Hanging in the balance were mayoral control, campaign finance reform, removal of pensions for convicted legislators, online fantasy sports betting and scores of other bills.

You may ask: why is all this conflict and wheeling and dealing necessary? Why can’t legislators have civil conversations and decide the issues?

James Madison, in Federalist # 51 wrote,

Ambition must be made to counteract ambition. The interest of the man must be connected with the constitutional rights of the place. It may be a reflection on human nature, that such devices should be necessary to control the abuses of government. But what is government itself, but the greatest of all reflections on human nature? If men were angels, no government would be necessary. If angels were to govern men, neither external nor internal controls on government would be necessary

The Constitutional Convention (1787) was not covered in CSPAN; the Constitutional Convention was a secret meeting. The only notes we have are Madison’s personal notes, not made public until after the death of all the delegates, The fifty-three delegates argued, came and went, delivered lengthy speeches, met in private, and made deals.

Slavery was one of the most significant stumbling blocks, the anti-slavery Northerners versus the slave-holding South, The compromise: slavery is not mentioned in the constitution, the question of slavery was left to the states, and, as part of a compromise; slaves were counted as 3/5th of a ”free person,” and referred to in the clause as “all other Persons.”

Representatives and direct taxes shall be apportioned among the several states which may be included within this union, according to their respective numbers, which shall be determined by adding to the whole number of free persons, including those bound to service for a term of years, and excluding Indians not taxed, three fifths of all other Persons.

Deal-making, as reprehensible as it may seem, is at the essence of making government work.

Whether to extend mayoral control in New York City had nothing to do with education. Weakening the mayor might give the Republicans a chance in the 2017 mayoral election. In spite of pleas from Merryl Tisch and others in the upper echelons of power Senate leader John Flanagan offered “unacceptable” plan after plan until in the closing hours an agreement was reached, the NY Times describes the plan as a one year extension plus,,

It would effectively create a parallel system of charter schools within the city, allowing “high-performing charter schools in good standing” to switch to join the State University of New York umbrella or the Board of Regents of the State Education Department.

Probably a meaningless change, currently charters schools authorized by both New York City and Buffalo make reauthorization proposals after five years, the authorizer, SUNY or the Board of Regents can reject the recommendation. The proposal allows the charter school, if it’s  “high performing and in good standing” to move directly to SUNY or the Regents for reauthorization.

The session is most interesting for what it did not do – the houses steered clear of legislation directing the State Education Department to take any actions. A host of education bills simply died. Neither the governor nor the party leaders had any desire to once again get involved in the morass of teacher accountability or testing, any of the issues that birthed the opt outs and/or angered teachers and their unions.

The budget was generous and the political leaders appear to be leaving the educational decisions to the educational leaders.

In December the Cuomo-appointed Task Force released their report with 21-recommendations: a blueprint for the Commissioner and the Board of Regents. The core of the report was a 4-year moratorium on the use of student test scores as part of a metric to assess teacher performance.

In the six months since the release of the report the Commissioner has made tests untimed, a recommendation in the report, established a number of large field-based committees to review elements of the Common Core, and, the Regents created a number of alternative pathways to graduation.

Quietly, very quietly, the Commissioner announced a change in the observation section of the teacher evaluation regulation. The outside observer would be scrapped – what might be a good idea in theory was both overly complex and a financial burden on school districts. There was no high drama – no headlines, simply an announcement undoubtedly based on quiet discussions.

The decisions before the Board of Regents are complex, politically explosive and without explicit answers.

Can you create a teacher evaluation plan that is acceptable to principals and teachers and not trashed by external critics?

Can better tests win back opt out parents?  And, what do you mean by “better tests?”

Will alternatives to testing, perhaps, portfolios or other performance assessments, be acceptable to the feds, and acceptable to the principals and teachers?  Are performance assessments practicable in actual classroom settings?

Will additional alternative pathways to high school graduation make students more or less prepared for college?

The Regents appear to have a window – three or four years – to make decisions based on their expertise as well as respond to external pressures and scrutiny, and, hovering aloft: “disruptive” solutions such as unlimited charter schools or vouchers.

Windows open, and windows close.

Getting It Right: Building a Research-Based Teacher Assessment System

A couple of years ago I was participating in a Danielson Training Workshop, two Saturdays in a room filled with principals and network support folk. We watched a video of part of a lesson – we were told we were watching a first year teacher in November in a high school classroom.

Under the former Satisfactory/Unsatisfactory rating system the lesson was clearly satisfactory. The Danielson Frameworks (Read the 115-page NYSED document here) requires that teachers are rated on a four-point scale (Distinguished, Proficient, Basic and Unsatisfactory) while New York State also requires a four point scale (Highly Effective, Effective, Developing and Ineffective). The Frameworks divides the teaching process into four domains, 22 components and 76 elements.

The instructor asked us to rate the lesson: at my table we were all over the place. For a teacher in the third month of her first year of teaching the lesson was excellent – clearly “proficient.”  Others argued the time in teaching was irrelevant, you had to rate her against all other teachers regardless of experience – at best, she was “developing.” Inter-rater reliability was absent.

Decades ago the union sent me to an Educational Testing Service conference on teacher assessment; about thirty experienced superintendents from all over the Northeast, and me, one union guy. We began by watching three 15-minutes videos of lessons: one an “old-fashioned” classroom, the kids sitting in rows, the kids answered teacher questions, the kids stood when they answered; the questions were at a high level although a small number of kids dominated the discussion. In the other video kids were sitting at tables, the teacher asked a question, gave the kids a few minutes to “huddle,” and one of the kids answered for the group and the teacher followed up with a few clarifying questions, in the third classroom the kids were at stations around the room, it was noisy, the noise was the kids discussing the assignment, the teacher flitted around the room, answering, clarifying and asking questions.

We were asked to rate the lesson on a provided checklist.

The result: the superintendent ratings were all over the place.

I was serving as the teacher union rep on a Schools Under Registration Review (SURR) team – we were visiting a low performing school. We were told to wait, the principal was busy, four of the 50 teachers were absent and there were three vacancies, the principal was assigning classroom coverages.

At the initial get acquainted session a team member, considering the staffing issues asked, “What are the primary qualities you look for in assessing teacher quality?” The principal blurted, “They come every day and blood doesn’t run out from under the door.”

A colleague was touring a school with very high test scores.  As he walked the building with the principal, he saw uniformly “mediocre” instruction – teacher-dominated, no student engagement. He mentioned the low quality of instruction to the principal, who shrugged, “Why mess with success?”

Once again, there is no inter-rater reliability.

In a number of school districts across the state almost all teachers received maximum observation ratings.

The State Ed folk simply accept the observation ratings of principals and school districts.

Charlotte Danielson, in her other book, Talk About Teaching  (September, 2015), discusses the complex role of the principal as rater as well as staff developer: how can a principal, who is the summative evaluater honestly engage with teachers who they rate?

In an excellent article from the Center for Educator Compensation Reform, Measuring and Promoting Inter-Rater Agreement of Teacher and Principal Performance Ratings (February, 2012), the authors parse the reliability of teacher observation ratings. There are a number of statistical tools to assess reliability – the state uses none of them.

In New York State 60% of a teacher rating is made up of the teacher observation score, and, we have no idea of the accuracy of the rating.

In the pre-Race to the Top days, the Satisfactory/Unsatisfactory rating days, the entire rating was dependent on the observation – in the last year of Bloomberg term 2.7% of teachers in New York City received Unsatisfactory ratings, under the current far more complex system that incorporates student tests scores and other measures of student growth only 1% of teachers were rated ineffective (Read a description of the plan:  APPR 3012-c).

Under the newest  system the other 40% is a combination of Measures of Student Learning and Student Learning Objectives, the use of state test scores is suspended until the 2019-20 school year.

Read a detailed description of the current APPR 3012-d teacher evaluation law here and a lengthy Power Point here.

In May, 2015 the Regents convened a Learning Summit and asked a number of experts to discuss the use of student growth scores (VAM): Watch the lengthy, sometime contentious discussion  here.

With one exception the experts criticized the use of student growth scores (VAM), the VAM scores did not meet the tests of “validity,” “reliability” and “stability.”

There have been glaring errors in the system. In the Sheri Lederman law suit  a teacher had very high observation scores and due to the composition of her class, very low student growth scores. The judge ruled the use of the growth scores, in the individual case, was “arbitrary and capricious.”

The APPR plan negotiated in New York City, on the other hand, allows for appeals by a neutral third party, and, the “neutral” has overturned appeals in which there was a wide disparity between the observation and VAM scores.

The current plan, created by the governor and approved by the legislature has been rejected by teachers and parents. Teachers are convinced that their score is dependent on the ability of the students they teach, not their competence. Parents feel schools are forced to “teach to the test” due to the consequences facing principals and teachers.

Angry parents, angry teachers and principals and a governor and a legislature looking for a way out of the box they created.

And, a cynicism from elements among the public – if two-thirds of kids are “failing” state tests how is it possible that only one percent of principals and teachers are rated “ineffective?”

The Board of Regents has been tasked with finding the “right” plan.

There has been surprisingly little research and public discussion of teacher attrition – in high poverty schools staggering percentages of teachers, 30%, 40%, 50% or more leave within their first few years.

The December, 2015, Cuomo Commission Task Force, in a scathing report, tasked the Regents with “correcting” what has been a disastrous path. Partially the governor creating an incredibly complex teacher evaluation matrix and partially the Commissioner King rushing to adopt the common core, common core testing and teacher evaluation simultaneously.

Can the Regents separate political decisions from research-based and guided decisions? Can the Regents move from the John King path, an emotion-guided political path to actually following “what the research says”?

On Tuesday the new Research Work Group, chaired by Regent Johnson will convene for the first time.

The roadmap for the State Ed Department and the Board of Regents are the twenty-one recommendations of the Cuomo Common Core Task Force. A number of the recommendations: untimed testing, an in-depth review from the field of the standards, greater transparency of the test items, alternatives to the use of examinations for students with disabilities, and, the beginning of an review of teacher evaluation are already in progress.

The Commissioner and the Regents have to regain a lost credibility: from policy emanating from the Gates Foundation and the so-called reformers to policies guided by scholarship and supported by parents and educators.

Killing the Zombies: Why the “Bad Teacher” Canard Refuses to Die

Who is Clay Christensen and what is disruptive innovation in education?

Christensen is a professor in the Harvard Business School and the intellectual force behind the current education reform movement. The professor proffers that education has been basically unchanged for decades, a traditional classroom model, very little has changed including little improvement in achievement. Christensen acolytes argue that the traditional model must be “disrupted.”  A wide range of examples: placing schools in competition; public, private, charter, parochial and home-schooling through a voucher system. Traditional instruction must be replaced by an iteration of personalized learning in the form of computer-based learning and, impediments to removing “bad” teacher removed.

The “disruptors’ include the political leadership, from the White House to state capitals.  The $4.4 billion in competitive state grants, the Race to the Top (RttT), is a prime example. The lure of federal dollars to disrupt the traditional systems; RttT required the creation and expansion of charter schools as well as creating a student test score-based teacher evaluation system.

The New Teacher Project (TNTP), an advocacy organization, a “disrupter” organization, conducted a survey of school districts and a report – The Widget Effect. The findings:

Effective teachers are the key to student success, yet our school systems treat all teachers as interchangeable parts, not professionals. Excellence goes unrecognized and poor performance goes unaddressed. This indifference to performance disrespects teachers and gambles with students’ lives.

The 2009 report, surveyed fifteen schools districts across four states points to the absence formalized evaluation systems resulting in virtually all teachers rated satisfactory with few classroom observations.  For the TNTP there was no sorting of teachers by ability, no one is fired and no one is identified as being an exemplary teacher.

In districts that use binary evaluation ratings (generally “satisfactory” or “unsatisfactory”), more than 99% of teachers receive the satisfactory rating, Districts that use a broader range of rating options do little better; in their districts, 94% of teachers receive one of the top two ratings and less than 1% are rated unsatisfactory.

Since the release of the report the reformers, the “disruptors,’ have been successful, enormously successful, in convincing, coercing, luring states into highly structured teacher evaluation systems; to identify the high performers (merit pay) and prune away the low performers.

As part of their winning Race to the Top proposal New York State designed a multiple measures teacher evaluation system: 60% of a teacher score would be supervisory observations based on a rubric selected by the school district (Danielson, Marzano, Marshall and others), 20% based on a student growth data (VAM) on state grades 3-8 test scores and 20% on a locally negotiated metric – which could be test scores or other measures of student learning (MOSL). The data is pumped into a dense, extremely dense mathematical algorithm and all teachers receive a score that translates in a letter grade on the HEDI spectrum: Highly Effective, Effective, Developing and Ineffective. The teacher evaluation plan, called the Annual Professional Performance Review (APPR) has been amended a number of times – the current plan prohibits the use of student test scores for four years, is called the “matrix.”

The inclusion of a value-added measurement, the student test score algorithm has been sharply criticized by a range of scholars as well as teacher organizations, and, a state court, in a non-precedent setting decision, found the use “arbitrary and capricious.”

Millions of dollars to create a multiple measures teacher evaluation plan and the result: 1% of teachers are ineffective – the same as the Widget Effect report.

In 2009 The New Teacher Project bemoaned that only 1% of teachers were rated “unsatisfactory” and seven years later the New York State APPR plan found, you guessed it: once again, only 1%.of teachers rated “ineffective.”

Millions of dollars to create a teacher evaluation system, a host of “experts,” the application of dense mathematical formulations and the percentage of teachers rated ineffective is unchanged.

This couldn’t possibly be right!! … so say the disruptors.

In a recently released report, The Widget Effect Revisited: Teacher Evaluation Reforms and the Distribution of Teacher Effectiveness  (February, 2016), the authors conducted surveys of  newly designed teacher evaluation plans across a number of states and interviewed principals.

On the new plans, “… less than 3% of teachers were rated below Proficient”

The raters, the principals, also reported,

“…evaluators perceive more than three times as many teachers in their school as below Proficient than they rate as such.”

In lengthy interviews the principals expressed the reasons for not rating more teachers as below Proficient.

* Time constraints (“It takes too much time away from running a school”)

* Teacher potential and motivation (“Fear of the discouraging of teachers”)

* Personal discomfort (“I have a difficult time telling teachers they’re failing”)

* Racial tensions (“Very difficult for a White principal to rate Black teachers poorly”)

* Quality of replacements (“I can’t find adequate replacements”)

* Voluntary departures (“I rate them Proficient and they leave – a deal is made”

* Burdensome dismissal processes (“The process is too complex and time consuming”)

Is the “problem” too many “below Proficient” teachers or “below Proficient” principals?

What the authors failed to investigate, admittedly not the purpose of the study,

* Inter-rater reliability: do the raters from school to school use the same rubrics, and, are they competent to assess teacher performance?

* The bell-curve conundrum: is the lowest rated teacher in the school “below Proficient” when compared with all other teachers in the district?  In other words, is it “arbitrary and capacious” to establish a system that guarantees that the lowest performer in a school must be below Proficient? After all, they may be more proficient than teachers in other schools in the district.

A second baseman on a major league team may be the “least proficient” among major leaguers and in the top 1% of all second baseman across colleges and minor leagues.

I know the idea is disquieting – perhaps only 1% of teachers actually are ineffective.

Prospective teachers must be accepted by a college and meet standards set by the Council on Accreditation of Teacher Preparation (CAEP), teachers must pass a number of nationally recognized pre-service exams, pass interviews by principals/hiring committees, teach a demonstration lesson and serve a probationary period as an at-will employee.  It should not be surprising that very few teachers who survive the rigorous pre-screening end up as “below Proficient.”

Nobel Laureate Paul Krugman coined the term zombie idea: a zombie idea is “a proposition that has been thoroughly refuted by analysis and evidence, and should be dead — but won’t stay dead because it serves a political purpose, appeals to prejudices, or both.”

The disruptor “bad teacher” solution to increasing student achievement is an example of a zombie idea – in spite of reams of evidence the idea refuses to die.

What is so depressing is when compared to teacher attrition the “bad” teacher argument pales – in the lowest achieving, highest poverty schools about half of all teachers leave within five years, and, we have a pretty good idea of why they leave: the way they’re treated.

Susan Moore Johnson, at the Next Generation of Teachers Project at Harvard examines the issue of teacher attrition in the highest poverty schools in detail.  Yes, teachers commonly leave to wealthier, whiter schools; however, they are not fleeing the students, they are fleeing the working conditions.

If we know how to make significant differences (“Improve working conditions in the poorest school”), if we’ve identified the core problem (“Teacher morale and treatment”), why don’t we address the solution?

Those zombies are tough to kill off.