New York State Cut Scores: From the Inside

scissorsBelow, I present, quite unintentionally with great serendipity, the first-hand account of New York State cut score-setting from my gracious guest blogger Dr. Maria Baldassarre-Hopkins, Assistant Professor of Language, Literacy and Technology at my alma mater, Nazareth College in Rochester New York.

If you have ever wondered…

~ Where the hell do those cut scores come from, anyway?

~ How does the state actually set a cut score?

~ How political is this process?

~ Are there any teachers involved at all?

…then this is the article for you.

Maria spoke to me often during the writing of this piece of having to navigate legalese, non-disclosure agreements and so on, so you will not find a tell-all blow-up here. However, you may certainly find a window into the process that you may not have had before. The fight begins, as it always should begin, with good intelligence.

Take it and use it well.  ~ Dina

UPDATE: This post has been picked up by Diane Ravich, who has a testing expert respond– also a whole new set of comments. Go check it out.


The Mission

If you are a teacher who has ever yearned to be sent on a secret mission (really, who hasn’t?), complete with coded documents, non-disclosure agreements, men in dark suits watching you, advance e-mails to dress in layers, unanswered questions and enormous breakfast buffets with zero protein, you must spend a week of your life working for Pearson. My mission: Make cut score recommendations to the commissioner for the maiden voyage of the Common Core ELA assessments.

The most thought-provoking question asked of me during my five days in Albany was by a no-nonsense, b.s.-calling principal from the Bronx whose intensity scared me a little: “So, what do you hope to get out of this?” I said something about using my new knowledge of the standards, the tests and how they are constructed and evaluated to help my students think about using sound literacy instruction to prepare their students for them.  That was Day 2.

A Word About Cut Scores

Cut scores were to be decided upon after NYS students in grades 3-8 took the tests.  By looking alternately at test questions, Common Core State Standards, Performance Level Descriptors, and other data-which-shall-not-be-named (thank you non-disclosure agreement!), 100 educators sat in four meeting rooms at the Hilton using a method known as “bookmarking” to decide what score a student needed to earn in order to be labeled as a particular “level” (i.e., 1-4).  How many questions does a student need to answer correctly in order to be considered a 3?  How about a 4?  2?  1?

In each room sat teachers, administrators and college faculty from across the state.  This mix made for some interesting discussion, heated debates, and a touch of hilarity.  There were smartly dressed psychometricians everywhere (i.e., Pearson stats people) and silent “gamemakers” unable to participate sitting in back of room looking on, clicking away on their laptops.  Sometimes they nodded to each other or whispered, other times they furrowed their brows, and at least twice when the tension was high in the room, one gamemaker (who I called “the snappy dresser” and others called “the Matrix guy”) stood up and leaned over the table like he was going to do something to make us rue the day.  I kept my eye on that one.

So, Bookmarking…

We began the bookmarking process with grade 8, later repeating the entire process for grades 7 and 6.  I will try to be as brief as possible:

  • We first reviewed Performance Level Descriptors for each level of performance on the assessment (levels 2-4; we were not provided PLDs for level 1).  PLDs were originally crafted by “content experts” at NYSED/Pearson and were categorized based on anchor standards that were being assessed.  While we were not permitted to leave with copies of the PLDs, we were told they would be made available to the public online … eventually.
  • We then had to consider this question:  What should a student who is barely at level ___ be able to do?  In other words, what separates a level 3 student from a level 2 student (etc.) on each anchor standard?  These were known as threshold descriptors.  I can’t say exactly what our threshold descriptors were, but I can say they included words like “smidge,” “glimmer,” “morsel,” and “nugget,” and phrases like “predominately consistently…”  This.  Took.  Long.
  • We reviewed selected passages and questions in original test booklets (40 minutes) and then the rest of the test (20 minutes).  (Just curious: How long did your students take to complete the test?  No reason.  Just wondering.)
  • We received Ordered Item Books (OIB) where all test questions for that grade were ordered by “experts” from least to most difficult.  Constructed and extended response questions were listed multiple times at various places, once for each possible point value.  Passage difficulty was considered in the ordering of the questions, where “difficulty” was synonymous with “Lexile score.”
  • Time to “bookmark”:  Each of us would place a post-it note in the OIB on the last question a student at a particular threshold level of proficiency would have a 2/3 chance of answering correctly.
  • But before we began, we were told which page numbers correlate with external benchmark data (I could tell you what those data were, but then I would have to kill you).  So, it was sort of like this:  “Here is how students who are successful in college do on these ‘other’ assessments.  If you put your bookmark on page X for level 3, it would be aligned with these data.”
  • We had three rounds of  paging through the OIB, bookmarking questions, getting feedback data on our determined cut scores, and revising.  We
  • had intense discussion as we began to realize the incredible weight of our task.  We were given more data in the form of p-values for each question in the OIB – the percentage of students who answered it correctly on the actual assessment. Our ultimate results were still not the final recommendation.
  • On our final day of bookmarking we came back to grade 8 (after the process took place for grades 6 and 7) and did one last round.  This 4th round determined the actual cut scores that would go to the commissioner as a recommendation.

I, along with the people in my room, completed this entire process for grades 6-8.   A group of educators in a similarly tiny room did the same for grades 3-5.  On day five, table leaders got together for vertical articulation.  This meant we looked across all of the cut scores to see whether or not there was some general consistency across all grades, 3-8.

And Now, a Gentle Plea to the Reader:

I received word on the day I write this that the commissioner has made a final decision on the cut scores.  I am not at liberty say whether the recommendation we made was the last word, but once the cut scores are announced I would like for you, with kindness in your heart, to hold the same image I cling to a month later – and it is this image that will have the most profound impact on how I channel this experience in my own teaching:

In the room where I sat for five days, I was among some of the most critical, thoughtful and intelligent teachers, administrators and college faculty I’ve ever met, all of whom were fiercely loyal to the students in their classrooms and communities. Despite the rigidly scaffolded and tightly constrained process of recommending cut scores, the educators in our room fought tirelessly for high standards and, at the same time, fairness to teachers and students.

  • Through gentle inquisition, they took the commissioner to task when he gave us our charge.
  • They challenged any and every part of the methodology that seemed problematic.
  • They thought about how these decisions would impact teacher and principal evaluations.
  • They pushed back hard at the reality that the cut score decisions could actually diminish the quality of education students—especially non-white students, ELLs and SWDs—would experience on a daily basis.
  • They realized that at the end of the day many questions would remain unanswered or unaddressed. Though our facilitator was lovely – a psychometrician from Pearson who was as intelligent and kind as she was passionate about the work she was doing, she was not a policy-maker.
  • They drank beer.

What I Took from This:

It was not quite what I had hoped or expected.  I wish I could say I now have answers to satiate my students’ hunger for the best practical answers to their instructional quandaries related to these tests.  If anything, my thoughts about that are slightly more muddied.

While I am required here to be vague about specific data, details and conversations, I trust that the discerning eye of the critical practitioner might read between these lines.  But I will be frank when I say that it has never been so clear to me that the dataphilia that is now the culture of our profession is not non-ideological.

My geek-life hero, Marilyn Cochran-Smith (among others), has written that teaching is never neutral. Every single thing we teach and how we choose to teach it is political, including how and what we assess and how we evaluate those assessments.

That admonition has never felt so real to me.  I am heartened by the vehemence with which the professionals in that room pushed back, working within the system in order to simultaneously work against it.

And that is my take away.  I was glad to be at the table.  I wish they had given us ten more days or two more years to make some substantive changes.  And I hope thoughtful people with a dog in the fight like the ones I met in Albany continue to fight to have their voices heard.

~ Maria Baldassarre-Hopkins