New York State Cut Scores: From the Inside

scissorsBelow, I present, quite unintentionally with great serendipity, the first-hand account of New York State cut score-setting from my gracious guest blogger Dr. Maria Baldassarre-Hopkins, Assistant Professor of Language, Literacy and Technology at my alma mater, Nazareth College in Rochester New York.

If you have ever wondered…

~ Where the hell do those cut scores come from, anyway?

~ How does the state actually set a cut score?

~ How political is this process?

~ Are there any teachers involved at all?

…then this is the article for you.

Maria spoke to me often during the writing of this piece of having to navigate legalese, non-disclosure agreements and so on, so you will not find a tell-all blow-up here. However, you may certainly find a window into the process that you may not have had before. The fight begins, as it always should begin, with good intelligence.

Take it and use it well.  ~ Dina

UPDATE: This post has been picked up by Diane Ravich, who has a testing expert respond– also a whole new set of comments. Go check it out.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The Mission

If you are a teacher who has ever yearned to be sent on a secret mission (really, who hasn’t?), complete with coded documents, non-disclosure agreements, men in dark suits watching you, advance e-mails to dress in layers, unanswered questions and enormous breakfast buffets with zero protein, you must spend a week of your life working for Pearson. My mission: Make cut score recommendations to the commissioner for the maiden voyage of the Common Core ELA assessments.

The most thought-provoking question asked of me during my five days in Albany was by a no-nonsense, b.s.-calling principal from the Bronx whose intensity scared me a little: “So, what do you hope to get out of this?” I said something about using my new knowledge of the standards, the tests and how they are constructed and evaluated to help my students think about using sound literacy instruction to prepare their students for them.  That was Day 2.

A Word About Cut Scores

Cut scores were to be decided upon after NYS students in grades 3-8 took the tests.  By looking alternately at test questions, Common Core State Standards, Performance Level Descriptors, and other data-which-shall-not-be-named (thank you non-disclosure agreement!), 100 educators sat in four meeting rooms at the Hilton using a method known as “bookmarking” to decide what score a student needed to earn in order to be labeled as a particular “level” (i.e., 1-4).  How many questions does a student need to answer correctly in order to be considered a 3?  How about a 4?  2?  1?

In each room sat teachers, administrators and college faculty from across the state.  This mix made for some interesting discussion, heated debates, and a touch of hilarity.  There were smartly dressed psychometricians everywhere (i.e., Pearson stats people) and silent “gamemakers” unable to participate sitting in back of room looking on, clicking away on their laptops.  Sometimes they nodded to each other or whispered, other times they furrowed their brows, and at least twice when the tension was high in the room, one gamemaker (who I called “the snappy dresser” and others called “the Matrix guy”) stood up and leaned over the table like he was going to do something to make us rue the day.  I kept my eye on that one.

So, Bookmarking…

We began the bookmarking process with grade 8, later repeating the entire process for grades 7 and 6.  I will try to be as brief as possible:

  • We first reviewed Performance Level Descriptors for each level of performance on the assessment (levels 2-4; we were not provided PLDs for level 1).  PLDs were originally crafted by “content experts” at NYSED/Pearson and were categorized based on anchor standards that were being assessed.  While we were not permitted to leave with copies of the PLDs, we were told they would be made available to the public online … eventually.
  • We then had to consider this question:  What should a student who is barely at level ___ be able to do?  In other words, what separates a level 3 student from a level 2 student (etc.) on each anchor standard?  These were known as threshold descriptors.  I can’t say exactly what our threshold descriptors were, but I can say they included words like “smidge,” “glimmer,” “morsel,” and “nugget,” and phrases like “predominately consistently…”  This.  Took.  Long.
  • We reviewed selected passages and questions in original test booklets (40 minutes) and then the rest of the test (20 minutes).  (Just curious: How long did your students take to complete the test?  No reason.  Just wondering.)
  • We received Ordered Item Books (OIB) where all test questions for that grade were ordered by “experts” from least to most difficult.  Constructed and extended response questions were listed multiple times at various places, once for each possible point value.  Passage difficulty was considered in the ordering of the questions, where “difficulty” was synonymous with “Lexile score.”
  • Time to “bookmark”:  Each of us would place a post-it note in the OIB on the last question a student at a particular threshold level of proficiency would have a 2/3 chance of answering correctly.
  • But before we began, we were told which page numbers correlate with external benchmark data (I could tell you what those data were, but then I would have to kill you).  So, it was sort of like this:  “Here is how students who are successful in college do on these ‘other’ assessments.  If you put your bookmark on page X for level 3, it would be aligned with these data.”
  • We had three rounds of  paging through the OIB, bookmarking questions, getting feedback data on our determined cut scores, and revising.  We
  • had intense discussion as we began to realize the incredible weight of our task.  We were given more data in the form of p-values for each question in the OIB – the percentage of students who answered it correctly on the actual assessment. Our ultimate results were still not the final recommendation.
  • On our final day of bookmarking we came back to grade 8 (after the process took place for grades 6 and 7) and did one last round.  This 4th round determined the actual cut scores that would go to the commissioner as a recommendation.

I, along with the people in my room, completed this entire process for grades 6-8.   A group of educators in a similarly tiny room did the same for grades 3-5.  On day five, table leaders got together for vertical articulation.  This meant we looked across all of the cut scores to see whether or not there was some general consistency across all grades, 3-8.

And Now, a Gentle Plea to the Reader:

I received word on the day I write this that the commissioner has made a final decision on the cut scores.  I am not at liberty say whether the recommendation we made was the last word, but once the cut scores are announced I would like for you, with kindness in your heart, to hold the same image I cling to a month later – and it is this image that will have the most profound impact on how I channel this experience in my own teaching:

In the room where I sat for five days, I was among some of the most critical, thoughtful and intelligent teachers, administrators and college faculty I’ve ever met, all of whom were fiercely loyal to the students in their classrooms and communities. Despite the rigidly scaffolded and tightly constrained process of recommending cut scores, the educators in our room fought tirelessly for high standards and, at the same time, fairness to teachers and students.

  • Through gentle inquisition, they took the commissioner to task when he gave us our charge.
  • They challenged any and every part of the methodology that seemed problematic.
  • They thought about how these decisions would impact teacher and principal evaluations.
  • They pushed back hard at the reality that the cut score decisions could actually diminish the quality of education students—especially non-white students, ELLs and SWDs—would experience on a daily basis.
  • They realized that at the end of the day many questions would remain unanswered or unaddressed. Though our facilitator was lovely – a psychometrician from Pearson who was as intelligent and kind as she was passionate about the work she was doing, she was not a policy-maker.
  • They drank beer.

What I Took from This:

It was not quite what I had hoped or expected.  I wish I could say I now have answers to satiate my students’ hunger for the best practical answers to their instructional quandaries related to these tests.  If anything, my thoughts about that are slightly more muddied.

While I am required here to be vague about specific data, details and conversations, I trust that the discerning eye of the critical practitioner might read between these lines.  But I will be frank when I say that it has never been so clear to me that the dataphilia that is now the culture of our profession is not non-ideological.

My geek-life hero, Marilyn Cochran-Smith (among others), has written that teaching is never neutral. Every single thing we teach and how we choose to teach it is political, including how and what we assess and how we evaluate those assessments.

That admonition has never felt so real to me.  I am heartened by the vehemence with which the professionals in that room pushed back, working within the system in order to simultaneously work against it.

And that is my take away.  I was glad to be at the table.  I wish they had given us ten more days or two more years to make some substantive changes.  And I hope thoughtful people with a dog in the fight like the ones I met in Albany continue to fight to have their voices heard.

~ Maria Baldassarre-Hopkins

 

 

 

26 thoughts on “New York State Cut Scores: From the Inside

  1. It seems the commentator has a hard time admitting that the process was thoughtful and done with integrity. I was heartening hearing of all of the care all sides put into the process. This piece is the best thing I’ve read about Pearson in a long time.

  2. (Mr. Spock voice): “Fascinating.” As thrilling a read as any great suspense novel. Thank you for shining the light on this process.
    Yes indeed, we can read between the lines; I can imagine a few times “everybody was kung fu fighting” for what they thought was right.

  3. As far as the ELA cut scores this year, the range for a 4 expanded and the range for a 3 contracted severely. Looking at the NYC numbers, this resulted in many more 4′s (roughly 500% more in grades 6 and 8, and more than 150% in grade 7), and many fewer 3′s as high 3′s became 4′s and low 3′s became 2′s.

    Grade 3 was an odd outlier in this trend, which is odd in itself, but I can’t figure the reason for the trend to begin with and I think it lends itself in future years to a re-tightening of the range for a level 4, which is going to leave a lot of kids who got 4′s this year very confused.

    A quick look at the data is here.

  4. nycsblog, this came up in discussion and was a concern. And I don’t have any answers, per se, but rather points to consider:
    1) proficiency is limited to levels 3 and 4. If you compare the sum of 3s and 4s from 2102 to 2013 it works out that overall proficiency decreases across the board (not news). We were strongly encouraged to think in terms of “proficiency” and not individual levels when looking at the data we generated.
    2) I do wonder, though, if this was also, in part, a result of the design of the process. We coded for 3 first, then 2, then 4. This (possibly) meant that considerably more time and energy went into placing that level 3 bookmark, less with a 2 and even less with a 4. This may be neither here nor there, but it’s something I want to think more about.
    3) I really hope I am not trivializing what you write here–not my intention–but I think part of the problem is that NYSED didn’t change the names of levels. They were very emphatic about the idea that a “4″ last year is not the same thing as a “4″ this year and at least pretended to toy with the idea of changing the names of the levels to something different. It seems like these vestiges of the “old” system interfere with our ability to make sense of the “new.”

  5. I really appreciate your time and the thoughtfulness of your reply; it was just one of the things that jumped out at me as I started filtering through the data, and it seems like this happened only on the ELA side.

    The thing I see working with kids at both ends of the curve is that it’s the ones near the top who tend to take these results very seriously, so a sudden shift like this — especially since they’re not likely to hear any more than “3″ or “4″ — is going to make it difficult for them, say, next year, when they work just as hard but perhaps the range for a 4 tightens and the range for a 3 widens — more in line with the way the cuts were done in recent years — and they end up with a high 3 when this year they had a solid 4. To some of the kids I work with, that 4 is really important so I think there should be more consistency in the way that cut is marked, or we should just go to a pass/fail model.

    You write, “I wish I could say I now have answers to satiate my students’ hunger for the best practical answers to their instructional quandaries related to these tests. If anything, my thoughts about that are slightly more muddied.” Leaving the instructional quandaries aside, how do you feel about the cut selection process? Was there any serious discussion of scrapping the current model completely and going in an entirely different direction — simple student percentile rankings, perhaps, or something else? And did you come away with any ideas of your own for what you’d like to see done in the future?

    Thanks again for your efforts, both on the days you describe and here helping spread some understanding of what happened.

  6. Quick correction: I wrote in point 2 above that we rated first for 3, then 2, then 4. It was actually 3, 4, then 2. This likely has little import, but there you go.

    There were no conversations about scrapping this model to my knowledge, other than a quasi-admission that staying with the 1-4 system is confusing. There were musings (“Just put it on a damn curve!”), but as we had pretty limited contact with policy-makers in open discussion, there really wasn’t space for that. When questions like this came up, we were gently redirected to the task at hand (after some room was given to talk) and encouraged to share our thoughts with Network Team (Also, the “curve” thing up there was a joke).

    I agree with you on the fact that the numbers do actually matter to the students (to say nothing for teachers and parents). I have no real ideas other than to say that if proficient or not proficient is the actual concern, then let’s just call it that. But. That’s not the only concern:

    The data Pearson has on test takers based on how they answer questions is unbelievable. So, for us, a 4 is a 4, right? They are actually able to ascertain, using the magic of psychometrics, what questions a level 4 answered correctly because she/he knew the answers, which ones she/he answered incorrectly but gave a really good effort, and which she/he just took a random stab at (for better or for worse). I’m not sure if or how these data will be used, but I found it absolutely fascinating.

  7. thanks so much for your fascinating observations. One question: Many teachers and students commented on the poor quality of the exams this year: that the tests were too long, too difficult for their grade level ( with whole sections repeated at different grade levels), full of ambiguous and confusing questions, with many featuring distracting commercial logos besides. I received hundreds of such comments, many of them included on my blog at http://shar.es/yemBe. Teachers reported that many top students didn’t finish or left in tears. Did any of the panelists notice these features or observe that the exams themselves were poorly designed and thus not likely to give accurate results?

  8. Leonie , thanks for the critical insights here. Your observations and those of your commenters are so important and should absolutely not be ignored or glossed over. Period.

  9. “But I will be frank when I say that it has never been so clear to me that the dataphilia that is now the culture of our profession is not non-ideological.”

    This deserves its own post, IMO.

  10. This is probably here but I confess I am getting a bit lost in all this. It’s fascinating, though.

    A question: Did Dr. Bladassarre-Hopkins and the others in her group know how students had scored (raw scores, of course) when they came up with the proposed cut scores? I assume Commissioner King had that information when he announced the final cut scores. Any reasons to think he didn’t?

    One thing to add. In the spring, I spoke with a principal of a school that does well on tests about whether the administraiton and staff at the school were concerned the new tests could end their record of strong showings. The principal said this would depend entirely on how the state scored the tests — whether there would be a curve — and, this principal continued, the city and state had not disclosed that information some six weeks after students had taken the tests.

    It’s difficult to know what good purpose could be served by that kind of secrecy, but I’d be happy to know if anyone has any idea.

  11. The biggest weakness of the process seems to be the “Here is how students who are successful in college do on these ‘other’ assessments.”

    While Pearson may have a reasonable sense how an 11th grader will do in first year of college based on the 11th grade scores, it has much less sense how will an 8th-grader do, and even less how will a 3rd-grader do based on 3rd-grade scores. Consequently, the tying the definition of achievement in early grades to college-readiness *by its nature* raises the expectations in early grades much more than in high school, to achieve similar certainty of college-success.

    In other words, even if the expectations of 11th graders are reasonable for college success (and I wouldn’t bet too much on it), the expectations in grades 308 are inflated, probably by a lot.

    But in this case there was also a political override to set the final cut scores. Not really surprising, and nothing wrong with it in principle, BUT THAT SHOULD HAVE BEEN CLEARLY STATED by NYS officials. It was not.

  12. “We were given more data in the form of p-values for each question in the OIB – the percentage of students who answered it correctly on the actual assessment.”

    “But before we began, we were told which page numbers correlate with external benchmark data (I could tell you what those data were, but then I would have to kill you). So, it was sort of like this: “Here is how students who are successful in college do on these ‘other’ assessments. If you put your bookmark on page X for level 3, it would be aligned with these data.””

    Regarding the second quote: NYSED has already said the external benchmarks are the SAT and PSAT, and their correlation to first year college GPA:

    http://www.p12.nysed.gov/assessment/reports/summary38externalbenchmarkstudies.pdf

    Can you confirm this, Maria?

    Let’s be honest: the SAT remains a norm-based assessment. And the reason they gave you the p-values for the items was so you could figure out which students in a normal distribution would answer the each item correctly.

    Any notion that the NY tests are criteria-based is pretty much shot at this point. Your sharply-dressed psychometricians were there to ensure that the percentages matched up with where they thought they should match up. The entire thing was reverse-engineered to create cut scores where about 30% of the students were “college ready.” “The Matrix” is exactly right: reality is what they create.

    I feel for you, professor. You tried to bring some integrity to a process that clearly has none.

  13. The only way the shenanigans and chicanery can be avoided from state to state on the cut scores is for the two consortia to establish federal cut scores. Everything else, sadly and according to history will be politicized.

  14. Jazzman, the document you shared includes an accurate description of some of the external benchmark data, yes.

  15. Gail, we did not know how students scored; we only knew the percentage of students that answered each item correctly.

  16. Jazz, I can confirm based on an e-mail we received from SED. I did not realize this was made public. My informed assumption was that it would not be.

  17. Thanks so much for sharing your experience. Do you believe NYS students are accurately described by the scored results/proficiency descriptors of these tests? (Not whether the defined process of determining cut scores was done according to protocol, but your overall professional judgment?)

  18. Mary, I think your question deserves an entire conversation … one that begins with a discussion of the fairness and developmental appropriateness of the tests (see Leonie Haimson’s comment above) and ends with teachers with their boots on the ground talking about what they see their students doing on a daily basis in their classrooms. In terms of what I can say to this end: We were as judicious as possible given the constraints of both time and methodology.

  19. Thanks for your response Maria. I would love to see that conversation take place in Rochester. You have made an extremely valuable contribution with the discussion here. I wonder if any of your colleagues – at Naz or any of the other local Ed programs — would facilitate a conversation among teachers. There are some ongoing discussions among parents about testing concerns and refusal, etc., but the teacher dialogue is critical. I think there is an underestimate of “passive refusals” especially in older grades, in which students sit for tests but work with minimum effort or answer randomly. The psychometricians have checked for that, I’m sure, but Pearson and NYSED and not likely to publish any findings.

  20. It’s an outrage that deliberation on a matter of public policy should not be fully disclosable to the public, unless literally a matter of life and death. The demand that participants in the process sign a non-disclosure agreement is proof positive that the process was fraudulent and corrupt, and that the people in charge of it are acting in complete bad faith. Those who participated in it were conned. As has been the public at large.

  21. Mary, that’s a great idea. Would love to talk more with you about what you think an opportunity for that conversation could look like. Feel free to shoot me an e-mail.

  22. Very much enjoyed this post and am grateful that you took the time to publicly share your thinking on a controversial topic that is so important to all of us. You gave what looks like a sincere accounting of what had to be difficult work. Thank you, thank you–for the work and for the glimpse into your experience.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>