mathismore wrote:We must test what we value, both locally and nationally.
Mathematical literacy is becoming a survival skill. We strongly believe in accountability to a rich set of mathematical goals. We want students to master core facts and procedures, but this is not enough. We want conceptual understanding, problem-solving, and flexible use of the mathematics to solve both pure and applied problems. Like standards, assessments must reflect our goals—most importantly, the ability to apply mathematical reasoning to analyze and attack real-world problems. If mathematical literacy includes the ability to make use of mathematics, and we believe in the importance of mathematical literacy, then we must align our testing accordingly. Testing must not be about punishment for failure, but about giving students and teachers a clearer understanding of what they do and do not know. Testing should inform instruction, not determine it.
While I agree wholeheartedly with the above, there are many difficulties with the issue of assessment and its misuses and abuses. There are significant issues involving the cost of certain kinds of assessments compared with, say, machine-scored multiple-choice or student-generated numerical response tests, as well as significant questions about the validity and especially reliability of scoring alternative assessments like free-response questions that appear on some state math tests and open-ended questions.
It should go without saying (but sadly, does not), that experts in psychometrics and companies such as the Educational Testing Service give constant warnings about the abuse and misuse of test scores that increasingly go unheeded. We see test scores used to determine or measure things that the tests were not designed to measure. We see single test scores used as the sole or primary determinant in evaluating a host of things. Sometimes mere ignorance is the culprit, which is bad enough. But increasingly such abuses are fueled by political agendas. People who should know better are willing to use a test of student achievement to attack a particular philosophy of teaching or learning. Data is cherry-picked to promote or attack a particular textbook, pedagogical approach, teaching tool, or technology. We see science either ignored, suppressed, flogged, or distorted in order to convince the public that one way to teach a subject, one sort of text, one kind of classroom, one set of resources, etc., is the best and only one that suits all students in all circumstances.
To get specific about test reliability, I have done something perhaps few reading this list have done: I recently worked as a "reader" (meaning a scorer) for middle school math "free response" problems given by one of our most affluent and educationally progressive states. Non-disclosure agreements prevent me from identifying the state, whom I worked for exactly, or actual content of the problems. But I have to say that I witnessed things that were standard operating procedure, ostensibly approved and/or ordered by the state itself during the scoring process, that if widely known by parents in that state would likely lead to a host of law suits or at least a public outcry for major changes in the test and/or its scoring. What I can say without risk, I believe, is that the interpretation of how a particular item and a particular kind of response to it should be scored. This might result in something that readers had previously been told should not get credit suddenly earning credit OR just the opposite. When asked whether that meant we would go back and RE-SCORE those papers already scored, we were repeatedly told that such would not be the case. The reason is rather obvious: working on a deadline, already facing innumerable technological delays that put us at risk of not finishing the scoring on time, there was no way the state or the company was going to re-score any papers. This wasn't on one item my team scored, but on most of them. There were many other teams scoring other items from these grades and other grade bands, and I know that our experience wasn't unique. So just how reliable are those scores, when this practice appears to run counter to the notion that the same answer should receive the same score no matter who produces it, when they produce it, or who scores it. Isn't that the essence of reliability? Isn't it the essence of fairness? Knowing that this practice is likely not unique to this administration of this particular test by this particular state, would politicians, parents, researchers, schools, teachers, administrators, kids themselves, or any reasonable person be quite so willing to give weight to the results of at least these non-machine scored items?
And yet, we need MORE and BETTER non-machine scored items. The tragedy in part here is that many of those most willing to attack reform in mathematics and other areas of education are huge abusers of test scores and are quite willing to see these expensive, difficult to write, difficult to score items deep-sixed permanently. They believe in "objective" items that are the ones least subject to the criticism offered above, but also often the least informative and least useful for promoting assessment as part of learning and instruction, rather than as goads to drive and determine what is taught and how.
I've barely scratched the surface here, but I hope some of what I've written will provoke a lot of thought among the reasonable and knowledgeable readers of this forum. Not only should we assess what we value, but we'd best give serious consideration to what message we're sending when we decide what to assess and how, as well as in the ways we use the results. I'm sure that sentiment is at the heart of Plank 7, but there's a lot of very ugly stuff lurking under the wood that needs to see the light of day.