What's behind a scaled score
Superintendent Dr. Theresa Thayer
Snyder
(August
18, 2010) I admit that I am a long-time skeptic as to the value of some
of the New York State testing regimen, especially at the elementary
level. I often tell people about my experiences as an elementary school
principal when the tests were first initiated for fourth and eighth
graders. The nine-year-olds took ELA exams, Math exams, and Science
exams. These exams were designed to give us hard data about student
performance. I observed levels of stress among the students and
collected my own hard data—how many fourth graders visited the school
nurse during testing weeks compared to the rest of the year. You can
probably guess that visits spiked remarkably. I decided at the end of
one testing cycle to surprise my fourth graders with an ice cream party.
When I announced that all fourth graders should come to the cafeteria,
one little boy asked his teacher if he should bring his number 2 pencil!
A few years later, the grades 3-8 tests were initiated. I admit my
skepticism remained intact, but when it was announced that all children
would be proficient by 2014, I actually became prophetic. “Well, what
that means is the scaled scores will increase gradually to give the
illusion that there is growth towards attaining that lofty goal of 100%
proficiency, despite the fact that the lofty goal makes no sense at
all.” You do not need too much of a background in educational statistics
to realize that if you give such a test and everybody passes, it is a
bad test, just as it is if you give such a test and everybody fails, it
is a bad test. You may recall this occurred a few years back on the
Regents Math exam, which was re-scaled because so many students failed
it. To say all will be proficient means the results have to be
statistically tweaked to accomplish this. It is rather like the Lake
Woebegone Effect, where all the children are above average.
Now we have a new dilemma. In order to prove that New York State tests
have been soft, this year’s versions were bumped up and moved to the end
of the year. We educators were told that these assessments would be
testing a good deal more than previous ones. Once they were completed
and sent off to the State, an announcement was made that indicated a new
cut point was implemented for determining levels of performance. I am
sure you have read the local press regarding the shock that school
people and parents will be feeling when they see that a child’s
performance level on the State exams has been seriously affected by the
new cut point. But again, that old skepticism of mine inched forward.
One of my administrators pointed out an irregularity. On a third grade
math test, a child who answered 37 out of 39 questions correctly (that
is 95% correct) was deemed performing at level 3. A child who answered
85% of the answers correctly was deemed performing at level 2. Some
children, who were performing at level 1, actually had scaled scores
that would have put them at level 3 last year. I scratched my head over
this and wondered it this was an aberration on the grade 3 math
assessment. As I began to dig through the data, I found that it was not
an aberration, but was true across all tests, in both ELA and math. I
contacted a couple of colleagues who advised me to remember that you
can’t compute percentages and compare them to scaled scores because test
items are weighted by difficulty. So I ran an error analysis, and I
learned that the errors the students made were random (especially for
higher-scoring students), which suggests item difficulty did not impact
their outcomes.
Since psychometrics is not my passion,
I decided to consult a guru of psychometrics whom I met several years
ago. Dr. W. James Popham is professor emeritus of statistics and
educational measurement from UCLA. He is an authoritative author and a
renowned scholar in this field. When I emailed him my dilemma and asked
if I was on track in looking at these data sets differently, he
generously responded: “One of the problems with scaled scores is that,
although their potentials for analysis are considerable, it is really
impossible to make any sense out of them. Thus, when you use a
“percent-correct” prism in an attempt to interpret the meaning of your
school’s scaled scores, this is a really sensible thing to do. I wish
more educators would be sensible!”
I have continued to correspond with him
as I have dug deeper and his response has been consistent: “Your
analysis is one way of trying to figure out on your own what’s meant by
these mystical numbers.” Dr. Popham has written a book I recommend for
parents called Testing! Testing! What Every Parent Needs to Know
About School Tests.
As I have been analyzing the data sets
through this new prism, I am finding that the cut points were adjusted
to force more students into levels 3 and 2, despite answering large
percentages of the questions accurately. A handful of children who
scored less well are children with whom we are already working to help
them achieve. But I have to be honest, I simply cannot look a parent in
the eye and say your child is performing at level 3, despite having
answered 95% of the test questions correctly. On many of the tests, the
range to achieve at level 4 is no more than one error. I have decided
that, given this different view of the data sets, when we receive the
parent report to mail home, I will be attaching a label which will tell
parents the percentage of items their child answered correctly. I hope
this will make the testing agenda more transparent and will ease undue
anxiety about student learning. I am all for rigor and I am surely in
favor of educational reform, but I don’t want it carried on the backs of
school children. From my point of view, you don’t make a poor assessment
stronger by making it harder to pass—you reform the assessment.
Respectfully submitted,
Dr. Teresa Thayer Snyder, Superintendent