Thank you very much. I am very happy to be here. I realize that my speaking tonight is a double controversy.
First of all, when we think of a place such as Café Scientifique we usually think of the physical sciences such as
chemistry, biology or physics. I am a social scientist and so we are now meeting at Café Sociale Scientifique.
Social scientists are the Rodney Dangerfields of science. "We just don’t get no respect." Many may hope that my being
here tonight does not start a trend and I can appreciate that.
The second controversy deals with my topic. Academic testing can be controversial, especially as encouraged
in unfunded federal mandates. I can appreciate that also. However, the misuse of testing can be dealt with much better if
we look at what is the appropriate use of testing. I feel I have friends at FairTest.org because I was once a lurker of their
on-line discussions and found much to agree on. I have the greatest respect for them and I certainly do not intend to gloss
over the dangers of the misuse of tests. However, what I have to say points out the good use of testing
and it does not necessarily lend itself to addressing all the possible problems that are associated
with testing as practiced today in the academic policies of our nation.
I would like to address the question “Are the social sciences really science?” Yes, they
are. Psychology made its debut as a science in 1890 with William Wundt beginning a psychology laboratory at Heidelberg. Since then psychological
studies have strived to meet as high a standard as possible as scientific research. Unfortunately, there are many psychological
studies that have missed the mark. Especially with the popularity of Freud and his unempirical findings in the early part
of the twentieth century, psychology as a science suffered a set-back. To make up for this, the behaviorists in the mid-century
greatly criticized all who tried to study anything but observable behavior. To the behaviorist the mind was a “locked
box” that no one could look into. Cognitive psychologists who had very interesting experimental techniques for studying
mental processes were look down on by the behaviorists as unscientific.
This criticism would be similar to telling the physicists that they cannot study subatomic particles
because you cannot see them even though you produce replicable evidence that they not only exist but they operate exactly
in the theoretically described way. Therefore, two draw-backs in the twentieth century for psychology as a science was the
1) rise of popular psychology and 2) paradigm dominance.
Another problem for psychology as a science is its roots in philosophy. If you remember correctly,
the physical sciences also had their roots in philosophy. It was not long ago that the physical sciences were referred to
as the natural philosophies. Also long before Socrates there was a bright young man named Milpitas who believed that the purpose of life was to observe everything. He observed and recorded so much that he realized
he could predict climate changes. One summer he recognized that the conditions were right for a bumper crop of olives and
he rented all the olive presses for the summer. He made a killing. He just wanted to do to get back at all the people who
asked him “If you are so smart, why aren’t you rich?”
The social sciences are too close to their philosophical roots. Any one can have a philosophy (and
most people do). Not everyone can have a sound psychological theory, but too many unqualified people claim they do. They justify
themselves by saying everyone is entitled to an opinion. Let me think about that a minute. No, you are not. If your opinion
is based on propositions that have been proven false, espousing it can cause too many problems, especially if you are in a
position of influence. And I think this is where the opposition to testing in the schools might come in. School officials
are constantly under pressure to show results or to recognize problems and to just do something. They are often in a position
where doing anything seems better than doing nothing. In the seventies and eighties, when the curriculum was made more demanding
even down to the early grades, there was a cry to “Do something!” for all the kids who were in no shape to take
on the heavy curriculum. These young students were criticized for not being ready for school. It is easier to blame the students
for not being ready for school than to blame the schools for not being ready for the students. At that time some “readiness”
tests were developed to assure that children would not start school unless they were ready to take on the extra work. Critics
of these tests often focus on the cause of these tests which is the lack of developmentally appropriate curriculum. However,
what should more seriously be considered is the fact that schools were buying a measurement instrument for “readiness”
something that had not yet been proven to exist. You might as well buy a tape measure to measure the air.
When psychological constructs, such as intelligence or academic achievement, are studied in a scientific
manner, the results can be as undeniable and sustainable as the results of studies in the physical sciences. That is, to the
degree that the scientific method is applied in the social sciences, the results are just as accurate as the results in the
physical sciences. The problem is: studying humans in true experiments is not always easy. Many times the independent variable
of interest is something negative or harmful, such as neglectful upbringing; therefore, we would not want to introduce it
to randomly selected individuals. What if I divided up the people in this room, and this half has to live with neglectful
parents and this half gets attentive parents. Let’s see how you turn out. It is absurd to think of such a thing. So
we have to approximate the random assignment of participants and the manipulation of the independent variable. To some degree,
the accuracy is therefore watered-down.
Another problem with studying humans is that humans are the only organisms that can determine their
own behavior and motivation. If we do not like the findings of a study, we can decide that it does not apply to us. This is
good if the findings are ominous. Research says there is a correlation between neglectful parents and the self-efficacy of
their children. However, enough young people decide that that finding does not apply to them and go on to live productive
lives. We all know people who claim to be unpredictable. Such people would deny
themselves benefits of psychological results just to prove how unpredictable they are. Trying to negate the effects of an
independent variable may not be worth the effort but it can be done.
These two phenomena are the reasons why we say that psychological science investigates “principles
of behavior” and not “laws of behavior”. Whether we are studying laws or principles or behavior or mental
processes, we could get a good handle on things if reporting followed scientific method: 1) if the hypothesis is expressed
in testable terms, 2) if methods are reported clearly even those that acknowledge limitations to the study and 3) if findings
are reported accurately even those that undermine the acceptance of the stated hypothesis. These three points are the most
important components of the scientific method for separating social sciences from everyday thinking. Everyday thinking has
hypotheses, research methods and reporting of findings but we cannot trust those results even though people do. Grant Dahlstrom
of UNC Chapel Hill calls the everyday approach something you would expect “Aunt Fanny” to say. How many of you
have an aunt who is always making predictions about the nieces and nephews usually punctuating her predictions with “Mark
my words! I have always seen it happen with a kid like that.” The problem with Aunt Fanny approach is: what is proposed
is based on trivial commonalities which minimize important differences. Furthermore, those that make these kinds of predictions
often note well the experiences that support their predictions while ignoring non-occurrences, experiences where the predictions
did not come true. Everyday predictions are not stated in testable terms. If you caught someone by disproving an everyday
prediction, he/she would have some “out” such as “Well… I did not mean all the time!” Well…
if you are searching for sustainable truth, you had better mean all the time or at least describe the conditions under which
it may not be true. Everyday thinking is all right for everyday situations; however, if you are making serious decisions that
will affect the lives of others, you should rely on the most accurate information based on scientific research. This was one
of the problems with the readiness testing I described above. Even though the test did predict some students who would have
trouble in the early grade, the accuracy rate was 50%. What about the others? Half the students who maybe needed help were
not identified while half of those identified did not need any extra help. If the parents of this latter group had followed
the advice of the readiness people, those students would have needlessly wasted an extra year waiting for first grade. If
you confronted the developers of the readiness test, what would they say? “Well… I did not mean all the time.”
They noticed the occurrences and not the non-occurrences.
I have had much experience with testing and I am supposed to tell you what really goes on behind
the scenes. I worked with a school district in upstate New York to develop an achievement test
for Math mastery at the fourth grade level. The objectives that the teachers felt their students should know by the end of
fourth grade were rather limited. However, the whole universe of test questions had to be considered in order to decide which
of those questions best represented of all the others. Why was 4 + 5 a better representative then 6 + 7? I don’t know but in order to
reach the point where I could say it was a better representative, I had to give all the possible test items to representative
students. I had third graders take the test to see which questions were consistently answered correctly by the third graders
meaning those question would be too easy for a fourth grade math test. I also had some fifth grade students take the test
to see if there were any questions that the fifth graders consistently had trouble with. Those, of course, would be too hard.
I gave the test questions to representative fourth graders of various math ability levels. Did the question distinguish between
the highly competent fourth graders and the ones who had trouble? If any question was so easy that all the representative
fourth graders got it right, it was left off. Whatever questions were left after all these eliminations found their way onto
the achievement test in some way. The teachers also wanted word problems in addition to the math facts and math operations
questions. I got creative and developed some word problems. My adviser was interested in how humor enhanced or distracted
from testing, so I allowed him to include a humorous item.
The humorous item was a cartoon of a little girl named Amy. In the cartoon, Amy arrives at her friend’s
house eating an apple and carrying some banana skins. Amy says “Do you know that your house is two bananas and half
an apple from my home?” The math question was “If Amy can eat one banana in the time it takes to walk one block
and one apple in the time it takes to walk four blocks, how many blocks is Amy’s house from her friend’s house?”
It was cute, but did not produce any discernible added effects.
I left the administration of the test up to the individual teachers. Some of them used it as a mid-term
test and some gave it to the students as a test that does not count. In one interesting case, a student whom the teacher thought
would do exceptionally bad ended up acing the test. When she asked him “Why”, he said “Because you told
me it didn’t count.” He lost his test anxiety and performed better than ever.
Developing appropriate test items can be quite a painstaking chore but it has to be done. No matter
what the test is. I was involved in developing test items for the MCAT in 1994. I was one of several item developers around
the country. There was a serious need for as many test questions as possible, so that in due course the number of items could
be whittled down through eliminating those that did not work as expected. Nine years before that I am sure someone similar
to me was working on the Political Science GRE and come up with equally interesting items for test-takers like me.
When I took the Political Science GRE, I thought knew what it was I did and did not know. However,
one question fascinated me so much I remember it to this day. Anyone who has studied political science knows that Karl Marx
believed that an ideal government could be formed as a “dictatorship of the proletariat”. This I knew, so I thought
I could easily get any question correct if I was asked about this objective. Well…. The question on the GRE turned out
to be:
“Karl Marx believed that an ideal government would be:
A) a dictatorship
of the people
B) an authoritarian
rule of the proletariat.”
with two or three other possible choices that could not be correct.
So what was I to do? One of the choices had the one key term and the other choice had the other key
term. Long after I finished the test, I thought about it and thought about it. Years later a friend in the testing industry
suggested that there was no correct answer. He felt it was probably an experimental item. It would not count the first time
it was introduced but if most of the top performers on the test chose one of the answers while the low performers chose the
other, the one chosen by the top performers would be the keyed answer for many years to come. How about that: democratic choice
on a political science exam!
To the contrary, recently someone with a political science suggested that perhaps I remembered the
questions wrong. He suggested that if the first choice had been “a dictatorship of the working class” it would
have been a better choice than the other. His reasoning was that “working class” is a better paraphrase for “proletariat”
than “authoritarian rule” is for “dictatorship”. This answer would probably be chosen by the student
who was more informed about what was behind the concepts.
There are two types of test based on their interpretation. One is the norm based test and the other
is the criterion-based test. The first compares the test-takers with each other. The second rates the test-taker based on
the number of right answers. Intelligence tests are an example of norm-based tests; test of mastery-learning are criterion-referenced.
Norm-referenced tests can be problematic in that if you are a very smart person compared to even smarter people, you may look
bad. You might want to look out at the audience and say “Geez, tough crowd!”
The Graduate Record Exam is not necessarily interpreted as a norm-referenced test. However, it does
report a norm-referenced result. Our friends at FairTest.Org point out some of the hazards of norm-referenced tests. They
also point out the dangers of high-stakes testing. I do not see the GRE as representative of either of these issues. There
may be some people who are denied entry into graduate school based on their results on the GRE, but such a case is rare. Many
people begin graduate school with a few courses as a non-matriculated student. They earn the respect of their professors who
end up writing a recommendation for them. Those recommendations weigh more heavily on their application than any silly result
on the GRE.
I know! I took the Psychology GRE under much confusion and did not do so well. I though I could guarantee
that no one would ever see those results. But I signed the wrong papers and my Psych GRE results ended up in the hands of
the admission committee when I applied for my doctorate program. Many other factors played in my favor and those results were
ignored.
The norm referencing of the Political Science played in my favor when I was trying to finish my bachelors
through testing. My college Excelsior College in New
York was a pioneer in non-traditional college
education where returning students did not have to attend classes to prove they had college-level knowledge. We took tests
such as the CLEP test for lower-level credit and test designed by Excelsior College for upper-level credit. Many of you may be familiar with the CLEP test, the College-Level Exam of Proficiency which
is a criterion-referenced test. The third type of test we could take at Excelsior was a subject GRE such as Political Science. At that time, Excelsior’s philosophy was that if you could do better on a subject
GRE than 50% of the test takers you have demonstrated college level knowledge in that area. The idea was based on the assumption
that most of the test-takers had already earned their bachelors at a traditional school and if the Excelsior student was in
the upper 50 percentile, then the Excelsior student must know as much or more than the traditional student and therefore deserved
the credit. (I was in the 68 percentile. Would anyone like to hear my talk on ideal government?) Anyway, Excelsior has had
to since change their philosophy and standards because there were too many GRE test-takers who were looking for credit in
comparison to those test-takers who already had their bachelors.
Another test familiar to many is the SAT. I have had some back-door experience with the SAT in my
work with an SAT preparation course. There are a couple of misunderstandings about what the SAT measures in addition to the
issues about how it is used. The construct that the SAT is measuring is “preparation for college”. One of the
components of “preparation for college” is “test-taking skills”. If you cannot understand how to answer
the questions on the SAT, your test-taking skills will get in the way of your succeeding in college, especially in upper-level
courses. If a freshman applicant to college does not do well on the SAT, they will probably not get accepted at their first
choice for college. However, I do not see SAT as high-stakes testing. First-choice for college is not necessarily the best
choice, especially considering the high dropout rate among college freshman. Furthermore, many students who were rejected
by their first-choice college, end up graduating from those same colleges. They attend another college for their first two
years and then transfer. Many transfer from a community college. If you do not know of any good community colleges, I can
show you one over in Alexandria. Sign up for my courses; I’m an easy “A”. (If you work hard
and pass the tests, I am an easy “A”.)
The developers of the SAT promote it as if it were an intelligence test. They say there is no way you
can increase your score. They also say that you should not guess because it will decrease your overall score. These are misconceptions.
The SAT can be compared to an intelligence test in the sense that both attempt to measure reasoning ability. However, the
abilities measured by the SAT are more conducive to training than those presumed to be measured by intelligence tests. There
are various test preparation programs for the SAT; they vary in their effectiveness. However, all of them will guarantee that
taking the course will raise your score by 100 points. If that is their only guarantee, it is a sure bet. Just taking the
SAT for a second time should raise your score by 100 points as you learn the “nature of the beast”; i.e., the
kind of questions that are asked. The questions in each section of the SAT are positioned with the easiest questions at the
beginning and the harder ones toward the end. If a student knows this, he/she can trust their feeling in the beginning of
each section, but should be more cautious of “trick questions” as they get closer to the end. The SAT administrators do not tell the test-takers this. A preparatory course would. Another way that familiarity
with the test can improve scores is recognizing what the question is asking when the format is a bit unusual.
In the Math Reasoning portion of the SAT, there is a section where each question consists of two numbers
or terms, one labeled “A” and the other labeled “B”. The test-taker is supposed to choose “A”
if the number or term labeled “A” is greater; “B” if the number or term labeled “B” is
greater; “C” if they are equal; and “D” if there is no way to tell. This is an unusual way to ask
a question and it must baffle many test-takers their first time. Their score for this section may not match their actual reasoning
ability. However, with a little practice, anyone should be able to demonstrate their reasoning. That practice has to come
outside of the testing time. This section has many questions that may be considered “trick questions”. For example,
if choice A is x and choice B is x2,
the knee-jerk response would be B because any of the whole numbers we encounter in everyday life would be less than their
square. However, we do not know if x is a fraction. The first time test-taker would not be likely to consider such likelihood.
Test preparation would encourage this comprehensive thinking.
Should you guess on the SAT? Yes! Break the rules. This is a case where girls may experience an unreal
decrease in test results. Girls tend to watch the rules, where boys tend to break them. If you absolutely have no idea what
the correct answer is, then you should not guess. It will work against you. However, most guessing on tests is not this blind
choice. Most test-takers who are inclined to guess have already identified two very good answers and the guess is just a matter
of choosing between the two. For every wrong answer you make, you are deducted 25% of a point. If you are guessing blindly
the odds are that you will get 20% of the answers right. Taking away 25% of the point earned for guessing correctly will yield
the test-taker a “0” at the end of the test with all the deductions for wrong answers eroding any points for accidentally
guessing right. But… If you are guessing between two good answers, your odds are 50% you got it right. It is not likely
that educated guesses would yield every other one wrong, but if it did you still come out ahead, earning one point for every
four guessed correctly rather than nothing for leaving those questions unanswered. This is another thing a test-taker learns
through SAT preparation. One word of caution about SAT test preparation. I have encountered several that look at what they
think the SAT should be and not what it really is. What they think it should be is a college entrance test. Ah, but it is
not. An SAT preparation program following this approach will train the test-taker to answer questions related to introductory
college courses and not what is really on the SAT.
Finally, I would like to point out some of the issues involved in the interpretation of test results.
One of the most important concerns in our pluralistic society is the commonly referred “cultural bias”. This is
not true bias in the sense that members of a particular group are singled out, but no matter how clean the test is there may
be particular questions that are interpreted differently by one group of people than by the majority. A very simple example
is an item that appeared in a nationally distributed achievement test. The students were required to recognize a misspelled
word which in this case was D-U-T-C-H-E-S-S for the word meaning the wife of a duke. Many smart students around the country
recognized that duchess is spelled without a “T”. However, many smart students living two counties above New York City all got that question wrong. They lived in a county called DUTCHESS which at one time I believe meant “the woman
from the Netherlands”. That particular question did not function the way it was intended. Many
times the way the question is worded may have a slightly different meaning to cultural minority than to the majority. There
are effective ways of detecting this systematic error.
The definition of a good test question is how well it is as a predictor of the final test score. If
a question appears to be a good predictor of the final score except when the scores for a particular group is considered,
the question may have a bias or a Differential Item Functioning. To determine this, the choices would have to be examined
and if all the high performers among that minority chose the same answer, then there must be some cultural content to that
answer. That question would have to be disconsidered for that group. Another way that cultural bias may slip in to testing
is if all the examples in the test are appropriate for one group and not the other. This may occur as simply as if all examples
referred to life in the suburbs instead of including some obvious urban references.
Another component of test interpretation is standard-setting. Where do we draw the line between “Pass”
and “Fail”? This is the least scientific procedure of test-taking
because it involves human judgment. However, if the test results are accurate and consistent and the appropriateness of the
use of the results is verified, the subjectivity of the standard-setting is of little concern. One polymer contains the same
monomer as another but because their uses might be completely different one molecular chain may be “stretched out”
while the other is kept short. This is the same with standard-setting. It’s
primary goal is to assure “due process” in testing by allowing the stakeholders in the testing situation to know
how the test results will be interpreted. One popular standard-setting method is called Contrasting Groups and is appropriate
in setting the standard for professional testing such as in licensing. One of the groups consists of those who are expected
to pass such as those already in the profession and the other group is the new applicants.
The distributions of scores for both groups are mapped on the same graph and where the two distributions intersect
is the cut-off score. This is interesting because those applicants who score the same as the lowest scores of those already
in the profession are rejected.
In my dissertation study, I applied an iterative process where teachers
set their standards before the test was administered but were allowed to review feedback from the students on the difficulty
level of the sections of the test. The teachers were allowed to reconsider the standards for those sections where the students’
judgment of difficulty differed from the teachers. Many times the teachers did change the standard in line with what was reported
by the students. However, there were some situations where the teachers all stayed pat.