I.
PRINCIPLES OF LANGUAGE ASSESSMENT
Practicality
A test that is prohibitively expensive is impractical.
A test of language proficiency that take a student five hours to complete is
impractical-it consumes more time (and money) than necessary to accomplish its
objective. A test that requires individual one-on-one proctoring is impractical
for a group of several hundred test-takers and only a handful of examiners. A
test that takes a few minutes for a student to take and several hours for an
examiner to evaluate is impractical for most classroom situations. A test that
can be scored only by a computer is impractical if the test takes place a
thousand miles away from the nearest computer. The value and quality of a test
sometimes hinge on such nitty-gritty, practical considerations.
Here’s
a little horror story about practicality gone awry. An administrator of a
six-week summertime short course needed the place the 50 or so students who had
enrolled in the program. A quick search yielded a copy of an old English
Placement Test from the University of Michigan. It had 20 listening items based
on an audiotape and 80 items on grammar, vocabulary, and reading comprehension,
all multiple choice format. A scoring grid accompanied the test. On the day of
the test, the require number of the test booklets had been secured, a proctor
had been assigned to monitor the process, and the administrator and a proctor
had planned to have the scoring completed by later that afternoon so students
could begin classes the next day. Sound simple, right? Wrong.
The
students arrived, test booklets were distributed, and direction were given. The
proctor started the tape. Soon students began to look puzzled. By the time the
tenth item played, everyone looked bewildered. Finally, the proctor checked a
test booklets and was horrified to discover that the wrong tape was playing; it
was a tape for another form of the same test! Now what? She decided to randomly
select a short passage from a textbook that was in the room and give the
student dictation. The students responded reasonably well. The next 80
non-tape-based items proceeded without incident, and the students handed in
their score sheets and dictation papers.
When
the red-face administrator and the proctor got together later the score the
tests, they faced the problem of how to score the dictation-a more subjective
process than some other forms of assessment. After a lengthy exchange, the two
established a point system, but after the first few papers had been scored, it
was clear that the point system needed revision. That meant going back to the
first papers that make sure the new system was followed.
The
two faculty members had barely begun to score the 80 multiple-choice items when
students began returning to the office to receive their placements. Students
were told to come back the next morning for the results. Later that evening,
having combined dictation score and 80 multiple-choice scores, the two
frustrated examiners finally arrived at placement for all students. It’s easy
to see what when wrong here. While the listening comprehension section of the
test was apparently highly practical, the administrator had failed to check the
materials ahead of time. Then, they established scoring procedure that did not
fit into the time constraints. In classroom based testing, time is almost
always a crucial practicality factor for busy teachers with a few hours in the
day.
Reliability
A reliable test is consistent and
dependable. If you give the same test to the same student or matched students
on two different occasions, the test should yield similar results. The issue of
reliability of the test may best be addressed by considering a number of
factors that may contribute to the unreliability of a test. Consider the
following possibilities: fluctuations in the student, in scoring, in test
administration, and in the test itself.
Student-Related-Reliability
The most common learner-related issue in
reliability is caused by temporary illness, fatigue, a “bad day”, anxiety, and
other physical or psychological factors, which may make an “observed” score
deviate from one’s true score. Also include in this category are such factors
as a test-taker’s “test-wideness” or strategies for efficient test taking.
Rater Reliability
Human error, subjectivity, and bias may
enter into the scoring process. Inter-rater reliability occur when two or more
scorers yield inconsistent scores of the same test, possibly for lack of
attention to scoring criteria, inexperience, inattention or ever preconceived
biases. In the story above about the placement test, the initial scoring plan
for the dictation was found to be unreliable-that is, the two scorers were not
applying the same standards.
Rater
reliability issues are not limited to contact where two or more scorers are
involved. Intra-rater reliability is a common occurrence for classroom teachers
because of unclear scorer criteria, fatigue, and bias toward particular “good”
and “bad” students or simple carelessness. When I am faced with up to 40 tests
to grade in only a week, I know that the standards I
apply-however-subliminally-to the first few tests will be different from those
I apply to the last few. I may be “easier” or “harder” on those first few
papers or I may get tired, and the result may be an inconsistent evaluation
across all tests. One solution to such intra-rater unreliability is to read
through about half of tests before rendering any final scorer or grades, then
to recycle back through the whole set of tests to ensure an even-handed
judgment. In tests of writing skill, rater reliability is particularly hard to
achieve since writing proficiency involves numerous traits that are difficult
to define. The careful specification of an analytical scoring instrument,
however can increase rater reliability (J. D. Brown, 1991).
Test Administration
Reliability
Unreliability
may also result from the conditions in which the test is administered. I once
witnessed the administration of the test of oral comprehension in which a tape
recorded played items for comprehension, but because of street noise outside
the building, student sitting next to window could not hear the tape
accurately. This was the clear case of unreliability caused by the condition of
the tests administration. Other sources of unreliability are found in
photocopying variations, the amount of light in different of the room,
variations in temperature, and even the condition of desk and chair.
Test Reliability
Sometime
the nature of the test itself can cause measurement errors. If a test is too
long, test-takers may become fatigued by the time they reach the later items
and hastily respond incorrectly. Timed test may discriminate against students who
do not perform well on a test with the time limit. We all know people (and you
may be included in this category) who “know” the course material perfectly but
who are adversely affected by the presence of a clock ticking away. Poorly
written test items (that are ambiguous or that have more than one correct
answer) may be a further source of test unreliability.
Validity
By
far the most complex criterion of an effective test-and arguably the most
important principle-is validity, “the extent to which inferences made from
assessment results are appropriate, meaningful, and useful in terms of the
purpose of the assessment” (Ground, 1998, p. 226). A valid test of reading
ability actually measure reading ability-not 20-20 vision, nor previous
knowledge in a subject, nor some other variable of questionable relevance. To
measure writing ability, one might ask student to write as many words as they
can in 15 minutes, then simply count the words for the final score. Such a test
would be easy to administer (practical), and the scoring quite dependable
(reliable). But it would not constitute a valid test of writing ability without
some consideration of comprehensibility, rhetorical discourse element, and the
organization of ideas, among other factors.
How is the validity of a test established?
There is no final, absolute measure of validity, but several different kinds of
evidence may be invoked in support. In some cases, it may be appropriate to
examine the extent to which a test calls for performance that matches that of
the course or unit of study being tested. In other cases, we may be concerned
with how well a test determines whether or not students have reached an
established set of goals or level of competence. Statistical correlation with
other related but independent measure is another widely accepted form of
evidence. Other concerns about a test’s validity may focus on the
consequences-beyond measuring the criteria themselves-of a test, or even on the
test-taker’s perception of validity. We will look at these five types of
evidence below.
Content-Relation
Evidence
If a test actually samples the subject
matter about which conclusions are to be drawn, and if it requires the
test-taker to perform the behavior that is being measured, it can claim content
related evidence of validity, often popularly referred to as content validity (e.g.,.
Mousasi, 2002; Hughes, 2003). You can usually identity content-relation
evidence observationally if you can clearly define the achievement that you are
measuring. A test of tennis competency that asks someone to run a 100-yard dash
obviously lacks content validity. If you are trying to assess a person’s
ability to speak a second language in a conversational setting, asking the
learner to answer paper-and-pencil multiple-choice questions requiring
grammatical judgment does not achieve content validity. A test that requires
the learners actually to speak within some sort of authentic context does. And
if a course has perhaps ten objectives but only two are covered in a test, then
content validity suffers.
Consider
the following quiz on English article for a high beginner level of a
conversation class (listening and speaking) for English learners.
|
English article quiz
The students had a unit on zoo animals and
had engaged in some open discussions and group work in which they had practiced
articles, all in listening and speaking modes of performance. In that quiz uses
a familiar setting and focuses of previously practiced language forms, it is
somewhat content valid. The fact that it was administered in written form,
however, and required students to read the passage and write their responses
make it quite low in content validity for a listening/speaking class.
There are a few cases of highly
specialized and sophisticated testing instrument that may have questionable
content-related evidence of validity. It is possible to contend, for example,
that the standard language proficiency tests, with their context reduced,
academically oriented language and limited stretches of discourse, lack content
validity since do not require the full spectrum of communicative performance on
the part of the learner (see Bachman, 1990, for a full discussion). There is
good reasoning behind such criticism; nevertheless, what such proficiency tests
lack in content-related evidence they may gain in other form of evidence, not
to mention practicality and reliability.
Another
way of understanding content validity is to consider the difference between
direct and indirect testing. Direct testing involves the test-taker in actually
performing the target task. • Criterion-Related
Evidence
A second of evidence of the validity of a test may be
found in what is called criterion-related evidence, also referred to as
criterion-related validity, or the extent to which the “criterion” of the test
has actually been reached. You will recall that in Chapter I it was noted that
most classroom-based assessment with teacher-designed tests fits the concept of
criterion-referenced assessment. In such tests, specified classroom objectives
are measured, and implied predetermined levels of performance are expected to
be reached (80 percent is considered a minimal passing grade).
Construct-Related
Evidence
A third kind of evidence that can support
validity, but one that does not play as large a role classroom teachers, is
construct-related validity, commonly referred to as construct validity. A
construct is any theory, hypothesis, or model that attempts to explain observed
phenomena in our universe of perceptions. Constructs may or may not be directly
or empirically measured-their verification often requires inferential data.
Consequential
Validity
As well as the above three widely accepted
forms of evidence that may be introduced to support the validity of an assessment,
two other categories may be of some interest and utility in your own quest for
validating classroom test. Mesick (1989), Grounlund (1998), McNamara (2000),
and Brindley (2001), among others, underscore the potential importance of the
consequences of using an assessment. Consequential validity encompasses all the
consequences of a test, including such considerations as its accuracy in
measuring intended criteria, its impact on the preparation of test-takers, its
effect on the learner, and the (intended and unintended) social consequences of
a test’s interpretation and use.
Face Validity
An important facet of consequential
validity is the extent to which “students view the assessment as fair,
relevant, and useful for improving learning” (Gronlund, 1998, p. 210), or what
is popularly known as face validity. “Face validity refers to the degree to
which a test looks right, and appears to measure the knowledge or abilities it
claims to measure, based on the subjective judgment of the examines who take
it, the administrative personnel who decode on its use, and other
psychometrically unsophisticated observers” (Mousavi, 2002, p. 244).
Authenticity
A fourth major principle of language
testing is authenticity, a concept that is a little slippery to define,
especially within the art and science of evaluating and designing tests.
Bachman and Palmer (1996, p. 23) define authenticity as “the degree of
correspondence of the characteristics of a given language test task to the
features of a target language task,” and then suggest an agenda for identifying
those target language tasks and for transforming them into valid test items.
Washback
A facet of consequential validity,
discussed above, is “the effect of testing on teaching and learning” (Hughes,
2003, p. 1), otherwise known among language-testing specialists as washback. In
large-scale assessment, washback generally refers to the effects the test have
on instruction in terms of how students prepare for the test.
“Cram”
course and “teaching to the best” are examples of such washback. Another form
of washback that occurs more in classroom assessment is the information that
“washes back” to students in the form of useful diagnoses of strengths and
weaknesses. Washback also includes the effects of an assessment on teaching and
learning prior to the assessment itself, that is, on preparation for the
assessment is by nature more likely to have built-in washback effects because
the teacher is usually providing interactive feedback. Formal tests can also
have positive washback, but they provide no washback if the students receive a
simple letter grade or a single overall numerical score.
The challenge to teachers is to
create classroom tests that serve as learning devices through which washback is
an achieved. Students’ incorrect responses can become windows of insight into
further work. Their correct responses need to be praised, especially when they
represent accomplishments in a student’s inter language. Teacher can suggest
strategies for success as part of their “coaching” role. Washback enhances a
number of basic principles of language acquisition: intrinsic motivation,
autonomy, self-confidence, language ego, inter language and strategic
investment, among others.
One way to enhance washback to
comment generously and specifically on test performance. Many overworked
teachers return tests to students with a single letter grade or numerical and
consider their job done. In reality, letter grades and numerical scores give
absolutely no information of intrinsic interest to student. Grades and scores
reduce a mountain of linguistic and cognitive performance data to an absurd
molehill. At best, they give a relative indication of a formulaic judgment of
performance as compared to others in the class which fosters competitive, not
cooperative, learning.
With this in mind, when you return a
written test or a data sheet from an oral production test, consider giving more
than a number, grade, or phrase as your feedback. Even if your evaluation is not
a neat little paragraph appended to the test, you can respond to as many
details throughout the test as time will permit. Give praise for strengths -the
“good stuff”- as well as constructive criticism of weaknesses. Give strategic
hints on how a student might improve certain elements of performance. In other
words, take some time to make the test performance an intrinsically motivating
experience from which a student will gain a sense of accomplishment and
challenge.
A little bit of washback may also help
students through a specification of the numerical scores on the various
subsections of the test. A subsection on verb tenses, for example, that yields
a relatively low score may serve the diagnostic purpose of showing the student
an area of challenge.
Another viewpoint on washback is
achieved by a quick consideration of differences between formative and
summative tests. Formative tests, by definition, provide washback in the form
of information to the learner on progress toward goals. But teachers might be
tempted to feel that summative tests, which provide assessment at the end of a
course or program, do not need to offer much in the way of washback. Such an
attitude in unfortunate because the end of every language course or program is
always the beginning of further pursuits, more learning, more goals, and more
challenges to face. Even a final examination in a course should carry with it
some means for giving washback to students.
In my courses I never give a final
examination as the last scheduled classroom session. I always administer a
final exam in order to return them to students during the last class. At this
time, the students receive scores, grades, and comments on their work, and I
spend some of the class session addressing material on which the students were
not completely clear. My summative assessment is thereby enhanced by some
beneficial washback that is usually not expected of final examinations.
Finally, washback also implies that
students have ready access to you to discuss the feedback and evaluation you
have given. While you almost certainly have known teachers with whom you
wouldn’t dare argue about a grade, an interactive, cooperative, collaborative
classroom nevertheless can promote an atmosphere of dialogue between students
and teachers regarding evaluative judgments. For learning to continue, students
need to have a chance to feedback on your feedback, to seek clarification of
any issues that are furry, and to set new and appropriate goals for themselves
for the days and weeks ahead.
II.
APPLYING
PRINCIPLES TO THE EVALUATION OF CLASSROOM TEST
The five principles of practicality,
reliability, validity, authenticity, and washback go a long way toward
providing useful guidelines for both evaluating an existing assessment
procedure and designing one on your own. Quizzes, tests, exams, and
standardized proficiency tests can be scrutinized through these five lenses.
Are there other principles that
should be invoked in evaluating and designing assessment? The answer, of
course, is yes. Language assessment is an extraordinarily broad discipline with
many branches, interest areas, and issue. The process of designing effective
assessment instruments is far too complex to be reduced to five principles.
Good test construction, for example, is governed by research-based rules of
test preparation, sampling of tasks, item design and construction, scoring
responses, ethical standard, and so on. But the five principles cited here
serve as an excellent foundation on which to evaluate existing instruments and
to build your own.
The questions that follow here,
indexed by the five principles, will help you evaluate existing tests for your
own classroom. It is important for you to remember, however, that the sequence
of these questions does not imply a priority order. Validity, for example, is
certainly the most significant cardinal principle of assessment evaluation.
Practically may be a secondary issue in classroom testing. Or, for a particular
test, you may need to place authenticity as your primary consideration. When
all is said and done, however, if validity
is not substantiated, all other considerations may be rendered useless.
1.
Are
the test procedures practical?
Practically is determined
by the teacher’s (and the students’) time constraints, costs, and
administrative details, and to some extent by what occurs before and after the
test. To determine whether a test is practical for you needs you may want to
use the checklist below
Practically
checklist
§ 1.
Are administrative details clearly established before the test?
§ 2.
Can students complete the test reasonably within the set time frame?
§ 3.
Can the test be administered smoothly, without procedural “glitches”?
§ 4.
Are all materials and equipment ready?
§ 5.
Is the cost of the test within budgeted limits?
§ 6.
is the scoring/evaluation system feasible in the teacher’s time frame?
§ 7.
Are methods for reporting results determined in advance?
|
As
this checklist suggests, after your account for the administrative details of
giving a test, you need to think about the practicality of your plans scoring
the test. In teacher’s busy lives, times, often emerges as the most important
factor, one that overrides other consideration in evaluating an assessment. If
you need to tailor test to fit your own time frame, as teachers frequently do,
you need to accomplish this without damaging the test’s validity and washback.
Teachers should, for example, avoid the temptation to offer only quickly scored
multiple-choice selection items that may be neither appropriate nor
well-designed. Everyone knows teachers secretly hate to grade tests (almost as
much as students hate to take them!) and will do almost anything to get through
that task as quickly and effortlessly as possible. Yet good teaching almost
always implies an investment of the teacher’s time I giving feedback-comments
and suggestions-to students on their tests.
2.
Is
the test reliable?
Reliability applies to both the test and the teacher, and at
least four sources of unreliability must be guarded against, as noted in the
section of this chapter. Test and test administration reliability can be
achieved by making sure that all students receive the same quality of input,
whether written or auditory. Part of achieving test reliability deepens on the
physical context-making sure, for example, that
·
Every student has a cleanly photocopied
test sheet,
·
Sound amplification is clearly audible to
everyone in the room,
·
Video input is equally visible to all,
·
Lighting, temperature, extraneous noise,
and other classroom conditions are equal (and optimal) for all students, and
·
Objective scoring procedures leave little
debate about correctness of an answer.
Rater
reliability, another common issue in assessments, may be more difficult,
perhaps because we too often overlook this as an issue. Since classroom test
rarely involve two scores, inter-rater reliability is seldom an issue. Instead,
intra-rater reliability is of constant concern to teachers: what happens to our
fallible concentration and stamina over the period of time during which we are
evaluating a test? Teachers need to find ways to maintain their concentration
and stamina over the time it takes to score assessment. In open-ended response
tests, this issue is a paramount importance. It is easy to let mentally
established standards erode over the hours you require to evaluate the test.
Intra-rater
reliability for open-ended responses may be enhanced by the following guidelines:
·
Use consistent sets of criteria for a
correct response.
·
Give uniform attention to those sets
throughout the evaluation time.
·
Read through tests at least twice to check
for your consistency.
·
If you have made “mid-stream” modifications
of what you consider as a correct response, go back and apply the same
standards to all.
·
Avoid fatigue by reading the tests in
several sittings, especially if the time requirement is a matter of several
hours.
3.
Does
the procedure demonstrate content validity?
The
major source of validity in a classroom test is content validity: the extent to
which the assessment requires students to perform tasks that were included in
the previous classroom lesson and that directly represent the objective of the
unit on which the assessment is based. If you have been teaching an English
language class to fifth graders who have been reading, summarizing, and
responding to short passages, and if your assessment is based on this work,
then to be content valid, the test needs to include performance in those
skills.
There
are two steps to evaluating the content validity of a classroom test.
1.
Are
classroom objectives identified and appropriately framed?
Underlying every good classroom test are the objectives of the lesson, module, or
unit of course in question. So the first measure of an effective classroom test
is the identification of objectives. Sometimes this is easier said than done.
Too often teachers work through lessons day after day with little or no
cognizance of the objectives they seek to fulfill. Or perhaps those objectives
are so poorly framed that determining whether or not they were accomplished is
impossible. Consider the following objectives for lessons, all of which
appeared on lesson plans designed by students in teacher preparation programs:
a. Students
should be able to demonstrate some reading comprehension.
b. To
practice vocabulary in context.
c. Students
will have fun through a relaxed activity and thus enjoy their learning.
d. To
give students a drill on the /I/ - /I/ contrast.
e. Students
will produce yes/no questions with final rising intonation.
Only
the last objective is framed in a form that lends itself to assessment. In (a),
the modal should is ambiguous and the expected performance is not stated. In
(b), everyone can fulfill the act of “practicing”; no standards are started or
implied. For obvious reasons, (c) cannot be assessed. And (d) is really just a
teacher’s note on the type of activity to be used.
Objective
(e), on the other hand, includes a performance
verb and a specific linguistic target.
By specifying acceptable and unacceptable levels of performance, the goal can
be tested. An appropriate test would elicit an adequate number of samples of
student performance, have a clearly framed set of standards for evaluating the
performance, and provide some sort of feedback to the student.
2. Are lesson objectives represented in
the form of test specifications? The next
content-validity issue that can be applied to a classroom test centers on the
concept of test specifications. Don’t let this word scare you. It simply means
that a test should have a structure that follow logically from the lesson or
unit you are testing. Many tests have a design that
·
Divides them into a number of sections
(corresponding, perhaps, to the objectives that are being assessed),
·
Offers students a variety of item types,
and
·
Gives an appropriated relative weight to
each section.
Some tests, of course, do not lend themselves to this
kind of structure. A test in a course in academic writing at the university
level might justifiably consists of an in class written essay on given
topic-only one “item” and one response, in a manner of speaking. But in this
case the specs (specifications) would be embedded in the prompt itself and in
the scoring or evaluation rubric used to grade it and give feedback. We will
return to the concept of test specs in the next chapter.
The content validity of an existing classroom test should
be apparent in how the objectives of the unit being tested are represented in
the form of the content of items, clusters of items, and item types. Do you
clearly perceive the performance of test-takers as reflective of the classroom
objectives? If so, and you can argue this, content validity has probably been
achieved.
Is
the Procedure Face Valid and “Biased for Best”?
This
question integrates the concept of face validity with the importance of
structuring an assessment procedure to elicit the optimal performance of the
student. Students will generally judge a test to be face valid if
·
Directions are clear;
·
The structure of the test is organized
logically,
·
Its difficulty level is appropriately
pitched,
·
The test has no “surprises,” and
·
Timing is appropriate
A
phrase that has come to be associated with face validity is “biased for best,”
term that goes a little beyond how the
students views the test to a degree of strategic involvement on the part of
student and teacher in preparing for, setting up and following up on the test
itself. According to Swain (1984), to give an assessment procedure that is
“biased for best,” a teacher.
·
Offers students appropriate review and
preparation for the test
·
Suggests strategies that will be
beneficial, and
·
Structure the test so that the best
students will be modestly challenged and
the weaker students will not be overwhelmed
It’s
easy for teachers to forget how challenging some test can be, and so a well-planned
testing experience will include some strategic suggestions on how students
might optimize their performance. In evaluating a classroom test, consider the
extent to which before, during and after-test options are fulfilled.
Test
Taking Strategies
Before the Test
1. Give
students all the information you can about the test: Exactly what will the test
cover? Which topics will be the most important? What kind of items will be on
it? How long will it be?
2. Encourage
students to do a systematic review of material. For example, they should skim
the textbook and other material, outline major points write down example.
3. Give
them practice test or exercises, if available.
4. Facilitate
formation of a study group, if possible.
5. Caution
students to get a good night’s rest before the test.
6. Remind
students to get to the classroom early.
During the Test
1. After
the test is distributed, tell students to look over the whole test quickly in
order to get a good grasp of its different parts.
2. Remind
them to mentally figure out how much time they will need for each part.
3. Advise
them to concentrate as carefully as possible.
4. Warn
students a few minutes before the end of the class period so that they can
finish on time, proofread their answer, and catch careless errors.
After the Test
1. When
you return the test, include feedback on specific things the students did well,
what he or she did not do well, and if possible, the reasons for your comments.
2. Advice
students to pay careful attention in class to whatever you say about the test
result.
3. Encourage
questions from students.
4. Advise
students to pay special attention in the future to points on which they are
weak.
Keep in mind that what
comes before and after the test also contributes to its face validity. Good
class preparation will give students a comfort level with the test, and good
feedback washback will allow them to learn from it.
Are
the Test Tasks as Authentic as Possible?
Evaluate the extent to
which a test is authentic by asking the following questions:
·
Is the language in the test as natural as
possible?
·
Are items as contextualized as possible
rather than isolated?
·
Are topics and situations interesting,
enjoyable, and/or humorous?
·
Is some thematic organization provided,
such as through a story line or episode?
·
Do tasks represent, or closely
approximate, real-world tasks?
Consider the following
two excerpts from test, and the concept of authenticity may become a little
clearer.
Multiple-Choice
Tasks-Contextualized
“Going To”
1. What
________ this weekend?
a.
You are going to do
b.
Are going to do
c.
You going to do
2. I’m
not sure _________ anything special?
a.
Are going to do
b.
You are going to do
c.
Is going to do
3. My
friend Melissa and I _______ a party. Would you like to come?
a.
Am going to
b.
Are going to go to
c.
Go to
4. I’d
love to!
a.
What’s it going to be?
b.
Who’s going to be?
c.
Where’s it going to be?
5. It
is __________ to be at Ruth’s house.
a.
go
b.
going
c.
going to
Multiple
Choice Tasks Decontextualized
1. There
are three countries I would like to visit. One it Italy.
a.
The other is New Zealand and other is
Nepal
b.
The others are New Zealand and Nepal.
c.
Others are New Zealand and Nepal.
2. When
I was twelve years old, I used _______ every day.
a.
swimming
b.
artistic
c.
artist
3. When
Mr. Brown designs a website, he always creates it _________
a.
Artistically
b.
Artistic
c.
Artist
4. Since
the beginning of the year, I ________ at Millennium Industries.
a.
am working
b.
had been working
c.
have been working
5. When
Mona bore her leg, she asked her husband ________ her to work.
a.
to drive
b.
driving
c.
drive
The
sequence of items in the contextualized tasks achieves a modicum of
authenticity by contextualizing all the items in a story line. The conversation
is one that might occur in the real world, even if with a little less formality.
The sequence of items in the decontextualized tasks takes the test-taker into
five different topic areas with no context for any. Each sentences is likely to
be written or spoken in the real world, but not in that sequence. Given the
constraints of a multiple-choice format, on a measure of authenticity I would
say the first excerpt is “good” and the second excerpt is only “fair”
Does
the Test Offer Beneficial Washback to the Learner?
The
design of an effective test should point the way to beneficial washback. A test
that achieves content validity demonstrates relevance to the curriculum in
question and thereby sets the stage for washback. When test items represent the
various objectives of a unit, and/or when sections of a test clearly focus on
major topics of the unit, classroom test can serve in a diagnostic capacity
even if they aren’t specifically labeled as such.
Other
evidence of washback may be less visible from an examination of the test
itself. Here again, what happens before and after that test is critical.
Preparation time before the test can contribute to washback since the learner
is reviewing and focusing in a potentially broader way on the objectives in
question. By spending classroom time after the test reviewing the content,
students discover their areas of strength and weakness. Teachers can raise the
washback potential by asking students to use test result as a guide to setting
goals for their future effort. The key is to play down the “Whew, I’m glad
that’s over” feeling that students are likely to have, and play up the learning
that can now take place from their knowledge of the results.
Some
of the “alternatives” in assessment referred to in Chapter 1 may also enhance
washback from tests. Self-assessment may sometimes be an appropriate way to
challenge students to discover their own mistakes. This can be particularly
effective for writing performance: once the pressure of assessment has come and
gone, students may be able to look back on their written to simply listening to
the teacher tell everyone what they got right and wrong and why. Journal
writing may offer students a specific place to record their feelings, what they
learned, and their resolutions for future effort.
The
five basic principles of language assessment were expanded here into six
essential questions you might ask yourself about an assessment. As you use the
principles and the guidelines to evaluate various forms of test and procedures,
be sure to allow each one of the five to take on greater or lesser importance,
depending on the context. In large scale standardized testing, for example,
practicality is usually more important than washback but the reverse may be
true of a number of classroom test. Validity is of course always the final
arbiter. And remember, too, that these principles, important as they are, are
not the only considerations in evaluating or making an effective test. Leave
some space for other factors to enter in.
In
the next chapter, the focus is on h ow to design a test. The same five principles
underlie test construction as well as test evaluation, along with some new
facets that will expand your ability to apply principles to the practicalities
of language assessment in your own classroom.
By. 2nd group
Catrine Mei Windri
Cindy Aprilia
Santi Novitasari
Novri Karyati
Junitrin
Tidak ada komentar:
Posting Komentar