0% found this document useful (0 votes)
299 views34 pages

Slide 6 - Test Construction and Adaptation

The document outlines the stages of test construction and adaptation, including test conceptualization, construction, try-out, item analysis, and revision. It discusses the importance of item development, scoring methods, and the process of linguistic validation for translating tests into different languages and cultures. Key concepts include norm-referenced vs. criterion-referenced tests, various item formats, and the significance of ensuring conceptual equivalence in translations.

Uploaded by

Kiran Chohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
299 views34 pages

Slide 6 - Test Construction and Adaptation

The document outlines the stages of test construction and adaptation, including test conceptualization, construction, try-out, item analysis, and revision. It discusses the importance of item development, scoring methods, and the process of linguistic validation for translating tests into different languages and cultures. Key concepts include norm-referenced vs. criterion-referenced tests, various item formats, and the significance of ensuring conceptual equivalence in translations.

Uploaded by

Kiran Chohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Test construction and

Adaptation
Test Construction
Five main stages in test development

Test conceptualization

Test construction

Test try-out

Item Analysis

Test Revision
Stage1: Test conceptualization
❑ The beginnings of any published test can probably be traced
to thoughts- self-talk-in behavioural terms.
❑ Test developer might say to himself/herself something like
“There ought to be a test designed to measure ______ in
(such & such) way”
❑ A review of the available literature on existing tests
designed to measure a particular construct might indicate
that such tests leave much to be desired in psychometric
soundness.
❑ An emerging social phenomenon or pattern of behavior
might serve as the stimulus for the development of a new
test. The development of a new test may be in response to a
need to assess mastery in an emerging occupation or
profession.
Some important questions
❑ Some preliminary questions include:
▪ What is the test designed to measure?
▪ What is objective of the test?
▪ Is there a need for the test?
▪ Who will use and take this test?
▪ What content it should cover?
▪ How will the test be administered?
▪ What is the ideal format of the test?
▪ What special training will be required of test users for
administering or interpreting the test?
▪ Is there any potential for harm as the result of an
administration of this test?
▪ How will meaning be attributed to scores on the test?
Norm-referenced Vs. criterion-referenced tests: Item development issues

Norm-referenced Criterion referenced


❑ Norm referenced tests allow for ❑ Criterion referenced test Evaluate an
interpretation in reference to a individual’s score with reference to a set
large standardization sample. standard.
❑ An item differentiating well ❑ The presence of this characteristic is not
between high and low scorers is what makes an item good or acceptable
a good one. from a criterion oriented perspective.
❑ Are insufficient and ❑ Each item addresses the issue of whether the
inappropriate when knowledge test taker has met certain criteria.
of mastery is required. ❑ Development of a criterion-referenced test
may entail exploratory work with at least
two groups of test takers (One who have
mastered the material and the other who has
not). The items that best discriminate
between these two groups would be
considered “good” items.
❑ Mostly employed in licensing contexts, be it
a license to practice medicine or to drive a
car.
Stage 2: Test construction
❑ Measurement
▪ Assignment of numbers according to rules.
❑ Scaling
▪ The process of setting rules for assigning numbers in measurement.
▪ It is a process by which a measuring device is designed and calibrated,
and the way numbers (or other indices) -scale values- are assigned to
different amount of the trait, attribute or characteristic being measured.
▪ Historically, Thurston is credited for being at the fore front of efforts to
develop methodologically sound scaling methods.
▪ scales can be meaningfully categorized along a continuum of level of
measurement and be referred to as nominal, ordinal, interval, or ratio.
Scaling continued….

Types of scales

Rating scales
(Likert scale)

Method of paired
Guttmann scale Adjective checklist
comparison
• Likert scale (Likert, 1932), is used extensively in psychology, are
relatively easy to construct. Each item presents the test taker with
five alternative responses (sometimes seven), usually on an agree–
disagree or approve–disapprove continuum.
• Method of paired comparisons. Test takers are presented with
pairs of stimuli (two photographs, two objects, two statements),
which they are asked to compare. They must select one of the
stimuli according to some rule; for example, the rule that they agree
more with one statement than the other, or the rule that they find
one stimulus more appealing than the other.
• Guttman scale (Guttman, 1944a,b, 1947): ordinal-level measures.
Items on it range sequentially from weaker to stronger expressions
of the attitude, belief, or feeling being measured. A feature of
Guttman scales is that all respondents who agree with the stronger
statements of the attitude will also agree with milder statements. I.e.
10-item scale, 8 scores mean the participant agrees with first 8
statements.
Writing Items

⮚ How does one develop items for the item pool?


⮚ The test developer may write a large number of items from
personal experience or academic acquaintance with the subject
matter. Help may also be sought from others, including experts.
⮚ For psychological tests designed to be used in clinical settings, clinicians,
patients, patients’ family members, clinical staff, and others may be
interviewed for insights that could assist in item writing.
⮚ For psychological tests designed to be used by personnel psychologists,
interviews with members of a targeted industry or organization will likely
be of great value.
⮚ For psychological tests designed to be used by school psychologists,
interviews with teachers, administrative staff, educational psychologists,
and others may be invaluable.
How much items to be include?
⮚ Advisable is that the first draft contains
approximately twice the number of items that the
final version of the test will contain.
⮚ An item pool is the reservoir or well from which
items will or will not be drawn for the final version
of the test.
⮚ A comprehensive sampling provides a basis for
content validity of the final version of the test.
❑ Item format
Variables such as form, plan, structure, arrangement,
and layout of individual test items are collectively
referred to as item format.
Types of item format

Selected response format Constructed response format


❑ Requires test taker to select ❑ Requires test taker to supply
a response from a set of or create the correct answer,
alternative responses. not merely to select it.
❑ Examples include multiple ❑ Examples include
choice, matching and true completion items, the short
false. answer, and the essay.
Scoring items

Cumulative scoring:

Class/category scoring

Ipsative scoring
• The rule in a cumulatively scored test is that the higher the
score on the test, the higher the test taker is on the ability,
trait, or other characteristic that the test purports to measure.
• In tests that employ class scoring or (also referred to as
category scoring), test taker responses earn credit toward
placement in a particular class or category with other test
takers whose pattern of responses is presumably similar in
some way. This approach is used by some diagnostic
systems wherein individuals must exhibit a certain number
of symptoms to qualify for a specific diagnosis.
• A third scoring model, ipsative scoring, departs radically in
rationale from either cumulative or class models. A typical
objective in ipsative scoring is comparing a test taker's
score on one scale within a test to another scale within that
same test. EPPS ipsative scoring system
Test
Test Tryout
Tryout

• Having created a pool of items from which the final version of


the test will be developed, the test developer will try out the
test. The test should be tried out on people who are similar in
critical respects to the people for whom the test was designed.
• Equally important are questions about the number of people
on whom the test should be tried out. An informal rule of thumb
is that there should be no fewer than 5 subjects and preferably
as many as 10 for each item on the test, (1:5)
• The test tryout should be executed under conditions as
identical as possible to the conditions under which the
standardized test will be administered; all instructions, and
everything from the time limits allotted for completing the test
to the atmosphere at the test site, should be as similar as
possible.
Stage 4: Item analysis

❑ The criteria for best items may differ as a function of the


test developer’s objectives.
❑ For example one test developer might deem the best items
to be those that optimally contribute to internal reliability,
other may wish to design a test with highest possible
criterion related validity and thus select items accordingly.
An item
difficulty index

An item Tools to
An index of
discrimination analyse & Item reliability
index select items

An index of
Item validity
Item Difficulty index
❑ An index of an item difficulty is obtained by calculating the
proportion of the total number of test taker who got the item right.
❑ A lowercase italicized P (p) is used to denote item difficulty and the
subscript refers to item number (p1 is read as item difficulty index
for item 1).
❑ This value can range from 0 (if no one got the item right) to 1 (If
everyone got the item right).
❑ For Exp. if 50 out of hundred examinees got item 2 right then item
difficulty index will be 50/100= .5 (p2 =.5)
❑ This statistic is referred to as Item difficulty index in the context of
achievement testing. While its called Item endorsement index in the
context of personality testing. Here the statistic don't provide a
measure of the percent of people passing the item, but a measure of
the percent of people who said yes to, agree with, or otherwise
endorsed the item.
❑ An index of the difficulty of the average test item for a
particular test can be calculated by averaging the item
difficulty indices for all test items. Summing Item difficulty
indices of all test items and then dividing it by the total
number of items.
❑ For maximum discrimination among the abilities of the test
takers, the optimal average item difficulty is
approximately .5, with individual items on the test ranging
in difficulty from about .3 to .8.
Item reliability index

❑ It provides an indication of the internal consistency


of a test; the higher this index, the greater the test’s
internal consistency.
❑ This index is equal to the product of the item score
standard deviation (s), and the correlation between
the item score and the total test score.
❑ Factor analysis and inter-item consistency (whether
items on a test appear to be measuring the same
thing)
❑ If too many items appear to be tapping a particular
area, the weakest of such items can be eliminated.
Item validity index

❑ It is a statistic designed to provide an indication of the


degree to which a test is measuring what it purports to
measure. The higher this index, the greater the test’s
criterion related validity.
❑ It is more important to calculate this index when the test
developer’s goal is to maximize criterion-related validity of
the test.
❑ The item validity index can be calculated using the
following two statistics
▪ The item score standard deviation
▪ The correlation between the item score and the criterion
score.
Item Discrimination Index

❑ Measures of Item discrimination indicate how adequately an item


separates or discriminates between high scorers and low scorers
on an entire test.
❑ For exp. a multiple choice item on an achievement test is a good
item if most of the high scorers answer correctly and most of the
low scorers answer incorrectly.
❑ The item-discrimination index is symbolized by a lowercase,
italicized letter d (d). The higher the value of d, the greater the
number of high scorers answering the item correctly.
❑ This index is actually a measure of the difference between the
proportion of high scorers answering an item correctly and the
proportion of low scorers answering an item correctly.
❑ The formula for calculating this is:
d [(U-L)/n]
❑ If same proportion of people of the U and L groups pass the item, the
item is not discriminating between test takers at all and d would be
equal to 0.
❑ The lowest possible d value could be -1, this is a test developer’s
nightmare as it indicates that all members of the U group failed to
answer correctly and all members of the L group passed it.
❑ A negative d value is a red flag; it indicates that low scorers are more
likely to answer correctly than high scorers
Item discrimination indices for five hypothetical items
Translation & Adaptation
Test translation and adaptation

• Test adaptation is a process by which a test


(or assessment instrument) is transformed
from a source language and/or culture into a
target language and/or culture.
Linguistic validation

❑ The aim of the linguistic validation is to produce a


translated version in a target language which is
conceptually equivalent to the original version, as well as
clear and easy to understand.
❑ The translated instrument should be understood by most
respondents in a selected population and should maintain a
reading and comprehension level that will be accessible by
most respondents, even of a low education level.
❑ Conceptual equivalence is the absence of differences in the
meaning and content between the source language and the
translated version. This is achieved through a process
called linguistic validation.
Steps in Standard Linguistic validation

Steps for Standard Linguistic Validation (Mapi, 2008).

Selection and briefing

Forward translation Backward translation

Pilot testing

Validation

Proof reading Final tool


Characteristics of the translator

❑ BILINGUAL
❑ PREVIOUS EXPERIENCE IN TRANSLATING
TESTS/INSTRUMENTS
Translation process
• First, the translation process from the beginning ought to be
conducted by bilinguals, that is, by people proficient in both
languages (Forward Translation). They should conduct the
so-called back-translation: initially, they translate the original
version of the method and then transfer this version back into
the original language (Backward Translation). Then both
versions are compared.
• Both versions of a questionnaire can be administered on the
same bilingual individuals. If the investigator gets similar
results on both versions, this is a good indicator that the
translation was conducted successfully.
• On the basis of comparison with the original versions of the
scales, each and every item is reviewed. If any discrepancy
found then that item is reviewed again.
• After shaping final translated version, the
items are proof read for further clarity.
• Before starting the research, try out is carried
out to check the feasibility of the data and to
see whether the sample easily comprehend
the items of the scale and respond
accordingly.
Thank You ☺

You might also like